Module 3: NoSQL and Document-Oriented Databases

Module Overview

Need to deal with Big Data? You may need tools beyond standard SQL approaches. Enter NoSQL and document-oriented databases! In this module, we explore the world of NoSQL databases with a focus on MongoDB, one of the most popular document-oriented databases. We'll learn how to store, retrieve, and query data in a schema-less environment.

Learning Objectives

Setting Up MongoDB in 2025

MongoDB has evolved into a leading cloud database platform, with MongoDB Atlas as the recommended solution for most users. Atlas is a fully managed cloud service that simplifies deployment, scaling, and management. While local installations are still available for on-premises or offline use, Atlas is preferred for its ease of use, global availability, and automatic scaling. Here's what you need to know to get started in 2025:

Modern Setup Process

Current Deployment Options (2025)

Note: Pricing varies by cloud provider (AWS, Azure, Google Cloud), region, and usage (e.g., compute, storage, data transfer). Always verify current rates on MongoDB's pricing page, as costs may fluctuate.

Key Requirements

Quick Start Guide

To set up a free MongoDB Atlas cluster:

  1. Sign up at MongoDB Atlas.
  2. Choose the Free Tier (M0), select a cloud provider (e.g., AWS), and pick a region (e.g., US East).
  3. Configure your cluster's security: Add your IP address to the IP whitelist (e.g., 0.0.0.0/0 for access from anywhere, but use caution) and create a database user with a username and password.
  4. Deploy the cluster (takes ~5 minutes). Copy the connection string to connect via your app or MongoDB Compass.
  5. Test your connection using Python: Install pymongo (pip install pymongo) and use the connection string in your code (replace with your user's password).

Tip: Ensure your IP whitelist includes your current network to avoid connection errors. For production apps, use VPC peering or private endpoints for secure access.

Additional Learning Resources

Enhance your MongoDB skills with these free resources:

Note: As of 2025, MongoDB has deprecated M2/M5 Shared clusters and Serverless instances, transitioning to Flex Clusters for greater flexibility. Atlas is the industry standard for cloud deployments due to its automated management, global scalability, and cost predictability. Local installations (e.g., MongoDB Community Edition) remain an option for on-premises or offline environments but require manual setup and maintenance.

Objective 01 - Identify appropriate use cases for document-oriented databases

Required Resources

Review each preclass resource before class.

Overview

There's a lot of hype for Big Data - what are the actual use cases? One common case is document-oriented databases (also known as document stores), which are great for storing large amounts of, well, documents (generally unstructured or semi-structured content).

Some people refer to these as “NoSQL” - but really, that is an imprecise label (anything that isn't SQL is “NoSQL”). More specifically, these are non-relational approaches to storing and retrieving data.

Document-oriented databases are a common subset of key-value stores - a general non-relational database paradigm. You've already interacted with structures like this! The general data structure abstraction is known as the hash table (or hash map), and common real-world implementations are Python dictionaries and JSON.

How is a document-oriented database different? It's bigger, and (usually) run on somebody else's computer (or computers). Because values are indexed by key hashes, it is relatively easily to split the data across multiple instances, and figure out which instance is needed to actually retrieve a given record.

Traditional SQL databases are more difficult to scale, as they calculate an index for each table based on its primary key, and these indices must live in-memory in a single server. There are new approaches to work around this which we will discuss later (e.g. PostgreSQL sharding), but it is true that non-relational approaches are still at least conceptually easier to scale.

Another important distinction - as we've seen, SQL databases require specifying a schema (what your data/types are) up-front. Document-oriented databases can generally take any sort of key-value pairs, including nested key-values, allowing you to flexibly store data without preemptively specifying structure. Some argue that this allows for faster prototyping, and is good for situations where you need to rapidly develop something that is likely to be completely rewritten in the long term anyway.

But there is an important caveat - though some may characterize NoSQL as being “schema-free”, it really just delays the necessity of a schema. Eventually your application needs to know what fields it is asking for (and probably what types they are). One way to characterize this approach is “schema-on-read”, as querying a document store usually requires specifying which key/value pairs you want. This is in contrast with “schema-on-write”, the approach of traditional SQL (which then lets you do things like SELECT * more easily, with some guarantees for what you're getting).

Follow Along

Consider the following situations:

Think for a moment - which sort of database (relational or non-relational) would you recommend in which situation?

The first situation (the bank) arguably demands a relational approach. Banks should be well-defined in their data, so an up-front schema is a fine requirement. Banks also benefit more from the reliability of SQL than the scalability of document-oriented databases.

The second situation could go either way, but many would suggest document-oriented databases. You can quickly make and throw away things without worrying about a schema, and you're likely to rewrite everything after you get investor approval anyway. Overall when prototyping though - go with the tools you know. If the two people involved happen to be really experienced with PostgreSQL, that is likely to be a better approach.

The third situation is similarly dependent on details, but erring towards relational is likely the right decision. Many companies think they have larger data than they have - PostgreSQL is likely to scale up just fine, and if you're an established company you have the time to plan and develop something with structure.

The fourth situation is deceptively similar to the third one - even a larger company can get by with relational! The main exception would be if they know they are developing a product that is closely related to their existing line and will thus immediately see significant usage and “big” data. But otherwise, even a large company can get a lot of mileage out of SQL, and benefits even more from the clarity and structure of a schema (due to having more developers and a larger ecosystem).

In the last two situations it'd also be worth considering modern “NewSQL” approaches, which try to combine the best of both worlds (structure from relational, scale from non-relational). More on this in the next module!

Challenge

Come up with two situations, one where relational is appropriate and another where non-relational is. Describe both in writing, as if you were making a recommendation to your manager for which database to choose.

Additional Resources

Objective 02 - Deploy and use a simple MongoDB instance

Required Resources

Overview

MongoDB makes it quite easy to have data in the cloud - it won't be the sort of data you're necessarily used to, but it is useful and is a tool that many web apps depend on.

A good mental abstraction for MongoDB is that it is “big JSON in the cloud” - it lets you save and retrieve (persist) JSON-serialized data, at scale, over a network. Since JSON is a ubiquitous format, widely supported by browsers and web applications, this is pretty handy.

But there is an important caveat - from a data science perspective, “unstructured key-value pairs” aren't the most useful way to have data. They're great for application development - just save things and retrieve them when you need them, but use a key instead of an inscrutable memory address. But this variety means that you end up with a collection of heterogeneous documents - you aren't guaranteed that they all have the same fields, so you can't just throw them in a DataFrame and work with them.

Nonetheless, you will encounter document-oriented databases in some form or another, and with proper care it is possible to get useful data from them.

Follow Along

Follow the instructions for Getting Started with MongoDB Atlas - registration is free, and the default options are generally all fine. You may have to wait some time for your cluster to actually be generated - it's actually spinning up multiple nodes, demonstrating the natural scalability of non-relational approaches.

Once it is finished, you can click the Connect button for your sandbox cluster - you have to specify a username and password, and it is suggested you randomly generate and save these values somewhere. You can generate random strings by clicking here (refresh for new ones), and use one for username and another for password (but save them somewhere locally so you remember them!).

Also, make sure to whitelist your current IP address! This will allow you to connect through the firewall that will protect your cluster. If you would like to connect from a Colab or another hosted notebook, Run !curl https://ipecho.net/plain to find the IP address. Note - it seems Colab IP addresses all start with 35., so you can whitelist all of them with the rule 35.0.0.0/8.

Next we will have to choose the connection method. Select “Connect to Your Application” (the other method of accessing date through tools will require installing local tools - check out the extension links if you're curious). Select MongoDB Drivers option, and choose the driver and version. Copy the connection string”, replace username/password with the ones you just used. Remember to remove '<' and '>' before typing the username and password., and run the following (after replacing the string passed to pymongo.MongoClient):

from pymongo import MongoClient
client = MongoClient(mongodb+srv://<username>:<password>@cluster0.iapf3z5.mongodb.net/?retryWrites=true&w=majority)
db = client.test

Congratulations - you're connected to your MongoDB! PyMongo interacts with MongoDB, giving/retrieving Python dictionaries from it, but with an important caveat - all keys must be strings, so that it can be cleanly translated to JSON by MongoDB.

result = db.test.insert_one({'stringy key': [2, 'thing', 3]})
print(result.inserted_id)
print(db.test.find_one({'stringy key': [2, 'thing', 3]}))

You should see output like:

5c6a0505d04bc70096888c2e
{'_id': ObjectId('5c6a0505d04bc70096888c2e'), 'stringy key': [2, 'thing', 3]}

Challenge

Use dir() and help() to inspect the db.test object, and see what else you can do. In particular, check out db.test.insert_many and db.test.find_many, to let you work with lots of data at once!

Additional Resources

Guided Project

In this guided project, we'll learn how to work with NoSQL databases and build data pipelines between different database types. Open guided-project.md in the GitHub repository below to follow along with the guided project.

The GitHub repository contains valuable resources, examples, and documentation that align with the lecture content and learning objectives. Take time to review these materials as they will help reinforce your understanding of NoSQL databases and MongoDB implementation.

Module Assignment

For this assignment, you'll practice working with MongoDB, creating document-oriented databases, and building data pipelines between SQL and NoSQL systems.

Solution Video

Additional Resources

MongoDB Learning

Documentation