Module 3: NoSQL and Document-Oriented Databases
Module Overview
Need to deal with Big Data? You may need tools beyond standard SQL approaches. Enter NoSQL and document-oriented databases! In this module, we explore the world of NoSQL databases with a focus on MongoDB, one of the most popular document-oriented databases. We'll learn how to store, retrieve, and query data in a schema-less environment.
Learning Objectives
- Deploy and use a simple MongoDB instance
- Build a data pipeline between SQL (SQLite) and NoSQL (MongoDB) databases
Setting Up MongoDB in 2025
MongoDB has evolved into a leading cloud database platform, with MongoDB Atlas as the recommended solution for most users. Atlas is a fully managed cloud service that simplifies deployment, scaling, and management. While local installations are still available for on-premises or offline use, Atlas is preferred for its ease of use, global availability, and automatic scaling. Here's what you need to know to get started in 2025:
Modern Setup Process
- MongoDB Atlas Cloud: A cloud-based platform that eliminates the need for local setup in most cases, offering automated management, backups, and global deployment across AWS, Azure, and Google Cloud.
- Flex Clusters: A flexible pricing model replacing older Shared (M2/M5) and Serverless instances. Flex Clusters provide usage-based scaling for variable workloads, starting with 5GB storage and supporting up to 500 operations per second, ideal for small to medium applications.
- Enhanced Security: Includes network isolation (restricting access to specific networks), end-to-end encryption, and granular access controls (role-based permissions). Higher-tier clusters offer advanced features like LDAP integration and database auditing.
- Integration Tools: Supports popular frameworks like Node.js, Python, and Java, and integrates with cloud providers (AWS, Azure, Google Cloud) and tools like MongoDB Compass for GUI management.
Current Deployment Options (2025)
- Free Tier (M0): Offers 512MB storage and shared resources, perfect for learning, prototyping, or small projects. Note: Features like Atlas Search and advanced backups are not supported.
- Flex Clusters: Start at approximately $10–$50/month (check MongoDB's pricing page for current rates) with 5GB+ storage and usage-based pricing. Costs are capped at ~$30/month for up to 500 ops/sec, ideal for apps with variable traffic.
- Dedicated Clusters: Start at ~$57–$60/month for M10 clusters (check MongoDB's pricing page), offering 10GB storage, 2 vCPUs, and dedicated resources for production applications.
Note: Pricing varies by cloud provider (AWS, Azure, Google Cloud), region, and usage (e.g., compute, storage, data transfer). Always verify current rates on MongoDB's pricing page, as costs may fluctuate.
Key Requirements
- A MongoDB Atlas account (free tier available, sign up with an email address)
- Python 3.7+ (preferably 3.10+) with the pymongo package (install via
pip install pymongo
; verify withpython -m pip show pymongo
) - MongoDB Compass (optional, free GUI tool for visualizing and managing databases, downloadable from MongoDB's website)
- A stable internet connection and modern web browser (e.g., Chrome, Firefox) for accessing the Atlas dashboard
Quick Start Guide
To set up a free MongoDB Atlas cluster:
- Sign up at MongoDB Atlas.
- Choose the Free Tier (M0), select a cloud provider (e.g., AWS), and pick a region (e.g., US East).
- Configure your cluster's security: Add your IP address to the IP whitelist (e.g., 0.0.0.0/0 for access from anywhere, but use caution) and create a database user with a username and password.
- Deploy the cluster (takes ~5 minutes). Copy the connection string to connect via your app or MongoDB Compass.
- Test your connection using Python: Install pymongo (
pip install pymongo
) and use the connection string in your code (replacewith your user's password).
Tip: Ensure your IP whitelist includes your current network to avoid connection errors. For production apps, use VPC peering or private endpoints for secure access.
Additional Learning Resources
Enhance your MongoDB skills with these free resources:
- MongoDB University: Start with the “MongoDB Basics” course to learn core concepts like documents, collections, and queries.
- MongoDB Community Forums: Ask questions and get help from experts and peers.
- MongoDB Atlas Documentation: Detailed guides on setup, configuration, and best practices.
Note: As of 2025, MongoDB has deprecated M2/M5 Shared clusters and Serverless instances, transitioning to Flex Clusters for greater flexibility. Atlas is the industry standard for cloud deployments due to its automated management, global scalability, and cost predictability. Local installations (e.g., MongoDB Community Edition) remain an option for on-premises or offline environments but require manual setup and maintenance.
Objective 01 - Identify appropriate use cases for document-oriented databases
Required Resources
Review each preclass resource before class.
- Red Hat Satellite standardizing on PostgreSQL backend
- HN Discussion comparing MongoDB to PostgreSQL and SQLite
- Your databases always have a schema: Essay comparing SQL/NoSQL and discussing what a schema really is
Overview
There's a lot of hype for Big Data - what are the actual use cases? One common case is document-oriented databases (also known as document stores), which are great for storing large amounts of, well, documents (generally unstructured or semi-structured content).
Some people refer to these as “NoSQL” - but really, that is an imprecise label (anything that isn't SQL is “NoSQL”). More specifically, these are non-relational approaches to storing and retrieving data.
Document-oriented databases are a common subset of key-value stores - a general non-relational database paradigm. You've already interacted with structures like this! The general data structure abstraction is known as the hash table (or hash map), and common real-world implementations are Python dictionaries and JSON.
How is a document-oriented database different? It's bigger, and (usually) run on somebody else's computer (or computers). Because values are indexed by key hashes, it is relatively easily to split the data across multiple instances, and figure out which instance is needed to actually retrieve a given record.
Traditional SQL databases are more difficult to scale, as they calculate an index for each table based on its primary key, and these indices must live in-memory in a single server. There are new approaches to work around this which we will discuss later (e.g. PostgreSQL sharding), but it is true that non-relational approaches are still at least conceptually easier to scale.
Another important distinction - as we've seen, SQL databases require specifying a schema (what your data/types are) up-front. Document-oriented databases can generally take any sort of key-value pairs, including nested key-values, allowing you to flexibly store data without preemptively specifying structure. Some argue that this allows for faster prototyping, and is good for situations where you need to rapidly develop something that is likely to be completely rewritten in the long term anyway.
But there is an important caveat - though some may characterize NoSQL as being “schema-free”, it really
just delays the necessity of a schema. Eventually your application needs to know what fields it is
asking for (and probably what types they are). One way to characterize this approach is
“schema-on-read”, as querying a document store usually requires specifying which key/value pairs you
want. This is in contrast with “schema-on-write”, the approach of traditional SQL (which then lets you
do things like SELECT *
more easily, with some guarantees for what you're getting).
Follow Along
Consider the following situations:
- A bank, with mission-critical data demanding high reliability and integrity
- A 2-person startup, rapidly developing a prototype in a week to demonstrate to investors
- A medium size company, profitable and established in their niche, building a new product offering
- A large company, in the same situation as the prior medium size company
Think for a moment - which sort of database (relational or non-relational) would you recommend in which situation?
The first situation (the bank) arguably demands a relational approach. Banks should be well-defined in their data, so an up-front schema is a fine requirement. Banks also benefit more from the reliability of SQL than the scalability of document-oriented databases.
The second situation could go either way, but many would suggest document-oriented databases. You can quickly make and throw away things without worrying about a schema, and you're likely to rewrite everything after you get investor approval anyway. Overall when prototyping though - go with the tools you know. If the two people involved happen to be really experienced with PostgreSQL, that is likely to be a better approach.
The third situation is similarly dependent on details, but erring towards relational is likely the right decision. Many companies think they have larger data than they have - PostgreSQL is likely to scale up just fine, and if you're an established company you have the time to plan and develop something with structure.
The fourth situation is deceptively similar to the third one - even a larger company can get by with relational! The main exception would be if they know they are developing a product that is closely related to their existing line and will thus immediately see significant usage and “big” data. But otherwise, even a large company can get a lot of mileage out of SQL, and benefits even more from the clarity and structure of a schema (due to having more developers and a larger ecosystem).
In the last two situations it'd also be worth considering modern “NewSQL” approaches, which try to combine the best of both worlds (structure from relational, scale from non-relational). More on this in the next module!
Challenge
Come up with two situations, one where relational is appropriate and another where non-relational is. Describe both in writing, as if you were making a recommendation to your manager for which database to choose.
Additional Resources
Objective 02 - Deploy and use a simple MongoDB instance
Required Resources
- MongoDB Atlas Getting Started: Instructions for setting up your first free MongoDB Cluster
- PyMongo: Full PyMongo documentation
Overview
MongoDB makes it quite easy to have data in the cloud - it won't be the sort of data you're necessarily used to, but it is useful and is a tool that many web apps depend on.
A good mental abstraction for MongoDB is that it is “big JSON in the cloud” - it lets you save and retrieve (persist) JSON-serialized data, at scale, over a network. Since JSON is a ubiquitous format, widely supported by browsers and web applications, this is pretty handy.
But there is an important caveat - from a data science perspective, “unstructured key-value pairs” aren't the most useful way to have data. They're great for application development - just save things and retrieve them when you need them, but use a key instead of an inscrutable memory address. But this variety means that you end up with a collection of heterogeneous documents - you aren't guaranteed that they all have the same fields, so you can't just throw them in a DataFrame and work with them.
Nonetheless, you will encounter document-oriented databases in some form or another, and with proper care it is possible to get useful data from them.
Follow Along
Follow the instructions for Getting Started with MongoDB Atlas - registration is free, and the default options are generally all fine. You may have to wait some time for your cluster to actually be generated - it's actually spinning up multiple nodes, demonstrating the natural scalability of non-relational approaches.
Once it is finished, you can click the Connect
button for your sandbox cluster - you have to
specify a
username and password, and it is suggested you randomly generate and save these values somewhere. You
can generate random strings by clicking here (refresh for new ones), and use one for username and
another for
password (but save them somewhere locally so you remember them!).
Also, make sure to whitelist your current IP address! This will
allow you to connect through the firewall that will protect your cluster. If you would like to connect
from a Colab or another hosted notebook, Run !curl https://ipecho.net/plain
to find the IP
address. Note - it seems Colab IP addresses all start with 35.
, so you can whitelist all of
them with
the rule 35.0.0.0/8
.
Next we will have to choose the connection method. Select “Connect to Your Application” (the other method
of accessing date through tools will require installing local tools - check out the extension links if
you're curious). Select MongoDB Drivers option, and choose the driver and version. Copy the connection
string”, replace username/password with the ones you just used. Remember to remove '<' and '>'
before typing the username and password., and run the following (after replacing the string passed to
pymongo.MongoClient
):
from pymongo import MongoClient
client = MongoClient(mongodb+srv://<username>:<password>@cluster0.iapf3z5.mongodb.net/?retryWrites=true&w=majority)
db = client.test
Congratulations - you're connected to your MongoDB! PyMongo interacts with MongoDB, giving/retrieving Python dictionaries from it, but with an important caveat - all keys must be strings, so that it can be cleanly translated to JSON by MongoDB.
result = db.test.insert_one({'stringy key': [2, 'thing', 3]})
print(result.inserted_id)
print(db.test.find_one({'stringy key': [2, 'thing', 3]}))
You should see output like:
5c6a0505d04bc70096888c2e
{'_id': ObjectId('5c6a0505d04bc70096888c2e'), 'stringy key': [2, 'thing', 3]}
Challenge
Use dir()
and help()
to inspect the db.test object, and see what else you can
do. In particular, check out db.test.insert_many
and db.test.find_many
, to let
you work with lots of data at once!
Additional Resources
Guided Project
In this guided project, we'll learn how to work with NoSQL databases and build data pipelines between different database types. Open guided-project.md in the GitHub repository below to follow along with the guided project.
The GitHub repository contains valuable resources, examples, and documentation that align with the lecture content and learning objectives. Take time to review these materials as they will help reinforce your understanding of NoSQL databases and MongoDB implementation.
Module Assignment
For this assignment, you'll practice working with MongoDB, creating document-oriented databases, and building data pipelines between SQL and NoSQL systems.