Module 1: Introduction to SQL
Module Overview
Structured Query Language - the Lingua Franca of Data. Known (to varying degrees) by software engineers, data scientists, DevOps, and MBAs, SQL is the beginning (and sometimes entirety) of many data pipelines. Why is it universal? One reason is it is relatively simple - it is a declarative language, demanding what it wants by mere description, not worrying about the details of "how." SQL is a standard - all SQL distributions support a common core of functionality, and most add their own features or special syntax on top of it. For our purposes, we will focus on PostgreSQL, but most of what we learn will be general.
Learning Objectives
- Write basic SQL queries to get specific subsets of data from a database and answer basic "business questions"
- Understand the purpose of SQL joins and perform joins to access data from multiple tables
Objective 01 - write basic SQL queries to get specific subsets of data from a database and answer basic business questions
Required Resources
Overview
The main purpose of SQL is to just get (SELECT
) some data, and to restrict the data based on
certain
conditions (WHERE
). Getting all customers from a certain region, or everyone who bought a
certain
product or on a certain day - these are common business questions useful for reports and dashboards, and
easily answered with basic SQL.
SQL is a standard, and a computational implementation of relational algebra - the math underpinning modern data modeling.
In relational algebra, datum are considered members of sets, where data with the same overall structure (attributes) are members of the same sets. Data in other sets has different attributes (columns if you want to picture it), but has at least one column shared with another set - a key that can relate entries across them.
So querying in relational algebra is essentially a batch of set operations - union, difference, Cartesian product, and intersection - but constrained to ensure the end result is valid.
The Pi symbol indicates projection (choosing attributes), and sigma
indicates selection
(setting
conditions). The above translates roughly to “Get the names and ids of all customers from California.”
Follow Along
Querying in SQL is essentially the same as relational algebra, but instead of mathematical syntax and
Greek letters we use capitalized keywords like SELECT
and WHERE
. Here is a
typical SQL query:
SELECT name, id
FROM customers
WHERE state = 'CA';
One counterintuitive difference - in SQL, SELECT
is the equivalent of projection
(Pi),
and WHERE
is the equivalent of selection (in relational algebra).
SQL queries are declarative - they simply state what they want, and the database sorts out the details. Depending on your career you may eventually want to specialize and dig deeper, but for now - trust the query planner (the part of the database that does this), and always start with queries as simple as you can.
SQL Constructs
Other common SQL constructs:
- AND - you can use
AND
to specify multiple conditions, e.g.WHERE state = 'CA'
andname = 'John Smith'
- LIMIT - if you only want at most a certain number of rows
- ORDER - to sort output
- GROUP - to aggregate output (more on this later)
In addition to writing SQL queries directly, they can be executed in Python via connector modules. For example:
import sqlite3
conn = sqlite3.connect('rpg_db.sqlite3')
curs = conn.cursor()
query = 'SELECT COUNT(*) FROM armory_item;'
curs.execute(query)
curs.execute(query).fetchall()
The general usage pattern is:
- Import the library
- Instantiate a connection to the database (may require authentication for “real” databases)
- Create a cursor (pointer into the database) - depending on workload/app you can have multiple cursors per connection
- Execute a query using the cursor
- (If relevant) Fetch results from the query
Note that, for write (INSERT
, UPDATE
, DELETE
, DROP
)
queries it may be necessary to execute a .commit()
method on the cursor for the change to
“stick” (be finalized in the database).
Challenge
Imagine the customers table also has address
and customer_since
attributes (where customer_since
is a year). Write a query to get all names and addresses
of customers living in Colorado who have been customers for at least 10 years.
Additional Resources
Objective 02 - Understand the purpose of SQL join and perform joins to access data from multiple tables
Overview
What if the data you need lives in separate tables (e.g. one for users, and another for orders)? That's
when you need to JOIN
- write a query that brings data across tables together on some
common key. This is where “relational” in relational databases comes from, and is the true heart of the
power of SQL.
Joins come in several flavors, and a favorite of SQL puzzles is to understand the distinction between
them. Consider a scenario where you have two tables, for now considered abstractly as X (with n rows
x_i
) and Y (with m rows y_j
), and that there is some common key id across the
tables allowing rows to be connected. Following are the possible joins:
- Cross join - also known as Cartesian product, this combines each
x_i
with eachy_j
, and will multiply out the number of rows returned (n * m
) - Left outer join - returns all
n
rows ofX
, with the columns populated withY
values, whenever the common key id matches and a null for those values that do not match - Right outer join - similar to left, but uses the
m
rows ofY
instead - Inner join - only returns rows that are both in
X
andY
, reducing the number of rows returned, and guaranteeing that no null values are introduced
SQL databases support all of the above, but the last case - inner joins - is the most practical in most situations. Cross joins are essentially never used outside of testing the database (most join implementations are cross-joins followed by filtering), and outer joins, while occasionally useful, should never be the default (only choose an outer join if you know it is appropriate for your problem).
There are specific cases of inner join referred to as equi-join (specifying what should be equal) and natural join (automatically joining by matching column names) - their implementation will vary across real-world databases, and use of natural join is generally discouraged. Start with regular inner join - you can research and pick up more sophisticated techniques as needed later on.
Follow Along
The full syntax to specify an inner join is:
SELECT charactercreator_character.name, charactercreator_character_inventory.id
FROM charactercreator_character
INNER JOIN charactercreator_character_inventory ON
charactercreator_character.character_id = charactercreator_character_inventory.character_id;
This query can be made easier by using table aliases, and replacing the explicit join with an implicit
inner join using a WHERE
clause:
SELECT character.name, inventory.id
FROM charactercreator_character AS character,
charactercreator_character_inventory AS inventory
WHERE character.character_id = inventory.character_id;
Both above queries are identical, returning 898 rows for the included rpg_db.sqlite3
data
in this module. Since that's a lot of rows, let's just load the first 10:
SELECT character.name, inventory.id
FROM charactercreator_character AS character,
charactercreator_character_inventory AS inventory
WHERE character.character_id = inventory.character_id
LIMIT 10;
Aliquid iste optio reiciendi 1
Aliquid iste optio reiciendi 2
Aliquid iste optio reiciendi 3
Optio dolorem ex a 4
Optio dolorem ex a 5
Optio dolorem ex a 6
Minus c 7
Minus c 8
Sit ut repr 9
Sit ut repr 10
What are we looking at? Character names and all their unique inventory id
values - each of
these then
corresponds to an actual item, but the details of that lives in another table. Fortunately, we can get
it by just growing our join:
SELECT character.name, item.name
FROM charactercreator_character AS character,
charactercreator_character_inventory AS inventory,
armory_item AS item
WHERE character.character_id = inventory.character_id
AND inventory.item_id = item.item_id
LIMIT 10;
Aliquid iste optio reiciendi Quos re
Aliquid iste optio reiciendi Dolores rem v
Aliquid iste optio reiciendi Alias laboriosam sapiente e
Optio dolorem ex a Sapiente do
Optio dolorem ex a Natus
Optio dolorem ex a Libero saepe accusantium
Minus c Saepe ea vo
Minus c Tempore
Sit ut repr Atque laudantium dolor q
Sit ut repr Sit rerum err
Note that all the names (for both characters and items) are automatically generated lorem ipsum - but the results look pretty good! We've successfully joined across three tables, connecting characters to the items they own.
The relationship between characters and items is known as many-to-many - one character can have many
items, and one item can be had by many characters (“item” here refers to the general idea of e.g. a
sword, and not a specific copy of a sword). The intermediary
charactercreator_character_inventory
facilitates this relationship - if you have simpler relationships, you can get the results from joining
only two tables. Other sorts of relationships:
- One-to-one - each entry has exactly one corresponding row in each table (e.g. books and ISBNs, people and SSNs)
- One-to-many - each entry in the first table can have multiple corresponding rows in the second table (e.g. books and book editions/printings, addresses and people who live there)
All of these can come up (many-to-one can be considered the reverse of one-to-many), but many-to-many (m2m) is particularly powerful. Consider other m2m examples - customers and items purchased, books and authors, and more.
Challenge
Come up with your own real-world examples of situations/data that can be modeled with one-to-one, one-to-many, and many-to-many relationships. Write “pseudo-SQL” to describe and query the data.
Additional Resources
Guided Project
In this guided project, we'll learn the fundamentals of SQL and how to query databases effectively. Open guided-project.md in the GitHub repository below to follow along with the guided project.
Module Assignment
For this assignment, you'll practice writing SQL queries to retrieve and analyze data from a database, demonstrating your understanding of basic SQL operations and joins.