Module 1: Introduction to SQL

Module Overview

Structured Query Language - the Lingua Franca of Data. Known (to varying degrees) by software engineers, data scientists, DevOps, and MBAs, SQL is the beginning (and sometimes entirety) of many data pipelines. Why is it universal? One reason is it is relatively simple - it is a declarative language, demanding what it wants by mere description, not worrying about the details of "how." SQL is a standard - all SQL distributions support a common core of functionality, and most add their own features or special syntax on top of it. For our purposes, we will focus on PostgreSQL, but most of what we learn will be general.

Learning Objectives

Objective 01 - write basic SQL queries to get specific subsets of data from a database and answer basic business questions

Required Resources

Overview

The main purpose of SQL is to just get (SELECT) some data, and to restrict the data based on certain conditions (WHERE). Getting all customers from a certain region, or everyone who bought a certain product or on a certain day - these are common business questions useful for reports and dashboards, and easily answered with basic SQL.

SQL is a standard, and a computational implementation of relational algebra - the math underpinning modern data modeling.

In relational algebra, datum are considered members of sets, where data with the same overall structure (attributes) are members of the same sets. Data in other sets has different attributes (columns if you want to picture it), but has at least one column shared with another set - a key that can relate entries across them.

So querying in relational algebra is essentially a batch of set operations - union, difference, Cartesian product, and intersection - but constrained to ensure the end result is valid.

 \Pi_{name,id}(\sigma_{state=CA}(Customers))

The Pi symbol indicates projection (choosing attributes), and sigma indicates selection (setting conditions). The above translates roughly to “Get the names and ids of all customers from California.”

Follow Along

Querying in SQL is essentially the same as relational algebra, but instead of mathematical syntax and Greek letters we use capitalized keywords like SELECT and WHERE. Here is a typical SQL query:

SELECT name, id
FROM customers
WHERE state = 'CA';

One counterintuitive difference - in SQL, SELECT is the equivalent of projection (Pi), and WHERE is the equivalent of selection (in relational algebra).

SQL queries are declarative - they simply state what they want, and the database sorts out the details. Depending on your career you may eventually want to specialize and dig deeper, but for now - trust the query planner (the part of the database that does this), and always start with queries as simple as you can.

SQL Constructs

Other common SQL constructs:

In addition to writing SQL queries directly, they can be executed in Python via connector modules. For example:

import sqlite3 
conn = sqlite3.connect('rpg_db.sqlite3') 
curs = conn.cursor() 
query = 'SELECT COUNT(*) FROM armory_item;' 
curs.execute(query) 
curs.execute(query).fetchall()

The general usage pattern is:

  1. Import the library
  2. Instantiate a connection to the database (may require authentication for “real” databases)
  3. Create a cursor (pointer into the database) - depending on workload/app you can have multiple cursors per connection
  4. Execute a query using the cursor
  5. (If relevant) Fetch results from the query

Note that, for write (INSERT, UPDATE, DELETE, DROP) queries it may be necessary to execute a .commit() method on the cursor for the change to “stick” (be finalized in the database).

Challenge

Imagine the customers table also has address and customer_since attributes (where customer_since is a year). Write a query to get all names and addresses of customers living in Colorado who have been customers for at least 10 years.

Additional Resources

Objective 02 - Understand the purpose of SQL join and perform joins to access data from multiple tables

Overview

What if the data you need lives in separate tables (e.g. one for users, and another for orders)? That's when you need to JOIN - write a query that brings data across tables together on some common key. This is where “relational” in relational databases comes from, and is the true heart of the power of SQL.

Joins come in several flavors, and a favorite of SQL puzzles is to understand the distinction between them. Consider a scenario where you have two tables, for now considered abstractly as X (with n rows x_i) and Y (with m rows y_j), and that there is some common key id across the tables allowing rows to be connected. Following are the possible joins:

SQL databases support all of the above, but the last case - inner joins - is the most practical in most situations. Cross joins are essentially never used outside of testing the database (most join implementations are cross-joins followed by filtering), and outer joins, while occasionally useful, should never be the default (only choose an outer join if you know it is appropriate for your problem).

There are specific cases of inner join referred to as equi-join (specifying what should be equal) and natural join (automatically joining by matching column names) - their implementation will vary across real-world databases, and use of natural join is generally discouraged. Start with regular inner join - you can research and pick up more sophisticated techniques as needed later on.

Follow Along

The full syntax to specify an inner join is:

SELECT charactercreator_character.name, charactercreator_character_inventory.id
FROM charactercreator_character
INNER JOIN charactercreator_character_inventory ON
charactercreator_character.character_id = charactercreator_character_inventory.character_id;

This query can be made easier by using table aliases, and replacing the explicit join with an implicit inner join using a WHERE clause:

SELECT character.name, inventory.id
FROM charactercreator_character AS character,
charactercreator_character_inventory AS inventory
WHERE character.character_id = inventory.character_id;

Both above queries are identical, returning 898 rows for the included rpg_db.sqlite3 data in this module. Since that's a lot of rows, let's just load the first 10:

SELECT character.name, inventory.id
FROM charactercreator_character AS character,
charactercreator_character_inventory AS inventory
WHERE character.character_id = inventory.character_id
LIMIT 10;
Aliquid iste optio reiciendi    1
Aliquid iste optio reiciendi    2
Aliquid iste optio reiciendi    3
Optio dolorem ex a    4
Optio dolorem ex a    5
Optio dolorem ex a    6
Minus c    7
Minus c    8
Sit ut repr    9
Sit ut repr    10

What are we looking at? Character names and all their unique inventory id values - each of these then corresponds to an actual item, but the details of that lives in another table. Fortunately, we can get it by just growing our join:

SELECT character.name, item.name
FROM charactercreator_character AS character,
charactercreator_character_inventory AS inventory,
armory_item AS item
WHERE character.character_id = inventory.character_id
AND inventory.item_id = item.item_id
LIMIT 10;
Aliquid iste optio reiciendi    Quos re
Aliquid iste optio reiciendi    Dolores rem v
Aliquid iste optio reiciendi    Alias laboriosam sapiente e
Optio dolorem ex a    Sapiente do
Optio dolorem ex a    Natus
Optio dolorem ex a    Libero saepe accusantium
Minus c    Saepe ea vo
Minus c    Tempore
Sit ut repr    Atque laudantium dolor q
Sit ut repr    Sit rerum err

Note that all the names (for both characters and items) are automatically generated lorem ipsum - but the results look pretty good! We've successfully joined across three tables, connecting characters to the items they own.

The relationship between characters and items is known as many-to-many - one character can have many items, and one item can be had by many characters (“item” here refers to the general idea of e.g. a sword, and not a specific copy of a sword). The intermediary charactercreator_character_inventory facilitates this relationship - if you have simpler relationships, you can get the results from joining only two tables. Other sorts of relationships:

All of these can come up (many-to-one can be considered the reverse of one-to-many), but many-to-many (m2m) is particularly powerful. Consider other m2m examples - customers and items purchased, books and authors, and more.

Challenge

Come up with your own real-world examples of situations/data that can be modeled with one-to-one, one-to-many, and many-to-many relationships. Write “pseudo-SQL” to describe and query the data.

Additional Resources

Guided Project

In this guided project, we'll learn the fundamentals of SQL and how to query databases effectively. Open guided-project.md in the GitHub repository below to follow along with the guided project.

Module Assignment

For this assignment, you'll practice writing SQL queries to retrieve and analyze data from a database, demonstrating your understanding of basic SQL operations and joins.

Solution Video

Additional Resources

PostgreSQL Documentation