Module 2: SQL for Analysis

Module Overview

SQL is simple, but can still be surprisingly powerful - as we learned in the first unit, a lot of analysis can be done with just descriptive statistics, and with the right query SQL can do all that and more.

Learning Objectives

Objective 01 - Deploy and use a simple PostgreSQL database

Video Walkthrough

Overview

SQLite is a great starting point, and still useful for all sorts of situations - but if you want to deploy and scale, PostgreSQL is the next tool to check out. It is a performant and true SQL database, with a great ecosystem and support for sharding (partitioning data for scalability - the domain of DevOps, but good to be aware of).

PostgreSQL is a highly influential relational database management system (rdbms), created in the 1980s at UC Berkeley. An evolution of the Ingres database, PostgreSQL uses a permissive (MIT-style) license. Its author, Michael Stonebraker, received the 2014 Turing Award (the computer science equivalent of the Nobel Prize) in recognition of his work and its impact.

Despite its long and rich history, PostgreSQL is a well-supported modern database, with significant new functionality having been added in recent years. It is also the second most widely used open-source SQL database, after MySQL (now owned by Oracle).

Follow Along

Sign up for Aiven.io and login and create a new free PostgreSQL service following the steps outlined in your walkthrough video above. No address or credit card should be required.

Connect to your PostgreSQL service using the Beaver client, and open a SQL console. We will use it to run some SQL queries:

CREATE TABLE test_table (
  id        SERIAL PRIMARY KEY,
  name      varchar(40) NOT NULL,
  data      JSONB
);

Then you can insert some data:

INSERT INTO test_table (name, data) VALUES
(
  'A row name',
  null
),
(
  'Another row, with JSON',
  '{ "a": 1, "b": ["dog", "cat", 42], "c": true }'
);

And finally use SELECT to see the result!

SELECT * FROM test_table;

Challenge

Create more tables and insert some rows into them.

Additional Resources

Objective 02 - Create a data pipeline with SQL

Required Resources

Overview

What good is a database without data in it? And do we really want to do all of our analysis directly with ad hoc queries?

There's a better way - getting data into and out of things programmatically, by building data pipelines.

SQL is great, but Python is our home - it's where our core code will live, and it's supported by a great ecosystem of Data Science libraries and frameworks. The good news is we can connect to SQL from Python, using Psycopg!

Follow Along

Use pip (in an environment of your choice - even Colab) to install psycopg2-binary. Then execute the following (requires having first done the setup from the previous objective in this module):

import psycopg2

# Connect to Aiven.io-hosted PostgreSQL
# You must obtain the connection details in your Aiven.io account
conn = psycopg2.connect(
    dbname='TODO',
    user='TODO',
    password='TODO',
    host='TODO.aivencloud.com',
    port=TODO
)

# A "cursor", a structure to iterate over db records to perform queries
cur = conn.cursor()

# An example query
cur.execute('SELECT * from test_table;')

# Fetch all rows from the executed query
rows = cur.fetchall()

# Iterate over the rows and print them
for row in rows:
    print(row)

# Close the cursor and connection
cur.close()
conn.close()

If you see something like (1, 'A row name', None), congratulations! You've interacted with your database from Python.

Challenge

Write a Python function to generate random data for insertion, and execute it in a loop in order to generate and insert a large quantity of data.

Additional Resources

Guided Project

Important Note

The following Guided Project demonstrates using a cloud PostgreSQL service called ElephantDB, which has since stopped taking new sign ups, to accomplish the project's objectives.

In order to follow along, some adjustments must be made:

  1. Use Aiven.io instead of ElephantDB as demonstrated in Learning Objective 1 in this module.
  2. Connect to and interact with your database from your Python code as demonstrated in Learning Objective 2.

In this guided project, we'll learn how to deploy PostgreSQL databases and use SQL for data analysis. Open guided-project.md in the GitHub repository below to follow along with the guided project.

Module Assignment

Important Note

The following Module Assignment demonstrates using a cloud PostgreSQL service called ElephantDB, which has since stopped taking new sign ups, to accomplish the project's objectives.

In order to follow along, some adjustments must be made:

  1. Use Aiven.io instead of ElephantDB as demonstrated in Learning Objective 1 in this module.
  2. Connect to and interact with your database from your Python code as demonstrated in Learning Objective 2.

For this assignment, you'll practice deploying a PostgreSQL database, creating tables, and building data pipelines using SQL for analysis.

Solution Video

Additional Resources

PostgreSQL Tutorials

Cloud Database Services