Module 2: Exploratory Data Analysis & Feature Engineering

Module Overview

In this module, you'll learn how to work with data using Google Colab and Pandas. You'll discover how to read and load datasets, explore your data using Pandas' powerful analysis tools, perform feature engineering to transform your data, and master string functions in Pandas for text data manipulation.

Learning Objectives

Detailed Objective: Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is an essential part of learning to be a data scientist. And something that experienced data scientists do regularly.

We'll be using some of the numerous tools available in the pandas library. Earlier in the module, we learned how to load datasets into notebooks. So now that we have all this data, what do we do with it?

Basic Information Methods

We can use a few methods to quickly look at your DataFrame and get an idea of what's inside. Here are some of the most common descriptions of what each method does:

Method Description
df.shape Display the size (rows, columns)
df.head() Display the first 5 rows (we can display first n rows by including the number in parenthesis)
df.tail() Display the last 5 rows (we can display last n rows by including the number in parenthesis)
df.describe() Display the statistics of numerical data types
df.info() Display the number of entries (rows), number of columns, and the data types

Column-specific Methods

Sometimes we don't want to look at the entire DataFrame and instead focus on a single column or a few columns. There are a few ways to select a column, but we'll mainly use the column name. For example, if we have a DataFrame called df and a column named "column_1," we could select just a single column by using df["column_1"]. Once we have a single column chosen, we can use some of the following methods to get more information.

Method Description
df.columns Print a list of the columns
df['column_name'] Select a single column (returns a Series)
df['column_name'].value_counts() Count the number of object and boolean occurrences
df.sort_values(by='column_name') Sort the values in the given column
df.drop() Remove rows or columns by specifying the label or index of the row/column

Handling Missing Values

With a lot of data comes the unavoidable fact that some of it will be messy. Messy data means that there will be missing values, "not-a-number" (NaN) occurrences, and problems with zeros not being zero. Fortunately, several pandas methods make dealing with the mess a little easier.

Method Description
df.isnull().sum() Count and sum the number of null occurrences (NaN or None)
df.fillna() Fill NaN values in a variety of ways
df.dropna() Remove values that are NaN or None; by default removes all rows with NaNs

Practical Example

Let's see a practical example of EDA using the M&Ms dataset, which is small and contains both numeric and object (string) data types:

# Import pandas
import pandas as pd

# Read the data from the website
url_mms = 'https://tinyurl.com/mms-statistics'
df = pd.read_csv(url_mms)

# Look at the shape
df.shape  # returns (816, 4)

# Look at the head of the file
df.head()

When we run df.head(), we'll see the first 5 rows of the dataset with columns 'type', 'color', 'diameter', and 'mass'. We can get more information about the dataset with:

# DataFrame information
df.info()

The output will tell us we have 816 entries, 4 columns, and reveal the data types. Let's also check the statistics of the numeric columns:

# DataFrame describe
df.describe()

This shows statistics like count, mean, standard deviation, min/max, and percentiles for numeric columns.

What if we want to count how many of each candy type we have?

Resources