Module 2: Exploratory Data Analysis & Feature Engineering
Module Overview
In this module, you'll learn how to work with data using Google Colab and Pandas. You'll discover how to read and load datasets, explore your data using Pandas' powerful analysis tools, perform feature engineering to transform your data, and master string functions in Pandas for text data manipulation.
Learning Objectives
- Read and load datasets using Google Colab and Pandas
- Explore and analyze data using Pandas' functionality
- Apply feature engineering techniques to transform and prepare data
- Master string functions in Pandas for text data manipulation
Detailed Objective: Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is an essential part of learning to be a data scientist. And something that experienced data scientists do regularly.
We'll be using some of the numerous tools available in the pandas library. Earlier in the module, we learned how to load datasets into notebooks. So now that we have all this data, what do we do with it?
Basic Information Methods
We can use a few methods to quickly look at your DataFrame and get an idea of what's inside. Here are some of the most common descriptions of what each method does:
Method | Description |
---|---|
df.shape |
Display the size (rows, columns) |
df.head() |
Display the first 5 rows (we can display first n rows by including the number in parenthesis) |
df.tail() |
Display the last 5 rows (we can display last n rows by including the number in parenthesis) |
df.describe() |
Display the statistics of numerical data types |
df.info() |
Display the number of entries (rows), number of columns, and the data types |
Column-specific Methods
Sometimes we don't want to look at the entire DataFrame and instead focus on a single column or a few columns. There are a few ways to select a column, but we'll mainly use the column name. For example, if we have a DataFrame called df
and a column named "column_1," we could select just a single column by using df["column_1"]
. Once we have a single column chosen, we can use some of the following methods to get more information.
Method | Description |
---|---|
df.columns |
Print a list of the columns |
df['column_name'] |
Select a single column (returns a Series) |
df['column_name'].value_counts() |
Count the number of object and boolean occurrences |
df.sort_values(by='column_name') |
Sort the values in the given column |
df.drop() |
Remove rows or columns by specifying the label or index of the row/column |
Handling Missing Values
With a lot of data comes the unavoidable fact that some of it will be messy. Messy data means that there will be missing values, "not-a-number" (NaN) occurrences, and problems with zeros not being zero. Fortunately, several pandas methods make dealing with the mess a little easier.
Method | Description |
---|---|
df.isnull().sum() |
Count and sum the number of null occurrences (NaN or None) |
df.fillna() |
Fill NaN values in a variety of ways |
df.dropna() |
Remove values that are NaN or None; by default removes all rows with NaNs |
Practical Example
Let's see a practical example of EDA using the M&Ms dataset, which is small and contains both numeric and object (string) data types:
# Import pandas
import pandas as pd
# Read the data from the website
url_mms = 'https://tinyurl.com/mms-statistics'
df = pd.read_csv(url_mms)
# Look at the shape
df.shape # returns (816, 4)
# Look at the head of the file
df.head()
When we run df.head()
, we'll see the first 5 rows of the dataset with columns 'type', 'color', 'diameter', and 'mass'. We can get more information about the dataset with:
# DataFrame information
df.info()
The output will tell us we have 816 entries, 4 columns, and reveal the data types. Let's also check the statistics of the numeric columns:
# DataFrame describe
df.describe()
This shows statistics like count, mean, standard deviation, min/max, and percentiles for numeric columns.
What if we want to count how many of each candy type we have?