Module 2: Wrangle ML Datasets
Module Overview
In this module, you'll learn essential techniques for wrangling datasets for machine learning. Data preparation is a critical step in the machine learning workflow, often taking up to 80% of a data scientist's time. You'll explore methods for data cleaning, exploration, and joining relational data to create meaningful feature sets for your models.
Learning Objectives
- explore tabular data for supervised machine learning
- join relational data for supervised machine learning
Objective 01 - Explore tabular data for supervised machine learning
Overview
In this module, we'll continue with our previous theme of learning to better understand our data in terms of preparation for modeling. In this objective, we'll explore a small data set and get some additional practice in basic cleaning; identifying columns that might be suitable for feature engineering and generally preparing the data for machine learning tasks.
Follow Along
The data we'll use in the following example is from the USGS Earthquake Catalog and the PNSN Earthquake catalog. We'll load in this data and go through a few processing steps to prepare it for some further analysis.
# Load in earthquake data sets import pandas as pd cols_set1 = ['Evid', 'Magnitude', 'Magnitude Type', 'Time UTC', 'Lat', 'Lon', 'Depth Km'] eq_set1 = pd.read_csv('pnsn_eqlist.csv', usecols=cols_set1) cols_set2 = ['time', 'latitude', 'longitude', 'depth', 'mag', 'magType', 'net', 'id'] eq_set2 = pd.read_csv('usgs_eqlist.csv', usecols=cols_set2) display(eq_set1.head()) display(eq_set2.head())
Evid | Magnitude | Magnitude Type | Time UTC | Lat | Lon | Depth Km | |
---|---|---|---|---|---|---|---|
0 | 61647142 | 2.5 | l | 2020/07/11 00:16:31 | 45.8433 | -126.3158 | 75.4 |
1 | 61656881 | 2.6 | l | 2020/07/09 13:46:58 | 42.6240 | -127.2607 | 26.8 |
2 | 61653226 | 2.9 | l | 2020/07/04 12:43:18 | 43.0163 | -127.4387 | 20.8 |
3 | 61653141 | 2.1 | l | 2020/07/04 07:56:07 | 48.8980 | -121.9658 | 0.6 |
4 | 61652566 | 3.3 | l | 2020/07/03 13:34:27 | 42.0197 | -126.3753 | 10.0 |
time | latitude | longitude | depth | mag | magType | net | id | |
---|---|---|---|---|---|---|---|---|
0 | 2020-07-11T21:17:57.180Z | 49.386000 | -120.536500 | -0.37 | 2.08 | ml | uw | uw61647417 |
1 | 2020-07-10T13:38:32.510Z | 42.317333 | -121.799000 | -1.45 | 2.05 | ml | uw | uw61646707 |
2 | 2020-07-04T12:43:17.982Z | 43.222800 | -126.950900 | 10.00 | 2.90 | ml | us | us7000ahx7 |
3 | 2020-07-04T07:56:07.610Z | 48.895500 | -121.966833 | 0.30 | 2.14 | ml | uw | uw61653141 |
4 | 2020-07-03T13:34:25.483Z | 42.037900 | -126.077500 | 10.00 | 3.20 | ml | us | us7000ahdg |
We can see that each list of earthquakes has an identification column (Evid
and
id
). If we looked at
more
rows, we would see that there is some overlap. In order to combine these two tables into one, we
need to
do a bit of cleaning so that we can compare the identification columns.
# Clean up eq_set2 id column
# (strip the 'uw' or 'us' from the number)
eq_set2['id'] = eq_set2['id'].map(lambda x: x.lstrip('uws'))
# Add column with the network code
eq_set1['net'] = 'uw'
# Rename columns
new_cols = ['id', 'mag','magType','time','latitude','longitude','depth','net']
eq_set1.columns = new_cols
eq_set1.head()
id | mag | magType | time | latitude | longitude | depth | net | |
---|---|---|---|---|---|---|---|---|
0 | 61647142 | 2.5 | l | 2020/07/11 00:16:31 | 45.8433 | -126.3158 | 75.4 | uw |
1 | 61656881 | 2.6 | l | 2020/07/09 13:46:58 | 42.6240 | -127.2607 | 26.8 | uw |
2 | 61653226 | 2.9 | l | 2020/07/04 12:43:18 | 43.0163 | -127.4387 | 20.8 | uw |
3 | 61653141 | 2.1 | l | 2020/07/04 07:56:07 | 48.8980 | -121.9658 | 0.6 | uw |
4 | 61652566 | 3.3 | l | 2020/07/03 13:34:27 | 42.0197 | -126.3753 | 10.0 | uw |
We also need to check if the id
column is of the same type for both DataFrames and
change it if
it's not.
# Check data types
print('Data type for eq_set1: ', eq_set1['id'].dtype)
print('Data type for eq_set2: ', eq_set2['id'].dtype)
# Change data type
eq_set1['id'] = eq_set1['id'].astype('str')
print('Data type for eq_set1: ', eq_set1['id'].dtype)
Data type for eq_set1: int64
Data type for eq_set2: object
Data type for eq_set1: object
Now we are ready to combine these DataFrames and continue on with our analysis.
Challenge
Using one of your own data sets, write down a list of what you need to do to the data in order to continue on with a machine learning model. Do you need to do extensive cleaning? Group columns with high cardinality into a smaller number of classes? Eventually combine the individual files or DataFrames?
All of these steps are important and often take up most of our time. The modeling process will be faster (and more accurate) with properly prepared data.
Additional Resources
Objective 02 - Join relational data for supervised machine learning
Overview
At this point, we have explored the earthquake data set, cleaned it up a little bit, and prepared to combine it into one file. In the next section we'll combine the two DataFrames, and in the process review some of the techniques we learned previous Sprints.
Follow Along
In the following section, we'll reproduce the cleaning and preparation process from the previous objective and continue on with combining our DataFrames. After they're combined, we'll take a look and make sure there are no duplicates.
# Load in earthquake data sets
import pandas as pd
cols_set1 = ['Evid', 'Magnitude', 'Magnitude Type', 'Time UTC', 'Lat', 'Lon', 'Depth Km']
eq_set1 = pd.read_csv('pnsn_eqlist.csv', usecols=cols_set1)
cols_set2 = ['time', 'latitude', 'longitude', 'depth', 'mag', 'magType', 'net', 'id']
eq_set2 = pd.read_csv('usgs_eqlist.csv', usecols=cols_set2)
# Clean up eq_set2 id column
# (strip the 'uw' or 'us' from the number)
eq_set2['id'] = eq_set2['id'].map(lambda x: x.lstrip('uws'))
# Add column with the network code
eq_set1['net'] = 'uw'
# Rename columns
new_cols = ['id', 'mag','magType','time','latitude','longitude','depth','net']
eq_set1.columns = new_cols
# Change data type
eq_set1['id'] = eq_set1['id'].astype('str')
# Take a look at our work
display(eq_set1.head())
display(eq_set2.head())
id | mag | magType | time | latitude | longitude | depth | net | |
---|---|---|---|---|---|---|---|---|
0 | 61647142 | 2.5 | l | 2020/07/11 00:16:31 | 45.8433 | -126.3158 | 75.4 | uw |
1 | 61656881 | 2.6 | l | 2020/07/09 13:46:58 | 42.6240 | -127.2607 | 26.8 | uw |
2 | 61653226 | 2.9 | l | 2020/07/04 12:43:18 | 43.0163 | -127.4387 | 20.8 | uw |
3 | 61653141 | 2.1 | l | 2020/07/04 07:56:07 | 48.8980 | -121.9658 | 0.6 | uw |
4 | 61652566 | 3.3 | l | 2020/07/03 13:34:27 | 42.0197 | -126.3753 | 10.0 | uw |
time | latitude | longitude | depth | mag | magType | net | id | |
---|---|---|---|---|---|---|---|---|
0 | 2020-07-11T21:17:57.180Z | 49.386000 | -120.536500 | -0.37 | 2.08 | ml | uw | 61647417 |
1 | 2020-07-10T13:38:32.510Z | 42.317333 | -121.799000 | -1.45 | 2.05 | ml | uw | 61646707 |
2 | 2020-07-04T12:43:17.982Z | 43.222800 | -126.950900 | 10.00 | 2.90 | ml | us | 7000ahx7 |
3 | 2020-07-04T07:56:07.610Z | 48.895500 | -121.966833 | 0.30 | 2.14 | ml | uw | 61653141 |
4 | 2020-07-03T13:34:25.483Z | 42.037900 | -126.077500 | 10.00 | 3.20 | ml | us | 7000ahdg |
Now we can combine the two sets. First, we need to think about how to join the DataFrames. In general, we
should keep all of the data if possible. In this case, the columns in each DataFrame are the same, so
we're essentially concatenating the two DataFrames. After we do that, we'll check for duplication in the
event id
and remove any duplicates.
# Concatenate eq_set1 and eq_set2
eq_set_all = pd.concat([eq_set1, eq_set2])
print('Size of DataFrame before dropping duplicates: ', eq_set_all.shape)
# Remove duplicate event id numbers
eq_set_all.drop_duplicates(subset='id', inplace=True)
print('Size of DataFrame after dropping duplicates: ', eq_set_all.shape)
Size of DataFrame before dropping duplicates: (68, 8)
Size of DataFrame after dropping duplicates: (50, 8)
Challenge
It's your turn to find some data that you need to combine or join in some way. Dig up two (or more) data sets and identify what you would need to do to combine them into one DataFrame. For more of a challenge, you can load, clean, and combine your data set. To make this a little easier, you could use a small subset of each to practice on.
Additional Resources
Guided Project
Open DS_232_guided_project.ipynb in the GitHub repository below to follow along with the guided project:
Guided Project Video
Module Assignment
For this assignment, you'll continue working with your portfolio dataset from Module 1. You'll apply what you've learned to clean, explore, and prepare your data for modeling.
Note: There is no video for this assignment as you will be working with your own dataset and defining your own machine learning problem.