Module 2: Wrangle ML Datasets

Module Overview

In this module, you'll learn essential techniques for wrangling datasets for machine learning. Data preparation is a critical step in the machine learning workflow, often taking up to 80% of a data scientist's time. You'll explore methods for data cleaning, exploration, and joining relational data to create meaningful feature sets for your models.

Learning Objectives

Objective 01 - Explore tabular data for supervised machine learning

Overview

In this module, we'll continue with our previous theme of learning to better understand our data in terms of preparation for modeling. In this objective, we'll explore a small data set and get some additional practice in basic cleaning; identifying columns that might be suitable for feature engineering and generally preparing the data for machine learning tasks.

Follow Along

The data we'll use in the following example is from the USGS Earthquake Catalog and the PNSN Earthquake catalog. We'll load in this data and go through a few processing steps to prepare it for some further analysis.

# Load in earthquake data sets
import pandas as pd

cols_set1 = ['Evid', 'Magnitude', 'Magnitude Type', 'Time UTC', 'Lat', 'Lon', 'Depth Km']
eq_set1 = pd.read_csv('pnsn_eqlist.csv', usecols=cols_set1)

cols_set2 = ['time', 'latitude', 'longitude', 'depth', 'mag', 'magType', 'net', 'id']
eq_set2 = pd.read_csv('usgs_eqlist.csv', usecols=cols_set2)

display(eq_set1.head())
display(eq_set2.head())
Evid Magnitude Magnitude Type Time UTC Lat Lon Depth Km
0 61647142 2.5 l 2020/07/11 00:16:31 45.8433 -126.3158 75.4
1 61656881 2.6 l 2020/07/09 13:46:58 42.6240 -127.2607 26.8
2 61653226 2.9 l 2020/07/04 12:43:18 43.0163 -127.4387 20.8
3 61653141 2.1 l 2020/07/04 07:56:07 48.8980 -121.9658 0.6
4 61652566 3.3 l 2020/07/03 13:34:27 42.0197 -126.3753 10.0
time latitude longitude depth mag magType net id
0 2020-07-11T21:17:57.180Z 49.386000 -120.536500 -0.37 2.08 ml uw uw61647417
1 2020-07-10T13:38:32.510Z 42.317333 -121.799000 -1.45 2.05 ml uw uw61646707
2 2020-07-04T12:43:17.982Z 43.222800 -126.950900 10.00 2.90 ml us us7000ahx7
3 2020-07-04T07:56:07.610Z 48.895500 -121.966833 0.30 2.14 ml uw uw61653141
4 2020-07-03T13:34:25.483Z 42.037900 -126.077500 10.00 3.20 ml us us7000ahdg

We can see that each list of earthquakes has an identification column (Evid and id). If we looked at more rows, we would see that there is some overlap. In order to combine these two tables into one, we need to do a bit of cleaning so that we can compare the identification columns.

# Clean up eq_set2 id column
# (strip the 'uw' or 'us' from the number)
eq_set2['id'] = eq_set2['id'].map(lambda x: x.lstrip('uws'))
# Add column with the network code
eq_set1['net'] = 'uw'

# Rename columns
new_cols = ['id', 'mag','magType','time','latitude','longitude','depth','net']
eq_set1.columns = new_cols

eq_set1.head()
id mag magType time latitude longitude depth net
0 61647142 2.5 l 2020/07/11 00:16:31 45.8433 -126.3158 75.4 uw
1 61656881 2.6 l 2020/07/09 13:46:58 42.6240 -127.2607 26.8 uw
2 61653226 2.9 l 2020/07/04 12:43:18 43.0163 -127.4387 20.8 uw
3 61653141 2.1 l 2020/07/04 07:56:07 48.8980 -121.9658 0.6 uw
4 61652566 3.3 l 2020/07/03 13:34:27 42.0197 -126.3753 10.0 uw

We also need to check if the id column is of the same type for both DataFrames and change it if it's not.

# Check data types
print('Data type for eq_set1: ', eq_set1['id'].dtype)
print('Data type for eq_set2: ', eq_set2['id'].dtype)

# Change data type
eq_set1['id'] = eq_set1['id'].astype('str')
print('Data type for eq_set1: ', eq_set1['id'].dtype)
Data type for eq_set1: int64
Data type for eq_set2: object
Data type for eq_set1: object

Now we are ready to combine these DataFrames and continue on with our analysis.

Challenge

Using one of your own data sets, write down a list of what you need to do to the data in order to continue on with a machine learning model. Do you need to do extensive cleaning? Group columns with high cardinality into a smaller number of classes? Eventually combine the individual files or DataFrames?

All of these steps are important and often take up most of our time. The modeling process will be faster (and more accurate) with properly prepared data.

Additional Resources

Objective 02 - Join relational data for supervised machine learning

Overview

At this point, we have explored the earthquake data set, cleaned it up a little bit, and prepared to combine it into one file. In the next section we'll combine the two DataFrames, and in the process review some of the techniques we learned previous Sprints.

Follow Along

In the following section, we'll reproduce the cleaning and preparation process from the previous objective and continue on with combining our DataFrames. After they're combined, we'll take a look and make sure there are no duplicates.


# Load in earthquake data sets
import pandas as pd

cols_set1 = ['Evid', 'Magnitude', 'Magnitude Type', 'Time UTC', 'Lat', 'Lon', 'Depth Km']
eq_set1 = pd.read_csv('pnsn_eqlist.csv', usecols=cols_set1)

cols_set2 = ['time', 'latitude', 'longitude', 'depth', 'mag', 'magType', 'net', 'id']
eq_set2 = pd.read_csv('usgs_eqlist.csv', usecols=cols_set2)

# Clean up eq_set2 id column
# (strip the 'uw' or 'us' from the number)
eq_set2['id'] = eq_set2['id'].map(lambda x: x.lstrip('uws'))

# Add column with the network code
eq_set1['net'] = 'uw'

# Rename columns
new_cols = ['id', 'mag','magType','time','latitude','longitude','depth','net']
eq_set1.columns = new_cols

# Change data type
eq_set1['id'] = eq_set1['id'].astype('str')

# Take a look at our work
display(eq_set1.head())
display(eq_set2.head())
id mag magType time latitude longitude depth net
0 61647142 2.5 l 2020/07/11 00:16:31 45.8433 -126.3158 75.4 uw
1 61656881 2.6 l 2020/07/09 13:46:58 42.6240 -127.2607 26.8 uw
2 61653226 2.9 l 2020/07/04 12:43:18 43.0163 -127.4387 20.8 uw
3 61653141 2.1 l 2020/07/04 07:56:07 48.8980 -121.9658 0.6 uw
4 61652566 3.3 l 2020/07/03 13:34:27 42.0197 -126.3753 10.0 uw
time latitude longitude depth mag magType net id
0 2020-07-11T21:17:57.180Z 49.386000 -120.536500 -0.37 2.08 ml uw 61647417
1 2020-07-10T13:38:32.510Z 42.317333 -121.799000 -1.45 2.05 ml uw 61646707
2 2020-07-04T12:43:17.982Z 43.222800 -126.950900 10.00 2.90 ml us 7000ahx7
3 2020-07-04T07:56:07.610Z 48.895500 -121.966833 0.30 2.14 ml uw 61653141
4 2020-07-03T13:34:25.483Z 42.037900 -126.077500 10.00 3.20 ml us 7000ahdg

Now we can combine the two sets. First, we need to think about how to join the DataFrames. In general, we should keep all of the data if possible. In this case, the columns in each DataFrame are the same, so we're essentially concatenating the two DataFrames. After we do that, we'll check for duplication in the event id and remove any duplicates.

# Concatenate eq_set1 and eq_set2
eq_set_all = pd.concat([eq_set1, eq_set2])
print('Size of DataFrame before dropping duplicates: ', eq_set_all.shape)

# Remove duplicate event id numbers
eq_set_all.drop_duplicates(subset='id', inplace=True)
print('Size of DataFrame after dropping duplicates: ', eq_set_all.shape)
Size of DataFrame before dropping duplicates: (68, 8)
Size of DataFrame after dropping duplicates: (50, 8)

Challenge

It's your turn to find some data that you need to combine or join in some way. Dig up two (or more) data sets and identify what you would need to do to combine them into one DataFrame. For more of a challenge, you can load, clean, and combine your data set. To make this a little easier, you could use a small subset of each to practice on.

Additional Resources

Guided Project

Open DS_232_guided_project.ipynb in the GitHub repository below to follow along with the guided project:

Guided Project Video

Module Assignment

For this assignment, you'll continue working with your portfolio dataset from Module 1. You'll apply what you've learned to clean, explore, and prepare your data for modeling.

Note: There is no video for this assignment as you will be working with your own dataset and defining your own machine learning problem.

Additional Resources