Module 1: Define ML Problems

Module Overview

In this module, you'll learn how to properly define machine learning problems. This is a crucial first step in any data science project, as a well-defined problem sets the foundation for all subsequent modeling decisions. You'll learn to choose appropriate targets, understand their distributions, and select evaluation metrics that align with your project goals.

Learning Objectives

Objective 01 - Choose a target to predict, and check its distribution

Overview

Up to this point in the course we have worked with a lot of data sets and fit a number of different types of models to those data sets. While this has been helpful practice for learning how to select a model, create pipelines, and evaluate the results, we could use some additional experience with data that is generally less "prepared."

Working through the example in this objective will give us a chance to more carefully consider our target. We should be looking at the distribution of the target and its suitability for modeling; if the classes are balanced, the appropriate type of encoding to use, and how that choice might affect the model.

In the next section, we'll be using a data set from Kaggle. While this data is mostly prepared for machine learning, it's still important to think about the target and what we are trying to model.

Follow Along

This data set is available on here. It records observations of Australian weather in order to try to predict if rain occurred on the day following the measurements.

# Import libraries, load data, and view
import pandas as pd
url="https://raw.githubusercontent.com/bloominstituteoftechnology/DS-Unit-2-Kaggle-Challenge/main/data/weather/weatherAUS.csv"
weather=pd.read_csv(url)
weather.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W ... 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8 No 0.0 No
1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW ... 25.0 1010.6 1007.8 NaN NaN 17.2 24.3 No 0.0 No
2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W ... 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2 No 0.0 No
3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE ... 16.0 1017.6 1012.8 NaN NaN 18.1 26.5 No 1.0 No
4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE ... 33.0 1010.8 1006.0 7.0 8.0 17.8 29.7 No 0.2 No

5 rows × 24 columns

As we typically do with a data set, we should get some more details. We can use the df.info() method to see how many columns we have, the data types for each of those columns, and how many of those values are non-null.

# Display the info for the weather DataFrame
weather.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 24 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           142193 non-null  object 
 1   Location       142193 non-null  object 
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   Evaporation    81350 non-null   float64
 6   Sunshine       74377 non-null   float64
 7   WindGustDir    132863 non-null  object 
 8   WindGustSpeed  132923 non-null  float64
 9   WindDir9am     132180 non-null  object 
 10  WindDir3pm     138415 non-null  object 
 11  WindSpeed9am   140845 non-null  float64
 12  WindSpeed3pm   139563 non-null  float64
 13  Humidity9am    140419 non-null  float64
 14  Humidity3pm    138583 non-null  float64
 15  Pressure9am    128179 non-null  float64
 16  Pressure3pm    128212 non-null  float64
 17  Cloud9am       88536 non-null   float64
 18  Cloud3pm       85099 non-null   float64
 19  Temp9am        141289 non-null  float64
 20  Temp3pm        139467 non-null  float64
 21  RainToday      140787 non-null  object 
 22  RISK_MM        142193 non-null  float64
 23  RainTomorrow   142193 non-null  object 
dtypes: float64(17), object(7)
memory usage: 26.0+ MB

There are definitely missing values in some of the columns. There are also columns labeled date but since they are object type, they will need to be converted to datetime objects. Additionally, we have quite a few categorical variables that will need to be either labeled or encoded. Finally and most importantly, we need to figure out what we are trying to predict from this data set!

The variables in this data set relate to measurements of the weather such as temperature, wind speed and direction, atmospheric pressure, and if there was rain on that current date. The feature that suggests something like a prediction is RainTomorrow. This column contains an object data type and when we look at the column values in more detail, we can see the values are categorical ('Yes' and 'No').

Let's take a look at our potential target and how the 'Yes' and 'No' classes are distributed.

# Look at the 'outcome_type' column
weather['RainTomorrow'].value_counts()
No     110316
Yes     31877
Name: RainTomorrow, dtype: int64
# Look at the 'outcome_type' column
weather['RainTomorrow'].value_counts(normalize=True)
No     0.775819
Yes    0.224181
Name: RainTomorrow, dtype: float64

This target has two somewhat imbalanced classes with 78% 'Yes' and 22% 'No' but will work fine as an example classification task. When we have imbalanced data, there are different ways to address possible problems, including using different metrics to evaluate the model. The topic of imbalanced data will likely come up (if it hasn't already!) as you work through both the Guided Projects and the Module Projects.

Challenge

For the Module Project, you will source your own data set and fit a model. As you are searching for something to work with, take a few minutes to look at each data set you come across and think about what the target variable would be. One stipulation for this exercise: try to find a data set where the target is not already specified. You don't need to perform any analysis right now other than viewing the data and making some effort to understand the features.

Additional Resources

Data Science Is Not Taught At Universities - And Here Is Why

Objective 02 - Avoid leakage of information from test to train or from target to features

Overview

We briefly introduced the concept of data leakage in the previous sprint when we discussed using pipelines for preprocessing and model fitting. In general, and especially if you are using cross-validation, it's good to be conscious of when, where, and why data leakage can occur. This module is focused on learning how to work with data that isn't already prepared for modeling. Not only do we need to know which features to use and if our target is appropriate, but we also need to protect against information leaking into either our testing data or from certain features.

The two main types of leakage are leaky features (predictors) and a leaky validation or testing process.

Leaky Features

This type of leakage occurs when you have a feature that has access to data that won't be available when you actually use the model on new data (outside of the test set) to make predictions. This could happen if you adjust the values in that feature after you determined the values in your target array.

For example, if we were predicting if someone has heart disease (True/False) and used a feature called BP_meds (indicating if the individual is taking blood pressure medication), we might have a problem. If someone is taking this medication, it might be because they have heart disease and are being treated. Moreover, the value in this column could have been changed after they were diagnosed with heart disease.

Leaky Testing Process

The other type of leak can happen when your validation data "learns" from the training data. If you are preprocessing the data, such as filling in missing values with the SimpleImputer or standardizing values with StandardScaler, you might accidentally be using the entire data set. In this case, it's important to apply the preprocessing steps separately to the training and testing data which will prevent the testing data from learning anything from the training set.

Now that we are more familiar with these two different types of data leakage, let's explore our real-world weather data from the previous objective.

Follow Along

We introduced the Australian weather data set earlier and explored it briefly to decide on the prediction target: whether it was going to rain on the day following the measurements. Let's look more closely at each of the features (predictors) to see if any of them could present leakage problems.

#Import libraries, load data, and view
import pandas as pd
url="https://raw.githubusercontent.com/bloominstituteoftechnology/DS-Unit-2-Kaggle-Challenge/main/data/weather/weatherAUS.csv"
weather=pd.read_csv(url)
weather.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W ... 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8 No 0.0 No
1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW ... 25.0 1010.6 1007.8 NaN NaN 17.2 24.3 No 0.0 No
2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W ... 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2 No 0.0 No
3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE ... 16.0 1017.6 1012.8 NaN NaN 18.1 26.5 No 1.0 No
4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE ... 33.0 1010.8 1006.0 7.0 8.0 17.8 29.7 No 0.2 No

5 rows × 24 columns

Before we identify any possible "leaky features" we should decide which features to use and the necessary preprocessing steps. Let's look at each type of variable (numeric and categorical) in more detail.

# Look at the statistics of categorical variables 
weather.describe(include=['object'])
Date Location WindGustDir WindDir9am WindDir3pm RainToday RainTomorrow
count 142193 142193 132863 132180 138415 140787 142193
unique 3436 49 16 16 16 2 2
top 2013-10-08 Canberra W N SE No No
freq 49 3418 9780 11393 10663 109332 110316
# Look at the statistics of the numeric variables 
weather.describe()
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RISK_MM
count 141556.000000 141871.000000 140787.000000 81350.000000 74377.000000 132923.000000 140845.000000 139563.000000 140419.000000 138583.000000 128179.000000 128212.000000 88536.000000 85099.000000 141289.000000 139467.000000 142193.000000
mean 12.186400 23.226784 2.349974 5.469824 7.624853 39.984292 14.001988 18.637576 68.843810 51.482606 1017.653758 1015.258204 4.437189 4.503167 16.987509 21.687235 2.360682
std 6.403283 7.117618 8.465173 4.188537 3.781525 13.588801 8.893337 8.803345 19.051293 20.797772 7.105476 7.036677 2.887016 2.720633 6.492838 6.937594 8.477969
min -8.500000 -4.800000 0.000000 0.000000 0.000000 6.000000 0.000000 0.000000 0.000000 0.000000 980.500000 977.100000 0.000000 0.000000 -7.200000 -5.400000 0.000000
25% 7.600000 17.900000 0.000000 2.600000 4.900000 31.000000 7.000000 13.000000 57.000000 37.000000 1012.900000 1010.400000 1.000000 2.000000 12.300000 16.600000 0.000000
50% 12.000000 22.600000 0.000000 4.800000 8.500000 39.000000 13.000000 19.000000 70.000000 52.000000 1017.600000 1015.200000 5.000000 5.000000 16.700000 21.100000 0.000000
75% 16.800000 28.200000 0.800000 7.400000 10.600000 48.000000 19.000000 24.000000 83.000000 66.000000 1022.400000 1020.000000 7.000000 7.000000 21.600000 26.400000 0.800000
max 33.900000 48.100000 371.000000 145.000000 14.500000 135.000000 130.000000 87.000000 100.000000 100.000000 1041.000000 1039.600000 9.000000 9.000000 40.200000 46.700000 371.000000

Data Exploration: Null Values

From the above DataFrame descriptions we can see that there are a lot of null values in some of the columns. We'll take a more detailed look at how many are missing in what columns. The plot below is created using a module available at this repository.

# Checking for null values
weather.isnull().sum()
Date                 0
Location             0
MinTemp            637
MaxTemp            322
Rainfall          1406
Evaporation      60843
Sunshine         67816
WindGustDir       9330
WindGustSpeed     9270
WindDir9am       10013
WindDir3pm        3778
WindSpeed9am      1348
WindSpeed3pm      2630
Humidity9am       1774
Humidity3pm       3610
Pressure9am      14014
Pressure3pm      13981
Cloud9am         53657
Cloud3pm         57094
Temp9am            904
Temp3pm           2726
RainToday         1406
RISK_MM              0
RainTomorrow         0
import matplotlib.pyplot as plt
import missingno as msno
msno.matrix(weather)

plt.show()

mod1_obj1_missingNA.png

We have four columns with a large number of null values. If we were doing this analysis for a competition (or for an actual data science job!) we would want to more carefully explore the missing values. Since these columns are missing about 40% of their data (or more), we're going to drop them for this analysis. For the other missing values, we'll use an Imputer in the preprocessing step.

To simplify the analysis for later, we'll also drop the Location column. Again, this information might be important for a more detailed model, but we're trying to keep this process simple so that we can focus on identifying the leaky features.

# Drop columns with high-percentage of missing values
cols_drop = ['Location', 'Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm']
weather_drop = weather.drop(cols_drop, axis=1)

Data Cleaning: Datetime

We have a date column which could be converted to a datetime object. We will use only the 'month' value from this column in our model, as the full date would be too specific.

# Convert the 'Date' column to datetime, extract month
weather_drop['Date'] = pd.to_datetime(weather_drop['Date'], infer_datetime_format=True).dt.month
weather_drop.head()

Data Processing: Pipeline

We'll separate our features into numeric and categorical types and then perform transformation steps. Several of the numeric features are on very different scales, so we will standardize those values. We will also impute our missing values with SimpleImputer(). The categorical features will be ordinarily encoded.

# Print the column names
weather_drop.columns
# Imports
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Define the numeric features
numeric_features = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 
                    'WindSpeed9am','WindSpeed3pm', 'Humidity9am', 
                    'Humidity3pm', 'Pressure9am','Pressure3pm', 
                    'Temp9am', 'Temp3pm', 'RISK_MM']

# Create the transformer (impute, scale)
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define the categorical features
categorical_features = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ordinal', OrdinalEncoder())])

# Define how the numeric and categorical features will be transformed
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Define the pipeline steps, including the classifier
clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', DecisionTreeClassifier())])

Create Feature Matrix, Target Array

We have a couple of final steps before we fit the model: create the feature matrix and then create and encode the target array.

# Create the feature matrix 
X = weather_drop.drop('RainTomorrow', axis=1)

# Create and encode the target array
from sklearn.preprocessing import LabelEncoder
label_enc = LabelEncoder()
y=label_enc.fit_transform(weather_drop['RainTomorrow'])
# Import the train_test_split utility
from sklearn.model_selection import train_test_split

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
# Fit the model
clf.fit(X_train,y_train)
print('Validation Accuracy', clf.score(X_test, y_test))

Wow! We achieved 100% accuracy. Is this too good to be true? Yes - anytime you have a model with very high accuracy you likely have a problem and that problem is probably data leakage of some type. Let's look at the feature importances to see where the problem is.

# Features (order in which they were preprocessed)
features_order = numeric_features + categorical_features

# Determine the importances
importances = pd.Series(clf.steps[1][1].feature_importances_, features_order)
# Plot feature importances
import matplotlib.pyplot as plt

n = 7
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey')

plt.show()

mod1_obj1_top7feature_leaky.png

It looks like the model was essentially fit on a single feature, which must be because the predictor was related to the target array. Spoiler: one of the features is leaking information to the model. It is the RISK_MM column which is essentially how much rain was recorded the following day.

We'll remove this column, run the model again, and calculate the features importances.

# Remove the 'RISK_MM' column
X_noriskmm = X.drop('RISK_MM', axis=1)

# Create the new training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_noriskmm, y, test_size=0.2, random_state=42)

# Drop the 'RISK_MM' column from the numeric_features
numeric_features = numeric_features.remove('RISK_MM')

# Fit the model
clf.fit(X_train,y_train)
print('Validation Accuracy (with no "RISK_MM")', clf.score(X_test, y_test))

That's better! The accuracy is still high, but much more reasonable.

# Get feature importances

# Features (order in which they were preprocessed)
numeric_features = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 
                    'WindSpeed9am','WindSpeed3pm', 'Humidity9am', 
                    'Humidity3pm', 'Pressure9am','Pressure3pm', 
                    'Temp9am', 'Temp3pm']

categorical_features = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
features_order = numeric_features + categorical_features

importances = pd.Series(clf.steps[1][1].feature_importances_, features_order)
# Plot feature importances

n = 7
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey')

plt.clf()

mod1_obj1_top7feature_NOleaky.png

Challenge

In the above example, we removed the RISK_MM column. However, this takes away information that we might use in our model. For this challenge, think of a way you could group the values in this column and use it as the target for the model. Can we predict how much rain is received instead of just a true/false prediction?

Additional Resources

Objective 03 - Choose an appropriate evaluation metric

Overview

Up to this point in the course we've fit several different models: linear regression, logistic regression, decision tree, and random forests. One concept that we haven't covered extensively is how to choose the metric by which we evaluate our models.

Because it's important, the following information is likely to be repeated during the Guided Project:

Classification & regression metrics are different!

Let's look at each type of task and the associated metrics.

Classification Tasks

For classification tasks we can use the metrics of: precision, recall, F1 score, and the receiver operating characteristic (ROC) curve. Some general rules to follow when choosing one of these metrics:

Regression Tasks

Generally, regression models are scored by the R squared value. It is the proportion of the variance in the dependent variable (y) that is predictable from the independent variable(s) (X).

We'll use the iris data set here to do a basic evaluation metric exercise.

Follow Along

Let's load in the data, create the training and test sets, fit the model, and, finally, evaluate our model.

# Load in libraries, data
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split

# Create X, y and training/test sets
iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=42)

# Import the classifier
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train,y_train)

print('Validation Accuracy: ', dt_classifier.score(X_test, y_test))
Validation Accuracy:  0.9666666666666667

This model has a pretty high accuracy, so let's look at a few other metrics which might give us a better idea of how the model is fitting the data. First, we'll create a visualization of the confusion matrix.

# import matplotlib.pyplot as plt 
from sklearn.metrics import ConfusionMatrixDisplay 

ConfusionMatrixDisplay.from_estimator(dt_classifier, X_test, y_test) 
plt.show()
<Figure size 576x576 with 0 Axes>

label_map.png

We can see from the confusion matrix that very few of the observations are being misclassified, so the high accuracy is probably correct for this model.

We can also look at the classification report, which shows the precision, recall, and the F1-score.

# Create the classification report
y_pred = dt_classifier.predict(X_test)
print(metrics.classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        23
           1       0.95      0.95      0.95        19
           2       0.94      0.94      0.94        18

    accuracy                           0.97        60
   macro avg       0.96      0.96      0.96        60
weighted avg       0.97      0.97      0.97        60

Challenge

For this challenge, think about a data set you have worked with that you haven't yet evaluated. Which metric should you use? Do you understand how the model is being evaluated?

Additional Resources

Objective 04 - Use the classification metric ROC AUC to interpret a classifier model

Overview

In the previous sprint, we examined the probability threshold a classifier uses when determining the class to which an observation belongs. We can extend this concept and look at something called the receiver operating characteristic, which is usually plotted as a curve and called the ROC curve.

First, we'll go back to the idea of calculating true positives and true negatives and look at a different measurement, the true positive rate (TPR) and the false positive rate (FPR).

 \text{TPR} = \frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}

 \text{FPR} = \frac{\text{False Positives}}{\text{False Positives}+\text{True Negatives}}

Both of the above measurements are the total true or false positives normalized by the total for each.

When we create a ROC curve, we are plotting the TPR against the FPR for a range of threshold values. In the next section, we'll use the scikit-learn roc_curve() method to do the calculations for us. From the resulting data, we'll create a plot.

Follow Along

In order to plot a ROC curve, we need some data and a classifier model fit to that data. Let's create some data from the previous objective and then build the ROC curve.

# Load modules
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve

# Create the data (feature, target)
X, y = make_classification(n_samples=10000, n_features=5,
                          n_classes=2, n_informative=3,
                          random_state=42)

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Create and fit the model
logreg_classifier = LogisticRegression().fit(X_train, y_train)

# Create predicted probabilities
y_pred_prob = logreg_classifier.predict_proba(X_test)[:,1]
# Create the data for the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# See the results in a table
roccurve_df = pd.DataFrame({
    'False Positive Rate': fpr, 
    'True Positive Rate': tpr, 
    'Threshold': thresholds
})

roccurve_df.head()
False Positive Rate True Positive Rate Threshold
0 0.000000 0.000000 1.999969
1 0.000000 0.000786 0.999969
2 0.000000 0.291438 0.983222
3 0.000815 0.291438 0.983049
4 0.000815 0.360566 0.970583
# Plot the ROC curve
import matplotlib.pyplot as plt

plt.plot(fpr, tpr)
plt.plot([0,1], ls='--')
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.show()
<Figure size 432x288 with 0 Axes>

mod4_obj4_ROC.png

The above model looks pretty good. In general, the better a model, the higher the curve is, and the greater the area under the curve (AUC). The maximum value for the AUC is equal to one. While we can "eyeball" the area in our curve, there is also a tool used to calculate the AUC.

# Calculate the area under the curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred_prob)
0.9419681927513379

Challenge

Using a different data set for classification, see if you can construct the ROC curve. Or with the same data set generated above, try using a different classifier such as a decision tree and plot the ROC curve and calculate the AUC. Which model performs better?

Additional Resources

Guided Project

Open DS_231_guided_project.ipynb in the GitHub repository below to follow along with the guided project:

Guided Project Video

Module Assignment

For this assignment, you'll apply what you've learned to your own portfolio dataset. This hands-on experience will solidify your understanding of the concepts and prepare you for real-world machine learning tasks.

Note: There is no video solution for this assignment as you will be working with your own dataset and defining your own machine learning problem.

Additional Resources