Module 1: Define ML Problems

In this module, you'll learn how to properly define machine learning problems. This is a crucial first step in any data science project, as a well-defined problem sets the foundation for all subsequent modeling decisions. You'll learn to choose appropriate targets, understand their distributions, and select evaluation metrics that align with your project goals.

Learning Objectives

1. Choosing a Target

Learn how to select the right target variable for your machine learning problem and analyze its distribution to make appropriate modeling choices.

Identifying potential target variables in a dataset
Determining if a problem is classification or regression
Analyzing the distribution of your target variable
Understanding how target distribution affects modeling choices
Applying transformations to targets when appropriate (like log transformations for skewed data)

2. Data Leakage

Understand what data leakage is and how to prevent it from compromising your model's performance on new data.

Understanding different types of data leakage
Identifying and preventing target leakage
Avoiding train-test contamination
Properly handling time-based data to prevent leakage
Creating robust validation strategies

3. ROC-AUC curve

Learn how to use ROC curves and AUC to evaluate and interpret classification models, especially when dealing with imbalanced classes.

Understanding the ROC curve and how it's constructed
Interpreting ROC AUC values
Using ROC curves to select optimal decision thresholds
Comparing models using ROC AUC
Understanding the limitations of ROC AUC

Guided Project

Define ML Problems Guided Project

In this guided project, you'll work through a complete example of defining a machine learning problem. You'll select an appropriate target variable, analyze its distribution, identify potential sources of leakage, and choose evaluation metrics that align with the problem goals.

GitHub Repository Guided Project Notebook (LS_DS_231.ipynb)

Module Assignment

Define ML Problems Using Your Portfolio Dataset

For this assignment, you'll apply what you've learned to your own portfolio dataset. This hands-on experience will solidify your understanding of the concepts and prepare you for real-world machine learning tasks.

Note: There is no video solution for this assignment as you will be working with your own dataset and defining your own machine learning problem.

Assignment Notebook Name: LS_DS_231_assignment.ipynb

Tasks:

Choose your target. Which column in your tabular dataset will you predict?
Is your problem regression or classification?
How is your target distributed?
- Classification: How many classes? Are the classes imbalanced?
- Regression: Is the target right-skewed? If so, you may want to log transform the target.
Choose your evaluation metric(s).
- Classification: Is your majority class frequency >= 50% and < 70%? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- Regression: Will you use mean absolute error, root mean squared error, R², or other regression metrics?
Choose which observations you will use to train, validate, and test your model.
- Are some observations outliers? Will you exclude them?
- Will you do a random split or a time-based split?
Begin to clean and explore your data.
Begin to choose which features, if any, to exclude. Would some features "leak" future information?
If you haven't found a dataset yet, do that today. Review requirements for your portfolio project on Canvas, and choose your dataset.

Assignment GitHub Repository