Module 2: Wrangle ML Datasets
In this module, you'll learn essential techniques for wrangling datasets for machine learning. Data preparation is a critical step in the machine learning workflow, often taking up to 80% of a data scientist's time. You'll explore methods for data cleaning, exploration, and joining relational data to create meaningful feature sets for your models.
Learning Objectives
1. Explore tabular dataset
Learn comprehensive techniques for exploring and understanding tabular data before building machine learning models.
- Conducting comprehensive exploratory data analysis (EDA)
- Identifying missing values and outliers
- Visualizing feature distributions and relationships
- Understanding feature correlations with the target
- Using visualization libraries to gain insights
- Identifying patterns and potential feature engineering opportunities
2. Define ML problem
Learn how to properly frame and define a machine learning problem from your dataset.
- Identifying the business question to be answered
- Choosing between classification and regression approaches
- Selecting appropriate evaluation metrics
- Defining model success criteria
- Understanding the trade-offs in different problem formulations
- Aligning the ML problem with business objectives
3. Join relational data
Learn techniques for combining data from multiple related sources to create more powerful feature sets.
- Understanding different types of joins (inner, outer, left, right)
- Identifying appropriate join keys
- Aggregating and summarizing data from related tables
- Handling time-based joins properly to avoid leakage
- Creating new features from joined data
- Identifying when and why to denormalize data for machine learning
Guided Project
Wrangle ML Datasets Guided Project
In this guided project, you'll work through a complete example of wrangling datasets for machine learning. You'll explore, clean, and prepare data, handling common challenges like missing values, outliers, and joining related tables. You'll also build a simple baseline model to establish a performance benchmark.
Datasets:
Module Assignment
Wrangle ML Datasets for Your Portfolio Project
For this assignment, you'll continue working with your portfolio dataset from Module 1. You'll apply what you've learned to clean, explore, and prepare your data for modeling.
Note: There is no video for this assignment as you will be working with your own dataset and defining your own machine learning problem.
Assignment Notebook Name: LS_DS_232_assignment.ipynb
Tasks:
- Continue to clean and explore your data.
- For the evaluation metric you chose, what score would you get just by guessing?
- Can you make a fast, first model that beats guessing?
- We recommend that you use your portfolio project dataset for all assignments this sprint.