Module 3: Cross-Validation and Grid Search
Overview
In this module, you will learn essential techniques for properly validating machine learning models and optimizing their hyperparameters. Cross-validation helps ensure that your model performance estimates are reliable, while grid search provides a systematic approach to finding the best hyperparameters for your models.
You'll learn how to implement these techniques using scikit-learn, and understand their importance in building models that generalize well to new, unseen data.
Learning Objectives
By the end of this module, you should be able to:
- Understand the limitations of simple train-test splits for model validation
- Implement K-fold cross-validation to get more reliable model performance estimates
- Use GridSearchCV to systematically search for optimal hyperparameters
- Apply cross-validation and grid search to tree-based models
- Understand the trade-offs between model complexity and performance
- Interpret cross-validation results to make informed modeling decisions
Guided Project
Cross-Validation and Grid Search with Tree-Based Models
Resources:
- GitHub Repository
- Guided Project Notebook: LS_DS_223.ipynb
- Slides
Module Assignment
Cross-Validation: Kaggle Competition
Continue improving your Kaggle competition submission by implementing cross-validation and hyperparameter optimization.
Assignment Notebook Name: LS_DS_223_assignment.ipynb
Tasks:
- Use wrangle function to import training and test data.
- Split training data into feature matrix X and target vector y.
- Establish the baseline accuracy score for your dataset.
- Build clf_dt.
- Build clf_rf.
- Evaluate classifiers using k-fold cross-validation.
- Tune hyperparameters for best performing classifier.
- Print out best score and params for model.
- Create submission.csv and upload to Kaggle.
Libraries to use: category_encoders, matplotlib, pandas, ydata-profiling, sklearn