Module 3: Cross-Validation and Grid Search

Overview

In this module, you will learn essential techniques for properly validating machine learning models and optimizing their hyperparameters. Cross-validation helps ensure that your model performance estimates are reliable, while grid search provides a systematic approach to finding the best hyperparameters for your models.

You'll learn how to implement these techniques using scikit-learn, and understand their importance in building models that generalize well to new, unseen data.

Learning Objectives

By the end of this module, you should be able to:

  • Understand the limitations of simple train-test splits for model validation
  • Implement K-fold cross-validation to get more reliable model performance estimates
  • Use GridSearchCV to systematically search for optimal hyperparameters
  • Apply cross-validation and grid search to tree-based models
  • Understand the trade-offs between model complexity and performance
  • Interpret cross-validation results to make informed modeling decisions

Guided Project

Cross-Validation and Grid Search with Tree-Based Models

Resources:

Module Assignment

Cross-Validation: Kaggle Competition

Continue improving your Kaggle competition submission by implementing cross-validation and hyperparameter optimization.

Assignment Notebook Name: LS_DS_223_assignment.ipynb

Tasks:

  1. Use wrangle function to import training and test data.
  2. Split training data into feature matrix X and target vector y.
  3. Establish the baseline accuracy score for your dataset.
  4. Build clf_dt.
  5. Build clf_rf.
  6. Evaluate classifiers using k-fold cross-validation.
  7. Tune hyperparameters for best performing classifier.
  8. Print out best score and params for model.
  9. Create submission.csv and upload to Kaggle.

Libraries to use: category_encoders, matplotlib, pandas, ydata-profiling, sklearn

Assignment GitHub Repository

Additional Resources