Module 2: Random Forests

Overview

In this module, you will learn about Random Forests, a powerful ensemble learning technique that builds upon the foundation of decision trees. Random Forests combine multiple decision trees to create a more robust, accurate, and stable model that mitigates many of the limitations of individual decision trees.

You'll learn how Random Forests work, their advantages over single decision trees, and how to implement and tune them effectively using scikit-learn. This module will also introduce you to the concept of ensemble learning and its benefits in machine learning.

Learning Objectives

By the end of this module, you should be able to:

  • Understand how Random Forests combine multiple decision trees
  • Implement Random Forest models for classification and regression
  • Configure important hyperparameters like n_estimators and max_features
  • Interpret feature importances from Random Forest models
  • Identify the advantages of Random Forests over single decision trees
  • Apply Random Forests to real-world machine learning problems

Guided Project

Random Forests with the Tanzania Waterpumps Dataset

Resources:

Module Assignment

Random Forests: Kaggle Competition

It's time to apply what you've learned about Random Forests to improve your predictions in the Kaggle competition!

Assignment Notebook Name: LS_DS_222_assignment.ipynb

Tasks:

  1. Sign up for a Kaggle account. Join the kaggle competition, and download the water pump dataset.
  2. Modify wrangle function to engineer a new feature. Use wrangle function to import training and test data.
  3. Split training data into feature matrix X and target vector y.
  4. Split feature matrix X and target vector y into training and validation sets.
  5. Establish the baseline accuracy score for your dataset.
  6. Build and train model_rf.
  7. Calculate the training and validation accuracy score for your model.
  8. Adjust model's max_depth and n_estimators to reduce overfitting.
  9. Generate list of predictions for X_test.
  10. Stretch goal: Create submissions.csv file and upload on kaggle competition site.

Libraries to use: category_encoders, matplotlib, pandas, ydata-profiling, sklearn

Assignment GitHub Repository

Additional Resources