Module 2: Random Forests
Overview
In this module, you will learn about Random Forests, a powerful ensemble learning technique that builds upon the foundation of decision trees. Random Forests combine multiple decision trees to create a more robust, accurate, and stable model that mitigates many of the limitations of individual decision trees.
You'll learn how Random Forests work, their advantages over single decision trees, and how to implement and tune them effectively using scikit-learn. This module will also introduce you to the concept of ensemble learning and its benefits in machine learning.
Learning Objectives
By the end of this module, you should be able to:
- Understand how Random Forests combine multiple decision trees
- Implement Random Forest models for classification and regression
- Configure important hyperparameters like n_estimators and max_features
- Interpret feature importances from Random Forest models
- Identify the advantages of Random Forests over single decision trees
- Apply Random Forests to real-world machine learning problems
Guided Project
Random Forests with the Tanzania Waterpumps Dataset
Resources:
- GitHub Repository
- Guided Project Notebook: LS_DS_222.ipynb
- Slides
Module Assignment
Random Forests: Kaggle Competition
It's time to apply what you've learned about Random Forests to improve your predictions in the Kaggle competition!
Assignment Notebook Name: LS_DS_222_assignment.ipynb
Tasks:
- Sign up for a Kaggle account. Join the kaggle competition, and download the water pump dataset.
- Modify wrangle function to engineer a new feature. Use wrangle function to import training and test data.
- Split training data into feature matrix X and target vector y.
- Split feature matrix X and target vector y into training and validation sets.
- Establish the baseline accuracy score for your dataset.
- Build and train model_rf.
- Calculate the training and validation accuracy score for your model.
- Adjust model's max_depth and n_estimators to reduce overfitting.
- Generate list of predictions for X_test.
- Stretch goal: Create submissions.csv file and upload on kaggle competition site.
Libraries to use: category_encoders, matplotlib, pandas, ydata-profiling, sklearn