Module 1: Decision Trees

Overview

In this module, you will learn about decision trees, one of the most intuitive and widely used machine learning algorithms. Decision trees are versatile models that can be used for both classification and regression tasks, making them essential tools in a data scientist's toolkit.

You'll explore how decision trees work, from the basic concepts of node splitting to practical implementation using scikit-learn. You'll also learn about the strengths and limitations of decision trees, setting the foundation for more advanced tree-based methods in later modules.

Learning Objectives

By the end of this module, you should be able to:

  • Understand the fundamental concepts behind decision trees
  • Implement decision tree classifiers and regressors using scikit-learn
  • Interpret decision tree visualizations and understand the importance of feature selection
  • Recognize the trade-offs between model complexity and generalization performance
  • Apply decision trees to solve real-world problems

Guided Project

Decision Trees with the Tanzania Waterpumps Dataset

Resources:

Module Assignment

Decision Trees: Kaggle Competition

It's Kaggle competition time! In this assignment, you'll apply what you've learned about decision trees to a real-world dataset.

Assignment Notebook: LS_DS_221_assignments.ipynb

Tasks:

  1. Sign up for a Kaggle account. Join the kaggle competition, and download the water pump dataset.
    Getting Started with Kaggle

    If this is your first time using Kaggle, here's how to get started:

    • Create an Account: Visit Kaggle.com and register with your email
    • Join the Competition: Navigate to the competition page and click "Join Competition"
    • Download Data: Go to the "Data" tab and download the dataset files

    Watch this walkthrough video for detailed instructions:

    Kaggle Challenge Setup Resource

    Additional Kaggle resources:

  2. Use wrangle function to import training and test data.
  3. Split training data into feature matrix X and target vector y.
  4. Split feature matrix X and target vector y into training and validation sets.
  5. Establish the baseline accuracy score for your dataset.
  6. Build and train model_dt.
  7. Calculate the training and validation accuracy score for your model.
  8. Adjust model's max_depth to reduce overfitting.
  9. Stretch goal: Create a horizontal bar chart showing the 10 most important features for your model.

Libraries to use: category_encoders, matplotlib, pandas, ydata-profiling, sklearn

Assignment GitHub Repository

Additional Resources