Module 1: Decision Trees

Overview

In this module, you will learn about decision trees, one of the most intuitive and widely used machine learning algorithms. Decision trees are versatile models that can be used for both classification and regression tasks, making them essential tools in a data scientist's toolkit.

You'll explore how decision trees work, from the basic concepts of node splitting to practical implementation using scikit-learn. You'll also learn about the strengths and limitations of decision trees, setting the foundation for more advanced tree-based methods in later modules.

Learning Objectives

By the end of this module, you should be able to:

Understand the fundamental concepts behind decision trees
Implement decision tree classifiers and regressors using scikit-learn
Interpret decision tree visualizations and understand the importance of feature selection
Recognize the trade-offs between model complexity and generalization performance
Apply decision trees to solve real-world problems

Guided Project

Decision Trees with the Tanzania Waterpumps Dataset

Resources:

GitHub Repository
Guided Project Notebook: JDS_SHR_221_guided_project_notes.ipynb
Slides

Module Assignment

Decision Trees: Kaggle Competition

It's Kaggle competition time! In this assignment, you'll apply what you've learned about decision trees to a real-world dataset.

Assignment Notebook: LS_DS_221_assignments.ipynb

Tasks:

Sign up for a Kaggle account. Join the kaggle competition, and download the water pump dataset.
Getting Started with Kaggle

If this is your first time using Kaggle, here's how to get started:
- Create an Account: Visit Kaggle.com and register with your email
- Join the Competition: Navigate to the competition page and click "Join Competition"
- Download Data: Go to the "Data" tab and download the dataset files
Watch this walkthrough video for detailed instructions:

Kaggle Challenge Setup Resource

Additional Kaggle resources:
- Getting Started with Kaggle
- Kaggle Competitions Documentation
Use wrangle function to import training and test data.
Split training data into feature matrix X and target vector y.
Split feature matrix X and target vector y into training and validation sets.
Establish the baseline accuracy score for your dataset.
Build and train model_dt.
Calculate the training and validation accuracy score for your model.
Adjust model's max_depth to reduce overfitting.
Stretch goal: Create a horizontal bar chart showing the 10 most important features for your model.

Libraries to use: category_encoders, matplotlib, pandas, ydata-profiling, sklearn

Assignment GitHub Repository