Module 4: Classification Metrics

Module Overview

In this module, you will learn about various classification metrics that are essential for evaluating binary classification models. You'll explore concepts beyond simple accuracy, including confusion matrices, precision, recall, and ROC curves. These metrics are crucial for understanding how your model performs, especially with imbalanced datasets.

Learning Objectives

1. Get & interpret Confusion Matrix

Learn how to create, visualize, and interpret confusion matrices to understand your classifier's performance.

  • Understanding the structure of a confusion matrix
  • Interpreting true positives, true negatives, false positives, and false negatives
  • Visualizing a confusion matrix as a heatmap
  • Identifying patterns in misclassifications
  • Using ConfusionMatrixDisplay.from_estimator in scikit-learn

2. Use Precision and Recall

Learn when accuracy isn't sufficient and how to use precision and recall to better evaluate your models.

  • Understanding why accuracy isn't always sufficient, especially for imbalanced classes
  • Calculating precision: TP/(TP + FP) - portion of positive classifications that were actually correct
  • Calculating recall: TP/(TP + FN) - portion of actual positives that were identified correctly
  • Understanding the trade-off between precision and recall
  • Calculating the F1 score to balance precision and recall
  • Using the classification_report function in scikit-learn

3. Understand relationships between classification thresholds, metrics and predicted probabilities

Learn how classification thresholds affect the balance between precision and recall.

  • Understanding classification thresholds and their default values
  • Working with predicted probabilities using predict_proba()
  • Adjusting probability thresholds based on business requirements
  • Balancing false positives and false negatives through threshold adjustment
  • Real-world examples of threshold adjustments in medical and financial applications

Guided Project

Classification Metrics

Resources:

Module Assignment

Classification Metrics: Kaggle Competition

In this module assignment, you'll continue working with the Tanzania Waterpumps Kaggle competition:

Assignment Notebook Name: LS_DS_224_assignment.ipynb

Tasks:

  1. Use wrangle function to import training and test data.
  2. Split training data into feature matrix X and target vector y.
  3. Split training data into training and validation sets.
  4. Establish the baseline accuracy score for your dataset.
  5. Build model.
  6. Calculate the training and validation accuracy score for your model.
  7. Plot the confusion matrix for your model.
  8. Print the classification report for your model.
  9. Identify likely 'non-functional' pumps in the test set.
  10. Find likely 'non-functional' pumps serving biggest populations.
  11. Plot pump locations from Task 10. (stretch goal)

Libraries to use: category_encoders, matplotlib, pandas, ydata-profiling, plotly, sklearn

Assignment GitHub Repository

Additional Resources