Module 4: Classification Metrics
Module Overview
In this module, you will learn about various classification metrics that are essential for evaluating binary classification models. You'll explore concepts beyond simple accuracy, including confusion matrices, precision, recall, and ROC curves. These metrics are crucial for understanding how your model performs, especially with imbalanced datasets.
Learning Objectives
1. Get & interpret Confusion Matrix
Learn how to create, visualize, and interpret confusion matrices to understand your classifier's performance.
- Understanding the structure of a confusion matrix
- Interpreting true positives, true negatives, false positives, and false negatives
- Visualizing a confusion matrix as a heatmap
- Identifying patterns in misclassifications
- Using ConfusionMatrixDisplay.from_estimator in scikit-learn
2. Use Precision and Recall
Learn when accuracy isn't sufficient and how to use precision and recall to better evaluate your models.
- Understanding why accuracy isn't always sufficient, especially for imbalanced classes
- Calculating precision: TP/(TP + FP) - portion of positive classifications that were actually correct
- Calculating recall: TP/(TP + FN) - portion of actual positives that were identified correctly
- Understanding the trade-off between precision and recall
- Calculating the F1 score to balance precision and recall
- Using the classification_report function in scikit-learn
3. Understand relationships between classification thresholds, metrics and predicted probabilities
Learn how classification thresholds affect the balance between precision and recall.
- Understanding classification thresholds and their default values
- Working with predicted probabilities using predict_proba()
- Adjusting probability thresholds based on business requirements
- Balancing false positives and false negatives through threshold adjustment
- Real-world examples of threshold adjustments in medical and financial applications
Guided Project
Module Assignment
Classification Metrics: Kaggle Competition
In this module assignment, you'll continue working with the Tanzania Waterpumps Kaggle competition:
Assignment Notebook Name: LS_DS_224_assignment.ipynb
Tasks:
- Use wrangle function to import training and test data.
- Split training data into feature matrix X and target vector y.
- Split training data into training and validation sets.
- Establish the baseline accuracy score for your dataset.
- Build model.
- Calculate the training and validation accuracy score for your model.
- Plot the confusion matrix for your model.
- Print the classification report for your model.
- Identify likely 'non-functional' pumps in the test set.
- Find likely 'non-functional' pumps serving biggest populations.
- Plot pump locations from Task 10. (stretch goal)
Libraries to use: category_encoders, matplotlib, pandas, ydata-profiling, plotly, sklearn