Module 4: Logistic Regression

Overview

In this module, you will transition from regression to classification with logistic regression. You'll implement train-validate-test splits, understand classification baselines, and learn about scikit-learn pipelines. These skills will enable you to build and evaluate models for binary classification problems.

Learning Objectives

Objective 1: Determine a baseline for classification

Learn how to establish reference points for evaluating your classification models.

  • Understanding the concept of classification baselines
  • Implementing majority class, stratified, and prior probability baselines
  • Calculating baseline accuracy, precision, and recall
  • Using baselines to contextualize model performance

Objective 2: Implement a train-validate-test split

Learn how to divide your data into three sets for more robust model evaluation.

  • Understanding the purpose of train-validate-test splits
  • Implementing multi-stage splits with scikit-learn
  • Maintaining class distributions across splits
  • Avoiding data leakage between splits

Objective 3: Fit a Logistic Regression classification model

Learn how to implement and fine-tune logistic regression for binary classification.

  • Understanding the logistic function and decision boundary
  • Implementing logistic regression in scikit-learn
  • Tuning regularization strength and solver parameters
  • Handling class imbalance in logistic regression

Objective 4: Create a scikit-learn pipeline

Learn to build streamlined workflows for data preprocessing and model training.

  • Understanding the benefits of scikit-learn pipelines
  • Building pipelines with preprocessing steps and models
  • Combining feature transformers with FeatureUnion
  • Using pipelines for cross-validation and grid search

Guided Project

Logistic Regression

The notebook for this guided project is JDS_SHR_214_guided_project_notes.ipynb in the GitHub repository.

Module Assignment

Logistic Regression Assignment

In this module assignment, found in the file LS_DS_214_assignment.ipynb in the GitHub repository, you'll apply logistic regression to solve a classification problem:

Tasks:

  1. Load adult.csv using the wrangle function
  2. Split data into feature matrix X and target vector y
  3. Split data into train, validation, and test sets
  4. Establish accuracy and confusion matrix baselines for your dataset
  5. Build a transformational pipeline for preprocessing
  6. Build and train a LogisticRegression model
  7. Evaluate the model with validation accuracy and f1 score
  8. Use your model to predict the test set and calculate test accuracy
  9. Interpret the coefficients from your logistic regression model