DS Unit 4 Sprint 13 - Sprint Challenge

Sprint Challenge Overview

This sprint challenge will assess your understanding of Natural Language Processing concepts covered throughout this sprint. You'll apply text preprocessing, vector representations, document classification, and topic modeling techniques to analyze the famous Yelp dataset.

Challenge Setup

To get started with the Sprint Challenge, follow these steps:

Access the Jupyter notebook using the link below.
Download the Yelp dataset from the provided data link.
You can complete the assignment locally or in Google Colab (make sure to Copy to your Google Drive).

Challenge Notebook Yelp Dataset

Challenge Expectations

The Sprint Challenge is designed to test your mastery of the following key concepts:

Text tokenization: Processing raw text and creating effective tokenization functions
Vector representations: Converting text to numerical features and finding document similarity
Document classification: Building pipelines to predict star ratings from review text
Topic modeling: Implementing LDA models to discover themes in documents

What to Expect

In this sprint challenge, you'll apply everything you've learned about Natural Language Processing to work with real Yelp review data. This challenge will test your ability to:

Create effective tokenization functions that process text appropriately
Build document-term matrices and use nearest neighbors for similarity analysis
Construct classification pipelines with proper vectorization and parameter tuning
Implement topic models using Gensim and interpret the results meaningfully
Visualize NLP results using both pyLDAvis and matplotlib
Present your findings and analysis in a clear, structured manner

There are 8 total possible points in this sprint challenge, covering all four major NLP components from your modules!

Submission

To submit your Sprint Challenge:

Complete all requirements in the Sprint Challenge notebook
If using Google Colab, submit the sharing link to your completed notebook
If working locally, create a GitHub repository with your Jupyter notebook and submit the repository link
Ensure all cells run successfully and outputs are visible before submitting

Sprint Challenge: Natural Language Processing

Sprint Challenge Overview

Challenge Setup

Challenge Expectations

What to Expect

Submission

Sprint Challenge Resources

Text Processing and Tokenization

Vector Representations and Classification

Topic Modeling and Visualization