Sprint Challenge
Overview
After a sprint of Natural Language Processing, you've learned some cool new techniques: how to process text, how to turn text into vectors, and how to model topics from documents. In this Sprint Challenge, you will apply your newly acquired skills to one of the most famous NLP datasets out there: Yelp. As part of the job selection process, many professionals are asked to create analyses of this dataset, so this challenge will give you a head start on a common industry task.
The real Yelp dataset is massive (almost 8 GB uncompressed), but we've sampled it down to something more manageable for this Sprint Challenge. If you're interested, you can analyze the full dataset as a stretch goal after completing the challenge.
Note: This challenge will test your ability to apply various NLP techniques to real-world data and extract meaningful insights from text.
Setup
Sprint Challenge GitHub Repository
Required Packages
For this challenge, you will need to import the following packages:
- spaCy (with the 'en_core_web_sm' model)
- Pandas
- Seaborn
- Matplotlib
- sklearn.neighbors.NearestNeighbors
- sklearn.pipeline.Pipeline
- sklearn.feature_extraction.text.TfidfVectorizer
- sklearn.neighbors.KNeighborsClassifier
- sklearn.model_selection.GridSearchCV
- gensim.corpora
- gensim.models.LdaModel
- re (regular expressions)
Data Import
You will need to load the raw Yelp review data from the following URL:
Challenge Objectives
There are 8 total possible points in this sprint challenge. Successfully completing all objectives earns you full credit. Here are the main components of the challenge:
Part 1: Tokenize Yelp Reviews
Create a function that:
- Accepts one document (review text) at a time
- Returns a list of tokens
- Effectively processes text for further analysis
Part 2: Create Vector Representations
In this section, you will:
- Create a document-term matrix from the reviews
- Write a fake review for testing purposes
- Use a NearestNeighbors model to find the 10 most similar reviews to your fake review
- Display the text of these similar reviews
Part 3: Classification Model
Your goal is to predict star ratings from review text:
- Create a pipeline with a vectorizer and a classifier
- Build parameter dictionaries for grid search
- Train the model to predict the 'stars' feature
- Use the model to predict a star rating for your fake review from Part 2
Part 4: Topic Modeling
Analyze what the Yelp reviews are discussing:
- Estimate an LDA topic model with 5 topics
- Create visualizations of the results (including pyLDAvis and matplotlib)
- Write a brief analysis of your topic model results
Submission
Before submitting your notebook, you must:
- Restart your notebook's kernel
- Run all cells sequentially from top to bottom (Cell → Run All)
- Comment out the cell that generates the pyLDAvis visual
- Ensure all tests are passing
To submit your Sprint Challenge:
- Complete all requirements in the Sprint Challenge notebook
- Follow the submission instructions provided in the challenge
- Submit your completed work through the Sprint Challenge assignment
Note: After your first submission attempt, you have unlimited attempts to resubmit until you achieve a passing grade of 100%.
Resources
Text Processing and Tokenization
Vectorization and Classification
- Scikit-Learn: Text Feature Extraction
- Scikit-Learn: NearestNeighbors Documentation
- Scikit-Learn: Pipeline Documentation