Sprint Challenge

Overview

After a sprint of Natural Language Processing, you've learned some cool new techniques: how to process text, how to turn text into vectors, and how to model topics from documents. In this Sprint Challenge, you will apply your newly acquired skills to one of the most famous NLP datasets out there: Yelp. As part of the job selection process, many professionals are asked to create analyses of this dataset, so this challenge will give you a head start on a common industry task.

The real Yelp dataset is massive (almost 8 GB uncompressed), but we've sampled it down to something more manageable for this Sprint Challenge. If you're interested, you can analyze the full dataset as a stretch goal after completing the challenge.

Note: This challenge will test your ability to apply various NLP techniques to real-world data and extract meaningful insights from text.

Setup

Sprint Challenge GitHub Repository

Required Packages

For this challenge, you will need to import the following packages:

  • spaCy (with the 'en_core_web_sm' model)
  • Pandas
  • Seaborn
  • Matplotlib
  • sklearn.neighbors.NearestNeighbors
  • sklearn.pipeline.Pipeline
  • sklearn.feature_extraction.text.TfidfVectorizer
  • sklearn.neighbors.KNeighborsClassifier
  • sklearn.model_selection.GridSearchCV
  • gensim.corpora
  • gensim.models.LdaModel
  • re (regular expressions)

Data Import

You will need to load the raw Yelp review data from the following URL:

Yelp Data

Challenge Objectives

There are 8 total possible points in this sprint challenge. Successfully completing all objectives earns you full credit. Here are the main components of the challenge:

Part 1: Tokenize Yelp Reviews

Create a function that:

  • Accepts one document (review text) at a time
  • Returns a list of tokens
  • Effectively processes text for further analysis

Part 2: Create Vector Representations

In this section, you will:

  • Create a document-term matrix from the reviews
  • Write a fake review for testing purposes
  • Use a NearestNeighbors model to find the 10 most similar reviews to your fake review
  • Display the text of these similar reviews

Part 3: Classification Model

Your goal is to predict star ratings from review text:

  • Create a pipeline with a vectorizer and a classifier
  • Build parameter dictionaries for grid search
  • Train the model to predict the 'stars' feature
  • Use the model to predict a star rating for your fake review from Part 2

Part 4: Topic Modeling

Analyze what the Yelp reviews are discussing:

  • Estimate an LDA topic model with 5 topics
  • Create visualizations of the results (including pyLDAvis and matplotlib)
  • Write a brief analysis of your topic model results

Submission

Before submitting your notebook, you must:

  1. Restart your notebook's kernel
  2. Run all cells sequentially from top to bottom (Cell → Run All)
  3. Comment out the cell that generates the pyLDAvis visual
  4. Ensure all tests are passing

To submit your Sprint Challenge:

  1. Complete all requirements in the Sprint Challenge notebook
  2. Follow the submission instructions provided in the challenge
  3. Submit your completed work through the Sprint Challenge assignment

Note: After your first submission attempt, you have unlimited attempts to resubmit until you achieve a passing grade of 100%.

Resources