Module 1: Natural Language Processing - Introduction

Module Overview

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. In this module, we'll explore the foundational concepts of NLP, including text preprocessing techniques that are essential for any text-based analysis. We'll learn how to tokenize text, remove stop words, and apply stemming or lemmatization to prepare text data for more advanced NLP applications.

Learning Objectives

1. Tokenize Text

• Break text into individual tokens (words, sentences, etc.)
• Apply different tokenization methods (character, word, sentence)
• Use Python libraries for efficient tokenization
• Handle punctuation and special characters during tokenization

2. Remove Stop Words From a List of Tokens

• Identify common stop words in text
• Remove stop words to improve analysis efficiency
• Use pre-defined stop word lists
• Create custom stop word lists for domain-specific applications

3. Stem or Lemmatize Text

• Apply stemming algorithms to reduce words to their root form
• Implement lemmatization to find the base dictionary form of words
• Compare stemming vs. lemmatization approaches
• Choose appropriate text normalization methods for different NLP tasks

Guided Project

NLP Introduction: Text Preprocessing

GitHub Repo | Slides

Guided Project File:

DS_411_Text_Data_Lecture_GP.ipynb

Module Assignment

Please read the assignment file in the GitHub repository for detailed instructions on completing your assignment tasks.

Assignment File:

DS_411_Text_Data_Assignment.ipynb

In this assignment, your goal is to find the attributes of the best & worst coffee shops in the dataset. The text is fairly raw: dates in the review, extra words in the star_rating column, etc. You'll probably want to clean that stuff up for a better analysis.

You will need to analyze the corpus of text using text visualizations of token frequency. Try cleaning the data as much as possible using techniques including:

  • Lemmatization
  • Custom stopword removal

Assignment Solution Video

Check for Understanding

Complete the following items to test your understanding:

  • Tokenize a text document using different methods (word, sentence)
  • Remove standard and custom stop words from tokenized text
  • Apply stemming and lemmatization to a set of tokens
  • Compare the results of stemming vs. lemmatization
  • Create a visualization of token frequencies

Additional Resources