Module 3: Document Classification
Module Overview
In this module, we'll explore document classification, a fundamental NLP task that involves categorizing text documents into predefined classes. We'll learn how to extract features from text data, implement classification pipelines, apply dimensionality reduction techniques like Latent Semantic Indexing (LSI), and benchmark different vectorization methods to optimize classification performance. These skills are essential for applications such as sentiment analysis, spam detection, and topic categorization.
Learning Objectives
- Extract text features and use them in classification pipelines
- Apply Latent Semantic Indexing (LSI) to a document classification problem
- Benchmark different vectorization methods in document classification tasks
Objective 01 - Extract Text Features and Use Them in Classification Pipelines
Overview
In Unit 2, we worked with pipelines a lot. We used them to create a consistent workflow of preprocessing and model fitting steps. A pipeline is necessary to ensure that the data is processed simultaneously during each fold of the cross-validation.
To start this module, we will focus on two tasks: extracting features from text and then classifying the text with a simple logistic regression. We'll put these two tasks together in a pipeline and then use a grid search cross-validation to find the ideal parameters.
Follow Along
For this example, we'll be using the sentiment labeled reviews from this UCI Machine Learning Repository. To make this exercise a little simpler and the dataset a little smaller, we'll use the reviews from Yelp.
First, let's vectorize the text data and then fit a classifier model.
# Imports
import pandas as pd
# Read in the locally saved file from the link above
df_yelp = pd.read_csv('yelp_labelled.txt', names=['sentence', 'label'], sep='\t')
df_yelp.head()
sentence | label | |
---|---|---|
0 | Wow... Loved this place. | 1 |
1 | Crust is not good. | 0 |
2 | Not tasty and the texture was just nasty. | 0 |
3 | Stopped by during the late May bank holiday of... | 1 |
4 | The selection on the menu was great and so wer... | 1 |
# Import train-test split
from sklearn.model_selection import train_test_split
# Create the feature and target variables
sentences = df_yelp['sentence']
y = df_yelp['label']
# Train-test split
sentences_train, sentences_test, y_train, y_test = train_test_split(
sentences, y, test_size=0.25, random_state=42)
We now have a list of sentences; we did the train-test split before we vectorized. If we instead vectorized the whole training set and split it into train-test sets, the training set would have information about the testing test set.
# Import the tf-idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Instantiate and fit the tf-idf vectorizer
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (2,2))
vectorizer.fit(sentences_train)
# Vectorize the training and testing data
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
# Display the properties of the vectorized text
X_train
<750x2864 sparse matrix of type '<class 'numpy.float64'>' with 3051 stored elements in Compressed Sparse Row format>
# Import the classifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Instantiate and fit a model
classifier = LogisticRegression(solver='lbfgs')
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.588
We have decent accuracy with a logistic regression model. Now, if we want to optimize our model and use cross-validation, we'll need to put the vectorizer and classifier steps in each fold of the cross-validation. Furthermore, as we learned in Unit 2, we need to apply the same transformation within each fold of the validation; otherwise, we could accidentally introduce data leakage (where we give the model more information about the data that it shouldn't have).
Our pipeline will have two steps: the vectorizer and the classifier.
from sklearn.pipeline import Pipeline
# Define the Pipeline
pipe = Pipeline([('vect', vectorizer), # vectorizer
('clf', classifier) # classifier
])
# Define the parameter space for the grid serach
parameters = {'clf__C': [1, 10, 1000000]} # C: regularization strength
# Implement a grid search with cross-validation
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(sentences, y);
# Print out the best score
grid_search.best_score_
Fitting 5 folds for each of 3 candidates, totalling 15 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 8 out of 15 | elapsed: 2.3s remaining: 2.0s [Parallel(n_jobs=-1)]: Done 15 out of 15 | elapsed: 2.4s finished 0.611
The accuracy is improved compared to the version we did without a grid search. It's relatively straightforward to adjust the parameters that you would like to perform the grid search over. In this case, the only somewhat helpful parameter to search over is the C or the inverse of the regularization. For classifiers with more parameters, you add a key: value pair to the parameters dictionary.
Challenge
For this challenge, try using a different classifier in place of the logistic regression. Make sure
to adjust the parameters
dictionary to be consistent with the classifier you choose.
Some suggested classifiers, to begin with, are a decision tree or a random forest.
Additional Resources
Objective 02 - Apply Latent Semantic Indexing (LSA) to a Document Classification Problem
Overview
We're going to continue to improve on our model from the previous objective. So far, we have used a pipeline with a vectorizer and a classifier. Another advantage of a pipeline is that it's easy to add additional tasks. For this objective, we will look at a technique called latent semantic analysis (LSA). It's also referred to as latent semantic indexing (LSI), and the terms are interchangeable.
When we're doing this analysis, we're looking for a set of common concepts for the documents in our corpus. So first, a word count (or some vector representation) is determined for each document. We can then assess document similarity by calculating the cosine similarity between two document vectors (cosine similarity is the normalized dot product).
Singular Value Decomposition (SVD)
We won't go into all of the detailed math here, but we'll try to summarize the main ideas behind singular value composition. First, as we have seen in previous modules, we can create a matrix representing the corpus's words. In the last objective, we have a list of yelp reviews and are considering each one a document. For each review (document), we calculate the tf-idf vector. The resulting matrix has rows that correspond to the words in the corpus, and the columns are each document. This matrix could also be large; we don't necessarily need all of the information within it.
To find the critical "parts" of the matrix, we can use SVD to reduce the number of rows (words) while still preserving enough information for later comparisons, like the cosine similarity.
In the following example, we'll do a latent semantic analysis using the Scikit-learn
TruncatedSVD
transformer. This tool works on tf-idf matrices as returned by the
vectorizers. When we apply the transformer to this matrix type, it is known as latent semantic
analysis (LSA).
Follow Along
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
# Read in the locally saved file from the link above
df_yelp = pd.read_csv('yelp_labelled.txt', names=['sentence', 'label'], sep='\t')
df_yelp.head()
# Create the features and target
sentences = df_yelp['sentence']
y = df_yelp['label']
# Instantiate the tf-idf vectorizer
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (2,2))
# Instantiate the classifier (defaults)
classifier = LogisticRegression(solver='lbfgs')
# Instantiate the LSA (SVD) algorithm (defaults)
svd = TruncatedSVD()
Now we can add the SVD part to our pipeline; we'll separate it from the classifier part and call the combination of the vectorizer and SVD the "lsa" piece.
# Create the pipelines
from sklearn.pipeline import Pipeline
# LSA part
lsa = Pipeline([('vect', vectorizer), ('svd', svd)])
# Combine into one pipeline
pipe = Pipeline([('lsa', lsa), ('clf', classifier)])
# Define the parameter space for the grid search
parameters = {
'lsa__svd__n_components': (100,250),
'lsa__vect__max_df': (0.9, 1.0), # max document frequency
}
# Implement a grid search with cross-validation
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(sentences, y);
# Display the best score from the grid-search
grid_search.best_score_
Fitting 5 folds for each of 4 candidates, totalling 20 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 18 out of 20 | elapsed: 3.4s remaining: 0.4s [Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 3.4s finished 0.594
In the last objective, we achieved an accuracy of 0.61, and we're at about the same here. Our analysis didn't seem to benefit too much from adding in LSA/SVD. But, we did use a relatively small set of labeled sentences, and the resulting matrix likely didn't contain too much extra information that needed to be "decomposed."
Challenge
The UCI Sentiment Labeled Sentences dataset could include two other sources in the analysis: Amazon and IMDB. You can use a dataset that contains all three of these sources or select one of them. With that dataset, try running the analysis above. Did it make any difference to include SVD?
Additional Resources
Objective 03 - Benchmark and Compare Various Vectorization Methods in Document Classification Tasks
Overview
This objective will focus on building pipelines and adding in different pieces to improve the modeling of our text data. In the previous examples, we vectorized the sentences in our "documents" (sentences). But this may not have been good enough to capture the meaning of the words in the text. We have model accuracies around 60% and would like to try to improve the model performance.
One thing we haven't incorporated are word embeddings - remember those? Individual words can be vectors, which represent different components. If we take a sentence of word vectors, the overall sentence vector is the average of all word vectors.
In the following example, we'll use spaCy and import the large pre-trained model, including word
embeddings. Because we're vectorizing our text using spaCy and not the scikit-learn
TfidfVectorizer
, we're not going to use a pipeline here. Instead, we'll get just a
train-test split and vectorize each part of the training and testing sets separately. Let's improve
our model.
Follow Along
# Import spaCy and the large pretrained model (includes word embeddings)
import spacy
nlp = spacy.load("en_core_web_lg")
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
# Read in the locally saved file from UCI website
df_yelp = pd.read_csv('yelp_labelled.txt', names=['sentence', 'label'], sep='\t')
df_yelp.head()
# Create the features and target
sentences = df_yelp['sentence']
y = df_yelp['label']
# Train-test split
sentences_train, sentences_test, y_train, y_test = train_test_split(
sentences, y, test_size=0.25, random_state=42)
# Function to return the vector for each sentence in a document
def get_word_vectors(docs):
return [nlp(doc).vector for doc in docs]
# Get the vectors for each sentence (mean of all the word vectors)
X_train = get_word_vectors(sentences_train)
X_test = get_word_vectors(sentences_test)
from sklearn.linear_model import LogisticRegression
# Instantiate the classifier (defaults)
classifier = LogisticRegression(solver='lbfgs')
# Fit the model
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
# Print out the accuracy score
print("Accuracy including word embeddings: ", score)
Accuracy including word embeddings: 0.856
We have improved our accuracy a lot here! The improvement would suggest that for this type of text data (short- to medium-length sentences reviewing something), word embeddings and word vectors captured more meaning. The ability to have more information resulted in a model that was able to make better predictions.
Challenge
The UCI Sentiment Labeled Sentences dataset could include two other sources in the analysis: Amazon and IMDB. You can use a dataset that contains all three of these sources or select one of them. With that dataset, try running the analysis above. Did adding more sentences change how well your model could predict?
Another task you could try is to use a different classifier, like a decision tree or random forest. Does the performance improve?
Additional Resources
Guided Project
Open DS_413_Document_Classification_Lecture_GP.ipynb in the GitHub repository to follow along with the guided project.
Module Assignment
Participate in a Kaggle competition to classify whisky reviews using different NLP techniques. Apply text feature extraction, LSI, and word embeddings to optimize classification performance and achieve at least 80% accuracy.