Building NLP Models with Python: A Comprehensive Guide

Understanding the NLP Model Development Process

This article aims to reshape the perspective of newcomers eager to learn about natural language processing (NLP). When I first delved into NLP, I often wondered how to effectively apply the myriad concepts involved. A basic understanding of natural language concepts is essential before diving in. For a refresher, you might explore the following topics:

Reading sentiment text files
Data exploration and text processing
Data cleaning — Stopwords, stemming, and lemmatization
Model building — Naive Bayes
Saving and reloading the model

Reading Sentiment Text Files

To begin, we will import the necessary libraries:

import pandas as pd

import numpy as np

import warnings

warnings.filterwarnings('ignore')

Next, we'll read the sentiment file using pandas, which can be downloaded from Kaggle:

train_ds = pd.read_csv("sentiment_train", delimiter="t")

train_ds.head(5)

The dataset consists of two columns: sentiment and text, where the sentiment column contains binary values, "0" for negative and "1" for positive.

To enhance our view of the text, we can adjust the column width:

pd.set_option('max_colwidth', 800)

Now, we can filter and view the positive (sentiment "1") and negative (sentiment "0") sentences:

train_ds[train_ds.sentiment == 1][0:5]

train_ds[train_ds.sentiment == 0][0:5]

Data Exploration and Text Processing

To examine the dataset's structure, we can utilize the info() method:

train_ds.info()

Next, we will analyze the distribution of sentiments using seaborn's count plot:

import matplotlib.pyplot as plt

import seaborn as sn

%matplotlib inline

plt.figure(figsize=(6, 5))

ax = sn.countplot(x='sentiment', data=train_ds)

for p in ax.patches:

ax.annotate(p.get_height(), (p.get_x() + 0.1, p.get_height() + 50))

Text Data Transformation

We'll convert the text data into a format suitable for analysis using the Count Vectorizer:

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

feature_vector = count_vectorizer.fit(train_ds.text)

To determine the total number of unique features:

word = feature_vector.get_feature_names()

print("Total number of features: ", len(word))

To sample some of these features:

import random

random.sample(word, 10)

Next, we will transform our features into a sparse matrix format:

train_ds_features = count_vectorizer.transform(train_ds.text)

To check the dimensions of this sparse matrix:

train_ds_features.shape

We can convert this sparse matrix into a dense DataFrame for easier analysis:

train_ds_df = pd.DataFrame(train_ds_features.todense())

train_ds_df.columns = word

#### Counting Word Frequencies

To count the occurrences of each word, we can use:

words_counts = np.sum(train_ds_features.toarray(), axis=0)

feature_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))

Visualizing the frequency distribution of words:

plt.figure(figsize=(12, 5))

plt.hist(feature_counts_df.counts, bins=50, range=(0, 2000))

plt.xlabel('Frequency of words')

plt.ylabel('Density')

To find words that appear only once:

len(feature_counts_df[feature_counts_df.counts == 1])

Identifying the most frequent words in the dataset:

count_vectorizer = CountVectorizer(max_features=1000)

feature_vector = count_vectorizer.fit(train_ds.text)

word = feature_vector.get_feature_names()

word_counts = np.sum(train_ds_features.toarray(), axis=0)

word_counts_df = pd.DataFrame(dict(features=word, counts=word_counts))

feature_counts_df.sort_values('counts', ascending=False)[0:15]

Data Cleaning Techniques

Stopwords Removal

We need to eliminate stopwords, as they don't contribute meaningfully to sentiment analysis:

from sklearn.feature_extraction import text

my_stop_words = text.ENGLISH_STOP_WORDS

print("Few stop words: ", list(my_stop_words)[0:10])

You can also add custom stopwords to your list:

my_stop_words = text.ENGLISH_STOP_WORDS.union(['harry', 'potter', 'code', 'vinci', 'da', 'harri', 'mountain', 'movie', 'movies'])

After setting the stopwords, we create a new DataFrame:

count_vectorizer = CountVectorizer(stop_words=my_stop_words, max_features=1000)

feature_vector = count_vectorizer.fit(train_ds.text)

train_ds_features = count_vectorizer.transform(train_ds.text)

word = feature_vector.get_feature_names()

words_counts = np.sum(train_ds_features.toarray(), axis=0)

word_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))

Stemming and Lemmatization

Next, we will stem words using the Porter Stemmer:

from nltk.stem.snowball import PorterStemmer

stemmer = PorterStemmer()

analyzer = CountVectorizer().build_analyzer()

def stem_words(doc):

stem_words = (stemmer.stem(w) for w in analyzer(doc))

non_stop_words = [word for word in list(set(stem_words) - set(my_stop_words))]

return non_stop_words

We can then create a new DataFrame of the stemmed words:

count_vectorizer = CountVectorizer(analyzer=stem_words, max_features=1000)

feature_vector = count_vectorizer.fit(train_ds.text)

train_ds_features = count_vectorizer.transform(train_ds.text)

word = feature_vector.get_feature_names()

words_counts = np.sum(train_ds_features.toarray(), axis=0)

feature_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))

Model Building: Naive Bayes Classifier

#### Training and Testing Data Preparation

We will split the data into training and test sets:

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(train_ds_features, train_ds.sentiment, test_size=0.3, random_state=42)

Using the Bernoulli Naive Bayes classifier for predictions:

from sklearn.naive_bayes import BernoulliNB

nb_clf = BernoulliNB()

nb_clf.fit(train_X.toarray(), train_y)

To make predictions:

test_ds_predicted = nb_clf.predict(test_X.toarray())

To evaluate the model's performance, we can print a classification report:

from sklearn import metrics

print(metrics.classification_report(test_y, test_ds_predicted))

To visualize the confusion matrix:

cm = metrics.confusion_matrix(test_y, test_ds_predicted)

sn.heatmap(cm, annot=True, fmt='.2f')

#### Saving and Reloading the Model

We can save our model using the pickle library:

import pickle

pickle.dump(nb_clf, open("Sentiment_Classifier_model", 'wb'))

To load the model for future predictions:

loaded_model = pickle.load(open("Sentiment_Classifier_model", 'rb'))

test_ds_predicted = loaded_model.predict(test_X.toarray())

print(metrics.classification_report(test_y, test_ds_predicted))

Conclusion

This guide provides fundamental insights into constructing natural language processing models and handling words as features for predictions. Other modeling techniques, such as TF-IDF and N-Grams, will be explored in subsequent articles.

I hope you found this article informative. Connect with me on LinkedIn and Twitter for more discussions.

bekkidavis.com

Building NLP Models with Python: A Comprehensive Guide

Understanding the NLP Model Development Process

Reading Sentiment Text Files

Data Exploration and Text Processing

Text Data Transformation

Data Cleaning Techniques

Stopwords Removal

Stemming and Lemmatization

Model Building: Naive Bayes Classifier

Conclusion

Recommended Articles

Share the page:

Recent Post:

Raising Kids in the Age of Instant Gratification and Technology

Celebrating the Pioneering Women of Computer Science

Maximize Your Efficiency by Transitioning from To-Do Lists to Leverage Lists