bekkidavis.com

Building NLP Models with Python: A Comprehensive Guide

Written on

Understanding the NLP Model Development Process

This article aims to reshape the perspective of newcomers eager to learn about natural language processing (NLP). When I first delved into NLP, I often wondered how to effectively apply the myriad concepts involved. A basic understanding of natural language concepts is essential before diving in. For a refresher, you might explore the following topics:

  • Reading sentiment text files
  • Data exploration and text processing
  • Data cleaning — Stopwords, stemming, and lemmatization
  • Model building — Naive Bayes
  • Saving and reloading the model

Reading Sentiment Text Files

To begin, we will import the necessary libraries:

import pandas as pd

import numpy as np

import warnings

warnings.filterwarnings('ignore')

Next, we'll read the sentiment file using pandas, which can be downloaded from Kaggle:

train_ds = pd.read_csv("sentiment_train", delimiter="t")

train_ds.head(5)

The dataset consists of two columns: sentiment and text, where the sentiment column contains binary values, "0" for negative and "1" for positive.

To enhance our view of the text, we can adjust the column width:

pd.set_option('max_colwidth', 800)

Now, we can filter and view the positive (sentiment "1") and negative (sentiment "0") sentences:

train_ds[train_ds.sentiment == 1][0:5]

train_ds[train_ds.sentiment == 0][0:5]

Data Exploration and Text Processing

To examine the dataset's structure, we can utilize the info() method:

train_ds.info()

Next, we will analyze the distribution of sentiments using seaborn's count plot:

import matplotlib.pyplot as plt

import seaborn as sn

%matplotlib inline

plt.figure(figsize=(6, 5))

ax = sn.countplot(x='sentiment', data=train_ds)

for p in ax.patches:

ax.annotate(p.get_height(), (p.get_x() + 0.1, p.get_height() + 50))

Text Data Transformation

We'll convert the text data into a format suitable for analysis using the Count Vectorizer:

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

feature_vector = count_vectorizer.fit(train_ds.text)

To determine the total number of unique features:

word = feature_vector.get_feature_names()

print("Total number of features: ", len(word))

To sample some of these features:

import random

random.sample(word, 10)

Next, we will transform our features into a sparse matrix format:

train_ds_features = count_vectorizer.transform(train_ds.text)

To check the dimensions of this sparse matrix:

train_ds_features.shape

We can convert this sparse matrix into a dense DataFrame for easier analysis:

train_ds_df = pd.DataFrame(train_ds_features.todense())

train_ds_df.columns = word

#### Counting Word Frequencies

To count the occurrences of each word, we can use:

words_counts = np.sum(train_ds_features.toarray(), axis=0)

feature_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))

Visualizing the frequency distribution of words:

plt.figure(figsize=(12, 5))

plt.hist(feature_counts_df.counts, bins=50, range=(0, 2000))

plt.xlabel('Frequency of words')

plt.ylabel('Density')

To find words that appear only once:

len(feature_counts_df[feature_counts_df.counts == 1])

Identifying the most frequent words in the dataset:

count_vectorizer = CountVectorizer(max_features=1000)

feature_vector = count_vectorizer.fit(train_ds.text)

word = feature_vector.get_feature_names()

word_counts = np.sum(train_ds_features.toarray(), axis=0)

word_counts_df = pd.DataFrame(dict(features=word, counts=word_counts))

feature_counts_df.sort_values('counts', ascending=False)[0:15]

Data Cleaning Techniques

Stopwords Removal

We need to eliminate stopwords, as they don't contribute meaningfully to sentiment analysis:

from sklearn.feature_extraction import text

my_stop_words = text.ENGLISH_STOP_WORDS

print("Few stop words: ", list(my_stop_words)[0:10])

You can also add custom stopwords to your list:

my_stop_words = text.ENGLISH_STOP_WORDS.union(['harry', 'potter', 'code', 'vinci', 'da', 'harri', 'mountain', 'movie', 'movies'])

After setting the stopwords, we create a new DataFrame:

count_vectorizer = CountVectorizer(stop_words=my_stop_words, max_features=1000)

feature_vector = count_vectorizer.fit(train_ds.text)

train_ds_features = count_vectorizer.transform(train_ds.text)

word = feature_vector.get_feature_names()

words_counts = np.sum(train_ds_features.toarray(), axis=0)

word_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))

Stemming and Lemmatization

Next, we will stem words using the Porter Stemmer:

from nltk.stem.snowball import PorterStemmer

stemmer = PorterStemmer()

analyzer = CountVectorizer().build_analyzer()

def stem_words(doc):

stem_words = (stemmer.stem(w) for w in analyzer(doc))

non_stop_words = [word for word in list(set(stem_words) - set(my_stop_words))]

return non_stop_words

We can then create a new DataFrame of the stemmed words:

count_vectorizer = CountVectorizer(analyzer=stem_words, max_features=1000)

feature_vector = count_vectorizer.fit(train_ds.text)

train_ds_features = count_vectorizer.transform(train_ds.text)

word = feature_vector.get_feature_names()

words_counts = np.sum(train_ds_features.toarray(), axis=0)

feature_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))

Model Building: Naive Bayes Classifier

#### Training and Testing Data Preparation

We will split the data into training and test sets:

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(train_ds_features, train_ds.sentiment, test_size=0.3, random_state=42)

Using the Bernoulli Naive Bayes classifier for predictions:

from sklearn.naive_bayes import BernoulliNB

nb_clf = BernoulliNB()

nb_clf.fit(train_X.toarray(), train_y)

To make predictions:

test_ds_predicted = nb_clf.predict(test_X.toarray())

To evaluate the model's performance, we can print a classification report:

from sklearn import metrics

print(metrics.classification_report(test_y, test_ds_predicted))

To visualize the confusion matrix:

cm = metrics.confusion_matrix(test_y, test_ds_predicted)

sn.heatmap(cm, annot=True, fmt='.2f')

#### Saving and Reloading the Model

We can save our model using the pickle library:

import pickle

pickle.dump(nb_clf, open("Sentiment_Classifier_model", 'wb'))

To load the model for future predictions:

loaded_model = pickle.load(open("Sentiment_Classifier_model", 'rb'))

test_ds_predicted = loaded_model.predict(test_X.toarray())

print(metrics.classification_report(test_y, test_ds_predicted))

Conclusion

This guide provides fundamental insights into constructing natural language processing models and handling words as features for predictions. Other modeling techniques, such as TF-IDF and N-Grams, will be explored in subsequent articles.

I hope you found this article informative. Connect with me on LinkedIn and Twitter for more discussions.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Raising Kids in the Age of Instant Gratification and Technology

Exploring the impact of technology on parenting and mental health in today's children.

Celebrating the Pioneering Women of Computer Science

Discover the remarkable contributions of seven groundbreaking female computer scientists who shaped the tech world.

Maximize Your Efficiency by Transitioning from To-Do Lists to Leverage Lists

Discover how shifting from to-do lists to leverage lists can optimize your productivity and long-term success.