Building NLP Models with Python: A Comprehensive Guide
Written on
Understanding the NLP Model Development Process
This article aims to reshape the perspective of newcomers eager to learn about natural language processing (NLP). When I first delved into NLP, I often wondered how to effectively apply the myriad concepts involved. A basic understanding of natural language concepts is essential before diving in. For a refresher, you might explore the following topics:
- Reading sentiment text files
- Data exploration and text processing
- Data cleaning — Stopwords, stemming, and lemmatization
- Model building — Naive Bayes
- Saving and reloading the model
Reading Sentiment Text Files
To begin, we will import the necessary libraries:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
Next, we'll read the sentiment file using pandas, which can be downloaded from Kaggle:
train_ds = pd.read_csv("sentiment_train", delimiter="t")
train_ds.head(5)
The dataset consists of two columns: sentiment and text, where the sentiment column contains binary values, "0" for negative and "1" for positive.
To enhance our view of the text, we can adjust the column width:
pd.set_option('max_colwidth', 800)
Now, we can filter and view the positive (sentiment "1") and negative (sentiment "0") sentences:
train_ds[train_ds.sentiment == 1][0:5]
train_ds[train_ds.sentiment == 0][0:5]
Data Exploration and Text Processing
To examine the dataset's structure, we can utilize the info() method:
train_ds.info()
Next, we will analyze the distribution of sentiments using seaborn's count plot:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
plt.figure(figsize=(6, 5))
ax = sn.countplot(x='sentiment', data=train_ds)
for p in ax.patches:
ax.annotate(p.get_height(), (p.get_x() + 0.1, p.get_height() + 50))
Text Data Transformation
We'll convert the text data into a format suitable for analysis using the Count Vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
feature_vector = count_vectorizer.fit(train_ds.text)
To determine the total number of unique features:
word = feature_vector.get_feature_names()
print("Total number of features: ", len(word))
To sample some of these features:
import random
random.sample(word, 10)
Next, we will transform our features into a sparse matrix format:
train_ds_features = count_vectorizer.transform(train_ds.text)
To check the dimensions of this sparse matrix:
train_ds_features.shape
We can convert this sparse matrix into a dense DataFrame for easier analysis:
train_ds_df = pd.DataFrame(train_ds_features.todense())
train_ds_df.columns = word
#### Counting Word Frequencies
To count the occurrences of each word, we can use:
words_counts = np.sum(train_ds_features.toarray(), axis=0)
feature_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))
Visualizing the frequency distribution of words:
plt.figure(figsize=(12, 5))
plt.hist(feature_counts_df.counts, bins=50, range=(0, 2000))
plt.xlabel('Frequency of words')
plt.ylabel('Density')
To find words that appear only once:
len(feature_counts_df[feature_counts_df.counts == 1])
Identifying the most frequent words in the dataset:
count_vectorizer = CountVectorizer(max_features=1000)
feature_vector = count_vectorizer.fit(train_ds.text)
word = feature_vector.get_feature_names()
word_counts = np.sum(train_ds_features.toarray(), axis=0)
word_counts_df = pd.DataFrame(dict(features=word, counts=word_counts))
feature_counts_df.sort_values('counts', ascending=False)[0:15]
Data Cleaning Techniques
Stopwords Removal
We need to eliminate stopwords, as they don't contribute meaningfully to sentiment analysis:
from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS
print("Few stop words: ", list(my_stop_words)[0:10])
You can also add custom stopwords to your list:
my_stop_words = text.ENGLISH_STOP_WORDS.union(['harry', 'potter', 'code', 'vinci', 'da', 'harri', 'mountain', 'movie', 'movies'])
After setting the stopwords, we create a new DataFrame:
count_vectorizer = CountVectorizer(stop_words=my_stop_words, max_features=1000)
feature_vector = count_vectorizer.fit(train_ds.text)
train_ds_features = count_vectorizer.transform(train_ds.text)
word = feature_vector.get_feature_names()
words_counts = np.sum(train_ds_features.toarray(), axis=0)
word_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))
Stemming and Lemmatization
Next, we will stem words using the Porter Stemmer:
from nltk.stem.snowball import PorterStemmer
stemmer = PorterStemmer()
analyzer = CountVectorizer().build_analyzer()
def stem_words(doc):
stem_words = (stemmer.stem(w) for w in analyzer(doc))
non_stop_words = [word for word in list(set(stem_words) - set(my_stop_words))]
return non_stop_words
We can then create a new DataFrame of the stemmed words:
count_vectorizer = CountVectorizer(analyzer=stem_words, max_features=1000)
feature_vector = count_vectorizer.fit(train_ds.text)
train_ds_features = count_vectorizer.transform(train_ds.text)
word = feature_vector.get_feature_names()
words_counts = np.sum(train_ds_features.toarray(), axis=0)
feature_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))
Model Building: Naive Bayes Classifier
#### Training and Testing Data Preparation
We will split the data into training and test sets:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(train_ds_features, train_ds.sentiment, test_size=0.3, random_state=42)
Using the Bernoulli Naive Bayes classifier for predictions:
from sklearn.naive_bayes import BernoulliNB
nb_clf = BernoulliNB()
nb_clf.fit(train_X.toarray(), train_y)
To make predictions:
test_ds_predicted = nb_clf.predict(test_X.toarray())
To evaluate the model's performance, we can print a classification report:
from sklearn import metrics
print(metrics.classification_report(test_y, test_ds_predicted))
To visualize the confusion matrix:
cm = metrics.confusion_matrix(test_y, test_ds_predicted)
sn.heatmap(cm, annot=True, fmt='.2f')
#### Saving and Reloading the Model
We can save our model using the pickle library:
import pickle
pickle.dump(nb_clf, open("Sentiment_Classifier_model", 'wb'))
To load the model for future predictions:
loaded_model = pickle.load(open("Sentiment_Classifier_model", 'rb'))
test_ds_predicted = loaded_model.predict(test_X.toarray())
print(metrics.classification_report(test_y, test_ds_predicted))
Conclusion
This guide provides fundamental insights into constructing natural language processing models and handling words as features for predictions. Other modeling techniques, such as TF-IDF and N-Grams, will be explored in subsequent articles.
I hope you found this article informative. Connect with me on LinkedIn and Twitter for more discussions.
Recommended Articles
- Understanding Lists as Big O and Comprehension with Python Examples
- Python Data Structures: Data Types and Objects
- Concepts of Exception Handling in Python
- Principal Component Analysis in Dimensionality Reduction with Python
- A Comprehensive Overview of K-means Clustering with Python
- In-depth Explanation of Linear Regression with Python
- Detailed Insights into Logistic Regression with Python
- Fundamentals of Time Series Analysis with Python
- Data Wrangling Techniques Using Python — Part 1
- Exploring the Confusion Matrix in Machine Learning
In this video, you will learn how to build a natural language processing (NLP) model in Python, covering essential techniques like TF-IDF, N-grams, and text processing.
This comprehensive NLP tutorial in Python provides practical examples, guiding you through the entire process of creating and utilizing NLP models.