Understanding the meaning and context behind someone’s words is very important to understand them and their thoughts. Imagine if we could make machines understand the context of our texts i.e. we can do sentiment analysis somehow?
It will help us to automate many such applications where humans had to take the burden of sitting for hours and manually going through all the texts, right? Well, it is already possible to some extent with today’s technology and advancements in the field of Natural Language Processing. If you are curious, then let’s go forward to know more about it and build your own Sentiment Analysis Classifier model to classify texts as positive or negative.
Table of content
- What is Sentiment Analysis?
- Collecting and reading the dataset
- Importing the NLTK modules
- Lower casing the text
- Expanding contractions
- Removing punctuations and special characters
- Removing stopwords
- Data building and splitting into the train and test sets
- Applying TFIDF Vectorizer
- Fitting the Support Vector Machine Model
- Evaluation of model using Accuracy Score, Confusion Matrix, and Classification Report
What is Sentiment Analysis?
Sentiment Analysis is a process of understanding the sentiment behind a sentence or text, to figure out if the context of the text is positive or negative. In this article, we are going to build a classifier model which when provided with a piece of text, will be able to classify it as positive or negative.
So let’s get going!
Collecting the Dataset for sentiment analysis
For building our model, we need to have some data to work upon. We will use the Amazon Reviews dataset for our project. You can find the dataset on the Internet or just follow this GitHub link:
In case, if you don’t have data handy or want to use your own dataset, you have always another option of web scrapping the data. You can follow the link below for the same.
Reading the dataset
First, we have to import pandas as we need pandas to read the dataset. Then, we will use the read_csv() command to read the dataset and create a DataFrame. For those of you who do not know what is a DataFrame, it is a table-like 2D structure containing rows and columns.
We will use the head() command to see the first 5 rows of the DataFrame. It looks something like this.
Import the NLTK modules
Natural Language ToolKit (NLTK) is a Python library that contains several useful packages for Natural Language Processing. In layman’s terms, the job of NLTK is to make it easier for us to help machines understand the natural text. In case, you want to explore more about it, I am leaving here a link to its documentation.
We will import NLTK and all the required modules of NLTK that we need to clean and process our data.
Lower casing the text
It is our first step of data cleaning.
We search through all the text data and convert all the text into lower case letters. We use the built-in method lower() for this.
Expanding the Contractions
Contractions are short forms of words. The unstructured and informal text contains a lot of contractions.
So, we will install and import the Python library contractions and use the fix() function to expand the contractions.
Removing Punctuation marks and Special Characters
After expanding all the contracted words, we will focus on removing all the special characters and punctuation symbols. First, we need to import the string module and we write a function to iterate through the text and replace all the symbols and punctuations with spaces.
Removing the Stopwords
Our next step is to remove the stopwords- commonly used words that don’t add value to our text.
We download the list of stopwords in the English language and then we remove the words “no” and “not” from that list as they can prove to be valuable while classifying the sentiment of the text.
Finally, we write a lambda function and apply it to the ‘’review’’ column using apply() to remove the stopwords and replace them with space when encountered in our text data.
Then we will break up the text into smaller pieces called tokens. This technique is called Tokenization.
First, we need to download the resource Punkt for applying word tokenization, where the sentence will be broken into a list of individual words. Then we will apply word_tokenize to the review column of our data frame.
The next step in preprocessing is called Lemmatization, where the tokenized words are converted into their root forms. We will download the wordnet resource and then call the WordNetLemmatizer() and apply it to the tokenized word lists in the review column.
After Lemmatization is complete, we will simply use astype() and convert the data type in the review column from list object to string.
Creating the Features and Target variables and splitting the data into Train and Test sets
We will create x as the features variable and assign the values of the review column to it using the iloc operator. Similarly, we will create y as the target variable and assign the values of the label column to it.
After that, we will import train_test_split from sklearn.model_selection and use it to split the data into training and testing sets.
Applying the TFIDF Vectorizer
TFIDF stands for Term Frequency Inverse Document Frequency. The machine cannot understand text data. So, we need a way to convert the text data into numerical data which the machine can understand. Using the TFIDF Vectorizer, text data is transformed into feature vectors, which can be provided as input to the Machine Learning estimator.
We import the TfidfVectorizer from sklearn.feature_extraction.text and then apply it to features variables of training and test data using fit_transform(). We store these transformed values in two new variables namely x_train_tfidf and x_test_tfidf.
Fitting the estimator and predicting the test set values
Now that our text data is cleaned, preprocessed, and converted into a type that the machine can understand, we will fit it into a Machine Learning model.
We will implement the Support Vector Machine (SVM) classifier from Scikit-Learn for this purpose. We fit the model using fit() and then predict the prowess of the model on the test data using predict()
Evaluating the model
After our sentiment analysis classifier is ready, we will evaluate it to check its accuracy. We will use three measures for evaluating our model- Accuracy score, Confusion matrix, and Classification report. We will import the required modules from sklearn.metrics and evaluate our model on these parameters.
Predicting the sentiment of any given text
Take any text or sentence, transform it using TF-IDF vectorizer and predict the sentiment of the text as ‘pos’ or ‘neg’.
Congratulations! You have successfully built your first Sentiment Analysis Classifier model!
If you want to take a look at the full code, follow the link and take a look at the GitHub account.