top of page

Part two: Sentiment Analysis

In order to understand consumer reviews, codes below will help predict the review stars based on the review text.

Step one: Import packages & modules
Screen Shot 2021-12-05 at 2.50.47 PM.png
Step two: Data preparetion

Filtering out the noise information and cleaning the review text by separating abbreviations.

Screen Shot 2021-12-05 at 2.51.06 PM.png
Step three: Data splitting

I set 80% of the data to training and 10% for each validating and 10% for testing. I assigned a random number between 0 and 99 to each review and then said every number below 80 is part of the training group, the number between 80 and 90 is part of the validing group and greater and equal to 90 is part of the testing group.

Screen Shot 2021-12-05 at 2.51.16 PM.png
Step four: text vectorization

I used CountVectorizer to convert the text to a vector of counts. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. We are then calling it to make every letter lowercase and passing arguments through our function call that determine the ngram_range, mad_df, and min_df.

Screen Shot 2021-12-05 at 2.51.28 PM.png
Screen Shot 2021-12-05 at 2.51.28 PM.png
Step five: TVT Split

Then, I splitted the data into the Training, Validating, and Testing groups. As shown above, 80% of our data is dedicated to training our models, and 10% is given to the validating and 10% for testing data. I checked that all the reviews are assigned to groups.

Screen Shot 2021-12-05 at 2.51.37 PM.png

Prediction through Machine Learning models

KNN Classifier

KNN classification is done by a majority vote to its neighbors. The data is assigned to the class which has the nearest neighbors. The number of nearest neighbors (K) is important. In order to find the best k, I used the elbow method, which is choosing the K with the lowest error rates.

Screen Shot 2021-12-11 at 9.34.45 PM.png

After finding the best k, we can use the KNN model to predict the results. In order to see how well the model actually works, I constructed the confusion matrix and printed the accuracy report. The accuracy of my model is 78%.

Screen Shot 2021-12-05 at 2.51.59 PM.png
Logistic regression

Then I used logistic regression as a classifier. However, by default, logistic regression is used for classification tasks that have two class labels. In my case, I need my model to be able to classify multiple classes. Instead, it requires modification to support multi-class classification problems. 

Screen Shot 2021-12-05 at 2.56.15 PM.png
Screen Shot 2021-12-05 at 2.56.25 PM.png

Finally, I construct the confusion matrix and classification report to see how my model perform. It shows the accuracy of my model is 86%.

Screen Shot 2021-12-05 at 2.56.35 PM.png
bottom of page