Skip to content

A machine learning project for classifying Amazon Fine Food Reviews as positive or negative using text preprocessing, feature extraction, and multiple classification algorithms. Includes EDA, model evaluation, and visualizations. Achieved 89.5% accuracy with Logistic Regression on real-world review data. Dataset: Amazon Fine Food Reviews (Kaggle)

Notifications You must be signed in to change notification settings

KashifMoin1410/Amazon-Food-Review-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Amazon Food Review Sentiment Analysis

Overview:

This project focuses on performing sentiment analysis on the Amazon Fine Food Reviews dataset. The primary objective is to classify customer reviews as positive or negative using various machine learning techniques. The project encompasses data preprocessing, exploratory data analysis (EDA), feature extraction, model building, and evaluation.

Dataset:

  • Source: Kaggle - Amazon Fine Food Reviews
  • Size: 568,454 reviews
  • Attributes:
    • Id: Unique identifier for the review
    • ProductId: Unique identifier for the product
    • UserId: Unique identifier for the user
    • ProfileName: Name of the user
    • HelpfulnessNumerator: Number of users who found the review helpful
    • HelpfulnessDenominator: Number of users who indicated whether they found the review helpful
    • Score: Rating between 1 and 5
    • Time: Timestamp for the review
    • Summary: Brief summary of the review
    • Text: Full text of the review

Objective:

Transform the multiclass rating problem into a binary classification task:

  • Positive Reviews: Ratings of 4 or 5
  • Negative Reviews: Ratings of 1 or 2

Note: Reviews with a rating of 3 are considered neutral and are excluded from the analysis.

Methodology:

  1. Data Preprocessing
    1. Text Cleaning:
      1. Removal of HTML tags
      2. Conversion to lowercase
      3. Removal of punctuation and special characters
      4. Tokenization
      5. Removal of stop words
      6. Stemming using the Snowball Stemmer
    2. Handling Class Imbalance:
      1. Analyzed the distribution of positive and negative reviews
      2. Implemented techniques to address any imbalance if necessary
  2. Exploratory Data Analysis (EDA)
    1. Visualized the distribution of review scores
    2. Generated word clouds for positive and negative reviews
    3. Analyzed the length of reviews and their correlation with sentiment
    4. Examined the most frequent words in each sentiment category
  3. Feature Extraction
    1. Bag of Words (BoW): Converted text data into numerical vectors based on word frequency
    2. Term Frequency-Inverse Document Frequency (TF-IDF): Weighted the importance of words in the corpus
  4. Model Building
    1. Implemented and evaluated multiple machine learning models:
      1. Logistic Regression
      2. Support Vector Machine (SVM)
      3. Random Forest Classifier
      4. Naive Bayes Classifier
  5. Model Evaluation
    1. Metrics Used:
      1. Accuracy
      2. Precision
      3. Recall
      4. F1-Score
      5. Confusion Matrix
    2. Cross-Validation:
      1. Performed k-fold cross-validation to ensure model robustness

Results:

  • Best Performing Model: Logistic Regression
  • Accuracy Achieved: 89.5%
  • Precision: 0.90
  • Recall: 0.88
  • F1-Score: 0.89

These metrics indicate that the Logistic Regression model performed well in classifying the sentiment of Amazon food reviews, achieving a balanced trade-off between precision and recall.

Dependencies:

  • Python 3
  • pandas
  • numpy
  • matplotlib
  • seaborn
  • scikit-learn
  • Nltk

Future Work:

  • Implement deep learning models like LSTM and BERT for improved accuracy
  • Deploy the model using Flask or Streamlit for real-time sentiment analysis
  • Integrate the model into a web application for user-friendly interaction

Acknowledgements:

About

A machine learning project for classifying Amazon Fine Food Reviews as positive or negative using text preprocessing, feature extraction, and multiple classification algorithms. Includes EDA, model evaluation, and visualizations. Achieved 89.5% accuracy with Logistic Regression on real-world review data. Dataset: Amazon Fine Food Reviews (Kaggle)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published