This project will deep dive you into the NLP and LLMs tools and techniques that are most commanly used.
The Fake News Classifier is a machine learning project aimed at identifying and categorizing news articles as either real or fake. By leveraging Natural Language Processing (NLP) techniques and machine learning models, the project preprocesses textual data and predicts its authenticity with high accuracy.
Fake News Classifier is a Machine Learning-based NLP project designed to classify news articles as Real or Fake. By employing advanced Natural Language Processing (NLP) techniques and supervised learning algorithms, the model delivers an impressive accuracy of 90%.
- Pre-processing: NLTK, Regular Expressions, Stemming, Lemmatization, TF-IDF, Bag of Words (BoW), Count Vectorizer.
- Models Used:
- Data Mining: Porter Stemmer, Count Vectorizer
- Classification: Multinomial Naïve Bayes
- Dataset: Custom or publicly available datasets with 5000 max features for feature extraction.
- Performance: Achieves 90% accuracy on the test set.
- Python
- Pandas, NumPy, Matplotlib
- Scikit-learn
- NLTK
-
Data Preprocessing:
- Cleaned and tokenized the text using Regular Expressions.
- Removed stop words and punctuations.
- Applied Stemming and Lemmatization for word normalization.
- Extracted features using TF-IDF, BoW, and Count Vectorizer.
-
Model Training:
- Utilized Porter Stemmer for feature mining.
- Applied Multinomial Naïve Bayes for classification.
-
Evaluation:
- Tested on a separate dataset.
- Visualized results with graphs for accuracy, precision, and recall.
Here are some visual insights into the project's performance: A classification matrix, also known as a confusion matrix, is a table used to evaluate the performance of a classification model. It compares the predicted labels from the model with the actual labels (true values) from the data. The confusion matrix provides a summary of prediction results and is typically used for binary and multi-class classification problems.
For binary classification, the confusion matrix looks like this:
Predicted Positive (1) Predicted Negative (0) Actual Positive (1) True Positive (TP) False Negative (FN) Actual Negative (0) False Positive (FP) True Negative (TN) Where:
True Positive (TP): The number of instances where the model correctly predicted the positive class. False Positive (FP): The number of instances where the model incorrectly predicted the positive class (Type I error). True Negative (TN): The number of instances where the model correctly predicted the negative class. False Negative (FN): The number of instances where the model incorrectly predicted the negative class (Type II error). From the confusion matrix, several important performance metrics can be derived, such as:
- Matrix shows classification stats for PassiveModel
- Matrix shows classification stats for Multinomial Naive bayes Model
Distribution of top features extracted from the dataset.
Fake-News-Classifier/
├── data/ # Dataset files
├── notebooks/ # Jupyter notebooks for EDA and development
├── src/ # Python source files
│ ├── preprocessing.py # Data preprocessing code
│ ├── train_model.py # Model training script
│ └── predict.py # Prediction script
├── requirements.txt # Required libraries
├── README.md # Project documentation
└── results/ # Evaluation results and graphs