AI Text Classification System

Easily label large text files using AI and classic clustering methods. The system relies on OpenAI models along with scikit-learn, HDBSCAN and PySpark to process your data, add new category columns and create detailed reports.

Quick Start

0. Clone this repository

git clone https://github.com/JuanLara18/Text-Classification-System.git
cd Text-Classification-System

1. Create and activate a virtual environment

python3 -m venv venv
source venv/bin/activate    # On Windows use: venv\Scripts\activate

2. Run the setup script

python setup.py

This command creates the basic folders, a sample config.yaml and an .env file.

3. Install dependencies

pip install -r requirements.txt

4. Add your OpenAI key

Edit .env and set:

OPENAI_API_KEY=sk-your-key-here

5. Configure the classification

Open config.yaml to tell the system how it should label your text. A perspective is a set of rules for one classification task. You can define several perspectives in the file and run them all at once.

AI classification with ChatGPT

Set type: "openai_classification".
List the columns you want to classify and the target_categories you expect (e.g. job titles, product types, etc.).
Optionally tweak the classification_config section to change the prompt or batch size.

Traditional clustering

Set type: "clustering" and choose an algorithm such as hdbscan or kmeans.
Provide algorithm parameters under params if needed.
Use this when you do not have predefined categories and want to discover groups automatically.

Hybrid approach

Mix both AI and clustering perspectives in the same file.
Useful for comparing the results and refining your categories.

Remember to update input_file, output_file, text_columns and any category lists or algorithm settings relevant to your use case.

6. Run the classifier

python main.py --config config.yaml

Your classified data will be saved in the output/ folder.

Essential Commands

# Basic usage
python main.py --config config.yaml

# Force recalculation (ignore cache)
python main.py --config config.yaml --force-recalculate

# Test with debug output
python main.py --config config.yaml --log-level debug

# Process specific input/output files
python main.py --config config.yaml --input data/myfile.dta --output results/classified.dta

What You Get

The system adds new columns to your data:

{classification_name} - The assigned category
Detailed HTML reports with visualizations
Cost tracking and performance metrics

Example: If you classify job positions, you'll get a new column like job_category with values like "Engineering", "Sales", etc.

Configuration Guide

Basic Configuration Structure

# Required: Your data and columns
input_file: "path/to/your/data.dta"
output_file: "path/to/output.dta" 
text_columns: ["column1", "column2"]

# Required: Classification setup
clustering_perspectives:
  your_classifier_name:
    type: "openai_classification"
    columns: ["column1"]              # Which columns to classify
    target_categories: ["Cat1", "Cat2"]  # Your categories
    output_column: "new_category_column"  # Name of new column

Supported File Formats

Stata files (.dta) - Primary format
CSV files (.csv) - Basic support
Excel files (.xlsx) - Convert to Stata first

Cost Management

The system includes built-in cost controls:

ai_classification:
  cost_management:
    max_cost_per_run: 10.0  # Stop if cost exceeds $10

Typical costs:

Small dataset (1,000 rows): ~$0.50
Medium dataset (10,000 rows): ~$2.00
Large dataset (100,000 rows): ~$15.00

Performance Settings

performance:
  batch_size: 50        # Process 50 items at once
  parallel_jobs: 4      # Use 4 CPU cores
  cache_embeddings: true # Cache results to avoid reprocessing

Advanced Usage

Multiple Classifications

You can create multiple classification perspectives:

clustering_perspectives:
  job_categories:
    columns: ["job_title"]
    target_categories: ["Engineering", "Sales", "Marketing"]
    output_column: "job_category"
  
  skill_levels:
    columns: ["job_description"] 
    target_categories: ["Entry Level", "Mid Level", "Senior Level"]
    output_column: "skill_level"

Traditional Clustering (No AI)

For exploratory analysis without predefined categories:

clustering_perspectives:
  content_clusters:
    type: "clustering"              # Traditional clustering
    algorithm: "hdbscan"            # or "kmeans"
    columns: ["text_column"]
    params:
      min_cluster_size: 50
    output_column: "discovered_cluster"

Custom Prompts

Customize how the AI classifies your data:

classification_config:
  prompt_template: |
    You are an expert in job classification.
    Classify this job position into exactly one category.
    
    Categories: {categories}
    Job Position: {text}
    
    Consider the job title, responsibilities, and required skills.
    Respond with only the category name.

Troubleshooting

Common Issues

"API key not found"

# Check your .env file
cat .env
# Should show: OPENAI_API_KEY=sk-...

"File not found"

# Check your file path in config.yaml
ls input/  # List files in input directory

"Memory error"

# Reduce batch size in config.yaml
performance:
  batch_size: 25  # Reduce from 50

"Too expensive"

# Set lower cost limit
ai_classification:
  cost_management:
    max_cost_per_run: 5.0  # Reduce from 10.0

Getting Help

Check logs: Look in logs/classification.log
Enable debug mode: Use --log-level debug
Test small sample: Process just 100 rows first

Technical Details

System Architecture

Text Processing: NLTK + custom preprocessing
AI Classification: OpenAI GPT models with intelligent caching
Traditional Clustering: scikit-learn, HDBSCAN, UMAP
Data Processing: PySpark for large datasets
Unique Value Optimization: Dramatically reduces API calls by processing only unique text values

Resource Requirements

Minimum: 4GB RAM, 2 CPU cores
Recommended: 8GB RAM, 4 CPU cores
Large datasets: 16GB+ RAM, 8+ CPU cores

Feature Extraction Methods

feature_extraction:
  method: "embedding"  # or "tfidf", "hybrid"
  embedding:
    model: "sentence-transformers"
    sentence_transformers:
      model_name: "all-MiniLM-L6-v2"  # Fast, good quality

Supported Algorithms

AI Classification: GPT-4o-mini, GPT-3.5-turbo, GPT-4
Traditional Clustering: K-Means, HDBSCAN, Agglomerative
Evaluation Metrics: Silhouette score, Davies-Bouldin index

Examples

Example 1: Job Position Classification

input_file: "data/hr_data.dta"
text_columns: ["position_title", "job_description"]

clustering_perspectives:
  job_classifier:
    type: "openai_classification"
    columns: ["position_title"]
    target_categories:
      - "Software Engineering"
      - "Data Science" 
      - "Product Management"
      - "Sales"
      - "Marketing"
      - "Human Resources"
      - "Finance"
      - "Operations"
      - "Other"
    output_column: "job_category"

Example 2: Customer Feedback Analysis

input_file: "data/feedback.dta"
text_columns: ["customer_comment"]

clustering_perspectives:
  sentiment_classifier:
    type: "openai_classification"
    columns: ["customer_comment"]
    target_categories:
      - "Positive"
      - "Negative" 
      - "Neutral"
      - "Feature Request"
      - "Bug Report"
    output_column: "feedback_type"

License

MIT License - see LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

Need help? Create an issue or check the logs in logs/classification.log

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
docs		docs
modules		modules
test		test
.gitignore		.gitignore
README.md		README.md
app.py		app.py
confi.yaml		confi.yaml
confi2.yaml		confi2.yaml
config.py		config.py
extract_names.py		extract_names.py
main.py		main.py
nltk_download.py		nltk_download.py
requirements.txt		requirements.txt
setup.py		setup.py

JuanLara18/Text-Classification-System

Folders and files

Latest commit

History

Repository files navigation