Easily label large text files using AI and classic clustering methods. The system relies on OpenAI models along with scikit-learn, HDBSCAN and PySpark to process your data, add new category columns and create detailed reports.
git clone https://github.com/JuanLara18/Text-Classification-System.git
cd Text-Classification-System
python3 -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
python setup.py
This command creates the basic folders, a sample config.yaml
and an .env
file.
pip install -r requirements.txt
Edit .env
and set:
OPENAI_API_KEY=sk-your-key-here
Open config.yaml
to tell the system how it should label your text. A
perspective is a set of rules for one classification task. You can define
several perspectives in the file and run them all at once.
AI classification with ChatGPT
- Set
type: "openai_classification"
. - List the
columns
you want to classify and thetarget_categories
you expect (e.g. job titles, product types, etc.). - Optionally tweak the
classification_config
section to change the prompt or batch size.
Traditional clustering
- Set
type: "clustering"
and choose analgorithm
such ashdbscan
orkmeans
. - Provide algorithm parameters under
params
if needed. - Use this when you do not have predefined categories and want to discover groups automatically.
Hybrid approach
- Mix both AI and clustering perspectives in the same file.
- Useful for comparing the results and refining your categories.
Remember to update input_file
, output_file
, text_columns
and any category
lists or algorithm settings relevant to your use case.
python main.py --config config.yaml
Your classified data will be saved in the output/
folder.
# Basic usage
python main.py --config config.yaml
# Force recalculation (ignore cache)
python main.py --config config.yaml --force-recalculate
# Test with debug output
python main.py --config config.yaml --log-level debug
# Process specific input/output files
python main.py --config config.yaml --input data/myfile.dta --output results/classified.dta
The system adds new columns to your data:
{classification_name}
- The assigned category- Detailed HTML reports with visualizations
- Cost tracking and performance metrics
Example: If you classify job positions, you'll get a new column like job_category
with values like "Engineering", "Sales", etc.
# Required: Your data and columns
input_file: "path/to/your/data.dta"
output_file: "path/to/output.dta"
text_columns: ["column1", "column2"]
# Required: Classification setup
clustering_perspectives:
your_classifier_name:
type: "openai_classification"
columns: ["column1"] # Which columns to classify
target_categories: ["Cat1", "Cat2"] # Your categories
output_column: "new_category_column" # Name of new column
- Stata files (
.dta
) - Primary format - CSV files (
.csv
) - Basic support - Excel files (
.xlsx
) - Convert to Stata first
The system includes built-in cost controls:
ai_classification:
cost_management:
max_cost_per_run: 10.0 # Stop if cost exceeds $10
Typical costs:
- Small dataset (1,000 rows): ~$0.50
- Medium dataset (10,000 rows): ~$2.00
- Large dataset (100,000 rows): ~$15.00
performance:
batch_size: 50 # Process 50 items at once
parallel_jobs: 4 # Use 4 CPU cores
cache_embeddings: true # Cache results to avoid reprocessing
You can create multiple classification perspectives:
clustering_perspectives:
job_categories:
columns: ["job_title"]
target_categories: ["Engineering", "Sales", "Marketing"]
output_column: "job_category"
skill_levels:
columns: ["job_description"]
target_categories: ["Entry Level", "Mid Level", "Senior Level"]
output_column: "skill_level"
For exploratory analysis without predefined categories:
clustering_perspectives:
content_clusters:
type: "clustering" # Traditional clustering
algorithm: "hdbscan" # or "kmeans"
columns: ["text_column"]
params:
min_cluster_size: 50
output_column: "discovered_cluster"
Customize how the AI classifies your data:
classification_config:
prompt_template: |
You are an expert in job classification.
Classify this job position into exactly one category.
Categories: {categories}
Job Position: {text}
Consider the job title, responsibilities, and required skills.
Respond with only the category name.
"API key not found"
# Check your .env file
cat .env
# Should show: OPENAI_API_KEY=sk-...
"File not found"
# Check your file path in config.yaml
ls input/ # List files in input directory
"Memory error"
# Reduce batch size in config.yaml
performance:
batch_size: 25 # Reduce from 50
"Too expensive"
# Set lower cost limit
ai_classification:
cost_management:
max_cost_per_run: 5.0 # Reduce from 10.0
- Check logs: Look in
logs/classification.log
- Enable debug mode: Use
--log-level debug
- Test small sample: Process just 100 rows first
- Text Processing: NLTK + custom preprocessing
- AI Classification: OpenAI GPT models with intelligent caching
- Traditional Clustering: scikit-learn, HDBSCAN, UMAP
- Data Processing: PySpark for large datasets
- Unique Value Optimization: Dramatically reduces API calls by processing only unique text values
- Minimum: 4GB RAM, 2 CPU cores
- Recommended: 8GB RAM, 4 CPU cores
- Large datasets: 16GB+ RAM, 8+ CPU cores
feature_extraction:
method: "embedding" # or "tfidf", "hybrid"
embedding:
model: "sentence-transformers"
sentence_transformers:
model_name: "all-MiniLM-L6-v2" # Fast, good quality
- AI Classification: GPT-4o-mini, GPT-3.5-turbo, GPT-4
- Traditional Clustering: K-Means, HDBSCAN, Agglomerative
- Evaluation Metrics: Silhouette score, Davies-Bouldin index
input_file: "data/hr_data.dta"
text_columns: ["position_title", "job_description"]
clustering_perspectives:
job_classifier:
type: "openai_classification"
columns: ["position_title"]
target_categories:
- "Software Engineering"
- "Data Science"
- "Product Management"
- "Sales"
- "Marketing"
- "Human Resources"
- "Finance"
- "Operations"
- "Other"
output_column: "job_category"
input_file: "data/feedback.dta"
text_columns: ["customer_comment"]
clustering_perspectives:
sentiment_classifier:
type: "openai_classification"
columns: ["customer_comment"]
target_categories:
- "Positive"
- "Negative"
- "Neutral"
- "Feature Request"
- "Bug Report"
output_column: "feedback_type"
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
Need help? Create an issue or check the logs in logs/classification.log