A sophisticated Python tool that transforms clean datasets into realistic messy datasets for testing data cleaning and quality processes. This generator intelligently analyzes your input data and creates expanded datasets with various types of data quality issues commonly found in real-world scenarios.
This tool is designed for:
- Data Engineers testing ETL pipelines and data validation rules
- Data Scientists training data cleaning models and algorithms
- QA Teams validating data processing systems
- Developers testing applications with realistic messy data
- Students learning data cleaning techniques with hands-on examples
- Automatically detects column types (numeric, datetime, categorical, text)
- Preserves data patterns and distributions from your original dataset
- Generates realistic variations based on existing data patterns
- Smart Duplicates: Creates both exact and near-duplicates with subtle variations
- Strategic Nulls: Introduces missing values with realistic patterns
- Range Violations: Generates out-of-bounds values for numeric columns
- Timestamp Corruption: Creates invalid dates and time anomalies
- Text Corruption: Introduces various text quality issues (case changes, special characters, truncation)
- Full control over messiness rates for each type of issue
- Configurable output size (scale up or down from original dataset)
- Flexible input/output formats (CSV, JSON)
Your final repository should look like this:
- Python 3.7 or higher
- Minimum 2GB RAM (4GB+ recommended for large datasets)
- Disk space: ~3x the size of your input file
pip install pandas numpy
Core Dependencies:
pandas >= 1.3.0
- Data manipulation and analysisnumpy >= 1.20.0
- Numerical computingdatetime
- Date/time handling (built-in)random
- Random data generation (built-in)string
- String manipulation (built-in)os
- Operating system interface (built-in)argparse
- Command-line argument parsing (built-in)
- Download
messy_data_generator.py
- Install dependencies:
pip install pandas numpy
- Run the script directly
git clone <repository-url>
cd messy-data-generator
pip install pandas numpy
# Create virtual environment
python -m venv messy_data_env
# Activate environment
# On Windows:
messy_data_env\Scripts\activate
# On macOS/Linux:
source messy_data_env/bin/activate
# Install dependencies
pip install pandas numpy
python messy_data_generator.py input_file.csv
python messy_data_generator.py input_file.csv \
--output messy_output.csv \
--rows 50000 \
--duplicates 0.20 \
--nulls 0.15 \
--wrong-ranges 0.10 \
--wrong-timestamps 0.08 \
--text-corruption 0.12
Argument | Short | Description | Default | Range |
---|---|---|---|---|
input_file |
- | Path to input CSV/JSON file | Required | - |
--output |
-o |
Output file path | messy_data.csv |
- |
--rows |
-r |
Target number of output rows | 10000 |
1+ |
--duplicates |
-d |
Duplicate rate (fraction) | 0.15 |
0.0-1.0 |
--nulls |
-n |
Null value rate (fraction) | 0.10 |
0.0-1.0 |
--wrong-ranges |
-w |
Wrong range value rate | 0.08 |
0.0-1.0 |
--wrong-timestamps |
-t |
Invalid timestamp rate | 0.05 |
0.0-1.0 |
--text-corruption |
-c |
Text corruption rate | 0.05 |
0.0-1.0 |
import pandas as pd
from messy_data_generator import AdvancedMessyDataGenerator
# Load your clean data
clean_df = pd.read_csv('clean_data.csv')
# Initialize generator
generator = AdvancedMessyDataGenerator(clean_df)
# Generate messy data
messy_df = generator.generate_messy_data(
target_rows=15000,
duplicate_rate=0.18,
null_rate=0.12,
wrong_range_rate=0.10,
wrong_timestamp_rate=0.06,
text_corruption_rate=0.08
)
# Analyze the results
analysis = generator.analyze_data_quality(messy_df)
# Save results
messy_df.to_csv('generated_messy_data.csv', index=False)
The generator automatically identifies and handles:
- Numeric Columns: Integers, floats, percentages
- Datetime Columns: Timestamps, dates, times
- Categorical Columns: Limited unique values, categories
- Text Columns: Free-form text, names, descriptions
- Exact Duplicates: Perfect copies of existing rows
- Near Duplicates: Rows with subtle differences like:
- Trailing/leading spaces
- Case variations (UPPER, lower, Mixed)
- Missing spaces or punctuation
- Minor spelling variations
- Higher probability in text and categorical columns
- Clustered patterns mimicking real-world data loss
- Random distribution across all column types
- Below minimum: Values lower than expected range
- Above maximum: Values higher than expected range
- Extreme outliers: Clearly invalid values (-999999, 999999)
- Edge cases: Zero values where inappropriate
- Historical dates: Years like 1900, 1970 (epoch)
- Future dates: Unrealistic future timestamps
- Invalid dates: NaT (Not a Time) values
- Format inconsistencies: Mixed date formats
- Case corruption: Random upper/lowercase changes
- Character replacement: Random character substitution
- Text reversal: Backwards text strings
- Empty strings: Blank values
- Special characters: Addition of ???, !!!, etc.
- Text duplication: Repeated content
-
Main Output:
messy_data.csv
(or specified filename)- Contains the generated messy dataset
- Same column structure as input
- Expanded to specified row count
-
Analysis Report:
messy_data_analysis.txt
- Data quality summary
- Duplicate counts and percentages
- Null value statistics by column
- Memory usage information
- Data type analysis
Data Quality Analysis Report
==============================
Original file: clean_sales.csv
Generated file: messy_sales.csv
Dataset shape: (10000, 8)
Duplicates: 1500
Null counts: {'customer_name': 450, 'email': 380, 'phone': 290}
Memory usage: 2.34 MB
python messy_data_generator.py data.csv \
--duplicates 0.05 \
--nulls 0.03 \
--wrong-ranges 0.02 \
--wrong-timestamps 0.01 \
--text-corruption 0.02
python python_standalone.py data.csv \
--duplicates 0.30 \
--nulls 0.25 \
--wrong-ranges 0.20 \
--wrong-timestamps 0.15 \
--text-corruption 0.20
python messy_data_generator.py small_sample.csv \
--rows 1000000 \
--output large_messy_dataset.csv
Problem: Out of memory errors with large datasets Solution:
- Reduce
--rows
parameter - Process in smaller batches
- Increase system RAM
- Use more efficient data types
Problem: "Unsupported file format" error Solution:
- Ensure input file is
.csv
or.json
- Check file encoding (UTF-8 recommended)
- Verify file is not corrupted
Problem: Slow generation speed Solution:
- Reduce target row count
- Lower messiness rates
- Use SSD storage
- Close other applications
Problem: Rate parameters outside valid range Solution:
- Keep all rates between 0.0 and 1.0
- Sum of all rates should be reasonable (<2.0)
Error | Cause | Solution |
---|---|---|
ValueError: Please provide a sample DataFrame |
No input data | Check input file path |
FileNotFoundError |
File doesn't exist | Verify file path and name |
MemoryError |
Insufficient RAM | Reduce dataset size |
UnicodeDecodeError |
File encoding issue | Convert to UTF-8 encoding |
Input Size | Output Size | RAM Required | Processing Time |
---|---|---|---|
< 1MB | < 100K rows | 2GB | < 1 minute |
1-10MB | 100K-500K rows | 4GB | 1-5 minutes |
10-100MB | 500K-1M rows | 8GB | 5-15 minutes |
> 100MB | > 1M rows | 16GB+ | 15+ minutes |
- Use appropriate data types in your input CSV
- Remove unnecessary columns before processing
- Start with smaller samples for testing
- Monitor memory usage during generation
- Use SSDs for faster I/O operations
# Test ETL pipeline with realistic messy data
python messy_data_generator.py clean_transactions.csv \
--rows 100000 \
--output etl_test_data.csv \
--duplicates 0.20 \
--nulls 0.15
# Generate training data for data cleaning models
python messy_data_generator.py labeled_dataset.csv \
--rows 50000 \
--output ml_training_messy.csv \
--text-corruption 0.15
# Create stress test data for validation systems
python messy_data_generator.py production_sample.csv \
--rows 25000 \
--output qa_stress_test.csv \
--wrong-ranges 0.25 \
--wrong-timestamps 0.20
- Check existing issues first
- Provide input file sample (if possible)
- Include full error message
- Specify Python version and OS
- Additional messiness types
- New file format support
- Performance improvements
- Additional analysis features
This project is open source. Please check the license file for details.
For support and questions:
- Check this README first
- Review the troubleshooting section
- Create an issue with detailed information
- Include sample data and error messages
Happy Data Messing! π²