This repository provides a comprehensive analysis and utility suite for the Jane Street Real-Time Market Data Forecasting competition on Kaggle. It focuses on Exploratory Data Analysis (EDA), missing value inspection, temporal pattern discovery, and metadata understanding to support downstream modeling.
Build a forecasting model that accurately predicts
responder_6
for multiple financial instruments, using features extracted from market microstructure data, with special consideration for missing data, symbol/date evolution, and non-stationarity.
File | Description |
---|---|
train.parquet/ |
10-part training set with date_id , time_id , symbol_id , weight , 79 features, and 9 responders |
features.csv |
Metadata about each feature and their associated tags |
responders.csv |
Metadata about each responder (including responder_6 ) |
test.parquet/ |
Mock single-time-batch test set used via API |
lags.parquet/ |
Lag-1 values of responders for each symbol_id |
sample_submission.csv |
Example submission format for predictions |
- Training Data (
train.parquet
) – 10 partitions with 79 features and 9 responders. - Test Data (
test.parquet
) – Single date-time batch for API prediction. - Lags Data (
lags.parquet
) – Responder values from previous date_id. - Metadata –
features.csv
,responders.csv
, andsample_submission.csv
.
✅ End-to-end EDA pipeline for Kaggle’s market forecasting challenge
✅ Optimized with polars
for fast loading and computation
✅ Ready-to-extend structure for modeling and feature engineering
✅ Fully reproducible scripts + Jupyter notebooks
✅ Aligned with API-specific submission formats
-
Clone the repository:
git clone https://github.com/yourusername/jane-street-market-forecasting.git cd jane-street-market-forecasting
-
Install dependencies:
pip install -r requirements.txt
- Use
polars
for efficient loading and querying. - Partitioned data handled by
load_train_partition(partition_id)
.
- Focused on non-null
responder_6
rows. - Shows missing count and ratio per feature with visual bar plots.
- Correlation heatmap of all 79 features.
- Highlights clusters and multicollinearity.
- Histograms for all 9 responders.
- Statistics (mean, std, min, max).
- Visual checks for clipped distributions (bounded between -5 and 5).
- Distribution of
symbol_id
anddate_id
across partitions. - Ensures temporal and symbol consistency.
- Validate structure of test sets and lag features.
- Aligns lag values for time-aware modeling.
This section provides a comprehensive breakdown of the exploratory insights derived from the Jane Street dataset.
- 🔢 Global null ratios are computed for all 79 features.
- 📊 Bar charts show availability vs missingness across usable samples.
- 🕒 Temporal null analysis tracks missing patterns by
date_id
, helping identify data degradation or dropouts over time.
- 🔗 Feature-to-feature correlation matrix highlights relationships between
feature_00
tofeature_78
. - 🔁 Responder-to-responder heatmap reveals interdependencies between target variables.
- 🧱 Cluster detection allows dimensionality reduction and feature selection by identifying redundant variables.
- 📦 Histograms of all responders (
responder_0
toresponder_8
) illustrate distribution shape. - 📈 For each responder, we compute:
- Mean
- Standard deviation
- Minimum and maximum values
- 📌 Special attention is given to
responder_6
— the target for forecasting.
- 🪙
symbol_id
frequency plots show how often each financial instrument appears. - 🗓
date_id
coverage checks ensure even temporal distribution across partitions. - ⏱ Temporal alignment validation confirms proper sequencing for lag features.
- ⏮
lags.parquet
provides lag-1 responder values for all symbols. - 🧩 These are served at the first
time_id
of each newdate_id
. - 📈 Visualization of
responder_6_lag_1
across symbols reveals carryover behavior and temporal consistency.
You can use the following batch scripts:
-
scripts/run_null_analysis.py
➤ Generates missing value profiles. -
scripts/run_feature_corr.py
➤ Creates feature-feature correlation heatmap.
All analysis is also available as interactive notebooks under notebooks/
, including:
01_eda_features.ipynb
02_eda_responders.ipynb
03_missing_value_analysis.ipynb
04_symbol_date_distribution.ipynb
Dependencies are listed in requirements.txt
, including:
- pandas
- polars
- matplotlib
- seaborn
- numpy
- scikit-learn
- pyarrow
- jupyterlab
This repository provides a complete exploratory analysis and data processing framework for the Jane Street Real-Time Market Data Forecasting competition on Kaggle.
The objective is to predict responder_6
, a proprietary market signal, using anonymized, high-frequency trading data across multiple instruments. The project focuses on:
- Efficient loading and handling of large
.parquet
files usingpolars
- Detailed EDA including missing value analysis, correlation insights, and statistical summaries
- Time-series structure validation across
date_id
,time_id
, andsymbol_id
- Visualization and understanding of lagged target values (
lags.parquet
) - Modular and reproducible code base ready for feature engineering and model development
This serves as a strong foundation for building robust forecasting models in a real-time, API-driven environment.
This repository is distributed under the MIT License.