This project demonstrates my ability to build a fully serverless data engineering pipeline on AWS, covering everything from data ingestion to visual analytics using native AWS services. It uses the Superstore Dataset sourced from Kaggle, containing retail transactional data across multiple categories such as customer orders, product segments, sales, profits, shipping methods and regions.
I designed an automated pipeline to process incremental sales data from a retail superstore using the following AWS services:
- S3 – Cloud storage for raw files
- IAM – Role-based access and user provisioning
- Glue – Metadata cataloguing via crawlers
- Athena – Serverless SQL querying on S3 data
- QuickSight – Interactive dashboard for analysis
Raw CSV → S3 (Partitioned by snapshot_day)
↓
AWS Glue Crawler → Data Catalog
↓
Amazon Athena (SQL)
↓
Amazon QuickSight Dashboards
Tool/Service | Purpose |
---|---|
S3 | Raw data storage (partitioned folders by snapshot date) |
Glue | Metadata extraction and partition awareness |
Athena | Ad hoc querying with cost-efficient scanning |
QuickSight | Visual reporting and dashboard sharing |
IAM | Secure access management and service-level roles |
s3://<my-bucket>/orders/
├── snapshot_day=2017-01-01/
│ └── orders_1.csv
└── snapshot_day=2017-01-02/
└── orders_2.csv
- Created a separate IAM user for development (no root user usage).
- Assigned admin permissions and created a Glue-compatible IAM role.
- Cleaned and filtered daily order data from the Kaggle superstore dataset.
- Saved files in
.csv
format and uploaded to partitioned folders.
- Created a Glue database to store metadata.
- Configured a crawler to:
- Auto-detect schema
- Recognise partitions using folder names
- Maintain an up-to-date data catalog for Athena
- Connected Athena to the Glue data catalog
- Ran queries on raw S3 files without loading into any database
Partitioning by date drastically improves Athena query speed and reduces cost.
Using WHERE snapshot_day = ...
reduced scanned data by ~54%.
- Example query to show the improvement caused by partitioning:
SELECT * FROM "db_hkhosravi"."orders"
WHERE snapshot_day = '2017-01-01'
- Connected QuickSight to Athena as the data source
- Imported data into SPICE for faster performance
- Built:
- Bar chart: Sales by Category
- Pie chart: Profit by City
- AWS IAM + security best practices
- Efficient cloud data partitioning strategies
- Glue crawler configuration for schema detection
- Serverless querying with Athena
- Real-time dashboards with QuickSight (SPICE vs live mode)
- Add trigger-based automation using AWS Lambda
- Integrate with a real-time stream via Kinesis or EventBridge
- Add notification layer for insights using SNS