earthquake-data-pipeline

A complete enterprise-style data pipeline for earthquake data using Databricks and Delta Lake, following the medallion architecture (RAW → BRONZE → SILVER → GOLD). The data is ingested daily from the USGS Earthquake API and processed for analytics, reporting, and future machine learning use cases.

Earthquake Data Pipeline

This project is a complete end-to-end solution for ingesting, transforming, analyzing, and visualizing earthquake data from the USGS API using a modern Azure-based architecture. The design follows best practices for enterprise-grade data pipelines, using layered storage (medallion architecture), orchestrated workflows, and advanced monitoring and visualization.

🔍 Purpose

The primary goal of this pipeline is to:

Provide clean, enriched, and aggregated earthquake data for reporting and decision-making.
Implement a real-world medallion architecture with clean separation between RAW, BRONZE, SILVER, and GOLD layers.
Enable automated orchestration (ADF), scalable compute (Databricks), alerting (Monitor), and visualization (Power BI).

🔁 Pipeline Flow

The pipeline performs the following steps:

Data ingestion (ADF): JSON data is retrieved daily from the USGS API.
Raw zone: The raw API response is saved to Azure Data Lake Gen2.
Bronze (Databricks): Parses JSON, extracts structured fields like ID, magnitude, coordinates, etc.
Silver (Databricks): Deduplicates, enriches, and adds derived columns like day_ratio, depth_category, etc.
Gold (Databricks): Aggregates by date and region. Calculates KPIs: avg_mag, total_eq, strong_eq, felt_pct, tsunami_pct, etc.
Gold History: Stores all daily aggregations over time (append-only).
Power BI: Reads CSV exports from the GOLD layer to create dashboards.

☁️ Azure Components Used

Azure Data Factory: Executes the full ETL flow via linked notebooks and copy activities.
Azure Data Lake Storage Gen2: Stores all layers of data.
Azure Databricks: Runs transformation logic in PySpark notebooks.
Power BI Desktop: Visualizes daily and historical metrics.
Azure Monitor: Sends email alerts on pipeline success/failure.

🧱 Layered Storage Layout

/earthquakes
├── raw/              # raw USGS API output (JSON)
├── bronze/           # normalized flat structure (Delta)
├── silver/           # cleaned, enriched data
├── gold/             # snapshot (merge/update per region+day)
├── gold_history/     # append-only daily metrics
├── gold_exports/     # CSV output for Power BI

📊 Visualizations in Power BI

This pipeline supports a variety of visuals:

Bar chart: total_eq per region (colored by avg_mag)
Map: location-based earthquakes with size by magnitude
Line chart: magnitude trend over time
Pie chart: depth distribution (shallow, intermediate, deep)
Slicers: by region_country and date for filtering

These visuals are defined in the Power BI file: powerbi/earthquake-dashboard.pbix
A static PDF version is also included: earthquake-dashboard.pdf

📣 Alerts and Monitoring

Success alert: adf_pipeline_success → email if pipeline runs OK
Failure alert: adf_pipeline_failed → email on any error
Managed via Azure Monitor and Action Groups.

🧠 Possible Enhancements

Live dashboard via Power BI Service + Gateway
Training ML models on gold_history (for magnitude prediction)
Export GOLD to Synapse or Azure SQL for analytics
Custom alerts on strong_eq > threshold for public safety

📁 Project Structure

earthquake-data-pipeline/
├── PowerBI plots/              # Power BI reports and visual exports
├── dataset/                    # ADF datasets in JSON
├── factory/                    # ADF factory definition
├── linkedService/              # Linked service configs for ADF
├── notebooks-DataBricks/       # Databricks notebooks
├── pipeline/                   # ADF pipelines in JSON
├── trigger/                    # ADF trigger definition
├── README.md                   # Project documentation
├── .gitignore
└── publish_config.json

Contributors

Thanks to all contributors! 🙌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

earthquake-data-pipeline

Earthquake Data Pipeline

🔍 Purpose

🔁 Pipeline Flow

☁️ Azure Components Used

🧱 Layered Storage Layout

📊 Visualizations in Power BI

📣 Alerts and Monitoring

🧠 Possible Enhancements

📁 Project Structure

Contributors

About

Uh oh!

Releases 1

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
PowerBI plots		PowerBI plots
dataset		dataset
diagram		diagram
factory		factory
linkedService		linkedService
notebooks-DataBricks		notebooks-DataBricks
pipeline		pipeline
trigger		trigger
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
publish_config.json		publish_config.json

License

YouSteen/earthquake-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

earthquake-data-pipeline

Earthquake Data Pipeline

🔍 Purpose

🔁 Pipeline Flow

☁️ Azure Components Used

🧱 Layered Storage Layout

📊 Visualizations in Power BI

📣 Alerts and Monitoring

🧠 Possible Enhancements

📁 Project Structure

Contributors

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors 2

Uh oh!

Languages