Skip to content

edgi-govdata-archiving/ECHO-Pipeline

Repository files navigation

ECHO Pipeline Project Documentation

This project developed a data pipeline system to enhance the archiving system and the data science tools that the Environmental Data & Governance Initiative (EDGI)’s Environmental Enforcement Watch program has developed. This project developed both an on-premise and dockerized delta lake platform harnessing the latest and stable open source technologies. All up-to-date datasets on the EPA Enforcement and Compliance History Online (ECHO) website can be automatically archived and curated in the delta lake system for data processing and query services. To harness this newly streamlined open delta lake technologies, the previously developed EDGI’s ECHO_modules was updated and added to the current repo as a branch. The project also developed a fast and tiny RESTful API server that can receive a query request from a client who uses the updated API supporting EDGI’s (ECHO_modules_delta) and return the query result from the on-premise delta lake system to the client. Any data analytics tools using the EDGI’s ECHO_modules_delta can easily access the ingested EPA ECHO datasets via the RESTful API service.

This project consist of the following components:

1. echo-archive (Documentation)

This directory contains scripts for mirroring and archiving the ECHO Downloads site. It scans the website for downloadable file links, downloads them, and stores them in a local folder structure that mirrors the website's organization. The goal is to create a local archive for backup, offline access, or data ingestion by the next data-scraper tool.

2. data-scraper (Documentation)

This directory contains scripts to download CSV files and scrape the matching schemas from (the EPA site)[https://echo.epa.gov/tools/data-downloads]. The system is containerized using Docker for portability and reproducibility.

3. data-storer-dev (Documentation)

This directory contains scripts for ingesting the scraped data (e.g CSV and JSON) from the data-scraper tool into the Delta Lake docker container based on the Delta Lake Quickstart Docker using PySpark. While this tool was created for the development environment, it can be deployed as a local Delta Lake system so analysts using the ECHO_modules_delta can directly access the ingested ECHO tables in the local Delta Lake system.

4. data-storer-production (Documentation)

This directory contains scripts for ingesting the scraped data (e.g CSV and JSON) from into an on-premise Delta Lake system using PySpark for the production environment. The on-premise Delta Lake deployment is used for the ECHO API service.

5. echo-api-server (Documentation)

This directory contains scripts for deploying the api server that can receive a query request from a client who uses the ECHO_modules_delta and return the query result from the on-premise Delta Lake system to the client. The server is built using fastapi. Refer to the directory's README.md for instructions on running the application.

6. json (Documentation)

This directory contains the schema files and additional configuration files used by ECHO_Pipeline.

How to use

  1. Set up your .env file based on the provided .env.example file.

    Variable Description Example
    STORAGE_HOST_PATH Path on the host machine where updated datasets and delta tables will be stored /home/user/epa-data
    LOCAL_ECHODOWNLOADS_HOST_PATH Path on the host machine where raw ECHO downloads are stored /home/user/echo-downloads
    JSON_DIR_HOST_PATH Path on the host machine containing schema definition JSON files, and other JSON files used in the pipeline home/user/json

    These host paths are mounted into the containers at the following internal paths:

    • /app/echo-downloads ← LOCAL_ECHODOWNLOADS_HOST_PATH

    • /app/epa-data ← STORAGE_HOST_PATH

    • /app/json ← JSON_DIR_HOST_PATH

  2. Build the docker containers by using docker compose command:

    docker compose -f dev-compose.yaml build
  3. Start the containers:

    docker compose -f dev-compose.yaml up

This will start the containers defined in dev-compose.yaml which are scraper and storer services. The scraper's main script will run upon container starting but you will have to start storer on your own.

Notes

  • Ensure that the paths defined in your .env file is accessible and correctly mounted within the containers.
  • For detailed usage and individual directory instructions, refer to the individual README.md files in each directory.

Code of Conduct

This repository falls under EDGI’s Code of Conduct https://github.com/edgi-govdata-archiving/overview/blob/main/CONDUCT.md. Please take a moment to review it before commenting on or creating issues and pull requests.

Contributors

License & Copyright

Copyright (C) Environmental Data and Governance Initiative (EDGI) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the LICENSE file for details.

About

Tools for scraping, downloading, storing, and serving ECHO data

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Contributors 3

  •  
  •  
  •