|
1 |
| -# echoflow |
| 1 | +## Echoflow: Streamlined Data Pipeline Orchestration |
2 | 2 |
|
3 |
| -**NOTE: This project is currently under heavy development, |
4 |
| -and has not been tested or deployed on the cloud** |
| 3 | +Welcome to **Echoflow**! Echoflow is a powerful data pipeline orchestration tool designed to simplify and enhance the execution of data processing tasks. Leveraging the capabilities of [Prefect 2.0](https://www.prefect.io/) and YAML configuration files, Echoflow caters to the needs of scientific research and data analysis. It provides an efficient way to define, configure, and execute complex data processing workflows. |
5 | 4 |
|
6 |
| -Sonar conversion pipeline tool with echopype. |
7 |
| -This tool allows for users to quickly setup sonar data processing pipelines, |
8 |
| -and deploy it locally or on the cloud. It uses [Prefect 2.0](https://www.prefect.io/), |
9 |
| -a data workflow orchestration tool to run these flows on various platforms. |
| 5 | +Echoflow integrates with **echopype**, a renowned package for sonar data analysis, to provide a versatile solution for researchers, analysts, and engineers. With Echoflow, users can seamlessly process and analyze sonar data using a modular and user-friendly approach. |
10 | 6 |
|
11 |
| -## Development |
12 | 7 |
|
13 |
| -To develop the code, simply install the package in a python virtual environment in editable mode. |
| 8 | +# Getting Started with Echoflow |
| 9 | + |
| 10 | +This guide will walk you through the initial steps to set up and run your Echoflow pipelines. |
| 11 | + |
| 12 | +## 1. Create a Virtual Environment |
| 13 | + |
| 14 | +To keep your Echoflow environment isolated, it's recommended to create a virtual environment using Conda or Python's built-in `venv` module. Here's an example using Conda: |
| 15 | + |
| 16 | +```bash |
| 17 | +conda create --name echoflow-env |
| 18 | +conda activate echoflow-env |
| 19 | +``` |
| 20 | + |
| 21 | +Or, using Python's venv: |
| 22 | + |
| 23 | +```bash |
| 24 | +python -m venv echoflow-env |
| 25 | +source echoflow-env/bin/activate # On Windows, use `echoflow-env\Scripts\activate` |
| 26 | +``` |
| 27 | + |
| 28 | +## 2. Clone the Project |
| 29 | +Now that you have a virtual environment set up, you can clone the Echoflow project repository to your local machine using the following command: |
| 30 | + |
| 31 | +```bash |
| 32 | +git clone <repository_url> |
| 33 | +``` |
| 34 | + |
| 35 | +## 3. Install the Package |
| 36 | +Navigate to the project directory you've just cloned and install the Echoflow package. The -e flag is crucial as it enables editable mode, which is especially helpful during development and testing. Now, take a moment and let the echoflow do its thing while you enjoy your coffee. |
| 37 | + |
| 38 | +```bash |
| 39 | +cd <project_directory> |
| 40 | +pip install -e . |
| 41 | +``` |
| 42 | + |
| 43 | +## 4. Echoflow and Prefect Initialization |
| 44 | + |
| 45 | +To kickstart your journey with Echoflow and Prefect, follow these simple initialization steps: |
| 46 | + |
| 47 | +### 4.1 Initializing Echoflow |
| 48 | +Begin by initializing Echoflow with the following command: |
14 | 49 |
|
15 | 50 | ```bash
|
16 |
| -pip install -e .[all] |
| 51 | +echoflow init |
17 | 52 | ```
|
| 53 | +This command sets up the groundwork for your Echoflow environment, preparing it for seamless usage. |
18 | 54 |
|
19 |
| -This will install all of the dependencies that the package need. |
20 |
| - |
21 |
| -**Check out the [Hake Flow Demo](./notebooks/HakeFlowDemo.ipynb) notebook to get started.** |
22 |
| - |
23 |
| -## Package structure |
24 |
| - |
25 |
| -All of the code lives in a directory called [echoflow](./echoflow/). |
26 |
| - |
27 |
| -Under that directory, there are currently 4 main subdirectory: |
28 |
| - |
29 |
| -- [settings](./echoflow/settings/): This is where pipeline configurations object models are found, |
30 |
| -as well as a home for any package configurations. |
31 |
| - - [models](./echoflow/settings/models/): This sub-directory to `settings` contains [pydantic](https://docs.pydantic.dev/) models to validate the configuration file specified by the user. |
32 |
| - This can look like below in [YAML](https://yaml.org/) format. |
33 |
| - |
34 |
| - ```yaml |
35 |
| - name: Bell_M._Shimada-SH1707-EK60 |
36 |
| - sonar_model: EK60 |
37 |
| - raw_regex: (.*)-?D(?P<date>\w{1,8})-T(?P<time>\w{1,6}) |
38 |
| - args: |
39 |
| - urlpath: s3://ncei-wcsd-archive/data/raw/{{ ship_name }}/{{ survey_name }}/{{ sonar_model }}/*.raw |
40 |
| - # Set default parameter values as found in urlpath |
41 |
| - parameters: |
42 |
| - ship_name: Bell_M._Shimada |
43 |
| - survey_name: SH1707 |
44 |
| - sonar_model: EK60 |
45 |
| - storage_options: |
46 |
| - anon: true |
47 |
| - transect: |
48 |
| - # Transect file spec |
49 |
| - # can be either single or multiple files |
50 |
| - file: ./hake_transects_2017.zip |
51 |
| - output: |
52 |
| - urlpath: ./combined_files |
53 |
| - overwrite: true |
54 |
| - ``` |
55 |
| -
|
56 |
| - This yaml file turns into a `MainConfig` object that looks like: |
57 |
| - |
58 |
| - ```python |
59 |
| - MainConfig(name='Bell_M._Shimada-SH1707-EK60', sonar_model='EK60', raw_regex='(.*)-?D(?P<date>\\w{1,8})-T(?P<time>\\w{1,6})', args=Args(urlpath='s3://ncei-wcsd-archive/data/raw/{{ ship_name }}/{{ survey_name }}/{{ sonar_model }}/*.raw', parameters={'ship_name': 'Bell_M._Shimada', 'survey_name': 'SH1707', 'sonar_model': 'EK60'}, storage_options={'anon': True}, transect=Transect(file='./hake_transects_2017.zip', storage_options={})), output=Output(urlpath='./combined_files', storage_options={}, overwrite=True), echopype=None) |
60 |
| - ``` |
61 |
| - |
62 |
| -- [stages](./echoflow/stages/): Within this directory lives the code for various stages within the sonar data processing pipeline, which is currently sketched out and discussed [here](https://github.com/uw-echospace/data-processing-levels/blob/main/discussion_2022-07-12.md). |
63 |
| -- [subflows](./echoflow/subflows/): Subflows contains flows that support processing level flows. |
64 |
| -Essentially this is the individual smaller flows that need to run within a data processing level. |
65 |
| - - Currently, each subflow is a directory that contains the following python files: |
66 |
| - - `flows.py`: Code regarding to flow lives here |
67 |
| - - `tasks.py`: Code regarding to task lives here |
68 |
| - - `utils.py`: Code that is used for utility functions lives here |
69 |
| - - `__init__.py`: File to import flow so the subflow directory can become a module and flow to be easily imported. |
70 |
| -- [tests](./echoflow/tests/): Tests code lives in this directory. |
71 |
| - |
72 |
| -For more details about prefect, go to their extensive [documentation](https://docs.prefect.io/). |
| 55 | +### 4.2 Initializing Prefect |
| 56 | +For Prefect, initialization involves a few extra steps, including secure authentication. Enter the following command to initiate the Prefect authentication process: |
| 57 | + |
| 58 | +- If you have a Prefect Cloud account, provide your Prefect API key to securely link your account. Type your API key when prompted and press Enter. |
| 59 | + |
| 60 | +```bash |
| 61 | +prefect cloud login |
| 62 | +``` |
| 63 | + |
| 64 | +- If you don't have a Prefect Cloud account yet, you can use local prefect account. This is especially useful for those who are just starting out and want to explore Prefect without an account. |
| 65 | + |
| 66 | +```bash |
| 67 | +prefect profiles create echoflow-local |
| 68 | +``` |
| 69 | + |
| 70 | +The initialization process will ensure that both Echoflow and Prefect are properly set up and ready for you to dive into your cloud-based workflows. |
| 71 | + |
| 72 | +## 5. Configure Blocks |
| 73 | +Echoflow utilizes the concept of [blocks](./docs/configuration/blocks.md) which are secure containers for storing credentials and sensitive data. If you're running the entire flow locally, feel free to bypass this step.To set up your cloud credentials, configure blocks according to your cloud provider. For detailed instructions, refer to the [Blocks Configuration Guide](./docs/configuration/blocks.md#creating-credential-blocks). |
| 74 | + |
| 75 | +## 6. Edit the Pipeline Configuration |
| 76 | +Open the [pipeline.yaml](./docs/configuration/pipeline.md) file. This YAML configuration file defines the processes you want to execute as part of your pipeline. Customize it by adding the necessary stages and functions from echopype that you wish to run. |
| 77 | + |
| 78 | +## 7. Define Data Sources and Destinations |
| 79 | +Customize the [datastore.yaml](./docs/configuration/datastore.md) file to define the source and destination for your pipeline's data. This is where Echoflow will fetch and store data as it executes the pipeline. |
| 80 | + |
| 81 | +## 8. Execute the Pipeline |
| 82 | +You're now ready to execute your Echoflow pipeline! Use the echoflow_start function, which is a central piece of Echoflow, to kick off your pipeline. Import this function from Echoflow and provide the paths or URLs of the configuration files. You can also pass additional options or storage options as needed. Here's an example: |
| 83 | + |
| 84 | +Customize the paths, block name, storage type, and options based on your requirements. |
| 85 | + |
| 86 | + |
| 87 | +```python |
| 88 | +from echoflow import echoflow_start, StorageType, load_block |
| 89 | + |
| 90 | +dataset_config = # url or path of datastore.yaml |
| 91 | +pipeline_config = # url or path of pipeline.yaml |
| 92 | +logfile_config = # url or path of logging.yaml (Optional) |
| 93 | + |
| 94 | +aws = load_block(name="<block_name>", type=<StorageType>) |
| 95 | + |
| 96 | +options = {"storage_options_override": False} # Enabling this assigns the block for universal use, avoiding the need for repetitive configurations when employing a single credential block throughout the application. |
| 97 | +data = echoflow_start(dataset_config=dataset_config, pipeline_config=pipeline_config, logging_config=logfile_config, storage_options=aws, options=options) |
| 98 | +``` |
73 | 99 |
|
74 | 100 | ## License
|
75 | 101 |
|
|
0 commit comments