Skip to content

Commit 4022bac

Browse files
authored
Merge pull request #22 from OSOceanAcoustics/dev
Revamped Echoflow Design
2 parents 8d5342a + 90a1c39 commit 4022bac

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

85 files changed

+10105
-1395
lines changed

README.md

Lines changed: 90 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,75 +1,101 @@
1-
# echoflow
1+
## Echoflow: Streamlined Data Pipeline Orchestration
22

3-
**NOTE: This project is currently under heavy development,
4-
and has not been tested or deployed on the cloud**
3+
Welcome to **Echoflow**! Echoflow is a powerful data pipeline orchestration tool designed to simplify and enhance the execution of data processing tasks. Leveraging the capabilities of [Prefect 2.0](https://www.prefect.io/) and YAML configuration files, Echoflow caters to the needs of scientific research and data analysis. It provides an efficient way to define, configure, and execute complex data processing workflows.
54

6-
Sonar conversion pipeline tool with echopype.
7-
This tool allows for users to quickly setup sonar data processing pipelines,
8-
and deploy it locally or on the cloud. It uses [Prefect 2.0](https://www.prefect.io/),
9-
a data workflow orchestration tool to run these flows on various platforms.
5+
Echoflow integrates with **echopype**, a renowned package for sonar data analysis, to provide a versatile solution for researchers, analysts, and engineers. With Echoflow, users can seamlessly process and analyze sonar data using a modular and user-friendly approach.
106

11-
## Development
127

13-
To develop the code, simply install the package in a python virtual environment in editable mode.
8+
# Getting Started with Echoflow
9+
10+
This guide will walk you through the initial steps to set up and run your Echoflow pipelines.
11+
12+
## 1. Create a Virtual Environment
13+
14+
To keep your Echoflow environment isolated, it's recommended to create a virtual environment using Conda or Python's built-in `venv` module. Here's an example using Conda:
15+
16+
```bash
17+
conda create --name echoflow-env
18+
conda activate echoflow-env
19+
```
20+
21+
Or, using Python's venv:
22+
23+
```bash
24+
python -m venv echoflow-env
25+
source echoflow-env/bin/activate # On Windows, use `echoflow-env\Scripts\activate`
26+
```
27+
28+
## 2. Clone the Project
29+
Now that you have a virtual environment set up, you can clone the Echoflow project repository to your local machine using the following command:
30+
31+
```bash
32+
git clone <repository_url>
33+
```
34+
35+
## 3. Install the Package
36+
Navigate to the project directory you've just cloned and install the Echoflow package. The -e flag is crucial as it enables editable mode, which is especially helpful during development and testing. Now, take a moment and let the echoflow do its thing while you enjoy your coffee.
37+
38+
```bash
39+
cd <project_directory>
40+
pip install -e .
41+
```
42+
43+
## 4. Echoflow and Prefect Initialization
44+
45+
To kickstart your journey with Echoflow and Prefect, follow these simple initialization steps:
46+
47+
### 4.1 Initializing Echoflow
48+
Begin by initializing Echoflow with the following command:
1449

1550
```bash
16-
pip install -e .[all]
51+
echoflow init
1752
```
53+
This command sets up the groundwork for your Echoflow environment, preparing it for seamless usage.
1854

19-
This will install all of the dependencies that the package need.
20-
21-
**Check out the [Hake Flow Demo](./notebooks/HakeFlowDemo.ipynb) notebook to get started.**
22-
23-
## Package structure
24-
25-
All of the code lives in a directory called [echoflow](./echoflow/).
26-
27-
Under that directory, there are currently 4 main subdirectory:
28-
29-
- [settings](./echoflow/settings/): This is where pipeline configurations object models are found,
30-
as well as a home for any package configurations.
31-
- [models](./echoflow/settings/models/): This sub-directory to `settings` contains [pydantic](https://docs.pydantic.dev/) models to validate the configuration file specified by the user.
32-
This can look like below in [YAML](https://yaml.org/) format.
33-
34-
```yaml
35-
name: Bell_M._Shimada-SH1707-EK60
36-
sonar_model: EK60
37-
raw_regex: (.*)-?D(?P<date>\w{1,8})-T(?P<time>\w{1,6})
38-
args:
39-
urlpath: s3://ncei-wcsd-archive/data/raw/{{ ship_name }}/{{ survey_name }}/{{ sonar_model }}/*.raw
40-
# Set default parameter values as found in urlpath
41-
parameters:
42-
ship_name: Bell_M._Shimada
43-
survey_name: SH1707
44-
sonar_model: EK60
45-
storage_options:
46-
anon: true
47-
transect:
48-
# Transect file spec
49-
# can be either single or multiple files
50-
file: ./hake_transects_2017.zip
51-
output:
52-
urlpath: ./combined_files
53-
overwrite: true
54-
```
55-
56-
This yaml file turns into a `MainConfig` object that looks like:
57-
58-
```python
59-
MainConfig(name='Bell_M._Shimada-SH1707-EK60', sonar_model='EK60', raw_regex='(.*)-?D(?P<date>\\w{1,8})-T(?P<time>\\w{1,6})', args=Args(urlpath='s3://ncei-wcsd-archive/data/raw/{{ ship_name }}/{{ survey_name }}/{{ sonar_model }}/*.raw', parameters={'ship_name': 'Bell_M._Shimada', 'survey_name': 'SH1707', 'sonar_model': 'EK60'}, storage_options={'anon': True}, transect=Transect(file='./hake_transects_2017.zip', storage_options={})), output=Output(urlpath='./combined_files', storage_options={}, overwrite=True), echopype=None)
60-
```
61-
62-
- [stages](./echoflow/stages/): Within this directory lives the code for various stages within the sonar data processing pipeline, which is currently sketched out and discussed [here](https://github.com/uw-echospace/data-processing-levels/blob/main/discussion_2022-07-12.md).
63-
- [subflows](./echoflow/subflows/): Subflows contains flows that support processing level flows.
64-
Essentially this is the individual smaller flows that need to run within a data processing level.
65-
- Currently, each subflow is a directory that contains the following python files:
66-
- `flows.py`: Code regarding to flow lives here
67-
- `tasks.py`: Code regarding to task lives here
68-
- `utils.py`: Code that is used for utility functions lives here
69-
- `__init__.py`: File to import flow so the subflow directory can become a module and flow to be easily imported.
70-
- [tests](./echoflow/tests/): Tests code lives in this directory.
71-
72-
For more details about prefect, go to their extensive [documentation](https://docs.prefect.io/).
55+
### 4.2 Initializing Prefect
56+
For Prefect, initialization involves a few extra steps, including secure authentication. Enter the following command to initiate the Prefect authentication process:
57+
58+
- If you have a Prefect Cloud account, provide your Prefect API key to securely link your account. Type your API key when prompted and press Enter.
59+
60+
```bash
61+
prefect cloud login
62+
```
63+
64+
- If you don't have a Prefect Cloud account yet, you can use local prefect account. This is especially useful for those who are just starting out and want to explore Prefect without an account.
65+
66+
```bash
67+
prefect profiles create echoflow-local
68+
```
69+
70+
The initialization process will ensure that both Echoflow and Prefect are properly set up and ready for you to dive into your cloud-based workflows.
71+
72+
## 5. Configure Blocks
73+
Echoflow utilizes the concept of [blocks](./docs/configuration/blocks.md) which are secure containers for storing credentials and sensitive data. If you're running the entire flow locally, feel free to bypass this step.To set up your cloud credentials, configure blocks according to your cloud provider. For detailed instructions, refer to the [Blocks Configuration Guide](./docs/configuration/blocks.md#creating-credential-blocks).
74+
75+
## 6. Edit the Pipeline Configuration
76+
Open the [pipeline.yaml](./docs/configuration/pipeline.md) file. This YAML configuration file defines the processes you want to execute as part of your pipeline. Customize it by adding the necessary stages and functions from echopype that you wish to run.
77+
78+
## 7. Define Data Sources and Destinations
79+
Customize the [datastore.yaml](./docs/configuration/datastore.md) file to define the source and destination for your pipeline's data. This is where Echoflow will fetch and store data as it executes the pipeline.
80+
81+
## 8. Execute the Pipeline
82+
You're now ready to execute your Echoflow pipeline! Use the echoflow_start function, which is a central piece of Echoflow, to kick off your pipeline. Import this function from Echoflow and provide the paths or URLs of the configuration files. You can also pass additional options or storage options as needed. Here's an example:
83+
84+
Customize the paths, block name, storage type, and options based on your requirements.
85+
86+
87+
```python
88+
from echoflow import echoflow_start, StorageType, load_block
89+
90+
dataset_config = # url or path of datastore.yaml
91+
pipeline_config = # url or path of pipeline.yaml
92+
logfile_config = # url or path of logging.yaml (Optional)
93+
94+
aws = load_block(name="<block_name>", type=<StorageType>)
95+
96+
options = {"storage_options_override": False} # Enabling this assigns the block for universal use, avoiding the need for repetitive configurations when employing a single credential block throughout the application.
97+
data = echoflow_start(dataset_config=dataset_config, pipeline_config=pipeline_config, logging_config=logfile_config, storage_options=aws, options=options)
98+
```
7399

74100
## License
75101

deployment/deploy_echoflow_worker.sh

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#!/bin/bash
2+
3+
# Step 1: Create a Python Virtual Environment
4+
python3 -m venv $HOME/env/echoflow-prod
5+
source $HOME/env/echoflow-prod/bin/activate
6+
7+
# Step 2: Clone the Echoflow Repository
8+
cd $HOME/
9+
git clone https://github.com/OSOceanAcoustics/echoflow.git
10+
cd $HOME/echoflow
11+
12+
# Step 3: Checkout the Dev Branch and Update (Optional) - Skip if using Prod/main branch
13+
git checkout dev
14+
git pull origin dev
15+
16+
# Step 4: Install the Echoflow Project in Editable Mode
17+
pip install -e .
18+
19+
# Step 5: Log in to Prefect Cloud and Set Your API Key - Change to step 5b if using prefect locally
20+
echo "Enter Prefect API key: "
21+
read prefectKey
22+
prefect cloud login -k $prefectKey
23+
24+
# Step 5b: Setup prefect locally
25+
# prefect profile create echoflow-local
26+
27+
# Step 6: Set Up the Prefect Worker as a Systemd Service
28+
echo "Enter Work Pool Name: "
29+
read workPool
30+
cd /etc/systemd/system
31+
32+
# Create and edit the prefect-worker.service file
33+
sudo cat <<EOL > prefect-worker.service
34+
[Unit]
35+
Description=Prefect-Worker
36+
37+
[Service]
38+
User=$(whoami)
39+
WorkingDirectory=$HOME/echoflow
40+
ExecStart=$(which prefect) agent start --pool $workPool
41+
Restart=always
42+
43+
[Install]
44+
WantedBy=multi-user.target
45+
EOL
46+
47+
# Step 7: Restart to to make systemd aware of the new service
48+
sudo systemctl daemon-reload
49+
50+
# Optionally, enable the service to start at boot
51+
sudo systemctl enable prefect-worker.service
52+
53+
# Step 8: Start the Prefect Worker Service
54+
sudo systemctl start prefect-worker.service
55+
56+
echo "Setup completed. The Echoflow worker is now running. Send tasks to $workPool using Prefect UI or CLI."

docs/configuration/blocks.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Echoflow Configuration and Credential Blocks
2+
3+
Echoflow leverages the concept of "blocks" from Prefect, which serve as containers for storing various types of data, including credentials and sensitive information. Currently, Echoflow supports two types of blocks: Azure Cosmos DB Credentials Block and AWS Credentials Block. These blocks allow you to securely store sensitive data while benefiting from Prefect's robust integration capabilities.
4+
5+
For a deeper understanding of blocks, you can refer to the [Prefect documentation](https://docs.prefect.io/2.11.5/concepts/blocks/).
6+
7+
## Types of Blocks in Echoflow
8+
9+
In the context of Echoflow, there are two main categories of blocks:
10+
11+
### 1. Echoflow Configuration Blocks
12+
13+
These blocks serve as repositories for references to credential blocks, as well as repositories for the various Prefect profiles that have been established using Echoflow's functions.
14+
15+
### 2. Credential Blocks
16+
17+
Credential blocks store sensitive information, such as authentication keys and tokens, securely. Echoflow integrates with Prefect's capabilities to ensure that sensitive data is protected.
18+
19+
## Creating Credential Blocks
20+
21+
Credential blocks can be conveniently created using an `.ini` file. By leveraging Prefect's integration, Echoflow ensures that the credentials stored in these blocks are handled securely. To create a credential block, you can follow these steps:
22+
23+
1. Open the `credentials.ini` file, which is located under the `.echoflow` directory in your home directory.
24+
```bash
25+
# Terminal command
26+
cd ~/.echoflow
27+
```
28+
2. Place the necessary credential information within the `credentials.ini` file.
29+
```bash
30+
# Terminal command
31+
nano credentials.ini # Or use any of your favourite editors
32+
```
33+
3. Store the updated `.ini` file in the `.echoflow` directory, which resides in your home directory.
34+
4. Utilize [echoflow load-credentials](../../echoflow/stages/subflows/echoflow.py#load_credential_configuration) command to generate a new credential block, leveraging the content from the `.ini` file.
35+
```bash
36+
echoflow load-credentials
37+
```
38+
5. Add the name of the block in pipeline or datastore yaml configuration files under `storage_options` section with the appropriate storage type (refer [StorageType](../../echoflow/config/models/datastore.py#StorageType)).
39+
40+
```yaml
41+
# Example
42+
storage_options:
43+
block_name: echoflow-aws-credentials # Name of the block containing credentials
44+
type: AWS # Specify the storage type using StorageType enum
45+
```
46+
47+
By providing the block name and storage type, ensure that the correct block is used for storage operations, and maintain clarity regarding the chosen storage type.
48+
49+
Once a credential block is created, it can be managed through the Prefect Dashboard. Additionally, if needed, you can use the `echoflow load-credentials` command with the `--sync` argument to ensure your blocks stay up-to-date with any changes made in the Prefect UI. This ensures that your configurations remain accurate and aligned across the application. **It is highly recommended to create new blocks whenever possible, as modifying existing blocks can lead to data loss or conflicts.**
50+
51+
## Considerations When Using `echoflow load-credentials`
52+
53+
When utilizing the `echoflow load-credentials` command, be aware of the following considerations:
54+
55+
- **Overwriting Values**: When using `echoflow load-credentials`, all the values from the `.ini` file will be written to the credential block, potentially overwriting existing values. Exercise caution when using this command to prevent unintentional data loss.
56+
- **Creating New Blocks**: To maintain data integrity and security, it's advised to create new blocks rather than modifying existing ones. If editing an existing block becomes necessary, it should be done through the Prefect Dashboard.
57+
- **Sync Argument**: The `--sync` argument is available in the `echoflow load-credentials` command. When set, this option syncs the credential block updates with the Prefect UI. This feature facilitates the seamless management of blocks through the dashboard, enhancing collaboration and control over credentials.
58+
59+
By adhering to these guidelines, you can ensure the secure management of sensitive information while effectively configuring and utilizing Echoflow within your projects.
60+
61+
62+
# Configuration File Explanation: credentials.ini
63+
64+
This Markdown file explains the structure and contents of the `credentials.ini` configuration file.
65+
66+
## AWS Section
67+
68+
The `[AWS]` section contains configuration settings related to AWS credentials.
69+
70+
- `aws_access_key_id`: Your AWS access key.
71+
- `aws_secret_access_key`: Your AWS secret access key.
72+
- `aws_session_token`: AWS session token (optional).
73+
- `region_name`: AWS region name.
74+
- `name`: Name of the AWS credentials configuration.
75+
- `active`: Indicates if the AWS credentials are active (True/False).
76+
- `options`: Additional options for AWS configuration.
77+
78+
## AzureCosmos Section
79+
80+
The `[AZCosmos]` section contains configuration settings related to Azure Cosmos DB credentials.
81+
82+
- `name`: Name of the Azure Cosmos DB credentials configuration.
83+
- `connection_string`: Azure Cosmos DB connection string.
84+
- `active`: Indicates if the Azure Cosmos DB credentials are active (True/False).
85+
- `options`: Additional options for Azure Cosmos DB configuration.
86+
87+
Example:
88+
89+
```ini
90+
[AWS]
91+
aws_key = my-access-key
92+
aws_secret = my-secret-key
93+
token = my-session-token
94+
region = us-west-1
95+
name = my-aws-credentials
96+
active = True
97+
option_key = option_value
98+
99+
[AZCosmos]
100+
name = my-az-cosmos-credentials
101+
connection_string = my-connection-string
102+
active = True
103+
option_key = option_value
104+
```
105+

docs/configuration/datastore.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Echoflow Run Configuration Documentation
2+
3+
This document provides detailed explanations for the keys used in the provided YAML configuration used to define an Echoflow run.
4+
5+
## Run Details
6+
7+
- `name`: This key specifies the name of the Echoflow run. It is used to identify and label the execution of the Echoflow process.
8+
- `sonar_model`: This key indicates the model of the sonar used for data collection during the run.
9+
- `raw_regex`: This key indicates the regex to be used while parsing the source directory to match the files to be processed.
10+
11+
## Input Arguments
12+
13+
- `urlpath`: This key defines the source data URL pattern for accessing raw data files. The pattern can contain placeholders that will be dynamically replaced during execution.
14+
- `parameters`: This section holds parameters used in the source data URL. These parameters dynamically replace placeholders in the URL path.
15+
- `storage_options`: This section defines storage options for accessing source data. It may include settings to anonymize access to the data.
16+
- `transect`: This section provides information about the transect data, including the URL of the transect file and storage options.
17+
- `json_export`: When set to true, this key indicates that raw JSON metadata of files should be exported for processing.
18+
- `raw_json_path`: This key defines the path where the raw JSON metadata will be stored. It can be used to skip parsing files in the source directory and instead fetch files from this JSON.
19+
20+
## Output Arguments
21+
22+
- `urlpath`: This key defines the destination data URL where processed data will be stored.
23+
- `overwrite`: When set to true, this key specifies that the data should overwrite any existing data in the output directory.
24+
- `storage_options`: This section defines storage options for the destination data, which may include details such as the block name and type.
25+
26+
## Notes
27+
28+
- The provided configuration serves as a structured setup for executing an Echoflow run, allowing customization through the specified keys.
29+
- Dynamic placeholders like `ship_name`, `survey_name`, and `sonar_model` are replaced with actual values based on the context.
30+
31+
Example:
32+
33+
```yaml
34+
name: Bell_M._Shimada-SH1707-EK60 # Name of the Echoflow Run
35+
sonar_model: EK60 # Sonar Model
36+
raw_regex: (.*)-?D(?P<date>\w{1,8})-T(?P<time>\w{1,6}) # Regex to parse the filenames
37+
args: # Input arguments
38+
urlpath: s3://ncei-wcsd-archive/data/raw/{{ ship_name }}/{{ survey_name }}/{{ sonar_model }}/*.raw # Source data URL
39+
parameters: # Source data URL parameters
40+
ship_name: Bell_M._Shimada
41+
survey_name: SH1707
42+
sonar_model: EK60
43+
storage_options: # Source data storage options
44+
anon: true
45+
transect: # Source data transect information
46+
file: ./x0007_fileset.txt # Transect file URL. Accepts .zip or .txt file
47+
storage_options: # Transect file storage options
48+
block_name: echoflow-aws-credentials # Block name. For more information on Blocks refer blocks.md
49+
type: AWS # Block type
50+
default_transect_num: 1 # Set when not using a file to pass transect information
51+
json_export: true # Export raw json metadata of files to be processed
52+
raw_json_path: s3://echoflow-workground/combined_files/raw_json # Path to store the raw json metadata. Can also work to skip the process of parsing the files at source directory and fetch files present in this json instead.
53+
output: # Output arguments
54+
urlpath: s3://echoflow-workground/combined_files_dask # Destination data URL parameters
55+
overwrite: true # Flag to overwrite the data if present in the output directory
56+
storage_options: # Destination data storage options
57+
block_name: echoflow-aws-credentials
58+
type: AWS
59+
```

0 commit comments

Comments
 (0)