Skip to content

Commit a6d81d3

Browse files
authored
Merge pull request #24 from uwescience/lesson_content
Adding benchmarking and scaling sections and polishing content
2 parents 8bed483 + 89ed235 commit a6d81d3

7 files changed

+207
-14
lines changed

docs/_config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Book settings
22
# Learn more at https://jupyterbook.org/customize/config.html
33

4-
title: GitHub Actions for Scientific Workflows (SciPy 2024)
4+
title: GitHub Actions for Scientific Data Workflows (SciPy 2024)
55
author: Valentina Staneva, George (Quinn) Brencher, Scott Henderson
66
logo: logo.png
77

docs/_toc.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,5 +10,7 @@ chapters:
1010
- file: caching
1111
- file: exporting-results
1212
- file: visualizing-results-webpage
13-
- file: ../glacier_image_correlation/README
14-
title: Batch Computing
13+
- file: batch-computing
14+
title: Scaling Workflows
15+
- file: model_benchmarking
16+
title: Collaborative Model Versioning and Benchmarking

docs/batch-computing.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Scaling Workflows
2+
3+
We demonstrate how GitHub Actions can be used for scaling computationally expensive workflows through a use case aiming to measure glacier surface velocity from satellite
4+
imagery.
5+
* how to perform batch computing by running many workflows in parallel
6+
* how to build complex pipelines by calling workflows from another workflow
7+
* how to specify paramers to run a workflow
8+
9+
10+
11+
# Measuring Glacier Surface Velocity
12+
#### Quinn Brencher, University of Washington
13+
14+
This set of Github Actions workflows allows you to measure horizontal glacier surface velocity from Sentinel-2 image pairs using [autoRIFT software](https://github.com/nasa-jpl/autoRIFT). No external accounts or API keys are required. These workflows were created for the Github Actions for Scientific Data Workflows workshop at the 2024 SciPy conference.
15+
16+
## Usage
17+
We use three workflows to batch process image pairs for glacier surface velocity. For demonstration purposes the workflows are only set up to work over the [Yazghil Glacier](https://earth.google.com/earth/d/1myewNJrDEM0tW1_xdpWCYaRCGDcOBwiy?usp=drive_link) in Pakistan. To run the workflows, simply fork this repository, visit the "Actions" tab, and choose the `batch_image_correlation` workflow (which runs the other two workflows as well).
18+
19+
![plot](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/glacier_image_correlation/images/workflow_diagram.png)
20+
21+
### 1. `image_correlation_pair`
22+
This workflow calls a Python script (image_correlation.py) that runs autoRIFT on a pair of spatially overlapping [Sentinel-2 L2A](https://docs.sentinel-hub.com/api/latest/data/sentinel-2-l2a/) images. It requires the [product names](https://sentiwiki.copernicus.eu/web/s2-products) of the two images. The images are downloaded from aws using the [Element 84 Earth Search API](https://element84.com/earth-search/). Only the near infrared band (NIR, B08) is used which has a spatial resolution of 10 m. autoRIFT is used to perform image correlation. Search distances are scaled with temporal baseline assuming a maximum surface velocity of 1000 m/yr, so images acquired farther apart in time take longer to process. Surface velocity maps are saved as geotifs and uploaded as [Github Artifacts](https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts).
23+
24+
![plot](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/glacier_image_correlation/images/input_images.png)
25+
26+
### 2. `batch_image_correlation`
27+
This workflow can be used to create surface velocity maps from many pairs of Sentinel-2 images. Required inputs include maximum cloud cover percent, start month (recommend >=5 to minimize snow cover), end month (recommend <=10 to minimize snow cover), and number of pairs per image, e.g.:
28+
- 1 pair per image: (img<sub>i</sub>, img<sub>i+1</sub>), (img<sub>i+1</sub>, img<sub>i+2</sub>), (img<sub>i+2</sub>, img<sub>i+3</sub>), ...
29+
- 2 pairs per image: (img<sub>i</sub>, img<sub>i+1</sub>), (img<sub>i</sub>, img<sub>i+2</sub>), (img<sub>i+1</sub>, img<sub>i+2</sub>), ...
30+
- 3 pairs per image: (img<sub>i</sub>, img<sub>i+1</sub>), (img<sub>i</sub>, img<sub>i+2</sub>), (img<sub>i</sub>, img<sub>i+3</sub>), ...
31+
32+
Only the first suitable image is selected for each month. Once image pairs are identified, a matrix job is set up to run `image_correlation_pair` for each pair. Finally, `summary_statistics` is run.
33+
34+
### 3. `summary_statistics`
35+
This workflow downloads all of the velocity maps created during a `batch_image_correlation` run and uses them to calculate and plot median velocity, standard deviation of velocity, and valid pixel count across all velocity maps. The summary statistics plot is uploaded as a Github Artifact.
36+
37+
![plot](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/glacier_image_correlation/images/velocity_summary_statistics.png)
38+
39+
40+
## Acknowledgements
41+
- Scott Henderson developed many of the original ideas and much of code used for this set of workflows
42+
- [University of Washington eScience Incubator Program 2024](https://escience.washington.edu/incubator-24-glacial-lakes/)
43+

docs/exporting-results.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,14 +32,12 @@ gh run download
3232

3333
The workflow run also provides a publicly available link to the download artifact:
3434

35-
Artifact download URL: [https://github.com/uwescience/SciPy2024-
36-
GitHubActionsTutorial/actions/runs/9591972369/artifacts/1619380017](https://github.com/uwescience/SciPy2024-
35+
Artifact download URL: [`https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/actions/runs/9591972369/artifacts/1619380017`](https://github.com/uwescience/SciPy2024-
3736
GitHubActionsTutorial/actions/runs/9591972369/artifacts/1619380017)
3837

39-
There is a `download-artifact` action to download the artifacts and share between jobs within a workflow run (note this is limited to the inidividual workflow run, for downloading across runs use the other options).
38+
There is a `download-artifact` action to download the artifacts and share between jobs within a workflow run (note this is limited to the individual workflow run, for downloading across runs use the other options).
4039

41-
[Here](Artifact download URL: https://github.com/uwescience/SciPy2024-
42-
GitHubActionsTutorial/actions/runs/9591972369/artifacts/1619380017) is more detailed documentation on GitHub Artifacts.
40+
[Here](https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts) you can find more detailed documentation on GitHub Artifacts.
4341

4442

4543

@@ -55,7 +53,7 @@ The approach consists of a few steps:
5553
* we will use [AnimMouse/setup-rclone](https://github.com/marketplace/actions/setup-rclone-action)
5654
* configure a Google Drive remote locally
5755
* encode the text in the config file and save it as a secret `RCLONE_CONFIG`
58-
* MacOX: `openssl base64 -in ~/.config/rclone/rclone_drive.conf`
56+
* `openssl base64 -in ~/.config/rclone/rclone_drive.conf`
5957
* run the `rclone` command to upload the plots to Google Drive
6058
* `rclone copy ambient_sound_analysis/img/broadband.png mydrive:rclone_uploads/`
6159

docs/getting-started.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,25 @@
11
# Setup
2-
* Fork this repo
3-
* Enable Github Actions:
2+
3+
* We expect all participants to have a GitHub account (if not you can make one here [https://github.com/login](https://github.com/login))
4+
* Fork [https://github.com/uwescience/SciPy2024-GitHubActionsTutorial](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial)
5+
* Enable GitHub Actions:
46
* Settings -> Actions -> Allow actions and reusable workflows
57
* [Managing Permissions
68
Documentation](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository)
79

810

9-
All workflow configurations are stored in the [`.github/workflows`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/tree/main/.github/workflows) and will go through them in the following order:
11+
All workflow configurations are stored in the [`.github/workflows`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/tree/main/.github/workflows) folder and we will go through them in the following order:
1012

1113
1. [`python_env.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/python_env.yml)
1214
2. [`conda_env.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/conda_env.yml)
1315
3. [`noise_processing.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/noise_processing.yml)
1416
4. [`create_website_spectrogram.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/create_website_spectrogram.yml)
1517
5. [`create_website.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/create_website.yml)
16-
6. ...
18+
6. [`batch_image_correlation.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/batch_image_correlation.yml)
19+
7. [`image_correlation_pair.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/image_correlation_pair.yml)
20+
8. [`summary_statistics.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/summary_statistics.yml)
21+
9. [`model_benchmarking.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/model_benchmarking.yml)
22+
10. [`create_website_benchmarks`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/create_website_benchmarks.yml)
1723

1824

1925

docs/intro.md

Lines changed: 100 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,103 @@
1-
# Welcome to GitHub Actions for Scientific Workflows
1+
# Welcome to GitHub Actions for Scientific Data Workflows
22

3+
4+
Tutorial presented at [SciPy 2024 Conference](https://www.scipy2024.scipy.org/)
5+
6+
Authors: Valentina Staneva, Quinn Brencher, Scott Henderson
7+
8+
## Abstract
9+
10+
In this tutorial we will introduce GitHub Actions to scientists as a tool for lightweight automation of scientific data workflows. We will
11+
demonstrate that GitHub Actions are not just a tool for software testing, but can be used in various ways to improve the reproducibility
12+
and impact of scientific analysis. Through a sequence of examples, we will demonstrate some of GitHub Actions' applications to scientific
13+
workflows, such as scheduled deployment of algorithms to sensor streams, updating visualizations based on new data, processing large
14+
datasets, model versioning and performance benchmarking. GitHub Actions can particularly empower Python scientific programmers who are not
15+
willing to build fully-fledged applications or set up complex computational infrastructure, but would like to increase the impact of their
16+
work. The goal is that participants will leave with their own ideas of how to integrate Github Actions in their own work.
17+
18+
## Description
19+
20+
GitHub Actions are quite popular within the software engineering community, but a scientific Python programmer may not have seen their use
21+
beyond a continuous integration framework for unit testing. We would like to increase their visibility through a scientific workflow lens.
22+
We will use examples that are relevant to the community: wrangling a messy realtime hydrophone data stream to display noise sounds from the
23+
Puget Sound (not far from the conference venue!) or processing hundreds of satellite radar images over glacial lakes in High-Mountain Asia
24+
to study flood hazards. We assume no knowledge on GitHub Actions and will start slowly with a “Hello World” step, but build quickly to
25+
create complex and exciting workflows. We will also showcase their value for scientific collaborations across institutions as a means to
26+
share reproducible workflows and computing infrastructure.
27+
28+
## Prerequisites
29+
GitHub account, familiarity with git (commits, versioning), GitHub (push, pull requests), and Python (conda, scipy, matplotlib), some maturity in manipulating scientific data and
30+
exposure to the challenges associated with it, ability to read code (our examples may use libraries not familiar to the audience, but the
31+
focus will be on the steps these libraries accomplish rather than the details)
32+
33+
## Installation Instructions
34+
Participants can make edits from the GitHub interface, but if they are willing to make updates locally, they need to have a functioning git
35+
([set up instructions](https://swcarpentry.github.io/git-novice/#installing-git))
36+
37+
## Outline
38+
39+
### Short Version
340
```{tableofcontents}
441
```
42+
43+
### Long Version (with approximate schedule)
44+
* Overview of GitHub Actions and Workflows and their popular uses in Python software development (examples of testing, listing,
45+
packaging)(20 min)
46+
* We will explain the main components of GitHub Actions and associated terminology
47+
* We will summarize their typical uses in software development
48+
* We will point to popular GitHub Actions used in Python software development and packaging (the focus of this tutorial will not be
49+
on them but rather on scientific pipelines)
50+
51+
* Setting up your first workflow: a scientific Python environment (20 min)
52+
* participants will update a workflow `.yml` file to create an environment with their favorite Python libraries
53+
* participants will inspect the github interface to see the workflow runs
54+
55+
* Scheduled algorithm deployment to a realtime stream (30 min)
56+
* we will deploy a typical scientific workflow: reading data, converting to a new format, and making a visualization
57+
* participants will update the deployment schedule to trigger a new workflow and will monitor the progress in the GitHub interface
58+
59+
* Break (15 min)
60+
61+
* Exporting results (30 min)
62+
* participants will learn about various ways to store the results:
63+
* caching
64+
* committing to GitHub
65+
* creating GitHub artifacts
66+
* storing to personal storage
67+
* they will modify the code to make a new plot which will be automatically updated
68+
* they will use either matplotlib or an interactive library such as plotly
69+
70+
* Update results on a webpage (30 min)
71+
* we will overview different ways to display scientific results on a webpage
72+
* we will demonstrate the workflow to deploy the webpage
73+
* participants will rerender the webpage based on the updates in GitHub
74+
75+
* Large-scale data processing (45 min)
76+
* we will demonstrate a use-case of processing large data sets with GitHub Actions
77+
* participants will fiddle with problem size to understand the power and limits of the computational infrastructure
78+
* we will discuss connections to cluster/cloud computing
79+
80+
* Break (10 min)
81+
82+
* Model Versioning and Benchmarking (20 min)
83+
* we will introduce how to leverage GitHub’s version control to version different models and performance
84+
* participants can contribute a new model and check its performance
85+
* we will discuss how this can be used as a community network to share methods and results
86+
87+
* Recap and Discussion (or buffer time) (20 min)
88+
* we will have a discussion on potential uses of GitHub Actions within the work of the participants
89+
90+
91+
# References
92+
* [*GitHub Actions for Scientific Data Workflows*](https://github.com/valentina-s/GithubActionsTutorial-USRSE23), Valentina Staneva,
93+
[US-RSE 2023 Tutorial](https://us-rse.org/usrse23/program/tutorials/)
94+
* [*Characterizing glacial lake outburst flood hazard at regional scale using fused InSAR-speckle tracking surface displacement time
95+
series*](https://escience.washington.edu/2024-incubator-projects/), Quinn Brencher and Scott Henderson, eScience Institute Data Incubator
96+
Project, 2024, [[repo](https://github.com/relativeorbit/actions-batch-demo)]
97+
* [*GitHub Actions Workflows for Scheduled Algorithm
98+
Deployment*](https://summerofcode.withgoogle.com/archive/2021/projects/5026942771789824), Dmitry Volodin, Jesse Lopez, Scott Veirs, Val
99+
Veirs, Valentina Staneva, Orcasound Google Summer Of Code 2021 Project, [[repo]](https://github.com/orcasound/orca-action-workflow)
100+
* [*GitHub Actions Documentation*](https://docs.github.com/en/actions/learn-github-actions)
101+
102+
103+

docs/model_benchmarking.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Collaborative Model Versioning and Benchmarking
2+
3+
Here we will describe a scenario in which users submit different models to be applied to common data and compare the results. For this we will leverage GitHub's core features to facilitate code versioning and collaborative development and will set up a GitHub Actions configuration which triggers the evaluation when a user creates a `pull request` with a new version of the model and updates a table with user's results and corresponding commit number.
4+
5+
We will use a simple approach to approximate the number of ships passing during a time window by counting the number of peaks that appear above a threshold in the broadband plot. The threshold is set in the [`model_benchmarking.py`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/ambient_sound_analysis/model_benchmarking.py) script.
6+
7+
8+
## Model Versioning Workflow
9+
The workflow which triggers the model evaluation is in [`model_benchmarking.yml`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/.github/workflows/model_benchmarking.yml). It consists of the following steps:
10+
11+
1. it gets triggered on `pull_request`
12+
* `synchronize` type ensures it get triggered when somebody updates existing pull request
13+
2. it runs the `model_benchmarking.py` script which creates a `.csv` file containing the estimated number of ships
14+
3. It appends to the row with number of ships extra metatada of the submission: username, commit SHA, pull request title
15+
4. It stores the row to a `score_[SHA].csv`
16+
5. It commits the 1-row file to the `ambient_sound_analysis/csv` folder
17+
18+
19+
## Model Benchmarking Workflow
20+
21+
The next workflow follows the steps `create_website_spectrogram` workflow, which converts a notebook [`display_benchmarks`](https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/blob/main/ambient_sound_analysis/display_benchmarks.ipynb) to a website. In this case, we have a very simple notebook which reads all `score_[SHA].csv` and displays a "benchmark table" with the individual entries. This notebook is converted to a webpage ([https://uwescience.github.io/SciPy2024-GitHubActionsTutorial/display_benchmarks.html](https://uwescience.github.io/SciPy2024-GitHubActionsTutorial/display_benchmarks.html/)).
22+
23+
### Exercise
24+
25+
Create a branch and update the `model_versioning.py` file with a different threshold
26+
27+
```
28+
# set threshold
29+
threshold = ??
30+
```
31+
32+
Submit a pull request from this branch to main and monitor the execution of the workflows. Check out the generated website at [https://uwescience.github.io/SciPy2024-GitHubActionsTutorial/display_benchmarks.html](https://uwescience.github.io/SciPy2024-GitHubActionsTutorial/display_benchmarks.html/).
33+
34+
35+
36+
37+
38+
39+
40+
41+
42+
43+
44+
45+

0 commit comments

Comments
 (0)