Skip to content

Add A10g workflow to gain access to A10G GPU #2558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .ci/torchbench/check-ssh.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash
set -eou pipefail

echo "Holding runner for 2 hours until all ssh sessions have logged out"
for _ in $(seq 1440); do
# Break if no ssh session exists anymore
if [ "$(who)" = "" ]; then
break
fi
echo "."
sleep 5
done
43 changes: 43 additions & 0 deletions .github/workflows/linux-test-a10g.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: TorchBench PR Test on A10G
on:
workflow_dispatch:
pull_request:

jobs:
linux-test-a10g:
# Don't run on forked repos
# Only run on PR labeled 'with-ssh'
if: github.repository_owner == 'pytorch' && contains(github.event.pull_request.labels.*.name, 'with-ssh')
runs-on: linux.g5.4xlarge.nvidia.gpu
timeout-minutes: 240
environment: docker-s3-upload
env:
CONDA_ENV: "pr-test-cuda"
TEST_CONFIG: "cuda"
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
steps:
- name: Checkout TorchBench
uses: actions/checkout@v3
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
with:
github-secret: ${{ secrets.TORCHBENCH_ACCESS_TOKEN }}
- name: Install Conda
run: |
bash ./.ci/torchbench/install-conda.sh
- name: Install TorchBench
run: |
bash ./.ci/torchbench/install.sh
- name: Wait for SSH session to end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atalman Do we need to kill SSH sessions that are still active after 2 hours?

if: always()
run: |
bash ./.ci/torchbench/check-ssh.sh
- name: Clean up Conda env
if: always()
run: |
. ${HOME}/miniconda3/etc/profile.d/conda.sh
conda remove -n "${CONDA_ENV}" --all

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
Loading