Skip to content

Commit 87405ce

Browse files
authored
[GNN] Adds example building dockerfile for H100s. (#737)
* adds updated Dockerfile for building * renames Dockerfile to Dockerfile.h100, and restore old Dockerfile * updates README * adds a small commit to retrigger check
1 parent db0558a commit 87405ce

File tree

2 files changed

+52
-1
lines changed

2 files changed

+52
-1
lines changed

graph_neural_network/Dockerfile.h100

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
FROM nvcr.io/nvidia/pytorch:22.12-py3
2+
3+
WORKDIR /workspace/repository
4+
5+
RUN pip install scikit-learn==0.24.2
6+
RUN pip install torch_geometric==2.4.0
7+
RUN pip install torch_scatter==2.1.1 torch_sparse==0.6.17
8+
RUN pip install graphlearn-torch==0.2.2
9+
10+
RUN apt update
11+
RUN apt install -y git
12+
RUN pip install git+https://github.com/mlcommons/logging.git
13+
14+
# TF32 instead of FP32 for faster compute
15+
ENV NVIDIA_TF32_OVERRIDE=1
16+
17+
COPY . .
18+
WORKDIR /workspace/repository
19+
20+
RUN git clone https://github.com/alibaba/graphlearn-for-pytorch.git
21+
WORKDIR /workspace/repository/graphlearn-for-pytorch
22+
RUN git checkout 910cb55
23+
RUN git submodule update --init

graph_neural_network/README.md

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,34 @@ cd training/gnn_node_classification/
3131
docker build -f Dockerfile -t training_gnn:latest .
3232
```
3333

34+
##### 2.1 Building on NVIDIA H100
35+
36+
The official Dockerfile supports only NVIDIA A100 GPUs, and `Dockerfile.h100` helps build and run GNN reference on NVIDIA H100 machines. To build the image:
37+
38+
```bash
39+
cd training/graph_neural_network
40+
docker build -f Dockerfile.h100 -t training_gnn:h100 .
41+
```
42+
43+
Once the image is built, we need to run this image **on H100 machines with at least 1 GPU mounted in the container**:
44+
45+
```bash
46+
docker run -it --rm --network=host --ipc=host --gpus all training_gnn:h100
47+
```
48+
49+
Inside the container, we follow the same build process detailed in [GraphLearn-Torch's README](https://github.com/alibaba/graphlearn-for-pytorch):
50+
51+
```bash
52+
# inside the current container image with H100 mounted:
53+
bash install_dependencies.sh
54+
55+
python3 setup.py bdist_wheel
56+
pip install dist/* --force-reinstall
57+
```
58+
59+
The container can now be used on H100 machines once the above installation steps are done. To verify, we can run `import graphlearn_torch as glt` in Python REPL. GLT is successfully installed for H100 if the import statement ends successfully without raising any error, and we can subsequently export the container with `docker commit` to save the container for future uses.
60+
61+
Once this is done, we should `cd /workspace/repository` and follow the same training workflow from there.
3462

3563
### Steps to download and verify data
3664
Download the dataset:
@@ -167,4 +195,4 @@ This benchmark is a collaborative effort with contributions from Alibaba, Intel,
167195

168196
- Alibaba: Li Su, Baole Ai, Wenting Shen, Shuxian Hu, Wenyuan Yu, Yong Li
169197
- Nvidia: Yunzhou (David) Liu, Kyle Kranen, Shriya Palasamudram
170-
- Intel: Kaixuan Liu, Hesham Mostafa, Sasikanth Avancha, Keith Achorn, Radha Giduthuri, Deepak Canchi
198+
- Intel: Kaixuan Liu, Hesham Mostafa, Sasikanth Avancha, Keith Achorn, Radha Giduthuri, Deepak Canchi

0 commit comments

Comments
 (0)