[GNN] Adds example building dockerfile for H100s. (#737)

Elnifio · web-flow · commit 87405ce77af1 · 2024-05-15T22:18:30.000-04:00
* adds updated Dockerfile for building

* renames Dockerfile to Dockerfile.h100, and restore old Dockerfile

* updates README

* adds a small commit to retrigger check
diff --git a/graph_neural_network/Dockerfile.h100 b/graph_neural_network/Dockerfile.h100
@@ -0,0 +1,23 @@
+FROM nvcr.io/nvidia/pytorch:22.12-py3
+
+WORKDIR /workspace/repository
+
+RUN pip install scikit-learn==0.24.2
+RUN pip install torch_geometric==2.4.0
+RUN pip install torch_scatter==2.1.1 torch_sparse==0.6.17
+RUN pip install graphlearn-torch==0.2.2
+
+RUN apt update
+RUN apt install -y git
+RUN pip install git+https://github.com/mlcommons/logging.git
+
+# TF32 instead of FP32 for faster compute
+ENV NVIDIA_TF32_OVERRIDE=1
+
+COPY . .
+WORKDIR /workspace/repository
+
+RUN git clone https://github.com/alibaba/graphlearn-for-pytorch.git
+WORKDIR /workspace/repository/graphlearn-for-pytorch
+RUN git checkout 910cb55
+RUN git submodule update --init
diff --git a/graph_neural_network/README.md b/graph_neural_network/README.md
@@ -31,6 +31,34 @@ cd training/gnn_node_classification/
 docker build -f Dockerfile -t training_gnn:latest .
 ```
 
+##### 2.1 Building on NVIDIA H100
+
+The official Dockerfile supports only NVIDIA A100 GPUs, and `Dockerfile.h100` helps build and run GNN reference on NVIDIA H100 machines. To build the image: 
+
+```bash
+cd training/graph_neural_network
+docker build -f Dockerfile.h100 -t training_gnn:h100 .
+```
+
+Once the image is built, we need to run this image **on H100 machines with at least 1 GPU mounted in the container**: 
+
+```bash
+docker run -it --rm --network=host --ipc=host --gpus all training_gnn:h100
+```
+
+Inside the container, we follow the same build process detailed in [GraphLearn-Torch's README](https://github.com/alibaba/graphlearn-for-pytorch): 
+
+```bash
+# inside the current container image with H100 mounted: 
+bash install_dependencies.sh
+
+python3 setup.py bdist_wheel
+pip install dist/* --force-reinstall
+```
+
+The container can now be used on H100 machines once the above installation steps are done. To verify, we can run `import graphlearn_torch as glt` in Python REPL. GLT is successfully installed for H100 if the import statement ends successfully without raising any error, and we can subsequently export the container with `docker commit` to save the container for future uses. 
+
+Once this is done, we should `cd /workspace/repository` and follow the same training workflow from there. 
 
 ### Steps to download and verify data
 Download the dataset:
@@ -167,4 +195,4 @@ This benchmark is a collaborative effort with contributions from Alibaba, Intel,
 
 - Alibaba: Li Su, Baole Ai, Wenting Shen, Shuxian Hu, Wenyuan Yu, Yong Li
 - Nvidia: Yunzhou (David) Liu, Kyle Kranen, Shriya Palasamudram
-- Intel: Kaixuan Liu, Hesham Mostafa, Sasikanth Avancha, Keith Achorn, Radha Giduthuri, Deepak Canchi
+- Intel: Kaixuan Liu, Hesham Mostafa, Sasikanth Avancha, Keith Achorn, Radha Giduthuri, Deepak Canchi