You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[GNN] Adds example building dockerfile for H100s. (#737)
* adds updated Dockerfile for building
* renames Dockerfile to Dockerfile.h100, and restore old Dockerfile
* updates README
* adds a small commit to retrigger check
The official Dockerfile supports only NVIDIA A100 GPUs, and `Dockerfile.h100` helps build and run GNN reference on NVIDIA H100 machines. To build the image:
Once the image is built, we need to run this image **on H100 machines with at least 1 GPU mounted in the container**:
44
+
45
+
```bash
46
+
docker run -it --rm --network=host --ipc=host --gpus all training_gnn:h100
47
+
```
48
+
49
+
Inside the container, we follow the same build process detailed in [GraphLearn-Torch's README](https://github.com/alibaba/graphlearn-for-pytorch):
50
+
51
+
```bash
52
+
# inside the current container image with H100 mounted:
53
+
bash install_dependencies.sh
54
+
55
+
python3 setup.py bdist_wheel
56
+
pip install dist/* --force-reinstall
57
+
```
58
+
59
+
The container can now be used on H100 machines once the above installation steps are done. To verify, we can run `import graphlearn_torch as glt` in Python REPL. GLT is successfully installed for H100 if the import statement ends successfully without raising any error, and we can subsequently export the container with `docker commit` to save the container for future uses.
60
+
61
+
Once this is done, we should `cd /workspace/repository` and follow the same training workflow from there.
34
62
35
63
### Steps to download and verify data
36
64
Download the dataset:
@@ -167,4 +195,4 @@ This benchmark is a collaborative effort with contributions from Alibaba, Intel,
167
195
168
196
- Alibaba: Li Su, Baole Ai, Wenting Shen, Shuxian Hu, Wenyuan Yu, Yong Li
0 commit comments