Skip to content

Commit 1c2c7b1

Browse files
Split docker and docker swarm documentations
1 parent b9816cd commit 1c2c7b1

File tree

2 files changed

+163
-189
lines changed

2 files changed

+163
-189
lines changed

README-docker_swarm.md

+157
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Tensorflow GPU Inference API with docker swarm
2+
3+
Please use **docker swarm** only if you need to:
4+
5+
* Provide redundancy in terms of API containers: In case a container went down, the incoming requests will be redirected to another running instance.
6+
7+
* Coordinate between the containers: Swarm will orchestrate between the APIs and choose one of them to listen to the incoming request.
8+
9+
* Scale up the Inference service in order to get a faster prediction especially if there's traffic on the service.
10+
11+
## Run the docker container
12+
13+
Docker swarm can scale up the API into multiple replicas and can be used on one or multiple hosts. In both cases, a docker swarm setup is required for all hosts.
14+
15+
#### Docker swarm setup
16+
17+
1- Enable docker swarm GPU resource:
18+
19+
```sh
20+
sudo nano /etc/nvidia-container-runtime/config.toml
21+
```
22+
23+
Remove # from this line `swarm-resource = "DOCKER_RESOURCE_GPU"` to enable it then save and exit.
24+
25+
2- The `deploy` command supports compose file version 3.0+ and runtime command in a compose file is only supported with compose file version 2.3. So we won't be able to add runtime in our stack file that why we will add default runtime in docker json file:
26+
27+
```sh
28+
sudo nano /etc/docker/daemon.json
29+
```
30+
31+
```json
32+
{
33+
"default-runtime": "nvidia",
34+
"runtimes": {
35+
"nvidia": {
36+
"path": "/usr/bin/nvidia-container-runtime",
37+
"runtimeArgs": []
38+
}
39+
}
40+
}
41+
```
42+
43+
3- Finally restart docker:
44+
45+
```sh
46+
sudo systemctl daemon-reload
47+
sudo systemctl restart docker
48+
```
49+
50+
4- Initialize Swarm:
51+
52+
```sh
53+
docker swarm init
54+
```
55+
56+
5- On the manager host, open the gpu-inference.yaml file and specify the number of replicas needed. In case you are using multiple hosts (With multiple hosts section), the number of replicas will be divided across all hosts.
57+
58+
```yaml
59+
version: "3"
60+
61+
services:
62+
api:
63+
environment:
64+
- "NVIDIA_VISIBLE_DEVICES=0"
65+
ports:
66+
- "4343:4343"
67+
image: tensorflow_inference_api_gpu
68+
volumes:
69+
- "/mnt/models:/models"
70+
deploy:
71+
replicas: 1
72+
update_config:
73+
parallelism: 2
74+
delay: 10s
75+
restart_policy:
76+
condition: on-failure
77+
```
78+
**Notes about gpu-inference.yaml:**
79+
* the volumes field on the left of ":" should be an absolute path, can be changeable by the user, and represents the models directory on your Operating System
80+
* the following volume's field ":/models" should never be changed
81+
* NVIDIA_VISIBLE_DEVICES defines on which GPU you want the API to run
82+
83+
#### With one host
84+
85+
Deploy the API:
86+
87+
```sh
88+
docker stack deploy -c gpu-inference.yaml tensorflow-gpu
89+
```
90+
91+
![onehost](./docs/onehost.png)
92+
93+
#### With multiple hosts
94+
95+
1- **Make sure hosts are reachable on the same network**.
96+
97+
2- Choose a host to be the manager and run the following command on the chosen host to generate a token so the other hosts can join:
98+
99+
```sh
100+
docker swarm join-token worker
101+
```
102+
103+
A command will appear on your terminal, copy and paste it on the other hosts, as seen in the below image
104+
105+
3- Deploy your application using:
106+
107+
```sh
108+
docker stack deploy -c gpu-inference.yaml tensorflow-gpu
109+
```
110+
111+
![multhost](./docs/multhost.png)
112+
113+
#### Useful Commands
114+
115+
1- In order to scale up the service to 4 replicas for example use this command:
116+
117+
```sh
118+
docker service scale tensorflow-gpu_api=4
119+
```
120+
121+
2- To check the available workers:
122+
123+
```sh
124+
docker node ls
125+
```
126+
127+
3- To check on which node the container is running:
128+
129+
```sh
130+
docker service ps tensorflow-gpu_api
131+
```
132+
133+
4- To check the number of replicas:
134+
135+
```sh
136+
docker service ls
137+
```
138+
139+
## Benchmarking
140+
141+
Here are two graphs showing time of prediction for different number of requests at the same time.
142+
143+
144+
![GPU 20 req](./docs/TGPU20req.png)
145+
146+
147+
![GPU 40 req](./docs/TGPU40req.png)
148+
149+
150+
We can see that both graphs got the same result no matter what is the number of received requests at the same time. When we increase the number of workers (hosts) we are able to speed up the inference. For example we can see in the last column we were able to process 40 requests in:
151+
152+
- 1.46 seconds with 4 replicas in 1 machine.
153+
- 0.82 seconds with 4 replicas in each of the 2 machines.
154+
155+
Moreover, in case one of the machines is down the others are always ready to receive requests.
156+
157+
Finally since we are predicting on GPU, scaling more replicas means a faster prediction.

0 commit comments

Comments
 (0)