Errors in deepops/slurm-exporter #1307

fa-ina-tic · 2024-05-27T02:34:04Z

OS on Master Nodes: Ubuntu 22.04 LTS

Hello,

I've encountered some issues with monitoring our Slurm cluster. Below are the details:

Problem 1: Service Failure
The docker.slurm-exporter.service keeps failing with the following log output:

May 12 02:35:56 login01 systemd[1]: Starting Prometheus Slurm Exporter...
May 12 02:35:56 login01 docker[2222118]: Error response from daemon: No such container: docker.slurm-exporter.service
May 12 02:35:56 login01 docker[2222131]: Error: No such container: docker.slurm-exporter.service
May 12 02:35:57 login01 docker[2222143]: latest: Pulling from deepops/prometheus-slurm-exporter
May 12 02:35:57 login01 docker[2222143]: Digest: sha256:9a0c657a465fe20209093aec38d44e8bc55d5de282a62827e623698ee9944048
May 12 02:35:57 login01 docker[2222143]: Status: Image is up to date for deepops/prometheus-slurm-exporter:latest
May 12 02:35:57 login01 docker[2222143]: docker.io/deepops/prometheus-slurm-exporter:latest
May 12 02:35:42 login01 systemd[1]: Started Prometheus Slurm Exporter.
May 12 02:35:43 login01 docker[2222028]: time="2024-05-12T02:35:43Z" level=info msg="Starting Server: :8080" source="main.go:43"
May 12 02:35:56 login01 docker[2222028]: 2024/05/12 02:35:56 exit status 1
May 12 02:35:56 login01 systemd[1]: docker.slurm-exporter.service: Main process exited, code=exited, status=1/FAILURE
May 12 02:35:56 login01 systemd[1]: docker.slurm-exporter.service: Failed with result 'exit-code'.
May 12 02:35:56 login01 systemd[1]: docker.slurm-exporter.service: Scheduled restart job, restart counter is at 16205.
May 12 02:35:56 login01 systemd[1]: Stopped Prometheus Slurm Exporter.

To solve problem 1, I ran the Docker image manually and checked the sinfo, squeue, and sdiag commands (as the service attempts to do). The following error messages were encountered:

root@mlops-slurm-master-01:/lib64# sdiag
sdiag: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by sdiag)
sdiag: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /usr/local/lib/slurm/libslurmfull.so)
sdiag: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/local/lib/slurm/libslurmfull.so)
sdiag: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /usr/local/lib/slurm/libslurmfull.so)
It appears that the required GLIBC versions are not available in the container, causing the commands to fail.

I think this is the main problem why docker.slurm-exporter.service failed.

Could you provide guidance on resolving these issues? Any insights or suggestions for troubleshooting would be greatly appreciated.

Thank you!

The text was updated successfully, but these errors were encountered:

Yuming-Lee · 2024-06-11T13:12:33Z

I've encountered the same issue. I think the "prometheus-slurm-exporter" is built from an old OS (Ubuntu 18.04). You can refer to the URL: https://github.com/dholt/prometheus-slurm-exporter/blob/master/Dockerfile.
I tried to build a new one using Ubuntu 22.04, but it’s not easy for me.

opabjumbs · 2024-08-30T13:57:01Z

We are experiencing the same issue. Due to this the SLURM Dashboard in Grafana does not work anymore. Prometheus does not receive any slurm_* metrics. Other metrics work fine, as well as GPU Nodes dashboard. Has anybody found a solution? As @Yuming-Lee suggests, could someone rebuild the docker image from a newer OS?

fa-ina-tic · 2024-09-04T07:01:37Z

The repository that @Yuming-Lee mentioned is based on this one. It simply compiles the service in a Docker environment and runs it. If using Docker isn't necessary for your setup, you can directly add the service to the node instead.

What I've done is: (1) compiled this repository (which is Golang-based), and (2) created a service daemon and added it to the system.

It could be somewhat burdensome if you can't use Ansible or other HPC management tools (since you'd need to add the service to each node manually), but it should be sufficient for a proof of concept.

opabjumbs · 2024-09-06T09:27:37Z

Thank you @fa-ina-tic, I am indeed setting up an alternative Prometheus slurm exporter https://github.com/rivosinc/prometheus-slurm-exporter/ to get data to Grafana.
I have it working manually installed on Ubuntu 22.04, not in Ansible yet.

Steps are as follows, all on slurm-master:

install Go https://go.dev/doc/install
setup GOROOT environment variable
setup PATH
run "go install github.com/rivosinc/[email protected]", change 1.6.5 to version that you wish to use
run "prometheus-slurm-exporter -slurm.cli-fallback" to start exporting the data (beter yet, wrap it in a service)
make a new Prometheus endpoint identical to /etc/prometheus/endpoints/slurm-exporter.yml just with port 9092
restart the docker.prometheus.service to reload the configuration
install the template into Grafana https://grafana.com/grafana/dashboards/19835-slurm-dashboardv2/ (template id 19835)

jungyh0218 · 2025-04-10T13:50:18Z

I checked rivosinc's exporter. It is actively maintained but didn't provide existing metrics in vpenso's original slurm exporter. Instead, I built a container image of vpenso's exporter using the base image of Ubuntu 22.04 version.
#1328

fa-ina-tic closed this as completed May 29, 2024

jungyh0218 mentioned this issue Apr 10, 2025

The Docker image 'deepops/prometheus-slurm-exporter' is based on too old version of Ubuntu. #1328

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Errors in deepops/slurm-exporter #1307

Errors in deepops/slurm-exporter #1307

fa-ina-tic commented May 27, 2024

Yuming-Lee commented Jun 11, 2024

Uh oh!

opabjumbs commented Aug 30, 2024

Uh oh!

fa-ina-tic commented Sep 4, 2024

Uh oh!

opabjumbs commented Sep 6, 2024

Uh oh!

jungyh0218 commented Apr 10, 2025

Uh oh!

Errors in deepops/slurm-exporter #1307

Errors in deepops/slurm-exporter #1307

Comments

fa-ina-tic commented May 27, 2024

Yuming-Lee commented Jun 11, 2024

Uh oh!

opabjumbs commented Aug 30, 2024

Uh oh!

fa-ina-tic commented Sep 4, 2024

Uh oh!

opabjumbs commented Sep 6, 2024

Uh oh!

jungyh0218 commented Apr 10, 2025

Uh oh!