Skip to content

Errors in deepops/slurm-exporter #1307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fa-ina-tic opened this issue May 27, 2024 · 5 comments
Closed

Errors in deepops/slurm-exporter #1307

fa-ina-tic opened this issue May 27, 2024 · 5 comments

Comments

@fa-ina-tic
Copy link

OS on Master Nodes: Ubuntu 22.04 LTS

Hello,

I've encountered some issues with monitoring our Slurm cluster. Below are the details:

Problem 1: Service Failure
The docker.slurm-exporter.service keeps failing with the following log output:

May 12 02:35:56 login01 systemd[1]: Starting Prometheus Slurm Exporter...
May 12 02:35:56 login01 docker[2222118]: Error response from daemon: No such container: docker.slurm-exporter.service
May 12 02:35:56 login01 docker[2222131]: Error: No such container: docker.slurm-exporter.service
May 12 02:35:57 login01 docker[2222143]: latest: Pulling from deepops/prometheus-slurm-exporter
May 12 02:35:57 login01 docker[2222143]: Digest: sha256:9a0c657a465fe20209093aec38d44e8bc55d5de282a62827e623698ee9944048
May 12 02:35:57 login01 docker[2222143]: Status: Image is up to date for deepops/prometheus-slurm-exporter:latest
May 12 02:35:57 login01 docker[2222143]: docker.io/deepops/prometheus-slurm-exporter:latest
May 12 02:35:42 login01 systemd[1]: Started Prometheus Slurm Exporter.
May 12 02:35:43 login01 docker[2222028]: time="2024-05-12T02:35:43Z" level=info msg="Starting Server: :8080" source="main.go:43"
May 12 02:35:56 login01 docker[2222028]: 2024/05/12 02:35:56 exit status 1
May 12 02:35:56 login01 systemd[1]: docker.slurm-exporter.service: Main process exited, code=exited, status=1/FAILURE
May 12 02:35:56 login01 systemd[1]: docker.slurm-exporter.service: Failed with result 'exit-code'.
May 12 02:35:56 login01 systemd[1]: docker.slurm-exporter.service: Scheduled restart job, restart counter is at 16205.
May 12 02:35:56 login01 systemd[1]: Stopped Prometheus Slurm Exporter.

To solve problem 1, I ran the Docker image manually and checked the sinfo, squeue, and sdiag commands (as the service attempts to do). The following error messages were encountered:

root@mlops-slurm-master-01:/lib64# sdiag
sdiag: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by sdiag)
sdiag: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /usr/local/lib/slurm/libslurmfull.so)
sdiag: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/local/lib/slurm/libslurmfull.so)
sdiag: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /usr/local/lib/slurm/libslurmfull.so)
It appears that the required GLIBC versions are not available in the container, causing the commands to fail.

I think this is the main problem why docker.slurm-exporter.service failed.

Could you provide guidance on resolving these issues? Any insights or suggestions for troubleshooting would be greatly appreciated.

Thank you!

@Yuming-Lee
Copy link

I've encountered the same issue. I think the "prometheus-slurm-exporter" is built from an old OS (Ubuntu 18.04). You can refer to the URL: https://github.com/dholt/prometheus-slurm-exporter/blob/master/Dockerfile.
I tried to build a new one using Ubuntu 22.04, but it’s not easy for me.

@opabjumbs
Copy link

We are experiencing the same issue. Due to this the SLURM Dashboard in Grafana does not work anymore. Prometheus does not receive any slurm_* metrics. Other metrics work fine, as well as GPU Nodes dashboard. Has anybody found a solution? As @Yuming-Lee suggests, could someone rebuild the docker image from a newer OS?

@fa-ina-tic
Copy link
Author

The repository that @Yuming-Lee mentioned is based on this one. It simply compiles the service in a Docker environment and runs it. If using Docker isn't necessary for your setup, you can directly add the service to the node instead.

What I've done is: (1) compiled this repository (which is Golang-based), and (2) created a service daemon and added it to the system.

It could be somewhat burdensome if you can't use Ansible or other HPC management tools (since you'd need to add the service to each node manually), but it should be sufficient for a proof of concept.

@opabjumbs
Copy link

Thank you @fa-ina-tic, I am indeed setting up an alternative Prometheus slurm exporter https://github.com/rivosinc/prometheus-slurm-exporter/ to get data to Grafana.
I have it working manually installed on Ubuntu 22.04, not in Ansible yet.

Steps are as follows, all on slurm-master:

  • install Go https://go.dev/doc/install
  • setup GOROOT environment variable
  • setup PATH
  • run "go install github.com/rivosinc/[email protected]", change 1.6.5 to version that you wish to use
  • run "prometheus-slurm-exporter -slurm.cli-fallback" to start exporting the data (beter yet, wrap it in a service)
  • make a new Prometheus endpoint identical to /etc/prometheus/endpoints/slurm-exporter.yml just with port 9092
  • restart the docker.prometheus.service to reload the configuration
  • install the template into Grafana https://grafana.com/grafana/dashboards/19835-slurm-dashboardv2/ (template id 19835)

@jungyh0218
Copy link
Contributor

I checked rivosinc's exporter. It is actively maintained but didn't provide existing metrics in vpenso's original slurm exporter. Instead, I built a container image of vpenso's exporter using the base image of Ubuntu 22.04 version.
#1328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants