-
Notifications
You must be signed in to change notification settings - Fork 338
Errors in deepops/slurm-exporter #1307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've encountered the same issue. I think the "prometheus-slurm-exporter" is built from an old OS (Ubuntu 18.04). You can refer to the URL: https://github.com/dholt/prometheus-slurm-exporter/blob/master/Dockerfile. |
We are experiencing the same issue. Due to this the SLURM Dashboard in Grafana does not work anymore. Prometheus does not receive any slurm_* metrics. Other metrics work fine, as well as GPU Nodes dashboard. Has anybody found a solution? As @Yuming-Lee suggests, could someone rebuild the docker image from a newer OS? |
The repository that @Yuming-Lee mentioned is based on this one. It simply compiles the service in a Docker environment and runs it. If using Docker isn't necessary for your setup, you can directly add the service to the node instead. What I've done is: (1) compiled this repository (which is Golang-based), and (2) created a service daemon and added it to the system. It could be somewhat burdensome if you can't use Ansible or other HPC management tools (since you'd need to add the service to each node manually), but it should be sufficient for a proof of concept. |
Thank you @fa-ina-tic, I am indeed setting up an alternative Prometheus slurm exporter https://github.com/rivosinc/prometheus-slurm-exporter/ to get data to Grafana. Steps are as follows, all on slurm-master:
|
I checked rivosinc's exporter. It is actively maintained but didn't provide existing metrics in vpenso's original slurm exporter. Instead, I built a container image of vpenso's exporter using the base image of Ubuntu 22.04 version. |
OS on Master Nodes: Ubuntu 22.04 LTS
Hello,
I've encountered some issues with monitoring our Slurm cluster. Below are the details:
Problem 1: Service Failure
The
docker.slurm-exporter.service
keeps failing with the following log output:To solve problem 1, I ran the Docker image manually and checked the sinfo, squeue, and sdiag commands (as the service attempts to do). The following error messages were encountered:
I think this is the main problem why
docker.slurm-exporter.service
failed.Could you provide guidance on resolving these issues? Any insights or suggestions for troubleshooting would be greatly appreciated.
Thank you!
The text was updated successfully, but these errors were encountered: