OCS/GCS v9.0.5 #56
ernst-bablick
announced in
Announcements
Replies: 1 comment
-
Prebuild packages for OCS and fully QAed GCS product packages will be available within the next 12 hours: https://www.hpc-gridware.com/download-main/ |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Major Enhancements
v9.0.5
qtelemetry (Developer Preview in GCS)
This release introduces qtelemetry, a new metrics exporter for Gridware Cluster Scheduler (GCS). It allows administrators to easily collect and expose cluster metrics for monitoring and observability purposes.
Features:
sge_qmaster
, spooling filesystem information)Quick Start:
By default,
qtelemetry
exports metrics on port9464
from the/metrics
endpoint:Enable additional metrics sources using command-line flags:
(Available in Gridware Cluster Scheduler only)
Out of the Box Support of various MPI Distributions
The
$SGE_ROOT/mpi
directory contains templates of the PE configuration for the following MPI distributions:They can be added by simply calling
qconf -Ap <path to template>
and will add the PE configuration for running jobs using the given MPI in tight integration.In addition build scripts for mpich, mvapich, and openmpi give an example on how the MPI distribution can be built and installed. The build scripts are located in
$SGE_ROOT/mpi/<mpi name>/build.sh
.$SGE_ROOT/mpi/examples
contains a MPI example written in the C language.It can be run as tightly integrated parallel job in any of the MPI distributions mentioned above
and supports checkpointing and restart.
It comes with documentation, build script, job script and a template of a ckeckpointing enviroment.
(Available in Open Cluster Scheduler and Gridware Cluster Scheduler)
Easier Creation of Configuration Templates
Configuration objects can now contain the additional special variables
$sge_root
and$sge_cell
forpaths to scripts, e.g. for
prolog
andepilog
in the global config and queue configurationsstarter_method
,suspend_method
,resume_method
, andterminate_method
in the queue configurationstart_proc_args
andstop_proc_args
in the parallel environment configurationckpt_command
,migr_command
,restart_command
, andclean_command
in the checkpointing environmentThis allows to have configuration templates that can be used in different environments without
the need to modify the paths before applying the configuration.
A list of all special variables is given in the sge_conf.5 man page in the
prolog
section.(Available in Open Cluster Scheduler and Gridware Cluster Scheduler)
Full List of Fixes
Release notes - Cluster Scheduler
v9.0.5
Improvement
CS-342 provide an openmpi integration
CS-343 provide an example and test program using MPI
CS-791 sge_root should be available as special variable in the configuration of prolog, epilog, queue, pe, ckpt
CS-914 Make ARCH script more robust
CS-1090 qstat -r shall report resource requests by scope
CS-1094 Update sge_pe.md to better explain PE_HOSTFILE
CS-1114 Add GPU monitoring examples to qtelemetry Grafana dashboard
CS-1115 Build qtelemetry in containers for lx-amd64 and lx-arm64
CS-1126 in the environment of tasks of tightly integrated parallel jobs set the pe_task_id
CS-1128 Add enroot to worker GPU VM image for GCP
CS-1143 provide a MPICH integration
CS-1144 provide a MVAPICH integration
CS-1145 provide an Intel MPI integration
CS-1146 cleanup and document the ssh wrapper MPI template and scripts
CS-1152 add a checktree_mpi to testsuite with configuration and tests making use of the various MPI integrations
CS-1158 Add qtelemetry Grafana dashboard to public Grafana Cloud Dashboards
New Feature
CS-1091 Clearly document the slots syntax in man5 sge_queue_conf.md
Sub-task
CS-697 Jenkins: enable issue_3013
CS-698 Jenkins: enable issue_3179
Task
CS-662 verify delayed job reporting of sge_execd after reconnecting to sge_qmaster
CS-1117 Add qtelemetry as developer preview to GCS distribution
CS-1118 Create a packer file which builds a GPU enabled VM with and without GCS for fast deployment on GCP
CS-1125 Provide a basic examples of how enroot can be used with the GPU integration
CS-1134 message cutoff after 8 characters
CS-1136 add checktree_qtelemetry to all build environments + Jenkins setup
Bug
CS-430 booking of resources into advance reservations needs to distinguish between host and queue resources
CS-722 env_list in qstat should show NONE if not set
CS-1028 qtelemetry should support NVIDIA loadsensor values for hosts
CS-1085 BDB build error on lx-riscv64 after OS update.
CS-1096 USE_QSUB_GID functionality fails on FreeBSD 14
CS-1111 minimum and maximum thread counts in the bootstrap.5 man page are incorrect
CS-1131 wallclock time reported for tasks of a tightly integrated parallel job is incorrect
CS-1139 job deletion via JAPI/DRMAA fails if job ID exceeds INT_MAX
CS-1140 termination of event client via JAPI fails if event client ID exceeds INT_MAX
CS-1141 MacOS build broken due to unavailability of getgrouplist()
CS-1163 when a queue is signalled then additional invalid entries are created in the berkeleydb spooling database
This discussion was created from the release OCS/GCS v9.0.5.
Beta Was this translation helpful? Give feedback.
All reactions