Skip to content

Commit 2a533f8

Browse files
authored
Add ParallelCluster 3.10.0, 3.10.1 support (#244)
Add support for ParallelCluster 3.10.0. Add alinux2023 support. Add support for external slurmdbd instance. Update documentation. Change the UID of the slurm user to 401 to match what ParallelCluster uses. Otherwise munge flags security errors because the UID of the submitter doesn't match the head node. Change the UpdateHeadNode lambda to only do the update via ssm if the cluster ins't already being updated. Resolves #242 Change the installer so that it checks to make sure that the cluster stack isn't already being changed or in a bad state. Resolves #221 Add support for ParallelCluster 3.10.1. Resolves #243
1 parent 8ee5253 commit 2a533f8

File tree

9 files changed

+459
-86
lines changed

9 files changed

+459
-86
lines changed

docs/config.md

+65-8
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,16 @@ This project creates a ParallelCluster configuration file that is documented in
2727
<a href="#database">Database</a>:
2828
<a href="#databasestackname">DatabaseStackName</a>: str
2929
<a href="#fqdn">FQDN</a>: str
30-
<a href="#port">Port</a>: str
30+
<a href="#database-port">Port</a>: str
3131
<a href="#adminusername">AdminUserName</a>: str
3232
<a href="#adminpasswordsecretarn">AdminPasswordSecretArn</a>: str
33-
<a href="#clientsecuritygroup">ClientSecurityGroup</a>:
33+
<a href="#database-clientsecuritygroup">ClientSecurityGroup</a>:
3434
SecurityGroupName: SecurityGroupId
35+
<a href="#slurmdbd">Slurmdbd</a>:
36+
<a href="#slurmdbdstackname">SlurmdbdStackName</a>: str
37+
<a href="#slurmdbd-host">Host</a>: str
38+
<a href="#slurmdbd-port">Port</a>: str
39+
<a href="#slurmdbd-clientsecuritygroup">ClientSecurityGroup</a>: str
3540
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#HeadNode-v3-Dcv">Dcv:</a>
3641
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#yaml-HeadNode-Dcv-Enabled">Enabled</a>: bool
3742
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#yaml-HeadNode-Dcv-Port">Port</a>: int
@@ -304,13 +309,18 @@ See [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#p
304309

305310
Optional
306311

312+
**Note**: Starting with ParallelCluster 3.10.0, you should use slurm/ParallelClusterConfig/[Slurmdbd](#slurmdbd) instead of slurm/ParallelClusterConfig/Database.
313+
You cannot have both parameters.
314+
307315
Configure the Slurm database to use with the cluster.
308316

309317
This is created independently of the cluster so that the same database can be used with multiple clusters.
310318

311-
The easiest way to do this is to use the [CloudFormation template provided by ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3) and then to just pass
312-
the name of the stack in [DatabaseStackName](#databasestackname).
313-
All of the other parameters will be pulled from the stack.
319+
See [Create ParallelCluster Slurm Database](../deployment-prerequisites#create-parallelcluster-slurm-database) on the deployment prerequisites page.
320+
321+
If you used the [CloudFormation template provided by ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3), then the easiest way to configure it is to pass
322+
the name of the stack in slurm/ParallelClusterConfig/Database/[DatabaseStackName](#databasestackname).
323+
All of the other parameters will be pulled from the outputs of the stack.
314324

315325
See the [ParallelCluster documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#Scheduling-v3-SlurmSettings-Database).
316326

@@ -330,7 +340,7 @@ The following parameters will be set using the outputs of the stack:
330340

331341
Used with the Port to set the [Uri](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmSettings-Database-Uri) of the database.
332342

333-
##### Port
343+
##### Database: Port
334344

335345
type: int
336346

@@ -353,11 +363,56 @@ This password is used together with AdminUserName and Slurm accounting to authen
353363

354364
Sets the [PasswordSecretArn](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmSettings-Database-PasswordSecretArn) parameter in ParallelCluster.
355365

356-
##### ClientSecurityGroup
366+
##### Database: ClientSecurityGroup
357367

358368
Security group that has permissions to connect to the database.
359369

360-
Required to be attached to the head node that is running slurmdbd so that the port connection to the database is allows.
370+
Required to be attached to the head node that is running slurmdbd so that the port connection to the database is allowed.
371+
372+
#### Slurmdbd
373+
374+
**Note**: This is not supported before ParallelCluster 3.10.0. If you specify this parameter then you cannot specify slurm/ParallelClusterConfig/[Database](#database).
375+
376+
Optional
377+
378+
Configure an external Slurmdbd instance to use with the cluster.
379+
The Slurmdbd instance provides access to the shared Slurm database.
380+
This is created independently of the cluster so that the same database can be used with multiple clusters.
381+
382+
This is created independently of the cluster so that the same slurmdbd instance can be used with multiple clusters.
383+
384+
See [Create Slurmdbd instance](../deployment-prerequisites#create-slurmdbd-instance) on the deployment prerequisites page.
385+
386+
If you used the [CloudFormation template provided by ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/external-slurmdb-accounting.html#external-slurmdb-accounting-step1), then the easiest way to configure it is to pass
387+
the name of the stack in slurm/ParallelClusterConfig/Database/[SlurmdbdStackName](#slurmdbdstackname).
388+
All of the other parameters will be pulled from the parameters and outputs of the stack.
389+
390+
See the [ParallelCluster documentation for ExternalSlurmdbd](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#Scheduling-v3-SlurmSettings-ExternalSlurmdbd).
391+
392+
##### SlurmdbdStackName
393+
394+
Name of the ParallelCluster CloudFormation stack that created the Slurmdbd instance.
395+
396+
The following parameters will be set using the outputs of the stack:
397+
398+
* Host
399+
* Port
400+
* ClientSecurityGroup
401+
402+
##### Slurmdbd: Host
403+
404+
IP address or DNS name of the Slurmdbd instance.
405+
406+
##### Slurmdbd: Port
407+
408+
Default: 6819
409+
410+
Port used by the slurmdbd daemon on the Slurmdbd instance.
411+
412+
##### Slurmdbd: ClientSecurityGroup
413+
414+
Security group that has access to use the Slurmdbd instance.
415+
This will be added as an extra security group to the head node.
361416

362417
### ClusterName
363418

@@ -373,6 +428,8 @@ For an existing secret can be the secret name or the ARN.
373428
If the secret doesn't exist one will be created, but won't be part of the cloudformation stack so that it won't be deleted when the stack is deleted.
374429
Required if your submitters need to use more than 1 cluster.
375430

431+
See [Create Munge Key](../deployment-prerequisites#create-munge-key) for more details.
432+
376433
### SlurmCtl
377434

378435
Configure the Slurm head node or controller.

docs/deploy-parallel-cluster.md

-18
Original file line numberDiff line numberDiff line change
@@ -10,24 +10,6 @@ The current latest version is 3.9.1.
1010

1111
See [Deployment Prerequisites](deployment-prerequisites.md) page.
1212

13-
### Create ParallelCluster UI (optional but recommended)
14-
15-
It is highly recommended to create a ParallelCluster UI to manage your ParallelCluster clusters.
16-
A different UI is required for each version of ParallelCluster that you are using.
17-
The versions are list in the [ParallelCluster Release Notes](https://docs.aws.amazon.com/parallelcluster/latest/ug/document_history.html).
18-
The minimum required version is 3.6.0 which adds support for RHEL 8 and increases the number of allows queues and compute resources.
19-
The suggested version is at least 3.7.0 because it adds configurable compute node weights which we use to prioritize the selection of
20-
compute nodes by their cost.
21-
22-
The instructions are in the [ParallelCluster User Guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-v3.html).
23-
24-
### Create ParallelCluster Slurm Database
25-
26-
The Slurm Database is required for configuring Slurm accounts, users, groups, and fair share scheduling.
27-
It you need these and other features then you will need to create a ParallelCluster Slurm Database.
28-
You do not need to create a new database for each cluster; multiple clusters can share the same database.
29-
Follow the directions in this [ParallelCluster tutorial to configure slurm accounting](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3).
30-
3113
## Create the Cluster
3214

3315
To install the cluster run the install script. You can override some parameters in the config file

docs/deployment-prerequisites.md

+100-11
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,78 @@ The version that has been tested is in the CDK_VERSION variable in the install s
9999

100100
The install script will try to install the prerequisites if they aren't already installed.
101101

102+
## Create ParallelCluster UI (optional but recommended)
103+
104+
It is highly recommended to create a ParallelCluster UI to manage your ParallelCluster clusters.
105+
A different UI is required for each version of ParallelCluster that you are using.
106+
The versions are list in the [ParallelCluster Release Notes](https://docs.aws.amazon.com/parallelcluster/latest/ug/document_history.html).
107+
The minimum required version is 3.6.0 which adds support for RHEL 8 and increases the number of allows queues and compute resources.
108+
The suggested version is at least 3.7.0 because it adds configurable compute node weights which we use to prioritize the selection of
109+
compute nodes by their cost.
110+
111+
The instructions are in the [ParallelCluster User Guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-pcui-v3.html).
112+
113+
## Create Munge Key
114+
115+
Munge is a package that Slurm uses to secure communication between servers.
116+
The munge service uses a preshared key that must be the same on all of the servers in the Slurm cluster.
117+
If you want to be able to use multiple clusters from your submission hosts, such as virtual desktops, then all of the clusters must be using the same munge key.
118+
This is done by creating a munge key and storing it in secrets manager.
119+
The secret is then passed as a parameter to ParallelCluster so that it can use it when configuring munge on all of the cluster instances.
120+
121+
To create the munge key and store it in AWS Secrets Manager, run the following commands.
122+
123+
```
124+
aws secretsmanager create-secret --name SlurmMungeKey --secret-string "$(dd if=/dev/random bs=1024 count=1 | base64 -w 0)"
125+
```
126+
127+
Save the ARN of the secret for when you create the Slurmdbd instance and for when you create the configuration file.
128+
129+
See the [Slurm documentation for authentication](https://slurm.schedmd.com/authentication.html) for more information.
130+
131+
See the [ParallelCluster documentation for MungeKeySecretArn](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmSettings-MungeKeySecretArn).
132+
133+
See the [MungeKeySecret configuration parameter](../config#mungekeysecret).
134+
135+
## Create ParallelCluster Slurm Database
136+
137+
The Slurm Database is required for configuring Slurm accounts, users, groups, and fair share scheduling.
138+
It you need these and other features then you will need to create a ParallelCluster Slurm Database.
139+
You do not need to create a new database for each cluster; multiple clusters can share the same database.
140+
Follow the directions in this [ParallelCluster tutorial to configure slurm accounting](https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3).
141+
142+
## Create Slurmdbd Instance
143+
144+
**Note**: Before ParallelCluster 3.10.0, the slurmdbd daemon that connects to the data was created on each cluster's head node.
145+
The recommended Slurm architecture is to have a shared slurmdbd daemon that is used by all of the clusters.
146+
Starting in version 3.10.0, ParallelCluster supports specifying an external slurmdbd instance when you create a cluster and provide a cloud formation template to create it.
147+
148+
Follow the directions in this [ParallelCluster tutorial to configure slurmdbd](https://docs.aws.amazon.com/parallelcluster/latest/ug/external-slurmdb-accounting.html#external-slurmdb-accounting-step1).
149+
This requires that you have already created the slurm database.
150+
151+
Here are some notes on the required parameters and how to fill them out.
152+
153+
| Parameter | Description
154+
|--------------|------------
155+
| AmiId | You can get this using the ParallelCluster UI. Click on Images and sort on Operating system. Confirm that the version is at least 3.10.0. Select the AMI for alinux2023 and the arm64 architecture.
156+
| CustomCookbookUrl | Leave blank
157+
| DBMSClientSG | Get this from the DatabaseClientSecurityGroup output of the database stack.
158+
| DBMSDatabaseName | This is an arbitrary name. It must be alphanumeric. I use slurmaccounting
159+
| DBMSPasswordSecretArn | Get this from the DatabaseSecretArn output of the database stack
160+
| DBMSUri | Get this from the DatabaseHost output of the database stack. Note that if you copy and paste the link you should delete the https:// prefix and the trailing '/'.
161+
| DBMSUsername | Get this from the DatabaseAdminUser output of the database stack.
162+
| EnableSlurmdbdSystemService | Set to true. Note the warning. If the database already exists and was created with an older version of slurm then the database will be upgraded. This may break clusters using an older slurm version that are still using the cluster. Set to false if you don't want this to happen.
163+
| InstanceType | Choose an instance type that is compatible with the AMI. For example, m7g.large.
164+
| KeyName | Use an existing EC2 key pair.
165+
| MungeKeySecretArn | ARN of an existing munge key secret. See [Create Munge Key](#create-munge-key).
166+
| PrivateIp | Choose an available IP in the subnet.
167+
| PrivatePrefix | CIDR of the instance's subnet.
168+
| SlurmdbdPort | 6819
169+
| SubnetId | Preferably the same subnet where the clusters will be deployed.
170+
| VPCId | The VPC of the subnet.
171+
172+
The stack name will be used in the slurm/ParallelClusterConfig/[SlurmdbdStackName](../config#slurmdbdstackname) configuration parameter.
173+
102174
## Security Groups for Login Nodes
103175

104176
If you want to allow instances like remote desktops to use the cluster directly, you must define
@@ -111,25 +183,30 @@ I'll call the three security groups the following names, but they can be whateve
111183
* SlurmHeadNodeSG
112184
* SlurmComputeNodeSG
113185

186+
First create these security groups without any security group rules.
187+
The reason for this is that the security group rules reference the other security groups so the groups must all exist before any of the rules can be created.
188+
After you have created the security groups then create the rules as described below.
189+
114190
### Slurm Submitter Security Group
115191

116192
The SlurmSubmitterSG will be attached to your login nodes, such as your virtual desktops.
117193

118194
It needs at least the following inbound rules:
119195

120-
| Type | Port range | Source | Description
121-
|------|------------|--------|------------
122-
| TCP | 1024-65535 | SlurmHeadNodeSG | SlurmHeadNode ephemeral
123-
| TCP | 1024-65535 | SlurmComputeNodeSG | SlurmComputeNode ephemeral
124-
| TCP | 6000-7024 | SlurmComputeNodeSG | SlurmComputeNode X11
196+
| Type | Port range | Source | Description | Details
197+
|------|------------|--------------------|------------ |--------
198+
| TCP | 1024-65535 | SlurmHeadNodeSG | SlurmHeadNode ephemeral | Head node can use ephemeral ports to connect to the submitter
199+
| TCP | 1024-65535 | SlurmComputeNodeSG | SlurmComputeNode ephemeral | Compute node will connect to submitter using ephemeral ports to manage interactive shells
200+
| TCP | 6000-7024 | SlurmComputeNodeSG | SlurmComputeNode X11 | Compute node can send X11 traffic to submitter for GUI applications
125201

126202
It needs the following outbound rules.
127203

128-
| Type | Port range | Destination | Description
129-
|------|------------|-------------|------------
130-
| TCP | 2049 | SlurmHeadNodeSG | SlurmHeadNode NFS
131-
| TCP | 6818 | SlurmComputeNodeSG | SlurmComputeNode slurmd
132-
| TCP | 6819 | SlurmHeadNodeSG | SlurmHeadNode slurmdbd
204+
| Type | Port range | Destination | Description | Details
205+
|------|------------|--------------------|-------------|--------
206+
| TCP | 2049 | SlurmHeadNodeSG | SlurmHeadNode NFS | Mount the slurm NFS file system with binaries and config
207+
| TCP | 6818 | SlurmComputeNodeSG | SlurmComputeNode slurmd | Connect to compute node for interactive jobs
208+
| TCP | 6819 | SlurmHeadNodeSG | SlurmHeadNode slurmdbd | Connect to slurmdbd (accounting database) daemon on head node for versions before 3.10.0.
209+
| TCP | 6819 | SlurmdbdSG | Slurmdbd | Connect to external Slurmdbd instance. For versions starting in 3.10.0.
133210
| TCP | 6820-6829 | SlurmHeadNodeSG | SlurmHeadNode slurmctld
134211
| TCP | 6830 | SlurmHeadNodeSG | SlurmHeadNode slurmrestd
135212

@@ -142,7 +219,7 @@ It needs at least the following inbound rules:
142219
| Type | Port range | Source | Description
143220
|------|------------|--------|------------
144221
| TCP | 2049 | SlurmSubmitterSG | SlurmSubmitter NFS
145-
| TCP | 6819 | SlurmSubmitterSG | SlurmSubmitter slurmdbd
222+
| TCP | 6819 | SlurmSubmitterSG | SlurmSubmitter slurmdbd. If not using external Slurmdbd.
146223
| TCP | 6820-6829 | SlurmSubmitterSG | SlurmSubmitter slurmctld
147224
| TCP | 6830 | SlurmSubmitterSG | SlurmSubmitter slurmrestd
148225

@@ -152,6 +229,18 @@ It needs the following outbound rules.
152229
|------|------------|-------------|------------
153230
| TCP | 1024-65535 | SlurmSubmitterSG | SlurmSubmitter ephemeral
154231

232+
### External Slurmdbd Security Group
233+
234+
**Note**: ParallelCluster 3.10.0 added support for an external Slurmdbd instance.
235+
236+
The submitter must be able to directly access the Slurmdbd instance on port 6819 when running commands like `sacctmgr`.
237+
You must edit the inbound rules of the Slurmdbd instances security group to allow the access.
238+
Add the following inbound rule.
239+
240+
| Type | Port range | Source | Description
241+
|------|------------|--------|------------
242+
| TCP | 6819 | SlurmSubmitterSG | SlurmSubmitter slurmdbd
243+
155244
### Slurm Compute Node Security Group
156245

157246
The SlurmComputeNodeSG will be specified in your configuration file for the slurm/InstanceConfig/AdditionalSecurityGroups parameter.

0 commit comments

Comments
 (0)