Skip to content

Commit 8dff7cd

Browse files
authored
Add support for ParallelCluster versions 3.9.0 and 3.9.1 (#232)
Add support for rhel9 and rocky9. Had to update some of the ansible playbooks to mimic rhel8 changes. Resolves #229 Set SubmitterInstanceTags based on RESEnvironmentName. Remove SubmitterSecurityGroupIds parameter. This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed. With the addition of adding security groups to the head and compute nodes the customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes. Resolves #204 Update CallSlurmRestApiLambda from Python 3.8 to 3.9. Resolves #230 Update CDK version to 2.111.0. This is the latest version supported by nodejs 16. Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or RHEL 7 family. Would require either running in a Docker container or on a newer OS version. I think that I'm going to change the prerequisites for the OS distribution so that I can stay on the latest tools. For example, I can't update to Python 3.12 until I do this. Update DeconfigureRESUsersGroupsJson to pass if last statement fails. Fix bug in create_slurm_accounts.py Resolves #231
1 parent ded618c commit 8dff7cd

File tree

15 files changed

+169
-118
lines changed

15 files changed

+169
-118
lines changed

docs/config.md

+1-9
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,6 @@ This project creates a ParallelCluster configuration file that is documented in
7676
- str
7777
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#HeadNode-v3-Imds">Imds</a>:
7878
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#yaml-HeadNode-Imds-Secured">Secured</a>: bool
79-
<a href="#submittersecuritygroupids">SubmitterSecurityGroupIds</a>:
80-
SecurityGroupName: SecurityGroupId
8179
<a href="#submitterinstancetags">SubmitterInstanceTags</a>: str
8280
TagName:
8381
- TagValues
@@ -249,7 +247,7 @@ See the [ParallelCluster docs](https://docs.aws.amazon.com/parallelcluster/lates
249247

250248
See the [ParallelCluster docs](https://docs.aws.amazon.com/parallelcluster/latest/ug/Image-v3.html#yaml-Image-CustomAmi) for the custom AMI documentation.
251249

252-
**NOTE**: A CustomAmi must be provided for Rocky8.
250+
**NOTE**: A CustomAmi must be provided for Rocky8 or Rocky9.
253251
All other distributions have a default AMI that is provided by ParallelCluster.
254252

255253
#### Architecture
@@ -491,12 +489,6 @@ Additional security groups that will be added to the head node instance.
491489

492490
List of Amazon Resource Names (ARNs) of IAM policies for Amazon EC2 that will be added to the head node instance.
493491

494-
### SubmitterSecurityGroupIds
495-
496-
External security groups that should be able to use the cluster.
497-
498-
Rules will be added to allow it to interact with Slurm.
499-
500492
### SubmitterInstanceTags
501493

502494
Tags of instances that can be configured to submit to the cluster.

docs/deployment-prerequisites.md

+73-1
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,76 @@ The version that has been tested is in the CDK_VERSION variable in the install s
9999

100100
The install script will try to install the prerequisites if they aren't already installed.
101101

102+
## Security Groups for Login Nodes
103+
104+
If you want to allow instances like remote desktops to use the cluster directly, you must define
105+
three security groups that allow connections between the instance, the Slurm head node, and the Slurm compute nodes.
106+
We call the instance that is connecting to the Slurm cluster a login node or a submitter instance.
107+
108+
I'll call the three security groups the following names, but they can be whatever you want.
109+
110+
* SlurmSubmitterSG
111+
* SlurmHeadNodeSG
112+
* SlurmComputeNodeSG
113+
114+
### Slurm Submitter Security Group
115+
116+
The SlurmSubmitterSG will be attached to your login nodes, such as your virtual desktops.
117+
118+
It needs at least the following inbound rules:
119+
120+
| Type | Port range | Source | Description
121+
|------|------------|--------|------------
122+
| TCP | 1024-65535 | SlurmHeadNodeSG | SlurmHeadNode ephemeral
123+
| TCP | 1024-65535 | SlurmComputeNodeSG | SlurmComputeNode ephemeral
124+
| TCP | 6000-7024 | SlurmComputeNodeSG | SlurmComputeNode X11
125+
126+
It needs the following outbound rules.
127+
128+
| Type | Port range | Destination | Description
129+
|------|------------|-------------|------------
130+
| TCP | 2049 | SlurmHeadNodeSG | SlurmHeadNode NFS
131+
| TCP | 6818 | SlurmComputeNodeSG | SlurmComputeNode slurmd
132+
| TCP | 6819 | SlurmHeadNodeSG | SlurmHeadNode slurmdbd
133+
| TCP | 6820-6829 | SlurmHeadNodeSG | SlurmHeadNode slurmctld
134+
| TCP | 6830 | SlurmHeadNodeSG | SlurmHeadNode slurmrestd
135+
136+
### Slurm Head Node Security Group
137+
138+
The SlurmHeadNodeSG will be specified in your configuration file for the slurm/SlurmCtl/AdditionalSecurityGroups parameter.
139+
140+
It needs at least the following inbound rules:
141+
142+
| Type | Port range | Source | Description
143+
|------|------------|--------|------------
144+
| TCP | 2049 | SlurmSubmitterSG | SlurmSubmitter NFS
145+
| TCP | 6819 | SlurmSubmitterSG | SlurmSubmitter slurmdbd
146+
| TCP | 6820-6829 | SlurmSubmitterSG | SlurmSubmitter slurmctld
147+
| TCP | 6830 | SlurmSubmitterSG | SlurmSubmitter slurmrestd
148+
149+
It needs the following outbound rules.
150+
151+
| Type | Port range | Destination | Description
152+
|------|------------|-------------|------------
153+
| TCP | 1024-65535 | SlurmSubmitterSG | SlurmSubmitter ephemeral
154+
155+
### Slurm Compute Node Security Group
156+
157+
The SlurmComputeNodeSG will be specified in your configuration file for the slurm/InstanceConfig/AdditionalSecurityGroups parameter.
158+
159+
It needs at least the following inbound rules:
160+
161+
| Type | Port range | Source | Description
162+
|------|------------|--------|------------
163+
| TCP | 6818 | SlurmSubmitterSG | SlurmSubmitter slurmd
164+
165+
It needs the following outbound rules.
166+
167+
| Type | Port range | Destination | Description
168+
|------|------------|-------------|------------
169+
| TCP | 1024-65535 | SlurmSubmitterSG | SlurmSubmitter ephemeral
170+
| TCP | 6000-7024 | SlurmSubmitterSG | SlurmSubmitter X11
171+
102172
## Create Configuration File
103173

104174
Before you deploy a cluster you need to create a configuration file.
@@ -108,6 +178,7 @@ Ideally you should version control this file so you can keep track of changes.
108178

109179
The schema for the config file along with its default values can be found in [source/cdk/config_schema.py](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L230-L445).
110180
The schema is defined in python, but the actual config file should be in yaml format.
181+
See [Configuration File Format](config.md) for documentation on all of the parameters.
111182

112183
The following are key parameters that you will need to update.
113184
If you do not have the required parameters in your config file then the installer script will fail unless you specify the `--prompt` option.
@@ -120,7 +191,6 @@ You should save your selections in the config file.
120191
| [Region](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L368-L369) | Region where VPC is located | | `$AWS_DEFAULT_REGION`
121192
| [VpcId](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L372-L373) | The vpc where the cluster will be deployed. | vpc-* | None
122193
| [SshKeyPair](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L370-L371) | EC2 Keypair to use for instances | | None
123-
| [slurm/SubmitterSecurityGroupIds](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L480-L485) | Existing security groups that can submit to the cluster. For SOCA this is the ComputeNodeSG* resource. | sg-* | None
124194
| [ErrorSnsTopicArn](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L379-L380) | ARN of an SNS topic that will be notified of errors | `arn:aws:sns:{{region}}:{AccountId}:{TopicName}` | None
125195
| [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L491-L543) | Configure instance types that the cluster can use and number of nodes. | | See [default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml)
126196

@@ -137,7 +207,9 @@ all nodes must have the same architecture and Base OS.
137207
| CentOS 7 | x86_64
138208
| RedHat 7 | x86_64
139209
| RedHat 8 | x86_64, arm64
210+
| RedHat 9 | x86_64, arm64
140211
| Rocky 8 | x86_64, arm64
212+
| Rocky 9 | x86_64, arm64
141213

142214
You can exclude instances types by family or specific instance type.
143215
By default the InstanceConfig excludes older generation instance families.

docs/res_integration.md

+7-2
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,12 @@ The intention is to completely automate the deployment of ParallelCluster and se
1111
|-----------|-------------|------
1212
| VpcId | VPC id for the RES cluster | vpc-xxxxxx
1313
| SubnetId | Subnet in the RES VPC. | subnet-xxxxx
14-
| SubmitterSecurityGroupIds | The security group names and ids used by RES VDIs. The name will be something like *EnvironmentName*-vdc-dcv-host-security-group | *EnvironmentName*-*VDISG*: sg-xxxxxxxx
1514
| SubmitterInstanceTags | The tag of VDI instances. | 'res:EnvironmentName': *EnvironmentName*'
1615
| ExtraMounts | The mount parameters for the /home directory. This is required for access to the home directory. |
1716
| ExtraMountSecurityGroups | Security groups that give access to the ExtraMounts. These will be added to compute nodes so they can access the file systems.
1817

18+
You must also create security groups as described in [Security Groups for Login Nodes](deployment-prerequisites.md#security-groups-for-login-nodes) and specify the SlurmHeadNodeSG in the `slurm/SlurmCtl/AdditionalSecurityGroups` parameter and the SlurmComputeNodeSG in the `slurm/InstanceConfig/AdditionalSecurityGroups` parameter.
19+
1920
When you specify **RESEnvironmentName**, a lambda function will run SSM commands to create a cron job on a RES domain joined instance to update the users_groups.json file every hour. Another lambda function will also automatically configure all running VDI hosts to use the cluster.
2021

2122
The following example shows the configuration parameters for a RES with the EnvironmentName=res-eda.
@@ -51,11 +52,15 @@ slurm:
5152
Database:
5253
DatabaseStackName: pcluster-slurm-db-res
5354
54-
SlurmCtl: {}
55+
SlurmCtl:
56+
AdditionalSecurityGroups:
57+
- sg-12345678 # SlurmHeadNodeSG
5558
5659
# Configure typical EDA instance types
5760
# A partition will be created for each combination of Base OS, Architecture, and Spot
5861
InstanceConfig:
62+
AdditionalSecurityGroups:
63+
- sg-23456789 # SlurmComputeNodeSG
5964
UseSpot: true
6065
NodeCounts:
6166
DefaultMaxCount: 10

docs/soca_integration.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ Set the following parameters in your config file.
1111
| Parameter | Description | Value
1212
|-----------|-------------|------
1313
| VpcId | VPC id for the SOCA cluster | vpc-xxxxxx
14-
| SubmitterSecurityGroupIds | The ComputeNode security group name and id | *cluster-id*-*ComputeNodeSG*: sg-xxxxxxxx
14+
| slurm/SlurmCtl/AdditionalSecurityGroups | Security group ids that give desktop instances access to the head node and that give the head node access to VPC resources such as file systems.
15+
| slurm/InstanceConfig/AdditionalSecurityGroups | Security group ids that give desktop instances access to the compute nodes and that give compute nodes access to VPC resources such as file systems.
1516
| ExtraMounts | Add the mount parameters for the /apps and /data directories. This is required for access to the home directory. |
1617

1718
Deploy your slurm cluster.

setup.sh

+10-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,16 @@ fi
4141
echo "Using python $python_version"
4242

4343
# Check nodejs version
44+
# https://nodejs.org/en/about/previous-releases
4445
required_nodejs_version=16.20.2
46+
# required_nodejs_version=18.20.2
47+
# On Amazon Linux 2 and nodejs 18.20.2 I get the following errors:
48+
# node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node)
49+
# node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node)
50+
# required_nodejs_version=20.13.1
51+
# On Amazon Linux 2 and nodejs 20.13.1 I get the following errors:
52+
# node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node)
53+
# node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node)
4554
export JSII_SILENCE_WARNING_DEPRECATED_NODE_VERSION=1
4655
if ! which node &> /dev/null; then
4756
echo -e "\nnode not found in your path."
@@ -88,7 +97,7 @@ fi
8897
echo "Using nodejs version $nodejs_version"
8998

9099
# Create a local installation of cdk
91-
CDK_VERSION=2.91.0 # If you change the CDK version here, make sure to also change it in source/requirements.txt
100+
CDK_VERSION=2.111.0 # When you change the CDK version here, make sure to also change it in source/requirements.txt
92101
if ! cdk --version &> /dev/null; then
93102
echo "CDK not installed. Installing global version of cdk@$CDK_VERSION."
94103
if ! npm install -g aws-cdk@$CDK_VERSION; then

source/cdk/cdk_slurm_stack.py

+5-43
Original file line numberDiff line numberDiff line change
@@ -231,17 +231,6 @@ def override_config_with_context(self):
231231
logger.error(f"Must set --{command_line_switch} from the command line or {config_key} in the config files")
232232
exit(1)
233233

234-
config_key = 'SubmitterSecurityGroupIds'
235-
context_key = config_key
236-
submitterSecurityGroupIds_b64_string = self.node.try_get_context(context_key)
237-
if submitterSecurityGroupIds_b64_string:
238-
submitterSecurityGroupIds = json.loads(base64.b64decode(submitterSecurityGroupIds_b64_string).decode('utf-8'))
239-
if config_key not in self.config['slurm']:
240-
logger.info(f"slurm/{config_key:20} set from command line: {submitterSecurityGroupIds}")
241-
else:
242-
logger.info(f"slurm/{config_key:20} in config file overridden on command line from {self.config['slurm'][config_key]} to {submitterSecurityGroupIds}")
243-
self.config['slurm'][config_key] = submitterSecurityGroupIds
244-
245234
def check_config(self):
246235
'''
247236
Check config, set defaults, and sanity check the configuration.
@@ -425,6 +414,9 @@ def update_config_for_res(self):
425414
'''
426415
res_environment_name = self.config['RESEnvironmentName']
427416
logger.info(f"Updating configuration for RES environment: {res_environment_name}")
417+
418+
self.config['slurm']['SubmitterInstanceTags'] = {'res:EnvironmentName': [res_environment_name]}
419+
428420
cloudformation_client = boto3.client('cloudformation', region_name=self.config['Region'])
429421
res_stack_name = None
430422
stack_statuses = {}
@@ -481,13 +473,6 @@ def update_config_for_res(self):
481473
self.config['SubnetId'] = subnet_ids[0]
482474
logger.info(f" SubnetId: {self.config['SubnetId']}")
483475

484-
submitter_security_group_ids = []
485-
if 'SubmitterSecurityGroupIds' not in self.config['slurm']:
486-
self.config['slurm']['SubmitterSecurityGroupIds'] = {}
487-
else:
488-
for security_group_name, security_group_ids in self.config['slurm']['SubmitterSecurityGroupIds'].items():
489-
submitter_security_group_ids.append(security_group_ids)
490-
491476
# Get RES VDI Security Group
492477
res_vdc_stack_name = f"{res_stack_name}-vdc"
493478
if res_vdc_stack_name not in stack_statuses:
@@ -508,11 +493,6 @@ def update_config_for_res(self):
508493
if not res_dcv_security_group_id:
509494
logger.error(f"RES VDI security group not found.")
510495
exit(1)
511-
if res_dcv_security_group_id not in submitter_security_group_ids:
512-
res_dcv_security_group_name = f"{res_environment_name}-dcv-sg"
513-
logger.info(f" SubmitterSecurityGroupIds['{res_dcv_security_group_name}'] = '{res_dcv_security_group_id}'")
514-
self.config['slurm']['SubmitterSecurityGroupIds'][res_dcv_security_group_name] = res_dcv_security_group_id
515-
submitter_security_group_ids.append(res_dcv_security_group_id)
516496

517497
# Get cluster manager Security Group
518498
logger.debug(f"Searching for cluster manager security group id")
@@ -535,11 +515,6 @@ def update_config_for_res(self):
535515
if not res_cluster_manager_security_group_id:
536516
logger.error(f"RES cluster manager security group not found.")
537517
exit(1)
538-
if res_cluster_manager_security_group_id not in submitter_security_group_ids:
539-
res_cluster_manager_security_group_name = f"{res_environment_name}-cluster-manager-sg"
540-
logger.info(f" SubmitterSecurityGroupIds['{res_cluster_manager_security_group_name}'] = '{res_cluster_manager_security_group_id}'")
541-
self.config['slurm']['SubmitterSecurityGroupIds'][res_cluster_manager_security_group_name] = res_cluster_manager_security_group_id
542-
submitter_security_group_ids.append(res_cluster_manager_security_group_id)
543518

544519
# Get vdc controller Security Group
545520
logger.debug(f"Searching for VDC controller security group id")
@@ -564,11 +539,6 @@ def update_config_for_res(self):
564539
if not res_vdc_controller_security_group_id:
565540
logger.error(f"RES VDC controller security group not found.")
566541
exit(1)
567-
if res_vdc_controller_security_group_id not in submitter_security_group_ids:
568-
res_vdc_controller_security_group_name = f"{res_environment_name}-vdc-controller-sg"
569-
logger.info(f" SubmitterSecurityGroupIds['{res_vdc_controller_security_group_name}'] = '{res_vdc_controller_security_group_id}'")
570-
self.config['slurm']['SubmitterSecurityGroupIds'][res_vdc_controller_security_group_name] = res_vdc_controller_security_group_id
571-
submitter_security_group_ids.append(res_vdc_controller_security_group_id)
572542

573543
# Configure the /home mount from RES if /home not already configured
574544
home_mount_found = False
@@ -1025,7 +995,7 @@ def create_parallel_cluster_lambdas(self):
1025995
],
1026996
compatible_runtimes = [
1027997
aws_lambda.Runtime.PYTHON_3_9,
1028-
aws_lambda.Runtime.PYTHON_3_10,
998+
# aws_lambda.Runtime.PYTHON_3_10, # Doesn't work: No module named 'rpds.rpds'
1029999
# aws_lambda.Runtime.PYTHON_3_11, # Doesn't work: No module named 'rpds.rpds'
10301000
],
10311001
)
@@ -1694,7 +1664,7 @@ def create_callSlurmRestApiLambda(self):
16941664
function_name=f"{self.stack_name}-CallSlurmRestApiLambda",
16951665
description="Example showing how to call Slurm REST API",
16961666
memory_size=128,
1697-
runtime=aws_lambda.Runtime.PYTHON_3_8,
1667+
runtime=aws_lambda.Runtime.PYTHON_3_9,
16981668
architecture=aws_lambda.Architecture.ARM_64,
16991669
timeout=Duration.minutes(1),
17001670
log_retention=logs.RetentionDays.INFINITE,
@@ -1842,14 +1812,6 @@ def create_security_groups(self):
18421812
Tags.of(self.slurm_submitter_sg).add("Name", self.slurm_submitter_sg_name)
18431813
self.suppress_cfn_nag(self.slurm_submitter_sg, 'W29', 'Egress port range used to block all egress')
18441814
self.submitter_security_groups[self.slurm_submitter_sg_name] = self.slurm_submitter_sg
1845-
for slurm_submitter_sg_name, slurm_submitter_sg_id in self.config['slurm']['SubmitterSecurityGroupIds'].items():
1846-
(allow_all_outbound, allow_all_ipv6_outbound) = self.allow_all_outbound(slurm_submitter_sg_id)
1847-
self.submitter_security_groups[slurm_submitter_sg_name] = ec2.SecurityGroup.from_security_group_id(
1848-
self, f"{slurm_submitter_sg_name}",
1849-
security_group_id = slurm_submitter_sg_id,
1850-
allow_all_outbound = allow_all_outbound,
1851-
allow_all_ipv6_outbound = allow_all_ipv6_outbound
1852-
)
18531815

18541816
self.slurm_rest_api_lambda_sg = ec2.SecurityGroup(self, "SlurmRestLambdaSG", vpc=self.vpc, allow_all_outbound=False, description="SlurmRestApiLambda to SlurmCtl Security Group")
18551817
self.slurm_rest_api_lambda_sg_name = f"{self.stack_name}-SlurmRestApiLambdaSG"

0 commit comments

Comments
 (0)