`macos-m1-12` maintenance plan

Since we launched the m1 runners on AWS we've known for a while that a maintenance plan on these would eventually be needed. This is a high level issue to organize work around maintaining these nodes. Ideally we can maintain them using GHA but there is also an option to maintain these using ansible as well (which is what they're originally provisioned with, [code here](https://github.com/fairinternal/pytorch-gha-infra/tree/main/macos-runners)).

# Proposed maintenance plan 1 (preferred, maybe more long term)

Utilize github actions to do regular maintenance on these nodes and set up alerting through that as well that'll give us an idea when a node doesn't pass certain health checks.

Steps needed for this approach:
- [ ] Label all existing nodes with their name as a label (to utilize later in our GHA workflow)
- [ ] Create a workflow that does the following
  - Generates a matrix using labels created above
  - Iterates over that matrix doing clean up steps / health checks
- [ ] Integrate that workflow's signal into our alerting system on HUD

## Advantages to approach 1
- Using GHA itself to do the maintenance ensures that no jobs are running on the machine when we attempt to run maintenance
- More transparency over what's going on since we can view the logs here

# Proposed maintenance plan 2 (more immediate, but hacky)

Create a new ansible playbook to do the regular maintenance but in a more manual way.

- [ ] Write new ansible playbook to do maintenance clean up
- [ ] ? Automate ansible playbook to do this maintenance regularly

## Advantages to approach 2
- Ansible is fairly easy to write so this can be done quickly

## Disadvantages
- Doing this outside of the scope of GHA means that jobs could be running when performing maintenance meaning that we could run into a scenario where we clean dependencies causing a workflow to fail
- Automating would need to be done in the private repository since it contains the IPs to the nodes which is needed for the ssh access that ansible requires

# Relevant issues
* https://github.com/pytorch/pytorch/issues/84841

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`macos-m1-12` maintenance plan #692

Proposed maintenance plan 1 (preferred, maybe more long term)

Advantages to approach 1

Proposed maintenance plan 2 (more immediate, but hacky)

Advantages to approach 2

Disadvantages

Relevant issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

macos-m1-12 maintenance plan #692

Description

Proposed maintenance plan 1 (preferred, maybe more long term)

Advantages to approach 1

Proposed maintenance plan 2 (more immediate, but hacky)

Advantages to approach 2

Disadvantages

Relevant issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`macos-m1-12` maintenance plan #692