Open
Description
Since we launched the m1 runners on AWS we've known for a while that a maintenance plan on these would eventually be needed. This is a high level issue to organize work around maintaining these nodes. Ideally we can maintain them using GHA but there is also an option to maintain these using ansible as well (which is what they're originally provisioned with, code here).
Proposed maintenance plan 1 (preferred, maybe more long term)
Utilize github actions to do regular maintenance on these nodes and set up alerting through that as well that'll give us an idea when a node doesn't pass certain health checks.
Steps needed for this approach:
- Label all existing nodes with their name as a label (to utilize later in our GHA workflow)
- Create a workflow that does the following
- Generates a matrix using labels created above
- Iterates over that matrix doing clean up steps / health checks
- Integrate that workflow's signal into our alerting system on HUD
Advantages to approach 1
- Using GHA itself to do the maintenance ensures that no jobs are running on the machine when we attempt to run maintenance
- More transparency over what's going on since we can view the logs here
Proposed maintenance plan 2 (more immediate, but hacky)
Create a new ansible playbook to do the regular maintenance but in a more manual way.
- Write new ansible playbook to do maintenance clean up
- ? Automate ansible playbook to do this maintenance regularly
Advantages to approach 2
- Ansible is fairly easy to write so this can be done quickly
Disadvantages
- Doing this outside of the scope of GHA means that jobs could be running when performing maintenance meaning that we could run into a scenario where we clean dependencies causing a workflow to fail
- Automating would need to be done in the private repository since it contains the IPs to the nodes which is needed for the ssh access that ansible requires