Skip to content

macos-m1-12 maintenance plan #692

Open
@seemethere

Description

@seemethere

Since we launched the m1 runners on AWS we've known for a while that a maintenance plan on these would eventually be needed. This is a high level issue to organize work around maintaining these nodes. Ideally we can maintain them using GHA but there is also an option to maintain these using ansible as well (which is what they're originally provisioned with, code here).

Proposed maintenance plan 1 (preferred, maybe more long term)

Utilize github actions to do regular maintenance on these nodes and set up alerting through that as well that'll give us an idea when a node doesn't pass certain health checks.

Steps needed for this approach:

  • Label all existing nodes with their name as a label (to utilize later in our GHA workflow)
  • Create a workflow that does the following
    • Generates a matrix using labels created above
    • Iterates over that matrix doing clean up steps / health checks
  • Integrate that workflow's signal into our alerting system on HUD

Advantages to approach 1

  • Using GHA itself to do the maintenance ensures that no jobs are running on the machine when we attempt to run maintenance
  • More transparency over what's going on since we can view the logs here

Proposed maintenance plan 2 (more immediate, but hacky)

Create a new ansible playbook to do the regular maintenance but in a more manual way.

  • Write new ansible playbook to do maintenance clean up
  • ? Automate ansible playbook to do this maintenance regularly

Advantages to approach 2

  • Ansible is fairly easy to write so this can be done quickly

Disadvantages

  • Doing this outside of the scope of GHA means that jobs could be running when performing maintenance meaning that we could run into a scenario where we clean dependencies causing a workflow to fail
  • Automating would need to be done in the private repository since it contains the IPs to the nodes which is needed for the ssh access that ansible requires

Relevant issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    awsIssues pertaining to our AWS infrastructuregha infraRelated to our self hosted Github Actions infrastructure

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions