Skip to content

Maxtext llama2-7b on 32 nodes has issue running on A3Mega #2

Open
@NinaCai

Description

@NinaCai

Following the instructions in this repo for running 32-node maxtext llama2-7b workload.

I have the following error

Stopping coordination service as cluster registration failed. This may be due to 1) some tasks crashed earlier before connecting, 2) some tasks were never scheduled, or 3) scheduling delays. Consider setting a longer initialization timeout if such delays are expected, the timeout is currently set to: 1h.

Original error: DEADLINE_EXCEEDED: Barrier timed out. Id: [Init]Wait_for_all_tasks_to_register::0. This usually happens because a task triggered the barrier too early or too slowly. Please look at the task logs (both timed out and first task) to debug further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions