Open
Description
Following the instructions in this repo for running 32-node maxtext llama2-7b workload.
I have the following error
Stopping coordination service as cluster registration failed. This may be due to 1) some tasks crashed earlier before connecting, 2) some tasks were never scheduled, or 3) scheduling delays. Consider setting a longer initialization timeout if such delays are expected, the timeout is currently set to: 1h.
Original error: DEADLINE_EXCEEDED: Barrier timed out. Id: [Init]Wait_for_all_tasks_to_register::0. This usually happens because a task triggered the barrier too early or too slowly. Please look at the task logs (both timed out and first task) to debug further.
Metadata
Metadata
Assignees
Labels
No labels