-
Notifications
You must be signed in to change notification settings - Fork 567
[Llama 3.1] Updates dataset, logging, and checkpoint resume. #787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
@@ -103,7 +104,7 @@ After the download is complete, you should see five files under `TOKENIZER_PATH` | |||
|
|||
### Training and test data separation | |||
|
|||
We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files for training and `c4-validation.<x>-of-00008.json.gz` files for evaluation. | |||
We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files (where `768 <= x <= 1023`) for training, and we use our customized `c4-validation-91205-samples.en.json.gz`, which contains the first 91205 samples from the unshuffled C4 validation dataset, for evaluation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why and how 91205 samples were chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is discussed in the utils/consolidate_data.sh. I can copy the lines here to make it more clear in the README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the description in this commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM
@mlcommons/wg-training @Elnifio Are we good to approve and merge this PR? My team is waiting for this PR to be merged before removing duplicate data from R2. Thanks. |
Thanks! |
This PR includes the following changes:
STEP_TIME_ATOL
so that, if a training step time is longer than this threshold, then we actively kill this job. Defaults to 2hrs per step.