[Llama 3.1] Updates dataset, logging, and checkpoint resume. #787

Elnifio · 2025-03-05T23:07:53Z

This PR includes the following changes:

Llama 3.1:
1. Updated the validation dataset so that it now contains strictly only 5760 sequences. This impacts the preprocessing script, README downloading instructions, and pretraining script's dataset loading section.
2. Updated logging to log number of sequences, instead of number of tokens. This impacts the README descriptions, pretrain launch script's arguments, and callback's logging.
3. Updated the dataset location on the S3 bucket in the README.
4. Added a knob STEP_TIME_ATOL so that, if a training step time is longer than this threshold, then we actively kill this job. Defaults to 2hrs per step.
5. Updated how the checkpoints are resumed across different experiment partitions when we start from HF checkpoint.
6. Updated the optimizer logging to AdamW.
  - Related logging PR: PR 408
  - Related policies PR: PR 553
Mixtral 8x22b: Updated the dataset downloading section in the README.

github-actions · 2025-03-05T23:08:05Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

large_language_model_pretraining/nemo/README.md

ShriyaRishab · 2025-03-06T15:31:21Z

large_language_model_pretraining/nemo/README.md

@@ -103,7 +104,7 @@ After the download is complete, you should see five files under `TOKENIZER_PATH`

 ### Training and test data separation

-We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files for training and `c4-validation.<x>-of-00008.json.gz` files for evaluation. 
+We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files (where `768 <= x <= 1023`) for training, and we use our customized `c4-validation-91205-samples.en.json.gz`, which contains the first 91205 samples from the unshuffled C4 validation dataset, for evaluation. 


Can you explain why and how 91205 samples were chosen?

This is discussed in the utils/consolidate_data.sh. I can copy the lines here to make it more clear in the README.

I have added the description in this commit

suexu1025

Thanks! LGTM

nathanw-mlc · 2025-03-18T20:26:36Z

@mlcommons/wg-training @Elnifio Are we good to approve and merge this PR? My team is waiting for this PR to be merged before removing duplicate data from R2. Thanks.

nathanw-mlc · 2025-03-18T21:45:41Z

Thanks!

Elnifio added 2 commits March 5, 2025 14:53

adds all changes

0656065

updates MoE download as well

876e664

Elnifio requested a review from a team as a code owner March 5, 2025 23:07

updates the logging name

973b864

ShriyaRishab reviewed Mar 6, 2025

View reviewed changes

large_language_model_pretraining/nemo/README.md Outdated Show resolved Hide resolved

ShriyaRishab reviewed Mar 6, 2025

View reviewed changes

ShriyaRishab previously approved these changes Mar 6, 2025

View reviewed changes

Remove mention of fixed typo from README.md

6238e90

nathanw-mlc dismissed ShriyaRishab’s stale review via 6238e90 March 6, 2025 22:57

Elnifio added 4 commits March 6, 2025 15:04

updates the path

2d69703

addresses comments

002b43e

uses sequences here, instead of tokens

f38b11e

revert +1 in steps logging

253e440

suexu1025 approved these changes Mar 13, 2025

View reviewed changes

ShriyaRishab approved these changes Mar 18, 2025

View reviewed changes

ShriyaRishab merged commit 637c82f into mlcommons:master Mar 18, 2025
1 check passed

github-actions bot locked and limited conversation to collaborators Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Llama 3.1] Updates dataset, logging, and checkpoint resume. #787

[Llama 3.1] Updates dataset, logging, and checkpoint resume. #787

Uh oh!

Elnifio commented Mar 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Mar 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

ShriyaRishab Mar 6, 2025

Uh oh!

Elnifio Mar 6, 2025

Uh oh!

Elnifio Mar 6, 2025

Uh oh!

suexu1025 left a comment

Uh oh!

nathanw-mlc commented Mar 18, 2025

Uh oh!

Uh oh!

nathanw-mlc commented Mar 18, 2025

Uh oh!

Uh oh!

[Llama 3.1] Updates dataset, logging, and checkpoint resume. #787

[Llama 3.1] Updates dataset, logging, and checkpoint resume. #787

Uh oh!

Conversation

Elnifio commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ShriyaRishab Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

Elnifio Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

Elnifio Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

suexu1025 left a comment

Choose a reason for hiding this comment

Uh oh!

nathanw-mlc commented Mar 18, 2025

Uh oh!

Uh oh!

nathanw-mlc commented Mar 18, 2025

Uh oh!

Uh oh!

Elnifio commented Mar 5, 2025 •

edited

Loading

github-actions bot commented Mar 5, 2025 •

edited

Loading