Skip to content

[Llama 3.1] Updates dataset, logging, and checkpoint resume. #787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 18, 2025

Conversation

Elnifio
Copy link
Contributor

@Elnifio Elnifio commented Mar 5, 2025

This PR includes the following changes:

  • Llama 3.1:
    1. Updated the validation dataset so that it now contains strictly only 5760 sequences. This impacts the preprocessing script, README downloading instructions, and pretraining script's dataset loading section.
    2. Updated logging to log number of sequences, instead of number of tokens. This impacts the README descriptions, pretrain launch script's arguments, and callback's logging.
    3. Updated the dataset location on the S3 bucket in the README.
    4. Added a knob STEP_TIME_ATOL so that, if a training step time is longer than this threshold, then we actively kill this job. Defaults to 2hrs per step.
    5. Updated how the checkpoints are resumed across different experiment partitions when we start from HF checkpoint.
    6. Updated the optimizer logging to AdamW.
  • Mixtral 8x22b: Updated the dataset downloading section in the README.

@Elnifio Elnifio requested a review from a team as a code owner March 5, 2025 23:07
Copy link

github-actions bot commented Mar 5, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@@ -103,7 +104,7 @@ After the download is complete, you should see five files under `TOKENIZER_PATH`

### Training and test data separation

We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files for training and `c4-validation.<x>-of-00008.json.gz` files for evaluation.
We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files (where `768 <= x <= 1023`) for training, and we use our customized `c4-validation-91205-samples.en.json.gz`, which contains the first 91205 samples from the unshuffled C4 validation dataset, for evaluation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why and how 91205 samples were chosen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is discussed in the utils/consolidate_data.sh. I can copy the lines here to make it more clear in the README.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the description in this commit

ShriyaRishab
ShriyaRishab previously approved these changes Mar 6, 2025
Copy link

@suexu1025 suexu1025 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@nathanw-mlc
Copy link
Member

@mlcommons/wg-training @Elnifio Are we good to approve and merge this PR? My team is waiting for this PR to be merged before removing duplicate data from R2. Thanks.

@ShriyaRishab ShriyaRishab merged commit 637c82f into mlcommons:master Mar 18, 2025
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 18, 2025
@nathanw-mlc
Copy link
Member

Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants