Skip to content

[Llama 3.1] Updates MLLOG tags #790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 8, 2025

Conversation

Elnifio
Copy link
Contributor

@Elnifio Elnifio commented Apr 3, 2025

No description provided.

@Elnifio Elnifio requested a review from a team as a code owner April 3, 2025 15:58
Copy link

github-actions bot commented Apr 3, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ShriyaRishab
ShriyaRishab previously approved these changes Apr 3, 2025
@ShriyaRishab
Copy link
Contributor

@Elnifio can you verify that the logs from this reference indeed pass the latest checkers?

@@ -124,11 +124,11 @@ def log_metrics(self, metrics, step):
assert step_time <= self.train_step_time_atol, f"Logged train step time ({step_time}) is slower than tolerable ({self.train_step_time_atol}). "

def log_validation_loss(self, metrics, step):
consumed_tokens = (step - self.init_global_step) * self.gbs * self.seq_len
consumed_tokens = step * self.gbs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be called consumed_samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the latest commit. Thanks for catching that!

@@ -202,12 +202,12 @@ def on_train_end(self, trainer, pl_module):

@rank_zero_only
def on_validation_start(self, trainer, pl_module):
mllogger.end(key=constants.BLOCK_STOP, metadata={'epoch_num': self.consumed_tokens(trainer)})
mllogger.start(key=constants.EVAL_START, metadata={'epoch_num': self.consumed_tokens(trainer)})
mllogger.end(key=constants.BLOCK_STOP, metadata={constants.SAMPLES_COUNT: self.consumed_tokens(trainer)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure where self.consumed_tokens() is defined, but does it return tokens or samples? If tokens, we need to switch it to samples. If samples, we need to rename it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the latest commit.

@Elnifio
Copy link
Contributor Author

Elnifio commented Apr 4, 2025

@ShriyaPalsamudram @mmarcinkiewicz the latest run fails the RCP checker in the following test:

  • common.yaml line 128, missing train_samples - this is fixed in my latest commit, where I renamed trained_samples to train_samples
  • closed_llama31_405b.yaml line 35, incorrect opt_learning_rate_decay_schedule value - this is fixed in my latest commit, where I renamed cosine with linear warmups to cosine with linear warmup.
  • common.yaml line 97, 103: missing epoch_num in epoch_start, epoch_end tags - I'm wondering whether, similar to other tags, we should accept both epoch_num and samples_count in the metadata. What do you think?

@Elnifio
Copy link
Contributor Author

Elnifio commented Apr 7, 2025

@ShriyaPalsamudram @mmarcinkiewicz the latest run fails the RCP checker in the following test:

  • common.yaml line 128, missing train_samples - this is fixed in my latest commit, where I renamed trained_samples to train_samples
  • closed_llama31_405b.yaml line 35, incorrect opt_learning_rate_decay_schedule value - this is fixed in my latest commit, where I renamed cosine with linear warmups to cosine with linear warmup.
  • common.yaml line 97, 103: missing epoch_num in epoch_start, epoch_end tags - I'm wondering whether, similar to other tags, we should accept both epoch_num and samples_count in the metadata. What do you think?

Further updating on this comment - I have ran the RCP checker against the latest compliance checker from PR 414 and it passed without any errors / warnings, so I'd say both PRs are ready to merge.

@ShriyaRishab ShriyaRishab merged commit ece3d15 into mlcommons:master Apr 8, 2025
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 8, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants