addresses comments

Elnifio · Elnifio · commit 002b43ed8fe1 · 2025-03-06T15:04:28.000-08:00
diff --git a/large_language_model_pretraining/nemo/README.md b/large_language_model_pretraining/nemo/README.md
@@ -102,13 +102,15 @@ After the download is complete, you should see five files under `TOKENIZER_PATH`
 
 We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files (where `768 <= x <= 1023`) for training, and we use our customized `c4-validation-91205-samples.en.json.gz`, which contains the first 91205 samples from the unshuffled C4 validation dataset, for evaluation. 
 
+Notice here that we are using the first 47,185,920 tokens (5760 sequences) from the validation dataset to perform the validation. According to our experiments, the first 91205 samples from the unshuffled C4 dataset yields 47,186,855 tokens, which is the smallest amount of samples needed to yield 47,185,920 tokens. Thus, we have chosen the first 91205 samples as our validation dataset. 
+
 ### Training data order
 
 We randomly shuffle the **last 256 of 1024 shards** for the benchmarking area.
 
 ### Test data order
 
-We use the first 47M tokens in the validation dataset for validation. We **do not shuffle** the validation dataset. 
+We use the first 47,185,920 tokens in the validation dataset for validation. We **do not shuffle** the validation dataset. 
 
 # 4. Model
 ### Publication/Attribution
@@ -159,7 +161,7 @@ Validation log perplexity = 5.6
 
 ### Evaluation frequency
 
-We perform evaluation every **46080** sequences. 
+We perform evaluation every **46,080** sequences. 
 
 ### Evaluation thoroughness