uses sequences here, instead of tokens

Elnifio · Elnifio · commit f38b11edd0d7 · 2025-03-06T15:07:16.000-08:00
diff --git a/large_language_model_pretraining/nemo/README.md b/large_language_model_pretraining/nemo/README.md
@@ -102,15 +102,15 @@ After the download is complete, you should see five files under `TOKENIZER_PATH`
 
 We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files (where `768 <= x <= 1023`) for training, and we use our customized `c4-validation-91205-samples.en.json.gz`, which contains the first 91205 samples from the unshuffled C4 validation dataset, for evaluation. 
 
-Notice here that we are using the first 47,185,920 tokens (5760 sequences) from the validation dataset to perform the validation. According to our experiments, the first 91205 samples from the unshuffled C4 dataset yields 47,186,855 tokens, which is the smallest amount of samples needed to yield 47,185,920 tokens. Thus, we have chosen the first 91205 samples as our validation dataset. 
+Notice here that we are using the first 5760 sequences (47,185,920 tokens) from the validation dataset to perform the validation. According to our experiments, the first 91205 samples from the unshuffled C4 dataset yields 47,186,855 tokens, which is the smallest amount of samples needed to yield 47,185,920 tokens. Thus, we have chosen the first 91205 samples as our validation dataset. 
 
 ### Training data order
 
 We randomly shuffle the **last 256 of 1024 shards** for the benchmarking area.
 
 ### Test data order
 
-We use the first 47,185,920 tokens in the validation dataset for validation. We **do not shuffle** the validation dataset. 
+We use the first 5,760 sequences (91,205 untokenized samples) in the validation dataset for validation. We **do not shuffle** the validation dataset. 
 
 # 4. Model
 ### Publication/Attribution