Skip to content

Commit f38b11e

Browse files
committed
uses sequences here, instead of tokens
1 parent 002b43e commit f38b11e

File tree

1 file changed

+2
-2
lines changed
  • large_language_model_pretraining/nemo

1 file changed

+2
-2
lines changed

large_language_model_pretraining/nemo/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,15 +102,15 @@ After the download is complete, you should see five files under `TOKENIZER_PATH`
102102

103103
We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files (where `768 <= x <= 1023`) for training, and we use our customized `c4-validation-91205-samples.en.json.gz`, which contains the first 91205 samples from the unshuffled C4 validation dataset, for evaluation.
104104

105-
Notice here that we are using the first 47,185,920 tokens (5760 sequences) from the validation dataset to perform the validation. According to our experiments, the first 91205 samples from the unshuffled C4 dataset yields 47,186,855 tokens, which is the smallest amount of samples needed to yield 47,185,920 tokens. Thus, we have chosen the first 91205 samples as our validation dataset.
105+
Notice here that we are using the first 5760 sequences (47,185,920 tokens) from the validation dataset to perform the validation. According to our experiments, the first 91205 samples from the unshuffled C4 dataset yields 47,186,855 tokens, which is the smallest amount of samples needed to yield 47,185,920 tokens. Thus, we have chosen the first 91205 samples as our validation dataset.
106106

107107
### Training data order
108108

109109
We randomly shuffle the **last 256 of 1024 shards** for the benchmarking area.
110110

111111
### Test data order
112112

113-
We use the first 47,185,920 tokens in the validation dataset for validation. We **do not shuffle** the validation dataset.
113+
We use the first 5,760 sequences (91,205 untokenized samples) in the validation dataset for validation. We **do not shuffle** the validation dataset.
114114

115115
# 4. Model
116116
### Publication/Attribution

0 commit comments

Comments
 (0)