Skip to content

Commit 002b43e

Browse files
committed
addresses comments
1 parent 2d69703 commit 002b43e

File tree

1 file changed

+4
-2
lines changed
  • large_language_model_pretraining/nemo

1 file changed

+4
-2
lines changed

large_language_model_pretraining/nemo/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,13 +102,15 @@ After the download is complete, you should see five files under `TOKENIZER_PATH`
102102

103103
We use the default split from the C4 dataset. This means that we use `c4-train.<x>-of-01024.json.gz` files (where `768 <= x <= 1023`) for training, and we use our customized `c4-validation-91205-samples.en.json.gz`, which contains the first 91205 samples from the unshuffled C4 validation dataset, for evaluation.
104104

105+
Notice here that we are using the first 47,185,920 tokens (5760 sequences) from the validation dataset to perform the validation. According to our experiments, the first 91205 samples from the unshuffled C4 dataset yields 47,186,855 tokens, which is the smallest amount of samples needed to yield 47,185,920 tokens. Thus, we have chosen the first 91205 samples as our validation dataset.
106+
105107
### Training data order
106108

107109
We randomly shuffle the **last 256 of 1024 shards** for the benchmarking area.
108110

109111
### Test data order
110112

111-
We use the first 47M tokens in the validation dataset for validation. We **do not shuffle** the validation dataset.
113+
We use the first 47,185,920 tokens in the validation dataset for validation. We **do not shuffle** the validation dataset.
112114

113115
# 4. Model
114116
### Publication/Attribution
@@ -159,7 +161,7 @@ Validation log perplexity = 5.6
159161

160162
### Evaluation frequency
161163

162-
We perform evaluation every **46080** sequences.
164+
We perform evaluation every **46,080** sequences.
163165

164166
### Evaluation thoroughness
165167

0 commit comments

Comments
 (0)