Skip to content

Commit 1ec961f

Browse files
authored
[LLM] Adds Llama 3.1 reference implementation code (#781)
* Initial commit of Llama 3.1 405B ref * removes comments * adds checkpoint loading and full C4 dataset loading * updates checkpointings and instructions * adds MLPerf callbacks * Changes the dataset sources and adds multiple seeds * Resolves comments * updates instructions * patches to download instructions * renames folder
1 parent 4733379 commit 1ec961f

File tree

9 files changed

+1362
-0
lines changed

9 files changed

+1362
-0
lines changed
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
ARG NEMO_BASE_IMAGE=nvcr.io/nvidia/nemo:24.12-rc0
16+
FROM ${NEMO_BASE_IMAGE} AS nemo-base-image
17+
18+
RUN pip uninstall transformers -y
19+
RUN pip install transformers==4.47.1 blobfile==3.0.0
20+
RUN pip install prettytable==3.12.0
21+
RUN pip install git+https://github.com/mlcommons/[email protected]
22+
23+
# setup workspace
24+
WORKDIR /workspace/llama31
25+
COPY . .
26+
27+
# Fixes the validation dataset order
28+
RUN patch --directory=/opt/megatron-lm -p1 < mcore.patch
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
# 1. Problem
2+
3+
Large Language Model pretraining - Llama 3.1 405B
4+
5+
# 2. Directions
6+
7+
### Steps to configure machine
8+
9+
To use this repository, please install a supported version of PyTorch with GPU support (python 3.10, pytorch 2.4, cuda 12.5, and nccl 2.22.3 and above) and NVIDIA APEX. **Slurm-based clusters are required to run the reference**.
10+
11+
We recommend using the latest NeMo FW container. The latest tested compatible version is `nvcr.io/nvidia/nemo:24.12-rc0`).
12+
13+
#### Container Setup
14+
15+
All of the following codes are assumed to be run within a container. A [Dockerfile](./Dockerfile) is available for building containers on top of `nvcr.io/nvidia/nemo:24.12-rc0`.
16+
17+
To build the container:
18+
19+
```bash
20+
docker build -t <tag> -f Dockerfile .
21+
```
22+
23+
To launch the container:
24+
25+
```bash
26+
docker run -it --rm \
27+
--network=host --ipc=host \
28+
-v ~/.ssh:/root/.ssh \
29+
<tag> bash
30+
```
31+
32+
Note: it's recommended to map your `.ssh` folder to inside the container, so that it's easier for the code to set up remote cluster access.
33+
34+
### Steps to download and verify data
35+
36+
The current codebase is still using GPT3's train/val datasets and SentencePieceModel tokenizer. Please refer to [GPT3 instructions](https://github.com/mlcommons/training/tree/master/large_language_model/megatron-lm#preprocessed-data-download) to download **the raw C4 dataset** that we can preprocess later.
37+
38+
### Steps to run and time
39+
40+
To train Llama 3.1 405B, we need to fill out all fields in [config.sh](./config.sh). This file contains all configurations for Slurm cluster access and job submission configurations, directory mappings, containers, and model configurations.
41+
42+
Once the `config.sh` is properly filled, we run the following code snippet **inside the container**:
43+
44+
```bash
45+
source config.sh
46+
bash run_llama31.sh
47+
```
48+
49+
# 3. Dataset/Environment
50+
### Publication/Attribution
51+
52+
We use the c4/en/3.0.1 dataset from [HuggingFace/AllenAI](https://huggingface.co/datasets/allenai/c4).
53+
54+
We use the Mixtral 8x22B tokenizer from [HuggingFace/MistralAI](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1).
55+
56+
### Preprocessed data download
57+
58+
The pre-tokenized dataset and the tokenizer are available to download from the S3 bucket. You can download this data from the bucket using RClone as follows:
59+
60+
To run Rclone on Windows, you can download the executable here. To install Rclone on Linux/macOS/BSD systems, run:
61+
62+
```
63+
sudo -v ; curl https://rclone.org/install.sh | sudo bash
64+
```
65+
66+
Once Rclone is installed, run the following command to authenticate with the bucket:
67+
68+
```
69+
rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
70+
```
71+
72+
You can then navigate in the terminal to your desired download directory and run the following commands to download the dataset and checkpoints:
73+
74+
#### Dataset
75+
76+
```
77+
# Replace this path with your desired path on the machine
78+
export PREPROCESSED_PATH="./"
79+
rclone copy mlc-training:mlcommons-training-wg-public/llama3_1/datasets/preprocessed_c4 $PREPROCESSED_PATH -P
80+
```
81+
82+
After the download is complete, you should see files with the following naming conventions under `PREPROCESSED_PATH`, ending with both `.idx` and `.bin`:
83+
- Training partitions: `c4-train.en_<number>_text_document`
84+
- Validation partitions: `c4-validation.en_text_document`
85+
86+
#### Tokenizer
87+
88+
```
89+
# Replace this path with your desired path on the machine
90+
export TOKENIZER_PATH="./"
91+
rclone copy mlc-training:mlcommons-training-wg-public/llama3_1/datasets/tokenizer $TOKENIZER_PATH -P
92+
```
93+
94+
After the download is complete, you should see five files under `TOKENIZER_PATH`:
95+
- `special_tokens_map.json`
96+
- `tokenizer.json`
97+
- `tokenizer.model`
98+
- `tokenizer.model.v1`
99+
- `tokenizer_config.json`
100+
101+
### Training and test data separation
102+
103+
To be determined. For now, we are using the default split from the C4 dataset.
104+
105+
### Training data order
106+
107+
To be determined. Current plan is to use the last 256 of 1024 files (shards 6 and 7) for the benchmarked area.
108+
109+
### Test data order
110+
111+
To be determined.
112+
113+
# 4. Model
114+
### Publication/Attribution
115+
116+
The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.21783). The main difference is that the model parameters is *to be determined from experiments*.
117+
118+
### Model details
119+
120+
| Config | Value |
121+
| :-- | :-- |
122+
| Embedding | RoPE + parameter adjustments |
123+
| # Layers | 126 |
124+
| Attention Type | GQA |
125+
| # Attn Heads | 128 |
126+
| Key/Value Heads | 8 |
127+
| Model Dimension | 16,384 |
128+
| Hidden Dimension | 53248 |
129+
| Activation | SwiGLU |
130+
| Normalization | RMSNorm |
131+
| Tokenizer | TikTokenizer |
132+
| Vocab size | 128,000 |
133+
| Context Length | 8192 |
134+
135+
136+
### Checkpoint download and conversion
137+
138+
To be determined. For now, we are not using Llama 3.1 default checkpoint.
139+
140+
~~To experiment with a given checkpoint, we have added a `--ckpt` argument that loads the pretrained checkpoint from a **NeMo checkpoint path**, which requires some checkpoint format conversion if the original checkpoint is in LlamaStack or HuggingFace format.~~
141+
142+
#### Saving and restoring a checkpoint
143+
144+
Large runs might need to span across multiple Slurm jobs, and we need to save and load checkpoints with contexts so that training can resume between jobs. To support this, we have added some environment variables. Please refer to `config.sh` for more details.
145+
146+
### Optimizer
147+
148+
Adam
149+
150+
# 5. Quality
151+
### Quality metric
152+
153+
Log Perplexity
154+
155+
### Quality target
156+
157+
To be determined.
158+
159+
### Evaluation frequency
160+
161+
To be determined.
162+
163+
### Evaluation thoroughness
164+
165+
To be determined.
166+
167+
168+
# 6. Other
169+
170+
### Data Preprocessing
171+
172+
Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in the [Preprocessed data download]() section.
173+
174+
#### Tokenizer
175+
176+
We use Mixtral 8x22B tokenizer in this benchmark. Tokenizer files can be downloaded [here](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/tree/main). Only the five files containing tokenizer-related contents (`special_tokens_map.json`, `tokenizer.json`, `tokenizer.model`, `tokenizer.model.v1`, `tokenizer_config.json`) are needed.
177+
178+
#### Run Data preprocessing
179+
180+
Run the following commands to merge all 1024 training files into 8 `json.gz` files and all 8 validation files into a single `json.gz` file. Each of the `json.gz` files will be preprocessed into a pair of megatron dataset files (`.bin` and `.idx`).
181+
182+
```bash
183+
export C4_PATH=""
184+
export MERGED_C4_PATH=""
185+
186+
bash consolidate_data.sh
187+
```
188+
189+
After the data consolidation is done, we can run this [script](./utils/preprocess.sh) to perform preprocessing. To run the preprocessing script, we need to use the following commands:
190+
191+
```bash
192+
# fill in the built container path here
193+
export CONT_IMAGE_URL=""
194+
# pass in the folder path that contains the Mixtral tokenizer here
195+
# please refer to the tokenizer section above for more details
196+
export TOKENIZER_PATH=""
197+
# pass in the merged file path here
198+
export MERGED_C4_PATH=""
199+
# this path is used for storing the preprocessed .bin and .idx files
200+
export PREPROCESSED_PATH=""
201+
202+
sbatch preprocess.sh
203+
```

0 commit comments

Comments
 (0)