|
| 1 | +# 1. Problem |
| 2 | + |
| 3 | +Large Language Model pretraining - Llama 3.1 405B |
| 4 | + |
| 5 | +# 2. Directions |
| 6 | + |
| 7 | +### Steps to configure machine |
| 8 | + |
| 9 | +To use this repository, please install a supported version of PyTorch with GPU support (python 3.10, pytorch 2.4, cuda 12.5, and nccl 2.22.3 and above) and NVIDIA APEX. **Slurm-based clusters are required to run the reference**. |
| 10 | + |
| 11 | +We recommend using the latest NeMo FW container. The latest tested compatible version is `nvcr.io/nvidia/nemo:24.12-rc0`). |
| 12 | + |
| 13 | +#### Container Setup |
| 14 | + |
| 15 | +All of the following codes are assumed to be run within a container. A [Dockerfile](./Dockerfile) is available for building containers on top of `nvcr.io/nvidia/nemo:24.12-rc0`. |
| 16 | + |
| 17 | +To build the container: |
| 18 | + |
| 19 | +```bash |
| 20 | +docker build -t <tag> -f Dockerfile . |
| 21 | +``` |
| 22 | + |
| 23 | +To launch the container: |
| 24 | + |
| 25 | +```bash |
| 26 | +docker run -it --rm \ |
| 27 | +--network=host --ipc=host \ |
| 28 | +-v ~/.ssh:/root/.ssh \ |
| 29 | +<tag> bash |
| 30 | +``` |
| 31 | + |
| 32 | +Note: it's recommended to map your `.ssh` folder to inside the container, so that it's easier for the code to set up remote cluster access. |
| 33 | + |
| 34 | +### Steps to download and verify data |
| 35 | + |
| 36 | +The current codebase is still using GPT3's train/val datasets and SentencePieceModel tokenizer. Please refer to [GPT3 instructions](https://github.com/mlcommons/training/tree/master/large_language_model/megatron-lm#preprocessed-data-download) to download **the raw C4 dataset** that we can preprocess later. |
| 37 | + |
| 38 | +### Steps to run and time |
| 39 | + |
| 40 | +To train Llama 3.1 405B, we need to fill out all fields in [config.sh](./config.sh). This file contains all configurations for Slurm cluster access and job submission configurations, directory mappings, containers, and model configurations. |
| 41 | + |
| 42 | +Once the `config.sh` is properly filled, we run the following code snippet **inside the container**: |
| 43 | + |
| 44 | +```bash |
| 45 | +source config.sh |
| 46 | +bash run_llama31.sh |
| 47 | +``` |
| 48 | + |
| 49 | +# 3. Dataset/Environment |
| 50 | +### Publication/Attribution |
| 51 | + |
| 52 | +We use the c4/en/3.0.1 dataset from [HuggingFace/AllenAI](https://huggingface.co/datasets/allenai/c4). |
| 53 | + |
| 54 | +We use the Mixtral 8x22B tokenizer from [HuggingFace/MistralAI](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1). |
| 55 | + |
| 56 | +### Preprocessed data download |
| 57 | + |
| 58 | +The pre-tokenized dataset and the tokenizer are available to download from the S3 bucket. You can download this data from the bucket using RClone as follows: |
| 59 | + |
| 60 | +To run Rclone on Windows, you can download the executable here. To install Rclone on Linux/macOS/BSD systems, run: |
| 61 | + |
| 62 | +``` |
| 63 | +sudo -v ; curl https://rclone.org/install.sh | sudo bash |
| 64 | +``` |
| 65 | + |
| 66 | +Once Rclone is installed, run the following command to authenticate with the bucket: |
| 67 | + |
| 68 | +``` |
| 69 | +rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com |
| 70 | +``` |
| 71 | + |
| 72 | +You can then navigate in the terminal to your desired download directory and run the following commands to download the dataset and checkpoints: |
| 73 | + |
| 74 | +#### Dataset |
| 75 | + |
| 76 | +``` |
| 77 | +# Replace this path with your desired path on the machine |
| 78 | +export PREPROCESSED_PATH="./" |
| 79 | +rclone copy mlc-training:mlcommons-training-wg-public/llama3_1/datasets/preprocessed_c4 $PREPROCESSED_PATH -P |
| 80 | +``` |
| 81 | + |
| 82 | +After the download is complete, you should see files with the following naming conventions under `PREPROCESSED_PATH`, ending with both `.idx` and `.bin`: |
| 83 | +- Training partitions: `c4-train.en_<number>_text_document` |
| 84 | +- Validation partitions: `c4-validation.en_text_document` |
| 85 | + |
| 86 | +#### Tokenizer |
| 87 | + |
| 88 | +``` |
| 89 | +# Replace this path with your desired path on the machine |
| 90 | +export TOKENIZER_PATH="./" |
| 91 | +rclone copy mlc-training:mlcommons-training-wg-public/llama3_1/datasets/tokenizer $TOKENIZER_PATH -P |
| 92 | +``` |
| 93 | + |
| 94 | +After the download is complete, you should see five files under `TOKENIZER_PATH`: |
| 95 | +- `special_tokens_map.json` |
| 96 | +- `tokenizer.json` |
| 97 | +- `tokenizer.model` |
| 98 | +- `tokenizer.model.v1` |
| 99 | +- `tokenizer_config.json` |
| 100 | + |
| 101 | +### Training and test data separation |
| 102 | + |
| 103 | +To be determined. For now, we are using the default split from the C4 dataset. |
| 104 | + |
| 105 | +### Training data order |
| 106 | + |
| 107 | +To be determined. Current plan is to use the last 256 of 1024 files (shards 6 and 7) for the benchmarked area. |
| 108 | + |
| 109 | +### Test data order |
| 110 | + |
| 111 | +To be determined. |
| 112 | + |
| 113 | +# 4. Model |
| 114 | +### Publication/Attribution |
| 115 | + |
| 116 | +The model largely follows the Llama 3.1 405B [paper](https://arxiv.org/abs/2407.21783). The main difference is that the model parameters is *to be determined from experiments*. |
| 117 | + |
| 118 | +### Model details |
| 119 | + |
| 120 | +| Config | Value | |
| 121 | +| :-- | :-- | |
| 122 | +| Embedding | RoPE + parameter adjustments | |
| 123 | +| # Layers | 126 | |
| 124 | +| Attention Type | GQA | |
| 125 | +| # Attn Heads | 128 | |
| 126 | +| Key/Value Heads | 8 | |
| 127 | +| Model Dimension | 16,384 | |
| 128 | +| Hidden Dimension | 53248 | |
| 129 | +| Activation | SwiGLU | |
| 130 | +| Normalization | RMSNorm | |
| 131 | +| Tokenizer | TikTokenizer | |
| 132 | +| Vocab size | 128,000 | |
| 133 | +| Context Length | 8192 | |
| 134 | + |
| 135 | + |
| 136 | +### Checkpoint download and conversion |
| 137 | + |
| 138 | +To be determined. For now, we are not using Llama 3.1 default checkpoint. |
| 139 | + |
| 140 | +~~To experiment with a given checkpoint, we have added a `--ckpt` argument that loads the pretrained checkpoint from a **NeMo checkpoint path**, which requires some checkpoint format conversion if the original checkpoint is in LlamaStack or HuggingFace format.~~ |
| 141 | + |
| 142 | +#### Saving and restoring a checkpoint |
| 143 | + |
| 144 | +Large runs might need to span across multiple Slurm jobs, and we need to save and load checkpoints with contexts so that training can resume between jobs. To support this, we have added some environment variables. Please refer to `config.sh` for more details. |
| 145 | + |
| 146 | +### Optimizer |
| 147 | + |
| 148 | +Adam |
| 149 | + |
| 150 | +# 5. Quality |
| 151 | +### Quality metric |
| 152 | + |
| 153 | +Log Perplexity |
| 154 | + |
| 155 | +### Quality target |
| 156 | + |
| 157 | +To be determined. |
| 158 | + |
| 159 | +### Evaluation frequency |
| 160 | + |
| 161 | +To be determined. |
| 162 | + |
| 163 | +### Evaluation thoroughness |
| 164 | + |
| 165 | +To be determined. |
| 166 | + |
| 167 | + |
| 168 | +# 6. Other |
| 169 | + |
| 170 | +### Data Preprocessing |
| 171 | + |
| 172 | +Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in the [Preprocessed data download]() section. |
| 173 | + |
| 174 | +#### Tokenizer |
| 175 | + |
| 176 | +We use Mixtral 8x22B tokenizer in this benchmark. Tokenizer files can be downloaded [here](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/tree/main). Only the five files containing tokenizer-related contents (`special_tokens_map.json`, `tokenizer.json`, `tokenizer.model`, `tokenizer.model.v1`, `tokenizer_config.json`) are needed. |
| 177 | + |
| 178 | +#### Run Data preprocessing |
| 179 | + |
| 180 | +Run the following commands to merge all 1024 training files into 8 `json.gz` files and all 8 validation files into a single `json.gz` file. Each of the `json.gz` files will be preprocessed into a pair of megatron dataset files (`.bin` and `.idx`). |
| 181 | + |
| 182 | +```bash |
| 183 | +export C4_PATH="" |
| 184 | +export MERGED_C4_PATH="" |
| 185 | + |
| 186 | +bash consolidate_data.sh |
| 187 | +``` |
| 188 | + |
| 189 | +After the data consolidation is done, we can run this [script](./utils/preprocess.sh) to perform preprocessing. To run the preprocessing script, we need to use the following commands: |
| 190 | + |
| 191 | +```bash |
| 192 | +# fill in the built container path here |
| 193 | +export CONT_IMAGE_URL="" |
| 194 | +# pass in the folder path that contains the Mixtral tokenizer here |
| 195 | +# please refer to the tokenizer section above for more details |
| 196 | +export TOKENIZER_PATH="" |
| 197 | +# pass in the merged file path here |
| 198 | +export MERGED_C4_PATH="" |
| 199 | +# this path is used for storing the preprocessed .bin and .idx files |
| 200 | +export PREPROCESSED_PATH="" |
| 201 | + |
| 202 | +sbatch preprocess.sh |
| 203 | +``` |
0 commit comments