from llm_trainer import create_dataset, LLMTrainer
create_dataset(save_dir="data") # Generate the default FineWeb dataset
model = ... # Define or load your model (GPT, xLSTM, Mamba...)
trainer = LLMTrainer(model) # Initialize trainer with default settings
trainer.train(data_dir="data") # Start training on the dataset
🔴 YouTube Video: Train LLMs in code, spelled out
Note
Explore usage examples
$ pip install llm-trainer
from llm_trainer import create_dataset
create_dataset(save_dir="data", # Where to save created dataset
chunks_limit=1_500, # Maximum number of files (chunks) with tokens to create
chunk_size=int(1e6)) # Number of tokens per chunk
- Your dataset should be structured as a JSON array, where each entry contains a "text" field. You can store your data in one or multiple JSON files.
Example JSON file:
[
{"text": "Learn about LLMs: https://www.youtube.com/@_NickTech"},
{"text": "Open-source python library to train LLMs: https://github.com/Skripkon/llm_trainer."},
{"text": "My name is Nikolay Skripko. Hello from Russia (2025)."}
]
- Run the following code to convert your JSON files into a tokenized dataset:
from llm_trainer import create_dataset_from_json
create_dataset_from_json(save_dir="data", # Where to save created dataset
json_dir="json_files", # Path to your JSON files
chunks_limit = 1_500, # Maximum number of files (chunks) with tokens to create
chunk_size=int(1e6)) # Number of tokens per chunk
You can train ANY LLM that expects a tensor X
with shape (batch_size, context_window)
as input and returns logits during the forward pass.
You need to create an LLMTrainer
object and call .train()
on it. Read about its parameters below:
model: torch.nn.Module = None, # The neural network model to train
optimizer: torch.optim.Optimizer = None, # Optimizer responsible for updating model weights
scheduler: torch.optim.lr_scheduler.LRScheduler = None, # Learning rate scheduler for dynamic adjustment
tokenizer: PreTrainedTokenizer | AutoTokenizer = None # Tokenizer for generating text (used if verbose > 0 during training)
model_returns_logits: bool = False # Whether model(X) returns logits or an object with an attribute `logits`
You must specify only the model
. The other attributes are optional and will be set to default values if not specified.
Parameter | Type | Description | Default value |
---|---|---|---|
max_steps |
int |
The maximum number of training steps | 5,000 |
save_each_n_steps |
int |
The interval of steps at which to save model checkpoints | 1,000 |
print_logs_each_n_steps |
int |
The interval of steps at which to print training logs | 1 |
BATCH_SIZE |
int |
The total batch size for training | 256 |
MINI_BATCH_SIZE |
int |
The mini-batch size for gradient accumulation | 16 |
context_window |
int |
The context window size for the data loader | 128 |
data_dir |
str |
The directory containing the training data | "data" |
logging_file |
Union[str, None] |
The file path for logging training metrics | "logs_training.csv" |
generate_each_n_steps |
int |
The interval of steps at which to generate and print text samples | 200 |
prompt |
str |
Beginning of the sentence that the model will continue | "Once upon a time" |
save_dir |
str |
The directory to save model checkpoints | "checkpoints" |
Every parameter has a default value, so you can start training simply by calling LLMTrainer.train()
.
- Fork the repository.
- Set up environment:
python3 -m venv .venv
source .venv/bin/activate
pip install poetry
poetry install
- Make changes
- Apply linter
$ pip install pylint==3.3.5
$ pylint $(git ls-files '*.py')
- Run tests locally
pip install pytest
poetry run pytest
- Commit and push your changes
- Create a pull request from your fork to the main repository