The total number of parameters of GPT2 model I got is different from that in the book #622

Jessen-Li · 2025-04-17T08:47:00Z

Jessen-Li
Apr 17, 2025

In chapter 4.6, on page 120, it is written that Total number of parameters: 163,009,536..
But my code(identical to that in the book) running result is 163424256.
After removing weight tying, the book said Number of trainable parameters considering weight tying: 124,412,160. By contrast, my number is 124619520.
Are the difference caused by version of torch? And I suppose given the identical GPT_CONFIG_124M, the size of the weights, i.e. the number of parameters should be the same.

Answered by rasbt

Apr 17, 2025

Thanks for the comments! It's interesting and weird that the numbers may be different. I suspect this is maybe due to some minor code difference that is easy to overlook.

@Jessen-Li could you try to run the following standalone code and see what you get? If it is 163,009,536 then there's perhaps some small discrepancy in your code. If not, we can investigate further.

# This file collects all the relevant code that we covered thus far
# throughout Chapters 2-4.
# This file can be run as a standalone script.

import tiktoken
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

#####################################
# Chapter 2
#################################…

View full answer

Darakhsh1999 · 2025-04-17T17:24:27Z

Darakhsh1999
Apr 17, 2025

Hmm interesting. My GPT model also deviates in trainable parameters from the book, however, I did not do an exact 1:1 copy if I remember correctly. My gpt2-small has 163,059,793 trainable parameters. I suggest you install ´torch-summary´ from pip and run your model both through print(model) and torchsummary.summary(model) to get an detailed output of the layer dimensions and their respective number of trainable parameters. You can compare them with mine to see any potential differences. It's highly unlikely that this is due to some torch version since we are using elementary operations from torch.nn that are not expected to change.

My model config is
config = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"d_e": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of transformer block layers
"p_dropout": 0.1, # Dropout rate
"qkv_bias": False, # Query/Key/Value bias in linear layers
}

Code output

print(model):

GPT(
(token_embedding): Embedding(50257, 768)
(position_embedding): Embedding(1024, 768)
(dropout): Dropout(p=0.1, inplace=False)
(transformer_blocks): Sequential(
(0): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(1): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(2): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(3): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(4): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(5): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(6): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(7): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(8): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(9): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(10): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(11): TransformerBlock(
(attention): MultiHeadAttention(
(W_q): Linear(in_features=768, out_features=768, bias=False)
(W_k): Linear(in_features=768, out_features=768, bias=False)
(W_v): Linear(in_features=768, out_features=768, bias=False)
(head_output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(mlp): MultiLayerPerceptron(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU(approximate='tanh')
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(ln1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(output_head_ln): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(output_head): Linear(in_features=768, out_features=50257, bias=True)
)

summary(model):

======================================================================
Layer (type:depth-idx) Param #
======================================================================
├─Embedding: 1-1 38,597,376
├─Embedding: 1-2 786,432
├─Dropout: 1-3 --
├─Sequential: 1-4 --
| └─TransformerBlock: 2-1 --
| | └─MultiHeadAttention: 3-1 2,360,064
| | └─MultiLayerPerceptron: 3-2 4,722,432
| | └─LayerNorm: 3-3 1,536
| | └─LayerNorm: 3-4 1,536
| | └─Dropout: 3-5 --
| └─TransformerBlock: 2-2 --
| | └─MultiHeadAttention: 3-6 2,360,064
| | └─MultiLayerPerceptron: 3-7 4,722,432
| | └─LayerNorm: 3-8 1,536
| | └─LayerNorm: 3-9 1,536
| | └─Dropout: 3-10 --
| └─TransformerBlock: 2-3 --
| | └─MultiHeadAttention: 3-11 2,360,064
| | └─MultiLayerPerceptron: 3-12 4,722,432
| | └─LayerNorm: 3-13 1,536
| | └─LayerNorm: 3-14 1,536
| | └─Dropout: 3-15 --
| └─TransformerBlock: 2-4 --
| | └─MultiHeadAttention: 3-16 2,360,064
| | └─MultiLayerPerceptron: 3-17 4,722,432
| | └─LayerNorm: 3-18 1,536
| | └─LayerNorm: 3-19 1,536
| | └─Dropout: 3-20 --
| └─TransformerBlock: 2-5 --
| | └─MultiHeadAttention: 3-21 2,360,064
| | └─MultiLayerPerceptron: 3-22 4,722,432
| | └─LayerNorm: 3-23 1,536
| | └─LayerNorm: 3-24 1,536
| | └─Dropout: 3-25 --
| └─TransformerBlock: 2-6 --
| | └─MultiHeadAttention: 3-26 2,360,064
| | └─MultiLayerPerceptron: 3-27 4,722,432
| | └─LayerNorm: 3-28 1,536
| | └─LayerNorm: 3-29 1,536
| | └─Dropout: 3-30 --
| └─TransformerBlock: 2-7 --
| | └─MultiHeadAttention: 3-31 2,360,064
| | └─MultiLayerPerceptron: 3-32 4,722,432
| | └─LayerNorm: 3-33 1,536
| | └─LayerNorm: 3-34 1,536
| | └─Dropout: 3-35 --
| └─TransformerBlock: 2-8 --
| | └─MultiHeadAttention: 3-36 2,360,064
| | └─MultiLayerPerceptron: 3-37 4,722,432
| | └─LayerNorm: 3-38 1,536
| | └─LayerNorm: 3-39 1,536
| | └─Dropout: 3-40 --
| └─TransformerBlock: 2-9 --
| | └─MultiHeadAttention: 3-41 2,360,064
| | └─MultiLayerPerceptron: 3-42 4,722,432
| | └─LayerNorm: 3-43 1,536
| | └─LayerNorm: 3-44 1,536
| | └─Dropout: 3-45 --
| └─TransformerBlock: 2-10 --
| | └─MultiHeadAttention: 3-46 2,360,064
| | └─MultiLayerPerceptron: 3-47 4,722,432
| | └─LayerNorm: 3-48 1,536
| | └─LayerNorm: 3-49 1,536
| | └─Dropout: 3-50 --
| └─TransformerBlock: 2-11 --
| | └─MultiHeadAttention: 3-51 2,360,064
| | └─MultiLayerPerceptron: 3-52 4,722,432
| | └─LayerNorm: 3-53 1,536
| | └─LayerNorm: 3-54 1,536
| | └─Dropout: 3-55 --
| └─TransformerBlock: 2-12 --
| | └─MultiHeadAttention: 3-56 2,360,064
| | └─MultiLayerPerceptron: 3-57 4,722,432
| | └─LayerNorm: 3-58 1,536
| | └─LayerNorm: 3-59 1,536
| | └─Dropout: 3-60 --
├─LayerNorm: 1-5 1,536
├─Linear: 1-6 38,647,633
======================================================================
Total params: 163,059,793
Trainable params: 163,059,793
Non-trainable params: 0
======================================================================

4 replies

rasbt Apr 17, 2025
Maintainer

Thanks for the comments! It's interesting and weird that the numbers may be different. I suspect this is maybe due to some minor code difference that is easy to overlook.

@Jessen-Li could you try to run the following standalone code and see what you get? If it is 163,009,536 then there's perhaps some small discrepancy in your code. If not, we can investigate further.

# This file collects all the relevant code that we covered thus far
# throughout Chapters 2-4.
# This file can be run as a standalone script.

import tiktoken
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

#####################################
# Chapter 2
#####################################


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True, num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

    return dataloader


#####################################
# Chapter 3
#####################################
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads  # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x)  # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)  # optional projection

        return context_vec


#####################################
# Chapter 4
#####################################
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift


class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
            (x + 0.044715 * torch.pow(x, 3))
        ))


class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)


class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)   # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed-forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        return x


class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits


def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (B, T) array of indices in the current context
    for _ in range(max_new_tokens):

        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)

        # Focus only on the last time step
        # (batch, n_token, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]

        # Get the idx of the vocab entry with the highest logits value
        idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx


def main():
    GPT_CONFIG_124M = {
        "vocab_size": 50257,     # Vocabulary size
        "context_length": 1024,  # Context length
        "emb_dim": 768,          # Embedding dimension
        "n_heads": 12,           # Number of attention heads
        "n_layers": 12,          # Number of layers
        "drop_rate": 0.1,        # Dropout rate
        "qkv_bias": False        # Query-Key-Value bias
    }

    torch.manual_seed(123)
    model = GPTModel(GPT_CONFIG_124M)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Total number of parameters: {total_params:,}")

if __name__ == "__main__":
    main()

Answer selected by Jessen-Li

Darakhsh1999 Apr 17, 2025

Hi again. I ran your code snippet locally and got the expected 163,009,536. I traced the difference to be caused by the biases in the final output head, mine was enabled while the book's implementation has them disabled (probably quite telling if I calculated the difference and saw the 52k parameter discrepancy instead of comparing lines 😅).

However, what is the intuition behind disabling the bias for the final linear layer? My intuition says that by having them enabled, we allow the model to push/drive the bias values to -inf (or large negative values) for token IDs that rarely/never appear in our training set, thus eliminating them from ever being sampled. This should be a cheap way for the model to optimise the loss in the earlier training stage before actually having to learn any semantic meaning behind what the tokens actually mean.

Jessen-Li Apr 18, 2025
Author

I mistaken vocab_size in the config. It should be 50257 while I used 50527 before.

rasbt Apr 18, 2025
Maintainer

Nice, glad you found the discrepancy and it's consistent now!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The total number of parameters of GPT2 model I got is different from that in the book #622

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The total number of parameters of GPT2 model I got is different from that in the book #622

Jessen-Li Apr 17, 2025

Replies: 1 comment · 4 replies

Darakhsh1999 Apr 17, 2025

Code output

rasbt Apr 17, 2025 Maintainer

Darakhsh1999 Apr 17, 2025

Jessen-Li Apr 18, 2025 Author

rasbt Apr 18, 2025 Maintainer

Jessen-Li
Apr 17, 2025

Replies: 1 comment 4 replies

Darakhsh1999
Apr 17, 2025

rasbt Apr 17, 2025
Maintainer

Jessen-Li Apr 18, 2025
Author

rasbt Apr 18, 2025
Maintainer