@Anindyadeep | 40 minutes read

In the last blog post, we provided some introduction towards sequence modeling, attention mechanism etc. We, discussed all the theoretical details around different parts of the transformer architecture in general and GPT-2 in particular.

<aside> 💡

Disclaimer: I will not re-explain much on topics which are already discussed in depth in the previous blogpost. So if you think some things are not clear, please do check out the previous blogpost for more theoretical understanding.

This blogpost can be thought as my notes / annotations of the famous video by Andrej Karpathy: Let’s reproduce GPT-2 124M. This blog roughly covers from timestamp: 13:47 to 1:13:47. I have’t included some parts like training the model / making the DataLoader. I will include those in the next part of the blogpost. So if you are watching the video, then this blog can serve you as a good guide for revisiting concepts or if you are too reluctant to watch the video / you are reading person , then hopefully this blogpost fills up the gap.

</aside>

In this blog, we are going to make our hands dirty and work on step by step implementation of GPT2 architecture. We will be following the same architecture naming convention as per Hugging Face, so that we can port the learned weights from HF model to our own PyTorch implemented model. We will discuss each block in details and then put everything together in the end. So without further due lets start.

<aside> ☝

Last but not the least; I have made some hand drawn images (sorry for the bad drawing), but if you are not able to see, just double tap the image to zoom. Also, if you see some discrepancies or if something does not make sense or anything I missed out from my side, feel free to reach me out on my Twitter (X).

</aside>

A Config to keep things structured and simple

We will start with defining a simple config, which will contain all the essential information about the context length, number of layers etc. Here is how we define it:

from dataclasses import dataclass

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50257
    n_layer: int = 12
    n_head: int = 12
    n_embed: int = 768
    bias: bool = True
    dropout_ratio: float = 0.0

Let's understand each of these configuration parameters in detail:

block_size: Also known as the model's context length or maximum sequence length. This determines how many tokens (input + output) the model can process in a single forward pass. For example, if block_size=1024, the model can look at up to 1024 tokens of context.
vocab_size: Represents the total number of unique tokens in the model's vocabulary. This is determined by your tokenizer choice (e.g., BPE, WordPiece) and directly affects the size of the embedding layer and final output layer.
n_layers: The number of transformer blocks stacked on top of each other.
n_head: This represents the number of attention heads per layer of the model. Must be a factor of n_embed for efficient parallel computation.
n_embed: The embedding dimension, which determines the size of token and positional embeddings, as well as the hidden state representation in all transformer layers. This is equivalent to d_model in some implementations.
dropout_ratio: Standard regularization parameter that randomly zeros out a fraction of neurons during training to prevent overfitting. Default is 0 for simplicity.

We will be taking a bottom up approach where we will be defining the whole architecture skeleton first (without defining the nitty gritty details) and then slowly move up on working on each details. So let’s define the whole architecture skeleton.

The Initial model skeleton

We will define the code into two parts, first we understand the architecture (code wise) and then we will see how does the forward ops and cross entropy loss is been calculated. Here’s our rough architecture:

class GPT(nn.Module):
  def __init__(self, config: GPTConfig) -> None:
      super().__init__()
      self.config = config
      self.transformer: nn.ModuleDict = nn.ModuleDict(dict(
          wte = nn.Embedding(config.vocab_size, config.n_embed),
          wpe = nn.Embedding(config.block_size, config.n_embed),
          h = nn.ModuleList(
              [Block(config) for _ in range(config.n_layer)]
          ),
          ln_f = nn.LayerNorm(config.n_embed)
      ))
      self.lm_head = nn.Linear(config.vocab_size, config.n_embed, bias=False)

      # Shared weight in the first and the last layer
      self.lm_head.weight = self.transformer.wte.weight

Compare the architecture code with Figure 1 on right.

Figure 1: GPT-2 architecture

The config (from GPTConfig) makes it easier to set the numbers (or hyper-parameters) and change them accordingly. We start with defining the transformer architecture which is a torch nn ModuleDict. We got our word token embedding layer (size: 50257 x 768) and a word positional embedding layer (size: 1024 x 768). Then we define h as our list of hidden blocks. As of now, let’s define Block like this: