File size: 6,525 Bytes
30dc046
 
 
 
 
 
 
 
 
 
 
 
 
3c1845c
30dc046
 
 
 
 
 
 
 
3c1845c
 
 
 
 
30dc046
 
d68f957
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
---
license: mit
language: en
tags:
  - gpt2
  - causal-lm
  - pytorch
  - transformer
  - pretraining
  - sft
  - question-answering
  - ultra-fineweb
  - custom-dataset

model-index:
  - name: gpt2-124m-qa
    results:
      - task:
          name: Question Answering
          type: text-generation
        dataset:
          name: Custom QA Dataset (JSONL)
          type: jsonl
        metrics:
          - name: Loss
            type: loss
            value: 0.65
---

<p align="center">

<a href="https://huggingface.co/shubharthak/gpt2-124m-qa">
  <img alt="Model Size" src="https://img.shields.io/badge/Model%20Size-124M-blue">
</a>

<a href="https://huggingface.co/shubharthak/gpt2-124m-qa">
  <img alt="Downloads" src="https://img.shields.io/huggingface/dl-daily/shubharthak/gpt2-124m-qa">
</a>

<a href="https://huggingface.co/shubharthak/gpt2-124m-qa">
  <img alt="Likes" src="https://img.shields.io/badge/HuggingFace-Likes-yellow">
</a>

<a href="https://huggingface.co/spaces/yuntian-deng/flash-attention">
  <img alt="Flash Attention" src="https://img.shields.io/badge/Flash%20Attention-Enabled-brightgreen">
</a>

<a href="https://pytorch.org/">
  <img alt="PyTorch" src="https://img.shields.io/badge/Framework-PyTorch-red">
</a>

<a href="https://huggingface.co/docs">
  <img alt="Task" src="https://img.shields.io/badge/Task-QA%20%2F%20CausalLM-purple">
</a>

</p>


# GPT-2 124M — Pretrained on Ultra-FineWeb Edu + QA SFT

This repository contains two trained checkpoints of a custom **GPT-2 124M** model:

- **Pretrained Model:** `model_09535.pt`  
  → Trained *from scratch* on **Ultra-FineWeb Edu (5B token subset)**  
- **QA SFT Model:** `qa-sft_best.pt`  
  → Fine-tuned using **Supervised Fine-Tuning (SFT)** on a curated **custom Q&A dataset**

This model was implemented using a **from-scratch GPT-2 training pipeline**, *inspired by Andrej Karpathy’s engineering approach*, but trained independently with different datasets and objectives.

---

## 📦 Model Versions

### **1. Pretrained Model (`model_09535.pt`)**
| Feature | Details |
|--------|---------|
| Parameters | 124M |
| Layers | 12 |
| Heads | 12 |
| Hidden size | 768 |
| Sequence length | 1024 |
| Vocab size | 50304 |
| Dataset | Ultra-FineWeb Edu (educational, high-quality web text) |
| Purpose | General language modeling |

**Goal:** Build a clean GPT-2 Small from-scratch to understand and implement a full LLM training pipeline.

---

### **2. QA SFT Model (`qa-sft_best.pt`)**
| Feature | Details |
|--------|---------|
| Base | The pretrained model above |
| Method | Supervised Fine-Tuning (SFT) |
| Dataset | Custom JSONL Q&A dataset |
| Domain | Australian facts, general knowledge, definitions, reasoning |
| Use-case | QA-style interactive chatbot |

Demo available at:  
👉 **https://gpt2.devshubh.me**

---

# 🧠 Model Architecture

This model follows the **GPT-2 Small** architecture:

- Decoder-only transformer  
- Multi-Head Self-Attention  
- GELU activations  
- LayerNorm (Pre-Norm)  
- Flash Attention enabled during training  
- Positional embeddings  
- Weight decay + AdamW (fused)  
- Mixed Precision (AMP FP16)  

---

# 🛠️ Training Details

## **Pretraining on Ultra-FineWeb Edu (5B token subset)**

- **Dataset:** Ultra-FineWeb Edu (educational, high-quality text)  
- **Tokenizer:** GPT-2 BPE (50304 vocab)  
- **Steps:** Thousands of steps on Kaggle T4  
- **Techniques used:**
  - Flash Attention  
  - Gradient Accumulation  
  - FP16 AMP  
  - Cosine Learning Rate Decay  
  - Warmup  
  - Fused AdamW  
  - Weight Decay  
  - Checkpointing every 500 steps  

---

## **Supervised Fine-Tuning (SFT) for QA**

- **Dataset:** Custom QA JSONL  
- **Format:** `{"instruction": "...", "response": "..."}`  
- **Loss:** Cross-entropy  
- **Goal:** Improve chat quality + correctness for QA  
- **Result:** Stable ~0.6–0.7 loss, improved reasoning  
- **Tokens:** ~100K–200K from curated dataset  

---

# 📚 Datasets Used

### **Pretraining Dataset: Ultra-FineWeb Edu**
- Educational subset of Ultra-FineWeb  
- High-quality English text  
- Filtered for correctness  
- Contains textbook-like explanations  
- Clean enough to bootstrap small LLMs  

### **Fine-Tuning Dataset: Custom QA JSONL**
- Australian knowledge  
- Definitions  
- Technology facts  
- Simple reasoning questions  
- Clean short answers  

---

# 🔤 Tokenizer

- GPT-2 BPE  
- 50304 vocab  
- Identical formatting to GPT-2 tokenizer  
- Tokenization done via `tiktoken`  

---

# 💻 How to Use (Karpathy Repo)

### **1. Clone the repo**
```bash
git clone https://github.com/shubharthaksangharsha/karpathy
cd karpathy/chapter-9-sft-rhlf-dpo-gpt2-124m
```

### **2. Run inference**
```python
import torch
from model import GPT

ckpt = torch.load("model_09535.pt", map_location="cpu")
model = GPT(config=ckpt['config'])
model.load_state_dict(ckpt['model'])
model.eval()

out = model.generate("Who is the prime minister of australia?", max_new_tokens=60)
print(out)
```

### **To run the QA model instead:**
```python
import torch
from model import GPT

ckpt = torch.load("qa-sft_best.pt", map_location="cpu")
model = GPT(config=ckpt['config'])
model.load_state_dict(ckpt['model'])
model.eval()

out = model.generate("What is the capital of Australia?", max_new_tokens=60)
print(out)
```

---

# 🤗 How to Use (Hugging Face Transformers)

Because this is a **Karpathy-format checkpoint**, you cannot load it directly using:

```python
AutoModelForCausalLM.from_pretrained(...)
```

Instead, load the state dict manually:

```python
import torch
state = torch.load("model_09535.pt", map_location="cpu")
model = state["model"]
```

⚠️ A conversion script is required for full HF `.from_pretrained()` compatibility.

---

# 📝 Example Inference (QA Model)

```python
import torch
from model import GPT
from tokenizer import GPT2Tokenizer

tokenizer = GPT2Tokenizer()

ckpt = torch.load("qa-sft_best.pt")
model = GPT(config=ckpt['config'])
model.load_state_dict(ckpt['model'])
model.eval()

prompt = "Q: What is the capital of Australia?\nA:"
tokens = tokenizer.encode(prompt)
out = model.generate(tokens, max_new_tokens=60)
print(tokenizer.decode(out))
```

---

# ⚠️ Limitations
- Only 124M parameters (not SOTA)  
- Limited reasoning ability  
- Trained on small custom QA set  
- Not RLHF-finetuned (only SFT)  
- Not safety-aligned or filtered  

---

# 📄 License
This work is based on Andrej Karpathy’s "Neural Networks: Zero to Hero" course and follows the same educational license.