Molbap HF Staff commited on
Commit
34245c7
·
1 Parent(s): 8f4b9d8

add curmbs

Browse files
Files changed (2) hide show
  1. content/article.md +83 -0
  2. dist/index.html +55 -0
content/article.md CHANGED
@@ -146,6 +146,11 @@ We needed to separate both principles that were so far intertwined, [repetition]
146
 
147
  What was the solution to this?
148
 
 
 
 
 
 
149
  ## <a id="modular"></a> Modular transformers
150
 
151
  Transformers is an opiniated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers were introduced](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file).
@@ -168,6 +173,11 @@ When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (righ
168
 
169
  What does that gives us?
170
 
 
 
 
 
 
171
  ### A maintainable control surface
172
 
173
  The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
@@ -192,6 +202,10 @@ The _attention computation_ itself happens at a _lower_ level of abstraction tha
192
 
193
  However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
194
 
 
 
 
 
195
  ### <a id="attention-classes"></a> External Attention classes
196
 
197
  We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
@@ -219,6 +233,9 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
219
  ```
220
 
221
 
 
 
 
222
 
223
  ### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
224
 
@@ -246,6 +263,9 @@ Which allows a user to run with multiple processes per node, e.g. 4 GPUs:
246
 
247
  Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
248
 
 
 
 
249
 
250
  ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
251
 
@@ -276,6 +296,11 @@ and the configuration can be _explicit_ about which attention type is in which l
276
 
277
  This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.
278
 
 
 
 
 
 
279
  ### <a id="community-kernels"></a>Community Kernels
280
 
281
  The same principle extends to normalization, activation, and other code paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
@@ -290,6 +315,11 @@ Plus, this opened another angle of contribution for the community. People who ar
290
 
291
  Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
292
 
 
 
 
 
 
293
  ## Modular developments
294
 
295
  Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
@@ -312,6 +342,13 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
312
 
313
  Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
314
 
 
 
 
 
 
 
 
315
  ### Many models, but not enough yet, are alike
316
 
317
  So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.
@@ -322,6 +359,9 @@ It is interesting, for that, to look at _when_ we deployed this modular logic an
322
 
323
  If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
324
 
 
 
 
325
 
326
  ### VLM improvements, avoiding abstraction
327
 
@@ -386,6 +426,11 @@ The following [Pull request to standardize placeholder masking](https://github.c
386
 
387
  But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
388
 
 
 
 
 
 
389
  ### On image processing and processors
390
 
391
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
@@ -395,6 +440,10 @@ The gains in performance are immense, up to 20x speed for most models when compi
395
  ![Fast Image Processors Performance](static/fast_image_processors.png)
396
  <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
397
 
 
 
 
 
398
 
399
  ## Reduce barrier to entry/contribution
400
 
@@ -406,6 +455,12 @@ Among the most valuable contributions to `transformers` is of course the additio
406
 
407
  A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
408
 
 
 
 
 
 
 
409
  ### <a id="encoders-ftw"></a> Models popularity
410
 
411
  Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.
@@ -419,6 +474,11 @@ In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-use
419
 
420
  So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
421
 
 
 
 
 
 
422
  ## A surgical toolbox for model development
423
 
424
  ### Attention visualisation
@@ -429,6 +489,10 @@ One particular piece of machinery is the `attention mask`. Here you see the famo
429
 
430
  {{{fragment-attention-visualizer}}}
431
 
 
 
 
 
432
 
433
  ### Logging entire model activations
434
 
@@ -438,6 +502,13 @@ It just works with PyTorch models and is especially useful when aligning outputs
438
 
439
  ![Model debugger interface](static/model_debugger.png)
440
 
 
 
 
 
 
 
 
441
  ### Cooking faster CUDA warmups
442
 
443
  Having a clean _external_ API allows us to work on the [true inner workings of transformers](#code-is-product). One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out [the source](https://github.com/huggingface/transformers/pull/36380)!
@@ -446,6 +517,11 @@ Having a clean _external_ API allows us to work on the [true inner workings of t
446
 
447
  It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
448
 
 
 
 
 
 
449
  ### Transformers-serve and continuous batching
450
 
451
  Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various [inference providers](https://huggingface.co/docs/inference-providers/en/index) if you're interested in model deployment in general.
@@ -462,6 +538,10 @@ This provides an OpenAI-compatible API with features like [continuous batching](
462
 
463
  Continuous batching is in itself very much linked to the great work of vLLM with the `paged attention kernel`, further justifying the facilitation of [external kernels](#community-kernels).
464
 
 
 
 
 
465
 
466
  ## Community reusability
467
 
@@ -475,6 +555,9 @@ Adding a model to transformers means:
475
  This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
476
 
477
 
 
 
 
478
 
479
  ## What is coming next
480
 
 
146
 
147
  What was the solution to this?
148
 
149
+ <div class="crumbs">
150
+ <strong>Breadcrumb</strong> — Read the code in one place (<a href="#one-model-one-file">One Model, One File</a>). Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don’t Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>). Next: how modular transformers honor these while removing boilerplate.
151
+ </div>
152
+
153
+
154
  ## <a id="modular"></a> Modular transformers
155
 
156
  Transformers is an opiniated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers were introduced](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file).
 
173
 
174
  What does that gives us?
175
 
176
+ <div class="crumbs">
177
+ <strong>Breadcrumb</strong> — What changed: a small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible (<a href="#one-model-one-file">tenet kept</a>). Why it matters: reviewers and contributors maintain the shard, not the repetition. Next: the measurable effect on effective LOC and maintenance cost.
178
+ </div>
179
+
180
+
181
  ### A maintainable control surface
182
 
183
  The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
 
202
 
203
  However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
204
 
205
+ <div class="crumbs">
206
+ <strong>Breadcrumb</strong> — Evidence: effective LOC drops ~15× when counting shards instead of expanded modeling. Less to read, fewer places to break. Related cleanups: attention backends moved behind a function interface. Next: how the attention interface stays standard without hiding semantics.
207
+ </div>
208
+
209
  ### <a id="attention-classes"></a> External Attention classes
210
 
211
  We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
 
233
  ```
234
 
235
 
236
+ <div class="crumbs">
237
+ <strong>Breadcrumb</strong> — Semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations. Next: distribution concerns are declared as a plan, not model surgery.
238
+ </div>
239
 
240
  ### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
241
 
 
263
 
264
  Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
265
 
266
+ <div class="crumbs">
267
+ <strong>Breadcrumb</strong> — Sharding is configuration (<code>tp_plan</code>), not edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact. Next: per-layer attention/caching schedules declared in config, not hardcoded.
268
+ </div>
269
 
270
  ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
271
 
 
296
 
297
  This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.
298
 
299
+ <div class="crumbs">
300
+ <strong>Breadcrumb</strong> — Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak. Next: speedups come from kernels that don’t change semantics.
301
+ </div>
302
+
303
+
304
  ### <a id="community-kernels"></a>Community Kernels
305
 
306
  The same principle extends to normalization, activation, and other code paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
 
315
 
316
  Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
317
 
318
+
319
+ <div class="crumbs">
320
+ <strong>Breadcrumb</strong> — Models define semantics; kernels define how to run them faster. Use annotations to borrow community forwards while keeping a consistent public surface. Next: what modularity looks like across the repo.
321
+ </div>
322
+
323
  ## Modular developments
324
 
325
  Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
 
342
 
343
  Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
344
 
345
+ How do we spot them, and how do we identify modularisable models?
346
+
347
+ <div class="crumbs">
348
+ <strong>Breadcrumb</strong> — Graph reading guide: nodes are models; edges are modular imports. Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents. Next: timeline + similarity signals to spot candidates.
349
+ </div>
350
+
351
+
352
  ### Many models, but not enough yet, are alike
353
 
354
  So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.
 
359
 
360
  If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
361
 
362
+ <div class="crumbs">
363
+ <strong>Breadcrumb</strong> — Similarity (Jaccard; embeddings tried separately) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior. Next: concrete VLM choices that avoid leaky abstractions.
364
+ </div>
365
 
366
  ### VLM improvements, avoiding abstraction
367
 
 
426
 
427
  But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
428
 
429
+ <div class="crumbs">
430
+ <strong>Breadcrumb</strong> — Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don’t migrate behavior to <code>PreTrainedModel</code>. Next: pipeline-level wins that came from PyTorch-first choices (fast processors).
431
+ </div>
432
+
433
+
434
  ### On image processing and processors
435
 
436
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
 
440
  ![Fast Image Processors Performance](static/fast_image_processors.png)
441
  <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
442
 
443
+ <div class="crumbs">
444
+ <strong>Breadcrumb</strong> — Torch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups. Next: how this lowers friction for contributors and downstream users.
445
+ </div>
446
+
447
 
448
  ## Reduce barrier to entry/contribution
449
 
 
455
 
456
  A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
457
 
458
+
459
+ <div class="crumbs">
460
+ <strong>Breadcrumb</strong> — The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest. Next: power tools enabled by a consistent API.
461
+ </div>
462
+
463
+
464
  ### <a id="encoders-ftw"></a> Models popularity
465
 
466
  Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.
 
474
 
475
  So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
476
 
477
+ <div class="crumbs">
478
+ <strong>Breadcrumb</strong> — Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS). Next: dev tools that leverage unified attention APIs and PyTorch-only internals.
479
+ </div>
480
+
481
+
482
  ## A surgical toolbox for model development
483
 
484
  ### Attention visualisation
 
489
 
490
  {{{fragment-attention-visualizer}}}
491
 
492
+ <div class="crumbs">
493
+ <strong>Breadcrumb</strong> — Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal). Next: whole-model tracing for ports and regressions.
494
+ </div>
495
+
496
 
497
  ### Logging entire model activations
498
 
 
502
 
503
  ![Model debugger interface](static/model_debugger.png)
504
 
505
+
506
+ <div class="crumbs">
507
+ <strong>Breadcrumb</strong> — Forward interception and nested JSON logging align ports to reference implementations, reinforcing “Source of Truth.” Next: CUDA warmup reduces load-time stalls without touching modeling semantics.
508
+ </div>
509
+
510
+
511
+
512
  ### Cooking faster CUDA warmups
513
 
514
  Having a clean _external_ API allows us to work on the [true inner workings of transformers](#code-is-product). One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out [the source](https://github.com/huggingface/transformers/pull/36380)!
 
517
 
518
  It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
519
 
520
+ <div class="crumbs">
521
+ <strong>Breadcrumb</strong> — Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR). Next: serving benefits directly from consistent interfaces and modularity.
522
+ </div>
523
+
524
+
525
  ### Transformers-serve and continuous batching
526
 
527
  Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various [inference providers](https://huggingface.co/docs/inference-providers/en/index) if you're interested in model deployment in general.
 
538
 
539
  Continuous batching is in itself very much linked to the great work of vLLM with the `paged attention kernel`, further justifying the facilitation of [external kernels](#community-kernels).
540
 
541
+ <div class="crumbs">
542
+ <strong>Breadcrumb</strong> — OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable. Next: reuse across vLLM/SGLang relies on the same consistency.
543
+ </div>
544
+
545
 
546
  ## Community reusability
547
 
 
555
  This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
556
 
557
 
558
+ <div class="crumbs">
559
+ <strong>Breadcrumb</strong> — Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical. Next: what changes in v5 without breaking the promise of visible semantics.
560
+ </div>
561
 
562
  ## What is coming next
563
 
dist/index.html CHANGED
@@ -131,6 +131,9 @@
131
  <p>This comes as a great cost. Enter the <code>#Copied from...</code> mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.</p>
132
  <p>We needed to separate both principles that were so far intertwined, <a href="#do-repeat-yourself">repetition</a> and <a href="#one-model-one-file">hackabilty</a>.</p>
133
  <p>What was the solution to this?</p>
 
 
 
134
  <h2><a id="modular"></a> Modular transformers</h2>
135
  <p>Transformers is an opiniated library. The previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and the <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post</a> were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. <a href="https://huggingface.co/docs/transformers/en/modular_transformers"><code>modular</code> transformers were introduced</a>, allowing a form of inheritance without breaking <a href="#one-model-one-file">One model, One file</a>.</p>
136
  <p>We amended the principle of <a href="#do-repeat-yourself">DRY*</a> by removing progressively all pieces of code that were “copied from” another file.</p>
@@ -290,6 +293,9 @@ class GlmRMSNorm(nn.Module):
290
  <p>What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.</p>
291
  <p>When <code>AutoModel.from_pretrained(...)</code> is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.</p>
292
  <p>What does that gives us?</p>
 
 
 
293
  <h3>A maintainable control surface</h3>
294
  <p>The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
295
  If it only has a modeling file, we add its LOC count.
@@ -303,6 +309,9 @@ However, if a model has a modular_<em>.py and a corresponding automatically gene
303
  <p>A related optimization was the following one. You’ve likely heard about <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a> and its several variants.</p>
304
  <p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
305
  <p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>.</p>
 
 
 
306
  <h3><a id="attention-classes"></a> External Attention classes</h3>
307
  <p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
308
  <p>We keep a <code>Callable</code> for the naive implementation of the attention, called “eager” computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
@@ -318,6 +327,9 @@ if self.config._attn_implementation != &quot;eager&quot;:
318
 
319
  MyModelOutputAnnotated = Annotated[MyModelOutput, &quot;shape: (B, C, H, W)&quot;]
320
  </code></pre>
 
 
 
321
  <h3><a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism</h3>
322
  <p>If you’re not familiar with the different flavours of parallelism, I recommend checking out <a href="https://huggingface.co/blog/accelerate-nd-parallel">this blog post</a> first, and of course a full <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook">dive into the ultra-scale playbook</a> is always recommended.</p>
323
  <p>The essential part is that, as <a href="https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism">the documentation states</a> when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.</p>
@@ -354,6 +366,9 @@ out = model(**inputs)</code></pre></p>
354
  <p>Which allows a user to run with multiple processes per node, e.g. 4 GPUs:</p>
355
  <p><code>torchrun --nproc-per-node 4 demo.py</code></p>
356
  <p>Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: “colwise” splits columns of weights/bias across ranks; “rowwise” splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like <code>layers.*.mlp.down_proj</code> to target repeated submodules.</p>
 
 
 
357
  <h3><a id="layers-attentions-caches"></a> Layers, attentions and caches</h3>
358
  <p>Following the same logic, the <em>nature</em> of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then</p>
359
  <pre><code class="language-python">ALLOWED_LAYER_TYPES = (
@@ -374,6 +389,9 @@ out = model(**inputs)</code></pre></p>
374
  ],
375
  </code></pre>
376
  <p>This is <a href="#minimal-user-api">minimal</a> to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.</p>
 
 
 
377
  <h3><a id="community-kernels"></a>Community Kernels</h3>
378
  <p>The same principle extends to normalization, activation, and other code paths. The model defines <strong>semantics</strong>; a kernel defines <strong>how</strong> to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a <a href="#consistent-public-surface">consistent public surface</a></p>
379
  <pre><code class="language-python">@use_kernel_forward_from_hub(&quot;RMSNorm&quot;)
@@ -382,6 +400,9 @@ class GlmRMSNorm(nn.Module):
382
  </code></pre>
383
  <p>Plus, this opened another angle of contribution for the community. People who are GPU whisperers can now contribute optimized kernels. You can check on the <a href="https://huggingface.co/blog/hello-hf-kernels">kernel community blog post</a> to learn more about it!</p>
384
  <p>Even more resources have been added, like the formidable <a href="https://github.com/huggingface/kernel-builder">kernel builder</a> with its connected resources to <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md">help you build kernels with it</a> and <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md">with nix</a>.</p>
 
 
 
385
  <h2>Modular developments</h2>
386
  <p>Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to <em>define standards</em>. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we’re striving for it.
387
  It’s hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
@@ -398,11 +419,18 @@ Radically different architectures such as mamba have spawned their own dependenc
398
  <p>However, even if llava defines a few VLMs, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
399
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.</p>
400
  <p>Another problem is, this is only for <code>modular</code> models. Several models do NOT have a modular file.</p>
 
 
 
 
401
  <h3>Many models, but not enough yet, are alike</h3>
402
  <p>So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.</p>
403
  <p>It is interesting, for that, to look at <em>when</em> we deployed this modular logic and what was its rippling effect on the library. You can check the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">larger space</a> to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.</p>
404
  <p> <iframe src=https://molbap-timeline-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
405
  <p>If you’ve checked out llava, you’ve seen that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
 
 
 
406
  <h3>VLM improvements, avoiding abstraction</h3>
407
  <p>We don’t have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.</p>
408
  <p>For instance, we thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
@@ -454,16 +482,25 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
454
  return special_image_mask, special_video_mask
455
  </code></pre>
456
  <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
 
 
 
457
  <h3>On image processing and processors</h3>
458
  <p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
459
  <p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
460
  <p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
461
  <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
 
 
 
462
  <h2>Reduce barrier to entry/contribution</h2>
463
  <p>This is an overall objective: there’s no <code>transformers</code> without its community.</p>
464
  <p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
465
  <p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. Very recently, <a href="https://huggingface.co/blog/welcome-openai-gpt-oss">OpenAI added GPT-OSS</a>, which prompted the addition of many new features to the library in order to support <a href="https://huggingface.co/openai/gpt-oss-120b">their model</a>.</p>
466
  <p>A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for <a href="https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b">gpt-oss 120b</a>, despite its size!</p>
 
 
 
467
  <h3><a id="encoders-ftw"></a> Models popularity</h3>
468
  <p>Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out <a href="https://huggingface.co/blog/embeddinggemma">EmbeddingGemma</a> for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.</p>
469
  <p><html>
@@ -4354,6 +4391,9 @@ return Plotly;
4354
  <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
4355
  <p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
4356
  <p>So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?</p>
 
 
 
4357
  <h2>A surgical toolbox for model development</h2>
4358
  <h3>Attention visualisation</h3>
4359
  <p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
@@ -4403,10 +4443,16 @@ return Plotly;
4403
  </div>
4404
  </div>
4405
  </p>
 
 
 
4406
  <h3>Logging entire model activations</h3>
4407
  <p>Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily <a href="https://huggingface.co/docs/transformers/internal/model_debugging_utils">debug any model</a> when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.</p>
4408
  <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
4409
  <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
 
 
 
4410
  <h3>Cooking faster CUDA warmups</h3>
4411
  <p>Having a clean <em>external</em> API allows us to work on the <a href="#code-is-product">true inner workings of transformers</a>. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out <a href="https://github.com/huggingface/transformers/pull/36380">the source</a>!</p>
4412
  <p><style>.warmup-demo body{background-color:#f5f5f5;margin:0;padding:20px;font-family:Segoe UI,Tahoma,Geneva,Verdana,sans-serif}.warmup-demo .container{background:#fff;border-radius:12px;max-width:1200px;margin:0 auto;padding:30px;box-shadow:0 4px 6px #0000001a}.warmup-demo h1{text-align:center;color:#333;margin-bottom:10px}.warmup-demo .subtitle{text-align:center;color:#666;margin-bottom:30px;font-size:16px}.warmup-demo .demo-container{gap:40px;margin-bottom:30px;display:flex}.warmup-demo .side{background:#fafafa;border:2px solid #ddd;border-radius:8px;flex:1;padding:20px}.warmup-demo .side h2{text-align:center;color:#333;margin-top:0}.warmup-demo .no-warmup h2{color:#d63384}.warmup-demo .with-warmup h2{color:#198754}.warmup-demo .memory-area{background:#fff;border:2px dashed #ccc;border-radius:6px;height:400px;margin:20px 0;padding:10px;position:relative;overflow:hidden}.warmup-demo .layer-box{background:#fff;border:2px solid #666;border-radius:4px;width:80px;height:30px;margin:3px;transition:all .3s;display:inline-block;position:relative}.warmup-demo .layer-box.allocating{background:#e9ecef;border-color:#adb5bd}.warmup-demo .layer-box.allocating:after{content:"malloc";color:#666;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .layer-box.loaded{background:#d1e7dd;border-color:#198754}.warmup-demo .layer-box.loaded:after{content:"data";color:#198754;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container{background:#fff;border:3px solid #666;border-radius:6px;width:100%;height:60px;margin-bottom:20px;position:relative;overflow:hidden}.warmup-demo .warmup-container.allocated{background:#e7f1ff;border-color:#0d6efd}.warmup-demo .warmup-container:before{content:"Pre-allocated Memory Pool";color:#666;z-index:1;font-size:14px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container.allocated:before{color:#0d6efd}.warmup-demo .warmup-fill{z-index:2;background:linear-gradient(90deg,#198754,#20c997);border-radius:3px;width:0%;height:100%;transition:width .5s;position:relative}.warmup-demo .warmup-fill:after{content:"Layer Data Loading";color:#fff;white-space:nowrap;font-size:12px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .timing{text-align:center;min-height:30px;margin:15px 0;font-size:24px;font-weight:700}.warmup-demo .no-warmup .timing{color:#d63384}.warmup-demo .with-warmup .timing{color:#198754}.warmup-demo .controls{text-align:center;margin:30px 0}.warmup-demo .btn{color:#fff;cursor:pointer;background:#0d6efd;border:none;border-radius:6px;margin:0 10px;padding:12px 24px;font-size:16px;transition:background .3s}.warmup-demo .btn:hover{background:#0b5ed7}.warmup-demo .btn:disabled{cursor:not-allowed;background:#6c757d}.warmup-demo .description{background:#f8f9fa;border-radius:6px;margin-top:15px;padding:15px;font-size:14px;line-height:1.5}.warmup-demo .phase-indicator{color:#666;text-align:center;min-height:20px;margin-top:10px;font-size:14px}.warmup-demo .layer-counter{text-align:center;color:#495057;margin:10px 0;font-size:16px}</style>
@@ -4461,6 +4507,9 @@ return Plotly;
4461
 
4462
  <script>let animationSpeed=1/2.4,isRunning=!1,totalLayers=10;function startDemo(){isRunning||(isRunning=!0,document.getElementById("startBtn").disabled=!0,document.getElementById("resetBtn").disabled=!0,Promise.all([animateNoWarmup(),animateWithWarmup()]).then(()=>{isRunning=!1,document.getElementById("startBtn").disabled=!1,document.getElementById("resetBtn").disabled=!1}))}function resetDemo(){isRunning||(document.getElementById("noWarmupArea").innerHTML="",document.getElementById("warmupLayers").innerHTML="",document.getElementById("warmupFill").style.width="0%",document.getElementById("warmupContainer").classList.remove("allocated"),document.getElementById("noWarmupTime").textContent="0.00s",document.getElementById("warmupTime").textContent="0.00s",document.getElementById("noWarmupCounter").textContent="Layers loaded: 0/10",document.getElementById("warmupCounter").textContent="Layers loaded: 0/10",document.getElementById("noWarmupPhase").textContent="",document.getElementById("warmupPhase").textContent="")}async function animateNoWarmup(){let e=document.getElementById("noWarmupArea"),t=document.getElementById("noWarmupTime"),n=document.getElementById("noWarmupCounter"),a=document.getElementById("noWarmupPhase"),m=0,o=200/animationSpeed;a.textContent="Loading model layers...";for(let a=0;a<10;a++){let d=document.createElement("div");d.className="layer-box",e.appendChild(d),await sleep(.3*o),d.classList.add("allocating"),t.textContent=(m+=.08).toFixed(2)+"s",await sleep(.7*o),d.classList.remove("allocating"),d.classList.add("loaded"),t.textContent=(m+=.12).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}async function animateWithWarmup(){let e=document.getElementById("warmupLayers"),t=document.getElementById("warmupTime"),n=document.getElementById("warmupCounter"),a=document.getElementById("warmupPhase"),m=document.getElementById("warmupContainer"),o=document.getElementById("warmupFill"),d=0,l=200/animationSpeed;a.textContent="Warming up allocator...",await sleep(2*l),m.classList.add("allocated"),t.textContent=(d+=.3).toFixed(2)+"s",a.textContent="Loading model layers...";for(let a=0;a<10;a++){let m=document.createElement("div");m.className="layer-box loaded",m.style.width="40px",m.style.height="20px",e.appendChild(m);let i=(a+1)/10*100;o.style.width=i+"%",await sleep(.5*l),t.textContent=(d+=.08).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}function sleep(e){return new Promise(t=>setTimeout(t,e))}</script></p>
4463
  <p>It’s hard to overstate how much of a lifesaver that is when you’re trying to load a model as fast as possible, as it’s the narrowest bottleneck for your iteration speed.</p>
 
 
 
4464
  <h3>Transformers-serve and continuous batching</h3>
4465
  <p>Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various <a href="https://huggingface.co/docs/inference-providers/en/index">inference providers</a> if you’re interested in model deployment in general.</p>
4466
  <pre><code class="language-bash">transformers serve
@@ -4471,6 +4520,9 @@ curl -X POST http://localhost:8000/v1/chat/completions \
4471
  </code></pre>
4472
  <p>This provides an OpenAI-compatible API with features like <a href="https://github.com/huggingface/transformers/pull/38085">continuous batching</a> (also check <a href="https://github.com/huggingface/transformers/pull/40426">here</a>) for better GPU utilization.</p>
4473
  <p>Continuous batching is in itself very much linked to the great work of vLLM with the <code>paged attention kernel</code>, further justifying the facilitation of <a href="#community-kernels">external kernels</a>.</p>
 
 
 
4474
  <h2>Community reusability</h2>
4475
  <p>Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be <em>reused</em> at large by the open-source ecosystem.</p>
4476
  <p>Adding a model to transformers means:</p>
@@ -4479,6 +4531,9 @@ curl -X POST http://localhost:8000/v1/chat/completions \
4479
  <li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great vLLM x HF blog post.</a></li>
4480
  </ul>
4481
  <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and there’s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>, and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
 
 
 
4482
  <h2>What is coming next</h2>
4483
  <p>The next major version of <code>transformers</code> is just around the corner. When v5 is releasd, <a href="#backwards-compatibility">backwards compatibility</a> will try to stay as solid as possible. Changes we do now are to ensure this.</p>
4484
  <p>Instead, what we aim to be is way more of a modular Toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM and other fine-tuning and fast inference options.</p>
 
131
  <p>This comes as a great cost. Enter the <code>#Copied from...</code> mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.</p>
132
  <p>We needed to separate both principles that were so far intertwined, <a href="#do-repeat-yourself">repetition</a> and <a href="#one-model-one-file">hackabilty</a>.</p>
133
  <p>What was the solution to this?</p>
134
+ <div class="crumbs">
135
+ <strong>Breadcrumb</strong> — Read the code in one place (<a href="#one-model-one-file">One Model, One File</a>). Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don’t Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>). Next: how modular transformers honor these while removing boilerplate.
136
+ </div>
137
  <h2><a id="modular"></a> Modular transformers</h2>
138
  <p>Transformers is an opiniated library. The previous <a href="https://huggingface.co/docs/transformers/en/philosophy">philosophy</a> page, and the <a href="https://huggingface.co/blog/transformers-design-philosophy">blog post</a> were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. <a href="https://huggingface.co/docs/transformers/en/modular_transformers"><code>modular</code> transformers were introduced</a>, allowing a form of inheritance without breaking <a href="#one-model-one-file">One model, One file</a>.</p>
139
  <p>We amended the principle of <a href="#do-repeat-yourself">DRY*</a> by removing progressively all pieces of code that were “copied from” another file.</p>
 
293
  <p>What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.</p>
294
  <p>When <code>AutoModel.from_pretrained(...)</code> is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.</p>
295
  <p>What does that gives us?</p>
296
+ <div class="crumbs">
297
+ <strong>Breadcrumb</strong> — What changed: a small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible (<a href="#one-model-one-file">tenet kept</a>). Why it matters: reviewers and contributors maintain the shard, not the repetition. Next: the measurable effect on effective LOC and maintenance cost.
298
+ </div>
299
  <h3>A maintainable control surface</h3>
300
  <p>The effect of modular can be measured straight from git history: at every commit, we look under the model directory.
301
  If it only has a modeling file, we add its LOC count.
 
309
  <p>A related optimization was the following one. You’ve likely heard about <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a> and its several variants.</p>
310
  <p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
311
  <p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>.</p>
312
+ <div class="crumbs">
313
+ <strong>Breadcrumb</strong> — Evidence: effective LOC drops ~15× when counting shards instead of expanded modeling. Less to read, fewer places to break. Related cleanups: attention backends moved behind a function interface. Next: how the attention interface stays standard without hiding semantics.
314
+ </div>
315
  <h3><a id="attention-classes"></a> External Attention classes</h3>
316
  <p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
317
  <p>We keep a <code>Callable</code> for the naive implementation of the attention, called “eager” computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
 
327
 
328
  MyModelOutputAnnotated = Annotated[MyModelOutput, &quot;shape: (B, C, H, W)&quot;]
329
  </code></pre>
330
+ <div class="crumbs">
331
+ <strong>Breadcrumb</strong> — Semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations. Next: distribution concerns are declared as a plan, not model surgery.
332
+ </div>
333
  <h3><a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism</h3>
334
  <p>If you’re not familiar with the different flavours of parallelism, I recommend checking out <a href="https://huggingface.co/blog/accelerate-nd-parallel">this blog post</a> first, and of course a full <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook">dive into the ultra-scale playbook</a> is always recommended.</p>
335
  <p>The essential part is that, as <a href="https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism">the documentation states</a> when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.</p>
 
366
  <p>Which allows a user to run with multiple processes per node, e.g. 4 GPUs:</p>
367
  <p><code>torchrun --nproc-per-node 4 demo.py</code></p>
368
  <p>Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: “colwise” splits columns of weights/bias across ranks; “rowwise” splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like <code>layers.*.mlp.down_proj</code> to target repeated submodules.</p>
369
+ <div class="crumbs">
370
+ <strong>Breadcrumb</strong> — Sharding is configuration (<code>tp_plan</code>), not edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact. Next: per-layer attention/caching schedules declared in config, not hardcoded.
371
+ </div>
372
  <h3><a id="layers-attentions-caches"></a> Layers, attentions and caches</h3>
373
  <p>Following the same logic, the <em>nature</em> of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then</p>
374
  <pre><code class="language-python">ALLOWED_LAYER_TYPES = (
 
389
  ],
390
  </code></pre>
391
  <p>This is <a href="#minimal-user-api">minimal</a> to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.</p>
392
+ <div class="crumbs">
393
+ <strong>Breadcrumb</strong> — Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak. Next: speedups come from kernels that don’t change semantics.
394
+ </div>
395
  <h3><a id="community-kernels"></a>Community Kernels</h3>
396
  <p>The same principle extends to normalization, activation, and other code paths. The model defines <strong>semantics</strong>; a kernel defines <strong>how</strong> to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a <a href="#consistent-public-surface">consistent public surface</a></p>
397
  <pre><code class="language-python">@use_kernel_forward_from_hub(&quot;RMSNorm&quot;)
 
400
  </code></pre>
401
  <p>Plus, this opened another angle of contribution for the community. People who are GPU whisperers can now contribute optimized kernels. You can check on the <a href="https://huggingface.co/blog/hello-hf-kernels">kernel community blog post</a> to learn more about it!</p>
402
  <p>Even more resources have been added, like the formidable <a href="https://github.com/huggingface/kernel-builder">kernel builder</a> with its connected resources to <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md">help you build kernels with it</a> and <a href="https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md">with nix</a>.</p>
403
+ <div class="crumbs">
404
+ <strong>Breadcrumb</strong> — Models define semantics; kernels define how to run them faster. Use annotations to borrow community forwards while keeping a consistent public surface. Next: what modularity looks like across the repo.
405
+ </div>
406
  <h2>Modular developments</h2>
407
  <p>Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to <em>define standards</em>. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we’re striving for it.
408
  It’s hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
 
419
  <p>However, even if llava defines a few VLMs, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
420
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.</p>
421
  <p>Another problem is, this is only for <code>modular</code> models. Several models do NOT have a modular file.</p>
422
+ <p>How do we spot them, and how do we identify modularisable models?</p>
423
+ <div class="crumbs">
424
+ <strong>Breadcrumb</strong> — Graph reading guide: nodes are models; edges are modular imports. Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents. Next: timeline + similarity signals to spot candidates.
425
+ </div>
426
  <h3>Many models, but not enough yet, are alike</h3>
427
  <p>So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. I also used code embedding models to check out code similarities, and it yielded better results, for the needs of this blog post I will stick to Jaccard index.</p>
428
  <p>It is interesting, for that, to look at <em>when</em> we deployed this modular logic and what was its rippling effect on the library. You can check the <a href="https://huggingface.co/spaces/Molbap/transformers-modular-refactor">larger space</a> to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.</p>
429
  <p> <iframe src=https://molbap-timeline-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
430
  <p>If you’ve checked out llava, you’ve seen that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
431
+ <div class="crumbs">
432
+ <strong>Breadcrumb</strong> — Similarity (Jaccard; embeddings tried separately) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior. Next: concrete VLM choices that avoid leaky abstractions.
433
+ </div>
434
  <h3>VLM improvements, avoiding abstraction</h3>
435
  <p>We don’t have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.</p>
436
  <p>For instance, we thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
 
482
  return special_image_mask, special_video_mask
483
  </code></pre>
484
  <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
485
+ <div class="crumbs">
486
+ <strong>Breadcrumb</strong> — Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don’t migrate behavior to <code>PreTrainedModel</code>. Next: pipeline-level wins that came from PyTorch-first choices (fast processors).
487
+ </div>
488
  <h3>On image processing and processors</h3>
489
  <p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
490
  <p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
491
  <p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
492
  <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
493
+ <div class="crumbs">
494
+ <strong>Breadcrumb</strong> — Torch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups. Next: how this lowers friction for contributors and downstream users.
495
+ </div>
496
  <h2>Reduce barrier to entry/contribution</h2>
497
  <p>This is an overall objective: there’s no <code>transformers</code> without its community.</p>
498
  <p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
499
  <p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. Very recently, <a href="https://huggingface.co/blog/welcome-openai-gpt-oss">OpenAI added GPT-OSS</a>, which prompted the addition of many new features to the library in order to support <a href="https://huggingface.co/openai/gpt-oss-120b">their model</a>.</p>
500
  <p>A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for <a href="https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b">gpt-oss 120b</a>, despite its size!</p>
501
+ <div class="crumbs">
502
+ <strong>Breadcrumb</strong> — The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest. Next: power tools enabled by a consistent API.
503
+ </div>
504
  <h3><a id="encoders-ftw"></a> Models popularity</h3>
505
  <p>Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out <a href="https://huggingface.co/blog/embeddinggemma">EmbeddingGemma</a> for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.</p>
506
  <p><html>
 
4391
  <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
4392
  <p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
4393
  <p>So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?</p>
4394
+ <div class="crumbs">
4395
+ <strong>Breadcrumb</strong> — Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS). Next: dev tools that leverage unified attention APIs and PyTorch-only internals.
4396
+ </div>
4397
  <h2>A surgical toolbox for model development</h2>
4398
  <h3>Attention visualisation</h3>
4399
  <p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
 
4443
  </div>
4444
  </div>
4445
  </p>
4446
+ <div class="crumbs">
4447
+ <strong>Breadcrumb</strong> — Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal). Next: whole-model tracing for ports and regressions.
4448
+ </div>
4449
  <h3>Logging entire model activations</h3>
4450
  <p>Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily <a href="https://huggingface.co/docs/transformers/internal/model_debugging_utils">debug any model</a> when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.</p>
4451
  <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
4452
  <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
4453
+ <div class="crumbs">
4454
+ <strong>Breadcrumb</strong> — Forward interception and nested JSON logging align ports to reference implementations, reinforcing “Source of Truth.” Next: CUDA warmup reduces load-time stalls without touching modeling semantics.
4455
+ </div>
4456
  <h3>Cooking faster CUDA warmups</h3>
4457
  <p>Having a clean <em>external</em> API allows us to work on the <a href="#code-is-product">true inner workings of transformers</a>. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out <a href="https://github.com/huggingface/transformers/pull/36380">the source</a>!</p>
4458
  <p><style>.warmup-demo body{background-color:#f5f5f5;margin:0;padding:20px;font-family:Segoe UI,Tahoma,Geneva,Verdana,sans-serif}.warmup-demo .container{background:#fff;border-radius:12px;max-width:1200px;margin:0 auto;padding:30px;box-shadow:0 4px 6px #0000001a}.warmup-demo h1{text-align:center;color:#333;margin-bottom:10px}.warmup-demo .subtitle{text-align:center;color:#666;margin-bottom:30px;font-size:16px}.warmup-demo .demo-container{gap:40px;margin-bottom:30px;display:flex}.warmup-demo .side{background:#fafafa;border:2px solid #ddd;border-radius:8px;flex:1;padding:20px}.warmup-demo .side h2{text-align:center;color:#333;margin-top:0}.warmup-demo .no-warmup h2{color:#d63384}.warmup-demo .with-warmup h2{color:#198754}.warmup-demo .memory-area{background:#fff;border:2px dashed #ccc;border-radius:6px;height:400px;margin:20px 0;padding:10px;position:relative;overflow:hidden}.warmup-demo .layer-box{background:#fff;border:2px solid #666;border-radius:4px;width:80px;height:30px;margin:3px;transition:all .3s;display:inline-block;position:relative}.warmup-demo .layer-box.allocating{background:#e9ecef;border-color:#adb5bd}.warmup-demo .layer-box.allocating:after{content:"malloc";color:#666;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .layer-box.loaded{background:#d1e7dd;border-color:#198754}.warmup-demo .layer-box.loaded:after{content:"data";color:#198754;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container{background:#fff;border:3px solid #666;border-radius:6px;width:100%;height:60px;margin-bottom:20px;position:relative;overflow:hidden}.warmup-demo .warmup-container.allocated{background:#e7f1ff;border-color:#0d6efd}.warmup-demo .warmup-container:before{content:"Pre-allocated Memory Pool";color:#666;z-index:1;font-size:14px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container.allocated:before{color:#0d6efd}.warmup-demo .warmup-fill{z-index:2;background:linear-gradient(90deg,#198754,#20c997);border-radius:3px;width:0%;height:100%;transition:width .5s;position:relative}.warmup-demo .warmup-fill:after{content:"Layer Data Loading";color:#fff;white-space:nowrap;font-size:12px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .timing{text-align:center;min-height:30px;margin:15px 0;font-size:24px;font-weight:700}.warmup-demo .no-warmup .timing{color:#d63384}.warmup-demo .with-warmup .timing{color:#198754}.warmup-demo .controls{text-align:center;margin:30px 0}.warmup-demo .btn{color:#fff;cursor:pointer;background:#0d6efd;border:none;border-radius:6px;margin:0 10px;padding:12px 24px;font-size:16px;transition:background .3s}.warmup-demo .btn:hover{background:#0b5ed7}.warmup-demo .btn:disabled{cursor:not-allowed;background:#6c757d}.warmup-demo .description{background:#f8f9fa;border-radius:6px;margin-top:15px;padding:15px;font-size:14px;line-height:1.5}.warmup-demo .phase-indicator{color:#666;text-align:center;min-height:20px;margin-top:10px;font-size:14px}.warmup-demo .layer-counter{text-align:center;color:#495057;margin:10px 0;font-size:16px}</style>
 
4507
 
4508
  <script>let animationSpeed=1/2.4,isRunning=!1,totalLayers=10;function startDemo(){isRunning||(isRunning=!0,document.getElementById("startBtn").disabled=!0,document.getElementById("resetBtn").disabled=!0,Promise.all([animateNoWarmup(),animateWithWarmup()]).then(()=>{isRunning=!1,document.getElementById("startBtn").disabled=!1,document.getElementById("resetBtn").disabled=!1}))}function resetDemo(){isRunning||(document.getElementById("noWarmupArea").innerHTML="",document.getElementById("warmupLayers").innerHTML="",document.getElementById("warmupFill").style.width="0%",document.getElementById("warmupContainer").classList.remove("allocated"),document.getElementById("noWarmupTime").textContent="0.00s",document.getElementById("warmupTime").textContent="0.00s",document.getElementById("noWarmupCounter").textContent="Layers loaded: 0/10",document.getElementById("warmupCounter").textContent="Layers loaded: 0/10",document.getElementById("noWarmupPhase").textContent="",document.getElementById("warmupPhase").textContent="")}async function animateNoWarmup(){let e=document.getElementById("noWarmupArea"),t=document.getElementById("noWarmupTime"),n=document.getElementById("noWarmupCounter"),a=document.getElementById("noWarmupPhase"),m=0,o=200/animationSpeed;a.textContent="Loading model layers...";for(let a=0;a<10;a++){let d=document.createElement("div");d.className="layer-box",e.appendChild(d),await sleep(.3*o),d.classList.add("allocating"),t.textContent=(m+=.08).toFixed(2)+"s",await sleep(.7*o),d.classList.remove("allocating"),d.classList.add("loaded"),t.textContent=(m+=.12).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}async function animateWithWarmup(){let e=document.getElementById("warmupLayers"),t=document.getElementById("warmupTime"),n=document.getElementById("warmupCounter"),a=document.getElementById("warmupPhase"),m=document.getElementById("warmupContainer"),o=document.getElementById("warmupFill"),d=0,l=200/animationSpeed;a.textContent="Warming up allocator...",await sleep(2*l),m.classList.add("allocated"),t.textContent=(d+=.3).toFixed(2)+"s",a.textContent="Loading model layers...";for(let a=0;a<10;a++){let m=document.createElement("div");m.className="layer-box loaded",m.style.width="40px",m.style.height="20px",e.appendChild(m);let i=(a+1)/10*100;o.style.width=i+"%",await sleep(.5*l),t.textContent=(d+=.08).toFixed(2)+"s",n.textContent=`Layers loaded: ${a+1}/10`}a.textContent="Complete!"}function sleep(e){return new Promise(t=>setTimeout(t,e))}</script></p>
4509
  <p>It’s hard to overstate how much of a lifesaver that is when you’re trying to load a model as fast as possible, as it’s the narrowest bottleneck for your iteration speed.</p>
4510
+ <div class="crumbs">
4511
+ <strong>Breadcrumb</strong> — Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR). Next: serving benefits directly from consistent interfaces and modularity.
4512
+ </div>
4513
  <h3>Transformers-serve and continuous batching</h3>
4514
  <p>Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various <a href="https://huggingface.co/docs/inference-providers/en/index">inference providers</a> if you’re interested in model deployment in general.</p>
4515
  <pre><code class="language-bash">transformers serve
 
4520
  </code></pre>
4521
  <p>This provides an OpenAI-compatible API with features like <a href="https://github.com/huggingface/transformers/pull/38085">continuous batching</a> (also check <a href="https://github.com/huggingface/transformers/pull/40426">here</a>) for better GPU utilization.</p>
4522
  <p>Continuous batching is in itself very much linked to the great work of vLLM with the <code>paged attention kernel</code>, further justifying the facilitation of <a href="#community-kernels">external kernels</a>.</p>
4523
+ <div class="crumbs">
4524
+ <strong>Breadcrumb</strong> — OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable. Next: reuse across vLLM/SGLang relies on the same consistency.
4525
+ </div>
4526
  <h2>Community reusability</h2>
4527
  <p>Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be <em>reused</em> at large by the open-source ecosystem.</p>
4528
  <p>Adding a model to transformers means:</p>
 
4531
  <li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great vLLM x HF blog post.</a></li>
4532
  </ul>
4533
  <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and there’s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>, and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
4534
+ <div class="crumbs">
4535
+ <strong>Breadcrumb</strong> — Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical. Next: what changes in v5 without breaking the promise of visible semantics.
4536
+ </div>
4537
  <h2>What is coming next</h2>
4538
  <p>The next major version of <code>transformers</code> is just around the corner. When v5 is releasd, <a href="#backwards-compatibility">backwards compatibility</a> will try to stay as solid as possible. Changes we do now are to ensure this.</p>
4539
  <p>Instead, what we aim to be is way more of a modular Toolbox. What we are not is a framework: you should not be FORCED to rewrite every modeling, but it is <em>better</em> for your model to be able to inherit from PreTrainedModel and have enabled TensorParallel, from_pretrained, sharding, push_to_hub, loss, as well as PEFT/TRL/SGLang/vLLM and other fine-tuning and fast inference options.</p>