File size: 4,734 Bytes
e40f02a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77733fb
 
e40f02a
 
 
77733fb
e40f02a
 
 
 
 
77733fb
 
e40f02a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - decoder-only
  - flux-compatible
  - pytorch
---

# data-archetype/capacitor_decoder

**Capacitor decoder**: a faster, lighter FLUX.2-compatible latent decoder built
on the
[SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae)
architecture.

## Decode Speed

| Resolution | Speedup vs FLUX.2 | Peak VRAM Reduction | capacitor_decoder (ms/image) | FLUX.2 VAE (ms/image) | capacitor_decoder peak VRAM | FLUX.2 peak VRAM |
|---:|---:|---:|---:|---:|---:|---:|
| `512x512` | `1.85x` | `59.3%` | `11.40` | `21.14` | `391.6 MiB` | `961.9 MiB` |
| `1024x1024` | `3.28x` | `79.1%` | `26.31` | `86.24` | `601.4 MiB` | `2876.4 MiB` |
| `2048x2048` | `4.70x` | `86.4%` | `86.29` | `405.84` | `1437.4 MiB` | `10531.4 MiB` |

These measurements are decode-only. Each image is first encoded once with the
same FLUX.2 encoder, latents are cached in memory, and then both decoders are
timed over the same cached latent set.

## 2k PSNR Benchmark

| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | Min (dB) | P5 (dB) | P95 (dB) | Max (dB) |
|---|---:|---:|---:|---:|---:|---:|---:|
| FLUX.2 VAE | 36.28 | 4.53 | 36.07 | 22.73 | 28.89 | 43.63 | 47.38 |
| capacitor_decoder | 36.34 | 4.50 | 36.29 | 23.28 | 29.06 | 43.66 | 47.43 |

| Delta vs FLUX.2 | Mean (dB) | Std (dB) | Median (dB) | Min (dB) | P5 (dB) | P95 (dB) | Max (dB) |
|---|---:|---:|---:|---:|---:|---:|---:|
| capacitor_decoder - FLUX.2 | 0.055 | 0.531 | 0.062 | -1.968 | -0.811 | 0.886 | 2.807 |

Evaluated on `2000` validation images: roughly `2/3`
photographs and `1/3` book covers. Each image is encoded once with FLUX.2 and
reused for both decoders.

[Results viewer](https://huggingface.co/spaces/data-archetype/capacitor_decoder-results)

## Usage

```python
import torch
from diffusers.models import AutoencoderKLFlux2

from capacitor_decoder import CapacitorDecoder, CapacitorDecoderInferenceConfig


def flux2_patchify_and_whiten(
    latents: torch.Tensor,
    vae: AutoencoderKLFlux2,
) -> torch.Tensor:
    b, c, h, w = latents.shape
    if h % 2 != 0 or w % 2 != 0:
        raise ValueError(f"Expected even FLUX.2 latent grid, got H={h}, W={w}")
    z = latents.reshape(b, c, h // 2, 2, w // 2, 2)
    z = z.permute(0, 1, 3, 5, 2, 4).reshape(b, c * 4, h // 2, w // 2)
    mean = vae.bn.running_mean.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
    var = vae.bn.running_var.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
    std = torch.sqrt(var + float(vae.config.batch_norm_eps))
    return (z.to(torch.float32) - mean) / std


device = "cuda"
flux2 = AutoencoderKLFlux2.from_pretrained(
    "BiliSakura/VAEs",
    subfolder="FLUX2-VAE",
    torch_dtype=torch.bfloat16,
).to(device)
decoder = CapacitorDecoder.from_pretrained(
    "data-archetype/capacitor_decoder",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [1, 3, H, W] in [-1, 1], with H and W divisible by 16

with torch.inference_mode():
    posterior = flux2.encode(image.to(device=device, dtype=torch.bfloat16))
    latent_mean = posterior.latent_dist.mean

    # Default path: match the usual FLUX.2 convention.
    # Whiten here, then let capacitor_decoder unwhiten internally before decode.
    latents = flux2_patchify_and_whiten(latent_mean, flux2)
    recon = decoder.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=CapacitorDecoderInferenceConfig(num_steps=1),
    )
```

Whitening and dewhitening are optional, but they **must** stay consistent. The
default above matches the usual FLUX.2 pipeline behavior. If your upstream path
already gives you raw patchified decoder-space latents instead, skip whitening
upstream and call `decode(..., latents_are_flux2_whitened=False)`.

## Details

- Default input contract: FLUX.2 patchified latents with FLUX.2 BN whitening still applied.
- Default decoder behavior: unwhiten with saved FLUX.2 BN running stats, then decode.
- Optional raw-latent mode: disable whitening upstream and call `decode(..., latents_are_flux2_whitened=False)`.
- Reused decoder architecture: [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae)
- [Technical report](technical_report_capacitor_decoder.md)
- [SemDisDiffAE technical report](https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md)

## Citation

```bibtex
@misc{capacitor_decoder,
  title   = {Capacitor Decoder: A Faster, Lighter FLUX.2-Compatible Latent Decoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/capacitor_decoder},
}
```