File size: 4,734 Bytes
e40f02a 77733fb e40f02a 77733fb e40f02a 77733fb e40f02a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | ---
license: apache-2.0
tags:
- diffusion
- autoencoder
- image-reconstruction
- decoder-only
- flux-compatible
- pytorch
---
# data-archetype/capacitor_decoder
**Capacitor decoder**: a faster, lighter FLUX.2-compatible latent decoder built
on the
[SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae)
architecture.
## Decode Speed
| Resolution | Speedup vs FLUX.2 | Peak VRAM Reduction | capacitor_decoder (ms/image) | FLUX.2 VAE (ms/image) | capacitor_decoder peak VRAM | FLUX.2 peak VRAM |
|---:|---:|---:|---:|---:|---:|---:|
| `512x512` | `1.85x` | `59.3%` | `11.40` | `21.14` | `391.6 MiB` | `961.9 MiB` |
| `1024x1024` | `3.28x` | `79.1%` | `26.31` | `86.24` | `601.4 MiB` | `2876.4 MiB` |
| `2048x2048` | `4.70x` | `86.4%` | `86.29` | `405.84` | `1437.4 MiB` | `10531.4 MiB` |
These measurements are decode-only. Each image is first encoded once with the
same FLUX.2 encoder, latents are cached in memory, and then both decoders are
timed over the same cached latent set.
## 2k PSNR Benchmark
| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | Min (dB) | P5 (dB) | P95 (dB) | Max (dB) |
|---|---:|---:|---:|---:|---:|---:|---:|
| FLUX.2 VAE | 36.28 | 4.53 | 36.07 | 22.73 | 28.89 | 43.63 | 47.38 |
| capacitor_decoder | 36.34 | 4.50 | 36.29 | 23.28 | 29.06 | 43.66 | 47.43 |
| Delta vs FLUX.2 | Mean (dB) | Std (dB) | Median (dB) | Min (dB) | P5 (dB) | P95 (dB) | Max (dB) |
|---|---:|---:|---:|---:|---:|---:|---:|
| capacitor_decoder - FLUX.2 | 0.055 | 0.531 | 0.062 | -1.968 | -0.811 | 0.886 | 2.807 |
Evaluated on `2000` validation images: roughly `2/3`
photographs and `1/3` book covers. Each image is encoded once with FLUX.2 and
reused for both decoders.
[Results viewer](https://huggingface.co/spaces/data-archetype/capacitor_decoder-results)
## Usage
```python
import torch
from diffusers.models import AutoencoderKLFlux2
from capacitor_decoder import CapacitorDecoder, CapacitorDecoderInferenceConfig
def flux2_patchify_and_whiten(
latents: torch.Tensor,
vae: AutoencoderKLFlux2,
) -> torch.Tensor:
b, c, h, w = latents.shape
if h % 2 != 0 or w % 2 != 0:
raise ValueError(f"Expected even FLUX.2 latent grid, got H={h}, W={w}")
z = latents.reshape(b, c, h // 2, 2, w // 2, 2)
z = z.permute(0, 1, 3, 5, 2, 4).reshape(b, c * 4, h // 2, w // 2)
mean = vae.bn.running_mean.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
var = vae.bn.running_var.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
std = torch.sqrt(var + float(vae.config.batch_norm_eps))
return (z.to(torch.float32) - mean) / std
device = "cuda"
flux2 = AutoencoderKLFlux2.from_pretrained(
"BiliSakura/VAEs",
subfolder="FLUX2-VAE",
torch_dtype=torch.bfloat16,
).to(device)
decoder = CapacitorDecoder.from_pretrained(
"data-archetype/capacitor_decoder",
device=device,
dtype=torch.bfloat16,
)
image = ... # [1, 3, H, W] in [-1, 1], with H and W divisible by 16
with torch.inference_mode():
posterior = flux2.encode(image.to(device=device, dtype=torch.bfloat16))
latent_mean = posterior.latent_dist.mean
# Default path: match the usual FLUX.2 convention.
# Whiten here, then let capacitor_decoder unwhiten internally before decode.
latents = flux2_patchify_and_whiten(latent_mean, flux2)
recon = decoder.decode(
latents,
height=int(image.shape[-2]),
width=int(image.shape[-1]),
inference_config=CapacitorDecoderInferenceConfig(num_steps=1),
)
```
Whitening and dewhitening are optional, but they **must** stay consistent. The
default above matches the usual FLUX.2 pipeline behavior. If your upstream path
already gives you raw patchified decoder-space latents instead, skip whitening
upstream and call `decode(..., latents_are_flux2_whitened=False)`.
## Details
- Default input contract: FLUX.2 patchified latents with FLUX.2 BN whitening still applied.
- Default decoder behavior: unwhiten with saved FLUX.2 BN running stats, then decode.
- Optional raw-latent mode: disable whitening upstream and call `decode(..., latents_are_flux2_whitened=False)`.
- Reused decoder architecture: [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae)
- [Technical report](technical_report_capacitor_decoder.md)
- [SemDisDiffAE technical report](https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md)
## Citation
```bibtex
@misc{capacitor_decoder,
title = {Capacitor Decoder: A Faster, Lighter FLUX.2-Compatible Latent Decoder},
author = {data-archetype},
email = {data-archetype@proton.me},
year = {2026},
month = apr,
url = {https://huggingface.co/data-archetype/capacitor_decoder},
}
```
|