mdiffae_v1
DEPRECATED β This model is superseded by SemDisDiffAE, which offers better reconstruction quality, better downstream diffusion convergence, and slightly faster inference.
mDiffAE v2 is also available but likewise superseded by SemDisDiffAE. It offers substantially better reconstruction (+1.7 dB mean PSNR) with the same or better downstream convergence.
Version Mean PSNR (2k images) Bottleneck Decoder mDiffAE v2 (recommended) 35.81 dB 96ch (8x) 8 blocks (skip-concat) mDiffAE v1 (this repo) 34.15 dB 64ch (12x) 4 blocks (flat)
mDiffAE β Masked Diffusion AutoEncoder. A fast, single-GPU-trainable diffusion autoencoder with a 64-channel spatial bottleneck. Uses decoder token masking as an implicit regularizer instead of REPA alignment.
This variant (mdiffae_v1): 81.4M parameters, 310.6 MB. Bottleneck: 64 channels at patch size 16 (compression ratio 12x).
Documentation
- Technical Report β architecture, masking strategy, and results
- iRDiffAE Technical Report β full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN
- Results β interactive viewer β full-resolution side-by-side comparison
Quick Start
import torch
from m_diffae import MDiffAE
# Load from HuggingFace Hub (or a local path)
model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda")
# Encode
images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)
# Decode (1 step by default β PSNR-optimal)
recon = model.decode(latents, height=H, width=W)
# Reconstruct (encode + 1-step decode)
recon = model.reconstruct(images)
Note: Requires
pip install huggingface_hub safetensorsfor Hub downloads. You can also pass a local directory path tofrom_pretrained().
Architecture
| Property | Value |
|---|---|
| Parameters | 81,410,624 |
| File size | 310.6 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 4 |
| Decoder topology | Flat sequential (no skip connections) |
| Bottleneck dim | 64 |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |
| PDG mechanism | Token-level masking (ratio 0.75) |
| Training regularizer | Decoder token masking (75% ratio, 50% apply prob) |
Encoder: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with learned residual gates.
Decoder: VP diffusion conditioned on encoder latents and timestep via shared-base + per-layer low-rank AdaLN-Zero. 4 flat sequential blocks (no skip connections).
Compared to iRDiffAE: iRDiffAE uses an 8-block decoder (2 start + 4 middle
- 2 end) with skip connections and 128 bottleneck channels (needed partly because REPA occupies half the channels). mDiffAE uses 4 flat blocks with no skip connections and 64 bottleneck channels (12x compression vs iRDiffAE's 6x), which gives better channel utilisation.
Key Differences from iRDiffAE
| Aspect | iRDiffAE v1 | mDiffAE v1 |
|---|---|---|
| Bottleneck dim | 128 | 64 |
| Decoder depth | 8 (2+4+2 skip-concat) | 4 (flat sequential) |
| PDG mechanism | Block dropping | Token masking |
| Training regularizer | REPA + covariance reg | Decoder token masking |
Recommended Settings
Best quality is achieved with 1 DDIM step and PDG disabled. PDG can sharpen images but should be kept very low (1.01β1.05).
| Setting | Default |
|---|---|
| Sampler | DDIM |
| Steps | 1 |
| PDG | Disabled |
| PDG strength (if enabled) | 1.05 |
from m_diffae import MDiffAEInferenceConfig
# PSNR-optimal (fast, 1 step)
cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
Citation
@misc{m_diffae,
title = {mDiffAE: A Fast Masked Diffusion Autoencoder},
author = {data-archetype},
year = {2026},
month = mar,
url = {https://huggingface.co/data-archetype/mdiffae_v1},
}
Dependencies
- PyTorch >= 2.0
- safetensors (for loading weights)
License
Apache 2.0
- Downloads last month
- 14