Spartacus-1B-Instruct โ Causal Monoid Language Model
A 1.3B parameter language model that replaces softmax attention with causal monoid state compression, achieving O(1) time per token and O(1) memory at inference โ regardless of sequence length.
SFT Training Curves
Core Mechanism
Key Properties
| Property | Transformer (Llama) | Spartacus (Monoid) |
|---|---|---|
| Inference time per token | O(T) โ scans full KV-cache | O(1) โ single state update |
| Inference memory per layer | O(T) โ stores all past K,V | O(1) โ fixed dรd state matrix |
| Sequence length extrapolation | Degrades beyond training length | Unlimited โ state size is constant |
| Causality | Imposed via attention mask | Built into the recurrence |
| Training complexity | O(Tยฒ) | O(T) via parallel prefix scan |
The Monoid Recurrence
Standard attention computes:
o_t = ฮฃ_{iโคt} softmax(q_t ยท k_i) v_i โ requires O(T) KV-cache
Monoid attention compresses the entire causal history into a fixed-size state matrix S_t per head:
S_t = diag(ฮฑ_t) ยท S_{t-1} + k_t โ v_t โ vector decay monoid recurrence
o_t = q_t ยท S_t โ state readout
This is a monoid because the binary operator (ฮฑ, S) โ (ฮฒ, X) = (ฮฑยทฮฒ, diag(ฮฒ)ยทS + X) is associative, enabling O(T) parallel prefix scan for training and O(1) sequential update for inference.
Vector Decay โ Per-Dimension Memory Lifetimes
Unlike scalar decay (one ฮฑ per head), Spartacus uses vector decay: each dimension of the d-vector has its own independent decay rate ฮฑ_t[i] โ (0, 1):
S_t[i,j] = ฮฑ_t[i] ยท S_{t-1}[i,j] + k_t[i] ยท v_t[j]
This allows different feature dimensions to specialize:
- Fast-decaying dimensions (ฮฑ โ 0) โ local syntax, punctuation, function words
- Slow-decaying dimensions (ฮฑ โ 1) โ entity memory, topic tracking, long-range facts
The decay gate uses Sigmoid activation:
ฮฑ_t = ฯ(Wยทx_t + b)
| Property | Value |
|---|---|
| Range | ฮฑ โ (0, 1) โ bounded, no explosion |
| Perfect memory | Wยทx โ +โ โน ฯ โ 1 (lossless retention) |
| Full forgetting | Wยทx โ -โ โน ฯ โ 0 (complete reset) |
| Stability | ฮฑ < 1 by construction โ no divergence regardless of input magnitude |
| Bias init | b = 3.0 โน ฯ(3) โ 0.95, model starts in "mostly remember" mode |
Attention Mask โ Padding-Aware Recurrence
The monoid recurrence correctly handles attention_mask for padded batches (e.g., left-padding during generate()). For PAD positions (mask=0):
ฮฑ = ฮฑ * mask + (1 - mask) โ ฮฑ = 1 (preserve state unchanged)
k = k * mask, v = v * mask โ kv = 0 (no information injected)
Net effect: S_t = 1ยทS_{t-1} + 0 = S_{t-1} โ PAD acts as the monoid identity element, completely invisible to the recurrence. This ensures identical outputs whether inputs are padded or not.
Design Choices
- SiLU-activated keys:
k = SiLU(k_proj(x))ensures non-negative keys, making the state matrix S positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's - QK-Norm: RMSNorm on both q and k before readout, stabilizing the scale of qยทS when the state matrix accumulates many outer products
- Output Norm: RMSNorm on the readout o after
qยทS, further stabilizing scale before gating - Output Gate:
gate = SiLU(gate_proj(x)), modulates the multi-head readout before o_proj (similar to GLA/RetNet). Lets the model suppress or amplify specific head outputs conditioned on the current input - Sigmoid decay gate: Ensures ฮฑ โ (0, 1) by construction โ allows near-perfect memory (ฮฑโ1) while preventing state explosion (ฮฑ>1). Bias initialized to 3.0 so ฯ(3)โ0.95, starting in high-retention mode
- Learnable h0: The initial state Sโ = h0 is a learnable parameter (zero-initialized), acting as a compressed "system prompt"
- Log-space decay in scan: The parallel prefix scan works in log-space
log(ฮฑ)to avoid numerical underflow when computing cumulative products over long sequences
Three Forward Paths
| Path | Condition | Complexity | Description |
|---|---|---|---|
| Training | use_cache=False |
O(T) parallel scan | Vectorized outer products โ parallel prefix scan โ vectorized readout |
| Inference prefill | use_cache=True, T>1 |
O(T) parallel scan | Same as training + extracts final state S_T for cache |
| Inference decode | use_cache=True, T=1 |
O(1) monoid_op | Single monoid_op to fold new token into state โ one matmul readout |
Model Details
| Parameter | Value |
|---|---|
| Model | NoesisLab/Spartacus-1B-Instruct |
| Architecture | MonoidForCausalLM |
| Parameters | ~1.34B (tied embeddings) |
| Hidden size | 2048 |
| Intermediate size (MLP) | 8192 |
| Layers | 16 |
| Attention heads | 32 |
| Head dimension | 64 |
| Decay gate | Vector decay (Sigmoid), d=64 per head |
| State matrix per head | 64 ร 64 = 4,096 floats |
| Vocabulary | 128,256 (Llama-3.2 tokenizer) |
| Precision | bfloat16 |
Benchmarks (0-shot)
| Task | Metric | Value | Stderr |
|---|---|---|---|
| ARC-Challenge | acc_norm | 0.3063 | ยฑ0.0135 |
| ARC-Easy | acc | 0.5518 | ยฑ0.0102 |
| HellaSwag | acc_norm | 0.4610 | ยฑ0.0050 |
| PIQA | acc_norm | 0.6915 | ยฑ0.0108 |
| WinoGrande | acc | 0.5225 | ยฑ0.0140 |
Comparison with ~1B Baselines (acc_norm, 0-shot)
| Task | Spartacus-1B | TinyLlama-1.1B | Llama 3.2-1B | Mamba-1.4B | RWKV-6-1.6B |
|---|---|---|---|---|---|
| ARC-C | 0.3063 | 0.3268 | ~0.359 | 0.284 | ~0.301 |
| ARC-E | 0.5518 | 0.5547 | ~0.752 | 0.512 | ~0.530 |
| HellaSwag | 0.4610 | 0.4670 | ~0.546 | 0.435 | ~0.450 |
| PIQA | 0.6915 | 0.7210 | ~0.740 | 0.655 | ~0.670 |
| WinoGrande | 0.5225 | 0.5040 | ~0.592 | 0.510 | ~0.515 |
Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining O(1) inference time and memory per token. Scores marked with ~ are approximate community-reported values.
Parallel Scan Implementation
The monoid_scan_cuda.py module provides a Triton JIT-compiled parallel prefix scan for the vector-decay monoid:
- Grid:
(B*H*D_k, ceil(D_v/BLOCK_DV))โ one program per state matrix row - Forward: Sequential scan along T per row, parallelized across all (batch, head, d_k) dimensions
- Backward: Reverse-order adjoint scan with per-row D_v reduction (minimal atomic_add)
- Fallback: Pure PyTorch sequential scan for CPU/MPS
- Auto-dispatch: CUDA โ Triton kernel, otherwise โ PyTorch fallback
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"NoesisLab/Spartacus-1B-Instruct",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
File Structure
MonoidForCausalLM.py # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
monoid_scan_cuda.py # Triton JIT parallel prefix scan (vector decay) + PyTorch fallback
model.safetensors # Model weights (bfloat16)
config.json # Model configuration
tokenizer.json # Llama-3.2 tokenizer
ARCH.png # Core mechanism diagram (monoid recurrence + parallel scan)
ACC_SPAR.png # SFT accuracy curve
LOSS_SPAR.png # SFT loss curve
Citation
@software{spartacus2025,
title={Spartacus: Causal Monoid Language Model with O(1) Inference},
author={NoesisLab},
year={2025},
url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
description={Replaces softmax attention with vector-decay monoid state compression for constant-time, constant-memory autoregressive generation}
}
License
Apache 2.0
- Downloads last month
- 234


