GEOLIP CaptionBERT-8192-anchored

This will be the real prototype, fingerprinting was the earlier thought and the full upcoming prototype is ready for train.

https://huggingface.co/AbstractPhil/geolip-axis-prototype

The example code and prototype axis modulators are present there as they are, and they will be utilized throughout upcoming experiments.

For CaptionBERT, upcoming checkpoints will push after the process is successful, likely 1 hour per epoch for 5 epochs or so should be more than enough.

This marks the first use of a new prototype object dubbed AnchorBank, which is designed specifically to house the necessary implications that the model is distilled with, while specifically aligning the expectation of those distillation valuations into the bank itself.

This allows the model to POTENTIALLY solve nth token lookup without a head, so a head will allow finetuning. If successful, the anchor bank will contain all the knowledge the model requires to geometrically represent it's data into expanded structures - if the losses and training process is correctly aligned to the task.

HOPEFULLY after this refit, the structure will be capable of predicting NIL head token prediction, if not I'll work with a different small LLM project and then determine the potential utility of direct integration of the two on a MOE pipeline instead of a full collective behavioral implication.

If that goes well, the MOE can be adapted into collective behavior if the systems align correctly, but that's a different process.

GEOLIP CaptionBERT-8192-fingerprinted

The next iteration will require an expanded fingerprinting axis-based relational bank, specifically to the alignment of the data and the teachers at training time.

The differentiation between what is learned and what is retained specifically expert-to-expert will enable this fingerprint to preserve the student model's integrity, which should allow cross_entropy training without complete geometric collapse and rapid overffiting.

As it stands this model is too rigid to train heads on, but I will directly improve it today and instill a core memory of geometry.

This geometry will be ever-learning, meaning when the core model trains from any experts, the bank must train as well. This geometry houses the entire internalized geometric embedding anchored fingerprinting spectrum, and this will likely evolve over the coming hours until the functional prototype comes to full fruition. Wish me luck as I design the reusable compact mechanism.

The final state of this will be a transparent embedding system with a transformer, specifically aligned stepwise.

No tricks, no gimmicks, just pure alignment math through solid and careful hypersphere rigidity analysis.

This alignment will allow the student to learn independently, without collapsing to overfitting due to exceeding internal utility, while the external heads still have more than a reasonable amount of information to access.

GEOLIP CaptionBERT-8192

A 26M-parameter caption encoder whose embedding space is the geometric intersection of five independently trained language models. Trained from scratch via consensus distillation β€” no pretrained weights, no expert models at inference.

Benchmarks

Evaluated against all five consensus teachers on STS-B, SICK-R, and MRPC. All models use mean-pooled embeddings with cosine similarity. No fine-tuning on any benchmark task.

Semantic Textual Similarity (STS-B)

Model Params Spearman ρ Pearson r
DistilBERT-base 66M 0.5717 β€”
RoBERTa-base 125M 0.5436 β€”
CaptionBERT-8192 26M 0.5032 0.5100
ALBERT-base-v2 12M 0.4784 β€”
BERT-base 110M 0.4729 β€”
ModernBERT-base 149M 0.4215 β€”

Beats BERT-base (4.2Γ— larger) and ModernBERT-base (5.7Γ— larger) on general sentence similarity despite being trained exclusively on image captions.

SICK-R (Compositional Similarity)

Model Params Spearman ρ Pearson r
DistilBERT-base 66M 0.6424 β€”
RoBERTa-base 125M 0.6296 β€”
CaptionBERT-8192 26M 0.6138 0.6645
BERT-base 110M 0.5865 β€”
ModernBERT-base 149M 0.5479 β€”
ALBERT-base-v2 12M 0.5364 β€”

#3/6 on compositional/syntactic similarity. Beats BERT-base, ModernBERT-base, and ALBERT on a task requiring structural language understanding.

MRPC (Paraphrase Detection)

Model Params F1 Accuracy Threshold
RoBERTa-base 125M 0.8122 β€” β€”
CaptionBERT-8192 26M 0.8068 0.6881 0.71
ALBERT-base-v2 12M 0.8067 β€” β€”
BERT-base 110M 0.8062 β€” β€”
DistilBERT-base 66M 0.8055 β€” β€”
ModernBERT-base 149M 0.8038 β€” β€”

#2/6 on paraphrase detection. 0.005 F1 behind RoBERTa, ahead of every other teacher. No classification head β€” pure cosine similarity with auto-discovered threshold. A model that has never seen a paraphrase pair during training nearly wins paraphrase detection.

Caption Embedding Quality

Metric Value
Self-similarity mean 0.0040
Self-similarity max 0.7181
Top-1 retrieval cosine 0.5477
Top-5 retrieval cosine 0.4853

Near-zero average self-similarity across 1000 random captions β€” the embedding space has excellent discrimination. Every caption occupies its own distinct region on the hypersphere.

Consensus Fidelity

Metric Value
Val cosine to consensus 0.862
Val R@1 1.000
Pentachoron CV 0.082
Training data 500K CC12M captions
Epochs 30
Position capacity 8,192 tokens
Parameters 25,958,016

How It Works

Five language models were aligned into a shared geometric space via whitened Procrustes rotation. Their normalized centroid β€” the geometric consensus β€” was proven to be a mathematical constant: five different random seeds produced the same consensus point to three decimal places.

This model was trained from scratch to reproduce that consensus directly from text. It distills the geometric intersection of five experts into a single small transformer.

The distillation is not standard knowledge distillation. It is multi-teacher geometric consensus distillation: the target is not any single teacher's output but the fixed point where all five teachers agree. Individual model errors cancel. What remains is the structural invariant of language understanding that five different architectures and training objectives independently discovered.

The alignment itself is directly distillable. The geometric structure is so robust that a from-scratch model learns it with R@1=1.000 from 18K examples in 80 seconds. The consensus manifold has pentachoron CV=0.084 β€” the tightest geometric regularity measured across all GEOLIP experiments β€” which means the function from text to embedding is smooth enough that sparse sampling covers it completely.

5 Expert Models (frozen)
    β”‚
    β”œβ”€β”€ BERT-base-uncased        (110M, MLM)
    β”œβ”€β”€ ModernBERT-base          (149M, MLM + rotary, 8192 ctx)
    β”œβ”€β”€ RoBERTa-base             (125M, MLM + dynamic masking)
    β”œβ”€β”€ ALBERT-base-v2           (12M, MLM + SOP + factorized)
    └── DistilBERT-base          (66M, distilled from BERT)
        β”‚
        β”œβ”€β”€ Extract pooled embeddings on 500K CC12M captions
        β”œβ”€β”€ Whitened Procrustes alignment to shared space
        β”œβ”€β”€ Consensus = normalized centroid (geometric constant)
        β”‚
        └── Train student with:
            β”œβ”€β”€ InfoNCE(student, consensus)   β€” retrieval alignment
            β”œβ”€β”€ MSE(student, consensus)       β€” direct regression
            └── Pentachoron CV β†’ 0.084        β€” geometric regularity

Planned Task Heads

The 768-dim consensus embedding serves as a frozen feature extractor. Linear heads trained on task-specific data snap on top.

Priority Heads

Head Architecture Training Data Use Case
NLI / Entailment cat(a, b, |a-b|, a*b) β†’ Linear(3072, 3) MNLI, SNLI Agent reasoning validation
Semantic Similarity Linear(768, 1) β†’ sigmoidΓ—5 STS-B train Push STS-B toward 0.80+
Multi-Label Tagging Linear(768, n_tags) β†’ sigmoid COCO categories, Visual Genome Predict objects/attributes from captions
Paraphrase Detection cos(a, b) β†’ threshold (already works) MRPC, QQP Deduplication, reformulation detection
Sentiment Linear(768, n_classes) SST-2, IMDB Content routing, sentiment analysis

Extended Heads

Head Architecture Training Data Use Case
Caption Quality Linear(768, 2) Hallucination-annotated captions Filter AI-generated training data
Cross-Encoder Reranker cat(query, doc) β†’ Linear(1536, 1) MS MARCO Two-stage retrieval scoring
Clustering Linear(768, 256) β†’ normalize Unsupervised Caption taxonomy, dataset organization
Relation Extraction cat(subj_emb, obj_emb) β†’ Linear(1536, n_rel) Visual Genome relationships Structured scene understanding
Caption-Image Score Linear(768, 256) β†’ cos with CLIP visual CC12M image-caption pairs Cross-modal retrieval without CLIP

Consensus Head Distillation

The same consensus trick applies to task heads. Train five separate NLI heads on the five frozen expert models, take the consensus prediction, distill into a single head on CaptionBERT. The head learns where all five experts agree on entailment β€” same noise cancellation, one layer instead of five.

Training Datasets β€” Current and Planned

Current

Dataset Samples Used Content Notes
CC12M LLaVA-Next 500K Re-captioned CC12M with LLaVA-Next Primary training data, mean ~92 tokens

Planned β€” Caption Saturation

The model tokenizes to 512 but has 8,192 position capacity. Longer, more complex captions will exercise the full context window and push v_cos beyond 0.862.

Dataset Size Content Why
ShareGPT4V 1.2M GPT-4V detailed image descriptions Longer captions (200-500 tokens), richer vocabulary
DOCCI 15K Expert-written dense image descriptions Extremely detailed, 100-300 words per image
Localized Narratives 850K Spoken descriptions with mouse traces Narrative structure, temporal ordering
DenseCap 5.4M Region-level dense captions Fine-grained spatial descriptions
TextCaps 145K Captions requiring OCR reading Text-in-image understanding
VizWiz 32K Captions from blind/low-vision users Diverse, real-world, often longer descriptions
COCO Captions 600K 5 captions per image, human-written Short but high-quality, broad coverage
SBU Captions 1M Web-crawled image-caption pairs Scale and diversity

Planned β€” Domain Extension

Dataset Size Content Why
BookCorpus 11K books Long-form narrative text Exercise 8K context, literary language
Wikipedia 6M articles Encyclopedic text General knowledge, factual density
Natural Questions 300K Question-answer pairs QA capability for retrieval heads
MS MARCO 1M Passages + queries Retrieval training for reranker head

Architecture

Input text
    β”‚
    β”œβ”€β”€ BERT WordPiece tokenizer (30,522 vocab)
    β”œβ”€β”€ Token embeddings (384-dim)
    β”œβ”€β”€ Position embeddings (8,192 capacity)
    β”‚
    β”œβ”€β”€ 6Γ— Transformer Encoder Layer
    β”‚   (384-dim, 6 heads, 1536 FFN, GELU, pre-norm)
    β”‚
    β”œβ”€β”€ Mean pool over non-padding tokens
    β”œβ”€β”€ Projection: 384 β†’ 384 β†’ GELU β†’ LN β†’ 768
    └── L2 normalize
        β”‚
        └── (B, 768) consensus-aligned embedding

Usage

import torch
from transformers import AutoTokenizer
from caption_encoder import CaptionEncoder

# Load
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = CaptionEncoder(
    vocab_size=30522, max_len=8192, d_model=384,
    n_heads=6, n_layers=6, d_ff=1536, output_dim=768,
    dropout=0.0, pad_token_id=0)
model.load_state_dict(torch.load("best_model.pt", weights_only=True))
model.eval()

# Encode
texts = ["A cat sitting on a windowsill", "A dog playing fetch on the beach"]
tokens = tokenizer(texts, max_length=512, padding="max_length",
                   truncation=True, return_tensors="pt")
with torch.no_grad():
    embeddings = model(tokens["input_ids"], tokens["attention_mask"])

# embeddings: (2, 768) L2-normalized
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.3f}")

Training Curve

Epoch t_cos v_cos v_cv Time
1 0.804 0.803 0.104 689s
5 0.819 0.819 0.086 689s
10 0.831 0.829 0.087 689s
15 0.842 0.840 0.078 688s
20 0.851 0.849 0.078 690s
25 0.860 0.859 0.092 689s
30 0.863 0.862 0.082 689s

R@1=1.000 and t_acc=1.000 throughout all 30 epochs. Train/val gap < 0.002 β€” no overfitting on 500K samples.

GEOLIP Family

System Type Params Output
CLIP-L ctx576 Memory bank 34M pooled (768,)
CLIP-L seq77 Memory + sequence 53M pooled + seq (77, 768)
Meridian bigG Memory + sequence 167M pooled + seq (77, 1280)
Conduit v0 Multi-expert hub 8.8M aligned (1024,)
CaptionBERT-8192 Consensus distilled 26M consensus (768,)

Citation

See Geometric Memory Part I and Part II for the full methodology, including the pentachoron consensus proof, whitened Procrustes alignment, compositional convolution experiments, and the path from accumulation-based memory to alignment-based distillation.

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AbstractPhil/geolip-captionbert-8192

Finetuned
(1)
this model

Dataset used to train AbstractPhil/geolip-captionbert-8192

Collection including AbstractPhil/geolip-captionbert-8192