SCimilarity — Extended Model

An extended version of SCimilarity, a metric-learning model for single-cell RNA-seq that maps cells to a unified 128-dimensional embedding space. The original model and method are described in:

Heimberg et al., "A cell atlas foundation model for scalable search of similar human cells", Nature, 2024. https://doi.org/10.1038/s41586-024-08411-y

What's different here

The original SCimilarity was trained on ~7.9 million annotated cells from 56 studies. This model was retrained from scratch on a significantly larger corpus extracted from CZ CELLxGENE Discover, using the same filtering criteria as the original paper (human cells, non-cancerous tissue, 10x Genomics platform).

	Original	This model
Training cells	7.9 M	39.5 M
Search index cells	23.4 M	45.5 M

Repository contents

├── encoder.ckpt            # encoder weights (use this for embedding)
├── decoder.ckpt            # decoder weights (reconstruction)
├── gene_order.tsv          # 28,231 gene symbols the model expects as input
├── layer_sizes.json        # network architecture
├── hyperparameters.json    # training hyperparameters
├── label_ints.csv          # cell type label → integer mappings
├── metadata.json           # dataset metadata
├── reference_labels.tsv    # per-cell metadata for all reference cells
│                           # (cell type, donor, tissue, dataset)
├── annotation/
│   └── labelled_kNN.bin    # kNN index for cell type annotation
└── cellsearch/
    └── full_kNN.bin        # kNN index for similarity search

The index files (annotation/ and cellsearch/) are large (~160 GB combined) but optional. If you only need to embed cells into the latent space — for clustering, visualization, or building your own index — you only need encoder.ckpt, gene_order.tsv, and layer_sizes.json.

Installation

pip install scimilarity

Or from source:

git clone https://github.com/Genentech/scimilarity
cd scimilarity
pip install -e .

Usage

For full usage examples including cell type annotation and similarity search, see the original SCimilarity notebooks. Simply point model_path to your local copy of this repository instead of the original model directory.

Encoder-only (no index required)

If you want to embed cells without downloading the full index:

import scanpy as sc
from scimilarity import CellEmbedding
from scimilarity.utils import align_dataset, lognorm_counts

ce = CellEmbedding(model_path="/path/to/model_v0")

adata = sc.read_h5ad("your_data.h5ad")
adata = align_dataset(adata, ce.gene_order)
adata = lognorm_counts(adata)

embeddings = ce.get_embeddings(adata.X)
adata.obsm["X_scimilarity"] = embeddings

Model architecture

Parameter	Value
Input genes	28,230
Hidden layers	3 × 1,024
Embedding dimension	128
Normalization	L2 (unit hypersphere)
Loss	Triplet (semi-hard) + MSE reconstruction

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support