SCimilarity β€” Extended Model

An extended version of SCimilarity, a metric-learning model for single-cell RNA-seq that maps cells to a unified 128-dimensional embedding space. The original model and method are described in:

Heimberg et al., "A cell atlas foundation model for scalable search of similar human cells", Nature, 2024. https://doi.org/10.1038/s41586-024-08411-y


What's different here

The original SCimilarity was trained on ~7.9 million annotated cells from 56 studies. This model was retrained from scratch on a significantly larger corpus extracted from CZ CELLxGENE Discover, using the same filtering criteria as the original paper (human cells, non-cancerous tissue, 10x Genomics platform).

Original This model
Training cells 7.9 M 39.5 M
Search index cells 23.4 M 45.5 M

Repository contents

β”œβ”€β”€ encoder.ckpt            # encoder weights (use this for embedding)
β”œβ”€β”€ decoder.ckpt            # decoder weights (reconstruction)
β”œβ”€β”€ gene_order.tsv          # 28,231 gene symbols the model expects as input
β”œβ”€β”€ layer_sizes.json        # network architecture
β”œβ”€β”€ hyperparameters.json    # training hyperparameters
β”œβ”€β”€ label_ints.csv          # cell type label β†’ integer mappings
β”œβ”€β”€ metadata.json           # dataset metadata
β”œβ”€β”€ reference_labels.tsv    # per-cell metadata for all reference cells
β”‚                           # (cell type, donor, tissue, dataset)
β”œβ”€β”€ annotation/
β”‚   └── labelled_kNN.bin    # kNN index for cell type annotation
└── cellsearch/
    └── full_kNN.bin        # kNN index for similarity search

The index files (annotation/ and cellsearch/) are large (~160 GB combined) but optional. If you only need to embed cells into the latent space β€” for clustering, visualization, or building your own index β€” you only need encoder.ckpt, gene_order.tsv, and layer_sizes.json.


Installation

pip install scimilarity

Or from source:

git clone https://github.com/Genentech/scimilarity
cd scimilarity
pip install -e .

Usage

For full usage examples including cell type annotation and similarity search, see the original SCimilarity notebooks. Simply point model_path to your local copy of this repository instead of the original model directory.

Encoder-only (no index required)

If you want to embed cells without downloading the full index:

import scanpy as sc
from scimilarity import CellEmbedding
from scimilarity.utils import align_dataset, lognorm_counts

ce = CellEmbedding(model_path="/path/to/model_v0")

adata = sc.read_h5ad("your_data.h5ad")
adata = align_dataset(adata, ce.gene_order)
adata = lognorm_counts(adata)

embeddings = ce.get_embeddings(adata.X)
adata.obsm["X_scimilarity"] = embeddings

Model architecture

Parameter Value
Input genes 28,230
Hidden layers 3 Γ— 1,024
Embedding dimension 128
Normalization L2 (unit hypersphere)
Loss Triplet (semi-hard) + MSE reconstruction
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support