scrna1/6 Jupyter Notebook lamindata

scRNA-seq

Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:

  1. create a dataset (an Artifact) and seed a Collection (scrna1/6)

  2. append a new dataset to the collection (scrna2/6)

  3. query & analyze individual datasets (scrna3/6)

  4. load the collection into memory (scrna4/6)

  5. iterate over the collection to train an ML model (scrna5/6)

  6. concatenate the collection to a single tiledbsoma array store (scrna6/6)

If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.

# pip install lamindb
!lamin init --storage ./test-scrna --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-scrna
import lamindb as ln
import bionty as bt

ln.track()
Hide code cell output
 connected lamindb: testuser1/test-scrna
 created Transform('nNa3XkxK4lxS0000', key='scrna.ipynb'), started new Run('deaO2tdaHoWLcTgi') at 2026-01-27 17:31:57 UTC
 notebook imports: bionty==2.1.0 lamindb==2.0.1
 recommendation: to identify the notebook across renames, pass the uid: ln.track("nNa3XkxK4lxS")

Populate metadata registries based on an artifact

Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells() loads a subsampled version:

adata = ln.core.datasets.anndata_human_immune_cells()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding'
    obsm: 'X_umap'

To validate & annotate a dataset, we need to define valid features.

ln.Feature(name="donor", dtype=str).save()
ln.Feature(name="tissue", dtype=bt.Tissue).save()
ln.Feature(name="cell_type", dtype=bt.CellType).save()
ln.Feature(name="assay", dtype=bt.ExperimentalFactor).save()
Feature(uid='a2LmC63z2Ozp', is_type=False, name='assay', _dtype_str='cat[bionty.ExperimentalFactor]', unit=None, description=None, array_rank=0, array_size=0, array_shape=None, synonyms=None, default_value=None, nullable=True, coerce=None, branch_id=1, space_id=1, created_by_id=3, run_id=1, type_id=None, created_at=2026-01-27 17:31:59 UTC, is_locked=False)

Let’s attempt saving this dataset as a validated & annotated artifact.

try:
    artifact = ln.Artifact.from_anndata(
        adata, schema="ensembl_gene_ids_and_valid_features_in_obs"
    ).save()
except ln.errors.ValidationError:
    pass
Hide code cell output
 writing the in-memory object into cache
 loading artifact into memory for validation
! 2 terms not validated in feature 'cell_type' in slot 'obs': 'mucosal invariant T cell', 'animal cell'
    1 synonym found: "mucosal invariant T cell" → "mucosal-associated invariant T cell"
    → curate synonyms via: .standardize("cell_type")
    for remaining terms:
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('cell_type')

One cell type isn’t validated because it’s not part of the CellType registry. Let’s create it.

adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
bt.CellType(name="animal cell").save()
Hide code cell output
CellType(uid='2Go5sf8V9UYpwB', name='animal cell', ontology_id=None, abbr=None, synonyms=None, description=None, branch_id=1, space_id=1, created_by_id=3, run_id=1, source_id=None, created_at=2026-01-27 17:32:03 UTC, is_locked=False)

We can now save the dataset.

# runs ~10sec because it imports 40k Ensembl gene IDs from a public ontology
artifact = ln.Artifact.from_anndata(
    adata,
    key="datasets/conde22.h5ad",
    schema="ensembl_gene_ids_and_valid_features_in_obs",
).save()
Hide code cell output
 writing the in-memory object into cache
 loading artifact into memory for validation
 starting creation of 35459 Gene records in batches of 10000
! 1044 terms not validated in feature 'columns' in slot 'var.T': 'ENSG00000238009', 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000236948', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000224621', 'ENSG00000234166', 'ENSG00000261135', 'ENSG00000264443', 'ENSG00000284602', 'ENSG00000225643', 'ENSG00000264078', 'ENSG00000237899', 'ENSG00000287400', 'ENSG00000228452', 'ENSG00000284700', 'ENSG00000242396', 'ENSG00000234810', 'ENSG00000237352', ...
    → fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
 not annotating with 35459 features for slot var.T as it exceeds 1000 (ln.settings.annotation.n_max_records)

Some Ensembl gene IDs don’t validate because they stem from an older version of Ensembl. If we wanted to be 100% sure that all gene identifiers are valid Ensembl IDs you can import the genes from an old Ensembl version into the Gene registry (see ). One can also enforce this through the .var.T schema by setting schema.maximal_set=True, which will prohibit any non-valid features in the dataframe.

artifact.describe()
Hide code cell output
Artifact: datasets/conde22.h5ad (0000)
├── uid: RZp8Xk9F17AOSXdT0000            run: deaO2td (scrna.ipynb)
kind: dataset                        otype: AnnData            
hash: oHb_G_zCRDhZTJpW_Z5_sm         size: 54.9 MB             
branch: main                         space: all                
created_at: 2026-01-27 17:32:23 UTC  created_by: testuser1     
n_observations: 1648                                           
├── storage/path: 
/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/RZp8Xk9F17AOSXdT0000.h5ad
├── Dataset features
├── obs (4)                                                                                                    
│   assay                          bionty.ExperimentalFactor            10x 3' v3, 10x 5' v1, 10x 5' v2        
│   cell_type                      bionty.CellType                      CD16-negative, CD56-bright natural kil…
│   donor                          str                                                                         
│   tissue                         bionty.Tissue                        blood, bone marrow, caecum, duodenum, …
└── var.T (35459 bionty.Gene.ens…                                                                              
└── Labels
    └── .tissues                       bionty.Tissue                        blood, thoracic lymph node, spleen, lu…
        .cell_types                    bionty.CellType                      classical monocyte, T follicular helpe…
        .experimental_factors          bionty.ExperimentalFactor            10x 3' v3, 10x 5' v2, 10x 5' v1        

Seed a collection

Let’s create a first version of a collection that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

collection = ln.Collection(artifact, key="scrna/collection1").save()

For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:

collection.describe()
Hide code cell output
Collection: scrna/collection1 (0000)
└── uid: ytmWmfb7QG4ieX6B0000            run: deaO2td (scrna.ipynb)
    branch: main                         space: all                
    created_at: 2026-01-27 17:32:23 UTC  created_by: testuser1     

Access the underlying artifacts like so:

collection.artifacts.to_dataframe()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations version_tag is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
1 RZp8Xk9F17AOSXdT0000 datasets/conde22.h5ad None .h5ad dataset AnnData 57612943 oHb_G_zCRDhZTJpW_Z5_sm None 1648 None True False 2026-01-27 17:32:23.135000+00:00 1 1 3 1 3 3

See data lineage:

collection.view_lineage()
_images/c52a3e8580163b77aad4222fb3f59e04369015f85a653165ebe9c00c49d29e41.svg

Finish the run and save the notebook.

ln.finish()
 finished Run('deaO2tdaHoWLcTgi') after 27s at 2026-01-27 17:32:25 UTC