scrna2/6 Jupyter Notebook lamindata

Standardize and append a dataset

Here, we’ll learn

  • how to standardize a less well curated dataset

  • how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.track()
Hide code cell output
 connected lamindb: testuser1/test-scrna
 created Transform('5e5BRC8lFax90000', key='scrna2.ipynb'), started new Run('7MMoNatWFBCvhBME') at 2026-01-27 17:32:28 UTC
 notebook imports: bionty==2.1.0 lamindb==2.0.1
 recommendation: to identify the notebook across renames, pass the uid: ln.track("5e5BRC8lFax9")

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
# this is our dataset
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We can’t save it in validated form.

try:
    ln.Artifact.from_anndata(
        adata,
        key="scrna/dataset2.h5ad",
        schema="ensembl_gene_ids_and_valid_features_in_obs",
    ).save()
except SystemExit as e:
    print("Error captured:", e)
Hide code cell output
 writing the in-memory object into cache
 loading artifact into memory for validation
! 4 terms not validated in feature 'columns' in slot 'obs': 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! no values were validated for columns!
Error captured: `organism` is required to get Source record for Gene!

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
    organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")
Hide code cell output
! found 6 symbols in public source: ['C9orf142', 'GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2']
  please add corresponding Gene records via: `.from_values(['C9orf142', 'GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2'])`

None of the cell type names are valid.

adata.obs["cell_type_untrusted"].unique()
Hide code cell output
['Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+']
Categories (9, object): ['CD4+/CD25 T Reg', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic', ..., 'CD19+ B', 'CD34+', 'CD56+ NK', 'Dendritic cells']

Let’s look up the non-validated cell types using the values of the public ontology and create a mapping.

cell_types = bt.CellType.public().lookup()
name_mapping = {
    "Dendritic cells": cell_types.dendritic_cell.name,
    "CD19+ B": cell_types.b_cell_cd19_positive.name,
    "CD4+/CD45RO+ Memory": cell_types.effector_memory_cd45ra_positive_alpha_beta_t_cell_terminally_differentiated.name,
    "CD8+ Cytotoxic T": cell_types.cd8_positive_alpha_beta_cytotoxic_t_cell.name,
    "CD4+/CD25 T Reg": cell_types.cd4_positive_cd25_positive_alpha_beta_regulatory_t_cell.name,
    "CD14+ Monocytes": cell_types.cd14_positive_monocyte.name,
    "CD56+ NK": cell_types.cd56_positive_cd161_positive_immature_natural_killer_cell_human.name,
    "CD8+/CD45RA+ Naive Cytotoxic": cell_types.cd8_positive_alpha_beta_memory_t_cell_cd45ro_positive.name,
    "CD34+": cell_types.cd34_positive_cd56_positive_cd117_positive_common_innate_lymphoid_precursor_human.name,
    "CD38-positive naive B cell": cell_types.cytotoxic_t_cell.name,
}

And standardize cell type names using this name mapping:

adata.obs["cell_type"] = adata.obs["cell_type_untrusted"].map(name_mapping)
adata.obs["cell_type"].unique()
Hide code cell output
['dendritic cell', 'B cell, CD19-positive', 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD4-positive, CD25-positive, alpha-beta regul..., 'CD14-positive monocyte', 'CD56-positive, CD161-positive immature natura..., 'CD8-positive, alpha-beta memory T cell, CD45R..., 'CD34-positive, CD56-positive, CD117-positive ...]
Categories (9, object): ['CD4-positive, CD25-positive, alpha-beta regul..., 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD8-positive, alpha-beta memory T cell, CD45R..., ..., 'B cell, CD19-positive', 'CD34-positive, CD56-positive, CD117-positive ..., 'CD56-positive, CD161-positive immature natura..., 'dendritic cell']

Define the corresponding feature:

ln.Feature(name="cell_type", dtype=bt.CellType).save()
Hide code cell output
 returning feature with same name: 'cell_type'
Feature(uid='Y2ZoGYqvX2hO', is_type=False, name='cell_type', _dtype_str='cat[bionty.CellType]', unit=None, description=None, array_rank=0, array_size=0, array_shape=None, synonyms=None, default_value=None, nullable=True, coerce=None, branch_id=1, space_id=1, created_by_id=3, run_id=1, type_id=None, created_at=2026-01-27 17:31:59 UTC, is_locked=False)

Save the artifact with cell type and gene annotations:

artifact_trusted = ln.Artifact.from_anndata(
    adata,
    key="scrna/dataset2.h5ad",
    description="10x reference adata, trusted cell type annotation",
    schema="ensembl_gene_ids_and_valid_features_in_obs",
).save()
artifact_trusted.describe()
Hide code cell output
 writing the in-memory object into cache
 loading artifact into memory for validation
! 4 terms not validated in feature 'columns' in slot 'obs': 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! 11 terms not validated in feature 'columns' in slot 'var.T': 'RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5'
    → fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
Artifact: scrna/dataset2.h5ad (0000)
|   description: 10x reference adata, trusted cell type annotation
├── uid: zhJzVEdVjvLdR8Ul0000            run: 7MMoNat (scrna2.ipynb)
kind: dataset                        otype: AnnData             
hash: -Finjf36qwZVQLR1AzouyA         size: 835.8 KB             
branch: main                         space: all                 
created_at: 2026-01-27 17:32:33 UTC  created_by: testuser1      
n_observations: 70                                              
├── storage/path: 
/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/zhJzVEdVjvLdR8Ul0000.h5ad
├── Dataset features
├── obs (1)                                                                                                    
│   cell_type                      bionty.CellType                      B cell, CD19-positive, CD14-positive m…
└── var.T (754 bionty.Gene.ensem…                                                                              
    AGTRAP                         num                                                                         
    ATP5IF1                        num                                                                         
    C1QA                           num                                                                         
    C1QB                           num                                                                         
    CD52                           num                                                                         
    EFHD2                          num                                                                         
    FGR                            num                                                                         
    GALE                           num                                                                         
    HES4                           num                                                                         
    HNRNPR                         num                                                                         
    HP1BP3                         num                                                                         
    MAD2L2                         num                                                                         
    NECAP2                         num                                                                         
    PARK7                          num                                                                         
    RBP7                           num                                                                         
    SRM                            num                                                                         
    SSU72                          num                                                                         
    STMN1                          num                                                                         
    TNFRSF1B                       num                                                                         
    TNFRSF4                        num                                                                         
└── Labels
    └── .cell_types                    bionty.CellType                      CD8-positive, alpha-beta memory T cell…

Query the previous collection:

collection_v1 = ln.Collection.get(key="scrna/collection1")

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = collection_v1.append(artifact_trusted).save()

See data lineage.

collection_v2.view_lineage()
Hide code cell output
_images/ae8c4d17c718cbb66eb39f6665df9d43c40bcb3c09b469ce611a37ccb4926844.svg