Standardize and append a dataset ¶

Here, we’ll learn

how to standardize a less well curated dataset
how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.track()

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
# this is our dataset
adata

We can’t save it in validated form.

try:
    ln.Artifact.from_anndata(
        adata,
        key="scrna/dataset2.h5ad",
        schema="ensembl_gene_ids_and_valid_features_in_obs",
    ).save()
except SystemExit as e:
    print("Error captured:", e)

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
    organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

None of the cell type names are valid.

adata.obs["cell_type_untrusted"].unique()

Let’s look up the non-validated cell types using the values of the public ontology and create a mapping.

cell_types = bt.CellType.public().lookup()
name_mapping = {
    "Dendritic cells": cell_types.dendritic_cell.name,
    "CD19+ B": cell_types.b_cell_cd19_positive.name,
    "CD4+/CD45RO+ Memory": cell_types.effector_memory_cd45ra_positive_alpha_beta_t_cell_terminally_differentiated.name,
    "CD8+ Cytotoxic T": cell_types.cd8_positive_alpha_beta_cytotoxic_t_cell.name,
    "CD4+/CD25 T Reg": cell_types.cd4_positive_cd25_positive_alpha_beta_regulatory_t_cell.name,
    "CD14+ Monocytes": cell_types.cd14_positive_monocyte.name,
    "CD56+ NK": cell_types.cd56_positive_cd161_positive_immature_natural_killer_cell_human.name,
    "CD8+/CD45RA+ Naive Cytotoxic": cell_types.cd8_positive_alpha_beta_memory_t_cell_cd45ro_positive.name,
    "CD34+": cell_types.cd34_positive_cd56_positive_cd117_positive_common_innate_lymphoid_precursor_human.name,
    "CD38-positive naive B cell": cell_types.cytotoxic_t_cell.name,
}

And standardize cell type names using this name mapping:

adata.obs["cell_type"] = adata.obs["cell_type_untrusted"].map(name_mapping)
adata.obs["cell_type"].unique()

Define the corresponding feature:

ln.Feature(name="cell_type", dtype=bt.CellType).save()

Save the artifact with cell type and gene annotations:

artifact_trusted = ln.Artifact.from_anndata(
    adata,
    key="scrna/dataset2.h5ad",
    description="10x reference adata, trusted cell type annotation",
    schema="ensembl_gene_ids_and_valid_features_in_obs",
).save()
artifact_trusted.describe()

Show code cell output Hide code cell output

→ loading artifact into memory for validation

! 4 terms not validated in feature 'columns' in slot 'obs': 'louvain', 'percent_mito', 'n_genes', 'cell_type_untrusted'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')

! 11 terms not validated in feature 'columns' in slot 'var.T': 'RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5'
    → fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')

Artifact: scrna/dataset2.h5ad (0000)
|   description: 10x reference adata, trusted cell type annotation
├── uid: kbsjNgsCRLgH5f1L0000            run: 7eOKbyZ (scrna2.ipynb)
│   kind: dataset                        otype: AnnData             
│   hash: -Finjf36qwZVQLR1AzouyA         size: 835.8 KB             
│   branch: main                         space: all                 
│   created_at: 2026-03-10 10:11:42 UTC  created_by: testuser1      
│   n_observations: 70                                              
├── storage/path: 
│   /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/kbsjNgsCRLgH5f1L0000.h5ad
├── Dataset features
│   ├── obs (1)                                                                                                    
│   │   cell_type                      bionty.CellType                      B cell, CD19-positive, CD14-positive m…
│   └── var.T (754 bionty.Gene.ensem…                                                                              
│       AGTRAP                         num                                                                         
│       ATP5IF1                        num                                                                         
│       C1QA                           num                                                                         
│       C1QB                           num                                                                         
│       CD52                           num                                                                         
│       EFHD2                          num                                                                         
│       FGR                            num                                                                         
│       GALE                           num                                                                         
│       HES4                           num                                                                         
│       HNRNPR                         num                                                                         
│       HP1BP3                         num                                                                         
│       MAD2L2                         num                                                                         
│       NECAP2                         num                                                                         
│       PARK7                          num                                                                         
│       RBP7                           num                                                                         
│       SRM                            num                                                                         
│       SSU72                          num                                                                         
│       STMN1                          num                                                                         
│       TNFRSF1B                       num                                                                         
│       TNFRSF4                        num                                                                         
└── Labels
    └── .cell_types                    bionty.CellType                      CD8-positive, alpha-beta memory T cell…

Query the previous collection:

collection_v1 = ln.Collection.get(key="scrna/collection1")

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = collection_v1.append(artifact_trusted).save()

See data lineage.

collection_v2.view_lineage()