Curate datasets¶

Data curation with LaminDB ensures your datasets are validated and queryable. This guide shows you how to transform data into clean, annotated datasets.

Curating a dataset with LaminDB means three things:

Validate that the dataset matches a desired schema.
Standardize the dataset (e.g., by fixing typos, mapping synonyms) or update registries if validation fails.
Annotate the dataset by linking it against metadata entities so that it becomes queryable.

In this guide we’ll curate common data structures. Here is a guide for the underlying low-level API.

Note: If you know either pydantic or pandera, here is an FAQ that compares LaminDB with both of these tools.

# pip install lamindb
!lamin init --storage ./test-curate --modules bionty

import lamindb as ln

ln.track()

Schema design patterns¶

A Schema in LaminDB is a specification that defines the expected structure, data types, and validation rules for a dataset. It is similar to pydantic.Model for dictionaries, and pandera.Schema, and pyarrow.lib.Schema for tables, but supporting more complicated data structures.

Schemas ensure data consistency by defining:

What Features (dimensions) exist in your dataset
What data types those features should have
What values are valid for categorical features
Which Features are required vs optional

An exemplary schema:

schema = ln.Schema(
    name="experiment_schema",           # human-readable name
    features=[                          # required features
        ln.Feature(name="cell_type", dtype=bt.CellType),
        ln.Feature(name="treatment", dtype=str),
    ],
    otype="DataFrame"                   # object type (DataFrame, AnnData, etc.)
)

For composite data structures using slots:

# AnnData with multiple "slots"
adata_schema = ln.Schema(
    otype="AnnData",
    slots={
        "obs": cell_metadata_schema,     # cell annotations
        "var.T": gene_id_schema          # gene-derived features  
    }
)

Before diving into curation, let’s understand the different schema approaches and when to use each one. Think of schemas as rules that define what valid data should look like.

Flexible schema¶

Use when: You want to validate those columns whose names match feature names in your Feature registry.

import lamindb as ln

schema = ln.Schema(name="valid_features", itype=ln.Feature).save()

Minimal required schema¶

Use when: You need certain columns but want flexibility for additional metadata.

import lamindb as ln

schema = ln.Schema(
    name="Mini immuno schema",
    features=[
        ln.Feature.get(name="perturbation"),
        ln.Feature.get(name="cell_type_by_model"),
        ln.Feature.get(name="assay_oid"),
        ln.Feature.get(name="donor"),
        ln.Feature.get(name="concentration"),
        ln.Feature.get(name="treatment_time_h"),
    ],
    flexible=True,  # _additional_ columns in a dataframe are validated & annotated
).save()

Strict Schema¶

Use when: You need complete control over data structure and values.

# Only allows specified columns
schema = ln.Schema(
    features=[...],
    minimal_set=True,  # whether all passed features are required
    maximal_set=False  # whether additional features are allowed
)

DataFrame¶

Step 1: Load and examine your data¶

We’ll be working with the mini immuno dataset:

df = ln.examples.datasets.mini_immuno.get_dataset1(
    with_cell_type_synonym=True, with_cell_type_typo=True
)
df

Show code cell output Hide code cell output

	ENSG00000153563	ENSG00000010610	ENSG00000170458	perturbation	sample_note	cell_type_by_expert	cell_type_by_model	assay_oid	concentration	treatment_time_h	donor	donor_ethnicity
sample1	1	3	5	DMSO	was ok	B-cell	B cell	EFO:0008913	0.1%	24	D0001	[Chinese, Singaporean Chinese]
sample2	2	4	6	IFNG	looks naah	CD8-pos alpha-beta T cell	T cell	EFO:0008913	200 nM	24	D0002	[Chinese, Han Chinese]
sample3	3	5	7	DMSO	pretty! 🤩	CD8-pos alpha-beta T cell	T cell	EFO:0008913	0.1%	6	None	[Chinese]

Step 2: Set up your metadata registries¶

Before creating a schema, ensure your registries have the right features and labels:

import bionty as bt

import lamindb as ln

# define valid labels
perturbation_type = ln.Record(name="Perturbation", is_type=True).save()
ln.Record(name="DMSO", type=perturbation_type).save()
ln.Record(name="IFNG", type=perturbation_type).save()
bt.CellType.from_source(name="B cell").save()
bt.CellType.from_source(name="T cell").save()

# define valid features
ln.Feature(name="perturbation", dtype=perturbation_type).save()
ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save()
ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save()
ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save()
ln.Feature(name="concentration", dtype=str).save()
ln.Feature(name="treatment_time_h", dtype="num", coerce=True).save()
ln.Feature(name="donor", dtype=str, nullable=True).save()
ln.Feature(name="donor_ethnicity", dtype=list[bt.Ethnicity]).save()

Step 3: Create your schema¶

schema = ln.examples.datasets.mini_immuno.define_mini_immuno_schema_flexible()
schema.describe()

Schema: Mini immuno schema
├── uid: RPYPLcLtrZ7RQd6j                run: GC6WlxC (curate.ipynb)
│   itype: Feature                       otype: None                
│   hash: Vk8JF50Quo76up6KbLaFMg         ordered_set: False         
│   maximal_set: False                   minimal_set: True          
│   branch: main                         space: all                 
│   created_at: 2026-01-26 16:02:08 UTC  created_by: testuser1      
└── Features (6)
    └── name                dtype                                  optional  nullable  coerce  default_value
        perturbation        Record[hVKTH8inNfQ5YCoB]               ✗         ✓         ✗       unset        
        cell_type_by_model  bionty.CellType                        ✗         ✓         ✗       unset        
        assay_oid           bionty.ExperimentalFactor.ontology_id  ✗         ✓         ✗       unset        
        donor               str                                    ✗         ✓         ✗       unset        
        concentration       str                                    ✗         ✓         ✗       unset        
        treatment_time_h    num                                    ✗         ✓         ✓       unset

Step 4: Initialize Curator and first validation¶

If you expect the validation to pass, you can directly register an artifact by providing the schema:

artifact = ln.Artifact.from_dataframe(df, key="examples/my_curated_dataset.parquet", schema=schema).save()

The validate() method validates that your dataset adheres to the criteria defined by the schema. It identifies which values are already validated (exist in the registries) and which are potentially problematic (do not yet exist in our registries).

try:
    curator = ln.curators.DataFrameCurator(df, schema)
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

Step 5: Fix validation issues¶

# check the non-validated terms
curator.cat.non_validated

For cell_type_by_expert, we saw 2 terms are not validated.

First, let’s standardize synonym “B-cell” as suggested

curator.cat.standardize("cell_type_by_expert")

# now we have only one non-validated cell type left
curator.cat.non_validated

For “CD8-pos alpha-beta T cell”, let’s understand which cell type in the public ontology might be the actual match.

# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup

# here is an example for the "cell_type" column
cell_types = lookup["cell_type_by_expert"]
cell_types.cd8_positive_alpha_beta_t_cell

# fix the cell type name
df["cell_type_by_expert"] = df["cell_type_by_expert"].cat.rename_categories(
    {"CD8-pos alpha-beta T cell": cell_types.cd8_positive_alpha_beta_t_cell.name}
)

For perturbation, we want to add the new values: “DMSO”, “IFNG”

# this adds perturbations that were _not_ validated
curator.cat.add_new_from("perturbation")

ln.Feature.get(name="perturbation")

Feature(uid='1Bdw0JapwgGp', is_type=False, name='perturbation', _dtype_str='cat[Record[hVKTH8inNfQ5YCoB]]', unit=None, description=None, array_rank=0, array_size=0, array_shape=None, synonyms=None, default_value=None, nullable=True, coerce=None, branch_id=1, space_id=1, created_by_id=3, run_id=1, type_id=None, created_at=2026-01-26 16:02:08 UTC, is_locked=False)

# validate again
curator.validate()

Step 6: Save your curated dataset¶

artifact = curator.save_artifact(key="examples/my_curated_dataset.parquet")

artifact.describe()

Show code cell output Hide code cell output

Artifact: examples/my_curated_dataset.parquet (0000)
├── uid: VCE4d2GSrHRsfmlN0000            run: GC6WlxC (curate.ipynb)
│   kind: dataset                        otype: DataFrame           
│   hash: xnLdi2kUCdOAe61uR8O7CA         size: 10.1 KB              
│   branch: main                         space: all                 
│   created_at: 2026-01-26 16:02:11 UTC  created_by: testuser1      
│   n_observations: 3                                               
├── storage/path: /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/VCE4d2GSrHRsfmlN0000.parquet
├── Dataset features
│   └── columns (8)                                                                                                
│       assay_oid                      bionty.ExperimentalFactor.ontology…  EFO:0008913                            
│       cell_type_by_expert            bionty.CellType                      B cell, CD8-positive, alpha-beta T cell
│       cell_type_by_model             bionty.CellType                      B cell, T cell                         
│       concentration                  str                                                                         
│       donor                          str                                                                         
│       donor_ethnicity                list[bionty.Ethnicity]               ['Chinese', 'Singaporean Chinese', 'Ha…
│       perturbation                   Record[Perturbation]                 DMSO, IFNG                             
│       treatment_time_h               num                                                                         
└── Labels
    └── .records                       Record                               DMSO, IFNG                             
        .cell_types                    bionty.CellType                      B cell, T cell, CD8-positive, alpha-be…
        .experimental_factors          bionty.ExperimentalFactor            single-cell RNA sequencing             
        .ethnicities                   bionty.Ethnicity                     Chinese, Singaporean Chinese, Han Chin…

Common fixes¶

This section covers the most frequent curation issues and their solutions. Use this as a reference when validation fails.

Feature validation issues¶

Issue: “Column not in dataframe”

"column 'treatment' not in dataframe. Columns in dataframe: ['drug', 'timepoint', ...]"

Solutions:

# Solution 1: Rename columns to match schema
df = df.rename(columns={
    'treatment': 'drug',
    'time': 'timepoint',
    ...
})

# Solution 2: Create missing columns
df['treatment'] = 'unknown'  # Add with default value (or define Feature.default_value)

# Solution 3: Modify schema to match your data
schema = ln.Schema(
    features=[
        ln.Feature.get(name="drug"),  # Use actual column name
        ln.Feature.get(name="timepoint"),
    ],
    ...
)

Value validation issues¶

Issue: “Terms not validated in feature ‘perturbation’”

2 terms not validated in feature 'cell_type': 'B-cell', 'CD8-pos alpha-beta T cell'
    1 synonym found: "B-cell" → "B cell"
    → curate synonyms via: .standardize("cell_type")
    for remaining terms:
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('cell_type')

Solutions:

# Solution 1: Use automatic standardization if given hint (handles synonyms))
curator.cat.standardize('cell_type')

# Solution 2: Manual mapping for complex cases
value_mapping = {
    'T-cells': 'T cell',
    'B-cells': 'B cell',
}
df['cell_type'] = df['cell_type'].map(value_mapping).fillna(df['cell_type'])

# Solution 3: Use public ontology lookup for correct names
lookup = curator.cat.lookup(public=True)
cell_types = lookup["cell_type"]
df['cell_type'] = df['cell_type'].cat.rename_categories({
    'CD8-pos T cell': cell_types.cd8_positive_alpha_beta_t_cell.name
})

# Solution 4: Add new legitimate terms
curator.cat.add_new_from("cell_type")

Data type issues¶

<cell_type>markdown</cell_type>Issue: “Expected categorical data, got object”

TypeError: Expected categorical data for cell_type, got object

Solutions:

# Solution 1: Convert to categorical
df['cell_type'] = df['cell_type'].astype('category')

# Solution 2: Use coercion in feature definition
ln.Feature(name="cell_type", dtype=bt.CellType, coerce=True).save()

External data validation¶

Since not all metadata is always stored within the dataset itself, it is also possible to validate external metadata.

curate_dataframe_external_features.py¶

import lamindb as ln
from datetime import date

df = ln.examples.datasets.mini_immuno.get_dataset1(otype="DataFrame")

temperature = ln.Feature(name="temperature", dtype=float).save()
date_of_study = ln.Feature(name="date_of_study", dtype=date).save()
external_schema = ln.Schema(features=[temperature, date_of_study]).save()

concentration = ln.Feature(name="concentration", dtype=str).save()
donor = ln.Feature(name="donor", dtype=str, nullable=True).save()
schema = ln.Schema(
    features=[concentration, donor],
    slots={"__external__": external_schema},
    otype="DataFrame",
).save()

artifact = ln.Artifact.from_dataframe(
    df,
    key="examples/dataset1.parquet",
    features={"temperature": 21.6, "date_of_study": date(2024, 10, 1)},
    schema=schema,
).save()
artifact.describe()

!python scripts/curate_dataframe_external_features.py

Show code cell output Hide code cell output

→ connected lamindb: testuser1/test-curate

→ returning feature with same name: 'concentration'
→ returning feature with same name: 'donor'
! no run & transform got linked, call `ln.track()` & re-run
→ writing the in-memory object into cache
→ returning artifact with same hash: Artifact(uid='VCE4d2GSrHRsfmlN0000', version_tag=None, is_latest=True, key='examples/my_curated_dataset.parquet', description=None, suffix='.parquet', kind='dataset', otype='DataFrame', size=10354, hash='xnLdi2kUCdOAe61uR8O7CA', n_files=None, n_observations=3, branch_id=1, space_id=1, storage_id=3, run_id=1, schema_id=1, created_by_id=3, created_at=2026-01-26 16:02:11 UTC, is_locked=False); to track this artifact as an input, use: ln.Artifact.get()
! key examples/my_curated_dataset.parquet on existing artifact differs from passed key examples/dataset1.parquet, keeping original key; update manually if needed or pass skip_hash_lookup if you want to duplicate the artifact
→ loading artifact into memory for validation

! 4 terms not validated in feature 'columns': 'sample_note', 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')

Artifact: examples/my_curated_dataset.parquet (0000)
├── uid: VCE4d2GSrHRsfmlN0000            run: GC6WlxC (curate.ipynb)
│   kind: dataset                        otype: DataFrame           
│   hash: xnLdi2kUCdOAe61uR8O7CA         size: 10.1 KB              
│   branch: main                         space: all                 
│   created_at: 2026-01-26 16:02:11 UTC  created_by: testuser1      
│   n_observations: 3                                               
├── storage/path: 
│   /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/VCE4d2GSrHRsfmlN
│   0000.parquet
├── Dataset features
│   └── columns (2)                                                             
│       concentration       str                                                 
│       donor               str                                                 
├── External features
│   └── assay_oid           bionty.ExperimentalFac…  EFO:0008913                
│       cell_type_by_expe…  bionty.CellType          B cell, CD8-positive, alph…
│       cell_type_by_model  bionty.CellType          B cell, T cell             
│       donor_ethnicity     list[bionty.Ethnicity]   ['Chinese', 'Singaporean C…
│       perturbation        Record[Perturbation]     DMSO, IFNG                 
│       date_of_study       date                     2024-10-01                 
│       temperature         float                    21.6                       
└── Labels
    └── .records            Record                   DMSO, IFNG                 
        .cell_types         bionty.CellType          B cell, T cell, CD8-positi…
        .experimental_fac…  bionty.ExperimentalFac…  single-cell RNA sequencing 
        .ethnicities        bionty.Ethnicity         Chinese, Singaporean Chine…

AnnData¶

AnnData like all other data structures that follow is a composite structure that stores different arrays in different slots.

Allow a flexible schema¶

We can also allow a flexible schema for an AnnData and only require that it’s indexed with Ensembl gene IDs.

curate_anndata_flexible.py¶

import lamindb as ln

ln.examples.datasets.mini_immuno.define_features_labels()
adata = ln.examples.datasets.mini_immuno.get_dataset1(otype="AnnData")
artifact = ln.Artifact.from_anndata(
    adata,
    key="examples/mini_immuno.h5ad",
    schema="ensembl_gene_ids_and_valid_features_in_obs",
).save()
artifact.describe()

Let’s run the script.

!python scripts/curate_anndata_flexible.py

Show code cell output Hide code cell output

→ connected lamindb: testuser1/test-curate

→ returning record with same name: 'Perturbation'
→ returning record with same name: 'DMSO'
→ returning record with same name: 'IFNG'
→ returning feature with same name: 'perturbation'
→ returning feature with same name: 'cell_type_by_expert'
→ returning feature with same name: 'cell_type_by_model'
→ returning feature with same name: 'assay_oid'
→ returning feature with same name: 'concentration'

→ returning feature with same name: 'treatment_time_h'
→ returning feature with same name: 'donor'
→ returning feature with same name: 'donor_ethnicity'

! no run & transform got linked, call `ln.track()` & re-run
→ writing the in-memory object into cache

→ loading artifact into memory for validation

! 1 term not validated in feature 'columns' in slot 'obs': 'sample_note'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')

Artifact: examples/mini_immuno.h5ad (0000)
├── uid: LVhVndXnXz7JQt4c0000            run:                 
│   kind: dataset                        otype: AnnData       
│   hash: FB3CeMjmg1ivN6HDy6wsSg         size: 30.9 KB        
│   branch: main                         space: all           
│   created_at: 2026-01-26 16:02:23 UTC  created_by: testuser1
│   n_observations: 3                                         
├── storage/path: 
│   /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/LVhVndXnXz7JQt4c
│   0000.h5ad
├── Dataset features
│   ├── obs (7)                                                                 
│   │   assay_oid           bionty.ExperimentalFac…  EFO:0008913                
│   │   cell_type_by_expe…  bionty.CellType          B cell, CD8-positive, alph…
│   │   cell_type_by_model  bionty.CellType          B cell, T cell             
│   │   concentration       str                                                 
│   │   donor               str                                                 
│   │   perturbation        Record[Perturbation]     DMSO, IFNG                 
│   │   treatment_time_h    num                                                 
│   └── var.T (3 bionty.G…                                                      
│       CD14                num                                                 
│       CD4                 num                                                 
│       CD8A                num                                                 
└── Labels
    └── .records            Record                   DMSO, IFNG                 
        .cell_types         bionty.CellType          B cell, T cell, CD8-positi…
        .experimental_fac…  bionty.ExperimentalFac…  single-cell RNA sequencing 

Under-the-hood, this uses the following build-in schema (anndata_ensembl_gene_ids_and_valid_features_in_obs()):

import bionty as bt

import lamindb as ln

obs_schema = ln.examples.schemas.valid_features()
varT_schema = ln.Schema(
    name="valid_ensembl_gene_ids", itype=bt.Gene.ensembl_gene_id
).save()
schema = ln.Schema(
    name="anndata_ensembl_gene_ids_and_valid_features_in_obs",
    otype="AnnData",
    slots={"obs": obs_schema, "var.T": varT_schema},
).save()

This schema tranposes the var DataFrame during curation, so that one validates and annotates the columns of var.T, i.e., [ENSG00000153563, ENSG00000010610, ENSG00000170458]. If one doesn’t transpose, one would annotate the columns of var, i.e., [gene_symbol, gene_type].

https://lamin-site-assets.s3.amazonaws.com/.lamindb/gLyfToATM7WUzkWW0001.png

Fix validation issues¶

adata = ln.examples.datasets.mini_immuno.get_dataset1(
    with_gene_typo=True, with_cell_type_typo=True, otype="AnnData"
)
adata

Check the slots of a schema:

schema.slots

curator = ln.curators.AnnDataCurator(adata, schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

As above, we leverage a lookup object with valid cell types to find the correct name.

valid_cell_types = curator.slots["obs"].cat.lookup()["cell_type_by_expert"]
adata.obs["cell_type_by_expert"] = adata.obs[
    "cell_type_by_expert"
].cat.rename_categories(
    {"CD8-pos alpha-beta T cell": valid_cell_types.cd8_positive_alpha_beta_t_cell.name}
)

The validated AnnData can be subsequently saved as an Artifact:

adata.obs.columns

Index(['perturbation', 'sample_note', 'cell_type_by_expert',
       'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h',
       'donor'],
      dtype='object')

curator.slots["var.T"].cat.add_new_from("columns")

! 1 term not validated in feature 'columns' in slot 'var.T': 'GeneTypo'
    → fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')

curator.validate()

! 1 term not validated in feature 'columns' in slot 'obs': 'sample_note'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')

artifact = curator.save_artifact(key="examples/my_curated_anndata.h5ad")

Access the schema for each slot:

artifact.features.slots

The saved artifact has been annotated with validated features and labels:

artifact.describe()

Show code cell output Hide code cell output

Artifact: examples/my_curated_anndata.h5ad (0000)
├── uid: q83qW513hnXIoCQo0000            run: GC6WlxC (curate.ipynb)
│   kind: dataset                        otype: AnnData             
│   hash: yeNWx0-dOGGkANQbocU4Sg         size: 30.9 KB              
│   branch: main                         space: all                 
│   created_at: 2026-01-26 16:02:28 UTC  created_by: testuser1      
│   n_observations: 3                                               
├── storage/path: /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/q83qW513hnXIoCQo0000.h5ad
├── Dataset features
│   ├── obs (7)                                                                                                    
│   │   assay_oid                      bionty.ExperimentalFactor.ontology…  EFO:0008913                            
│   │   cell_type_by_expert            bionty.CellType                      B cell, CD8-positive, alpha-beta T cell
│   │   cell_type_by_model             bionty.CellType                      B cell, T cell                         
│   │   concentration                  str                                                                         
│   │   donor                          str                                                                         
│   │   perturbation                   Record[Perturbation]                 DMSO, IFNG                             
│   │   treatment_time_h               num                                                                         
│   └── var.T (3 bionty.Gene.ensembl…                                                                              
│       CD4                            num                                                                         
│       CD8A                           num                                                                         
└── Labels
    └── .records                       Record                               DMSO, IFNG                             
        .cell_types                    bionty.CellType                      B cell, T cell, CD8-positive, alpha-be…
        .experimental_factors          bionty.ExperimentalFactor            single-cell RNA sequencing

Unstructured dictionaries¶

Most datastructures support unstructured metadata stored as dictionaries:

Pandas DataFrames: .attrs
AnnData: .uns
MuData: .uns and modality:uns
SpatialData: .attrs

Here, we exemplary show how to curate such metadata for AnnData:

define_schema_anndata_uns.py¶

import lamindb as ln

from define_schema_df_metadata import study_metadata_schema

anndata_uns_schema = ln.Schema(
    otype="AnnData",
    slots={
        "uns:study_metadata": study_metadata_schema,
    },
).save()

!python scripts/define_schema_anndata_uns.py

curate_anndata_uns.py¶

import lamindb as ln

ln.examples.datasets.mini_immuno.define_features_labels()
adata = ln.examples.datasets.mini_immuno.get_dataset1(otype="AnnData")
schema = ln.Schema.get(name="Study metadata schema")
artifact = ln.Artifact.from_anndata(
    adata, schema=schema, key="examples/mini_immuno_uns.h5ad"
)
artifact.describe()

!python scripts/curate_anndata_uns.py

Show code cell output Hide code cell output

→ connected lamindb: testuser1/test-curate

→ returning record with same name: 'Perturbation'
→ returning record with same name: 'DMSO'
→ returning record with same name: 'IFNG'

→ returning feature with same name: 'perturbation'
→ returning feature with same name: 'cell_type_by_expert'
→ returning feature with same name: 'cell_type_by_model'
→ returning feature with same name: 'assay_oid'
→ returning feature with same name: 'concentration'
→ returning feature with same name: 'treatment_time_h'
→ returning feature with same name: 'donor'
→ returning feature with same name: 'donor_ethnicity'

! no run & transform got linked, call `ln.track()` & re-run
→ writing the in-memory object into cache
→ returning artifact with same hash: Artifact(uid='LVhVndXnXz7JQt4c0000', version_tag=None, is_latest=True, key='examples/mini_immuno.h5ad', description=None, suffix='.h5ad', kind='dataset', otype='AnnData', size=31672, hash='FB3CeMjmg1ivN6HDy6wsSg', n_files=None, n_observations=3, branch_id=1, space_id=1, storage_id=3, run_id=None, schema_id=8, created_by_id=3, created_at=2026-01-26 16:02:23 UTC, is_locked=False); to track this artifact as an input, use: ln.Artifact.get()

! key examples/mini_immuno.h5ad on existing artifact differs from passed key examples/mini_immuno_uns.h5ad, keeping original key; update manually if needed or pass skip_hash_lookup if you want to duplicate the artifact
→ loading artifact into memory for validation

Traceback (most recent call last):
  File "/home/runner/work/lamindb/lamindb/docs/scripts/curate_anndata_uns.py", line 6, in <module>
    artifact = ln.Artifact.from_anndata(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/lamindb/lamindb/lamindb/models/artifact.py", line 2126, in from_anndata
    curator = AnnDataCurator(artifact, schema)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/lamindb/lamindb/lamindb/curators/core.py", line 1002, in __init__
    raise InvalidArgument("Schema otype must be 'AnnData'.")
lamindb.errors.InvalidArgument: Schema otype must be 'AnnData'.

MuData¶

curate_mudata.py¶

import lamindb as ln
import bionty as bt

from docs.scripts.define_schema_df_metadata import study_metadata_schema

# define labels
perturbation = ln.Record(name="Perturbation", is_type=True).save()
ln.Record(name="Perturbed", type=perturbation).save()
ln.Record(name="NT", type=perturbation).save()

replicate = ln.Record(name="Replicate", is_type=True).save()
ln.Record(name="rep1", type=replicate).save()
ln.Record(name="rep2", type=replicate).save()
ln.Record(name="rep3", type=replicate).save()

# define the global obs schema
obs_schema = ln.Schema(
    name="mudata_papalexi21_subset_obs_schema",
    features=[
        ln.Feature(name="perturbation", dtype="cat[Record[Perturbation]]").save(),
        ln.Feature(name="replicate", dtype="cat[Record[Replicate]]").save(),
    ],
).save()

# define the ['rna'].obs schema
obs_schema_rna = ln.Schema(
    name="mudata_papalexi21_subset_rna_obs_schema",
    features=[
        ln.Feature(name="nCount_RNA", dtype=int).save(),
        ln.Feature(name="nFeature_RNA", dtype=int).save(),
        ln.Feature(name="percent.mito", dtype=float).save(),
    ],
).save()

# define the ['hto'].obs schema
obs_schema_hto = ln.Schema(
    name="mudata_papalexi21_subset_hto_obs_schema",
    features=[
        ln.Feature(name="nCount_HTO", dtype=float).save(),
        ln.Feature(name="nFeature_HTO", dtype=int).save(),
        ln.Feature(name="technique", dtype=bt.ExperimentalFactor).save(),
    ],
).save()

# define ['rna'].var schema
var_schema_rna = ln.Schema(
    name="mudata_papalexi21_subset_rna_var_schema",
    itype=bt.Gene.symbol,
    dtype=float,
).save()

# define composite schema
mudata_schema = ln.Schema(
    name="mudata_papalexi21_subset_mudata_schema",
    otype="MuData",
    slots={
        "obs": obs_schema,
        "rna:obs": obs_schema_rna,
        "hto:obs": obs_schema_hto,
        "rna:var": var_schema_rna,
        "uns:study_metadata": study_metadata_schema,
    },
).save()

# curate a MuData
mdata = ln.examples.datasets.mudata_papalexi21_subset(with_uns=True)
bt.settings.organism = "human"  # set the organism to map gene symbols
curator = ln.curators.MuDataCurator(mdata, mudata_schema)
artifact = curator.save_artifact(key="examples/mudata_papalexi21_subset.h5mu")
assert artifact.schema == mudata_schema

!python scripts/curate_mudata.py

Show code cell output Hide code cell output

→ connected lamindb: testuser1/test-curate

→ returning feature with same name: 'temperature'

→ returning feature with same name: 'experiment'
→ returning schema with same hash: Schema(uid='iOvFkvewdoog4TY9', is_type=False, name='Study metadata schema', description=None, n_members=2, coerce=None, flexible=False, itype='Feature', otype=None, hash='Hk8-vgcUhQ46cviAlYnkVQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=3, run_id=None, type_id=None, created_at=2026-01-26 16:02:31 UTC, is_locked=False)

→ returning record with same name: 'Perturbation'

! rather than passing a string 'cat[Record[Perturbation]]' to dtype, consider passing a Python object
→ returning feature with same name: 'perturbation'
! rather than passing a string 'cat[Record[Replicate]]' to dtype, consider passing a Python object

! you are trying to create a record with name='nFeature_HTO' but a record with similar name exists: 'nFeature_RNA'. Did you mean to load it?

! auto-transposed `var` for backward compat, please indicate transposition in the schema definition by calling out `.T`: slots={'var.T': itype=bt.Gene.ensembl_gene_id}

! 37 terms not validated in feature 'columns' in slot 'obs': 'adt:S.Score', 'gdo:guide_ID', 'gdo:gene_target', 'adt:perturbation', 'adt:gene_target', 'hto:NT', 'adt:HTO_classification', 'gdo:Phase', 'gdo:perturbation', 'gdo:S.Score', 'hto:HTO_classification', 'hto:MULTI_ID', 'hto:G2M.Score', 'gdo:orig.ident', 'adt:Phase', 'hto:percent.mito', 'gdo:NT', 'hto:Phase', 'adt:MULTI_ID', 'adt:orig.ident', ...
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')

! 96 terms not validated in feature 'columns' in slot 'rna:var': 'RP5-827C21.6', 'XX-CR54.1', 'RP11-379B18.5', 'RP11-778D9.12', 'RP11-703G6.1', 'AC005150.1', 'RP11-717H13.1', 'CTC-498J12.1', 'CTC-467M3.1', 'HIST1H4K', 'RP11-524H19.2', 'AC006042.7', 'AC002066.1', 'AC073934.6', 'RP11-268G12.1', 'U52111.14', 'RP11-235C23.5', 'RP11-12J10.3', 'CASC1', 'RP11-324E6.9', ...
    12 synonyms found: "CTC-467M3.1" → "MEF2C-AS2", "HIST1H4K" → "H4C12", "CASC1" → "DNAI7", "LARGE" → "LARGE1", "NBPF16" → "NBPF15", "C1orf65" → "CCDC185", "IBA57-AS1" → "IBA57-DT", "KIAA1239" → "NWD2", "TMEM75" → "LINC02912", "AP003419.16" → "RPS6KB2-AS1", "FAM65C" → "RIPOR3", "C14orf177" → "LINC02914"
    → curate synonyms via: .standardize("columns")
    for remaining terms:
    → fix organism 'Organism(uid='1dpCL6TduFJ3AP', name='human', ontology_id='NCBITaxon:9606', abbr=None, synonyms=None, description=None, scientific_name='Homo sapiens', branch_id=1, space_id=1, created_by_id=3, run_id=None, source_id=34, created_at=2026-01-26 16:02:20 UTC, is_locked=False)', fix typos, remove non-existent values, or save terms via: curator.slots['rna:var'].cat.add_new_from('columns')
! no run & transform got linked, call `ln.track()` & re-run
→ writing the in-memory object into cache

→ returning schema with same hash: Schema(uid='eDOpzMZa6GbPFd3v', is_type=False, name='mudata_papalexi21_subset_obs_schema', description=None, n_members=2, coerce=None, flexible=False, itype='Feature', otype=None, hash='sz4L6FrFOSWktQ-CyKwADw', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=3, run_id=None, type_id=None, created_at=2026-01-26 16:02:39 UTC, is_locked=False)
→ returning schema with same hash: Schema(uid='NLV4EsGUZy4ARU8I', is_type=False, name='mudata_papalexi21_subset_rna_obs_schema', description=None, n_members=3, coerce=None, flexible=False, itype='Feature', otype=None, hash='8Q0Re3inChS8FnGLTi90nA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=3, run_id=None, type_id=None, created_at=2026-01-26 16:02:39 UTC, is_locked=False)
→ returning schema with same hash: Schema(uid='MbM4jCo6P041SncC', is_type=False, name='mudata_papalexi21_subset_hto_obs_schema', description=None, n_members=3, coerce=None, flexible=False, itype='Feature', otype=None, hash='cIErgmfQDsz3RrI58CWdow', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=3, run_id=None, type_id=None, created_at=2026-01-26 16:02:39 UTC, is_locked=False)

→ returning schema with same hash: Schema(uid='iOvFkvewdoog4TY9', is_type=False, name='Study metadata schema', description=None, n_members=2, coerce=None, flexible=False, itype='Feature', otype=None, hash='Hk8-vgcUhQ46cviAlYnkVQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=3, run_id=None, type_id=None, created_at=2026-01-26 16:02:31 UTC, is_locked=False)

SpatialData¶

define_schema_spatialdata.py¶

import lamindb as ln
import bionty as bt


attrs_schema = ln.Schema(
    features=[
        ln.Feature(name="bio", dtype=dict).save(),
        ln.Feature(name="tech", dtype=dict).save(),
    ],
).save()

sample_schema = ln.Schema(
    features=[
        ln.Feature(name="disease", dtype=bt.Disease, coerce=True).save(),
        ln.Feature(
            name="developmental_stage",
            dtype=bt.DevelopmentalStage,
            coerce=True,
        ).save(),
    ],
).save()

tech_schema = ln.Schema(
    features=[
        ln.Feature(name="assay", dtype=bt.ExperimentalFactor, coerce=True).save(),
    ],
).save()

obs_schema = ln.Schema(
    features=[
        ln.Feature(name="sample_region", dtype="str").save(),
    ],
).save()

uns_schema = ln.Schema(
    features=[
        ln.Feature(name="analysis", dtype="str").save(),
    ],
).save()

# Schema enforces only registered Ensembl Gene IDs are valid (maximal_set=True)
varT_schema = ln.Schema(itype=bt.Gene.ensembl_gene_id, maximal_set=True).save()

sdata_schema = ln.Schema(
    name="spatialdata_blobs_schema",
    otype="SpatialData",
    slots={
        "attrs:bio": sample_schema,
        "attrs:tech": tech_schema,
        "attrs": attrs_schema,
        "tables:table:obs": obs_schema,
        "tables:table:var.T": varT_schema,
    },
).save()

!python scripts/define_schema_spatialdata.py

curate_spatialdata.py¶

import lamindb as ln

spatialdata = ln.examples.datasets.spatialdata_blobs()
sdata_schema = ln.Schema.get(name="spatialdata_blobs_schema")
curator = ln.curators.SpatialDataCurator(spatialdata, sdata_schema)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

spatialdata.tables["table"].var.drop(index="ENSG00000999999", inplace=True)

# validate again (must pass now) and save artifact
artifact = ln.Artifact.from_spatialdata(
    spatialdata, key="examples/spatialdata1.zarr", schema=sdata_schema
).save()
artifact.describe()

!python scripts/curate_spatialdata.py

TiledbsomaExperiment¶

curate_soma_experiment.py¶

import lamindb as ln
import bionty as bt
import tiledbsoma as soma
import tiledbsoma.io

adata = ln.examples.datasets.mini_immuno.get_dataset1(otype="AnnData")
tiledbsoma.io.from_anndata("small_dataset.tiledbsoma", adata, measurement_name="RNA")

obs_schema = ln.Schema(
    name="soma_obs_schema",
    features=[
        ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
        ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
    ],
).save()

var_schema = ln.Schema(
    name="soma_var_schema",
    features=[
        ln.Feature(name="var_id", dtype=bt.Gene.ensembl_gene_id).save(),
    ],
    coerce=True,
).save()

soma_schema = ln.Schema(
    name="soma_experiment_schema",
    otype="tiledbsoma",
    slots={
        "obs": obs_schema,
        "ms:RNA.T": var_schema,
    },
).save()

with soma.Experiment.open("small_dataset.tiledbsoma") as experiment:
    curator = ln.curators.TiledbsomaExperimentCurator(experiment, soma_schema)
    curator.validate()
    artifact = curator.save_artifact(
        key="examples/soma_experiment.tiledbsoma",
        description="SOMA experiment with schema validation",
    )
assert artifact.schema == soma_schema
artifact.describe()

!python scripts/curate_soma_experiment.py

Show code cell output Hide code cell output

→ connected lamindb: testuser1/test-curate

→ returning feature with same name: 'cell_type_by_expert'
→ returning feature with same name: 'cell_type_by_model'

! 1 term not validated in feature 'columns': 'sample_note'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')

! no run & transform got linked, call `ln.track()` & re-run

→ returning schema with same hash: Schema(uid='UCCJtOb4WOqdcTFP', is_type=False, name='soma_obs_schema', description=None, n_members=2, coerce=None, flexible=False, itype='Feature', otype=None, hash='Uz22R2o-9mxeMd8GXS5yhA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=3, run_id=None, type_id=None, created_at=2026-01-26 16:02:58 UTC, is_locked=False)

→ returning schema with same hash: Schema(uid='B3RBeyEhSm13MBQG', is_type=False, name='soma_var_schema', description=None, n_members=1, coerce=True, flexible=False, itype='Feature', otype=None, hash='ZhkHa0DyH03aWUqfYzgYYA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=3, run_id=None, type_id=None, created_at=2026-01-26 16:02:58 UTC, is_locked=False)

Artifact: examples/soma_experiment.tiledbsoma (0000)
|   description: SOMA experiment with schema validation
├── uid: k0aBGoqYqqIrOC7v0000            run:                 
│   kind: dataset                        otype: tiledbsoma    
│   hash: 5_q1SDSAsHarmO0Y_PQ9uw         size: 23.9 KB        
│   branch: main                         space: all           
│   created_at: 2026-01-26 16:02:59 UTC  created_by: testuser1
│   n_files: 68                          n_observations: 3    
├── storage/path: 
│   /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/k0aBGoqYqqIrOC7v
│   .tiledbsoma
├── Dataset features
│   ├── obs (2)                                                                 
│   │   cell_type_by_expe…  bionty.CellType          B cell, CD8-positive, alph…
│   │   cell_type_by_model  bionty.CellType          B cell, T cell             
│   └── ms:RNA.T (1)                                                            
│       var_id              bionty.Gene.ensembl_ge…  ENSG00000010610, ENSG00000…
└── Labels
    └── .genes              bionty.Gene              CD8A, CD4, CD14            
        .cell_types         bionty.CellType          B cell, T cell, CD8-positi…

Other data structures¶

If you have other data structures, read: How do I validate & annotate arbitrary data structures?.