Curate AnnData
based on the CELLxGENE schema¶
This guide shows how to curate an AnnData object with the help of laminlabs/cellxgene
against the CELLxGENE schema v5.2.0.
# pip install 'lamindb[bionty,jupyter]' pronto
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.2.3
!lamin init --storage ./test-cellxgene-curate --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-cellxgene-curate
import lamindb as ln
import bionty as bt
ln.track()
Show code cell output
→ connected lamindb: testuser1/test-cellxgene-curate
→ created Transform('SUYhawHskKnI0000'), started new Run('dWi9IA01...') at 2025-07-21 11:35:06 UTC
→ notebook imports: bionty==1.6.1rc1 lamindb==1.9.0
• recommendation: to identify the notebook across renames, pass the uid: ln.track("SUYhawHskKnI")
The CELLxGENE schema¶
As a first step, we generate the specific CELLxGENE schema which adds missing sources to the instance:
cxg_schema = ln.examples.cellxgene.get_cxg_schema("5.2.0")
Show code cell output
→ source added!
→ source added!
→ source added!
→ referenced read-only storage location at s3://bionty-assets, is managed by instance with uid 2WgKqzPc1eW3
→ source added!
cxg_schema.describe()
Show code cell output
Schema(uid='O3YGSIW9aGaxCmA5', name='AnnData of CELLxGENE version 5.2.0', n=-1, is_type=False, itype='Composite', otype='AnnData', dtype='num', hash='WgZN6EKAaHHWhGKLoktRMg', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-07-21 11:35:15 UTC)
var: Schema(uid='2X0XzfcgMnt9vlXP', name='var of CELLxGENE version 5.2.0', n=2, is_type=False, itype='Feature', dtype='DataFrame', hash='FRoYHDfugiRsaAUPr8Xnsw', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-07-21 11:35:14 UTC)
obs: Schema(uid='JEWnt3qBsfMPenL6', name='obs of CELLxGENE version 5.2.0', n=12, is_type=False, itype='Feature', otype='DataFrame', hash='TCQ9ciVSrPmnMfB8ZiKJow', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-07-21 11:35:15 UTC)
The schema has two components:
cxg_schema.slots["var"].describe()
Show code cell output
Schema ├── .uid = '2X0XzfcgMnt9vlXP' ├── .name = 'var of CELLxGENE version 5.2.0' ├── .itype = 'Feature' ├── .ordered_set = False ├── .maximal_set = False ├── .minimal_set = True ├── .created_by = testuser1 (Test User1) ├── .created_at = 2025-07-21 11:35:14 └── Feature • 2 └── name dtype opti… null… coerce_… default_v… var_index cat[bionty.Gene.ensembl_gene_id[source__uid='5dmX950… ✗ ✓ ✗ unset feature_is_filt… bool ✗ ✓ ✗ unset
cxg_schema.slots["obs"].describe()
Show code cell output
Schema DataFrame ├── .uid = 'JEWnt3qBsfMPenL6' ├── .name = 'obs of CELLxGENE version 5.2.0' ├── .itype = 'Feature' ├── .ordered_set = False ├── .maximal_set = False ├── .minimal_set = True ├── .created_by = testuser1 (Test User1) ├── .created_at = 2025-07-21 11:35:15 └── Feature • 12 └── name dtype coe… def… assay_ontology_term_id cat[bionty.ExperimentalFactor.ontology_id[source__uid='2v… ✗ uns… cell_type_ontology_term_id cat[bionty.CellType.ontology_id[source__uid='3Uw2Va7a']] ✗ uns… development_stage_ontology_term… cat[bionty.DevelopmentalStage.ontology_id[source__uid='1G… ✗ uns… disease_ontology_term_id cat[bionty.Disease.ontology_id[source__uid='4a3ejKuf']] ✗ uns… self_reported_ethnicity_ontolog… cat[bionty.Ethnicity.ontology_id[source__uid='MJRqduf9']] ✗ uns… sex_ontology_term_id cat[bionty.Phenotype.ontology_id[source__uid='3ox8Ekgl']] ✗ uns… tissue_ontology_term_id cat[bionty.Tissue.ontology_id[source__uid='MUtAGdL4']] ✗ uns… organism_ontology_term_id cat[bionty.Organism.ontology_id[source__uid='4tsksCMX']] ✗ uns… donor_id str ✗ unk… is_primary_data cat[ULabel] ✗ uns… suspension_type cat[ULabel] ✗ uns… tissue_type cat[ULabel] ✗ uns…
In the following, we will validate a dataset the CELLxGENE schema and curate it.
Validate and curate metadata¶
Let’s start with an AnnData object that we would like to curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres all requirements of CELLxGENE including the CELLxGENE schema.
adata = ln.core.datasets.small_dataset3_cellxgene(with_obs_typo=True)
adata.write_h5ad("small_cxg.h5ad")
adata
Show code cell output
AnnData object with n_obs × n_vars = 3 × 3
obs: 'disease_ontology_term_id', 'development_stage_ontology_term_id', 'organism', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type'
var: 'feature_is_filtered'
uns: 'title'
obsm: 'X_pca'
Initially, the cellxgene-schema
validator of CZI does not pass and we need to curate the dataset.
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg.h5ad
Show code cell output
Loading dependencies
Loading validator modules
Starting validation...
WARNING: Dataframe 'var' only has 3 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.
ERROR: Add labels error: Column 'cell_type' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Add labels error: Column 'organism' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Add labels error: Column 'self_reported_ethnicity' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'var', make sure it is a valid ID.
ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'raw.var', make sure it is a valid ID.
ERROR: Dataframe 'obs' is missing column 'cell_type_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'assay_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'organism_ontology_term_id'.
ERROR: 'UBERON:0002048XXX' in 'tissue_ontology_term_id' is not a valid ontology term id of 'UBERON'. When 'tissue_type' is 'tissue' or 'organoid', 'tissue_ontology_term_id' MUST be a descendant term id of 'UBERON:0001062' (anatomical entity).
ERROR: Dataframe 'obs' is missing column 'self_reported_ethnicity_ontology_term_id'.
ERROR: Checking values with dependencies failed for adata.obs['development_stage_ontology_term_id'], this is likely due to missing dependent column in adata.obs.
ERROR: Checking values with dependencies failed for adata.obs['suspension_type'], this is likely due to missing dependent column in adata.obs.
Validation complete in 0:00:00.516340 with status is_valid=False
CELLxGENE requires all observations to be annotated.
If information for a specific column like disease_ontology_term_id
is not available, CELLxGENE requires to fall back to default values like “normal” or “unknown”.
Let’s save these defaults to the instance using lamindb.examples.cellxgene.save_cxg_defaults()
:
ln.examples.cellxgene.save_cxg_defaults()
Show code cell output
! you are trying to create a record with name='tissue' but a record with similar name exists: 'TissueType'. Did you mean to load it?
! you are trying to create a record with name='cell' but a record with similar name exists: 'cell culture'. Did you mean to load it?
Now we can start curating the dataset:
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
curator.validate()
except ln.errors.ValidationError:
pass
Show code cell output
! using default organism = human
! using default organism = human
! 1 term not validated in feature 'index' in slot 'var': 'invalid_ensembl_id'
→ fix typos, remove non-existent values, or save terms via: curator.slots['var'].cat.add_new_from('index')
The error shows invalid genes are present in the dataset.
Let’s remove them from both the adata
and adata.raw
objects:
adata = adata[
:, ~adata.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
if adata.raw is not None:
raw_data = adata.raw.to_adata()
raw_data = raw_data[
:, ~raw_data.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
adata.raw = raw_data
As we’ve subsetted the AnnData object, we have to recreate the AnnDataCurator
to validate again:
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
curator.validate()
except ln.errors.ValidationError as e:
print(e)
Show code cell output
{
"SCHEMA": {
"COLUMN_NOT_IN_DATAFRAME": [
{
"schema": null,
"column": null,
"check": "column_in_dataframe",
"error": "column 'assay_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'organism', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
},
{
"schema": null,
"column": null,
"check": "column_in_dataframe",
"error": "column 'cell_type_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'organism', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
},
{
"schema": null,
"column": null,
"check": "column_in_dataframe",
"error": "column 'self_reported_ethnicity_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'organism', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
},
{
"schema": null,
"column": null,
"check": "column_in_dataframe",
"error": "column 'organism_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'organism', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
}
]
}
}
The validation error tells us that we’re missing several columns.
The reason is simple:
CELLxGENE requires all obs
metadata to be stored as ontology IDs in entity_ontology_term_id
columns.
Therefore, we first translate the name
based obs
columns into the required format.
adata.obs
Show code cell output
disease_ontology_term_id | development_stage_ontology_term_id | organism | sex_ontology_term_id | tissue_ontology_term_id | cell_type | self_reported_ethnicity | donor_id | is_primary_data | suspension_type | tissue_type | |
---|---|---|---|---|---|---|---|---|---|---|---|
barcode1 | MONDO:0004975 | unknown | human | PATO:0000383 | UBERON:0002048XXX | T cell | South Asian | -1 | False | cell | tissue |
barcode2 | MONDO:0004980 | unknown | human | PATO:0000384 | UBERON:0002048XXX | B cell | South Asian | 1 | False | cell | tissue |
barcode3 | MONDO:0004980 | unknown | human | unknown | UBERON:0000948 | B cell | South Asian | 2 | False | cell | tissue |
# Add missing assay column
adata.obs["assay_ontology_term_id"] = "EFO:0005684"
# Add `entity_ontology_term_id` columns by translating names to ontology IDs
standardization_map = {
"organism": (bt.Organism, "organism_ontology_term_id"),
"self_reported_ethnicity": (
bt.Ethnicity,
"self_reported_ethnicity_ontology_term_id",
),
"cell_type": (bt.CellType, "cell_type_ontology_term_id"),
}
for col, (bt_class, new_col) in standardization_map.items():
adata.obs[new_col] = bt_class.standardize(
adata.obs[col], field="name", return_field="ontology_id"
)
# Drop the name columns because CELLxGENE disallows them
adata.obs = adata.obs.drop(columns=list(standardization_map.keys()))
Show code cell output
! found 1 name in public source: ['South Asian']
please add corresponding Ethnicity records via: `.from_values(['South Asian'])`
! found 2 names in public source: ['T cell', 'B cell']
please add corresponding CellType records via: `.from_values(['T cell', 'B cell'])`
# recreating the object because we dropped `obs` columns
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
curator.validate()
except ln.errors.ValidationError:
pass
Show code cell output
! 1 term not validated in feature 'tissue_ontology_term_id' in slot 'obs': 'UBERON:0002048XXX'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('tissue_ontology_term_id')
An error is shown for the tissue label “UBERON:0002048XXX” because it contains a few extra X
- a typo.
Let’s fix it:
adata.obs["tissue_ontology_term_id"] = adata.obs["tissue_ontology_term_id"].replace(
{"UBERON:0002048XXX": "UBERON:0002048"}
)
Show code cell output
/tmp/ipykernel_4393/59938913.py:1: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
adata.obs["tissue_ontology_term_id"] = adata.obs["tissue_ontology_term_id"].replace(
Now validate
should pass.
curator.validate()
Save artifact¶
We can now save the curated artifact:
artifact = curator.save_artifact(key="examples/dataset-curated-against-cxg.h5ad")
Show code cell output
→ returning existing schema with same hash: Schema(uid='JEWnt3qBsfMPenL6', name='obs of CELLxGENE version 5.2.0', n=12, is_type=False, itype='Feature', otype='DataFrame', hash='TCQ9ciVSrPmnMfB8ZiKJow', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-07-21 11:35:15 UTC)
artifact.describe()
Show code cell output
Artifact .h5ad · AnnData · dataset ├── General │ ├── key: examples/dataset-curated-against-cxg.h5ad │ ├── uid: RjaBvxeYkchtkpjL0000 hash: 5sXvIQzdfepyq4kEDzXuPw │ ├── size: 43.4 KB transform: cellxgene-curate.ipynb │ ├── space: all branch: all │ ├── created_by: testuser1 created_at: 2025-07-21 11:35:41 │ ├── n_observations: 3 │ └── storage path: │ /home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/examples/dataset-curated-again │ st-cxg.h5ad ├── Dataset features │ ├── var • 2 [Feature] │ │ var_index cat[bionty.Gene.ensembl_gene_id[… BRCA2, DPM1 │ │ feature_is_filtered bool │ └── obs • 12 [Feature] │ assay_ontology_term_id cat[bionty.ExperimentalFactor.on… RNA-seq of coding RNA from single cells │ cell_type_ontology_term_id cat[bionty.CellType.ontology_id[… B cell, T cell │ development_stage_ontology_te… cat[bionty.DevelopmentalStage.on… unknown │ disease_ontology_term_id cat[bionty.Disease.ontology_id[s… Alzheimer disease, atopic eczema │ organism_ontology_term_id cat[bionty.Organism.ontology_id[… human │ self_reported_ethnicity_ontol… cat[bionty.Ethnicity.ontology_id… South Asian │ sex_ontology_term_id cat[bionty.Phenotype.ontology_id… female, male, unknown │ suspension_type cat[ULabel] cell │ tissue_ontology_term_id cat[bionty.Tissue.ontology_id[so… heart, lung │ tissue_type cat[ULabel] tissue │ donor_id str │ is_primary_data cat[ULabel] └── Labels └── .organisms bionty.Organism human .genes bionty.Gene DPM1, BRCA2 .tissues bionty.Tissue heart, lung .cell_types bionty.CellType T cell, B cell .diseases bionty.Disease Alzheimer disease, atopic eczema .phenotypes bionty.Phenotype unknown, female, male .experimental_factors bionty.ExperimentalFactor RNA-seq of coding RNA from single cells .developmental_stages bionty.DevelopmentalStage unknown .ethnicities bionty.Ethnicity South Asian .ulabels ULabel tissue, cell
Validating using cellxgene-schema¶
To validate the now curated AnnData object using CZI’s cellxgene-schema CLI tool, we need to write the AnnData object to disk.
adata.write("small_cxg_curated.h5ad")
# %%bash -e
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg_curated.h5ad
Show code cell output
Loading dependencies
Loading validator modules
Starting validation...
WARNING: Dataframe 'var' only has 2 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING: Data contains assay(s) that are not represented in the 'suspension_type' schema definition table. Ensure you have selected the most appropriate value for the assay(s) between 'cell', 'nucleus', and 'na'. Please contact cellxgene@chanzuckerberg.com during submission so that the assay(s) can be added to the schema definition document.
Validation complete in 0:00:02.545513 with status is_valid=True
Note
The Curate class is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.