How do I validate & annotate arbitrary data structures?

This guide walks through the low-level API that lets you validate iterables.

You can then use the records create inferred during validation to annotate a dataset.

How do I validate based on a public ontology?

LaminDB makes it easy to validate categorical variables based on registries that inherit from CanCurate.

CanCurate methods validate against the registries in your LaminDB instance. In Manage biological ontologies, you’ll see how to extend standard validation to validation against public references using a PubliOntology object, e.g., via public_genes = bt.Gene.public(). By default, from_values() considers a match in a public reference a validated value for any bionty entity.

# pip install 'lamindb[zarr]'
!lamin init --storage ./test-curate-any --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-curate-any

Define a test dataset.

import lamindb as ln
import bionty as bt
import zarr
import numpy as np

data = zarr.open_group(store="data.zarr", mode="a")

data.create_dataset(name="temperature", shape=(3,), dtype="float32")
data.create_dataset(name="knockout_gene", shape=(3,), dtype=str)
data.create_dataset(name="disease", shape=(3,), dtype=str)

data["knockout_gene"][:] = np.array(
    ["ENSG00000139618", "ENSG00000141510", "ENSG00000133703"]
)
data["disease"][:] = np.random.default_rng().choice(
    ["MONDO:0004975", "MONDO:0004980"], 3
)
 connected lamindb: testuser1/test-curate-any

Validate and standardize vectors

Read the disease array from the zarr group into memory.

disease = data["disease"][:]

validate() validates vectore-like values against reference values in a registry. It returns a boolean vector indicating where a value has an exact match in the reference values.

bt.Disease.validate(disease, field=bt.Disease.ontology_id)
Hide code cell output
! Your Disease registry is empty, consider populating it first!
   → use `.import_source()` to import records from a source, e.g. a public ontology
array([False, False, False])

When validation fails, you can call inspect() to figure out what to do.

inspect() applies the same definition of validation as validate(), but returns a rich return value InspectResult. Most importantly, it logs recommended curation steps that would render the data validated.

Note: you can use standardize() to standardize synonyms.

bt.Disease.inspect(disease, field=bt.Disease.ontology_id)
Hide code cell output
! received 2 unique terms, 1 empty/duplicated term is ignored
! 2 unique terms (100.00%) are not validated for ontology_id: 'MONDO:0004975', 'MONDO:0004980'
   detected 2 Disease terms in public source for ontology_id: 'MONDO:0004975', 'MONDO:0004980'
→  add records from public source to your Disease registry via .from_values()
<lamin_utils._inspect.InspectResult at 0x7f971686cc50>

Bulk creating records using from_values() only returns validated records.

diseases = bt.Disease.from_values(disease, field=bt.Disease.ontology_id).save()

Repeat the process for more labels:

experiments = ln.Record.from_values(
    ["Experiment A", "Experiment B"],
    field=ln.Record.name,
    create=True,  # create non-validated labels
).save()
genes = bt.Gene.from_values(
    data["knockout_gene"][:], field=bt.Gene.ensembl_gene_id
).save()

Annotate the dataset

Register the dataset as an artifact:

artifact = ln.Artifact("data.zarr", key="my_dataset.zarr").save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run

Annotate with features:

ln.Feature(name="experiment", dtype=ln.Record).save()
ln.Feature(name="disease", dtype=bt.Disease.ontology_id).save()
ln.Feature(name="knockout_gene", dtype=bt.Gene.ensembl_gene_id).save()
artifact.features.add_values(
    {"experiment": experiments, "knockout_gene": genes, "disease": diseases}
)
artifact.describe()
Hide code cell output
Artifact: my_dataset.zarr (0000)
├── uid: uRZ2s9JybxnSPugG0000            run:                 
hash: TIIZZt03kFoWa8VVp2MKiQ         size: 1.2 KB         
branch: main                         space: all           
created_at: 2026-01-26 16:02:06 UTC  created_by: testuser1
n_files: 6                                                
├── storage/path: /home/runner/work/lamindb/lamindb/docs/faq/test-curate-any/.lamindb/uRZ2s9JybxnSPugG.zarr
├── Features
└── disease                        bionty.Disease.ontology_id           MONDO:0004975, MONDO:0004980           
    experiment                     Record                               Experiment A, Experiment B             
    knockout_gene                  bionty.Gene.ensembl_gene_id          ENSG00000133703, ENSG00000139618, ENSG…
└── Labels
    └── .records                       Record                               Experiment A, Experiment B             
        .genes                         bionty.Gene                          BRCA2, TP53, KRAS                      
        .diseases                      bionty.Disease                       Alzheimer disease, atopic eczema       
Hide code cell content
# clean up test instance
!rm -r data.zarr
!rm -r ./test-curate-any
!lamin delete --force test-curate-any