Gene Ontology (GO)¶
In this notebook we manage a pathway registry based on “2023 GO Biological Process” ontology. We’ll walk you through the steps of registering pathways and link them to genes.
In the Cell type annotation and pathway analysis notebook, we’ll demonstrate how to perform a pathway enrichment analysis and track the dataset with LaminDB.
# pip install lamindb gseapy
!lamin init --storage ./use-cases-registries --modules bionty
Show code cell output
→ initialized lamindb: testuser1/use-cases-registries
import lamindb as ln
import bionty as bt
import gseapy as gp
Show code cell output
→ connected lamindb: testuser1/use-cases-registries
Fetch GO pathways annotated with human genes using Enrichr¶
First we fetch the "GO_Biological_Process_2023" pathways for humans using GSEApy which wraps GSEA and Enrichr.
go_bp = gp.get_library(name="GO_Biological_Process_2025", organism="Human")
print(f"Number of pathways {len(go_bp)}")
Show code cell output
Number of pathways 5341
go_bp["ATF6-mediated Unfolded Protein Response (GO:0036500)"]
Show code cell output
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'MANF', 'DDIT3', 'CREBZF']
Parse out the ontology_id from keys, convert into the format of {ontology_id: (name, genes)}
def parse_ontology_id_from_keys(key):
"""Parse out the ontology id.
"ATF6-mediated Unfolded Protein Response (GO:0036500)" -> ("GO:0036500", "ATF6-mediated Unfolded Protein Response")
"""
name, id = key.rsplit(" (", 1)
return id.rstrip(")"), name
go_bp_parsed = {
parse_ontology_id_from_keys(k)[0]: (parse_ontology_id_from_keys(k)[1], v)
for k, v in go_bp.items()
}
go_bp_parsed["GO:0036500"]
Show code cell output
('ATF6-mediated Unfolded Protein Response',
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'MANF', 'DDIT3', 'CREBZF'])
Register pathway ontology in LaminDB¶
source = bt.Source.get(name="go")
source
Show code cell output
Source(uid='2UZHts8n', entity='bionty.Pathway', organism='all', name='go', version='2025-10-10', in_db=False, currently_used=True, description='Gene Ontology', url='http://purl.obolibrary.org/obo/go/releases/2025-10-10/extensions/go-plus.owl', md5=None, source_website='http://geneontology.org', branch_id=1, space_id=1, created_by_id=3, run_id=None, dataframe_artifact_id=None, created_at=2026-01-27 17:31:50 UTC, is_locked=False)
bionty = bt.Pathway.public(source=source)
bionty
Show code cell output
PublicOntology
Entity: Pathway
Organism: all
Source: go, 2025-10-10
#terms: 80453
Next, we register all the pathways and genes in LaminDB to finally link pathways to genes.
Register pathway terms¶
To register the pathways we make use of .from_values to directly parse the annotated GO pathway ontology IDs into LaminDB.
pathways = bt.Pathway.from_values(go_bp_parsed.keys(), bt.Pathway.ontology_id).save()
Show code cell output
! ontology ID BFO:0000015 not found in DataFrame
→ starting creation of 10454 Pathway_parents records in batches of 10000
Register gene symbols¶
Similarly, we use .from_values for all Pathway associated genes to register them with LaminDB.
all_genes = bt.Gene.standardize(sum(go_bp.values(), []), organism="human")
genes = bt.Gene.from_values(all_genes, organism="human").save()
Show code cell output
! found 35 synonyms in public source (output truncated): [np.str_('C17ORF99'), np.str_('C6ORF89'), np.str_('C9ORF78'), np.str_('C15ORF62'), np.str_('C2ORF69'), np.str_('C19ORF12'), np.str_('CPAP'), np.str_('C9ORF72'), np.str_('C12ORF57'), np.str_('HEMK2'), '...']
please add corresponding Gene records via (output truncated): `.from_values([np.str_('C17ORF99'), np.str_('C6ORF89'), np.str_('C9ORF78'), np.str_('C15ORF62'), np.str_('C2ORF69'), np.str_('C19ORF12'), np.str_('CPAP'), np.str_('C9ORF72'), np.str_('C12ORF57'), np.str_('HEMK2'), '...'])`
! ambiguous validation in Bionty for 1006 records: 'GART', 'HSPA1L', 'HSPA1A', 'HSPA1B', 'CCT8', 'KGD4', 'ATAT1', 'TRIM71', 'DHX36', 'CMTR2', 'PKLR', 'LDHA', 'SLC25A24', 'ATF6B', 'MAGEL2', 'TRIM27', 'PTPRC', 'GPS2', 'AKAP17A', 'SLC39A7', ...
→ starting creation of 16187 Gene records in batches of 10000
Manually register the 32 non-validated symbols:
inspect_result = bt.Gene.inspect(all_genes, organism="human")
organism = bt.Organism.get(name="human")
nonval_genes = []
for g in inspect_result.non_validated:
nonval_genes.append(bt.Gene(symbol=g, organism=organism))
ln.save(nonval_genes)
Show code cell output
! received 14217 unique terms, 154953 empty/duplicated terms are ignored
! 32 unique terms (0.20%) are not validated for symbol: 'LOC112694756', 'LOC102724971', 'IGL', 'LOC102723407', 'APOBEC3A_B', 'LOC102724560', 'TNFAIP8L2-SCNM1', 'TRA', 'LOC124905743', 'CCL4L1', ...
couldn't validate 32 terms: 'LOC102724652', 'DUX1', 'TRA', 'DNAAF19', 'CCL3L1', 'LOC102724971', 'LOC112268384', 'RBMY1C', 'LOC100533997', 'LOC102724560', 'TMEM278', 'LOC102725023', 'FSAF1', 'LOC124905743', 'VMA22', 'CHLSN', 'LOC102723407', 'LOC107987479', 'LOC112694756', 'SLC67A1', ...
→ if you are sure, create new records via Gene() and save to your registry
! you are trying to create a record with name='IGL' but records with similar symbols exist: 'IGLL5', 'IGLL1', 'PIGL'. Did you mean to load one of them?
! you are trying to create a record with name='TRA' but records with similar symbols exist: 'TRAF5', 'TRAF6', 'TRAPPC2L'. Did you mean to load one of them?
! you are trying to create a record with name='CCL4L1' but records with similar symbols exist: 'CCL4', 'CCL4', 'CCL4'. Did you mean to load one of them?
! you are trying to create a record with name='CCL3L1' but records with similar symbols exist: 'CCL3', 'CCL3', 'CCL3'. Did you mean to load one of them?
! you are trying to create a record with name='DNAAF19' but records with similar symbols exist: 'DNAAF4', 'DNAAF3', 'DNAAF2'. Did you mean to load one of them?
! you are trying to create a record with name='TMEM278' but records with similar symbols exist: 'TMEM230', 'TMEM203', 'TMEM231'. Did you mean to load one of them?
! you are trying to create a record with name='RBMY1C' but records with similar symbols exist: 'RBMY1B', 'RBMY1F', 'RBMY1E'. Did you mean to load one of them?
Link pathway to genes¶
Now that we are tracking all pathways and genes records, we can link both of them to make the pathways even more queryable.
symbols_genes = {record.symbol: record for record in genes}
for pathway in pathways:
pathway_genes = go_bp_parsed.get(pathway.ontology_id)[1]
pathway_genes_records = [symbols_genes.get(gene) for gene in pathway_genes]
pathway.genes.set(pathway_genes_records)
Now genes are linked to pathways:
pathway.genes.to_list("symbol")
Show code cell output
['PGK1',
'CTSH',
'OMA1',
'GGT1',
'FURIN',
'FGA',
'FGB',
'CTSL',
'PRSS3',
'FGG',
'F12',
'THBD',
'PLAT',
'DLC1',
'MMP14',
'PLAU',
'ZNF160',
'C1R',
'VSIR',
'KLKB1',
'F9',
'HGFAC',
'C1RL',
'HPR',
'KLK2',
'KLK1',
'CUZD1']
pathway.genes.to_list("ensembl_gene_id")
Show code cell output
['ENSG00000102144',
'ENSG00000103811',
'ENSG00000162600',
'ENSG00000100031',
'ENSG00000140564',
'ENSG00000171560',
'ENSG00000171564',
'ENSG00000135047',
'ENSG00000010438',
'ENSG00000171557',
'ENSG00000131187',
'ENSG00000178726',
'ENSG00000104368',
'ENSG00000288673',
'ENSG00000157227',
'ENSG00000122861',
'ENSG00000170949',
'ENSG00000288512',
'ENSG00000107738',
'ENSG00000164344',
'ENSG00000101981',
'ENSG00000109758',
'ENSG00000288124',
'ENSG00000261701',
'ENSG00000167751',
'ENSG00000167748',
'ENSG00000138161']