## lamindb.examples.datasets

Example datasets.

# The mini immuno dataset

| --- | --- |
| "mini_imm | Two "mini immuno" datasets. |
| uno" |
| --- | --- |

* lamindb.examples.datasets.mini_immuno

  * Datasets

 * "get_dataset1()"

 * "get_dataset2()"

  * Schemas

 * "define_features_labels()"

 * "define_mini_immuno_schema_flexible()"

  * Utilities

 * "save_mini_immuno_datasets()"

# Small in-memory datasets

lamindb.examples.datasets.anndata_with_obs()

 Create a mini anndata with cell_type, disease and tissue.

 Return type:
 "AnnData"

# Files

lamindb.examples.datasets.file_fcs()

 Example FCS artifact.

 Return type:
 "Path"

lamindb.examples.datasets.file_fcs_alpert19(populate_registries=False)

 FCS file from Alpert19.

 Parameters:
 **populate_registries** ("bool", default: "False") -- pre-
 populate metadata records to simulate existing registries  #
 noqa

 Return type:
 "Path"

lamindb.examples.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(populate_registries=False)

 Gene counts table from nf-core RNA-seq pipeline.

 Output of: https://nf-co.re/rnaseq

 Return type:
 "Path"

lamindb.examples.datasets.file_jpg_paradisi05()

 JPG file example.

 Originally from: https://upload.wikimedia.org/wikipedia/commons/2/
 28/Laminopathic_nuclei.jpg

 Return type:
 "Path"

lamindb.examples.datasets.file_tiff_suo22()

 Image file from Suo22.

 Pair with anndata_suo22_Visium10X

 Return type:
 "Path"

lamindb.examples.datasets.file_fastq(in_storage_root=False)

 Mini mock fastq artifact.

 Return type:
 "Path"

lamindb.examples.datasets.file_bam(in_storage_root=False)

 Mini mock bam artifact.

 Return type:
 "Path"

lamindb.examples.datasets.file_mini_csv(in_storage_root=False)

 Mini csv artifact.

 Return type:
 "Path"

# Directories

lamindb.examples.datasets.dir_scrnaseq_cellranger(sample_name, basedir='./', output_only=True)

 Mock cell ranger outputs.

 Parameters:
 * **sample_name** ("str") -- name of the sample

| * **basedir** ("str" | "Path", default: "'./'") -- run directory |

 * **output_only** ("bool", default: "True") -- only return
 output files

 Return type:
 "Path"

lamindb.examples.datasets.dir_iris_images()

 Directory with 3 studies of the Iris flower: 405 images & metadata.

 Provenance:
 https://lamin.ai/laminlabs/lamindata/transform/3q4MpQxRL2qZ5zKv

 The problem is that the same artifact was also ingested by the
 downstream demo notebook:
 https://lamin.ai/laminlabs/lamindata/transform/NJvdsWWbJlZS5zKv

 This is why on the UI, the artifact shows up as output of the
 downstream demo notebook rather than the upstream curation
 notebook. The lineage information should still be captured by
 https://github.com/laminlabs/lnschema-core/blob/a90437e91dfbd6b900
 2f18c3e978bd0f9c9a632d/lamindb/models.py#L2050-L2052 but we don't
 use this in the UI yet.

 Return type:
 "UPath"

# Dictionary, Dataframe, AnnData, MuData, SpatialData

lamindb.examples.datasets.dict_cellxgene_uns()

 An example CELLxGENE AnnData ".uns" dictionary.

 Return type:
 "dict"["str", "Any"]

lamindb.examples.datasets.df_iris()

 The iris collection as in sklearn.

 Original code:

 sklearn.collections.load_iris(as_frame=True).frame

 Return type:
 "DataFrame"

lamindb.examples.datasets.df_iris_in_meter()

 The iris collection with lengths in meter.

 Return type:
 "DataFrame"

lamindb.examples.datasets.df_iris_in_meter_study1()

 The iris collection with lengths in meter.

 Return type:
 "DataFrame"

lamindb.examples.datasets.df_iris_in_meter_study2()

 The iris collection with lengths in meter.

 Return type:
 "DataFrame"

lamindb.examples.datasets.anndata_mouse_sc_lymph_node(populate_registries=False)

 Mouse lymph node scRNA-seq collection from EBI.

 Subsampled to 10k genes.

 From: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8414/

 Parameters:
 **populate_registries** ("bool", default: "False") -- pre-
 populate metadata records to simulate existing registries  #
 noqa

 Return type:
 "AnnData"

lamindb.examples.datasets.anndata_human_immune_cells(populate_registries=False)

 Cross-tissue immune cell analysis reveals tissue-specific features
 in humans.

 From: https://cellxgene.cziscience.com/collections/62ef75e4-cbea-
 454e-a0ce-998ec40223d3 Collection: Global

 Return type:
 "AnnData"

 To reproduce the subsample::
 >>> adata = sc.read('Global.h5ad')
 >>> adata.obs = adata.obs[['donor_id', 'tissue', 'cell_type', 'assay', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'assay_ontology_term_id']].copy()
 >>> sc.pp.subsample(adata, fraction=0.005)
 >>> del adata.uns["development_cache_ontology_term_id_colors"]
 >>> del adata.uns["sex_ontology_term_id_colors"]
 >>> adata.write('human_immune.h5ad')

lamindb.examples.datasets.anndata_pbmc68k_reduced()

 Modified from scanpy.collections.pbmc68k_reduced().

 This code was run:

 pbmc68k = sc.collections.pbmc68k_reduced()
 pbmc68k.obs.rename(columns={"bulk_labels": "cell_type"}, inplace=True)
 pbmc68k.obs["cell_type"] = pbmc68k.obs["cell_type"].cat.rename_categories(
 {"Dendritic": "Dendritic cells", "CD14+ Monocyte": "CD14+ Monocytes"}
 )
 del pbmc68k.obs["G2M_score"]
 del pbmc68k.obs["S_score"]
 del pbmc68k.obs["phase"]
 del pbmc68k.obs["n_counts"]
 del pbmc68k.var["dispersions"]
 del pbmc68k.var["dispersions_norm"]
 del pbmc68k.var["means"]
 del pbmc68k.uns["rank_genes_groups"]
 del pbmc68k.uns["bulk_labels_colors"]
 sc.pp.subsample(pbmc68k, fraction=0.1, random_state=123)
 pbmc68k.write("scrnaseq_pbmc68k_tiny.h5ad")

 Return type:
 "AnnData"

lamindb.examples.datasets.anndata_file_pbmc68k_test()

 Modified from scanpy.collections.pbmc68k_reduced().

 Additional slots were added for testing purposes. Returns the
 filepath.

 To reproduce:

 pbmc68k = ln.examples.datasets.anndata_pbmc68k_reduced()
 pbmc68k_test = pbmc68k[:30, :200].copy()
 pbmc68k_test.raw = pbmc68k_test[:, :100]
 pbmc68k_test.obsp["test"] = sparse.eye(pbmc68k_test.shape[0], format="csr")
 pbmc68k_test.varp["test"] = sparse.eye(pbmc68k_test.shape[1], format="csr")
 pbmc68k_test.layers["test"] = sparse.csr_matrix(pbmc68k_test.shape)
 pbmc68k_test.layers["test"][0] = 1.
 pbmc68k_test.write("pbmc68k_test.h5ad")

 Return type:
 "Path"

lamindb.examples.datasets.anndata_pbmc3k_processed()

 Modified from scanpy.pbmc3k_processed().

 Return type:
 "AnnData"

lamindb.examples.datasets.anndata_suo22_Visium10X()

 AnnData from Suo22 generated by 10x Visium.

lamindb.examples.datasets.anndata_visium_mouse_cellxgene()

 Visium samples of thymus from wild type B6 mice 3-6 weeks old.

 The dataset is a CELLxGENE schema 7.0.0 validated dataset.

 Return type:
 "AnnData"

lamindb.examples.datasets.mudata_papalexi21_subset(with_uns=False)

 A subsetted MuData from papalexi21.

 Return type:
 "MuData"

 To reproduce the subsetting:
 >>> !wget https://figshare.com/ndownloader/files/36509460
 >>> import mudata as md
 >>> import scanpy as sc
 >>> mdata = md.read_h5mu("36509460")
 >>> mdata = sc.pp.subsample(mdata, n_obs=200, copy=True)[0]
 >>> mdata[:, -300:].copy().write("papalexi21_subset_200x300_lamindb_demo_2023-07-25.h5mu")

lamindb.examples.datasets.schmidt22_crispra_gws_IFNG(basedir='.')

 CRISPRi screen collection of Schmidt22.

 Originally from: https://zenodo.org/record/5784651

 Return type:
 "Path"

lamindb.examples.datasets.schmidt22_perturbseq(basedir='.')

 Perturb-seq collection of Schmidt22.

 Subsampled and converted to h5ad from R file:
 https://zenodo.org/record/5784651

 To reproduce the subsample: >>> adata = sc.read
 ('HuTcellsCRISPRaPerturbSeq_Re-stimulated.h5ad') >>> adata.obs =
 adata.obs[['cluster_name']] >>> del adata.obsp >>> del
 adata.var['features'] >>> del adata.obsm['X_pca'] >>> del adata.uns
 >>> del adata.raw >>> del adata.varm >>> adata.obs =
 adata.obs.reset_index() >>> del adata.obs['index'] >>>
 sc.pp.subsample(adata, 0.03) >>>
 adata.write('schmidt22_perturbseq.h5ad')

 Return type:
 "Path"

lamindb.examples.datasets.spatialdata_blobs()

 Example SpatialData dataset for tutorials.

 Return type:
 "SpatialData"

# Other

lamindb.examples.datasets.fake_bio_notebook_titles(n=100)

 A fake collection of study titles.

 Return type:
 "list"["str"]