RxRx: cell imaging¶
rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.
High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.
In this guide, you’ll see how to query some of these data using LaminDB. If you’d like to transfer data into your own LaminDB instance, see the transfer guide.
# !pip install 'lamindb[gcp]' duckdb
!lamin init --modules bionty,wetlab --storage ./test-rxrx
→ initialized lamindb: testuser1/test-rxrx
import lamindb as ln
import bionty as bt
import wetlab as wl
Show code cell output
→ connected lamindb: testuser1/test-rxrx
Create the central query object for this instance:
lamindata_db = ln.QueryDB("laminlabs/lamindata")
Search & look up metadata¶
We’ll find all genetic treatments in the GeneticPerturbation registry:
df = lamindata_db.genetic_perturbations.to_dataframe()
df.shape
Show code cell output
(100, 13)
Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:
sirnas = lamindata_db.genetic_perturbations.filter(system="siRNA").lookup(
return_field="name"
)
We’re also interested in cell lines & wells:
cell_lines = lamindata_db.cell_lines.lookup(return_field="abbr")
wells = lamindata_db.wells.lookup(return_field="name")
Load the collection¶
This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.
Let us get the corresponding object and some information about it:
collection = lamindata_db.collections.get("Br2Z1lVSQBAkkbbt7ILu")
collection.view_lineage()
collection.describe()
Show code cell output
Collection: Annotated RxRx1 images (1) └── uid: Br2Z1lVSQBAkkbbt7ILu run: 2024-06-17T12:31:43.923373+00:00 (01-rxrx1-ingest.ipynb) branch: main space: all created_at: 2024-06-17 12:43:02 UTC created_by: sunnyosun
The dataset consists in a metadata file and a folder path pointing to the image files:
collection.meta_artifact.load().head()
Show code cell output
! run input wasn't tracked, call `ln.track()` and re-run
| site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1.png |
| 1 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w2.png |
| 2 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w3.png |
| 3 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w4.png |
| 4 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w5.png |
Query image files¶
Because we didn’t choose to register each image as a record in the Artifact registry, we have to query the images through the metadata file of the dataset:
df = collection.meta_artifact.load()
! run input wasn't tracked, call `ln.track()` and re-run
We can query a subset of images using metadata registries & pandas query syntax:
query = df[
(df.cell_line == cell_lines.hep_g2_cell)
& (df.sirna == sirnas.s15652)
& (df.well == wells.m15)
& (df.plate == 1)
& (df.site == 2)
]
query
Show code cell output
! 5 records found for 's15652'. Returning based on keep='first'.
| site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3066 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1.png |
| 3067 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w2.png |
| 3068 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w3.png |
| 3069 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w4.png |
| 3070 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w5.png |
| 3071 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w6.png |
To access the individual images based on this query result:
collection.data_artifact.storage.root
Show code cell output
'gs://rxrx1-europe-west4'
images = [f"{collection.data_artifact.storage.root}/{key}" for key in query.path]
images
Show code cell output
['gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w1.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w2.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w3.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w4.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w5.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w6.png']
Download an image to disk:
path = ln.UPath(images[1])
path.download_to(".")
from IPython.display import Image
Image(f"./{path.name}")
Show code cell output
Use DuckDB to query metadata¶
As an alternative to pandas, we could use DuckDB to query image metadata.
import duckdb
features = lamindata_db.features.lookup(return_field="name")
filter = (
f"{features.cell_line} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
f"{features.plate} == '1' and {features.site} == '2'"
)
parquet_data = duckdb.from_parquet(
collection.meta_artifact.path.as_posix() + "?s3_region=us-east-1"
)
parquet_data.filter(filter)
Show code cell output
┌──────────────────┬────────────────┬───────────┬─────────┬────────────┬───────┬─────────┬───────┬──────────────────┬─────────┬──────────┬───────────────────────────────────────────┐
│ site_id │ well_id │ cell_line │ split │ experiment │ plate │ well │ site │ well_type │ sirna │ sirna_id │ path │
│ varchar │ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ int64 │ varchar │ varchar │ int64 │ varchar │
├──────────────────┼────────────────┼───────────┼─────────┼────────────┼───────┼─────────┼───────┼──────────────────┼─────────┼──────────┼───────────────────────────────────────────┤
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w1.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w2.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w3.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w4.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w5.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w6.png │
└──────────────────┴────────────────┴───────────┴─────────┴────────────┴───────┴─────────┴───────┴──────────────────┴─────────┴──────────┴───────────────────────────────────────────┘