RxRx: cell imaging

rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.

High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.

In this guide, you’ll see how to query some of these data using LaminDB. If you’d like to transfer data into your own LaminDB instance, see the transfer guide.

# !pip install 'lamindb[gcp]' duckdb
!lamin init --modules bionty,wetlab --storage ./test-rxrx
 initialized lamindb: testuser1/test-rxrx
import lamindb as ln
import bionty as bt
import wetlab as wl
Hide code cell output
 connected lamindb: testuser1/test-rxrx

Create the central query object for this instance:

lamindata_db = ln.QueryDB("laminlabs/lamindata")

Search & look up metadata

We’ll find all genetic treatments in the GeneticPerturbation registry:

df = lamindata_db.genetic_perturbations.to_dataframe()
df.shape
Hide code cell output
(100, 13)

Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:

sirnas = lamindata_db.genetic_perturbations.filter(system="siRNA").lookup(
    return_field="name"
)

We’re also interested in cell lines & wells:

cell_lines = lamindata_db.cell_lines.lookup(return_field="abbr")
wells = lamindata_db.wells.lookup(return_field="name")

Load the collection

This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.

Let us get the corresponding object and some information about it:

collection = lamindata_db.collections.get("Br2Z1lVSQBAkkbbt7ILu")
collection.view_lineage()
collection.describe()
Hide code cell output
_images/7d310607bf7a7032e4659ce194c871a0b48338b05e904981ff2b8eca7955145c.svg
Collection: Annotated RxRx1 images (1)
└── uid: Br2Z1lVSQBAkkbbt7ILu            run: 2024-06-17T12:31:43.923373+00:00 (01-rxrx1-ingest.ipynb)
    branch: main                         space: all                                                   
    created_at: 2024-06-17 12:43:02 UTC  created_by: sunnyosun                                        

The dataset consists in a metadata file and a folder path pointing to the image files:

collection.meta_artifact.load().head()
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id path
0 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1.png
1 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w2.png
2 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w3.png
3 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w4.png
4 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w5.png

Query image files

Because we didn’t choose to register each image as a record in the Artifact registry, we have to query the images through the metadata file of the dataset:

df = collection.meta_artifact.load()
! run input wasn't tracked, call `ln.track()` and re-run

We can query a subset of images using metadata registries & pandas query syntax:

query = df[
    (df.cell_line == cell_lines.hep_g2_cell)
    & (df.sirna == sirnas.s15652)
    & (df.well == wells.m15)
    & (df.plate == 1)
    & (df.site == 2)
]
query
Hide code cell output
! 5 records found for 's15652'. Returning based on keep='first'.
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id path
3066 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w1.png
3067 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w2.png
3068 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w3.png
3069 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w4.png
3070 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w5.png
3071 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w6.png

To access the individual images based on this query result:

collection.data_artifact.storage.root
Hide code cell output
'gs://rxrx1-europe-west4'
images = [f"{collection.data_artifact.storage.root}/{key}" for key in query.path]
images
Hide code cell output
['gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w1.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w2.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w3.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w4.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w5.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w6.png']

Download an image to disk:

path = ln.UPath(images[1])
path.download_to(".")
from IPython.display import Image

Image(f"./{path.name}")
Hide code cell output
_images/e9ab80eeba21bdcf86c18651e2665c5a5406cd56b4860eaa76eb961fa3a225fd.png

Use DuckDB to query metadata

As an alternative to pandas, we could use DuckDB to query image metadata.

import duckdb

features = lamindata_db.features.lookup(return_field="name")

filter = (
    f"{features.cell_line} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
    f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
    f"{features.plate} == '1' and {features.site} == '2'"
)

parquet_data = duckdb.from_parquet(
    collection.meta_artifact.path.as_posix() + "?s3_region=us-east-1"
)

parquet_data.filter(filter)
Hide code cell output
┌──────────────────┬────────────────┬───────────┬─────────┬────────────┬───────┬─────────┬───────┬──────────────────┬─────────┬──────────┬───────────────────────────────────────────┐
│     site_id      │    well_id     │ cell_line │  split  │ experiment │ plate │  well   │ site  │    well_type     │  sirna  │ sirna_id │                   path                    │
│     varchar      │    varchar     │  varchar  │ varchar │  varchar   │ int64 │ varchar │ int64 │     varchar      │ varchar │  int64   │                  varchar                  │
├──────────────────┼────────────────┼───────────┼─────────┼────────────┼───────┼─────────┼───────┼──────────────────┼─────────┼──────────┼───────────────────────────────────────────┤
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w1.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w2.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w3.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w4.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w5.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w6.png │
└──────────────────┴────────────────┴───────────┴─────────┴────────────┴───────┴─────────┴───────┴──────────────────┴─────────┴──────────┴───────────────────────────────────────────┘