Recursion is releasing the RxRx1 dataset to kickstart a flurry of innovation in machine learning on large biological datasets to impact drug discovery and development. RxRx1 is a dataset consisting of 296 GB of 16-bit fluorescent microscopy images, the result of the same experimental design being run multiple times with the primary differences between experiments being technical noise unrelated to the underlying biology. As such, RxRx1 provides a significant sample of controlled biological variability that is prime for training models to discern classes of cell morphology, independent from experimental batch variation. It’s important to note that RxRx1 has been created in a controlled manner to provide the appropriate data for discerning biological variation in its common context of changing experimental conditions.
To date, we have generated over 2 petabytes of image data. RxRx1 represents a glimpse into the massive and truly unique dataset that is being generated at Recursion. RxRx1 is approximately 296 GB, consisting of 125,510 total images representing 1,108 classes. This is comparable to datasets such as ImageNet (ILSVRC2012) which is approximately 155 GB and 1.2m images with 1000 classes and other biological datasets such as BBBC017 (among others) from the Broad Institute of MIT which is about 56 GB, 64,512 total images, representing 4,903 classes.
Artificial Intelligence has the potential to dramatically reframe the challenge of understanding how drugs interact with human cells. Recursion is reinventing drug discovery and development using machine learning and rich biological datasets generated in-house, built for-purpose for machine learning algorithms. RxRx1 is a curated sample of this data that represents less than 1% of Recursion’s current weekly data generation.
RxRx1 presents such a task. Figure 1 demonstrates the complexity of identifying relevant biological variation and separating it from technical noise caused by batch effects. Even when experiments are designed to control for technical variables such as temperature, humidity, and reagent concentration, batch effects unavoidably enter into the data, resulting in images that contain factors of variation due to either biologically relevant variables or irrelevant technical variables. Batch effects threaten to confound any set of experiments across the entire field of biology. Machine disentanglement of batch effects from relevant biological variables would be applicable across the field and could have broad impacts on accelerating drug discovery and development.
The experiment uses a modified Cell Painting staining protocol (CellPainting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes, Bray et. al., 2016) which uses 6 different stains to adhere to different parts of the cell. The stains fluoresce at different wavelengths and are therefore captured by different imaging channels; thus there are 6 images per imaging site in a well. Each image captures different morphology of the same segment of the well, like layers of a 3 dimensional structure.
The images in RxRx1 are generated by carrying out biological experiments using reagents known as siRNAs. A small interfering RNA (siRNA) is a biological reagent used to knockdown a particular gene, and every genetic perturbation used in the RxRx1 dataset is carried out via an siRNA. To understand these biological reagents, it’s important to review some key biological concepts.
Recall that each gene in our DNA encodes for a specific protein (sometimes several), and the process by which proteins are created involves transcription (reading the DNA to create complementary mRNA) and translation (reading the mRNA to link amino acids together to create a protein). These chains of amino acids are folded and further modified to yield functional proteins. This notion of information flowing from DNA to mRNA to proteins is called the central dogma of biology.
There are a number of ways to model this experimentally, one of which is to directly target the DNA and create a mutation that will lead to a lack of protein production. Alternatively, one can target the mRNA, degrading them before they are translated into proteins by the ribosome. This is, effectively, what siRNAs do, and so they provide a useful construct by which to model the loss-of-function of a particular gene (see Fig. 3). An siRNA is a 21 nucleotide RNA strand, which is designed to be fully complementary to a specific section of an mRNA, enabling efficient binding and ultimately cleavage of the target mRNA. This allows these siRNAs to target specific genes and lead to a significant reduction in mRNA present in a cellular system, effectively reducing the total amount of the associated protein.
However, siRNAs are known to have severe off-target effects - they not only degrade the targeted mRNA, but also can block translation of hundreds of additional mRNAs. This is done via the miRNA pathway, and such off-target effects are driven by the seed region (nucleotides 2-8) of the siRNA. These seed-based off-target effects dominate the signal in any siRNA-involved study, and thus to effectively model gene loss-of-function, one must use multiple siRNAs targeting each gene and a number of computational methods to determine if there is any particular gene-driven effect in an assay.
The purpose of this explanation is not to dive into the details of how to mitigate off-target effects, but rather to ensure that researchers working with the RxRx1 dataset sufficiently understand the underlying biology so as to focus on the right problems that can be addressed via this dataset. As no gene is targeted by more than 1 siRNA in the RxRx1 dataset, this dataset should not be used to try to identify gene-specific knockdown effects.
The combined effects of targeted knockdown and seed-based effects lead to observable morphology of a cell culture called a phenotype. Figure 4 shows 4 phenotypes of distinct siRNAs from a single experiment and plate. The phenotype is sometimes visually recognizable from the images, but often the difference in cell morphology is subtle and hard to detect by the human eye.
Since the images in RxRx1 are generated by carrying out biological experiments using siRNAs which are designed to target and knockdown a specific gene, it is tempting to use this data to identify gene-specific morphological changes. However, since siRNAs are known to have significant off-target effects, you would need data from many different siRNAs targeting the same gene combined with computational methods for deconvolving the target signal from off-target effects. Since this dataset only includes one siRNA per gene, the data provided is insufficient for making gene-specific morphological conclusions.
RxRx1 presents such a task. Figure 1 demonstrates the complexity of identifying relevant biological variation and separating it from technical noise caused by batch effects. Even when experiments are designed to control for technical variables such as temperature, humidity, and reagent concentration, batch effects unavoidably enter into the data, resulting in images that contain factors of variation due to either biologically relevant variables or irrelevant technical variables. Batch effects threaten to confound any set of experiments across the entire field of biology. Machine disentanglement of batch effects from relevant biological variables would be applicable across the field and could have broad impacts on accelerating drug discovery and development.
RxRx1 includes data from 51 instances of the same experiment design executed in different experimental batches. In this experiment, we use 1,108 different siRNAs to knockdown 1,108 different genes.
The experiment uses 384-well plates (see Fig. 5) to isolate populations of cells into wells where exactly one of 1,108 different siRNAs is introduced into the well to create distinct genetic conditions. A well is like a single test tube at a small scale, 3.3 mm2. The outer rows and columns of the plate are not used because they are subject to greater environmental effects; so there are 308 used wells on each plate. Thus the experiment consists of 4 total plates. Each plate holds the same 30 control siRNA conditions, 277 different non-control siRNA, and one untreated well. The location of each of the 1,108 non-control siRNA conditions is randomized in each experiment to prevent confounding effects of the location of a particular well (see Plate Effects). Each well in each plate contains two 512 x 512 x 6 images. The images were acquired from two non-overlapping regions of each well. Each of the 6 channels can be assigned a consistent color and composited for ease of reviewing (see Fig. 6), however the RxRx1 contains the 6-channel images and not the composite images.
Each batch represents a single cell type: 24 in HUVEC, 11 in RPE, 11 in HepG2, and 5 in U2OS. Figure 7 shows the phenotype of a single siRNA in the four different cell types. For each image, the accompanying metadata provides the following information about the associated well: 1) its cell type, 2) its experiment, 3) its plate within the experiment, 4) its location on the plate, and 5) its siRNA. Since each of the 51 experiments was run in different batches, the images exhibit technical effects common to their batch and distinct from other batches; these batch effects are discussed further below.
When the images were originally created by Recursion, they were of size 2048 x 2048 x 6, but in order to make the dataset size more manageable, they were downsampled by a side-length factor of 2 and only the center 512 x 512 crop is provided.
As described above, each of the 51 experiments was executed in a different experimental batch. A batch is a set of experiment plates that are executed together, at the same time with the same materials. This means that all the plates within a batch are similar in their reagent synthesis, environmental conditions, etc., and plates from one batch differ from those from another batch in a consistent way. There are changes from batch to batch in environmental and experimental conditions that cause these effects. Examples of environmental conditions include humidity and temperature. Examples of experimental conditions include synthesis and concentration of reagents, as well as cell culture density. As seen in Figure 8, the batch effects are more visually salient than the relevant biological variation introduced by different siRNAs.
These batch effects are an inherent feature of experimentation and are unavoidably introduced into data collected across multiple batches. Any scientific conclusions drawn from such data should rely on the relevant biological variation in the data rather than on these incidental effects. A machine learning approach to separating batch effects from biological variation could be used widely in the field to extend the comparability of large image sets without a biologist needing to deconvolute the biological variation manually, hence RxRx1 has the potential to spur innovation of models which will overcome the issues plaguing the pharmaceutical industry.
One particular set of metadata descriptors worth discussing more fully are experiment, plate, well, and site (see Fig. 5). These describe information about the physical location of each image in terms of the data generation process. Every image is taken of a particular site of a cell culture well on a 384-well plate. These cell cultures are distributed across a 16x24 grid of wells on a plate, and there are 4 plates per experiment in the RxRx1 dataset. Each experiment (set of 4 plates) was run in a different batch than the other experiments in RxRx1, such that the experimental noise that occurs due to slightly different conditions in the lab will take on a different form for each experiment. These are the batch effects referenced above. But there can be additional noise within an experiment driven by both inter- and intra-plate effects. An inter-plate effect is any effect primarily driven by the plate assignment within a batch (differences between plates), and an intra-plate effect is any effect primarily driven by the well assignment within a plate (differences between wells, or locations, within the same plate). All three of these sources of experimental variation may prove important to properly model the RxRx1 data, and the dataset has been generated in such a way that there are very few instances where a perturbation will be in the same well twice.
In each experiment, the same 30 siRNAs appear on every plate as positive controls. In addition, there is one well per plate that is left untreated as a negative control. The 30 control siRNAs target 30 different genes and produce a variety of morphological effects. Together, these wells provide a set of reference controls on each plate.
Of obvious note are areas of generalization, as this dataset (and any biological dataset) contains non-random experimental effects which make generalization challenging. This dataset is well suited for tasks such as transfer learning (e.g. to a new cell type), domain adaptation (treating a new batch as a new target domain) and K-shot learning (a number of perturbations are present across every plate). While generalizability is important in every ML problem, it is of particular importance in working with biological datasets as mentioned above.
Given the metadata associated with each image, the RxRx1 dataset provides a good opportunity for further research in context modeling. This could include using contexts such as cell types, plate and well assignments. The exploration of methods to use these contexts to enhance machine learning methods in their ability to represent the biological perturbations is an additional avenue of research with RxRx1.
While much research has been done in computer vision across many domains, this dataset is large and rich and presents a very different data distribution than is found in most publicly available imaging datasets. Some of these differences include the relative independence of many of the channels (unlike RGB images) and the fact that each example is one of a population of objects treated similarly as opposed to singletons. The RxRx1 dataset presents an opportunity for further fundamental research in computer vision techniques.
The task is to correctly classify the perturbation present in each image in a held out set of experiments that were run in batches different from the experiments in the training set. Thus, in order for the classifier to generalize well to unseen batches, it must learn to separate biological and technical factors and make predictions only on the biology of the perturbation.
The evaluation metric will be the siRNA classification accuracy averaged over images. This metric is useful as an overall measure of the goodness of the classifier since the training and hold-out sets are approximately balanced across the siRNA classes, and the metric improves with each correctly classified image. And because the hold-out experiments are from entirely different batches than the training experiments, classifiers will have to generalize well to unseen experimental batches in order to score well on accuracy. In addition, since we will not have a separate task for accuracy on individual cell types, results will improve as the classifiers learn to do well on each cell type.
This competition will be of interest to the rapidly growing community of researchers looking to apply machine learning methods to complex biological data sets, and especially those working on biological images. The specific task of removing experimental batch effects is highly relevant to the broader life sciences scientific community and can provide insights that enable researchers to develop improved methods for working with other experimental datasets. However, the competition itself should be of great interest to the larger community of machine learning researchers since the image set is large, systematically produced, and useful in more general areas of machine learning research as mentioned above.
The competition was held on Kaggle. Visit the Kaggle site to check out the leaderboard and forums.
Our bold ambition is no less than to create a map of human cellular biology. Along the way, we’re leveraging our work to find novel treatments for disease and then partnering with the world’s most successful development companies to get them to patients as quickly as possible.
We’ve already developed a massive database of biological images, each of which is relatable over time to all the others we produce. As the massive search space of biological perturbations, both genetic and otherwise, is filled with data from our image sets and analysis, we’ve started to understand and model the complex interactions that compounds have with various conditions.
Google Cloud is widely recognized as a global leader in delivering a secure, open and intelligent enterprise cloud platform. Our technology is built on Google’s private network and is the product of nearly 20 years of innovation in security, network architecture, collaboration, artificial intelligence and open source software. We offer a simply engineered set of tools and unparalleled technology across Google Cloud Platform and G Suite that help bring people, insights and ideas together. Customers across more than 150 countries trust Google Cloud to modernize their computing environment for today’s digital world.
You have the cloud and we have your back. For nearly a decade, we’ve been helping businesses build and scale cloud solutions with our world-class cloud engineering support.
We help our customers with technical support and consulting on building and operating complex large-scale distributed systems, developing better machine learning models and setting up big data solutions using Google Cloud, Amazon AWS and Microsoft Azure.
NVIDIA’s (NASDAQ: NVDA) invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots and self-driving cars that can perceive and understand the world. More information at http://nvidianews.nvidia.com.
Lambda provides Deep Learning workstations, servers, and GPU cloud services. Lambda Deep Learning infrastructure is used by the world's leading AI research & development organizations including Apple, Microsoft, MIT, Stanford, and the US Government. To learn more, visit www.lambdalabs.com.
The RxRx1 dataset is closely related to other datasets released by Recursion, although there are some key differences. For ease of comparison and understanding, we provide the following table highlighting the primary differences:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
A CSV containing the experiment design, e.g. what cell type and treatment are in each well. The schema is provided in the README.
A large CSV file containing all of the deep learning embeddings for each image.
125,510 8-bit PNG 512x512x6 images. The directory structure is explained in the README.