Learning regulatory models for cell development from single cell transcriptomic data
Introduction
Advances in genomics technologies now enable high-throughput screening of many biomolecules within single cells. Perhaps the best developed techniques are those allowing measurement of gene expression levels in up to thousands of individual cells, enabling us to explore how cellular transcriptional states vary during e.g. developmental processes. These techniques are widely applied in stem cell and developmental biology research, where they are revolutionizing our understanding of the changes that occur as cells progress through development. Single cell data have, for example, enabled identification of rare cell types, provided insight into the subpopulation structure of developing cell populations, and challenged existing models of developmental hierarchies (see Refs. 1, 2, 3 for recent reviews).
Precisely controlled spatial and temporal patterns of gene expression accompany the differentiation of cells from multipotent progenitor states towards specialized cell lineages, as a multicellular organism develops from a single fertilized egg. Single cell transcriptomic data — cross-sectional data comprising ‘snapshots’ of mRNA expression levels that are generated using single cell RNA (scRNA) sequencing or quantitative PCR — provide unprecedented insights into individual cells and how they respond to environmental, developmental and physiological cues. However, analysing these data poses new computational and statistical challenges due to the technical and biological heterogeneity that characterise such data. Numerous methods — specifically tailored for single cel data — have been developed for pre-processing (including normalizing) and visualizing these high-dimensional data, characterising cell types and subpopulation structure, and detecting differentially expressed genes (reviewed in Refs. 4, 5, 6, 7, 8). Here, we focus on recent computational methods for analysing scRNA data that address the challenge of learning temporal dynamics from static measurements, allowing us to examine potential functional interactions between genes, and move towards developing mathematical models describing the gene regulatory mechanisms controlling cell development and differentiation.
Gene regulatory networks and models of cell development
Complex gene regulatory networks (GRNs) comprising activating and repressing interactions between transcription factors and their targets control the transcriptional state of cells. In a dynamical systems framework, such networks are viewed as regulating the probability of cells occupying different gene expression states. Stable ‘attractor’ states are associated with discrete cell types observed experimentally, and the potential landscape determines probable transition routes between states 9, 10, ∗11. The analogy of landscapes that dictate cellular developmental pathways has long been used as a conceptual framework for describing differentiation processes [12].
Single cell experiments effectively provide snapshots of these notional landscapes, enabling us to quantitatively assess the distributions in gene expression space of cell populations undergoing differentiation. While the landscape analogy implies smooth, continuous transitions between stable states, single cell data allow much more detailed examination of the moments when cells commit to certain lineages and have led to proposals that we should refine our descriptions of these key bifurcation or cell fate decision events. Rather than smooth transitions, these may be discontinuous, stochastic transition events driven by the dynamic nature of the landscape (which changes in response to GRN activity and extracellular signals) and reflected in the observed increased transcriptional heterogeneity at these points ∗11, 13, ∗14, 15. There is great interest in analysing single cell data to understand the transcriptional changes that occur as cells differentiate and the genes and regulatory mechanisms controlling these processes 1, 5, 6, 8, 16.
Defining cell types and subpopulation structure
Single cell transcriptomic data are high-dimensional comprising information on up to thousands of genes and cells depending on the experimental protocol, so most analyses start by visualizing and exploring structure in these data using clustering and dimensionality reduction algorithms. The assumptions and inherent biases of different algorithms — e.g. the relative emphasis on preserving local versus global structure when reducing dimensions, or how to define similarities between cells or genes — affect our conclusions about structure and patterns in these data 6, 7, 17, 18; choices made during the preliminary steps will therefore influence any subsequent downstream analyses relying on these results (Figure 1). Figure 1. Common workflows for analysing single cell data. A wide array of computational methods are available for analysing single cell transcriptomic data, with the complementary aims of characterising cell subpopulations and their gene expression patterns, identifying genetic drivers of transition events and inferring (mechanistic) models of gene interactions. Following pre-processing steps such as eliminating poor quality data, normalizing, and correcting technical errors (not depicted), dimension reduction and clustering help characterise cell subtypes, and may provide the initial steps for pseudotemporal ordering. Using clustering or pseudotemporal ordering, it is then possible to identify genes that are differentially expressed in different states, or more ambitiously, infer a gene regulatory network (the arrows indicate common – but not all possible – analysis workflows).
When studying cell differentiation, detecting genes showing variable expression across different developmental stages is a first step towards identifying putative GRN components. Clustering genes by expression profile similarity (or bi-clustering by both gene and cell similarities) can identify gene modules showing coordinated expression changes associated with developmental progression (e.g. Refs. 19, 20). Several approaches to detect differential expression of genes between cell subsets are specifically tailored to deal with the complexities of single cell data, e.g. by accounting for the prevalence of ‘dropouts’ (where gene expression is undetected in a given cell due to low mRNA capture rates) 21, 22, 23, 24, 25.
These approaches are helpful for many downstream analyses — particularly by revealing any subpopulation structure — but may only provide cursory mechanistic insights. Here, we focus on efforts to gain more insight into the precise dynamics and regulation of gene expression changes.
Inferring temporal progression through development
Longitudinal scRNA data cannot be collected straightforwardly (since cells are lysed for mRNA quantification) but, under certain assumptions, we can use samples of populations of cells undergoing differentiation to reconstruct probable developmental trajectories. The asynchronous behaviour of cells means that even when we trigger differentiation artificially, experimental sampling times will not necessarily reflect the extent of a cell's developmental progression. Instead, by assuming that transcriptional state reflects developmental stage, and that cells follow common trajectories, we can order cells according to similarities in expression state and infer their relative progression; individual cells are assigned a ‘pseudotime’ depending on their position in this inferred order. Many pseudotemporal ordering algorithms exist, with different capabilities in terms of the types of trajectories they can infer and requirements for prior knowledge (see Table 1) ∗26, 27, ∗28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39.
Table 1. Overview of single cell pseudotemporal ordering algorithms. Summary of algorithms developed to infer lineage hierarchies and temporal ordering of cells from single cell transcriptomic data. A brief overview of each method and key features is provided, see the original references for more detail. N.B. This is not a comprehensive list of all algorithms, but aims to provide an overview of the most commonly used and recent algorithms, and illustrate the wide range of methods that have been applied to this type of problem. Abbreviations used in table: k-nearest neighbour graph (k-NNG), Gaussian process (GP), minimal spanning tree (MST), Latent Dirichlet Allocation (LDA).
Algorithm | Method summary | Key features | References |
---|---|---|---|
GPfates | First infers pseudotimes (and potentially reduces dimensions) using a GP latent variable model, using experimental capture times as prior information. Then identifies bifurcation using a nonparametric temporal mixture model. | Uses experimental cell capture times; provides uncertainty estimates; currently limited to a single bifurcation event. | Lönnberg 2017 [26] |
Monocle 2 | Selects genes showing differential expression between cell clusters. Uses reversed graph embedding to learn a mapping between high and low-dimensional space, and a spanning tree connecting cell clusters in low-dimensions. | Unsupervised method to select informative genes and branching structure; scalable for large datasets. | Qiu 2017 [27] |
scTDA | Uses a topological data analysis (TDA) algorithm to construct low-dimensional network representation, where nodes represent cell clusters, and edges connect nodes with cells in common. | Scalable; infers multi-branching lineages; does not enforce common differentiation trajectories, allows more complex topological structures. | Rizvi 2017 [28] |
Slingshot | Constructs MST between cell clusters to identify number and location of branch points. Infers pseudotemporal trajectories by fitting principal curves to each lineage (option for user to define lineage endpoints). | Infers multi-lineage structures; compatible with any upstream dimensionality reduction/clustering methods. | Street 2017 [29] |
Mpath | Hierarchical clustering identifies ‘landmark’ cell clusters representing different states. Constructs a network between landmarks, with edges weighted by the number of transitioning cells. | Infers multi-branching lineages; relies on observing transitioning cells (i.e. assumes continuum). | Chen 2016 [30] |
CellTree | Bayesian method to infer branching hierarchies and gene sets associated with developmental stages. Based on LDA – a model that assumes a mixture of unobserved ‘topics’ (gene sets) can explain the observed cell states. | Requires user-defined number of topics (provides heuristic guide); directly links gene expression patterns to cellular hierarchy. | duVerle 2016 [31] |
Diffusion pseudotime | Introduces a distance metric describing transition probabilities between any cell pair by considering random walks of all lengths between cells in gene expression space. Automatically detects branching points. | Scalable to large datasets; identifies branch points and metastable cell states; requires user-defined number of branches. | Haghverdi 2016 [32] |
TSCAN | Averages expression values in each cell for genes with similar expression profiles. Clusters cells in reduced dimensions by fitting mixture of multivariate normal distributions, and constructs MST linking clusters. | Uses clustering to improve robustness; (optionally) uses prior knowledge (e.g. no. of clusters or branches); allows multi-branching lineages. | Ji 2016 [33] |
SCOUP | Initial temporal ordering based on MST in reduced dimensions. Refines ordering by optimising a mixture Ornstein-Uhlenbeck (OU) process model (models variables moving towards an attractor state with Brownian motion). | Currently applicable to linear or bifurcating trajectories; identifies putative regulatory interactions through correlation analysis. | Matsumoto 2016 [34] |
deLorean | Uses GPs to model gene expression profiles and infer pseudotimes within a Bayesian inference framework, using experimental cell capture times as prior knowledge. | Uses experimental cell capture times; only infers linear trajectories; not scalable; provides uncertainty estimates. | Reid & Wernisch 2016 [35] |
Wishbone | Reduces dimensions using diffusion maps and constructs a k-NNG between cells. Initial cell ordering based on shortest-paths. Refines trajectory and identifies branching structure using randomly selected ‘waypoint’ cells. | Limited to single bifurcation; relies on gene ontology annotations to select informative diffusion components; scalable to large datasets. | Setty 2016 [36] |
SLICER | Selects genes varying systematically across cell population. Constructs k-NNG between cells in reduced dimensions. Infers pseudotimes and branching structure using geodesic distances and entropy respectively. | Unsupervised method to select informative genes and branching structure; allows multiple differentiation routes between two points. | Welch 2016 [37] |
Waterfall | Reduces dimensions before clustering cells using a k-means algorithm. Constructs MST to link cluster centres, and assigns cell pseudotimes by projection onto trajectory. | Uses cell clustering to improve robustness; linear trajectories only. | Shin 2015 [38] |
SCUBA | Fits a smooth curve in reduced dimensions using principal curve analysis. Divides cells into temporal clusters, before iteratively clustering cells at each time and mapping between different times to infer hierarchical structure. | Automatically infers trajectory endpoints and number of lineages; can detect multiple branching events. | Marco 2014 [39] |
This re-ordering provides candidate temporal trajectories for each gene that, if correct, give a clearer view of developmental gene expression dynamics. This can show the relative timing of expression changes, reveal sets of genes with coordinated dynamics, identify gene expression signatures of specific developmental lineages, or indicate cellular processes that change systematically during development ∗26, 27, ∗28, 30, 31, 32, 40, 41, 42. We can also study the transcriptional changes associated with bifurcation or cell fate decision events, enabling us to identify putative regulators of specific transitions from gene expression changes that accompany or immediately precede such events ∗26, ∗28, 30, 40. Determining the number and location of these bifurcation events remains a challenging problem ∗26, ∗28, 29, 30, 32, 36, 37. Comparing gene expression dynamics may of course indicate the directionality of any regulatory interactions and, in a few cases, pseudotemporal trajectories have been used to infer dynamical models of GRNs 43, 44.
Pseudotemporal ordering algorithms have generated much interest and provided insight into developmental processes. However, as with all inference and modelling, we should bear in mind the limitations and underlying assumptions: many algorithms rely on initial clustering and/or dimensionality reduction steps, and some require or can incorporate prior knowledge (e.g. number of lineages or experimental sampling times); these methods and choices will influence our results. Comparisons demonstrate that inferred trajectories can differ substantially between algorithms due to e.g. different susceptibilities to noise and data sparsity ∗26, 27, ∗28, 30. While such algorithm-dependent influences can be assessed through quantitative comparisons, we should always consider whether the assumptions are appropriate for the particular system under study. While initial analyses such as dimensionality reduction often appear to depict cells undergoing smooth, continuous changes in transcriptional state, the homogeneity, parsimony and irreversibility of developmental transitions are in fact assumptions that may or may not accurately reflect the true biological processes. In addition there may be a host of other factors — apart from development — that affect changing gene expression patterns ∗11, 16, ∗28.
Developing gene regulatory network models
While approaches used for network inference from bulk transcriptomic data have been directly applied to single-cell data, newer dedicated methods have also been developed. Despite the challenges posed by technical noise, single cell data offer several potential advantages for inferring regulatory relationships — larger sample sizes; inherent biological heterogeneity provides the variability necessary to infer relationships without needing perturbation experiments; and the ability to visualise subpopulation structure avoids potential confounding effects from analysing mixed populations of cell types 3, 8, 9, 16.
A common (and simple) approach is to calculate pairwise correlations between gene expression states, generating an undirected network indicating statistical relationships between genes that are interpreted as putative (co-)regulatory interactions. While this method has successfully identified regulatory relationships from single cell developmental data (e.g. Refs. 45, 46, 47, 48), it only detects linear (or monotonic) relationships and thus may overlook many biological interactions. Information theory provides alternative measures, e.g. mutual information, that can capture more complex non-linear statistical dependencies between variables and are widely applied for GRN inference from bulk data [49]. Calculating information measures typically requires estimating joint probability distributions from experimental data so these methods benefit hugely from the larger sample sizes afforded by new single cell technologies, particularly when using measures that quantify relationships between three or more variables 49, 50. Both pairwise and higher-order information theoretical measures have been successfully integrated into network inference algorithms to infer regulatory interactions from single cell data ∗51, ∗52. Without making further assumptions (e.g. temporal ordering) or integrating other data (e.g. transcription factor binding) these statistical models (whether correlation or information theoretic based) do not indicate the directionality of interactions.
Other methods aim to infer mathematical models of GRNs that represent the mechanistic nature and directionality of interactions, and allow system dynamics to be studied using Boolean or ordinary differential equation (ODE) models (see Table 2) 43, 44, ∗51, ∗52, 53, 54, 55, 56, 57. Boolean models rely on discretized data which may increase their robustness to noise and provide benefits for computational efficiency, and, unlike ODE models, they make fewer assumptions about the nature of interactions and avoid the need to infer many parameters. However, Boolean discretization inherently results in some data loss and may be overly simplistic, and the method chosen to learn model structure may require certain dataset features — e.g. one of the earliest algorithms successfully applied to single cell data constructs a state-transition graph from binarised cell expression states and thus requires large numbers (e.g. thousands) of cells [56]. Several ODE-based models have been inferred from single cell data using differing assumptions about the nature of relationships, but in all cases relying on temporal gene expression data — either inferred pseudotemporal orderings or experimental sampling times — and the assumptions associated with these 43, 44, 53, 57. Temporal assumptions and assumptions of irreversibility in particular have strong implications on ODE-based analyses, because they directly inform the directionality of inferred relationships. So far, mechanistic models inferred from single-cell data tend to be limited to smaller GRNs (comprising tens of genes) than the statistical models outlined previously (which can easily scale to hundreds of genes), but they do provide the capability to simulate GRN dynamics and thus allow predictions of system behaviour under different scenarios.
Table 2. Overview of single cell network/model inference algorithms. Summary of the different types of statistical and modelling approaches that have recently been applied to single cell transcriptomic data in order to infer mathematical models of putative gene regulatory mechanisms. An overview of each class of method is given, with a few examples described in greater detail; some of the relative advantages and disadvantages of these approaches are noted. Abbreviations used in table: transcription factor (TF), ordinary differential equation (ODE).
Model | Method summary | Notes |
---|---|---|
Correlation/relevance networks | ||
Overview | Undirected edges connect pairs of genes exhibiting co-ordinated expression. Gene pairs are ranked by Pearson (or Spearman) correlation; positive or negative values indicate activation or repression respectively. | Simple to interpret and fast to calculate. Limited to detecting linear (or monotonic) relationships. |
Information theory | ||
Overview | Undirected edges connect pairs of genes showing statistically dependent expression profiles. Dependence quantified using information theoretic measures, often mutual information or a more complex variant. | Relatively computationally efficient. Can detect non-linear dependencies thus avoids assumptions about the nature of interactions. Often requires discretising data, which may reduce noise, but may lose information. Perform best with large sample sizes. |
PIDC [51] | Exploits large sample sizes to estimate a three-variable information measure; aims to distinguish direct interactions by decomposing mutual information between a pair of genes into contributions that are unique to that pair or shared with other genes. | Avoids several common assumptions regarding state space and temporal progression. An information-based approach that is applicable to networks of hundreds of genes. |
MAGIC [52] | Imputes missing expression values using a diffusion process through cells, then estimates conditional densities for pairs of genes using a k-nearest neighbour approach. Calculates mutual information from conditional densities to score putative interactions. | Alleviates influence of dropouts and aims to recover information from sparsely populated regions of expression space. Assumes data inherently low-dimensional, and that signal overcomes noise in sparse regions. |
ODE model inference | ||
Overview | ODEs represent the regulatory interactions controlling the expression of each gene; algorithms aim to infer parameters for these ODEs in a network where directed edges connect transcription factors to their targets. | Infer detailed mechanistic networks capturing direction and strength of regulation; provide information on system dynamics. May be computationally complex. Assume specific mathematical forms for interactions, and often rely on inferred temporal trajectories. |
Jang et al., 2017 [53] | First identifies discrete cell states, transition lineages, and key gene modules. Creates a step-function based model of gene module interactions; estimates parameter probability distributions using linear programming with observed cell states determining constraints. | Coarse-resolution network connects modules of genes with similar expression profiles. Uses binarised data which may reduce noise or may lose information. |
SCODE [44] | Infers linear ODE-model of regulatory interactions between TFs from pseudotemporal trajectories. Develops an efficient parameter estimation algorithm that relies on linear regression and a lower-dimensional transformation of the data. | Assumes linear relationships. Computationally efficient, compared to similar approaches. |
Ocone et al., 2015 [43] | Modular approach combines dimension reduction, pseudotemporal ordering and network inference. Infers initial coarse network using random forest and correlation methods, then infers a Hill-function ODE model using Bayesian model selection and parameter inference. | Modularity allows substitution of different algorithms. Multiple steps introduce multiple sets of assumptions. Limited to small models. |
Boolean model inference | ||
Overview | Binarised gene expression levels are governed by update functions that describe the combinatorial action of regulating genes in terms of Boolean logic rules. A state transition graph comprises the possible cell states arising from the governing Boolean network. | Infer detailed mechanistic networks capturing direction of interactions and combinatorial regulation; provide information on system dynamics. Binarising data may reduce sensitivity to noise or may lose information. |
BTR [54] | Infers an asynchronous Boolean model, by iterative optimisation starting from an initial model (random or prior knowledge) using a swarming hill climbing strategy. Scores proposed models by comparing the experimental data and model state spaces. | Avoids making assumptions about temporal progression. Search strategy optimised for local searches, so performs best when informed by prior knowledge of regulatory interactions. |
SingCellNet [55] | Infers an asynchronous probabilistic Boolean network. Uses genetic algorithms to optimise network topology and probabilities of Boolean update rules, based on consistency with known cell lineage hierarchies; update rules based on prior knowledge. | Limited to small networks. Assumes knowledge of cell lineage hierarchy and putative update rules. |
SCNS [56] | Creates state transition graph from observed binarised transcriptional states of cells; initial/final cell states defined using prior knowledge. Infers Boolean update functions for each gene that are consistent with the observed transitions. | Assumes observed cell states represent state space of Boolean network. Requires large datasets (thousands of cells) to construct a connected state transition graph. |
Linear regression | ||
SINCERITIES [57] | Infers directed network of TF-target gene interactions using linear regression. Assumes change in TF expression distribution causes proportional change in target gene distribution at subsequent time point. Distinguishes activation and repression using partial correlation. | Relies on experimental sampling times (i.e. cross-sectional time series data). Assumes linear relationships between changes at consecutive times. |
All these statistical or mechanistic models of GRNs provide a set of putative functional interactions between genes (or modules of co-varying genes). While we of course aim to develop methods that provide the most reliable inference results, these networks should be viewed as hypotheses about the underlying regulatory mechanisms. These can guide further investigation and experiments, and allow us to test our current understanding but, like any models, they should be continually refined and improved as new information emerges. These models rely on some key assumptions: firstly, that mRNA expression levels are indicative of the corresponding protein levels (thus ignoring potential post-transcriptional influences), but also that differentiation is the dominant process driving the observed gene expression dynamics. Our choice of cells and genes to include in our analyses is critical to ensure this latter assumption is appropriate.
There are of course technical limits to what we can learn about a biological system by experimentally observing the system state. Some interactions will not be inferable, e.g. if they do not drive observable expression changes, or these changes are too transient to detect associations between the corresponding genes. It can be difficult to distinguish certain regulatory topologies, such as indirect versus direct regulation, depending on the inference method. Finally, while GRNs are sometimes defined as the complete collection of possible gene regulatory interactions within a given cell, we can of course only hope to infer the subset of interactions active under our specific experimental conditions. We expect to infer distinct networks using different cell subsets, depending on the variability present in the selected cell population, and thus should carefully choose which data to analyse 45, ∗51, 58.
Combining computational analyses
In general, we should use several of the approaches outlined above to gain insights into the regulatory mechanisms driving cell differentiation: they provide complementary information, and using them in combination can help redress some of the limitations and biases inherent to each method.
A preliminary descriptive analysis, using e.g. clustering, dimensionality reduction, and bioinformatics annotation of differentially expressed genes, can provide important information about any subpopulation structure within the data. This helps us to choose cell subsets to analyse that will be most informative about the biological process of interest — e.g. those where we believe that differentiation is the dominant driver of transcriptional variation. When detecting statistical relationships between genes for GRN inference, we might select cells undergoing a specific developmental transition as some statistical dependencies may be masked within more complex datasets comprising cells in multiple developmental lineages. Clustering or pseudotemporal ordering can help to identify subsets of non-responsive cells to exclude from subsequent analyses. Cells are likely to be simultaneously affected by multiple biological processes, so we may aim to account for any potential confounding factors, e.g. large-scale transcriptional changes associated with cell cycle stage may mask the variability linked to differentiation [59].
Although scRNA-sequencing provides information about thousands of genes, analysis is greatly aided by careful pre-processing and biologically guided selection of relevant genes. Basic filtering metrics allow us to remove genes expressed at very low levels (and therefore dominated by technical noise) or those showing little variation in expression. Clustering and pseudotemporal ordering can help us select genes associated with the biological process of interest, or identify gene modules demonstrating similar dynamics. Removing non-informative genes (or cells) aids all downstream analyses, but particularly those that seek to develop mechanistic – ODE and Boolean network – models.
Ideally, we should also consider the limitations and assumptions of each of the methods we include in our analysis, and aim to explore (at least to some extent) how our algorithm choices influence our conclusions. Many of the algorithms applied to single cell data do not allow us to quantify uncertainties in the outputs of earlier stages of analyses (e.g. clustering, dimensionality reduction, or inferred temporal orderings), but conclusions from any downstream analyses (e.g. inferred regulatory networks and models) are necessarily conditional on the accuracy of these initial results. In the absence of reliable methods to propagate uncertainties through the different stages of analysis, perhaps a pragmatic solution is to verify whether our conclusions are robust to some variation in the methods selected during earlier steps – e.g. we could explore how much inferred temporal orderings or network models vary when we subsample our data or use different dimensionality reduction methods.
Conclusions
The biological questions we seek to address using scRNA data are complex. We should carefully consider how to analyse such data — ideally prior to data collection to ensure suitable experimental design — and make optimal use of them by incorporating multiple analytical approaches. Particularly while these technologies and methods are relatively new, we should explore and compare different analytical frameworks, and continue to elaborate them. Flexible, open-source software implementations are essential to allow such comparisons and ensure algorithms are easy to adapt and integrate with other complementary methods.
To gain a more comprehensive picture of the regulatory mechanisms controlling differentiation, we need to incorporate other sources of information. We can design experiments to test and refine our putative hypotheses, and to verify that conclusions drawn from in vitro data correspond to in vivo observations. We should develop ways of integrating other types of data into our analyses, such as incorporating information on transcription factor binding or chromatin accessibility when inferring GRN models. Genomics technologies are being adapted to measure multiple characteristics (e.g. chromatin accessibility and methylation) at single cell level with recent success at quantifying several features within the same cells 2, 16. These datasets will provide much richer information about the underlying biological processes and will demand dedicated computational and statistical methods to combine information from heterogenous data types.
We need to develop effective ways to integrate and compare data generated from independent experiments on similar biological systems, to ensure the conclusions we draw are robust and biologically meaningful. As experimental technologies advance we expect to see improved performance of many methods — e.g. pseudotemporal ordering and network inference methods should benefit from increasing sample sizes and become more robust to noise. Larger datasets will offer more comprehensive sampling of different cell states, providing better resolution of sparsely populated regions of gene expression space (e.g. during rapid state transitions). Finally, existing approaches that require large sample sizes for model inference 56, 60 or data imputation [52] will become feasible to apply more widely.
Acknowledgements
ACB gratefully acknowledges support through a BBSRC Future Leaders Fellowship (Grant reference BB/N011597/1). TEC is funded through a BBSRC DTP PhD studentship.
References
- 1
- V. Moignard, B. GöttgensDissecting stem cell differentiation using single cell expression profilingCurr Opin Cell Biol, 43 (2016), pp. 78-86, 10.1016/j.ceb.2016.08.005
- 2
- L. Wen, F. TangSingle-cell sequencing in stem cell biologyGenome Biol, 17 (2016), pp. 1-12, 10.1186/s13059-016-0941-0
- 3
- P. Kumar, Y. Tan, P. CahanUnderstanding development and stem cells using single cell-based analyses of gene expressionDevelopment, 144 (2017), pp. 17-32, 10.1242/dev.133058
- 4
- D. Grün, A. van OudenaardenDesign and analysis of single-cell sequencing experimentsCell, 163 (2015), pp. 799-810, 10.1016/j.cell.2015.10.039
- 5
- O. Stegle, S.A. Teichmann, J.C. MarioniComputational and analytical challenges in single-cell transcriptomicsNat Rev Genet, 16 (2015), pp. 133-145, 10.1038/nrg3833
- 6
- S. Woodhouse, V. Moignard, B. Göttgens, J. FisherProcessing, visualising and reconstructing network models from single-cell dataImmunol Cell Biol, 94 (2016), pp. 256-265, 10.1038/icb.2015.102
- 7
- R. Bacher, C. KendziorskiDesign and computational analysis of single-cell RNA-sequencing experimentsGenome Biol, 17 (2016), pp. 1-14, 10.1186/s13059-016-0927-y
- 8
- A. Wagner, A. Regev, N. YosefRevealing the vectors of cellular identity with single-cell genomicsNat Biotechnol, 34 (2016), pp. 1145-1160, 10.1038/nbt.3711
- 9
- C. TrapnellDefining cell types and states with single-cell genomicsGenome Res, 25 (2015), pp. 1491-1498, 10.1101/gr.190595.115
- 10
- C. Marr, J.X. Zhou, S. HuangSingle-cell gene expression profiling and cell state dynamics: collecting data, correlating data points and connecting the dotsCurr Opin Biotechnol, 39 (2016), pp. 207-214, 10.1016/j.copbio.2016.04.015
- ∗11
- N. Moris, C. Pina, A.M. AriasTransition states and cell fate decisions in epigenetic landscapesNat Rev Genet, 17 (2016), pp. 693-703, 10.1038/nrg.2016.98
A thorough review of how scRNA data is shaped by the epigenetic landscape, and how it can help us elucidate mechanisms.
- 12
- C.H. WaddingtonCanalization of development and the inheritance of acquired charactersNature, 150 (1942), pp. 563-565
- 13
- P. Rue, A. Martinez AriasCell dynamics and gene expression control in tissue homeostasis and developmentMol Syst Biol, 11 (2015), 10.15252/msb.20145549792–792
- ∗14
- M. Mojtahedi, A. Skupin, J. Zhou, I.G. Castaño, R.Y.Y. Leong-Quong, H. Chang, et al.Cell fate decision as high-dimensional critical state transitionPLoS Biol, 14 (2016), Article e2000640, 10.1371/journal.pbio.2000640
This study highlights how theoretical concepts can be used in the analysis of scRNA data on developmental processes.
- 15
- A. Richard, L. Boullu, U. Herbach, A. Bonnafoux, V. Morin, E. Vallin, et al.Single-cell-based analysis highlights a surge in cell-to-cell molecular variability preceding irreversible commitment in a differentiation processPLoS Biol, 14 (2016), Article e1002585-35, 10.1371/journal.pbio.1002585
- 16
- A. Tanay, A. RegevScaling single-cell genomics from phenomenology to mechanismNature, 541 (2017), pp. 331-338, 10.1038/nature21350
- 17
- T. Ronan, Z. Qi, K.M. NaegleAvoiding common pitfalls when clustering biological dataSci Signal, 9 (2016), 10.1126/scisignal.aad1932re6–re6
- 18
- L. Haghverdi, F. Buettner, F.J. TheisDiffusion maps for high-dimensional single-cell analysis of differentiation dataBioinformatics, 31 (2015), pp. 2989-2998, 10.1093/bioinformatics/btv325
- 19
- A. Olsson, M. Venkatasubramanian, V.K. Chaudhri, B.J. Aronow, N. Salomonis, H. Singh, et al.Single-cell analysis of mixed-lineage states leading to a binary cell fate choiceNature, 537 (2016), pp. 698-702, 10.1038/nature19348
- 20
- B. Treutlein, D.G. Brownfield, A.R. Wu, N.F. Neff, G.L. Mantalas, F.H. Espinoza, et al.Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seqNature, 509 (2014), pp. 371-375, 10.1038/nature13173
- 21
- P.V. Kharchenko, L. Silberstein, D.T. ScaddenBayesian approach to single-cell differential expression analysisNat Meth, 11 (2014), pp. 740-742, 10.1038/nmeth.2967
- 22
- K.D. Korthauer, L.-F. Chu, M.A. Newton, Y. Li, J. Thomson, R. Stewart, et al.A statistical approach for identifying differential distributions in single-cell RNA-seq experimentsGenome Biol (2016), pp. 1-15, 10.1186/s13059-016-1077-y
- 23
- J. Fan, N. Salathia, R. Liu, G.E. Kaeser, Y.C. Yung, J.L. Herman, et al.Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysisNat Meth, 13 (2016), pp. 241-244, 10.1038/nmeth.3734
- 24
- G. Finak, A. McDavid, M. Yajima, J. Deng, V. Gersuk, A.K. Shalek, et al.MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing dataGenome Biol (2015), pp. 1-13, 10.1186/s13059-015-0844-5
- 25
- C. VallejosBeyond comparisons of means: understanding changes in gene expression at the single-cell levelGenome Biol (2016), pp. 1-14, 10.1186/s13059-016-0930-3
- ∗26
- T. Lönnberg, V. Svensson, K.R. JamesSingle-cell RNA-seq and computational analysis using temporal mixture modelling resolves Th1/Tfh fate bifurcation in malariaScience, 2 (2017), 10.1126/sciimmunol.aal2192eaal2192
This paper illustrates how a combination of experimental and computational approaches, including development of a new pseudotemporal ordering algorithm, can be used to characterise the transcriptional changes accompanying cell fate decisions and differentiation.
- 27
- X. Qiu, Q. Mao, Y. Tang, L. Wang, R. Chawla, H. Pliner, et al.Reversed graph embedding resolves complex single-cell developmental trajectoriesbioRxiv (2017), 10.1101/110668
- ∗28
- A.H. Rizvi, P.G. Camara, E.K. Kandror, T.J. Roberts, I. Schieren, T. Maniatis, et al.Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and developmentNat Biotechnol, 35 (2017), pp. 551-560, 10.1038/nbt.3854
This paper introduces a powerful topology-based computational method for unsupervised pseudotemporal ordering that makes fewer assumptions about the nature of developmental trajectories than many existing algorithms.
- 29
- K. Street, D. Risso, R.B. Fletcher, D. Das, J. Ngai, N. Yosef, et al.Slingshot: cell lineage and pseudotime inference for single-cell transcriptomicsbioRxiv (2017), pp. 1-21, 10.1101/128843
- 30
- J. Chen, A. Schlitzer, S. Chakarov, F. Ginhoux, M. PoidingerMpath maps multi-branching single-cell trajectories revealing progenitor cell progression during developmentNat Commun, 7 (2016), Article 11988, 10.1038/ncomms11988
- 31
- D.A. duVerle, S. Yotsukura, S. Nomura, H. Aburatani, K. TsudaCellTree: an R/bioconductor package to infer the hierarchical structure of cell populations from single-cell RNA-seq dataBMC Bioinf, 17 (2016), p. 363, 10.1186/s12859-016-1175-6
- 32
- L. Haghverdi, M. Buttner, F.A. Wolf, F. Buettner, F.J. TheisDiffusion pseudotime robustly reconstructs lineage branchingNat Meth (2016), pp. 1-6, 10.1038/nmeth.3971
- 33
- Z. Ji, H. JiTSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysisNucleic Acids Res, 44 (2016), 10.1093/nar/gkw430e117–e117
- 34
- H. Matsumoto, H. KiryuSCOUP: a probabilistic model based on the Ornstein-Uhlenbeck process to analyze single-cell expression data during differentiationBMC Bioinf, 17 (2016), p. 232, 10.1186/s12859-016-1109-3
- 35
- J.E. Reid, L. WernischPseudotime estimation: deconfounding single cell time seriesBioinformatics, 32 (2016), pp. 2973-2980, 10.1093/bioinformatics/btw372
- 36
- M. Setty, M.D. Tadmor, S. Reich-Zeliger, O. Angel, T.M. Salame, P. Kathail, et al.Wishbone identifies bifurcating developmental trajectories from single-cell dataNat Biotechnol, 34 (2016), pp. 637-645, 10.1038/nbt.3569
- 37
- J. Welch, A. Hartemink, J.F. PrinsSLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq dataGenome Biol, 17 (2016), p. 106, 10.1186/s13059-016-0975-3
- 38
- J. Shin, D.A. Berg, Y. Zhu, J.Y. Shin, J. Song, M.A. Bonaguidi, et al.Single-cell RNA-seq with waterfall reveals molecular cascades underlying adult neurogenesisStem Cell, 17 (2015), pp. 360-372, 10.1016/j.stem.2015.07.013
- 39
- E. Marco, R.L. Karp, G. Guo, P. Robson, A.H. Hart, L. Trippa, et al.Bifurcation analysis of single-cell gene expression data reveals epigenetic landscapeProc Natl Acad Sci, 111 (2014), pp. E5643-E5650, 10.1073/pnas.1408993111
- 40
- D. Cacchiarelli, X. Qiu, S. Srivatsan, M. Ziller, E. Overbey, J. Grimsby, et al.Aligning single-cell developmental and reprogramming trajectories identifies molecular determinants of reprogramming outcomebioRxiv (2017), 10.1101/122531
- 41
- X. Qiu, A. Hill, J. Packer, D. Lin, Y.-A. Ma, C. TrapnellSingle-cell mRNA quantification and differential analysis with censusNat Meth, 14 (2017), pp. 309-315, 10.1038/nmeth.4150
- 42
- C. Trapnell, D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse, et al.The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cellsNat Biotechnol, 32 (2014), pp. 381-386, 10.1038/nbt.2859
- 43
- A. Ocone, L. Haghverdi, N.S. Mueller, F.J. TheisReconstructing gene regulatory dynamics from high-dimensional single-cell snapshot dataBioinformatics, 31 (2015), pp. i89-96, 10.1093/bioinformatics/btv257
- 44
- H. Matsumoto, H. Kiryu, C. Furusawa, M.S.H. Ko, S.B.H. Ko, N. Gouda, et al.SCODE: an efficient regulatory network inference algorithm from single-cell RNA-Seq during differentiationBioinformatics, 33 (2017), pp. 2314-2321, 10.1093/bioinformatics/btx194
- 45
- V. Moignard, I.C. Macaulay, G. Swiers, F. Buettner, J. Schütte, F.J. Calero-Nieto, et al.Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysisNat Cell Biol, 15 (2013), pp. 363-372, 10.1038/ncb2709
- 46
- C. Pina, J. Teles, C. Fugazza, G. May, D. Wang, Y. Guo, et al.Single-cell network analysis identifies DDIT3 as a nodal lineage regulator in hematopoiesisCell Rep, 11 (2015), pp. 1503-1510, 10.1016/j.celrep.2015.05.016
- 47
- M. Crow, A. Paul, S. Ballouz, Z.J. Huang, J. GillisExploiting single-cell expression to characterize co-expression replicabilityGenome Biol, 17 (2016), pp. 1-19, 10.1186/s13059-016-0964-6
- 48
- A.A. Kolodziejczyk, J.K. Kim, J.C.H. Tsang, T. Ilicic, J. Henriksson, K.N. Natarajan, et al.Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variationCell Stem Cell, 17 (2015), pp. 471-485, 10.1016/j.stem.2015.09.011
- 49
- A. Villaverde, J. Ross, J. BangaReverse engineering cellular networks with information theoretic methodsCells, 2 (2013), pp. 306-329, 10.3390/cells2020306
- 50
- N. Timme, W. Alford, B. Flecker, J.M. BeggsSynergy, redundancy, and multivariate information measures: an experimentalist's perspectiveJ Comput Neurosci, 36 (2013), pp. 119-140, 10.1007/s10827-013-0458-4
- ∗51
- T.E. Chan, M. Stumpf, A.C. BabtieGene regulatory network inference from single-cell data using multivariate information measuresCell Syst (2017)(in press)
This paper introduces a powerful information-theoretic based network inference algorithm targeted at scRNA data.
- ∗52
- D. van Dijk, J. Nainys, R. Sharma, P. Kathail, A.J. Carr, K.R. Moon, et al.MAGIC: a diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing databioRxiv (2017), 10.1101/111591
This paper introduces a method for data imputation to overcome the issues of technical noise in scRNA data, and shows this enhances detection of cell clusters, developmental trajectories and gene regulatory interactions.
- 53
- S. Jang, S. Choubey, L. Furchtgott, L.N. Zou, A. DoyleDynamics of embryonic stem cell differentiation inferred from single-cell transcriptomics show a series of transitions through discrete cell stateseLife (2017), 10.7554/eLife.20487.001
- 54
- C.Y. Lim, H. Wang, S. Woodhouse, N. Piterman, L. Wernisch, J. Fisher, et al.BTR: training asynchronous Boolean models using single-cell expression dataBMC Bioinf (2016), pp. 1-18, 10.1186/s12859-016-1235-y
- 55
- H. Chen, J. Guo, S.K. Mishra, P. Robson, M. Niranjan, J. ZhengSingle-cell transcriptional analysis to uncover regulatory circuits driving cell fate decisions in early mouse developmentBioinformatics, 31 (2015), pp. 1060-1066, 10.1093/bioinformatics/btu777
- 56
- V. Moignard, S. Woodhouse, L. Haghverdi, A.J. Lilly, Y. Tanaka, A.C. Wilkinson, et al.Decoding the regulatory network of early blood development from single-cell gene expression measurementsNat Biotechnol, 33 (2015), pp. 269-276, 10.1038/nbt.3154
- 57
- N.P. Gao, M. Ud-Dean, R. GunawanSINCERITIES: inferring gene regulatory networks from time-stamped single cell transcriptional expression profilesbioRxiv (2016), 10.1101/089110
- 58
- P.S. Stumpf, R.C.G. Smith, M. Lenz, A. Schuppert, F.-J. Müller, A. Babtie, et al.Stem cell differentiation is a stochastic process with memoryCell Syst (2017), 10.1016/j.cels.2017.08.009(in press)
- 59
- F. Buettner, K.N. Natarajan, F.P. Casale, V. Proserpio, A. Scialdone, F.J. Theis, et al.Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cellsNat Biotechnol, 33 (2015), pp. 155-160, 10.1038/nbt.3102
- 60
- J. Fisher, A.S. Köksal, N. Piterman, S. WoodhouseSynthesising executable gene regulatory networks from single-cell gene expression dataD. Kroening, C. Păsăreanu (Eds.), Computer aided verification, Lecture notes in computer science, vol. 9206, Springer (2015), pp. 544-560