Causal network reconstruction from time series is an emerging topic in many fields of science. Beyond inferring directionality between two time series, the goal of causal network reconstruction or causal discovery is to distinguish direct from indirect dependencies and common drivers among multiple time series. Here, the problem of inferring causal networks including time lags from multivariate time series is recapitulated from the underlying causal assumptions to practical estimation problems. Each aspect is illustrated with simple examples including unobserved variables, sampling issues, determinism, stationarity, nonlinearity, measurement error, and significance testing. The effects of dynamical noise, autocorrelation, and high dimensionality are highlighted in comparison studies of common causal reconstruction methods. Finally, method performance evaluation approaches and criteria are suggested. The article is intended to briefly review and accessibly illustrate the foundations and practical problems of time series-based causal discovery and stimulate further methodological developments.
Reconstructing interaction networks from observed time series is a common problem in fields where active experiments are impossible, unethical, or expensive. Pairwise association networks, for example based on correlations, cannot be interpreted causally. The goal of causal network reconstruction or causal discovery is to distinguish direct from indirect dependencies and common drivers among multiple time series. Here, we briefly recapitulate the theoretical assumptions underlying causal discovery from time series, discuss practical estimation problems, and illustrate each aspect with accessible examples.
I. Introduction
Reconstructing the causal relations behind the phenomena we observe is a fundamental problem in all fields of science. The traditional approach is to conduct active experiments, but in many fields such as Earth system science or neuroscience, manipulations of the complex system under study are either impossible, unethical, or very expensive. On the other hand, modern science generates an ever-growing amount of data from these systems, in particular time series data. Concurrently, novel computing hardware today allows efficient processing of massive amounts of data. These developments have led to emerging interest in the problem of reconstructing causal networks or causal discovery from observational time series.
In the past few decades, a number of original causality concepts have been developed, such as Granger causality (Granger, 1969) or transfer entropy (Schreiber, 2000). Since the 1990s, computer scientists, statisticians, and philosophers have grounded causal reasoning and inference in a robust mathematical framework (Pearl, 2000; Spirtes et al., 2000). The (quite natural) definition of causality underlying this framework is that
Causal network reconstruction. Consider a time series dataset (panel A) from a complex system of which we try to reconstruct the underlying causal dependencies (panel B), accounting for linear and nonlinear dependencies and including their time lags (link labels). Causal discovery aims to unveil spurious associations (gray arrows) which necessarily emerge due to common drivers (e.g.,
Causal network reconstruction. Consider a time series dataset (panel ) from a complex system of which we try to reconstruct the underlying dependencies (panel ), accounting for linear and nonlinear dependencies and including their time lags (link labels). Causal discovery aims to unveil spurious associations (gray arrows) which necessarily emerge due to common drivers (e.g., ) or transitive indirect paths (e.g., ). Correlation matrices are, therefore, often very dense, while causal networks are typically sparse. () The time series graph defined in Definition resolves also the time-dependence structure up to some maximum time lag . A link (black edge) exists if and are independent conditionally on the past of the whole process (gray boxes).
There are different sets of assumptions that allow us to identify a causal graph. Here, we focus on time-lagged causal discovery in the framework of conditional independence testing using the assumptions of time-order, Causal Sufficiency, the Causal Markov Condition, and Faithfulness, among others, which are all discussed thoroughly in this paper. But some of these assumptions can be replaced. Recent work (Peters et al., 2017) shows ways to use assumptions on the noise structure and dependency types in the framework of structural causal models which can complement the approach studied here and we will include references to recent work from this framework throughout the sections.
The paper is organized as follows: In Sec. II, we relate Granger causality and similar concepts to the conditional independence-framework (Spirtes et al., 2000). Section III provides the necessary definitions and notation and in Sec. IV we recapitulate the assumptions underlying time-lagged causal discovery from time series alongside illustrative examples. The practical estimation aspect from introducing some causal discovery algorithms to significance testing is presented in Sec. V, while Sec. VI discusses suggestions for performance evaluation. In Sec. VII, we present some comparison studies of common causal methods and conclude the paper with a brief discussion (Sec. VIII). The paper is accompanied by a python jupyter notebook on https://github.com/jakobrunge/tigramite to reproduce some of the examples.
II. From Granger causality to conditional independence
Granger (1969), based on work by Wiener (1956), was the first to propose a practical, operational definition of causality based on prediction improvement. The underlying idea of measuring whether
where
where
Tests for causality are then based on testing whether a particular CMI is greater than zero. Looking at the definition of CMI,
TEbiv and its advancements essentially test for conditional independence of
where
III. Definitions and notation
A. Definition of time series graphs
Consider a multivariate process
(Definition of links).
For
Here, we define links with lags
The parents of a node
In the following,
B. Separation
When considering the dependency between two variables
(Separation).
Intuitively, if two nodes are separated, no information is shared between the two. For example, in Fig. 1(c)
IV. Assumptions of causal discovery from observational time series
Causal information cannot be obtained from associations of measured variables without some assumptions. A variety of different assumptions have been shown to be sufficient to estimate the true causal graph (Spirtes et al., 2000; Peters et al., 2017). Here, we focus on three main assumptions under which the time series graph represents causal relations: Causal Sufficiency, the Causal Markov Condition, and Faithfulness. For time-lagged causal discovery from observational time series, we also need the assumptions of no instantaneous effects and stationarity. Further added to these are dependence type assumptions (e.g., linear or nonlinear) and no measurement error, and we will also assume that the joint distribution of the process has a positive density. All of these are discussed in the following.
For illustrating some of the assumptions in this paper, we will estimate the time series graphs by directly testing Definition 1 via Eq. (6) [Fig. 1(c)], mostly in the partial correlation version.
The problem of latent (unobserved) variables. Here,
The problem of latent (unobserved) variables. Here, is a latent confounder and leads to spurious links between .
Sub-sampled time series. If the time series of the true underlying process shown in the top panels (time series graph on the left, aggregated process graph on the right) is sampled at
Sub-sampled time series. If the time series of the true underlying process shown in the top panels (time series graph on the left, aggregated process graph on the right) is sampled at , the time series graph of the sub-sampled process (bottom left panel, note that here refers to the sub-sampled time) here even has a reversed causal loop.
A. Causal sufficiency
As Granger (1969) already notes, “[t]he one completely unreal aspect of the above definitions is the use of the series
(Causal Sufficiency).
A set
(Unobserved variables)
What happens if such an unobserved (latent) confounder exists? Consider the example depicted in Fig. 2 where we assume no autodependencies. Here,
The Fast Causal Discovery (FCI) algorithm (Spirtes et al., 2000; Zhang, 2008) does not assume causal sufficiency and allows us to partially identify which links are spurious due to unobserved confounders and also for which links confoundedness cannot be determined. The underlying idea is that if conditional independence holds between two variables for any subset (including the empty set) of
(Sub-sampled time series)
Causal sufficiency can also be violated if all variables are observed, but they are sampled at too coarse time intervals relative to the causal links. Consider a process with time series graph depicted in the top panel of Fig. 3 featuring a causal loop
B. Causal Markov condition
All independence-based causal discovery methods necessitate the Causal Markov Condition (Spirtes et al., 2000) which constitutes a close relationship between the process
(Causal Markov Condition).
Note that if Causal Sufficiency is not fulfilled, then also generally the Markov condition will not hold (Spirtes et al., 2000). Intuitively, the Causal Markov Condition implies that once we know the values of
(Non-markovian processes)
(Time aggregation)
Another example where noise terms become dependent is time-aggregation. Consider the causal chain of processes
Time aggregation is an important issue in many applied fields. For example, in climate research time series are frequently measured daily and then aggregated to a monthly scale to investigate dependencies (Runge et al., 2014). Ideally, the time resolution is at least as short as the shortest causal time lag (assuming no instantaneous effects, see Sec. IV D). See Breitung and Swanson (2002) and Barnett and Seth (2015) for a discussion on temporal aggregation in time series models. Ignoring time order, in some cases the recent methods discussed in Peters et al. (2017) can help.
Time aggregation. If the time series of the true underlying process shown in the top panel is aggregated at
Time aggregation. If the time series of the true underlying process shown in the top panel is aggregated at , the time series graph of the aggregated process (bottom panel, note that here refers to the aggregated time) has contemporaneous dependencies due to the too coarse time resolution compared to the lags of the causal links, but also many spurious directed links.
C. Faithfulness
The Causal Markov Condition guarantees that separation in the graph implies independence in the process. But what can be concluded from an estimated conditional independence relation, that is, the reverse direction? Faithfulness guarantees that the graph entails all conditional independence relations that are implied by the Markov condition.
(Faithfulness).
The combination of Faithfulness and the Markov property implies that
(Counteracting mechanisms)
Example of unfaithfully fine-tuned process where
Example of unfaithfully fine-tuned process where and are independent even though they are connected in the graph. Here, the indirect mechanism with a positive effect counteracts the direct link with a negative effect.
(Determinism)
One may argue that we live in a deterministic world and the assumption of “an independent noise term” that is pertinent to statistics is unrealistic. On the other hand, for a given observed variable
The former example only illustrated a static case of determinism. The field of nonlinear dynamics studies the properties of nonlinear and chaotic dynamical processes which has led to a plethora of nonlinear time series analysis methods (Kantz and Schreiber, 2003), often from an information-theoretic perspective (Hlavácková-Schindler et al., 2007) including transfer entropy. Many of these methods built on the assumption that no system is perfectly deterministic, for example, due to the coarse-graining of the system’s phase-space in the measurement process. In Sec. VII A, we study the effect of dynamical noise on several common time-series based causal discovery approaches for chaotic systems.
(Non-pairwise dependencies)
Next to the fine-tuned example on counteracting mechanisms, Faithfulness can also be violated for a dependency realized in an XOR-gate. Suppose, as shown in Fig. 7 among several “faithful” pairwise dependencies, we have that
Another form of synergy without violation of Faithfulness is the case that
As pointed out in James et al. (2016) for synergistic dependencies, the problem is that the concept of a pairwise dependency graphical model does not apply, but hyper-graphs are needed to represent such dependencies. Causal discovery of such graphs, however, carries the problem of combinatorial explosion if links between sets of nodes are considered.
D. Instantaneous effects
Granger causality and the definition of time series graphs are examples for lagged definitions of causality. To guarantee that the lagged parents defined in Eq. (8) are sufficient for the Causal Markov Condition to hold, we need to assume that there are no instantaneous (contemporaneous) causal effects, i.e.,
E. Stationarity
To estimate the time series graph defined in Definition 1 from time series data, we assume stationarity. Another option would be to utilize independent ensembles of realizations of lagged processes. Here, we define stationarity with respect to a time index set
(Causal stationarity).
This constitutes actually a weaker form of stationarity than the common definition of stationarity in mean, variance, spectral properties, or of the value of individual coefficients in a linear model. For example, one could require that all CMIs are stationary,
which is a much stronger statement. The strength of causal mechanisms may fluctuate over time and the causal stationarity assumption only requires conditional independence to be stationary.
(Non-stationarity due to confounding)
Consider the data shown in Fig. 8 and suppose we first only have access to the variables
A typical example of a common nonstationarity, albeit not the same as in our example, is found in climate time series which are usually all driven by solar forcing leading to a common seasonal signal. In climate research the time series are typically anomalized, that is, the seasonal signal is estimated and subtracted from the data (Storch and Zwiers, 1999) which is equivalent to regressing out its influence. But this is not always possible, in our example the common signal is not purely periodic and cannot easily be estimated from the data. Another option for the case of piecewise stationary processes is to include background knowledge on the stationary regimes and estimate the graphs separately for the stationary subsets of
Now suppose we actually have access to the common signal
The key point is that the causal structure, that is, the time series graph, of the whole process
Nonstationarity due to confounding. (Top) Example time series
Nonstationarity due to confounding. Example time series that are nonstationary in mean and spectral properties due to a common signal . Access to only results in a fully connected causal graph. Including allows us to identify the correct causal graph.
F. Dependency type assumptions
To test conditional independence hypotheses
Regression-based conditional independence tests of
for centered variables
Secondly, from the estimated functions
Finally, the dependence between the residuals can be tested with different pairwise association tests. For partial correlation this is a
The other extreme to partial correlation are model-free methods that directly test conditional independence. The most prominent test statistic is CMI as defined in Eq. (3), for which non-parametric estimators based on nearest-neighbor statistics exist (Kraskov et al., 2004; Frenzel and Pompe, 2007; Vejmelka and Palus, 2008; Póczos and Schneider, 2012) [see also Gao et al. (2015) and Lord et al. (2018) for recent progress on nearest-neighbor entropy estimators]. Other possible conditional independence tests are Kernel Conditional Independence Tests (Zhang et al., 2011; Strobl et al., 2017) which essentially test for zero Hilbert-Schmidt norm of the partial cross-covariance operator or conditional distance correlation (Wang et al., 2015). Some new recent tests are based on neural networks (Sen et al., 2017) or decision tree regression (Chalupka et al., 2018). In Runge (2018), a conditional independence test based on CMI is introduced.
(Nonlinearity)
For the linear case (first row in Fig. 9), we consider
For the nonlinear additive noise case with a quadratic dependency
Finally, if the dependencies are multiplicative (bottom row) as in
Model-free methods in principle can deal with all these cases, which might lead to the conclusion that they are superior. But the “no-free-lunch-theorem” tells us that such generality has a price and model-free methods are very data-hungry and computationally expensive. If expert knowledge pointing to a linear or otherwise parametric dependency is available, then regression-based methods will typically greatly outperform model-free methods.
Illustration of applicability of different conditional independence methods (linear and non-parametric regression-based and model-free) on different types of linear and nonlinear common driver models. Black arrows denote correctly identified causal links and dashed gray arrows indicate spurious links. The gray scatterplots with red fit line illustrate regressions of
Illustration of applicability of different conditional independence methods (linear and non-parametric regression-based and model-free) on different types of linear and nonlinear common driver models. Black arrows denote correctly identified causal links and dashed gray arrows indicate spurious links. The gray scatterplots with red fit line illustrate regressions of and on and the black scatterplot the dependency between the residuals . The three-dimensional scatterplot with red cubes in the right column depicts the CMIknn test () which is based on data-adaptive nearest-neighbor estimation (the cubes are smaller for denser regions).
Effect of observational noise on causal network reconstruction. Shown left is the linear example from Fig. 9. Very strong observational noise on
Effect of observational noise on causal network reconstruction. Shown left is the linear example from Fig. . Very strong observational noise on (right panel) here leads to a vanishing correlation between and as well as and . Since then the effect of cannot be regressed out anymore, we also get a spurious link .
G. Measurement error
Measurement error, unlike dynamical noise, contaminates the variables between which we seek to reconstruct dependencies and constitutes a difficult problem in causal network reconstruction (Scheines and Ramsey, 2016).
(Observational noise)
Here, we only discuss measurement error in its simple form as observational noise, which can be modeled as
Firstly, observational noise attenuates true associations and, therefore, lowers detection power. This is because in general
V. Practical estimation
The previous sections concerned fundamental assumptions of time-lagged causal discovery based on conditional independence relations. We now turn to the topic of practical estimation where we introduce several common causal discovery methods and discuss their consistency, significance testing, and computational complexity. We restrict the analysis to the class of conditional independence approaches, which flexibly allows us to use different independence tests. But graphical models, in general, can also be estimated with score-based Bayesian methods, e.g., the max-min hill-climbing algorithm (Tsamardinos et al., 2006).
A. Causal discovery algorithms
1. Granger causality/transfer entropy/FullCI
Transfer entropy, as introduced by Schreiber (2000), is a direct information-theoretic implementation of Granger causality (Barnett et al., 2009). In a lag-specific implementation, as given by FullCI [Eq. (6)], it tests for conditional independence between each
2. Optimal causation entropy
Sun and Bollt (2014) and Sun et al. (2015) developed a discovery algorithm based on the information-theoretic optimal causation entropy principle (algorithm abbreviated as OCE) which reconstructs the lagged parents of a variable
The significance of CMIs can be tested with a nearest-neighbor CMI estimator (Kraskov et al., 2004; Frenzel and Pompe, 2007; Vejmelka and Palus, 2008) in combination with a permutation test where
3. PC algorithm
An alternative to this forward-backward scheme is the PC algorithm (named after its inventors Peter and Clark) (Spirtes and Glymour, 1991). The original PC algorithm was formulated for general random variables without assuming a time order. It consists of several phases where first, in the skeleton-discovery phase, an undirected graphical model (Lauritzen, 1996) is estimated whose links are then oriented using a set of logical rules (Spirtes and Glymour, 1991; Spirtes et al., 2000). A later improvement led to the more robust modification called PC-stable (Colombo and Maathuis, 2014).
For the case of time series, we can use the information of time order which naturally provides an orientation rule for links. The algorithm then is as follows: For every variable
cannot be rejected at a significance threshold
cannot be rejected, where
The forward-backward scheme of OCE conducts conditional independence tests only using the conditions with highest CMI in the preceding stage and quickly increases the number of conditions. This can lead to wrong parents being kept in
4. PCMCI
A more recent approach that addresses some of the shortcomings of the PC algorithm above is PCMCI (Runge et al., 2018). PCMCI is a two-step approach which uses a version of the PC-algorithm only as a condition-selection step (PC
with
The MCI test is the most important difference to the PC algorithm and the approach by Sun and Bollt (2014). The additional conditioning on the parents
B. Consistency
Consistency is an important property of causal methods that tells us whether the method provably converges to the true causal graph in the limit of infinite sample size. Consistency concerns the conditional independence tests on the one hand, but also the causal algorithm in the case of iterative approaches such as those discussed in Sec. V A.
For example, for the consistency of the non-parametric regression independence tests in Eq. (23), we need to assume that the function estimator converges to the true function, that the noise in the model is additive and independent, and finally that we have a consistent unconditional independence test for the residuals. With a consistent test, the time series graph can be directly estimated based on Definition 1. For iterative causal algorithms, we can define universal consistency as follows.
(Universal causal consistency).
That is, the probability of estimating the wrong graph becomes arbitrarily small if enough data is available, for any distribution
However, universal consistency is a weaker statement than, for example, uniform consistency which bounds the error probability as a function of the sample size
(An inconsistent causal algorithm)
Consider again the forward-selection stage of the OCE algorithm (Sun and Bollt, 2014; Sun et al., 2015) introduced in Sec. V as a standalone method to reconstruct parents of a variable
Example where for certain
Example where for certain the MI can be larger than any of the MIs or . Thus, the most strongly associated variable with is actually not a causal driver.
C. Significance testing
How can we assess the statistical significance of conditional independence tests on which the causal algorithms in Sec. V A are based, such as the tests discussed in Sec. IV F? Using a test statistic
versus the general alternative
To assess the significance of an outcome of such a test using
Given the null distribution, the
There are two types of errors we can make. Rejecting
If the test statistic has a continuous distribution, then under
(Permutation testing)
Permutation testing is straightforward in the bivariate independence test case. To create an estimate of the null distribution of a test statistic
To achieve a test under the correct null hypothesis, we can use a local permutation scheme that preserves the associations between
Permutation approach to conditional independence testing. (Top left) Example sample drawn from a common driver scheme
Permutation approach to conditional independence testing. Example sample drawn from a common driver scheme . Permuted sample with randomly shuffled data points , which destroys the associations between and , but also between and leading to ill-calibrated tests. Schematic of local permutation scheme. Each sample point ’s -value is mapped randomly to one of its -nearest neighbors in subspace (see ) to preserve dependencies between and .
(Non-independent samples)
A basic assumption underlying many conditional independence tests is that the samples are independent and identically distributed (i.i.d.). Unfortunately, time series are typically dependent in time. To take this dependence into account, one can either adapt the distribution under the null hypothesis, for example, in partial correlation
(Sequential testing of causal links)
The preceding discussion concerned tests of an individual conditional independence relationship. Directly testing causal links via Definition 1, thus, gives us a well-calibrated test if all assumptions are fulfilled.
However, in iterative causal algorithms (Sec. V A) such as the PC algorithm, OCE, or the first step of PCMCI (PC
Figure 13 depicts an illustrative numerical example where the combined false positive rate is much lower than the
As a side remark on the previously discussed OCE and PCMCI causal discovery approaches, in the backward-elimination stage [Eq. (26)] OCE tests only the significance of each of the parents
The problem of sequential testing for
The problem of sequential testing for conditional on other variables (gray boxes). While the false positive rates of each individual test are as expected at , the combined false positive rate of the sequence of tests is much lower. Similarly, the combined true positive rate is lower than the minimal true positive rate among the individual tests. Gaussian noise model with coefficients ; rates estimated from 5000 realizations with sample size .
D. Computational complexity
Application areas of causal discovery methods vary in the typical numbers of variables as well as available sample sizes
For directly testing causal links via Definition 1 (FullCI), the computational complexity depends on the complexity of a single high-dimensional conditional independence test. In the linear partial correlation case, OLS regression scales
The methods PC, OCE, or PCMCI (Sec. V A) based on a condition-selection step avoid high-dimensional conditional independence estimation by conducting more tests with lower dimensional conditioning sets. Their theoretical complexities are difficult to evaluate, for numerical evaluations see Sun et al. (2015) and Runge et al. (2018), but typically they scale polynomially in time.
The other major challenge with high dimensionality is detection power as analyzed in Example VII C.
VI. Performance evaluation criteria
How can causal discovery methods, such as those described in Sec. V be evaluated? Typically, we want to know which method performs best on data from the kind of system we wish to study. Ideally, we would like to compare different methods on a data sample where the underlying causal truth is known or evaluate methods by experimentally manipulating a system, i.e., actually performing the do-experiment (Pearl, 2000) mentioned in the introduction which forms the theoretical basis of the present concept of causality. Since both of these options are mostly not available, an alternative is to construct synthetic model data where the underlying ground truth is known. These can then be used to study the performance of causal methods for realistic finite sample situations.
A. Models
To evaluate causal methods on synthetic data, several aspects for constructing model systems are relevant:
Model realism: The model systems should mimic the domain-specific properties of real data in terms of nonlinearity, autocorrelation, spectral properties, noise structure (dynamical as well as observational), etc.
Model diversity: To avoid biased conclusions, a large number of different randomly selected connectivity structures should be tested [including link density as well as properties such as small-worldness (Watts and Strogatz, 1998)]. For example, the aforementioned forward-selection approach failed for the example shown in Fig. 11 but works for many other graphs. But also consistent methods may have biases for finite samples as studied in Runge et al. (2018).
Model dimensionality: As studied in Fig. 16, a method may perform well only for a small number of variables and the performance for high-dimensional settings (large networks) can quickly degrade.
Sample sizes: The comparative performance of different models may vary widely for different sample sizes which needs to be studied if no uniform consistency results are available.
B. Metrics
The performance of a causal method on a single realization of a model does not allow for reliable conclusions. Therefore, each model needs to be evaluated from many realizations. Then the most straightforward evaluation metric is to measure false positive rates and true positive rates for a given
Next to the true and false positives of a causal method for finite samples, another performance criterion is computational runtime, though this may strongly depend on a given implementation.
VII. Comparison studies
In this section, we compare several common causal discovery methods in three numerical comparison studies highlighting the effect of dynamical noise in deterministic chaotic systems, autocorrelation, and high dimensionality.
A. Dynamical noise in deterministic chaotic systems
In Example 6, we studied a static example of determinism. Here, we evaluate the effect of dynamical noise in a system of coupled chaotic logistic maps:
with uniformly distributed independent noise
We compare three methods from an information-theoretic framework (FullCI, OCE, PCMCI) with convergent-cross mapping (CCM, Sugihara et al., 2012) (see also Arnhold et al., 1999; Hirata et al., 2016) as a nonlinear dynamics-inspired approach. The significance of CMIs in FullCI and OCE is tested with a nearest-neighbor CMI estimator (Kraskov et al., 2004; Frenzel and Pompe, 2007; Vejmelka and Palus, 2008) in combination with a permutation test where
Comparison of CCM, FullCI, OCE, and two versions of PCMCI on common driver system of three coupled chaotic logistic maps. The top panel shows the true graph structure. In the left and right graphs for noise levels
Comparison of CCM, FullCI, OCE, and two versions of PCMCI on common driver system of three coupled chaotic logistic maps. The top panel shows the true graph structure. In the left and right graphs for noise levels and , respectively, the width of arrows denotes the detection rate, gray edges depict false links (only false positive rates shown). The center panels depict average true (black, left axis) and false positive rates (red, right axis) for different strengths of dynamical noise. The gray line marks the 5% significance threshold.
Figure 14 shows that in the purely deterministic regime with
For higher dynamical noise levels interesting behavior emerges: FullCI and CCM have continuously decreasing power (and slightly decreasing false positives) dropping to almost zero at
How can these results be understood? CCM attempts to reconstruct the attractor manifolds underlying
The MCI test statistic for the link
Given at least some dynamical noise to suffice the Faithfulness condition and together with the other assumptions discussed in this paper, OCE and PCMCI provably converge to the true causal graph in the limit of infinite sample size (Runge et al., 2018; Sun et al., 2015, see also Sec. V B), while no such theoretical results is available for CCM. In the infinite sample limit also the inflated false positives of OCE due to time-dependent samples vanish. For finite samples, on the other hand, among other factors, consistency depends on how well-calibrated the significance test is. PCMCI here always yields expected levels of false positives. The inflated false positives for FullCI and OCE for a wide range of noise levels and for PCMCI
B. Autocorrelation
In Fig. 15, we evaluate the previously introduced causal algorithms (Sec. V A) FullCI, OCE, and PCMCI on autocorrelated data which is an ubiquitous feature in real world time series data. The full model setup is described in Appendix B 2. In short, here we only evaluate the false positive rates (for
Comparison of FullCI, three versions of OCE, and PCMCI under strong autocorrelation. (Top panel) Model time series graph with labels denoting linear coefficients. The full model setup is described in Appendix B 2. FullCI has the conditioning set
Comparison of FullCI, three versions of OCE, and PCMCI under strong autocorrelation. Model time series graph with labels denoting linear coefficients. The full model setup is described in . FullCI has the conditioning set , OCE has conditioning set depicted by gray and blue boxes, and PCMCI has conditioning set depicted by gray, blue, and red boxes. False positive rates and true positive rates are shown for two different dimensions and various autocorrelation strengths .
We compare partial correlation implementations (test statistic
In the bottom four panels of Fig. 15, we depict results for
For
C. Curse of dimensionality
In Fig. 16, we evaluate the causal algorithms (Sec. V A) for high-dimensional data using the same model as in Fig. 15 described in Appendix B 2. Here only FullCI, OCE, and PCMCI are compared, all of them again based on partial correlation.
Comparison of FullCI, OCE, and PCMCI under high dimensionality. The model setup is the same as in Fig. 15. Shown are false positive rates and true positive rates for different dimensions
Comparison of FullCI, OCE, and PCMCI under high dimensionality. The model setup is the same as in Fig. . Shown are false positive rates and true positive rates for different dimensions and no autocorrelation () in the model detailed in .
As shown in Fig. 16, FullCI severely suffers from the curse of dimensionality and the OLS-solver even becomes ill-conditioned for
The problem becomes even more severe for non-parametric tests such as multivariate transfer entropy. Another alternative to OLS partial correlation estimation are regularization techniques such as ridge regression (Hoerl et al., 1970; Tibshirani, 1996; Tikhonov, 1963), but these come with other difficulties, for example regarding significance testing. These issues are further analyzed in Runge et al. (2018).
VIII. Discussion and conclusions
The long preceding list of theoretical assumptions in Sec. IV may make causal discovery seem a daunting task. In most real data application, we will not have all common drivers measured, hence violating the causal sufficiency assumption. Then also the Markov assumption may be violated in many cases due to time-aggregation. A conclusion on the existence of a causal link, thus, rests on a number of partially strong assumptions. So what can we learn from an estimated causal graph? Let us consider what the assumptions on the absence of a causal link are.
Note that for the practical estimation from finite data, we also need to assume that all dependencies among lagged variables in
Firstly, since we assume an error-free measurement process, we have that
The second set of assumptions important for causal discovery are the assumptions underlying significance testing (Sec. V C). Failing to properly take into account autocorrelation or too simple permutation schemes imply ill-calibrated significance tests leading to inflated false positives beyond those expected by the significance level (see the comparison studies in Sec. VII). Next to the theoretical causal assumptions, statistical reliability of reconstructed networks is an important aspect for drawing causal conclusions.
This paper is intended to recapitulate the main concepts of time-lagged causal discovery from observational time series data and accessibly illustrate important challenges. But many more challenges exist, for example, we have not considered selection bias or issues with the definition of variables as elaborated on in Spirtes et al. (2000). We also have not discussed the topic of determining causal effects (Pearl, 2000) (causal quantification) or mediation (VanderWeele, 2015; Runge et al., 2015a, 2015b) as opposed the pure existence or absence of causal links presented here.
Our focus was on time series which make the causal discovery problem easier in some aspects (e.g., time order can be exploited), but more difficult in other aspects, especially regarding statistical testing. We have briefly mentioned the recent works based on different sets of assumptions in the framework of structural causal modeling (Peters et al., 2017), which do not require time-order. Also, many more techniques and insights from the conditional independence framework (Spirtes et al., 2000) can be utilized in the time series case. An important conclusion is that causal discovery is a very active area of research in many fields, from mathematics, computer science, and physics to applied sciences, and methodological progress can greatly benefit from more interdisciplinary exchange.
ACKNOWLEDGMENTS
J.R. thanks C. Glymour and F. Eberhardt for comments and D. Sejdinovic, K. Zhang, and J. Peters for helpful discussions. Special thanks to C. Linstead for help with high-performance computing. J.R. received funding from a postdoctoral award by the James S. McDonnell Foundation and gratefully acknowledges the European Regional Development Fund (ERDF), the German Federal Ministry of Education and Research, and the Land Brandenburg for supporting this project by providing resources on the high-performance computer system at the Potsdam Institute for Climate Impact Research. Software is available online under https://github.com/jakobrunge/tigramite.
Appendix A: Inconsistent causal algorithm
For the example graph shown in Fig. 11, consider the following decomposition [chain rule (Cover and Thomas, 2006), dropping
Alternatively, one can decompose the same MI as
where the last term vanishes because
Hence, the wrong parent
Appendix B: Implementation details for numerical examples
B.1. Dynamical noise model
CCM was estimated with embedding dimension
B.2. Model for examples on autocorrelation and high-dimensionality
The model time series graph is depicted in Fig. 15. In the model setup, all links correspond to linear dependencies. The autocorrelation
This setup guarantees that the unconditional dependence stays the same and we only investigate the effect of increasing autocorrelation and higher dimensionality. We test additive Gaussian noise terms
B.3. Pre-whitening and block-shuffling
We also tested the OCE test using a pre-whitening (OCEpw) and a block-shuffle permutation test (OCEbs). For the pre-whitening test, we preprocessed all
Then the OCE test is applied to these residuals
Another remedy is a block-shuffle permutation test, which is based on a block-shuffle surrogate test following Peifer et al. (2005) and Mader et al. (2013). For the test statistic