Graphical LASSO based Model Selection for Time Series

Publisher: IEEE

Abstract:
We propose a novel graphical model selection scheme for high-dimensional stationary time series or discrete time processes. The method is based on a natural generalization of the graphical LASSO algorithm, introduced originally for the case of i.i.d. samples, and estimates the conditional independence graph of a time series from a finite length observation. The graphical LASSO for time series is defined as the solution of an l 1 -regularized maximum (approximate) likelihood problem. We solve this optimization problem using the alternating direction method of multipliers. Our approach is nonparametric as we do not assume a finite dimensional parametric model, but only require the process to be sufficiently smooth in the spectral domain. For Gaussian processes, we characterize the performance of our method theoretically by deriving an upper bound on the probability that our algorithm fails. Numerical experiments demonstrate the ability of our method to recover the correct conditional independence graph from a limited amount of samples.
Published in: IEEE Signal Processing Letters ( Volume: 22, Issue: 10, October 2015)
Page(s): 1781 - 1785
Date of Publication: 22 April 2015
ISSN Information:
INSPEC Accession Number: 15212418
Publisher: IEEE

SECTION I.

Introduction

We consider the problem of inferring the conditional independence graph (CIG) of a stationary high-dimensional discrete time process or time series x[n] from observing N samples x[1],,x[N]. This problem is referred to as graphical model selection (GMS) and of great practical interest, e.g., for gene analysis, econometrics, environmental monitoring and medical diagnosis [1]–​[5].

Most of existing work on GMS for time series is based on finite dimensional parametric models [6], [7]. By contrast, a first nonparametric GMS method for high-dimensional time series has been proposed recently [8]. This approach is based on performing neighborhood regression (cf. [9]) in the frequency domain.

In this paper, we present an alternative nonparametric GMS scheme for time series based on generalizing the graphical LASSO (gLASSO) [10]–​[12] to stationary time series. The resulting algorithm is implemented using the alternating direction method of multipliers (ADMM) [13], for which we derive closed-form update rules. Typically, ADMM-based methods allow for straightforward distributed implementations, e.g., in wireless sensor networks [13], [14].

While algorithmically our approach is similar to the joint gLASSO proposed in [15], the deployment of a gLASSO type algorithm for GMS of time series is new.

Our main analytical contribution is a performance analysis which yields an upper bound on the probability that our scheme fails to correctly identify the CIG. The effectiveness of our GMS method is also verified numerically.

Notation. Given a natural number F, we define [F]:={1,,F}. For a square matrix XCp×p, we denote by X¯, XH, tr{X} and |X| its elementwise complex conjugate, its Hermitian transpose, its trace and its determinant, respectively. We also need the matrix norm X:=maxi,j|Xi,j|. By writing XY we mean that YX is a positive-semidefinite (psd) matrix. A strictly positive definite matrix X is indicated as 0X.

We denote by H[F]p the set of all length-F sequences X[]:=(X[1],,X[F]) with Hermitian matrices X[f]Cp×p. For a sequence X[]H[F]p, we define Xi,j[]2:=(1/F)f[F]|Xi,j[f]|2, its squared generalized Frobenius norm X[]2F:=i,jXi,j[]2 and its 1-norm as X1:=i,jXi,j[]. We equip the set H[F]p with the inner product A[],B[]:=(1/F)f[F]tr{A[f]B[f]}.

For a sequence X[]H[F]p and some subset S[p]×[p], we denote by XS[] the matrix sequence which is obtained by zeroing separately for each index f[F] all entries of the matrix X[f] except those in S. The generalized support of a sequence X[]H[F]p is defined as gsupp(X[]):={(i,j)[p]×[p]:(X[f])i,j0 for some f[F]}. We also use a+:=max{a,0}.

SECTION II.

Problem Formulation

Consider a p-dimensional real-valued zero-mean stationary time series x[n]=(x1[n],,xp[n])T, for nZ. We assume its autocorrelation function (ACF) R[m]:=E{x[m]xT[0]} to be absolutely summable, i.e., m=R[m]<, such that we can define the spectral density matrix (SDM) S(θ) via a Fourier transform:

S(θ):=m=R[m]exp(j2πmθ),θ[0,1).(1)
View SourceRight-click on figure for MathML and additional features.We require the eigenvalues of the SDM to be uniformly bounded as
Lλi(S(θ))U,for all θ[0,1),(2)
View SourceRight-click on figure for MathML and additional features.
where, without loss of generality, we will assume L=1 in what follows. The upper bound in (2) is valid if the ACF R[m] is summable; the lower bound ensures certain Markov properties of the CIG [3], [16], [17].

Our approach is based on the assumption that the SDM is a smooth function. Due to the Fourier relationship (1), this smoothness will be quantified via certain ACF moments

μ(h)x:=m=R[m]h[m].(3)
View SourceRight-click on figure for MathML and additional features.Here, h[m] denotes a weight function which typically increases with |m|. For a process with sufficiently small moment μ(h)x, thereby enforcing smoothness of the SDM, we are allowed to base our considerations on a discretized version of the SDM, given by S[f]=S(θf), with θf:=(f1)/F, for f[F]. The number F of sampling points is a design parameter which has to be chosen suitably large, compared to the ACF moment μ(h)x (cf. [8, Lemma 2.1]).

The CIG of a process x[n] is a simple undirected graph G=(V,E) with node set V=[p]. Each node rV represents a single scalar component process xr[n]. An edge between nodes r and r is absent, i.e., (r,r)E, if the component processes xr[n] and xr[n] are conditionally independent given all remaining component processes [3].

If the process x[n] is Gaussian, the CIG can be conveniently characterized via the process inverse SDM K[f]:=S1[f]. More specifically, it can be shown that, for sufficiently small μ(h)x with h[m]=|m|, [3], [8],

(i,j)E(i,j)gsupp(K[]).(4)
View SourceRight-click on figure for MathML and additional features.Thus, the edge set E of the CIG is determined by the generalized support of the inverse SDM K[f], for f[F].

Our goal is to robustly estimate the CIG from a finite length observation, incurring unavoidable estimation errors. Therefore, we have to require that, in addition to (4), the non-zero off-diagonal entries of K[f] are sufficiently large, such that a certain amount of estimation error is tolerable. To this end, we define the process (un-normalized) minimum global partial spectral coherence as

ρx:=min(i,j)EKi,j[].
View SourceRight-click on figure for MathML and additional features.For the analysis of our GMS scheme we require

Assumption II.1:

We have ρxρmin for a known ρmin>0.

Our approach to GMS in the high-dimensional regime exploits a specific problem structure induced by the assumption that the true CIG is sparse.

Assumption II.2:

The CIG of the observed process x[n] is sparse such that |E|s for some small sp(p1)/2.

The performance analysis of the proposed GMS algorithm requires to quantify the conditioning of SDM sub-matrices. In particular, we will use the following assumption, which is a natural extension of the (multitask) compatibility condition, originally introduced in [11] to analyze LASSO for the ordinary sparse linear (multitask) model.

Assumption II.3:

Given a process x[n] whose CIG contains no more than s edges indexed by S[p]×[p], we assume that there exists a positive constant ϕ such that

1Ff[F]vec{X[f]}H(S¯[f]S[f])vec{X[f]}ϕsXS21(5)
View SourceRight-click on figure for MathML and additional features.holds for all X[]H[F]p with XSc13XS1.

The constant ϕ in (5) is essentially a lower bound on the eigenvalues of small sub-matrices of the SDM. As such, the Assumption II.3 is closely related to the concept of the restricted isometry property (RIP) [18].

As verified easily, Assumption II.3 is always valid with ϕ=L for a process satisfying (2). However, for processes having a sparse CIG, we typically expect ϕL.

SECTION III.

Graphical LASSO for Time Series

The graphical least absolute shrinkage and selection operator (gLASSO) [10]–​[12], [19] is an algorithm for estimating the inverse covariance matrix K:=C1 of a high-dimensional Gaussian random vector xN(0,C) based on i.i.d. samples. In particular, gLASSO is based on optimizing a 1-penalized log-likelihood function and can therefore be interpreted as regularized maximum likelihood estimation.

A. Extending Glasso To Stationary Time Series

A natural extension of gLASSO to the case of stationary Gaussian time series is obtained by replacing the objective function for the i.i.d. case (which can be interpreted as a penalized log-likelihood) with the corresponding penalized log-likelihood function for a stationary Gaussian process. However, since the exact likelihood lacks a simple closed-form expression (in terms of the SDM), we will use the “Whittle-approximation” [20], [21], to arrive at the following gLASSO estimator for general stationary time series:

Kˆ[]:=argminX[]CA{X}+Sˆ[],X[]+λX[]1(6)
View SourceRight-click on figure for MathML and additional features.with A{X}:=(1/F)f[F]log|X[f]| and constraint set
C:={X[]H[F]p:0X[f]I for all f[F]}.(7)
View SourceRight-click on figure for MathML and additional features.
This constraint set is reasonable since (i) the function A{X} is finite only if 0X[f] and (ii) the true inverse SDM satisfies K[f]I, for all f[F], due to (2) (with L=1).

The formulation (6) involves an estimator Sˆ[f] of the SDM values S(θf), for f[F]. While in principle any reasonable estimator could be used, we will restrict the choice to a multivariate Blackman-Tukey (BT) estimator [22]

Sˆ[f]=m=N+1N1w[m]Rˆ[m]exp(j2πmθf)(8)
View SourceRight-click on figure for MathML and additional features.with the standard biased autocorrelation estimate Rˆ[m]=(1/N)Nn=m+1x[n]xT[nm] for m=0,,N1. Enforcing the symmetry Rˆ[m]=RˆH[m], we can obtain the ACF estimate for m=N+1,,1. The window function w[m] in (8) is a design parameter, which can be chosen freely as long as it yields a psd estimate Sˆ[f]. A sufficient condition such that Sˆ[f] is guaranteed to be psd is non-negativeness of the Fourier transform W(θ):=m=w[m]exp(j2πθm) [22, Sec. 2.5.2].

The existence of a minimizer in (6) is guaranteed for any choice of λ0 as the optimization problem (6) is equivalent to the unconstrained problem minX[]A{X}+Sˆ[],X[]+λX[]1+IC(X[]), where IC(X[]) is the indicator function of the constraint set C. Existence of a minimizer of this equivalent problem is guaranteed by [23, Theorem 27.2]: The objective function is a closed proper convex function and is finite only on the bounded set C (cf. (7)), which trivially implies that the objective function has no direction of recession [23].

We will present in Section III-C a specific choice for the tuning parameter λ in (6), which ensures that the gLASSO estimator Kˆ[] is accurate, i.e., the estimation error Δ[]:=Kˆ[]K[] is small. Based on the gLASSO (6), an estimate for the edge set of the CIG may then be obtained by thresholding:

Eˆ(η):={(i,j):Kˆi,j[]η}.(9)
View SourceRight-click on figure for MathML and additional features.Obviously, under Assumption II.1, if
Δ1ρmin/2,(10)
View SourceRight-click on figure for MathML and additional features.
we have Eˆ(ρmin/2)=E, i.e., the CIG is recovered perfectly.

B. ADMM Implementation

An efficient numerical method for solving convex optimization problems of the type (6) is the alternating direction method of multipliers (ADMM). Defining the augmented Lagrangian [13] of the problem (6) as

Lρ(X[],Z[],U[]):=A{X}+Sˆ[],X[]+λZ1+(ρ/2)X[]Z[]+U[]2F,
View SourceRight-click on figure for MathML and additional features.the (scaled) ADMM method iterates, starting with arbitrary initializations for X(0)[], Z(0)[] and U(0)[], the following update rules
X(k+1)[]Z(k+1)[]U(k+1)[]=argminX[]CLρ(X[],Z(k)[],U(k)[])=argminZ[]H[F]pLρ(X(k+1)[],Z[],U(k)[])=U(k)[]+(X(k+1)[]Z(k+1)[]).(11)(12)(13)
View SourceRight-click on figure for MathML and additional features.
It can be shown (cf. [13, Sec. 3.1]) that for any ρ>0, the iterates X(k)[] converge to a solution of (6) i.e., limkX(k)[]=Kˆ[]. Thus, while the precise choice for ρ has some influence on the convergence speed of ADMM [13, Sec. 3.4.1], it is not very crucial. We used ρ=100 in all of our experiments (cf.  Section IV).

Closed forms for updates (11) and (12) are stated in

Proposition III.1:

Let us denote the eigenvalue decomposition of the matrix Sˆ[f]+ρ(U(k)[f]Z(k)[f]) by VfDfVHf with the diagonal matrix Df composed of the eigenvalues df,i, sorted non-increasingly. Then, the ADMM update rule (11) is accomplished by setting, separately for each f[F],

X(k+1)[f]=VfD˜fVHf(14)
View SourceRight-click on figure for MathML and additional features.with the diagonal matrix D˜f whose ith diagonal element is given by d~f,i=min{(1/(2ρ))[df,i+d2f,i+4ρ],1}.

If we define the block-thresholding operator W[]=Sκ(Y[]) via Wi,j[f]:=(1κ/Yi,j[])+Yi,j[f], the update rule (12) results in

Z(k+1)[]=Sλ/ρ(X(k+1)[]+U(k)[]).(15)
View SourceRight-click on figure for MathML and additional features.

Proof:

Since the minimization problem (12) is equivalent to the ADMM update for a group LASSO problem [13, Sec. 6.4.2], the explicit form (15) follows from the derivation in [13, Sec. 6.4.2].

For the verification of (14), note that the optimization problem (11) splits into F separate subproblems, one for each f[F]. The subproblem for a specific frequency bin f is (omitting the index f)

min0XIlog|X|+X,Sˆ+ρ(U(k)Z(k))+(ρ/2)X2F.(16)
View SourceRight-click on figure for MathML and additional features.Let us denote the non-increasing eigenvalues of the Hermitian matrices X and Y:=Sˆ+ρ(U(k)Z(k)) by xi and di, for i[p], respectively. According to [24, Lemma II.1], we have the trace inequality X,Yi[p]xidpi1 with equality if X is of the form X=Vdiag{dpi1}VH with a unitary matrix V whose columns are eigenvectors of Y. Due to this trace inequality, a lower bound on (16) is
min0xi1i[p]logxi+xidpi+1+(ρ/2)x2i.(17)
View SourceRight-click on figure for MathML and additional features.
The minimum in (17) is achieved by the choice x~i=h(dpi+1) with h(z):=min{(z+z2+4ρ)/2ρ,1}. However, for the choice X=Vdiag{x~i}VH (which is (14)), the objective function in (16) achieves the lower bound (17), certifying optimality.

We summarize our GMS method in

Algorithm 1

Given samples x[1],,x[N], parameters T, F, η, λ and window function w[m] perform the steps:

  • Step 1)

    For each f[F], compute the SDM estimate Sˆ[f] according to (8).

  • Step 2)

    Approximate the gLASSO Kˆ[] (cf. (6)) by iterating (14), (15) and (13) for a fixed number T yielding X(T)[].

  • Step 3)

    Estimate the edge set via Eˆ(η)={(i,j):X(T)i,j[]η}.

C. Performance Analysis

Let us for simplicity assume that the ADMM iterates X(k)[] converged perfectly to the gLASSO estimate Kˆ[] given by (6), i.e., X(T)[]=Kˆ[]. Then, under Assumption II.1, a sufficient condition for our GMS method to recover the edge set of the CIG G is (10).

We will now derive an upper bound on the probability that (10) fails to hold. This will be accomplished by (i) showing that the norm Δ1 can be bounded in terms of the SDM estimation error E[f]:=S[f]Sˆ[f] and (ii) controlling the probability that the error E[f] is sufficiently small.

Upper bounding Δ1. By definition of Kˆ[] (cf. (6)),

A{Kˆ}+Δ[],Sˆ[]+λ(Kˆ1K1)A{K}.(18)
View SourceRight-click on figure for MathML and additional features.Combining with argminX[]CA{X}+X[],S[]=K[],
λKˆ1λK1+Δ[],E[].(19)
View SourceRight-click on figure for MathML and additional features.
Let us, for the moment, assume validity of the condition
E:=maxf[F]E[f]λ/2,(20)
View SourceRight-click on figure for MathML and additional features.
implying |Δ[],E[]|(λ/2)Δ1 and, in turn via (19),
λKˆ1λK1+(λ/2)Δ1.(21)
View SourceRight-click on figure for MathML and additional features.
Below, we present a specific choice for the tuning parameter λ in (6) such that (20) holds with high probability. Using Kˆ1=()KˆSc1+ΔS+KS1 and Δ1=ΔS1+ΔSc1 and the (reverse) triangle inequality, (21) yields
ΔSc1=()KˆSc13ΔS1,(22)
View SourceRight-click on figure for MathML and additional features.
where () is due to S=gsupp(K[]). Thus, the estimation error Δ[] tends to be sparse.

As a next step, we rewrite (18) as

(A{Kˆ}A{K})+Δ[],S[](3λ/2)Δ1(23)
View SourceRight-click on figure for MathML and additional features.where we used (20). Let us denote the eigenvalues of the psd matrix Γ[f]:=S1/2[f]Δ[f]S1/2[f] by γf,i. We can then rewrite the LHS of (23) as
(A{Kˆ}A{K})+Δ[],S[]=1Ff,ilog(1+γf,i)+γf,i.
View SourceRight-click on figure for MathML and additional features.
Since Kˆ[f],K[f]I and S[f]UI (cf. (7), (2)), we have γf,i2U. Using log(1+x)=xx22(1+εx)2, with some ε[0,1], we further obtain
(A{Kˆ}A{K})+Δ[],S[]Γ[]2F/(4(2U+1)2).
View SourceRight-click on figure for MathML and additional features.
Applying [25, Lemma 4.3.1(b)] to the RHS,
(A{Kˆ}A{K})+Δ[],S[]14F(2U+1)2f[F]vec{Δ[f]}H(S¯[f]S[f])vec{Δ[f]}.(24)
View SourceRight-click on figure for MathML and additional features.
Combining (24) with (23) and Assumption II.3 (which can be invoked due to (22)), we arrive at
Δ196(2U+1)2(s/ϕ)λ.(25)
View SourceRight-click on figure for MathML and additional features.
Controlling the SDM estimation error. It remains to control the probability that condition (20) is valid, i.e., the SDM estimation error E[f] incurred by the BT estimator (8) is sufficiently small. According to [26, Lemma IV.4], for any ν[0,1/2),
P{Eν+μ(h1)x}2e(1/32)Nν2w[]21U2+2log(2pN)(26)
View SourceRight-click on figure for MathML and additional features.
where μ(h1)x is the ACF moment (3) with h1[m]:=|1w[m](1|m|/N)| for |m|<N and h1[m]:=1 else.

The main result. Recall that a sufficient condition for Eˆ(ρmin/2), given by (9), to coincide with the true edge set is (10). Under the condition (20), implying validity of (25), the inequality (10) will be satisfied if λ is chosen as

λ=ϕρmin/(192s(2U+1)2).(27)
View SourceRight-click on figure for MathML and additional features.Using (26) to bound the probability for (20) to hold yields

Proposition III.2:

Consider a stationary Gaussian zero-mean time series x[n] satisfying (2) and Assumption II.1-II.3. Then, using the choice (27) in (6) and if the conditions

μ(h1)xN/log(4Np2/δ)ϕρmin/(384s(2U+1)2)108(2U+1)6s2ϕ2ρ2minw[]21(28)(29)
View SourceRight-click on figure for MathML and additional features.are satisfied, we have P{Eˆ(ρmin/2)E}δ.

In order to satisfy the condition (28), the window function w[] in (8) has to be chosen as the indicator function for the effective support of the ACF R[m]. Thus, the factor w[]21 in (29) corresponds to a scaling of the sample size with the square of the effective ACF width. Moreover, the sufficient condition (29) scales inversely with the square of the minimum partial spectral coherence ρmin which agrees with the scaling of the sufficient condition obtained for the neighborhood regression approach in [8]. Note that, while our performance analysis applies only to Gaussian time series, the GMS method in Algorithm I can also be applied to non-Gaussian time series. However, for a non-Gaussian time series the resulting edge estimate Eˆ(η) (cf. (9)) is then not related to a CIG anymore but to a partial correlation graph [3].

Fig. 1. - ROC curves of glasso based GMS method.
Fig. 1.

ROC curves of glasso based GMS method.

SECTION IV.

Numerical Results

We generated a Gaussian time series x[n] of dimension p=64 by applying a finite impulse response filter g[n] of length 2 to a zero-mean, stationary, white, Gaussian noise process e[n]N(0,C0). We choose the covariance matrix C0 such that the resulting CIG G=([p],E) is a star graph containing a hub node with |E|=4 neighbors. The corresponding precision matrix K0=C10 has main diagonal entries equal to 0.5 and off diagonal entries equal to 0.1. The filter coefficients g[n] are such that the magnitude of the associated transfer function is uniformly bounded from above and below by positive constants, thereby ensuring that conditions (2) and (4) are satisfied with L=1, U=10 and F=4. Thus, the generated time series satisfies Assumption II.1 - II.3 with ρmin=0.1m=r2g[m], s=4 and ϕ=1. Here, rg[m]=n=g[n+m]g[n] denotes the autocorrelation sequence of g[n].

Based on N{128,256} observed samples, we estimated the edge set of the CIG using Algorithm 1 with L=10 ADMM iterations, F=4 frequency points and window function w[m]=exp(m2). The gLASSO parameter λ (cf. (6) and (27)) was varied in the range [c1,c2]×ϕρmin/(192s(2U+1)2), where constants c1 and c2 have been tuned empirically.

In Fig. 1, we show receiver operating characteristic (ROC) curves with the empirical false alarm rate Pfa:=1Mi[M]|EˆiE|p(p1)/2|E| and the empirical detection probability Pd:=1Mi[M]|EˆiE||E|, both averaged over M=100 independent simulation runs. Here, Eˆi denotes the edge estimate in the i-th simulation run.

    References

    1.
    J. Gohlke, O. Armant, F. Parham, M. Smith, C. Zimmer, D. Castro, et al., "Characterization of the proneural gene regulatory network during mouse telencephalon development", BMC Biology, vol. 6, no. 1, pp. 15, 2008.
    2.
    E. Davidson and M. Levin, "Gene regulatory networks", Proc. Nat. Acad. Sci, vol. 102, no. 14, 2005-Apr.
    3.
    R. Dahlhaus, "Graphical interaction models for multivariate time series", Metrika, vol. 51, pp. 151-172, 2000.
    4.
    A. Abdelwahab, O. Amor and T. Abdelwahed, "The analysis of the interdependence structure in international financial markets by graphical models", Int. Res. J. Finance Econ., pp. 291-306, 2008.
    5.
    J. Dauwels, H. Yu, X. Wang, F. Vialatte, C. Latchoumane, J. Jeong, et al., "Inferring brain networks through graphical models with hidden variables", NIPS 2011 Workshop on Machine Learning and Interpretation in Neuroimaging, 2011-Dec.
    6.
    A. Bolstad, B. D. van Veen and R. Nowak, "Causal network inference via group sparse regularization", IEEE Trans. Signal Process., vol. 59, no. 6, pp. 2628-2641, Jun. 2011.
    7.
    J. Songsiri and L. Vandenberghe, "Topology selection in graphical models of autoregressive processes", J. Mach. Learn. Res., vol. 11, pp. 2671-2705, 2010.
    8.
    A. Jung, R. Heckel, H. Bölcskei and F. Hlawatsch, "Compressive nonparametric graphical model selection for time series", Proc. IEEE ICASSP-2014, 2014-May.
    9.
    N. Meinshausen and P. Bühlmann, "High-dimensional graphs and variable selection with the Lasso", Ann. Statist., vol. 34, no. 3, pp. 1436-1462, 2006.
    10.
    J. H. Friedmann, T. Hastie and R. Tibshirani, "Sparse inverse covariance estimation with the graphical lasso", Biostatistics, vol. 9, no. 3, pp. 432-441, Jul. 2008.
    11.
    P. Bühlmann and S. van de Geer, Statistics for High-Dimensional Data, New York:Springer, 2011.
    12.
    D. M. Witten, J. H. Friedmann and N. Simon, "New insights and faster computations for the graphical lasso", J. Comput. Graph. Stat., vol. 20, no. 4, pp. 892-900, 2011.
    13.
    S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, "Distributed optimization and statistical learning via the alternating direction method of multipliers", Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1-122, 2011.
    14.
    A. Wiesel and A. O. Hero, "Distributed covariance estimation in Gaussian graphical models", IEEE Trans. Signal Process., vol. 60, no. 1, pp. 211-220, Jan. 2012.
    15.
    P. Danaher, P. Wang and D. M. Witten, "The joint graphical lasso for inverse covariance estimation across multiple classes", J. R. Statist. Soc. B, vol. 76, pp. 373-397, 2014.
    16.
    S. L. Lauritzen, Graphical Models, U.K., Oxford:Clarendon, 1996.
    17.
    M. Eichler, Graphical Models in Time Series Analysis, 1999.
    18.
    S. A. van de Geer and P. Bühlmann, "On the conditions used to prove oracle results for the Lasso", Electron. J. Statist., vol. 3, pp. 1360-1392, 2009.
    19.
    P. Ravikumar, M. J. Wainwright, B. Raskutti and G. Yu, " High-dimensional covariance estimation by minimizing \${ell _1}\$ -penalized log-determinant divergence ", Electron. J. Statist., vol. 5, pp. 935-980, 2011.
    20.
    F. R. Bach and M. I. Jordan, "Learning graphical models for stationary time series", IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2189-2199, Aug. 2004.
    21.
    P. Whittle, "Estimation and information in time series", Ark. Mater., vol. 2, pp. 423-434, 1953.
    22.
    P. Stoica and R. Moses, Introduction to Spectral Analysis, USA, NJ, Englewood Cliffs:Prentice Hall, 1997.
    23.
    R. T. Rockafellar, Convex Analysis, USA, NJ, Princeton:Princeton Univ. Press, 1970.
    24.
    J. B. Lasserre, "A trace inequality for matrix product", IEEE Trans. Automat. Contr., vol. 40, no. 8, Aug. 1995.
    25.
    R. A. Horn and C. R. Johnson, Topics in Matrix Analysis, U.K., Cambridge:Cambridge Univ. Press, 1991.
    26.
    A. Jung, "Learning the conditional independence structure of stationary time series: A multitask learning approach", IEEE Trans. Signal Process..