An introduction to variational methods for graphical models (1999)
Cached
Download Links
Venue: | MACHINE LEARNING |
Citations: | 864 - 63 self |
BibTeX
@INPROCEEDINGS{Jordan99anintroduction,
author = {Michael I. Jordan and Zoubin Ghahramani and et al.},
title = {An introduction to variational methods for graphical models},
booktitle = {MACHINE LEARNING},
year = {1999},
pages = {183--233},
publisher = {MIT Press}
}
Years of Citing Articles
OpenURL
Abstract
This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, the sigmoid belief network, the Boltzmann machine, and several variants of hidden Markov models, in which it is infeasible to run exact inference algorithms. We then introduce variational methods, which exploit laws of large numbers to transform the original graphical model into a simplified graphical model in which inference is efficient. Inference in the simpified model provides bounds on probabilities of interest in the original model. We describe a general framework for generating variational transformations based on convex duality. Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.
Citations
9062 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ... E/ ln • P.H; E/ Q.H j E/ ‚ : (43) The difference between the left and right hand sides of this equation is easily seen to be the KL divergence D.Q k P/. Thus, by the positivity of the KL divergence (=-=Cover & Thomas, 1991-=-), the right-hand side of Eq. (43) is a lower bound on P.E/. Moreover, by choosing ‚ according to Eq. (41), we obtain the tightest lower bound. 6.1. Convex duality and the KL divergence We can also ju... |
8800 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...and the state nodes Xi are hidden. An HMM is trained by treating the output nodes as evidence nodes and the state nodes as hidden nodes. An expectation-maximization (EM) algorithm (Baum et al., 1970; =-=Dempster, Laird, & Rubin, 1977-=-) is generally used to update the parameters A; B; … ; this algorithm involves a simple iterative procedure having two alternating steps: (1) run an inference algorithm to calculate the conditional pr... |
7401 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...ce has arisen in applications of graphical model inference to error-control decoding (McEliece, MacKay, & Cheng, 1998). In particular, Kim and Pearl’s algorithm for singly-connected graphical models (=-=Pearl, 1988-=-) has been used successfully as an iterative approximate method for inference in non-singly-connected graphs. Another approach to the design of approximation algorithms involves making use of Monte Ca... |
3594 | Convex Analysis
- Rockafellar
- 1970
(Show Context)
Citation Context ...many of the variational transformations that have been utilized in the literature on graphical models are examples of the general principle of convex duality. It is a general fact of convex analysis (=-=Rockafellar, 1972-=-) that a concave function f .x/ can be represented via a conjugate or dual function as follows: f .x/ D min ‚ f‚T x ¡ f ⁄.‚/g; (21) where we now allow x and ‚ to be vectors. The conjugate function f ⁄... |
3021 |
UCI repository of machine learning databases, hhttp://www.ics.uci.edu/mlearn/ MLRepository.htmli: Last modification
- Blake, Merz
- 2004
(Show Context)
Citation Context ...he expectations computed in the first phase. The procedure then returns to the first phase and iterates. Ghahramani and Jordan (1997) reported results on fitting an FHMM to the Bach chorale data set (=-=Merz & Murphy, 1996-=-). They showed that significantly larger effective state spaces could be fit with the FHMM than with an unstructured HMM, and that performance in terms of probability of the test set was an order of m... |
964 |
An Introduction to Bayesian Networks
- Jensen
- 1996
(Show Context)
Citation Context ...represent the set of evidence nodes, we wish to calculate P.H j E/: P.H j E/ D P.H; E/ P.E/ : (1) 184 JORDAN ET AL. General exact inference algorithms have been developed to perform this calculation (=-=Jensen, 1996-=-; Shachter, Andersen, & Szolovits, 1994; Shenoy, 1992); these algorithms take systematic advantage of the conditional independencies present in the joint distribution as inferred from the pattern of m... |
891 | A tutorial on learning with Bayesian networks
- Heckerman
- 1998
(Show Context)
Citation Context ...ian methods. Variational inference can be applied to the general problem of Bayesian parameter estimation. Indeed we can quite generally treat parameters as additional nodes in a graphical model (cf. =-=Heckerman, 1999-=-) and thereby treat Bayesian inference on the same footing as generic probabilistic inference in a graphical model. This probabilistic inference problem is often intractable, and variational approxima... |
835 |
A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ...e training process and the state nodes Xi are hidden. An HMM is trained by treating the output nodes as evidence nodes and the state nodes as hidden nodes. An expectation-maximization (EM) algorithm (=-=Baum et al., 1970-=-; Dempster, Laird, & Rubin, 1977) is generally used to update the parameters A; B; … ; this algorithm involves a simple iterative procedure having two alternating steps: (1) run an inference algorithm... |
799 | A view of the EM algorithm that justifies incremental, sparse, and other variants - Neal, Hinton - 1998 |
589 | Probabilistic inference using Markov chain Monte Carlo methods (Tech - Neal - 1993 |
375 |
Modern Quantum Mechanics
- Sakurai
- 1994
(Show Context)
Citation Context ...ial case. 4. Basics of variational methodology Variational methods are used as approximation methods in a wide variety of settings, including finite element analysis (Bathe, 1996), quantum mechanics (=-=Sakurai, 1985-=-), statistical mechanics (Parisi, 1988), and statistics (Rustagi, 1976). In each of these cases the application of variational methods converts a complex problem into a simpler problem, where the simp... |
355 |
Finite Element Procedures
- Bathe
- 1996
(Show Context)
Citation Context ... HMDT includes the FHMM as a special case. 4. Basics of variational methodology Variational methods are used as approximation methods in a wide variety of settings, including finite element analysis (=-=Bathe, 1996-=-), quantum mechanics (Sakurai, 1985), statistical mechanics (Parisi, 1988), and statistics (Rustagi, 1976). In each of these cases the application of variational methods converts a complex problem int... |
339 | Turbo decoding as an instance of Pearl’s belief propagation algorithm
- McEliece, MacKay, et al.
- 1998
(Show Context)
Citation Context ...ver, given the exponential growth in complexity of the exact algorithms. A related approach to approximate inference has arisen in applications of graphical model inference to error-control decoding (=-=McEliece, MacKay, & Cheng, 1998-=-). In particular, Kim and Pearl’s algorithm for singly-connected graphical models (Pearl, 1988) has been used successfully as an iterative approximate method for inference in non-singly-connected grap... |
301 |
Learning and re-learning in boltzmann machines
- Hinton, Sejnowski
- 1986
(Show Context)
Citation Context ...ricted set of potential functions (see figure 7). In particular, the clique potentials are formed by taking products of “Boltzmann factors”—exponentials of terms that are at most quadratic in the Si (=-=Hinton & Sejnowski, 1986-=-). Thus each clique potential is a product of factors expfµi j Si S j g and factors expfµi0Si g, where Si 2 f0; 1g.8 A given pair of nodes Si and Sj can appear in multiple, overlapping cliques. For ea... |
293 | Bucket elimination: A unifying framework for probabilistic inference - Dechter - 1996 |
261 | Approximating probabilistic inference in Bayesian belief networks is NP-hard
- Dagum, Luby
- 1993
(Show Context)
Citation Context ... use of Monte Carlo methods. A variety of Monte Carlo algorithms have been developed (see Neal, INTRODUCTION TO VARIATIONAL METHODS 185 1993) and applied to the inference problem in graphical models (=-=Dagum & Luby, 1993-=-; Fung & Favero, 1994; Gilks, Thomas, & Spiegelhalter, 1994; Jensen, Kong, & Kjærulff, 1995; Pearl, 1988). Advantages of these algorithms include their simplicity of implementation and theoretical gua... |
250 |
Statistical Field Theory
- Parisi
- 1988
(Show Context)
Citation Context ...odology Variational methods are used as approximation methods in a wide variety of settings, including finite element analysis (Bathe, 1996), quantum mechanics (Sakurai, 1985), statistical mechanics (=-=Parisi, 1988-=-), and statistics (Rustagi, 1976). In each of these cases the application of variational methods converts a complex problem into a simpler problem, where the simpler problem is generally characterized... |
229 | The wake-sleep algorithm for unsupervised neural networks
- Hinton, Dayan, et al.
- 1995
(Show Context)
Citation Context ...model and the recognition model and rely on the parameter estimation process to bring these parameterizations into register. This is the basic idea behind the “Helmholtz machine” (Dayan et al., 1995; =-=Hinton et al., 1995-=-). The key advantage of the recognition-model approach is that the calculation of P.H j E/ is reduced to an elementary feedforward calculation that can be performed quickly. There are some disadvantag... |
190 |
Connectionist learning of belief networks
- Neal
- 1992
(Show Context)
Citation Context ....1¡ »i /„i .1¡ „i /: (68) Note that there is no need to calculate variational parameters under the unconditional distribution, P.S j µ/, as in the case of the Boltzmann machine (a fact first noted by =-=Neal, 1992-=-). 222 JORDAN ET AL. Figure 23. The leftmost figure shows examples of the images used by Saul and Jordan (1999) in training their handwritten digit classifier. The rightmost figure shows examples of i... |
173 | Probabilistic independence networks for hidden markov probability models
- Smyth, Heckerman, et al.
- 1997
(Show Context)
Citation Context ...ions about the state space and the transition probabilities that are not available within the simple HMM framework. A number of structured variations on HMMs have been considered in recent years (see =-=Smyth et al., 1997-=-); generically these variations can be viewed as “dynamic belief networks” (Dean & Kanazawa, 1989; Kanazawa, Koller, & Russell, 1995). Here we consider a particularly simple variation on the HMM theme... |
155 | Stochastic simulation algorithms for dynamic probabilistic networks
- Kanazawa, Koller, et al.
- 1995
(Show Context)
Citation Context ... number of structured variations on HMMs have been considered in recent years (see Smyth et al., 1997); generically these variations can be viewed as “dynamic belief networks” (Dean & Kanazawa, 1989; =-=Kanazawa, Koller, & Russell, 1995-=-). Here we consider a particularly simple variation on the HMM theme known as the “factorial hidden Markov model” (Ghahramani & Jordan, 1997; Williams & Hinton, 1991). The graphical model for a factor... |
145 |
A mean field theory learning algorithm for neural networks
- Peterson, Anderson
- 1987
(Show Context)
Citation Context ...th the intractability of the Boltzmann machine (Hinton & Sejnowski, 1986). The sampling algorithms are overly slow, however, and more recent work has considered the faster “mean field” approximation (=-=Peterson & Anderson, 1987-=-). We will describe the mean field approximation for Boltzmann machines later in the paper—it is a special form of the variational approximation approach that provides lower bounds on marginal probabi... |
132 | Keeping neural networks simple by minimizing the description length of the weights - Hinton, Camp - 1993 |
126 |
A Language and Program for Complex Bayesian Modelling
- Gilks, Thomas, et al.
- 1994
(Show Context)
Citation Context ... Monte Carlo algorithms have been developed (see Neal, INTRODUCTION TO VARIATIONAL METHODS 185 1993) and applied to the inference problem in graphical models (Dagum & Luby, 1993; Fung & Favero, 1994; =-=Gilks, Thomas, & Spiegelhalter, 1994-=-; Jensen, Kong, & Kjærulff, 1995; Pearl, 1988). Advantages of these algorithms include their simplicity of implementation and theoretical guarantees of convergence. The disadvantages of the Monte Carl... |
126 | Mean field theory for sigmoid belief networks - Saul, Jaakkola, et al. - 1996 |
117 | Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base{II: Evaluation of diagnostic performance. Methods of information in medicine
- Shwe, Middleton, et al.
- 1991
(Show Context)
Citation Context ...noisy-OR model; see Pearl, 1988). Unfortunately, these coupling terms can lead to an exponential growth in inferential complexity. Considering a set of standard diagnostic cases (the “CPC cases”; see =-=Shwe et al., 1991-=-), Jaakkola and Jordan (1999b) found that the median size of the maximal clique of the moralized QMR-DT graph is 151.5 nodes. Thus even without considering the triangulation step, we see that diagnost... |
116 |
Valuation-based systems for bayesian decision analysis
- Shenoy
- 1992
(Show Context)
Citation Context ...late P.H j E/: P.H j E/ D P.H; E/ P.E/ : (1) 184 JORDAN ET AL. General exact inference algorithms have been developed to perform this calculation (Jensen, 1996; Shachter, Andersen, & Szolovits, 1994; =-=Shenoy, 1992-=-); these algorithms take systematic advantage of the conditional independencies present in the joint distribution as inferred from the pattern of missing edges in the graph. We often also wish to calc... |
103 | Exploiting tractable substructures in intractable networks
- Saul, Jordan
- 1996
(Show Context)
Citation Context ...hods in a way that emphasizes their links to exact methods. Indeed, as we will see, exact methods often appear as subroutines within an overall variational approximation (cf. Jaakkola & Jordan, 1996; =-=Saul & Jordan, 1996-=-). It should be acknowledged at the outset that there is as much “art” as there is “science” in our current understanding of how variational methods can be applied to probabilistic inference. Variatio... |
95 | Bounded conditioning: Flexible inference for decisions under scarce resources - Horvitz, Suermondt, et al. - 1989 |
61 | Bayesian Methods for Mixtures of Experts
- Waterhouse, McKay, et al.
- 1996
(Show Context)
Citation Context ...garithm of the marginal likelihood; a key quantity in Bayesian model selection and model averaging. More recently, the ensemble learning approach has been applied to mixture of experts architectures (=-=Waterhouse, MacKay, & Robinson, 1996-=-) and hidden Markov models (MacKay, 1997). One interesting aspect of these applications is that they do not assume any particular parametric family for Q, rather they make the nonparametric assumption... |
59 | Variational methods for inference and estimation in graphical models. Doctoral dissertation
- Jaakkola
- 1997
(Show Context)
Citation Context ....1. Convex duality and the KL divergence We can also justify the choice of KL divergence by making an appeal to convex duality theory, thereby linking the block approach with the sequential approach (=-=Jaakkola, 1997-=-). Consider, for simplicity, the case of discrete-valued nodes H . The distribution Q.H j E; ‚/ 214 JORDAN ET AL. can be viewed as a vector of real numbers, one for each configuration of the variables... |
48 |
Backward simulation in Bayesian networks
- Fung, Favero
- 1994
(Show Context)
Citation Context ...methods. A variety of Monte Carlo algorithms have been developed (see Neal, INTRODUCTION TO VARIATIONAL METHODS 185 1993) and applied to the inference problem in graphical models (Dagum & Luby, 1993; =-=Fung & Favero, 1994-=-; Gilks, Thomas, & Spiegelhalter, 1994; Jensen, Kong, & Kjærulff, 1995; Pearl, 1988). Advantages of these algorithms include their simplicity of implementation and theoretical guarantees of convergenc... |
45 | Hidden Markov decision trees
- Jordan, Ghahramani, et al.
- 1997
(Show Context)
Citation Context ... variational algorithm for the higher-order HMM see Saul and Jordan (1996). 3.7. Hidden Markov decision trees Finally, we consider a model in which a decision tree is endowed with Markovian dynamics (=-=Jordan, Ghahramani, & Saul, 1997-=-). A decision tree can be viewed as a graphical model by modeling the decisions in the tree as multinomial random variables, one for each level of the decision tree. Referring to figure 13, and focusi... |
45 | Global conditioning for probabilistic inference in belief networks
- Shachter, Andersen, et al.
- 1994
(Show Context)
Citation Context ...set of evidence nodes, we wish to calculate P.H j E/: P.H j E/ D P.H; E/ P.E/ : (1) 184 JORDAN ET AL. General exact inference algorithms have been developed to perform this calculation (Jensen, 1996; =-=Shachter, Andersen, & Szolovits, 1994-=-; Shenoy, 1992); these algorithms take systematic advantage of the conditional independencies present in the joint distribution as inferred from the pattern of missing edges in the graph. We often als... |
44 | Localized partial evaluation of belief networks - Draper, Hanks - 1994 |
42 |
Search-Based Methods Bound Diagnostic Probabilities in Very Large Belief Nets
- Henrion
- 1991
(Show Context)
Citation Context ...entify and exploit such situations. Examples include the pruning algorithms of Kjærulff (1994), the “bounded conditioning” method of Horvitz, Suermondt, and Cooper (1989), search-based methods (e.g., =-=Henrion, 1991-=-), and the “localized partial evaluation” method of Draper and Hanks (1994). A virtue of all of these methods is that they are closely tied to the exact methods and thus are able to take full advantag... |
42 | Computing upper and lower bounds on likelihoods in intractable networks
- Jaakkola, Jordan
- 1996
(Show Context)
Citation Context ...in slightly different ways and obtain different update equations. (Yet another related variational approximation for the sigmoid belief network, including both upper and lower bounds, is presented in =-=Jaakkola and Jordan, 1996-=-). Finally, we can compute the gradient with respect to the parameters µi j for fixed variational parameters„ and » . The result obtained by Saul and Jordan (1999) takes the following form: 1µi j / .„... |
41 | Switching state-space models
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...presented by Saul and Jordan (1996), as a refined version of mean field theory for Markov random fields, and has been developed further in a number of recent studies (e.g., Ghahramani & Jordan, 1997; =-=Ghahramani & Hinton, 1996-=-; Jordan et al., 1997). In the block approach, we begin by identifying a substructure in the graph of interest that we know is amenable to exact inference methods (or, more generally, to efficient app... |
41 | Improving the mean field approximation via the use of mixture distributions
- Jaakkola, Jordan
- 1997
(Show Context)
Citation Context ... so as to make the approximation as tight as possible. It is not difficult to verify that products of the expression in Eq. (32) yield an overall bound that is a convex function of the ‚i parameters (=-=Jaakkola & Jordan, 1999-=-b). Thus standard optimization algorithms can be used to find good choices for the ‚i . Figure 18 shows results from Jaakkola and Jordan (1999b) for approximate inference on four of the “CPC cases” th... |
30 | Approximating posterior distributions in belief networks using mixtures - Bishop, Lawrence, et al. - 1998 |
30 | A hierarchical community of experts - Hinton, Sallans, et al. - 1999 |
29 | Recursive algorithms for approximating probabilities in graphical models
- Jaakkola, Jordan
- 1997
(Show Context)
Citation Context ... bounds on marginal probabilities. We will also discuss a more general variational algorithm that provides upper and lower bounds on probabilities (marginals and conditionals) for Boltzmann machines (=-=Jaakkola & Jordan, 1997-=-a). 3.4. Hidden Markov models In this section, we briefly review hidden Markov models. The hidden Markov model (HMM) is an example of a graphical model in which exact inference is tractable; our purpo... |
29 |
An empirical analysis of likelihood—Weighting simulation on a large, multiply connected medical belief network
- Shwe, Cooper
- 1991
(Show Context)
Citation Context ...ndings were treated exactly. Note that accurate values were obtained in less than a second. The figure also shows results from a state-of-theart sampling algorithm (the likelihood-weighted sampler of =-=Shwe and Cooper, 1991-=-). The sampler required significantly more computer time than the variational method to obtain roughly comparable accuracy. Jaakkola and Jordan (1999b) also presented results for the entire corpus of ... |
24 | A statistical approach to decision tree modeling - Jordan - 1994 |
24 | Learning in Boltzmann trees - Saul, Jordan - 1994 |
21 |
Variational methods in statistics
- Rustagi
- 1976
(Show Context)
Citation Context ... used as approximation methods in a wide variety of settings, including finite element analysis (Bathe, 1996), quantum mechanics (Sakurai, 1985), statistical mechanics (Parisi, 1988), and statistics (=-=Rustagi, 1976-=-). In each of these cases the application of variational methods converts a complex problem into a simpler problem, where the simpler problem is generally characterized by a decoupling of the degrees ... |
20 |
Blocking-Gibbs sampling in very large probabilistic expert systems
- Jensen, Kong, et al.
- 1995
(Show Context)
Citation Context ...loped (see Neal, INTRODUCTION TO VARIATIONAL METHODS 185 1993) and applied to the inference problem in graphical models (Dagum & Luby, 1993; Fung & Favero, 1994; Gilks, Thomas, & Spiegelhalter, 1994; =-=Jensen, Kong, & Kjærulff, 1995-=-; Pearl, 1988). Advantages of these algorithms include their simplicity of implementation and theoretical guarantees of convergence. The disadvantages of the Monte Carlo approach are that the algorith... |
18 | Reduction of computational complexity in Bayesian networks through removal of weak dependences - Kjærulff - 1994 |
18 |
Mean field networks that learn to discriminate temporally distorted strings
- Williams, Hinton
- 1991
(Show Context)
Citation Context ... (Dean & Kanazawa, 1989; Kanazawa, Koller, & Russell, 1995). Here we consider a particularly simple variation on the HMM theme known as the “factorial hidden Markov model” (Ghahramani & Jordan, 1997; =-=Williams & Hinton, 1991-=-). The graphical model for a factorial HMM (FHMM) is shown in figure 9. The system is composed of a set of M chains indexed by m. Let the state node for the mth chain at time i be represented by X .m/... |
17 |
The limitations of deterministic Boltzmann machine learning
- Galland
- 1993
(Show Context)
Citation Context ...ines and Boltzmann machines with “frustrated” interactions; these are networks whose potential functions embody constraints between neighboring nodes that cannot be simultaneously satisfied (see also =-=Galland, 1993-=-). In the case of sparse networks, exact algorithms can provide help; indeed, this observation led to the use of exact algorithms as subroutines within the “structured mean field” approach pursued by ... |