Exact and Monte Carlo calculations of integrated likelihoods for the latent class model

  • a CNRS & Université de Lille 1, Villeneuve d’Ascq, France
  • b INRIA, Orsay, France
  • c CNRS & Université de Technologie de Compiègne, Compiègne, France

Abstract

The latent class model or multivariate multinomial mixture is a powerful approach for clustering categorical data. It uses a conditional independence assumption given the latent class to which a statistical unit is belonging. In this paper, we exploit the fact that a fully Bayesian analysis with Jeffreys non-informative prior distributions does not involve technical difficulty to propose an exact expression of the integrated complete-data likelihood, which is known as being a meaningful model selection criterion in a clustering perspective. Similarly, a Monte Carlo approximation of the integrated observed-data likelihood can be obtained in two steps: an exact integration over the parameters is followed by an approximation of the sum over all possible partitions through an importance sampling strategy. Then, the exact and the approximate criteria experimentally compete, respectively, with their standard asymptotic BIC approximations for choosing the number of mixture components. Numerical experiments on simulated data and a biological example highlight that asymptotic criteria are usually dramatically more conservative than the non-asymptotic presented criteria, not only for moderate sample sizes as expected but also for quite large sample sizes. This research highlights that asymptotic standard criteria could often fail to select some interesting structures present in the data.

Keywords

  • Categorical data;
  • Bayesian model selection;
  • Jeffreys conjugate prior;
  • Importance sampling;
  • EM algorithm;
  • Gibbs sampler

1. Introduction

The standard model for clustering observations described through categorical variables is the so-called latent class model (see for instance Goodman, 1974). This model assumes that the observations arose from a mixture of multivariate distributions and that the variables are conditionally independent knowing the clusters. It has been proved to be successful in many practical situations (see for instance Aitkin et al., 1981).

In this paper, we consider the problem of choosing a relevant latent class model. In the Gaussian mixture context, the BIC criterion (Schwarz, 1978) appears to give a reasonable answer to the important problem of choosing the number of mixture components (see for instance Fraley and Raftery, 2002). However, some previous works dealing with the latent class model (see for instance Nadif and Govaert, 1998) for the binary case suggest that BIC needs particular large sample size to reach its expected asymptotic behaviour in practical situations. And, any criterion related to the asymptotic BIC approximation may suffer this limitation. In this paper, we take profit from the possibility to avoid asymptotic approximation of integrated likelihoods to propose alternative non-asymptotic criteria.

Actually, a conjugate Jeffreys non-informative prior distribution is available for the latent class model parameters (contrary to what happens for Gaussian mixture models) and integrating the complete-data likelihood leads to a closed form formula. Thus, the integrated complete-data likelihood proposed in Biernacki et al. (2000) as a Bayesian clustering criterion can be exactly and easily computed without needing any BIC approximation. Moreover, the integrated observed-data likelihood, more commonly named marginal likelihood (see for instance Frühwirth-Schnatter, 2006), can be non-asymptotically approximated through two steps: an exact integration of the complete data distribution over the parameters is followed by an approximation of the sum over all possible partitions to get the marginal distribution of the observed data. This approximation involves a Bayesian importance sampling strategy. The Bayesian instrumental distribution is derived in a natural way using the fact that Bayesian inference is efficiently implemented through a Gibbs sampler thanks to conjugate properties.

The main purpose of this paper is to present those non-asymptotic Bayesian (latent class) model selection criteria and to compare them with their asymptotic versions. Second, it gives the opportunity to highlight the important difference between the complete-data and observed-data criteria.

The paper is organised as follows. In Section 2, the standard latent class model is described; furthermore maximum likelihood (ML) and non-informative Bayesian inferences are briefly sketched. The exact integrated complete-data likelihood and the approximate integrated observed-data likelihood are, respectively, described in 3 and 4. Numerical experiments on both simulated and real data sets for selecting a relevant number of mixture components are presented in Section 5. A discussion section ends the paper by summarising the pros and cons of each evaluated strategy in order to help practitioners to make their choice. It gives also some possible extensions of this work.