Ensemble learning is a relatively new concept suggested by Hinton and van Camp in 1993 [23]. It allows approximating the true posterior distribution with a tractable approximation and fitting it to the actual probability mass with no intermediate point estimates.
The posterior distribution of the model parameters , , is approximated with another distribution or approximating ensemble . The objective function chosen to measure the quality of the approximation is essentially the same cost function as the one for EM algorithm in Equation (3.10) [38]
Ensemble learning is based on finding an optimal function to approximate another function. Such optimisation methods are called variational methods and therefore ensemble learning is sometimes also called variational learning [30].
A closer look at the cost function shows that it can be represented as a sum of two simple terms
The first term in Equation (3.12) is the Kullback-Leibler divergence between the approximate posterior and the true posterior . A simple application of Jensen's inequality [52] shows that the Kullback-Leibler divergence between two distributions and is always nonnegative:
The Kullback-Leibler divergence is not symmetric and it does not obey the triangle inequality, so it is not a metric. Nevertheless it can be considered a kind of a distance measure between probability distributions [12].
Using the inequality in Equation (3.13) we find that the cost function is bounded from below by the negative logarithm of the evidence
Looking at this the other way round, the cost function gives a lower bound on the model evidence with
An important feature for practical use of ensemble learning is that the cost function and its derivatives with respect to the parameters of the approximating distribution can be easily evaluated for many models. Hinton and van Camp [23] used a separable Gaussian approximating distribution for a single hidden layer MLP network. After that many authors have used the method for different applications.