In their original paper [23], Hinton and van Camp
approached ensemble learning from an information theoretic point of
view by using the Minimum Description Length (MDL)
Principle [61]. They developed a new coding method
for noisy parameter values which led to the cost function of
Equation (3.11). This allows interpreting the cost
in Equation (3.11) as a description length
for the data using the chosen model.
The MDL principle asserts that the best model for given data is the
one that attains the shortest description of the data. The
description length can be evaluated in bits and it represents the
length of the message needed to transmit the data. The idea is that
one builds a model for the data and then sends the description of that
model and the residual of the data that could not be modelled. Thus
the total description length is
data
model
error
.
The code length is related to probability because according to the
coding theorem, an event having probability
can be
coded using
bits, assuming both the sender and the
receiver know the distribution
.
In their article Hinton and van Camp developed a method for encoding the parameters of the model in such a way, that the expected code length is the one given by Equation (3.11). Derivation of this result can be found in the original paper by Hinton and van Camp [23] or in the doctoral thesis [57] by Harri Valpola.