11The theory of unsupervised learning is much less developed and more complicated than the theory of supervised learning, especially for multi-layer networks. Therefore, a lot of work on such feature learning uses supervised learning, somehow obtaining labels or categories for each image, and using ordinary supervised learning where the network learns the connection between the images and their categories. The bottleneck here is getting sufficient amounts of such data with category labels. It is difficult because somebody has to tell what the photographs are depicting; if the labels are given by humans, that is a lot of work (although a simple approximation would be to extract the labels from captions which are sometimes attached to images on the internet). Current research is strongly focused on finding methods to train multi-layer neural networks without labels, that is, in an unsupervised way. A particularly promising approach is called self-supervised, which means performing unsupervised learning by reformulating the problem as supervised learning. Basically, you create hypothetical outputs, or a hypothetical classification problem, and use them to train your ordinary supervised, input-output neural network. The possibilities are unlimited: you could define the input to a neural network to be a degraded version of your data and the output your real data, where the degraded version could be obtained by adding noise, or making a colour image black-and-white (Vincent et al., 2008; Larsson et al., 2017). Or, the “degraded” data could actually be artificially generated: then you train the neural network to distinguish between the real and the artificial data (Gutmann and Hyvärinen, 2012). For example, in video data, you could randomly shuffle the time frames in a video, or scramble audio in a video with sound, and train the neural network to classify such scrambled data vs. the original data (Hyvärinen and Morioka, 2017; Misra et al., 2016; Arandjelovic and Zisserman, 2017). In each case, the neural network has to learn something about the structure of the data in order to perform this mapping, that is, trying to reconstruct the original images from degraded ones, or telling which data is real and which is noise. The multi-layer processing thus learned are reasonably similar to what is computed in the brain (Zhuang et al., 2021). However, it should be noted that self-supervised learning in itself gives only features; it does not give a proper Bayesian prior model except in some special cases, such as the “noise-contrastive estimation” by Gutmann and Hyvärinen (2012), and nonlinear versions of independent component analysis (Hyvärinen and Morioka, 2016; Khemakhem et al., 2020).