Convolutional neural network


A convolutional neural network is a type of neural network in which data is “scanned” from different vantage points. For purposes of illustration, this description presumes it is being used with image data to perform image recognition, although it also has applications in other domains including natural language processing. Its main distinguishing features are:

  • Input neurons are arranged and connected in a way that reflects the input data, e.g. in spatial dimensions if the input data consists of images.
  • Individual features (e.g. typically occurring parts of a boat) are learned using filters that play a similar role to the probability-based neurons in the hidden layer of a Restricted Boltzmann Machine. Filters are fed information about an image using a filter-sized window that is slid across that image, and this window typically covers an area that is much smaller than the input images. If the main image is 3000x3000 pixels and the filters are 5x5 pixels, a large number of 5x5 pixel sections of the main window - which may or may not overlap with one another depending on how the network has been configured - will be presented individually to the filters. (Stride is one of several hyperparameters and refers to the size of the steps the filter-sized window takes as it slides across the main image.) Filters are initialized randomly and then trained using backpropagation so that each filter converges upon the feature occurring often in the training data that happens to be closest to its initial state. For example, if the training data is handwritten letters, the individual filters might learn the various lines and curves that can form part of letters. The output of a filter is a scalar number that expresses how well what the filter has just been shown corresponds to whatever it had previously learned.
  • The outputs from the first convolutional layer (row of filters) are typically presented to second and subsequent convolutional layers with increasing filter sizes. A second-layer filter that is processing a given area of an image will receive the information that the first-layer filters produced when they processed smaller areas contained within that area. If filters in the first convolutional layer have recognized a nose, a mouth, eyes and ears, and all these outputs are passed through to a second convolutional layer whose filter size  is 10-20 times that used in the first convolutional layer, a suitably trained second-layer filter will be able to recognize a face.
  • Left unchecked, the fact that each pixel in a typically already high-resolution image is processed multiple times by each convolutional layer would make backpropagation infeasible because of the amount of relevant information each layer would send to the subsequent layer. Pooling is a technique used to generalize the inputs to a convolutional layer in order to get rid of superfluous information. The idea is the same as what happens in any situation where the size of a digital image is reduced: if the size in each dimension is halved, each four-pixel square in the original image has to be somehow transformed into a single pixel in the new image. This obvious type of pooling can be used in a CNN to simplify the input to the first convolutional layer, but more important is its use between convolutional layers. In our example above, the ‘face’ filter in the second layer does not need to process all the outputs of the smaller ‘nose’ filter for every square where the ‘nose’ filter was positioned within the area that the ‘face’ filter is now analyzing: it only needs the highest output that the ‘nose’ filter produced anywhere within the larger area as a measure of how likely it is that that larger area contains a nose. In accordance with this intuition, the most frequent type of pooling used in between convolutional layers is max-pooling, where the highest input value is used as a generalized output value.
  • The final convolutional layer then acts as input to subsequent layers of a standard multi-layer perceptron.

A standard convolutional neural network has at least two important limitations:

  • It can only recognise images of a similar size to those with which it was trained. The scale-invariant convolutional neural network or SiCNN attempts to solve this problem using a scale transformation applied to the filters.
  • It can only recognise a face as the coexistence of a nose, mouth, eyes and ears within the same area: a modern-art painting where the nose is on the side and the mouth is at the top will be classified in the same way as a real face. Given that this would seem to be a very serious shortcoming, standard CNNs actually perform surprisingly well. However, they are increasingly being outperformed by capsule networks or CapsNets. A capsule is a group of neurons with a similar function to the a filter in a standard CNN, but information about where capsules in a given layer were triggered is passed through to subsequent layers, enabling them to model spatial relationships between features.

    This precludes using the simple max-pooling method described above. Instead, a capsule network “explains away” superfluous information using dynamic routing: the output transmitted from each capsule in a given row to each capsule in the subsequent row is weighted according to how well that output contributes to whatever feature the higher-level recipient capsule has learned compared to how well the same output contributes to the features learned by other higher-level recipient capsules in the same row.

    The mechanics of dynamic routing involve a feedback mechanism that somewhat counter-intuitively can still be trained using backpropagation. This is made mathematically possible because the output of each capsule is a normalized vector whose size expresses the probability that the capsule has just processed whatever feature it had previously learned (1 = certain, 0 = impossible): it is this normalization that facilitates comparison between capsules in the same row. The dimensions of each capsule tend to progress from representing simple pixel-based features in the earlier capsule layers to representing higher-order graphical features in the later layers in a fashion that recalls the progression from low-level machine code to high-level object-oriented programming languages. This gives a capsule network the ability to learn to recognize the same 3D object viewed from different angles, a task at which standard CNNs typically fail.

ConvNet CNN
Scale-invariant convolutional neural network SiCNN Capsule network CapsNet
has functional building block
has input data type
IDT_Binary vector
has internal model
INM_Neural network
has output data type
has learning style
has parametricity
PRM_Nonparametric with hyperparameter(s)
has relevance
sometimes supports
mathematically similar to