A perceptron is a type of neural network used for classification. The input is a binary or scalar vector that is fed to a layer of input neurons; the output is a classification that is obtained from a layer of output neurons. The aim is that when data is fed to the input neurons, the output neuron that corresponds to each possible classification of that data should be activated to an extent that represents the probability of the input fitting its classification.

Strictly speaking, the word perceptron refers to a single-layer network that corresponds to a simple linear classifier, but the term is commonly used to mean what should more properly be called a multilayer perceptron: perceptrons in productive use today have one or more hidden layers of neurons between the input layer (also called the retina) and the output layer. It can be mathematically proven that a perceptron with three hidden layers is capable of learning a classification function of any complexity and structure. For this reason, for a long time most research was only carried out with a maximum of three layers, until it was discovered that adding more layers could make it easier for a perceptron to learn a complex function.

Perceptrons are typically trained using an iterative technique: the output a layer provides for a given input is compared to the target output (training goal) for that layer; the difference between the two or error is analyzed; the weights of the neurons within that layer are adjusted to minimize that error; and the procedure is repeated until the weights converge on an optimal set of values. Note that one of the problems with perceptrons is that convergence may occur on a local optimum rather than on the global optimum, just as going downhill from a mountain peak will simply land you in the nearest valley rather than at the lowest point in the mountain range.

Backpropagation is a mathematical technique that allows tuning to occur across multiple layers at once. One important restriction placed by backpropagation is that there are mathematical reasons why it cannot work if the binary threshold function is being used as the activation function. There are many variants on backpropagation as well as techniques that increase its performance, which should normally form a standard part of whatever library is being used as an implementation of the perceptron algorithm. With training based on backpropagation, the best results are usually obtained if the neurons within a perceptron are initialized with individually generated random weights.

Important considerations when using a perceptron with backpropagation include:

  • How many neurons to use in each layer. Too few neurons will lead to a perceptron that is unable to learn the classifications, while too many neurons will lead to a perceptron that will tend to overfit the training data.
  • How many hidden layers to use. As explained above, more hidden layers can learn more complex functions effectively. However, each additional hidden layer makes backpropagation less effective.
  • The learning rate, or the size of the reweightings that occur after each training step. It is good practice to reduce the learning rate as a perceptron converges on an optimum, but it is important to do so in discrete steps of one order of magnitude rather than continuously, otherwise the convergence can get stuck. Discriminative fine tuning is a promising technique that involves using a progressively higher learning rate for hidden layers closer to the output layer than for hidden layers closer to the input layer.
  • The activation function to use. Any function can be used that does not have a differential value of 0 for some value of x (as the binary threshold function does). The rectified linear unit (ReLU) and hyperbolic tangent functions have been found to be appropriate for a range of common perceptron tasks.

If a perceptron used in deep learning contains more hidden layers than can successfully be trained using the standard backpropagation procedure explained above, it is also possible to train layers one-by-one using stacked autoencoders or restricted Boltzmann machines. Optionally, such layers can be used as the starting point for backpropagation which is then much less likely to converge on local minima.

In a recurrent multi-layer perceptron, output information from one operation of the network is remembered in context neurons and reused as input information in the subsequent operation of the network. A Jordan network has one context neuron per output neuron and the context neurons feed back into the network as additional input neurons, while an Elman network  has one context neuron per hidden or output neuron and each context neuron feeds back to all the non-context neurons within its own layer. The addition of memory yields perceptrons that are able to perform tasks with a temporal element such as speech recognition. However, these networks often fail to learn working models because inputs from older loops have progressively less impact on the final output, which reduces the effectiveness of backpropagation. This means that a long short-term memory network is generally a better choice for modelling time-series data.

MLP multi-layer perceptron Recurrent multi-layer perceptron Elman network Jordan network
has functional building block
has input data type
IDT_Binary vector IDT_Vector of categorical variables IDT_Vector of quantitative variables
has internal model
INM_Neural network
has output data type
ODT_Classification ODT_Probability
has learning style
has parametricity
PRM_Nonparametric with hyperparameter(s)
has relevance
sometimes supports
mathematically similar to