2016 marks the 30th anniversary of publication of a remarkable paper. The seminal article entitled “Learning representations by back-propagating errors”, by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. William, underlines the beginning of a new era in artificial intelligence.

That unpresumptuous letter published in the 323th edition of Nature, in October 1986, shed some light to a fundamental enigma that troubled many scientists for many preceding decades: how to make an Artificial Neural Network (ANN) learn just like a human brain?

At that time, analogies to biological neural circuits were the trend. The article humbly concludes with the following statement: “The learning procedure, in its current form, is not a plausible model of learning in brains. However...”

The proposed algorithm, today known simply as Backpropagation, has been time-honored as the standard training method for neural networks (up until recently).

How to accomplish a supervised training of a ANN when all is known is the state of input neurons and the desired network answer? How to correct a wrong answer? Imagine the complexity of measuring that single error and discovering the exact quantity of “guilt” shared by every single neuron in a net composed by many layers of neurons.

The solution is to back propagate that error from the last to the first layer of neurons. That propagation is realized through a successive derivation of carefully chosen mathematical functions that compose the network.

Roughly speaking an artificial neural network is made of 2 or more layers of neurons:

**Input Layer -> (optional) Hidden Layer(s) -> Output Layer**

Each layer is connected to the next by two mathematical functions (a linear and a non-linear function):

The linear function is the weighted sum of the *“evoked action potentials”* of all neurons in one layer connecting to a particular neuron in the next layer. Every connection has a weight.

The non-linear function is the *“threshold mediated”* response of that next layer neuron in reaction to these *dendrites* inflow (yes, this is another intentional similarity to a organic neuron).

The function measuring the network error is basically a modified version of the distance of the network answer to the desired answer.

The Backpropagation algorithm carefully chooses the *“threshold-mediated non-linear function”* and the *“error measuring function”*. Both need be *differentiable*. The “magic” is done by:

- finding the derivative of the error function in relation to the network output;
- deriving the network output function (i.e the non-linear function) in respect to the linear function;
- resolving the derivative of the linear function in respect to each weight that composes this linear function ->
**At this point we know the exact quantity of error in each of these neuron pair connections (put differently, how the net error changes in relation to each weight change);** - and finally adjusting each weight with that last piece of information;

The algorithm is recursive, meaning that, when the above steps are completed, step number 1 restarts from the penultimate neuron layer towards first. That presents another problem: how to obtain the network error to start again the algorithm since now the error is diluted through all neurons in the penultimate layer?

You must sum the derivatives of the error in respect to each weight binding the penultimate to the ante-penultimate layer. Now you can begin exactly as before.

In order to pay a tribute to that remarkable technology breakthrough I decided to implement the algorithm being as faithful to the article as possible. All 9 mathematical formulas are there *ipsis litteris* (or should I say *ipsis numeris*), including the elegantly chain-rule deduced derivatives. In code, I make brief comments of these formulas using the same notation references as on the paper from 1986. To be simple to read I refrained from using any third party libraries (including matrix libraries), so one should not expect much efficiency. Also, I decided to follow an object-oriented approach (in Java).

This simple implementation accepts any number of layers, with any size of neurons. The training can be tweaked with the alfa decay factor parameter which accelerates convergence to the “learnt” state of the network. It assumes all training data to be normalized between 0 and 1.

As predicted by the letter, the Backpropagation algorithm, not being a second order derivative aware method, sometimes stutters on local minima and does not guarantee to reach global minima.