The full source code for the neural network and the interactive demonstrations can be found on GitHub.
Increasing numbers of us are turning to solutions based on neural networks to help us with a wide variety of tasks. From accelerating software development through to getting advice on personal issues they are becoming almost the goto tool for many people.
But how do they actually work?
I figured it would be fun, and perhaps helpful to others, to explore this through a series of practical, interactive, explained, examples of increasing complexity over a fairly short (5 part) series. And I promise that by part 5 we’ll have something pretty cool!
The first thing we’re going to see is that they are, in fact, fundamentally really simple and have absolutely nothing to do with actual neurons at all. The cynic in me can’t help but think that the AI folk love to dress up what they do in humanistic and biological language to convince people they are doing more profound work than they are. Also: it sells.
As a simple example we’re going to look at a neural network that can XOR two numbers together but before we get to the interactive example a little bit of background is useful.
XOR
XOR is a simple bitwise operation that, if you’re a crusty old developer like me, you might remember as being a handy and performant way of drawing and removing sprites on 8-bit machines so that they didn’t erase the background or require a load of CPU time.
Given two input bits the output of the XOR operation is shown in the table below:
| Input A |
Input B |
Output |
| 0 |
0 |
0 |
| 0 |
1 |
1 |
| 1 |
0 |
1 |
| 1 |
1 |
0 |
In this example we’re going to setup and train a neural network to handle an XOR calculation. Gross overkill, sure, and I don’t think this would be an effective way of drawing sprites on an 8-bit machine, but it makes for a great first example.
Neurons
Firstly neurons. Neurons are basically nodes in the network that have one or more weighted inputs and themselves have a bias. The input, the weights and the bias are all numbers. The neuron works by multiplying each input by its weight, adding the bias and then squashing all this into a number in the range of 0 to 1. Without that squashing step the neuron is just doing basic arithmetic, and when we look at the network we’ll see that we stack layers of neurons and if we just stack layers of basic arithmetic we achieve nothing more than a single layer could. The squash is what gives the network its power.
For our simple XOR example we’re going to use a squashing function called sigmoid. These squashing functions are called activation functions and thats how we’ll refer to them from here on. And so for a neuron with two inputs all it does is run this formula:
output = sigmoid((input1 * weight1) + (input2 * weight2) + bias)
While this isn’t going to become a maths fest going fowards we’ll use mathematical notation, which would express the above like this:
$$output = \sigma\left((input_1 \times weight_1) + (input_2 \times weight_2) + bias\right)$$
At this point you might be asking - whats this sigmoid function. Its this:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
Don’t worry too much about the formula — all it does is take any number, no matter how large or small, and map it to a value between 0 and 1. Large positive inputs give values close to 1, large negative inputs give values close to 0, and zero maps to exactly 0.5. It has a characteristic shape that you can see below:
In code the entire neuron is surprisingly compact:
// Sigmoid: squash any number into 0..1
function sigmoid(x: number): number {
return 1 / (1 + Math.exp(-x));
}
// A neuron: multiply each input by its weight, add bias, squash
function neuron(inputs: number[], weights: number[], bias: number): number {
let sum = 0;
for (let i = 0; i < inputs.length; i++) {
sum += inputs[i] * weights[i];
}
sum += bias;
return sigmoid(sum);
}
That’s it. Every neuron in every neural network, no matter how large, does this same operation.
The Network
Ok. So that’s the neural part - so what about the network part?
Essentially neurons are arranged in layers: an input layer, one or more hidden layers, and a output layer with every neuron in one layer being connected to every neuron in the next layer. Hidden layers sounds very mystical but there’s nothing much hidden about them unless you are treating the network as a black box. They are simply the layers of neurons between the inputs and the output.
Our input neurons are really just numbers - they have no biases - and for our XOR example we have two input neurons, one for each XOR input. And we only need a single output neuron that, when trained, should give us 0 or 1 as our answer. And we’re going to have a single hidden layer of 4 neurons.
This gives us our network topology - 2 input neurons, 4 hidden layer neurons each connected to both inputs, and a single output neuron connected to the 4 hidden layer neurons. When the network runs we run the calculation from earlier for each neuron in the layer and then, when all neurons in the layer have calculated their output, we move on to the next layer.
In code we represent this as arrays of neurons organised into layers. Each neuron stores its weights, bias, and its most recent output. The network is created with random weights — we’ll see why that matters shortly:
interface Neuron {
weights: number[];
bias: number;
output: number;
net: number; // the weighted sum before sigmoid — we need this for backprop later
}
type Layer = Neuron[];
interface Network {
layers: Layer[];
inputCount: number;
}
function createNeuron(inputCount: number): Neuron {
return {
weights: Array.from({ length: inputCount }, () => Math.random() * 2 - 1),
bias: Math.random() * 2 - 1,
output: 0,
net: 0,
};
}
function createNetwork(topology: number[]): Network {
const layers: Layer[] = [];
for (let i = 1; i < topology.length; i++) {
const inputCount = topology[i - 1];
const layer = Array.from({ length: topology[i] }, () =>
createNeuron(inputCount)
);
layers.push(layer);
}
return { layers, inputCount: topology[0] };
}
// Create our XOR network: 2 inputs, 4 hidden, 1 output — 17 parameters total
const network = createNetwork([2, 4, 1]);
The forward pass — running inputs through the network — is just applying the neuron formula to each layer in sequence:
function forward(network: Network, inputs: number[]): number[] {
let current = inputs;
for (const layer of network.layers) {
const next: number[] = [];
for (const neuron of layer) {
let sum = neuron.bias;
for (let i = 0; i < neuron.weights.length; i++) {
sum += current[i] * neuron.weights[i];
}
neuron.net = sum;
neuron.output = sigmoid(sum);
next.push(neuron.output);
}
current = next;
}
return current;
}
If you walk through the example above you will see each neuron running the formula we looked at earlier and at the output giving us the answer 0.53. Which is probably the worst case we could have hoped for - the output should be 0 or 1 and we’re square in the middle. This is because the weights are all random at the moment so we are literally just pushing numbers through random multipliers.
Before the network will be able to give us credible results we need to train it and it needs to learn. That’s what we’ll cover next.
Back propagation - the learning
The network learns by pushing the error back through the network and adjusting the weights and biases. Conceptually it does this by distributing blame proportionally - the neurons that had the biggest impact on the output have their weights and biases adjusted the most. In our examples this means the weights will be adjusted the most on the connections represented by the thickest lines.
The network also has a learning rate - this is a multiplier applied to the proportioned level of blame that basically scales how big an adjustment we’ll make to the weights and biases. Too big and our corrections will overshoot and too small and the network will take longer to converge on accurate answers.
Working this through the network is known as back propagation and the idea is that by tweaking these numbers then over many runs, each run is known as an epoch, then the error delta, the loss, should converge towards 0.
The code for backpropagation is the most involved part but the structure mirrors the forward pass — just working backwards. We need one extra piece: the sigmoid derivative, which tells us how steep the S-curve is at a given neuron’s operating point. Where the curve is steep the neuron is sensitive to changes and absorbs more blame. Where it’s flat — near 0 or 1 — the neuron is saturated and barely learns:
function sigmoidDerivative(net: number): number {
const s = sigmoid(net);
return s * (1 - s);
}
function backward(
network: Network,
inputs: number[],
targets: number[],
learningRate: number
): number {
const { layers } = network;
// Step 1: How wrong is the output, and how sensitive is it to changes?
const outputLayer = layers[layers.length - 1];
for (let i = 0; i < outputLayer.length; i++) {
const neuron = outputLayer[i];
const error = neuron.output - targets[i];
neuron.delta = error * sigmoidDerivative(neuron.net);
}
// Step 2: Propagate blame backward through hidden layers
for (let l = layers.length - 2; l >= 0; l--) {
const layer = layers[l];
const nextLayer = layers[l + 1];
for (let i = 0; i < layer.length; i++) {
let downstreamBlame = 0;
for (const nextNeuron of nextLayer) {
downstreamBlame += nextNeuron.delta * nextNeuron.weights[i];
}
layer[i].delta = downstreamBlame * sigmoidDerivative(layer[i].net);
}
}
// Step 3: Nudge every weight and bias
for (let l = 0; l < layers.length; l++) {
const layerInputs =
l === 0 ? inputs : layers[l - 1].map((n) => n.output);
for (const neuron of layers[l]) {
for (let w = 0; w < neuron.weights.length; w++) {
neuron.weights[w] -= learningRate * neuron.delta * layerInputs[w];
}
neuron.bias -= learningRate * neuron.delta;
}
}
// Return the loss so we can track progress
let loss = 0;
for (let i = 0; i < outputLayer.length; i++) {
const diff = outputLayer[i].output - targets[i];
loss += diff * diff;
}
return loss / outputLayer.length;
}
In the simulation below you can see this being applied from where we left off above and if you’re interested in the mathematics thats included too.
The simulation
At this point we have everything we need to train and then use our XOR neural network. Training is just running the forward pass and backward pass on every XOR input, thousands of times:
const xorData = [
{ inputs: [0, 0], targets: [0] },
{ inputs: [0, 1], targets: [1] },
{ inputs: [1, 0], targets: [1] },
{ inputs: [1, 1], targets: [0] },
];
for (let epoch = 0; epoch < 20000; epoch++) {
for (const sample of xorData) {
forward(network, sample.inputs);
backward(network, sample.inputs, sample.targets, 0.5);
}
}
That’s the entire training loop. Each epoch feeds all four XOR cases through, adjusting weights after each one. If you run the simulation below you’ll see the model train itself over 20,000 epochs with a learning rate of 0.5.
What you’ll probably immediately notice is that at the end of this process the neural network does not give us perfect answers. What we’re seeing is something like this:
| Input A |
Input B |
Target |
Output |
| 0 |
0 |
0 |
0.0094 |
| 0 |
1 |
1 |
0.9890 |
| 1 |
0 |
1 |
0.9872 |
| 1 |
1 |
0 |
0.0121 |
If you’re used to thinking in more classical modes of computation then I think this, particularly, is a key takeaway about neural networks: they provide approximations. Or perhaps what might be best called probabilistic answers.
If you zoom in on the interesting part of the loss curve you might notice it looks like an inverted sigmoid — but it’s coincidental rather than causal. The loss curve isn’t a sigmoid, it just has a similar shape. This pattern of slow start, rapid progress, then diminishing returns shows up across all kinds of optimisation problems, not just neural networks.
It’s interesting to play with the learning rate and the number of epochs — you can get the network to converge on a more accurate result but it will never land on exact values. These are fundamentally approximation machines.
And you’re probably starting to see why these systems can be so expensive to train: as the number of neurons multiplies the number of calculations required grows quickly and you need vast numbers of epochs to converge over really large training sets. You can probably also see why GPUs, and similar architectures, are so good at this. It’s basically multiplication at a massive scale.
In the next part we’re going to build on these basics and get a neural network to do something a bit more complicated - but the concepts will be exactly the same.