If you have spent even twenty minutes reading about AI, you have hit the phrase "neural network" — usually with a picture of colourful dots connected by lines, usually followed by a hand-wavy explanation about how it is "inspired by the human brain." That picture is mostly for the marketing department. Under the hood, a neural network is an elegant, mathematically tractable machine that most people can understand perfectly well without a maths degree. This guide strips the metaphor away and walks through what a neural network actually is, how it learns, where it wins, and where it is still beaten by much simpler tools in 2026.

The one-sentence version

A neural network is a stack of simple math blocks, tuned automatically by exposure to examples, that turns some input into some output. That is literally it. The word "neuron" is a historical accident — the building blocks look vaguely like diagrams of biological neurons that researchers in the 1950s drew to pitch their work, but the resemblance is cosmetic. Every serious implementation is a very large matrix multiplication, a non-linearity, another matrix multiplication, another non-linearity, repeated many times. If you remember that, you will understand more about neural networks than 90% of the people who talk about them.

The reason this humble structure is so powerful is not any single neuron's cleverness but the sheer number of them, the way they are connected, and the training process that nudges their weights until the whole thing computes something useful. In that sense a neural network is less like a brain and more like a very large, very trainable calculator.

The single neuron: multiply, add, squash, repeat

Start with one neuron. It takes in several numbers, multiplies each by a learned weight, adds them up, adds a learned bias, and passes the total through a simple non-linear function called an activation. That is the entire computation of a single neuron: multiply, add, squash.

To make that concrete, imagine a neuron that tries to predict whether a photo contains a cat. Its inputs might be pixel values from a small patch of the image — say 100 numbers. Each of those 100 numbers has a weight, and there is one bias. The neuron multiplies each pixel by its corresponding weight, sums the products, adds the bias, and passes the total through, say, the ReLU activation function (which zeros out any negative number and keeps positive ones unchanged). The output is a single number — call it an "activation."

That one neuron has 101 parameters to learn: 100 weights plus 1 bias. Before training, all 101 are random. After training, they have settled into values that make the neuron output high numbers for cat-like pixel patches and low numbers for non-cat-like ones. Nothing magical has happened; the training process just discovered that certain combinations of pixel intensities tend to co-occur with cats.

A single neuron is not much. It can learn a linear boundary — essentially, the same thing a logistic regression learns. By itself, it is useless for interesting problems. The power comes from stacking many of these tiny units into a network.

From one neuron to a network

A neural network is arranged in layers. The first layer takes the raw input — say a 28x28 pixel image of a handwritten digit, flattened into 784 numbers. The next layer has some number of neurons, each taking all 784 inputs and producing one activation. So a hidden layer of 128 neurons takes 784 inputs and produces 128 outputs. The next layer takes those 128 outputs and produces its own activations. And so on.

Each layer transforms the shape of the data. The first layer turns 784 pixel intensities into 128 "early features." The second layer turns those 128 features into 64 "middle features." Eventually, the final layer outputs something task-specific — say, 10 numbers, one per possible digit (0 through 9), with the largest one representing the network's guess.

What makes this interesting is that no human designs what each layer should compute. The whole structure is just a pile of connected multiply-add-squash units with random weights at the start. Training discovers, on its own, which combinations of pixel intensities are useful for telling a "3" from an "8." Early layers typically learn simple features (edges, corners, curves); middle layers learn mid-level shapes (loops, strokes); final layers learn full-concept features (digit identity). This hierarchical discovery of features is what makes deep networks so powerful on messy perceptual data.

Activation functions: why non-linearity matters

The squash step — the activation function — is small-looking but essential. Without it, stacking many layers would be pointless: any composition of linear operations is still a linear operation, and you would be left with nothing more expressive than a single regression.

The activation function introduces non-linearity, which lets the network bend the function it is computing into whatever shape the training data demands. A handful of common ones dominate in practice.

  • ReLU — Rectified Linear Unit — is by far the most common in modern networks. It outputs 0 for any negative input and leaves positive inputs unchanged. It is trivial to compute, avoids certain training pathologies that plagued earlier activations, and generally just works.
  • Sigmoid squashes any real number into the range 0 to 1. It was historically popular and is still used at the final layer for binary classification.
  • Tanh, a scaled version of sigmoid, outputs values from -1 to 1. Less used in modern hidden layers but still common in some recurrent networks.
  • GELU and SwiGLU are smoother variants of ReLU used in modern transformers and large language models. They train slightly better for certain architectures.

The choice of activation matters but rarely as much as people imagine. ReLU is a reasonable default and lets you stop thinking about it.

How a network learns: the training loop

Training is where the magic actually happens. At the start, every weight and bias in the network is initialised to a small random value. Given any input, the network produces an essentially random output. The training loop consists of four steps, repeated millions or billions of times.

First, forward pass. Take a training example, push it through the network layer by layer, and get a prediction.

Second, compute the loss. Compare the prediction to the correct answer using a loss function — typically mean squared error for regression or cross-entropy for classification. The loss is a single number that says how wrong the network was on this example.

Third, backward pass. Use a technique called backpropagation (more on this in a moment) to calculate how much each weight and bias in the network contributed to the loss. The output is a "gradient" — a direction, for each parameter, that indicates which way to nudge it to reduce the loss.

Fourth, update. Nudge every parameter in the direction its gradient suggests, multiplied by a small number called the learning rate.

Repeat for the next training example. And the next. For a dataset of a million images, the network will make a million updates per epoch (a full pass through the data), and training typically runs for anywhere from a few to hundreds of epochs. Slowly, what started as random noise becomes a network that produces sensible outputs. This is what "training a model" means in practice.

Backpropagation in plain English

Backpropagation is the algorithm that makes training work, and it is worth a minute of your attention even if the maths looks intimidating on a blackboard.

The intuition is this. The loss is a single number at the end of the network. Every parameter in the network contributed a little bit to that loss. If you could figure out exactly how much each parameter contributed — whether nudging it up or down would have reduced the loss — you could then nudge them all in the right direction simultaneously.

Backpropagation does exactly this, using the chain rule from calculus. It starts at the output and works backwards through the network, one layer at a time, computing for each parameter how changing it would have changed the loss. This is why it is called "back" propagation: the gradients flow from the output layer back to the input layer.

The genius of the algorithm, rediscovered in the 1980s, is that this whole process can be done efficiently. Instead of evaluating each parameter independently (which would be computationally impossible for billion-parameter networks), backpropagation piggybacks on the forward pass's intermediate activations to compute all gradients in a single backward sweep. This is what makes modern deep learning computationally feasible. Without backpropagation, training a neural network with more than a few hundred parameters would be impractical.

The zoo of neural network types

Once you understand the basic multiply-add-squash structure, the menagerie of named neural network variants becomes much less intimidating. Here are the ones worth knowing.

  • Fully connected networks, also called multi-layer perceptrons (MLPs), are the most basic form: every neuron in one layer connects to every neuron in the next. Good for tabular data, rarely used on their own for images or text.
  • Convolutional neural networks (CNNs) dominate image processing. They reuse the same small weight patterns across a whole image, which dramatically reduces parameter count and lets them detect shapes regardless of position. CNNs powered most of the deep-learning image breakthroughs from 2012 to 2020.
  • Recurrent neural networks (RNNs) and LSTMs process sequences one element at a time, maintaining an internal "memory." They dominated language and speech until transformers displaced them.
  • Transformers are the dominant architecture of the 2020s. They handle sequences in parallel using the attention mechanism. Every modern LLM is a transformer.
  • Graph neural networks work on data with arbitrary connection structure, like social networks or molecules. Niche but powerful for their domain.
  • Diffusion models generate images by learning to reverse a noising process. Midjourney, Stable Diffusion, and Imagen are all diffusion models under the hood.

All of these share the same underlying idea — stacked multiply-add-squash — with different connection patterns and training objectives.

A concrete example: training a digit recogniser end to end

To tie all the moving pieces together, walk through a classic small project: building a neural network that reads a handwritten digit (0 through 9) from a 28x28 pixel image.

You start with a dataset — MNIST is the traditional one — containing 60,000 training images and 10,000 test images, each with its correct digit label. You build a network with one input layer (784 neurons, one per pixel), two hidden layers (say 128 and 64 neurons), and an output layer (10 neurons, one per possible digit). Initialise all weights and biases to small random numbers. That is your starting point: a network that outputs essentially noise on every input.

Now train. For each batch of, say, 64 training images, run the forward pass, compute the cross-entropy loss between the predicted digit probabilities and the true labels, backpropagate gradients, and update every weight and bias using an optimiser like Adam. One pass through the full 60,000 images is one epoch. After a few epochs, accuracy climbs above 90%. After twenty or thirty epochs with a well-tuned learning rate, a network this simple can hit 98% accuracy on the test set.

What just happened? The network, starting from nothing, learned — through nothing but forward passes, loss gradients, and weight updates — to recognise handwritten digits well enough to pass most humans. That is the entire arc of deep learning in miniature: data, loss, gradients, updates, emergent capability. Every frontier AI system scales up the same recipe with bigger networks, more data, and much more compute.

The hardware behind modern neural networks

A worthwhile aside: neural networks are what they are today in part because GPUs exist. A single matrix multiplication — the core operation inside every layer — is embarrassingly parallel. GPUs, originally built to render game graphics, happen to be extremely good at exactly this kind of parallel math, and they are orders of magnitude faster than CPUs for training deep networks. The entire modern deep-learning revolution would have been impossible without commodity GPU compute, and the current frontier depends on ever-larger GPU clusters — tens of thousands of chips training a single frontier model for months. TPUs (Google's custom AI chips), Nvidia H200s and B200s, and the new wave of AI accelerators from AMD and emerging startups all compete for the same workload. When someone says "training a model cost a hundred million dollars," this is where the money went.

Where neural networks still get beaten by simpler tools

Despite their glamour, neural networks are not the right tool for every job. Classical models continue to win in three important scenarios.

Tabular data with modest row counts. If your problem is predicting from tabular data with fewer than a million rows, a gradient-boosted tree (XGBoost, LightGBM) will almost always match or beat a neural network, train in a fraction of the time, and serve predictions at a hundredth of the cost.

Problems with tight interpretability requirements. Financial regulators, medical device authorities, and government agencies often require that automated decisions be explainable. A small decision tree is legible; a neural network with millions of parameters is not. Until interpretability research matures, simpler models win these domains.

Problems where you only have a hundred examples. Neural networks are data-hungry. If your training set has two hundred rows, you will be overfitting before you finish writing the config file. A well-tuned logistic regression or a tree-based model will do far better on tiny data.

Using a neural network when a simpler model would do is a classic junior-engineer mistake. Match the tool to the problem.

What neural networks are not

A final round of myth-busting.

A neural network is not a brain. It contains no biology, no electrical signals, no neurotransmitters. The analogy is historical and largely inaccurate.

A neural network does not "think." It computes a fixed function — input to output — that was tuned during training. It has no persistent internal state between inferences unless explicitly designed (as in RNNs or transformers with context).

A neural network is not magic. Everything it computes is the result of weights multiplying inputs and being squashed. Where those weights came from is the training process. What they mean is whatever pattern the training data encoded.

A neural network is not guaranteed to work on your problem. You can train a network on bad data, with a bad architecture, with too little compute, and get a useless result. Deep learning is a craft.

A neural network is a pile of multiply-add-squash units with learned weights. Everything else — CNNs, RNNs, transformers, diffusion models — is a recipe for how to arrange and train that pile.

The short version

A neural network is a stack of layers, each layer a bunch of multiply-add-squash units, each unit's parameters learned automatically by exposure to examples. Training uses forward passes and backpropagation to nudge parameters toward configurations that produce useful outputs. Different connection patterns give rise to different network types — CNNs, transformers, diffusion models — but the core idea is the same. Neural networks excel at unstructured data where patterns are too messy for hand-crafted rules, and they are the building block underneath almost every modern AI product. Understand that one idea, and the rest of the field becomes navigable.

Share: