Understanding the Basics of Artificial Intelligence (AI). Dive into Neural Networks, the backbone of modern AI, understand its mathematics, ...
![]() |
Understanding the Basics of Artificial Intelligence (AI). |
What are Neural Networks?
From Perceptrons to Deep Learning
Neural networks started with something called a perceptron in 1958, thanks to Frank Rosenblatt. This was a basic neural network meant for simple yes-or-no-type tasks. From there, we built more complex networks, like multi-layer perceptrons (MLPs), which can understand more complicated data relationships thanks to having one or more hidden layers.
Then came deep learning, which is all about neural networks with lots of layers. These deep neural networks are capable of learning from huge piles of data, and they’re behind a lot of the AI breakthroughs we hear about, from beating human Go players to powering self-driving cars.
Understanding Through Patterns
One of the biggest strengths of neural networks is their ability to learn patterns in data without being directly programmed for specific tasks. This process, called “training,” lets neural networks pick up on general trends and make predictions or decisions based on what they’ve learned.
Thanks to this capability, neural networks are super versatile and can be used for a wide array of applications, from image recognition to language translation, to forecasting stock market trends. They’re proving that tasks once thought to require human intelligence can now be tackled by AI.
Types of Neural Networks
Before diving into their structure and math, let’s take a look at the most popular types of Neural Networks we may find today. This will give us a better understanding of their potential and capabilities. I will try to cover all of them in future articles, so make sure to subscribe!
Feedforward Neural Networks (FNN): Starting with the basics, the Feedforward Neural Network is the simplest type. It’s like a one-way street for data — information travels straight from the input, through any hidden layers, and out the other side to the output. These networks are the go-to for simple predictions and sorting things into categories.
Convolutional Neural Networks (CNN): CNNs are the big guns in the world of computer vision. They’ve got a knack for picking up on the spatial patterns in images, thanks to their specialized layers. This ability makes them stars at recognizing images, spotting objects within them, and classifying what they see. They’re the reason your phone can tell a dog from a cat in photos.
Recurrent Neural Networks (RNN): RNNs have a memory of sorts, making them great for anything involving sequences of data, like sentences, DNA sequences, handwriting, or stock market trends. They loop information back around, allowing them to remember previous inputs in the sequence. This makes them ace at tasks like predicting the next word in a sentence or understanding spoken language.
Long Short-Term Memory Networks (LSTM): LSTMs are a special breed of RNNs built to remember things for longer stretches. They’re designed to solve the problem of RNNs forgetting stuff over long sequences. If you’re dealing with complex tasks that need to hold onto information for a long time, like translating paragraphs or predicting what happens next in a TV series, LSTMs are your go-to.
Generative Adversarial Networks (GAN): Imagine two AIs in a cat-and-mouse game: one generates fake data (like images), and the other tries to catch what’s fake and what’s real. That’s a GAN. This setup allows GANs to create incredibly realistic images, music, text, and more. They’re the artists of the neural network world, generating new, realistic data from scratch.
The Architecture of Neural Networks
At the core of neural networks are what we call neurons or nodes, inspired by the nerve cells in our brains. These artificial neurons are the workhorses that handle the heavy lifting of receiving, crunching, and passing along information. Let’s dive into how these neurons are built.
The Structure of a Neuron: A neuron gets its input either directly from the data we’re interested in or from the outputs of other neurons. These inputs are like a list, with each item on the list representing a different characteristic of the data.
For each input, the neuron does a little math: it multiplies the input by a “weight” and then adds a “bias.” Think of weights as the neuron’s way of deciding how important an input is, and bias as a tweak to make sure the neuron’s output fits just right. During the network’s training, it adjusts these weights and biases to get better at its job. Next, the neuron sums up all these weighted inputs and biases and runs the total through a special function called an activation function.
This step is where the magic happens, allowing the neuron to tackle complex patterns by bending and stretching the data in nonlinear ways. Popular choices for this function are ReLU, Sigmoid, and Tanh, each with its way of tweaking the data. Neural networks are structured in layers, sort of like a layered cake, with each layer made up of multiple neurons. The way these layers stack up forms the network’s architecture:
Input Layer: This is where the data enters the network. Each neuron here corresponds to one feature of the data. In the image above the input layer is the first layer on the left holding two nodes.
Hidden Layers: These are the layers sandwiched between the input and output, as we can see from the image above. You might have just one or a bunch of these hidden layers, doing the grunt work of computations and transformations. The more layers (and neurons in each layer) you have, the more intricate patterns the network can learn. But, this also means more computing power is needed and a higher chance of the network getting too caught up in the training data, a problem known as overfitting.
Output Layer: This is the network’s final stop, where it spits out the results. Depending on the task, like if it’s classifying data, this layer might have a neuron for each category, using something like the softmax function to give probabilities for each category. In the image above, the last layer holds only one node, suggesting it is used for a regression task.
The Role of Layers in Learning
The hidden layers are the network’s feature detectives. As data moves through these layers, the network gets better at spotting and combining input features, layering them into a more complex understanding of the data. With each layer the data passes through, the network can pick up on more intricate patterns. Early layers might learn basic stuff like shapes or textures, while deeper layers get the hang of more complex ideas, like recognizing objects or faces in pictures.
Weighted Sum: The first step in the neural computation process involves aggregating the inputs to a neuron, each multiplied by their respective weights, and then adding a bias term. This operation is known as the weighted sum or linear combination. Mathematically, it is expressed as, where:
- z is the weighted sum,
- wi represents the weight associated with the i-th input,
- xi is the i-th input to the neuron,
- b is the bias term, a unique parameter that allows adjusting the output along with the weighted sum.
The weighted sum is crucial because it constitutes the raw input signal to a neuron before any non-linear transformation. It allows the network to perform a linear transformation of the inputs, adjusting the importance (weight) of each input in the neuron’s output.
Activation Functions: As we said before, activation functions play a pivotal role in determining the output of a neural network. They are mathematical equations that determine whether a neuron should be activated or not. Activation functions introduce non-linear properties to the network, enabling it to learn complex data patterns and perform tasks beyond mere linear classification, which is essential for deep learning models. Here, we delve into several key types of activation functions and their significance:
This function squeezes its input into a narrow range between 0 and 1. It’s like taking any value, no matter how large or small, and translating it into a probability.

You’ll see sigmoid functions in the final layer of binary classification networks, where you need to decide between two options — yes or no, true or false, 1 or 0.
tanh stretches the output range to between -1 and 1. This centers the data around 0, making it easier for layers down the line to learn from it.

It’s often found in the hidden layers, helping to model more complex data relationships by balancing the input signal. ReLU is like a gatekeeper that passes positive values unchanged but blocks negatives, turning them to zero. This simplicity makes it very efficient and helps overcome some tricky problems in training deep neural networks.

Its simplicity and efficiency have made ReLU incredibly popular, especially in convolutional neural networks (CNNs) and deep learning models. Leaky ReLU allows a tiny, non-zero gradient when the input is less than zero, which keeps neurons alive and kicking even when they’re not actively firing.

It’s a tweak to ReLU used in cases where the network might suffer from “dead neurons,” ensuring all parts of the network stay active over time. ELU smooths out the function for negative inputs (using a parameter α for scaling), allowing for negative outputs but with a gentle curve. This can help the network maintain a mean activation closer to zero, improving learning dynamics.

Useful in deeper networks where ReLU’s sharp threshold could slow down learning.
The softmax function turns logits, the raw output scores from the neurons, into probabilities by exponentiating and normalizing them. It ensures that the output values sum up to one, making them directly interpretable as probabilities.

It’s the go-to for the output layer in multi-class classification problems, where each neuron corresponds to a different class, and you want to pick the most likely one.
The Core of Neural Learning
Backpropagation, short for “backward propagation of errors,” is a method for efficiently calculating the gradient of the loss function concerning all weights in the network. It consists of two main phases: a forward pass, where the input data is passed through the network to generate an output, and a backward pass, where the output is compared to the target value, and the error is propagated back through the network to update the weights.
The essence of backpropagation is the chain rule of calculus, which is used to calculate the gradients of the loss function for each weight by multiplying the gradients of the layers behind it. This process reveals how much each weight contributes to the error, providing a clear path for its adjustment. The chain rule for backpropagation can be represented as follows:

where:
- ∂a/∂L is the gradient of the loss function to the activation,
- ∂z/∂a is the gradient of the activation function to the weighted input z,
- ∂w/∂z is the gradient of the weighted input to the weight w,
- z represents the weighted sum of inputs and a is the activation.
Gradient Descent: Optimizing the Weights
Gradient Descent is an optimization algorithm used for minimizing the loss function in a neural network. It works by iteratively moving the weights in the direction of the steepest decrease in loss. The amount by which the weights are adjusted in each iteration is determined by the learning rate, a hyperparameter that controls the size of the steps.
Mathematically, the weight update rule in gradient descent can be expressed as:

where:
- w-new and w-old represent the updated (new) and current (old) values of the weight, respectively,
- η is the learning rate, a hyperparameter that controls the size of the step taken in the direction of the negative gradient,
- ∂w/∂L is the gradient of the loss function for the weight.
In practice, backpropagation and gradient descent are performed in tandem. Backpropagation computes the gradient (the direction and magnitude of the error) for each weight in the network, and gradient descent uses this information to update the weights to minimize the loss. This iterative process continues until the model converges to a state where the loss is minimized or a criterion is met.
Step by Step example
Let’s explore an example involving backpropagation and gradient descent in a simple neural network. This neural network will have a single hidden layer. We’ll work through a single iteration of training with one data point to understand how these processes update the network’s weights.
Network Structure:
- Inputs: x1, x2 (2-dimensional input vector)
- Hidden Layer: 2 neurons, with activation function f(z)=ReLU(z)=max(0,z)
- Output Layer: 1 neuron, with activation function g(z)=σ(z)=1+e−z1 (Sigmoid function for binary classification)
- Loss Function: Binary Cross-Entropy Loss.
Forward Pass
Given inputs x1, x2, weights w, and biases b, the forward pass calculates the network’s output. The process for a single hidden layer network with ReLU activation in the hidden layer and a sigmoid activation in the output layer is as follows:
Input to Hidden Layer
Let the initial weights from the input to the hidden layer be w11, w12, w21, w22, and the biases be b1, b2 for the two hidden neurons, respectively.
Given an input vector [x1, x2], the weighted sum for each neuron in the hidden layer is:

Applying the ReLU activation function:

1.2: Hidden Layer to Output:
Let the weights from the hidden layer to the output neuron be w31, w32, and the bias be b3.
The weighted sum at the output neuron is:

Applying the Sigmoid activation function for the output:

Loss Calculation (Binary Cross-Entropy):

Backward Pass (Backpropagation):
Now things get a bit more complex, as we need to calculate the gradient on the formulas we applied in the forward pass.
Output Layer Gradients
Let’s start with the output layer. The derivative of the loss function for z3 is:

The gradients of the loss for weights and bias of the output layer:

Hidden Layer Gradients
The gradients of the loss for the hidden layer activations (chain rule applied):

The gradients of the loss concerning weights and biases of the hidden layer:

These steps are then repeated until a criterion is met, such as a maximum number of epochs.
Improvements
While the basic idea of Gradient Descent is simple — take small steps in the direction that reduces error the most — several tweaks and improvements have been made to this method to enhance its efficiency and effectiveness.
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) takes the core idea of gradient descent but changes the approach by using just one training example at a time to calculate the gradient and update the weights. This method is similar to making decisions based on quick, individual observations rather than waiting to gather everyone’s opinion. It can make the learning process much faster because the model updates more frequently and with less computational burden.
Conclusion
Diving into the world of neural networks opens our eyes to the incredible potential these models hold within the realm of artificial intelligence. Starting with the basics, like how neural networks use weighted sums and activation functions to process information, we’ve seen how techniques like backpropagation and gradient descent empower them to learn from data.
Especially in areas like image recognition, we’ve witnessed firsthand how neural networks are solving complex challenges and pushing technology forward. Looking ahead, it’s clear we are only at the beginning of a long journey called “Deep Learning”. In the next articles, we will talk about more advanced deep learning architectures, fine-tuning methods, and much more!