Like Marvel superheroes, who have notable powers, AI possesses its own set of “superpowers,” with self-learning arguably the most impactful. For many, autonomous learning is like a mysterious black box: we see what comes in and what goes out, but the inner workings are often hidden. In this article, we address this puzzle by exploring PyTorch, a widely used machine learning algorithm, with the goal of clarifying how it learns and updates.
PyTorch is an open source framework for building, training, and deploying deep learning models. Use optimizers like SGD or Adam to update model weights using gradients over repeated training cycles. We will demonstrate the stochastic gradient descent (SGD) algorithm with a simple example.
To start with, Stochastic Gradient Descent (SGD) is an optimization algorithm used to train machine learning models by minimizing the loss function. It is a variant of standard gradient descent that, instead of using the entire data set to calculate the gradient for each update, uses a small, randomly selected subset of data called a mini-batch.
Model equation:
Enough said, let’s move on to an example. We will consider the following simple linear regression model:
ypred = wx + b – Equation 1*
where ypred = predicted output and x = input, w = weight, b = bias.
The goal is to accurately predict the outputs given a set of input values, based on a reference model/equation. This is achieved by calculating the optimal values of weight (w) and bias (b) that make the predicted production as close as possible to the actual production.
Loss function:
Just like humans, the algorithm needs to understand how close it is to calculating optimal values. This is achieved using a loss function (l), where after each step, you quantify the error between the model prediction and the actual value. The goal is to minimize this loss function by iteratively modifying the values of w and b.
For simplicity, we choose the mean squared error (MSE) as our loss function, given by the following equation.
l = (1/n) * Sum of all values of i from 0 to ((ypred(i) – y(i))^2)
* – Equation 2*
where n = No. of training instances,
y(i) = Actual output of the ith input, ypred(i) = Predicted output of the ith input
Expanding ypred according to Equation 1, we have l as follows:
l = (1/n) * Sum of all values of i from 0 to ((wx(i) + b* – and(I))^2)
* – Equation 3*
Methodology:
We start by assigning an initial value aw and b, for example, w = 0 and b = 0. As a next step, we need to train our model by feeding a set of inputs and corresponding outputs. Each of these sets is known as a training batch.
For each training batch (i), the algorithm calculates the gradient of w (dl/dw) and b (dl/db) with respect to the loss metric (l). Based on the calculated gradients, it then proceeds to update the values of w and b, accordingly:
w(i+1) = w(i) – (dl/dw) * lr – Equation 4
b(i+1) = b(i) – (dl/db) * lr – Equation 5
where lr = learning rate
Learning rate is a hyperparameter that controls how much a model updates its parameters during training. It can significantly affect whether the model performs well or does not learn.
We repeat the above steps for each training batch until we have exhausted all training batches.
Gradient calculation:
The calculation of the gradient of w and b, i.e. dl/dw and dl/db, occurs once the loss metric (l) is calculated. The loss metric (l) is first calculated by forward propagation comparing ypred with y. This is followed by backpropagation in which the partial derivatives of the loss metric versus weight (w) and bias (b) are calculated.
We will examine both methods in detail in light of our loss function (l).
Forward Propagation:
As part of the direct pass, the algorithm calculates the loss function given a set of input variables. The algorithm keeps a record of the data and all operations executed (along with the resulting new variables) in a directed acyclic graph (DAG) consisting of function objects. In this DAG, leaves are the input variables and roots are the output variables. By tracing this graph from roots to leaves, the algorithm can automatically calculate gradients using the chain rule.
The following is a visual representation of our loss function calculation, with the variables as nodes and the math operations as functions:
To ensure simplicity, each variable node represents a unit function, as follows:
z = l
l = (1/n) * e
e = (d)^2
d = c – y
c = a + b
a = w * x
Tracing the DAG again would give us Equation 3.
Using a bottom-up approach, the algorithm starts from the lowest nodes in the tree and propagates upward through each step until it reaches the top of the tree to calculate the loss function (l). As you go up the tree, at each step you essentially store the computed function and the gradient function of the operation in memory, which is then used during the backpropagation flow.
The algorithm also adds an additional variable (z) at the end, which is essentially the same as l (i.e., z = l). This is to aid in the calculation of the gradient of l during backpropagation, as we will see below.
Backward propagation:
As the name suggests, back propagation uses a top-down approach, unlike the bottom-up propagation method. Start from the top node (z) and calculate the partial derivative of each node/variable with respect to the underlying variable. Through the previously stored features (as part of forward propagation), the algorithm calculates the gradient of each feature. Using the chain rule, the algorithm propagates to the leaf nodes.
Below is a visual representation of the DAG in our example. In the graph, the blue arrows are in the direction of the forward pass versus the green arrows in the opposite direction, representing the backward pass. The green nodes represent the corresponding backward functions of each operation in the forward pass.
Calculation of weight gradient:
Backpropagation is how the algorithm calculates gradients based on the partial derivative chain rule. Applying the chain rule to our example to calculate the weight gradient, we obtain the following:
dl/dw = (dl/de)(day/day)(dd/dc)(CC/yes)(da/dw) – Equation 6
where l = (1/n) * e
e = (𝑑)^2
d = c – y
c = a + b
a = w * x
The partial derivatives of each intermediate variable are the following:
dl/de = 1/n,
de/dd = 2d,
dd/pa = 1,
CC/day = 1,
da/dw = x
Substituting the previous values into the equation. 6, we obtain the following expression:
dl/dw = (1/n) * Sum of all values of i from 0 to (2d(i)11x(i)) – Equation 7
Expanding each intermediate variable, this is equivalent to:
l = (2/n) * Sum of all values of i from 0 to ((wx(i) + b* – y(i)) * x(i)) – Equation 8
Polarization gradient calculation:
Similarly, the bias gradient is calculated by the following steps:
dl/db = (dl/de)(day/day)(dd/cc)*(cc/db) – Equation 9
where l = (1/n) * e
e = (𝑑)^2
d = c – y
c = a + b
The partial derivatives of each intermediate variable are the following:
dl/de = 1/n,
de/dd = 2d,
dd/pa = 1,
CC/DB = 1
Substituting the previous values into the equation. 9, we obtain the following expression:
dl/db = (1/n) * Sum of all values of i from 0 to (2d(i)11) – Equation 10
Expanding each intermediate variable, this is equivalent to:
dl/db = (2/n) * Sum of all values of i from 0 to (wx(i) + b – y(i)) – Equation 11*
Number analysis:
Now let’s apply some numbers to demonstrate the above method.
The first step is to create some sample data. For this purpose, we would consider the following as our reference equation for the input (X) and the true output (Y):
y = 2 * x + 10 – Equation 12
Therefore, the goal of the optimization algorithm is to calculate the weight (w) and bias (b) values so that they are as close to 2 and 10, respectively.
Based on this reference equation, we create a sample of input and output data as follows:
X = (0,1,2,3,4,5,6,7,8)
Y = (10,12,14,16,18,20,22,24,26) (based on Equation 12)
Now we need to assign an initial value aw and b for the algorithm to start the calculation process. Let’s set w = 0 and b = 0. Additionally, we assume a learning rate (lr) = 0.1 and a batch size of 3. By choosing a batch size of 3, our X and Y values are split into the following batches:
Lot 1:
x: 0,1,2
and: 10,12,14
Lot 2:
x: 3,4,5
and: 16,18,20
Lot 3:
x: 6,7,8
and: 22,24,26
Each batch is run serially, starting from Batch 1.
By applying the calculations to the data set and assumptions above, we obtain the following values for each batch:
Lot 1:
north: 3
dl/dw: -26.67
dl/db: -24
lr: 0.01
w1: 0.267
b1: 0.24
loss: 146.67
Lot 2:
north: 3
dl/dw: -135.86
dl/db: -33.39
lr: 0.01
w2: 1.63
b2: 0.57
loss: 280.67
Lot 3:
north: 3
dl/dw: -169.19
dl/db: -24.10
lr: 0.01
w3:3.32
b3: 0.81
loss: 145.28
For each batch, we note that the gradient values and weight and bias updates were performed according to the following equations:
dl/dw according to the equation. 8
dl/db according to the equation. 11
w(i) according to the equation. 4
b(i) according to the equation. 5
loss according to the equation. 3
Verification
We break down all the calculations and run a batch of three to get the values. To confirm the accuracy of the algorithm, let’s verify the results using the same optimization technique in Python with the identical data set. The code snippet is shown below:
import torch\
from torch import optim\
from torch.utils.data import TensorDataset, DataLoader\
import torch.nn as nn
# 1. Dummy Data(Batch Size=3, Features=1)\
x = (0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0)\
y = (10.0,12.0,14.0,16.0,20.0,22.0,24.0,26.0)\
tensor_x = torch.tensor(x)\
tensor_y = torch.tensor(y)\
dataset = TensorDataset(tensor_x, tensor_y)\
dataloader = DataLoader(dataset, batch_size=3, shuffle=False)
# 2. Define hyper-parameters\
lr = 0.01 #Set learning rate\
epochs = 1 #Setting epoch to 1 for illustrative purpose
# 3. Define Model(1 linear layer: y=wx+b)\
class Simple_model(nn.Module):\
def __init__(self):\
super().__init__()\
self.weight = nn.Parameter(torch.zeros(1))\
self.bias = nn.Parameter(torch.zeros(1))\
\
def forward(self, xb):\
return xb * self.weight + self.bias
model = Simple_model()
# Define loss metric and optimizer\
criterion = nn.MSELoss()\
optimizer = optim.SGD(model.parameters(), lr=lr)
# 4. Training Loop\
for epoch in range(epochs):\
for batch_idx, (data, target) in enumerate(dataloader):\
# Print initial weight and bias for the present batch\
print("---------- Epoch {epoch}, Batch {batch_idx}----------")\
print(f"Starting W_grad: {model.weight.grad}, "\
f"Starting B_grad: {model.bias.grad}, "\
f"Starting Weight: {model.weight.item():.4f}, "\
f"Starting Bias: {model.bias.item():.4f}")\
\
# Forward Pass\
output = model(data)\
loss = criterion(output, target)\
\
# Backward Pass (Gradient calculation)\
loss.backward()\
\
# Update Weights\
optimizer.step()\
\
# Access gradients for this batch\
weight_grad = model.weight.grad\
bias_grad = model.bias.grad\
\
print(f"Updated W_grad: {weight_grad.item():.4f}, "\
f"Updated B_grad: {bias_grad.item():.4f}, "\
f"Loss: {loss.item():.4f}, "\
f"Updated Weight: {model.weight.item():.4f}, "\
f"Updated Bias: {weight.bias.item():.4f}")\
\
# Zero gradients from previous step\
optimizer.zero_grad()\
By running the above code, we get the exact results provided above, for each batch.





