Surendra Sedhai | Backpropagation Demystified

Backpropagation is a well-established approach to train a neural network. The main goal of training a neural network is to determine the optimum values of connection weights (parameters) of the neural network. Following equation is used to update the connection weights $w_i := w_i - \alpha * \frac{\partial E(w_i)}{\partial w_i}$

where, $w_i$ is the connection weight and $\alpha$ is the learning rate. Learning rate $\alpha$ controls magnitude of the change in the connection weight on each step of training process.

In the above equation, the partial derivative of the $E(w_i)$ w.r.t $w_i$ gives the direction in which the weight should be modified to reduce the error. In the equation, the only computationally expensive operation is to compute partial derivative. Without using backpropagation approach, with the increase in the number of layers of neural network computional complexity of the partial derivative increase exponentially. Using backprogation technique, the partial derivative of error w.r.t connection weight can be computed efficienly.

Refresher: Intuitive meaning of derivative and partial derivative

Let’s assume $x_1$ is time allocated by a person for the exercise and $y$ is the weight of the person. The derivative of $y$ w.r.t. $x_1$ means if the person makes a small change in the time allocated for the exercise how much the weight will change. Mathematically, the derivative of $y$ w.r.t. $x_1$ is represented as $\frac{dy}{dx_1}$ .

We know that diet ( $x_2$ ) also has an effect on the weight of the person. Analysis of change in the weight of the person w.r.t. a small change in the time allocated for the exercise keeping diet same is given by the partial derivative of $y$ w.r.t. $x_1$ . Mathematically, the partial derivative of $y$ w.r.t. $x_1$ is represented as $\frac{\partial y}{\partial x_1}$ . Similarly, if the person keeps the exercise time $x_1$ the same and makes a small change in the diet $x_2$ it is represented as the partial derivative of $y$ w.r.t. $x_2$ which is represented as $\frac{\partial y}{\partial x_2}$ .

raw

Figure 1: Toy neural network

Figure 1 shows the simplest possible neural network which is good enough to understand details of the backpropagation algorithm. In the figure, input-layer is layer-1 ( $L_1$ ), hidden-layer is ( $L_2$ ) and output-layer is ( $L_3$ ). $x_1$ and $x_2$ are input to the neural network, blue nodes are nodes of hidden-layer and the red node is output-node. There are 6 connection weights $w_1$ to $w_6$ . In the training process we want to determine values of the connection weights which minimize error. In this post regression error is minimized to explain the backpropagation technique but the idea remains the same for the classification problem as well.

Square Error (E) = $\frac{(y-\hat{y})^2}{2}$

As constant does not affect optimization problem, square of error is divided by 2 just to make the computation clean.

In Figure 1, each node (red/blue) represents three computations namely, multiplication of an input and a weight, sum of all the results of multiplications and finally activation score which is non-linearity of the summation. Figure 2 shows the clear and detailed view of the neural network which consists of multiplier, summation unit and non-linearity unit. Combination of the multiplier, summation and non-linearity function in Figure 2 is equivalent to a node in Figure 1.

raw

Figure 2: Equivalent detailed view of the toy neural network

$z_1^{(2)}$ is z of node-1 in layer-2 (Note: subscript => item number and superscript => layer-number) and $a_1^{(2)}$ is the activation score of node-1 in layer-2. Following equation is used to compute $z_i^{(j)}$ and $a_i^{(j)}$ in the figure 2
$z_1^{(2)} = x_1 * w_1 + x_2 * w_3$ $z_2^{(2)} = x_1 * w_2 + x_2 * w_4$
$a_1^{(2)} = f(z_1^{(2)})$ $a_2^{(2)} = f(z_2^{(2)})$
$z_1^{(3)} = a_1^{(2)} * w_5 + a_1^{(2)} * w_6$ $a_1^{(3)} = f(z_1^{(3)}) = \hat{y}$

where, f() is a non-linearity function. Usually, non-linear function f() are sigmoid, tanh or Relu.

Backpropagation is all about computing partital derivative efficiently. Using backpropagation, we can calculate $\frac{\partial E}{\partial w_i}$ effectively where i = {1, 2, 3, 4, 5, 6}.

The partial derivative of error with respect to weight gives amount by which error will change if there is a small change in the weight. To calcualate the partial derivative of error w.r.t. weight $\frac{\partial E}{\partial w_1}$ chain rule is used.

$\frac{\partial E}{\partial w_1} = \frac{\partial E}{\partial a_1^{(3)}} \frac{\partial a_1^{(3)}}{\partial w_1}$

$= \frac{\partial E}{\partial a_1^{(3)}} \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} \frac{\partial z_1^{(3)}}{\partial w_1}$
$= \frac{\partial E}{\partial a_1^{(3)}} \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} \frac{\partial z_1^{(3)}}{\partial a_1^{(2)}} \frac{\partial a_1^{(2)}}{\partial w_1}$
$= \frac{\partial E}{\partial a_1^{(3)}} \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} \frac{\partial z_1^{(3)}}{\partial a_1^{(2)}} \frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} \frac{\partial z_1^{(2)}}{\partial w_1}$

To come up with the expression from Figure 1 can be confusing. However, if we construct a detailed view of the neural network as in Figure 2 then getting the expression is simply following chain rule from output to input (note it is backward). For $w_2$ to $w_6$ , the partial derivatives of error with respect to the weights can be calculated in the same way.

OMG!! expression of such a simple neural network is so complex. There will be total six partial derivatives, so how can we compute them effectively? The answer is dynamic programming. Dynamic programming is the fancy name of ** memorization and reutilization of computation.** We use the chain rule to get the expression of the partial derivative of error w.r.t. weights. To compute such complex expressions efficiently dynamic programming is used.

Layer-3

$E =\frac{(y-\hat{y})^2}{2} =\frac{(y-a_1^{(3)})^2}{2}$

$\frac{\partial E}{\partial z_1^{(3)}} = \frac{\partial E}{\partial a_1^{(3)}} \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} = \frac{\partial \frac{(y-\hat{y})^2}{2}}{\partial a_1^{(3)}} \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} = \frac{\partial \frac{(y-a_1^{(3)})^2}{2}}{\partial a_1^{(3)}} \frac{\partial f(z_1^{(3)})}{\partial z_1^{(3)}} = -(y-a_1^{(3)}) \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}}$ $= (a_1^{(3)} -y) f'(z_1^{(3)}) = \delta_1^{(3)}$

For ease of representation we represent $\frac{\partial E}{\partial z_1^{(3)}}$ as $\delta_1^{(3)}$ . Note that $z$ and $\delta$ have same sub-script (item-number) and superscript (layer-number).

To evaluate $\frac{\partial E}{\partial w_5}$ , the computation of $\delta_1^{(3)}$ can be utilized (memorized and reutilized) hence the repeated computation is avoided.

$\frac{\partial E}{\partial w_5}= \frac{\partial E}{\partial z_1^{(3)}} \frac{\partial z_1^{(3)}}{\partial w_5} = \delta_1^{(3)} \frac{\partial (a_1^{(2)} * w_5 + a_2^{(2)} * w_6)}{\partial w_5} = \delta_1^{(3)} a_1^{(2)}$ First chain rule is used to obtain the expression and then computaiton of $\delta_1^{(3)}$ is utilzed to compute $\frac{\partial E}{\partial w_5}$ . Following the similar steps, $\frac{\partial E}{\partial w_6}$ can be computed. $\frac{\partial E}{\partial w_6}= \delta_1^{(3)} a_2^{(2)}$

### Layer-2 The computations of layer-3 is reused for the computation of layer 2. Note that, computation starts from output toward input using the chain rule. The reutilization of the computation can be seen as message passing in the backward direction.

$\frac{\partial E}{\partial z_1^{(2)}} = \frac{\partial E}{\partial z_1^{(3)}} \frac{\partial z_1^{(3)}}{\partial a_1^{(3)}} \frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} = \delta_1^{(3)} \frac{\partial (a_1^{(2)} * w_5 + a_1^{(2)} * w_6)}{\partial a_1^{(2)}} \frac{\partial f(z_1^{(2)})}{\partial z_1^{(2)}} = \delta_1^{(3)} * w_5 *f'(z_1^{(2)}) = \delta_1^{(2)}$

Similarly, $\frac{\partial E}{\partial z_2^{(2)}} = \delta_1^{(3)} * w_6 * f'(z_2^{(2)}) = \delta_2^{(2)}$

It is observed that term which appears in the computation of $\frac{\partial E}{\partial w_5}$ will be memorized and reutilized in the computation of $\frac{\partial E}{\partial w_1}$ .

Let’s calculate the partial derivative of error with respect to weight.

$\frac{\partial E}{\partial w_1} = \frac{\partial E}{\partial z_1^{(2)}} \frac{\partial (x_1 * w_1 + x_2 * w_3)}{\partial w_1} = \delta_1^{(2)} * x_1$

let’s say $x_1 = a_1^{(1)}$ and $x_2 = a_2^{(1)}$ Then, $\frac{\partial E}{\partial w_1} = \delta_1^{(2)} * a_1^{(1)}$

We start with following expression, $\frac{\partial E}{\partial w_1} = \frac{\partial E}{\partial a_1^{(3)}} \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} \frac{\partial z_1^{(3)}}{\partial a_1^{(2)}} \frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} \frac{\partial z_1^{(2)}}{\partial w_1}$ which looks like there is a lot of computation. But, utilizing the dynamic programming in each layer computation is minimized. Now, $\frac{\partial E}{\partial w_1} = \delta_1^{(2)} * a_1^{(1)}$ which looks much simpler and also computationally very efficient.

Similarly, the expression for the partial derivative of error w.r.t. remaining weights can be computed. We observed that if we break down a neural network into a detailed diagram as in Figure 2 then it would be easier to come up with chain rule. Once we have the expression, the long and complex expression can be computed efficiently using dynamic programming.