Demystifying Deep Learning

: Part 11

Backpropagation through, well, anything!

17th September 2018

Introduction

So far in this series, we have looked at the general principle of gradient descent, and how we computed backpropagation for each layer in a feedforward neural network, then generalising to look at backprop in different types of layers in a CNN.

Now we will take a step back and look at backpropagation in a more general sense - through a computation graph. Through this we'll get a general intuition for how the frameworks compute their

We'll use the LSTM cell as our motivating example - to continue the task of sentiment analysis on the IMDB review dataset - you can find the code in the accompanying Jupyter notebook

General Backpropagation Principles

Let's look back at the principles we've used in this series:

  • Partial Derivative Intuition: Think of yx\frac{\partial{y}}{\partial{x}} loosely as quantifying how much yy would change if you gave the value of xx a little "nudge" at that point.

  • Breaking down computations - we can use the chain rule to aid us in our computation - rather than trying to compute the derivative in one fell swoop, we break up the computation into smaller intermediate steps.

  • Computing the chain rule - when thinking about which intermediate values to include in our chain rule expression, think about the immediate outputs of equations involving xx - which other values get directly affected when we slightly nudge xx?

  • One element at a time - rather than worrying about the entire matrix AA, we'll instead look at an element AijA_{ij}. One equation we will refer to time and time again is:

    Cij=kAikBkj    C=A.BC_{ij} = \sum_k A_{ik}B_{kj} \iff C=A.B

    A useful tip when trying to go from one element to a matrix is to look for summations over repeated indices (here it was k) - this suggests a matrix multiplication.

    Another useful equation is the element-wise product of two matrices:

    Cij=AijBij    C=ABC_{ij} = A_{ij}B_{ij} \iff C=A*B

  • Sanity check the dimensions - check the dimensions of the matrices all match (the derivative matrix should have same dimensions as the original matrix, and all matrices being multiplied together should have dimensions that align.

A computation graph allows us to clearly break down the computations, as well as see the immediate outputs when computing our chain rule.

We will use the computation graph representation (shown above) of the LSTM to compute the gradients using backpropagation through time.

The LSTM Computation Graph

Forward Propagation equations

From the previous post, the forward propagation equations for one timestep in the LSTM are:

Γi=σ(Wi.[a<t1>,x<t>]+bi)\Gamma_i = \sigma(W_i.[a^{<t-1>}, x^{<t>}]+b_i)

Γf=σ(Wf.[a<t1>,x<t>]+bf)\Gamma_f = \sigma(W_f.[a^{<t-1>}, x^{<t>}]+b_f)

Γo=σ(Wo.[a<t1>,x<t>]+bo)\Gamma_o = \sigma(W_o.[a^{<t-1>}, x^{<t>}]+b_o)

c~<t>=tanh(Wc.[a<t1>,x<t>]+bc)\tilde{c}^{<t>} =\tanh (W_c.[a^{<t-1>}, x^{<t>}]+b_c)

c<t>=Γic~<t>+Γfc<t1>{c}^{<t>} = \Gamma_i*\tilde{c}^{<t>} + \Gamma_f*{c}^{<t-1>}

a<t>=Γotanhc<t>a^{<t>} = \Gamma_o*\tanh{c}^{<t>}

Notation used:

[a<t1>,x<t>][a^{<t-1>}, x^{<t>}] denotes a concatenation of the two matrices to form a (na+nx)(n_a+n_x) x mm matrix. A.BA.B denotes matrix multiplication of AA and BB, whereas ABA*B denotes elementwise multiplication. Γ\Gamma refers to the gate - see the previous post defining the LSTM for a full breakdown of the notation used.

To backpropagate through the cell, given the gradient with respect to a<t>a^{<t>} and c<t>c^{<t>} from the backprop from the next step, we need to compute the gradients for each of the weights Wi,Wf,Wo,WcW_i, W_f, W_o, W_c and biases bi,bf,bo,bcb_i, b_f, b_o, b_c, and finally we will need to backpropagate to the previous timestep and compute the gradient with respect to a<t1>a^{<t-1>} and c<t1>c^{<t-1>}.

These are a lot of partial derivatives to compute - indeed as our neural networks get more complicated there will be more partial derivatives to calculate.

How can we use our computation graph to break this down?

Firstly, since the gate equations are identical, we can combine them so instead we have a 3na3n_a x mm matrix Γ\Gamma containing the 3 gates' outputs, and we can refer to the first third of the matrix Γi\Gamma_i and the other two thirds Γf\Gamma_f and Γo\Gamma_o respectively. Then we have one 3na3n_a x (na+nx)(n_a + n_x) matrix of weights for the gates WgW_g and one 3na3n_a x 11 bias vector bgb_g.

Next, we want to document all intermediate stages in calculation (every node in the graph).

So, walking through the computation graph node-by-node in the forward step:

  • We concatenate a<t1>a^{<t-1>} and x<t>x^{<t>} to form the (na+nx)(n_a + n_x) x mm concatenated input matrix [a<t1>,x<t>][a^{<t-1>}, x^{<t>}].

  • We calculate the weighted input matrix for the gates ZgZ_g using the weights WgW_g and bias bgb_g.

  • Likewise we calculate the weighted input ZcZ_c for the candidate memory c~<t>\tilde{c}^{<t>} using the weight matrix WcW_c and bias bcb_c.

(NB: the diagram uses one weight matrix W, but it helps to think about these weights separately because of the different activation functions used)

  • We apply the sigmoid activation function to ZgZ_g to get the Gate matrix Γ\Gamma (denoted by f, i, o in diagram), and the tanh activation function to ZcZ_c to get c~<t>\tilde{c}^{<t>} (denoted by g in the diagram).

  • Since elementwise multiplication and addition are straightforward operations, for brevity we won't give the intermediate outputs Γic~<t>\Gamma_i*\tilde{c}^{<t>} and Γfc<t1>\Gamma_f*{c}^{<t-1>} their own symbols.

  • Let us denote the tanhc<t>\tanh{c}^{<t>} intermediate output as a~<t>\tilde{a}^{<t>}.

Now we have broken down the computation graph into steps, and added our intermediate variables we have the equations:

  1. Zg=Wg.[a<t1>,x<t>]+bgZ_g = W_g.[a^{<t-1>}, x^{<t>}]+b_g

  2. Zc=Wc.[a<t1>,x<t>]+bcZ_c = W_c.[a^{<t-1>}, x^{<t>}]+b_c

  3. Γ=σ(Zg)\Gamma = \sigma(Z_g)

  4. c~<t>=tanhZc\tilde{c}^{<t>} =\tanh Z_c

  5. c<t>=Γic~<t>+Γfc<t1>{c}^{<t>} = \Gamma_i*\tilde{c}^{<t>} + \Gamma_f*{c}^{<t-1>}

  6. a~<t>=tanhc<t>\tilde{a}^{<t>} = \tanh{c}^{<t>}

  7. a<t>=Γoa~<t>a^{<t>} = \Gamma_o*\tilde{a}^{<t>}

These equations correspond to the nodes in the graph - the left-hand-side variable is the ouput edge of the node, and the right-hand-side variables are the input edges to the node.

Backpropagation in a Computation Graph:

These equations allow us to clearly see the immediate outputs with respect to a variable when computing the chain rule - e.g. for [a<t1>,x<t>][a^{<t-1>}, x^{<t>}] the immediate outputs are $$ZgandandZc$$.

If we look at the equations / computation graph, we can more generally look at the type of operations, and then use the same identities:

  • Addition: If C=A+BC = A + B then CA=1\frac{\partial{C}}{\partial{A}} = 1 and CB=1\frac{\partial{C}}{\partial{B}} = 1

  • Elementwise multiplication: If C=ABC = A * B then CA=B\frac{\partial{C}}{\partial{A}} = B and CB=A\frac{\partial{C}}{\partial{B}} = A

  • tanh: If C=tanhAC = \tanh A then CA=1tanh2A=1C2\frac{\partial{C}}{\partial{A}} = 1 - \tanh^2 A = 1 - C^2

  • sigmoid: If C=σ(A)C = \sigma(A) then CA=σ(A)σ2(A)=C(1C)\frac{\partial{C}}{\partial{A}} = \sigma (A)- \sigma^2 (A) = C*(1-C)

  • Weighted Input: this is the same equation as the feedforward neural network. If Z=W.X+bZ = W.X + b then:

JW=1mJZ.XT\frac{\partial{J}}{\partial{W}}= \frac{1}{m} \frac{\partial{J}}{\partial{Z}}.X^{T}

Jb=1mi=1mJZ\frac{\partial{J}}{\partial{b}} = \frac{1}{m} \sum_{i=1}^{m}\frac{\partial{J}}{\partial{Z}}

JX=WT.JZ\frac{\partial{J}}{\partial{X}} = W^{T}.\frac{\partial{J}}{\partial{Z}}

Armed with these general computation graph principles, we can apply chain rule. We elementwise multiply (*) the partial derivatives, i.e.

JA=JBBA\frac{\partial{J}}{\partial{A}} = \frac{\partial{J}}{\partial{B}}*\frac{\partial{B}}{\partial{A}}

Also note we sum partial derivatives coming from each of the immediate outputs:

So if B=f(A)B = f(A) and C=g(A)C = g(A), i.e. BB and CC are immediate outputs of AA in the computation graph, then we sum the partial derivatives:

JA=JBBA+JCCA\frac{\partial{J}}{\partial{A}} = \frac{\partial{J}}{\partial{B}}*\frac{\partial{B}}{\partial{A}} + \frac{\partial{J}}{\partial{C}}*\frac{\partial{C}}{\partial{A}}

In a deep learning framework like TensorFlow or Keras, there will be identities like this for each of the differentiable operations.

Backpropagation Through Time in an LSTM Cell

When trying to compute JA\frac{\partial{J}}{\partial{A}}, we'll use the general equation:

JA=JBBA\frac{\partial{J}}{\partial{A}} = \frac{\partial{J}}{\partial{B}}*\frac{\partial{B}}{\partial{A}}

For brevity, we'll substitute the value of BA\frac{\partial{B}}{\partial{A}} using the operations' identities above.

The equations are thus as follows:

From equation 7:

Ja~<t>=Ja<t>Γo\frac{\partial{J}}{\partial{\tilde{a}^{<t>}}} = \frac{\partial{J}}{\partial{a^{<t>}}}* \Gamma_o

JΓo=Ja<t>a~<t>\frac{\partial{J}}{\partial{\Gamma_o}}= \frac{\partial{J}}{\partial{a^{<t>}}}*\tilde{a}^{<t>}

Using equation 6, and writing equation 5 as an equation for c<t+1>c^{<t+1>} instead of c<t>c^{<t>} (i.e. adding 1 to the timestep):

Jc<t>=Jc<t+1>Γf+Ja~<t>(1a~<t>2)\frac{\partial{J}}{\partial{c^{<t>}}} = \frac{\partial{J}}{\partial{c^{<t+1>}}}*\Gamma_f + \frac{\partial{J}}{\partial{\tilde{a}^{<t>}}} *(1-\tilde{a}^{<t>2})

Also using equation 5:

Jc~<t>=Jc<t>Γi\frac{\partial{J}}{\partial{\tilde{c}^{<t>}}} = \frac{\partial{J}}{\partial{c^{<t>}}}*\Gamma_i

JΓi=Jc<t>c~<t>\frac{\partial{J}}{\partial{\Gamma_i}}= \frac{\partial{J}}{\partial{c^{<t>}}}*\tilde{c}^{<t>}

JΓf=Jc<t>c<t1>\frac{\partial{J}}{\partial{\Gamma_f}}= \frac{\partial{J}}{\partial{c^{<t>}}}*c^{<t-1>}

From equations 3 and 4 respectively:

JZg=JΓΓ(1Γ)\frac{\partial{J}}{\partial{Z_g}} = \frac{\partial{J}}{\partial{\Gamma}}*\Gamma*(1-\Gamma)

JZc=Jc~<t>(1c~<t>2)\frac{\partial{J}}{\partial{Z_c}} = \frac{\partial{J}}{\partial{\tilde{c}^{<t>}}}*(1-\tilde{c}^{<t>^2})

Equations 1 and 2 are identical, and so are the partial derivatives, differing only in subscript.

JWg=1mJZg.[a<t1>,x<t>]T\frac{\partial{J}}{\partial{W_g}} = \frac{1}{m} \frac{\partial{J}}{\partial{Z_g}}.[a^{<t-1>}, x^{<t>}]^T

Jbg=1mi=1mJZg(i)\frac{\partial{J}}{\partial{b_g}} = \frac{1}{m}\sum_{i=1}^{m} \frac{\partial{J}}{\partial{Z_g^{(i)}}}

JWc=1mJZc.[a<t1>,x<t>]T\frac{\partial{J}}{\partial{W_c}} = \frac{1}{m} \frac{\partial{J}}{\partial{Z_c}}.[a^{<t-1>}, x^{<t>}]^T

Jbc=1mi=1mJZc(i)\frac{\partial{J}}{\partial{b_c}} = \frac{1}{m}\sum_{i=1}^{m} \frac{\partial{J}}{\partial{Z_c^{(i)}}}

J[a<t1>,x<t>]=WgT.JZg+WcT.JZc\frac{\partial{J}}{\partial{[a^{<t-1>}, x^{<t>}]}} = W_g^T.\frac{\partial{J}}{\partial{Z_g}}+ W_c^T.\frac{\partial{J}}{\partial{Z_c}}

So by breaking the computation graph into many steps, we can break down the calculation into smaller simpler steps that just use the operations' derivative identities mentioned above.

Code:

The motivating example we've looked at uses an LSTM network for sentiment analysis on a dataset of IMDB reviews

def backward_step(dA_next, dC_next,cache,parameters):
    (a_next, c_next, input_concat, c_prev, c_candidate,IFO_gates) = cache
    n_a, m = a_next.shape

    dC_next += dA_next* (IFO_gates[2*n_a:]*(1-np.tanh(c_next)**2))
    #we compute dC<t> in two backward steps since we need both dC<t+1> and dA<t>

    dC_prev = dC_next * IFO_gates[n_a:2*n_a]
    dC_candidate =  dC_next * IFO_gates[:n_a]

    #derivative with respect to the output of the IFO gates - abuse of notation to call it dIFO
    dIFO_gates = np.zeros_like(IFO_gates)
    dIFO_gates[:n_a] = dC_next * c_candidate
    dIFO_gates[n_a:2*n_a]= dC_next * c_prev
    dIFO_gates[2*n_a:] = dA_next * np.tanh(c_next)

    #derivative with respect to the unactivated output of the gate (before the sigmoid is applied)
    dZ_gate =  dIFO_gates* (IFO_gates*(1-IFO_gates))
    dA_prev =  (parameters["Wg"].T).dot(dZ_gate)[:n_a]
    dWg = (1/m)*dZ_gate.dot(input_concat.T)
    dbg = (1/m)*np.sum(dZ_gate,axis=1, keepdims=True)

    dZ_c = dC_candidate * (1-c_candidate**2)
    dA_prev +=  (parameters["Wc"].T).dot(dZ_c)[:n_a]
    dWc = (1/m)*dZ_c.dot(input_concat.T)
    dbc = (1/m)*np.sum(dZ_c,axis=1, keepdims=True)

    return dA_prev, dC_prev, dWg, dbg, dWc, dbc

Practical Considerations:

When checking the equations for the backprop, it helps to have a numerical checker - I've written one in the accompanying Jupyter notebook.

Conclusion

This seems like a good juncture to recap the series so far.

We started the series looking at the most commonly used termninology, followed by looking at simple machine learning algorithms in linear and logistic regression, building up the intuition behind the maths behind gradient descent as we built up to a feedforward neural network.

Next we looked at the learning process itself, and how we could improve gradient descent itself, as well as debug our model to see whether it was learning or not.

Finally, we moved onto more specialised neural networks - CNNs and recurrent neural nets, not only looking at their theory but the motivation behind them. We also looked at the maths behind them, deriving the CNN backprop equations from scratch.

Now that we're at the point that we're able to understand backprop in a general computation graph, we can use the abstractions of the deep learning frameworks in subsequent posts.

Sign up for more tutorials!

I write tutorials distilling the key concepts from the technologies I have used, whether it be through my time at Cambridge University, through various side projects or through internships at companies such as Facebook.

To be notified when I next put out a post, sign up below!