Introduction to RNNs

March 26, 2025

This post is aimed at giving a quick and dirty intro to RNNs, with an emphasis naturally on mathematical formulations, so one has just enough working knowledge to implement a basic model. Traditional neural networks treat each input as independent. But in many real-world scenarios, the order and context of data matter. A sequence of words in a sentence, a time series of stock prices, or a series of sensor readings all have inherent temporal dependencies.

RNNs solve this by introducing a "hidden state" - think of it as the network's working memory. This hidden state acts like a messenger, carrying important information from previous time steps to the current one.

Recall the traditional feedforward neural network architecture consists of an input layer, some hidden middle layers, and an output layer with connections between each of the layers. The RNN architecture allows memory by adding self-loops at the nodes in the hidden layers, which means that information from previous steps can be fed back into the network.

In particular this makes them much better at sequential classification tasks than feedforward networks, which are better at independent classification tasks (e.g. image classification).

To implement this idea, we have the following steup. Suppose our labeled training data is sequential, i.e. (x1,y1),,(xt,yt)(x^1, y^1), \ldots, (x^t, y^t). We refer to ttht^{\text{th}} datapoint as the point at timestep tt.

At each tt, we have an input given by xtx^t, a hidden vector hth^t, and an output vector y^t\hat y^t (the hat denotes prediction). The hidden layer represents the information that we want to pass from one timestep to the next, and is a function of the input at time tt as well as the information from the previous timestep t1t-1. We can summarize this as follows.

  • xtRdx^t \in \bbr^d = input vector
  • htRh^t \in \bbr^{\ell} = hidden vector
  • ytRmy^t \in \bbr^m = output vector

We will describe a simple recurrent network (SRN) to illustrate the basic idea. We model the relationship as follows.

  • ht=σh(Whxt+Uhht1+bh)h^t = \sigma_h(W_h x^t + U_h h^{t-1} + b_h)
  • y^t=σy(Wyht+by)\hat y^t = \sigma_y(W_y h^t + b_y)

The functions σh,σy\sigma_h, \sigma_y are activation functions (e.g. softmax\softmax), Wh,Wy,UhW_h, W_y, U_h are the weight matrices, and bh,byb_h, b_y are the biases. Collectively, the weights and biases are referred to as the parameters θ\theta. Intuitively, we are just updating our internal memory hth^t as a function of the input at time tt and our memory at the previous timestep, and our prediction for yy at time tt is just a function of our memory at time tt.

Simple Cell RNN Forward Pass

At each timestep tt, we update our parameters xt,ht,y^tx^t,h^t,\hat y^t in the following fashion. We first randomly initialize θ\theta. For the hidden layer, we use the σh(z)=tanh(z)\sigma_h(z) = \tanh(z) activation function to get an output in the range (1,1)(-1,1), although other activation functions can also be used. For a vector zz we just take σh(z)i=σh(zi)\sigma_h(z)_i = \sigma_h(z_i) coordinate-wise, and we have the following dimensions of the weights and biases (although you can just assume we make everything the correct dimension).

  • WhR×dW_h \in \bbr^{\ell \times d} = input weights for hidden layer
  • UhR×U_h \in \bbr^{\ell \times \ell} = hidden layer weights for hidden layer
  • bhRb_h \in \bbr^{\ell} = bias for hidden layer
  • WyRm×W_y \in \bbr^{m \times \ell} = hidden layer weights for prediction
  • byRmb_y \in \bbr^{m} = bias for prediction

Then just as before,

hθt=σh(Whxt+Uhhθt1+bh)h^t_{\theta} = \sigma_h(W_{h} x^t + U_{h} h^{t-1}_{\theta} + b_h)

We will assume that we are performing binary classification, i.e. yt{0,1}y^t \in \{0,1\}. Since we are doing binary, m=1m=1 and we compute our prediction for yty^t using this hidden layer and the sigmoid function σy(z)=1/(1+exp(z))\sigma_y(z) = 1/(1+\exp(-z)).

Some examples of applications include language modeling problems like sentiment analysis: given an embedding of a sequence of words, determine whether it is positive or negative. It's possible to use other activation functions for different tasks, like the identity for continuous modeling or softmax for multiclass classification.

y^θt=σy(Wyhθt+by)\hat y^t_{\theta} = \sigma_y(W_{y} h^t_{\theta} + b_y)

We can compute cross-entropy loss at time tt with the following formula.

Jt(θ)=ytlog(y^θt)(1yt)log(1y^θt)J^t(\theta) = - y^t \log(\hat y^t_{\theta}) - (1-y^t) \log(1-\hat y^t_{\theta})

But which timestep tt should we use to compute loss and update the weights? The answer is to compute loss at each timestep tt according to a sliding context window of a fixed number of previous timesteps, and perform backpropogation. This is referred to as backpropogation through time.

Backpropogration Through Time (BPTT)

The idea for backpropogation here is to unfold the RNN through some timesteps and run backpropogation in the usual sense. That is, we don't just want to run backprop on all the data from time tt, but also the memory data from times t1,t2,,tkt-1,t-2,\ldots, t-k.

Fix some kk which we will determine later. Starting at t=kt = k, set h0=0h^0 = 0 and foward propogate xt1,,x0,h0x^{t-1},\ldots,x^{0}, h^{0} through the unfolded network according to the previous section. That is,

hθ1=σh(Whx1+Uhhθ0+bh)hθt=σh(Whxt+Uhhθt1+bh)y^θt=σy(Wyhθt+by)\begin{align*} h^1_{\theta} &= \sigma_h(W_h x^1 + U_h h^{0}_{\theta} + b_h) \\ &\vdots \\ h^t_{\theta} &= \sigma_h(W_h x^t + U_h h^{t-1}_{\theta} + b_h) \\ \hat y^t_{\theta} &= \sigma_y(W_y h^t_{\theta} + b_y) \end{align*}

Then we compute gradient loss and perform usual backpropogation (using vectorized notations for the chain rule).

Jt(θ)Wy=Jt(θ)y^θty^θtWy=(yt1y^θt+(1yt)11y^θt)(y^θt(1y^θt))hθt=(yt(1y^θt)+(1yt)y^θt)hθt=(y^θtyt)hθtR1×\begin{align*} \frac{\del J^t(\theta)}{\del W_y} = \frac{\del J^t(\theta)}{\del \hat y^t_{\theta}} \frac{\del \hat y^t_{\theta}}{\del W_y} &= \left(-y_t \frac{1}{\hat y^t_{\theta}} + (1-y^t) \frac{1}{1 - \hat y^t_{\theta}}\right) \left(\hat y^t_{\theta} (1-\hat y^t_{\theta})\right) {h^t_{\theta}}^{\top} \\ &= \left(-y^t(1-\hat y^t_{\theta}) + (1-y^t)\hat y^t_{\theta}\right){h^t_{\theta}}^{\top} \\ &= \left(\hat y^t_{\theta} - y^t\right){h^t_{\theta}}^{\top} \in \bbr^{1 \times \ell} \end{align*}

which implies by a symmetric calculation that

Jt(θ)by=y^θtytR\frac{\del J^t(\theta)}{\del b_{y}} =\hat y^t_{\theta} - y^t \in \bbr

For UhU_h, we have

Jt(θ)Uh=Jt(θ)y^θty^θthθthθtUh=(y^θtyt)WyhθtUhR×\begin{align*} \frac{\del J^t(\theta)}{\del U_h} = \frac{\del J^t(\theta)}{\del \hat y^t_{\theta}} \frac{\del \hat y^t_{\theta}}{\del h^t_{\theta}} \frac{\del h^t_{\theta}}{\del U_h} &= (\hat y^t_{\theta} - y^t) W_y^{\top} \frac{\del h^t_{\theta}}{\del U_h} \in \bbr^{\ell \times \ell} \end{align*}

Note σh(z)=tanh(z)=1tanh2(z)=1σh(z)2\sigma_h'(z) = \tanh'(z) = 1 - \tanh^2(z) = 1-\sigma_h(z)^2 so for

zt=Whxt+Uhhθt1+bhz^t = W_h x^t + U_h h^{t-1}_{\theta} + b_h

we have

σh(zt)i=1σh(zt)i2=1[hi;θt]2\sigma_h'(z^t)_i = 1 - \sigma_h(z^t)_i^2 = 1 - [h^t_{i;\theta}]^2

We apply σh\sigma_h entry-wise so when we take the chain rule, we let σh\sigma_h' denote the entry-wise derivatives which we just showed how to compute, and take the entry-wise (Hadamard) product \odot. We compute the transpose of the partial to save a lot of ugly transpose notation.

[hθtUh]=[hθtztztUh]=σh(zt)(hθt1+Uh[hθt1Uh])=σh(zt)(hθt1+Uh[σh(zt1)(hθt2+Uh[hθt2Uh])])=γ=0k2[i=0γσh(zti)]Uhγhθtγ1R×1\begin{align*} \left[\frac{\del h^t_{\theta}}{\del U_h}\right]^{\top} &= \left[\frac{\del h^t_{\theta}}{\del z^t} \odot \frac{\del z^t}{\del U_h}\right]^{\top} = \sigma_h'(z^t) \odot \left({h^{t-1}_{\theta}} + U_h\gradient{\left[\frac{\del h^{t-1}_{\theta}}{\del U_h}\right]^{\top}}\right) \\ &= \sigma_h'(z^t) \odot \left({h^{t-1}_{\theta}} + U_h\gradient{\left[\sigma_h'(z^{t-1}) \odot \left({h^{t-2}_{\theta}} + U_h \left[\frac{\del h^{t-2}_{\theta}}{\del U_h} \right]^{\top}\right)\right]}\right) \\ &= \sum_{\gamma=0}^{k-2} \left[\bigodot_{i=0}^{\gamma} \sigma_h'(z^{t-i})\right] \odot U_h^{\gamma}{h^{t-\gamma-1}_{\theta}} \in \bbr^{\ell \times 1} \end{align*}

Similar computations give WhW_h and bhb_h. Then we just update our parameters, where η\eta is our learning rate.

θθηJt(θ)\theta \leftarrow \theta - \eta J^t(\theta)

Nice! We have successfully performed backprop through time. Note we only did this for t=kt = k, but the same idea applies for tkt \geq k. That is, start with t=kt = k and input xk,,x0,h0=0x^k, \ldots, x^0, h^0 = 0. Perform the above forward and backward pass to update the parameters θ\theta, and set (with our newly updated parameters)

h1=hθ1=σh(Whx1+Uhh0+bh)h^1 = h^1_{\theta} = \sigma_h(W_h x^1 + U_h h^{0} + b_h)

Then perform the passes again with input xk+1,,x1,h1x^{k+1}, \ldots, x^1, h^1 to update θ\theta, etc... until we reach the last iteration on xT,,xTk,hTkx^T, \ldots, x^{T-k}, h^{T-k}.

A question you might have is what a reasonable choice for kk might be? If we pick kk large, notice that the contribution of the previous information decays geometrically over time in the above formula for Jt(θ)/Uh\del J^t(\theta) / \del U_h; notice the UhγU_h^{\gamma} factor where γ\gamma ranges from 00 to k2k-2. Thus the gradient becomes too small and the updated parameters aren't able to train effectively. This is known as the vanishing gradient problem. The solution to this problem, known as long short-term memory (LSTM), is the subject of the next post.

The other possibility is that we implement LSTM, but now the gradients become unreasably large. This is the exploding gradients problem and can be solved by defining some threshold such that when the gradient blows up past this threshold, we normalize the gradient to scale it back down. Another approach that could solve both issues at once is to initialize the weight matrix UhU_h to be orthogonal since products of orthogonal matrices don't explode or vanish, but this limits our weights and leads to more restrictive models.

In practice, usually k=8k = 8 or 99 works well, although one is better off using more advanced RNNs to mitigate this issue.

Conclusion

RNNs are a very powerful framework for training neural networks on sequential data and represent an intuitive adaptation of the ideas used to build feedforward neural networks to solve these tasks. In the next post we will discuss some more complex cells that we can choose to solve the vanishing gradients problem.

Namely, instead of using a simple tanh\tanh activation on input xtx^t and the previous ht1h^{t-1}, we keep track of five different gates and states: internal memory of the cell, the hidden state that we output across time, the input gate to determine how much input flows to cell memory, the forget gate to determine how much input and previous cell memory flows to cell memory, and the output gate to determine how much input and prevoius cell memory flows into the output hidden state.