RNNs & LSTMs

Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) are the core architectures for modeling sequential data — text, speech, and time series. Unlike fully-connected or convolutional networks, they process input step-by-step and maintain hidden state across time steps.

Why Not FC or ConvNets for Sequences?

FC Nets / ConvNets process a paragraph as a whole, require fixed-size input and produce fixed-size output — they cannot naturally handle variable-length sequences.
RNNs handle variable-length sequences, share weights across time steps, and maintain state.

Text Preprocessing Pipeline

Before feeding text into an RNN, raw text is transformed:

Tokenization → Encoding → Alignment

Tokenization (word-level): split text into tokens, build a frequency-sorted vocabulary.
Encoding: map tokens to integer indices. Infrequent words/tokens are dropped:
- Computational cost: a bigger vocabulary → higher-dimensional one-hot vectors.
- Low information value: typos and rare named entities contribute noise.
One-Hot Encoding (char-level) or Embeddings (word-level): word-level requires embeddings because the vocabulary is too large for one-hot.
Alignment: pad/truncate sequences to the same length.

Simple RNN

The recurrence relation at each time step:

$h_{t} = tanh (A \cdot [h_{t - 1}, x_{t}] + b)$

$h_{t}$ : hidden state at step $t$
$x_{t}$ : embedding vector at step $t$ (not the one-hot vector — the output of the embedding layer)
$A$ : shared weight matrix
tanh keeps values bounded in $(- 1, 1)$ , preventing exploding activations.

Parameter Count

$Params = dim (h) \times (dim (h) + dim (x)) + dim (h)$

Important: $dim (x)$ is the embedding dimension (e.g., 32), not the vocabulary size.

`return_sequences`

`return_sequences`	Output shape	Effect on Dense layer
`False` (default)	Last $h_{T}$ only	Dense sees a single vector
`True`	All $h_{1}, \dots, h_{T}$	Dense applied at each step — only the Dense layer’s param count changes

import tensorflow as tf
from tensorflow import keras
 
model = keras.Sequential([
    keras.layers.Embedding(input_dim=vocab_size, output_dim=32),
    keras.layers.SimpleRNN(64, return_sequences=False),
    keras.layers.Dense(10, activation='softmax')
])
model.summary()

Shortcomings

Vanishing gradient / long-term dependency problem: gradients decay over long sequences, so early tokens are effectively forgotten.

Chaos and Stability

In the language of dynamical systems, the training of RNNs is a balance between stability and chaos:

Vanishing Gradients: Represent a “stable” but “damped” system. The influence of the initial state $h_{0}$ decays exponentially.
Exploding Gradients: Represent a chaotic system. Small changes in the initial state or parameters lead to massive, unpredictable changes in the output (the Butterfly Effect).
Edge of Chaos: Researchers have found that RNNs perform best when initialized at the “edge of chaos”—a regime where the system is sensitive enough to remember the past but stable enough to not let noise explode.

TIP

For more on the math behind this, see 7. Chaos Theory.

LSTM (Long Short-Term Memory)

LSTM introduces a cell state $C_{t}$ (the “conveyor belt”) and three gates to selectively retain or discard information.

Gates

Gate	Formula	Role
Forget gate	$f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}])$	How much of $C_{t - 1}$ to keep (0 = forget, 1 = keep)
Input gate	$i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}])$	How much of the new candidate to write
Candidate	$\tilde{C}_{t} = tanh (W_{C} \cdot [h_{t - 1}, x_{t}])$	The actual candidate content to add
Cell update	$C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ \tilde{C}_{t}$	Updated cell state
Output gate	$o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}])$	How much of cell state to expose as $h_{t}$
Hidden state	$h_{t} = o_{t} ⊙ tanh (C_{t})$	Output hidden state

Parameter Count

$Params = 4 \times dim (h) \times (dim (h) + dim (x))$

There are 4 weight matrices: $W_{f}, W_{i}, W_{C}, W_{o}$ — hence 4× an equivalent SimpleRNN.

model = keras.Sequential([
    keras.layers.Embedding(input_dim=vocab_size, output_dim=64),
    keras.layers.LSTM(128, return_sequences=True),
    keras.layers.LSTM(64),
    keras.layers.Dense(1, activation='sigmoid')
])

Stacked RNNs / LSTMs

Multiple RNN/LSTM layers stacked on top of each other. The first layer must have return_sequences=True to pass a full sequence to the next layer. May improve performance when the dataset is large.

model = keras.Sequential([
    keras.layers.Embedding(vocab_size, 64),
    keras.layers.LSTM(128, return_sequences=True),  # passes full sequence
    keras.layers.LSTM(64),                           # final layer
    keras.layers.Dense(num_classes, activation='softmax')
])

Bidirectional RNN

A Bidirectional RNN runs two independent RNNs over the sequence — one forward, one backward — and concatenates their hidden states: $[h_{t}, h_{t}^{'}]$ .

Use when: the full input sequence is available (e.g., text classification, encoding).
Cannot use as decoder: the backward pass requires future tokens, which don’t exist yet during autoregressive generation.

model = keras.Sequential([
    keras.layers.Embedding(vocab_size, 64),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(1, activation='sigmoid')
])

Pretrained Embeddings

The embedding layer is responsible for most trainable parameters. When labeled data is scarce, freeze a pretrained embedding (e.g., GloVe, Word2Vec) to reduce trainable parameters and leverage large-corpus knowledge.

Best Practices Summary

Always use LSTM instead of SimpleRNN.
Use Bi-RNN instead of unidirectional RNN whenever possible.
Stack RNN layers for larger datasets.
Pretrain the embedding layer when labeled data is small.

Text Generation (Char-Level)

Slice text into overlapping segments; each segment is input, the next character is the label.
Formulated as multi-class classification (one class per character).
Choosing the next character:
- Greedy: always pick the highest-probability character — too deterministic.
- Multinomial sampling: sample from the distribution — too random.
- Temperature-scaled sampling (best): adjust the sharpness of the distribution.

import numpy as np
 
def sample_with_temperature(predictions, temperature=1.0):
    predictions = np.asarray(predictions).astype("float64")
    predictions = predictions ** (1.0 / temperature)
    predictions = predictions / np.sum(predictions)
    return np.random.choice(len(predictions), p=predictions)

temperature < 1: more deterministic (sharper distribution).
temperature > 1: more random (flatter distribution).
temperature = 1: standard multinomial sampling.

Machine Translation: Seq2Seq

Architecture: LSTM Encoder → final states $(h, c)$ → LSTM Decoder

Two separate tokenizers/dictionaries (source and target languages have different vocabularies).
Loss: Cross-Entropy.

Improvements

Technique	Why it helps
Bi-LSTM Encoder	Longer memory — doesn’t forget early tokens in long sentences
Word-level tokenization	Shorter sequences → less forgetting; BUT requires more data
Multi-task learning	Additional supervision signal
Attention	Decoder can focus on relevant encoder states

Why can’t the Bi-LSTM be the decoder? Causality. Decoding is autoregressive — you generate one token at a time. The backward LSTM requires seeing future tokens, which don’t exist yet.

# Encoder
encoder_inputs = keras.Input(shape=(None,))
enc_emb = keras.layers.Embedding(src_vocab_size, 256)(encoder_inputs)
encoder_lstm = keras.layers.Bidirectional(keras.layers.LSTM(256, return_state=True))
enc_out, fh, fb, bh, bb = encoder_lstm(enc_emb)
state_h = keras.layers.Concatenate()([fh, bh])
state_c = keras.layers.Concatenate()([fb, bb])
 
# Decoder
decoder_inputs = keras.Input(shape=(None,))
dec_emb = keras.layers.Embedding(tgt_vocab_size, 256)(decoder_inputs)
decoder_lstm = keras.layers.LSTM(512, return_sequences=True, return_state=True)
dec_out, _, _ = decoder_lstm(dec_emb, initial_state=[state_h, state_c])
decoder_outputs = keras.layers.Dense(tgt_vocab_size, activation='softmax')(dec_out)
 
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

Attention & Transformers - Mechanism that allows decoders to focus on relevant encoder states.
4. RNNs & CNNs for Text Classification - NLP perspective on RNN architectures.
5. Tokenization - Text preprocessing details.
3. Word Vectors - Embedding representations used as RNN input.

Harbor 🪼

Explorer

Why Not FC or ConvNets for Sequences?

Text Preprocessing Pipeline

Simple RNN

Parameter Count

`return_sequences`

Shortcomings

Chaos and Stability

LSTM (Long Short-Term Memory)

Gates

Parameter Count

Stacked RNNs / LSTMs

Bidirectional RNN

Pretrained Embeddings

Best Practices Summary

Text Generation (Char-Level)

Machine Translation: Seq2Seq

Improvements

Table of Contents

Backlinks

Harbor 🪼

Explorer

RNNs & LSTMs

Why Not FC or ConvNets for Sequences?

Text Preprocessing Pipeline

Simple RNN

Parameter Count

return_sequences

Shortcomings

Chaos and Stability

LSTM (Long Short-Term Memory)

Gates

Parameter Count

Stacked RNNs / LSTMs

Bidirectional RNN

Pretrained Embeddings

Best Practices Summary

Text Generation (Char-Level)

Machine Translation: Seq2Seq

Improvements

Related Notes

Table of Contents

Backlinks

`return_sequences`