Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) are the core architectures for modeling sequential data β text, speech, and time series. Unlike fully-connected or convolutional networks, they process input step-by-step and maintain hidden state across time steps.
Why Not FC or ConvNets for Sequences?
- FC Nets / ConvNets process a paragraph as a whole, require fixed-size input and produce fixed-size output β they cannot naturally handle variable-length sequences.
- RNNs handle variable-length sequences, share weights across time steps, and maintain state.
Text Preprocessing Pipeline
Before feeding text into an RNN, raw text is transformed:
Tokenization β Encoding β Alignment
- Tokenization (word-level): split text into tokens, build a frequency-sorted vocabulary.
- Encoding: map tokens to integer indices. Infrequent words/tokens are dropped:
- Computational cost: a bigger vocabulary β higher-dimensional one-hot vectors.
- Low information value: typos and rare named entities contribute noise.
- One-Hot Encoding (char-level) or Embeddings (word-level): word-level requires embeddings because the vocabulary is too large for one-hot.
- Alignment: pad/truncate sequences to the same length.
Simple RNN
The recurrence relation at each time step:
- : hidden state at step
- : embedding vector at step (not the one-hot vector β the output of the embedding layer)
- : shared weight matrix
tanhkeeps values bounded in , preventing exploding activations.
Parameter Count
Important: is the embedding dimension (e.g., 32), not the vocabulary size.
return_sequences
return_sequences | Output shape | Effect on Dense layer |
|---|---|---|
False (default) | Last only | Dense sees a single vector |
True | All | Dense applied at each step β only the Dense layerβs param count changes |
import tensorflow as tf
from tensorflow import keras
model = keras.Sequential([
keras.layers.Embedding(input_dim=vocab_size, output_dim=32),
keras.layers.SimpleRNN(64, return_sequences=False),
keras.layers.Dense(10, activation='softmax')
])
model.summary()Shortcomings
- Vanishing gradient / long-term dependency problem: gradients decay over long sequences, so early tokens are effectively forgotten.
Chaos and Stability
In the language of dynamical systems, the training of RNNs is a balance between stability and chaos:
- Vanishing Gradients: Represent a βstableβ but βdampedβ system. The influence of the initial state decays exponentially.
- Exploding Gradients: Represent a chaotic system. Small changes in the initial state or parameters lead to massive, unpredictable changes in the output (the Butterfly Effect).
- Edge of Chaos: Researchers have found that RNNs perform best when initialized at the βedge of chaosββa regime where the system is sensitive enough to remember the past but stable enough to not let noise explode.
TIP
For more on the math behind this, see 7. Chaos Theory.
LSTM (Long Short-Term Memory)
LSTM introduces a cell state (the βconveyor beltβ) and three gates to selectively retain or discard information.
Gates
| Gate | Formula | Role |
|---|---|---|
| Forget gate | How much of to keep (0 = forget, 1 = keep) | |
| Input gate | How much of the new candidate to write | |
| Candidate | The actual candidate content to add | |
| Cell update | Updated cell state | |
| Output gate | How much of cell state to expose as | |
| Hidden state | Output hidden state |
Parameter Count
There are 4 weight matrices: β hence 4Γ an equivalent SimpleRNN.
model = keras.Sequential([
keras.layers.Embedding(input_dim=vocab_size, output_dim=64),
keras.layers.LSTM(128, return_sequences=True),
keras.layers.LSTM(64),
keras.layers.Dense(1, activation='sigmoid')
])Stacked RNNs / LSTMs
Multiple RNN/LSTM layers stacked on top of each other. The first layer must have return_sequences=True to pass a full sequence to the next layer. May improve performance when the dataset is large.
model = keras.Sequential([
keras.layers.Embedding(vocab_size, 64),
keras.layers.LSTM(128, return_sequences=True), # passes full sequence
keras.layers.LSTM(64), # final layer
keras.layers.Dense(num_classes, activation='softmax')
])Bidirectional RNN
A Bidirectional RNN runs two independent RNNs over the sequence β one forward, one backward β and concatenates their hidden states: .
- Use when: the full input sequence is available (e.g., text classification, encoding).
- Cannot use as decoder: the backward pass requires future tokens, which donβt exist yet during autoregressive generation.
model = keras.Sequential([
keras.layers.Embedding(vocab_size, 64),
keras.layers.Bidirectional(keras.layers.LSTM(64)),
keras.layers.Dense(1, activation='sigmoid')
])Pretrained Embeddings
The embedding layer is responsible for most trainable parameters. When labeled data is scarce, freeze a pretrained embedding (e.g., GloVe, Word2Vec) to reduce trainable parameters and leverage large-corpus knowledge.
Best Practices Summary
- Always use LSTM instead of SimpleRNN.
- Use Bi-RNN instead of unidirectional RNN whenever possible.
- Stack RNN layers for larger datasets.
- Pretrain the embedding layer when labeled data is small.
Text Generation (Char-Level)
- Slice text into overlapping segments; each segment is input, the next character is the label.
- Formulated as multi-class classification (one class per character).
- Choosing the next character:
- Greedy: always pick the highest-probability character β too deterministic.
- Multinomial sampling: sample from the distribution β too random.
- Temperature-scaled sampling (best): adjust the sharpness of the distribution.
import numpy as np
def sample_with_temperature(predictions, temperature=1.0):
predictions = np.asarray(predictions).astype("float64")
predictions = predictions ** (1.0 / temperature)
predictions = predictions / np.sum(predictions)
return np.random.choice(len(predictions), p=predictions)- temperature < 1: more deterministic (sharper distribution).
- temperature > 1: more random (flatter distribution).
- temperature = 1: standard multinomial sampling.
Machine Translation: Seq2Seq
Architecture: LSTM Encoder β final states β LSTM Decoder
- Two separate tokenizers/dictionaries (source and target languages have different vocabularies).
- Loss: Cross-Entropy.
Improvements
| Technique | Why it helps |
|---|---|
| Bi-LSTM Encoder | Longer memory β doesnβt forget early tokens in long sentences |
| Word-level tokenization | Shorter sequences β less forgetting; BUT requires more data |
| Multi-task learning | Additional supervision signal |
| Attention | Decoder can focus on relevant encoder states |
Why canβt the Bi-LSTM be the decoder? Causality. Decoding is autoregressive β you generate one token at a time. The backward LSTM requires seeing future tokens, which donβt exist yet.
# Encoder
encoder_inputs = keras.Input(shape=(None,))
enc_emb = keras.layers.Embedding(src_vocab_size, 256)(encoder_inputs)
encoder_lstm = keras.layers.Bidirectional(keras.layers.LSTM(256, return_state=True))
enc_out, fh, fb, bh, bb = encoder_lstm(enc_emb)
state_h = keras.layers.Concatenate()([fh, bh])
state_c = keras.layers.Concatenate()([fb, bb])
# Decoder
decoder_inputs = keras.Input(shape=(None,))
dec_emb = keras.layers.Embedding(tgt_vocab_size, 256)(decoder_inputs)
decoder_lstm = keras.layers.LSTM(512, return_sequences=True, return_state=True)
dec_out, _, _ = decoder_lstm(dec_emb, initial_state=[state_h, state_c])
decoder_outputs = keras.layers.Dense(tgt_vocab_size, activation='softmax')(dec_out)
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')Related Notes
- Attention & Transformers - Mechanism that allows decoders to focus on relevant encoder states.
- 4. RNNs & CNNs for Text Classification - NLP perspective on RNN architectures.
- 5. Tokenization - Text preprocessing details.
- 3. Word Vectors - Embedding representations used as RNN input.