Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) are the core architectures for modeling sequential data β€” text, speech, and time series. Unlike fully-connected or convolutional networks, they process input step-by-step and maintain hidden state across time steps.

Why Not FC or ConvNets for Sequences?

  • FC Nets / ConvNets process a paragraph as a whole, require fixed-size input and produce fixed-size output β€” they cannot naturally handle variable-length sequences.
  • RNNs handle variable-length sequences, share weights across time steps, and maintain state.

Text Preprocessing Pipeline

Before feeding text into an RNN, raw text is transformed:

Tokenization β†’ Encoding β†’ Alignment

  1. Tokenization (word-level): split text into tokens, build a frequency-sorted vocabulary.
  2. Encoding: map tokens to integer indices. Infrequent words/tokens are dropped:
    • Computational cost: a bigger vocabulary β†’ higher-dimensional one-hot vectors.
    • Low information value: typos and rare named entities contribute noise.
  3. One-Hot Encoding (char-level) or Embeddings (word-level): word-level requires embeddings because the vocabulary is too large for one-hot.
  4. Alignment: pad/truncate sequences to the same length.

Simple RNN

The recurrence relation at each time step:

  • : hidden state at step
  • : embedding vector at step (not the one-hot vector β€” the output of the embedding layer)
  • : shared weight matrix
  • tanh keeps values bounded in , preventing exploding activations.

Parameter Count

Important: is the embedding dimension (e.g., 32), not the vocabulary size.

return_sequences

return_sequencesOutput shapeEffect on Dense layer
False (default)Last onlyDense sees a single vector
TrueAll Dense applied at each step β€” only the Dense layer’s param count changes
import tensorflow as tf
from tensorflow import keras
 
model = keras.Sequential([
    keras.layers.Embedding(input_dim=vocab_size, output_dim=32),
    keras.layers.SimpleRNN(64, return_sequences=False),
    keras.layers.Dense(10, activation='softmax')
])
model.summary()

Shortcomings

  • Vanishing gradient / long-term dependency problem: gradients decay over long sequences, so early tokens are effectively forgotten.

Chaos and Stability

In the language of dynamical systems, the training of RNNs is a balance between stability and chaos:

  • Vanishing Gradients: Represent a β€œstable” but β€œdamped” system. The influence of the initial state decays exponentially.
  • Exploding Gradients: Represent a chaotic system. Small changes in the initial state or parameters lead to massive, unpredictable changes in the output (the Butterfly Effect).
  • Edge of Chaos: Researchers have found that RNNs perform best when initialized at the β€œedge of chaos”—a regime where the system is sensitive enough to remember the past but stable enough to not let noise explode.

TIP

For more on the math behind this, see 7. Chaos Theory.


LSTM (Long Short-Term Memory)

LSTM introduces a cell state (the β€œconveyor belt”) and three gates to selectively retain or discard information.

Gates

GateFormulaRole
Forget gateHow much of to keep (0 = forget, 1 = keep)
Input gateHow much of the new candidate to write
CandidateThe actual candidate content to add
Cell updateUpdated cell state
Output gateHow much of cell state to expose as
Hidden stateOutput hidden state

Parameter Count

There are 4 weight matrices: β€” hence 4Γ— an equivalent SimpleRNN.

model = keras.Sequential([
    keras.layers.Embedding(input_dim=vocab_size, output_dim=64),
    keras.layers.LSTM(128, return_sequences=True),
    keras.layers.LSTM(64),
    keras.layers.Dense(1, activation='sigmoid')
])

Stacked RNNs / LSTMs

Multiple RNN/LSTM layers stacked on top of each other. The first layer must have return_sequences=True to pass a full sequence to the next layer. May improve performance when the dataset is large.

model = keras.Sequential([
    keras.layers.Embedding(vocab_size, 64),
    keras.layers.LSTM(128, return_sequences=True),  # passes full sequence
    keras.layers.LSTM(64),                           # final layer
    keras.layers.Dense(num_classes, activation='softmax')
])

Bidirectional RNN

A Bidirectional RNN runs two independent RNNs over the sequence β€” one forward, one backward β€” and concatenates their hidden states: .

  • Use when: the full input sequence is available (e.g., text classification, encoding).
  • Cannot use as decoder: the backward pass requires future tokens, which don’t exist yet during autoregressive generation.
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, 64),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(1, activation='sigmoid')
])

Pretrained Embeddings

The embedding layer is responsible for most trainable parameters. When labeled data is scarce, freeze a pretrained embedding (e.g., GloVe, Word2Vec) to reduce trainable parameters and leverage large-corpus knowledge.


Best Practices Summary

  1. Always use LSTM instead of SimpleRNN.
  2. Use Bi-RNN instead of unidirectional RNN whenever possible.
  3. Stack RNN layers for larger datasets.
  4. Pretrain the embedding layer when labeled data is small.

Text Generation (Char-Level)

  1. Slice text into overlapping segments; each segment is input, the next character is the label.
  2. Formulated as multi-class classification (one class per character).
  3. Choosing the next character:
    • Greedy: always pick the highest-probability character β€” too deterministic.
    • Multinomial sampling: sample from the distribution β€” too random.
    • Temperature-scaled sampling (best): adjust the sharpness of the distribution.
import numpy as np
 
def sample_with_temperature(predictions, temperature=1.0):
    predictions = np.asarray(predictions).astype("float64")
    predictions = predictions ** (1.0 / temperature)
    predictions = predictions / np.sum(predictions)
    return np.random.choice(len(predictions), p=predictions)
  • temperature < 1: more deterministic (sharper distribution).
  • temperature > 1: more random (flatter distribution).
  • temperature = 1: standard multinomial sampling.

Machine Translation: Seq2Seq

Architecture: LSTM Encoder β†’ final states β†’ LSTM Decoder

  • Two separate tokenizers/dictionaries (source and target languages have different vocabularies).
  • Loss: Cross-Entropy.

Improvements

TechniqueWhy it helps
Bi-LSTM EncoderLonger memory β€” doesn’t forget early tokens in long sentences
Word-level tokenizationShorter sequences β†’ less forgetting; BUT requires more data
Multi-task learningAdditional supervision signal
AttentionDecoder can focus on relevant encoder states

Why can’t the Bi-LSTM be the decoder? Causality. Decoding is autoregressive β€” you generate one token at a time. The backward LSTM requires seeing future tokens, which don’t exist yet.

# Encoder
encoder_inputs = keras.Input(shape=(None,))
enc_emb = keras.layers.Embedding(src_vocab_size, 256)(encoder_inputs)
encoder_lstm = keras.layers.Bidirectional(keras.layers.LSTM(256, return_state=True))
enc_out, fh, fb, bh, bb = encoder_lstm(enc_emb)
state_h = keras.layers.Concatenate()([fh, bh])
state_c = keras.layers.Concatenate()([fb, bb])
 
# Decoder
decoder_inputs = keras.Input(shape=(None,))
dec_emb = keras.layers.Embedding(tgt_vocab_size, 256)(decoder_inputs)
decoder_lstm = keras.layers.LSTM(512, return_sequences=True, return_state=True)
dec_out, _, _ = decoder_lstm(dec_emb, initial_state=[state_h, state_c])
decoder_outputs = keras.layers.Dense(tgt_vocab_size, activation='softmax')(dec_out)
 
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')