Before the Transformer

Your phone just suggested the word “tomorrow” before you finished typing “see you to”. That suggestion didn’t come from a transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. . It came from a modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. that fits in 50KB, runs in microseconds, and is older than the smartphone you’re holding.

A lot of working software still runs on the AI that came before the AI. This post is about that AI.

In the previous post we looked at what might come after the transformer. This post goes the other way. Before BERT, before word2vec, before deep learning was the default, NLP ran on a small set of statistical and probabilistic models that did genuinely useful work – some of which they still do, today, in places where the cost or latency or interpretability of a transformer would be wrong.

These aren’t museum pieces. They’re production tools. You should know about them because they’re often the correct answer, especially for problems with tight latency budgets, small datasets, or auditability requirements.

n-gram language models

An n-gram model is a language modelLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. in the most literal sense: it estimates the probability of the next word given the previous n-1 words.

A bigram model (n=2) estimates P(word | previous word). “The cat sat on the ___” – given the model has seen “on the mat” enough times in training data, it estimates a high probability for “mat” given “the.” A trigram model uses two preceding words. A 5-gram model uses four.

The model itself is just a giant table of counts: count how many times each n-gram appeared in your training corpus, divide by the count of the prefix, and that’s your probability estimate. No neural network. No gradient descent. No GPU. Just a hash table.

This sounds laughably primitive in 2026. It’s also how Google’s mobile keyboard worked for years, how speech recognition worked for years, and how machine translation worked for years – and the n-gram model was state of the art at all three.

Why n-gram models still ship

Three reasons.

First, they’re tiny. A 5-gram model trained on a few million words of domain-specific text fits in megabytes. It runs on a phone, on an embedded device, in a process that wakes up for one millisecond at a time.

Second, they’re fast. Lookup is a single hash-table query. The latency is nanoseconds. There’s no model to load, no GPU to wait for.

Third, they’re deterministic and auditable. If your spam filter or autocomplete makes a mistake, you can find out exactly which n-gram triggered the decision and which counts produced the probability. There’s no opaque embeddingEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. to introspect.

The trade-off is severity: n-gram models can’t generalise beyond what they’ve literally seen. “The dog sat on the mat” might be a familiar pattern; “The aardvark sat on the mat” is brand new and the model has nothing useful to say. They suffer the sparsity problem – most plausible n-grams never appear in the training data at all, even with a large corpus.

A lot of the cleverness in classical n-gram modelling went into smoothing techniques (Kneser-Ney, Good-Turing) that estimate plausible probabilities for n-grams the model never saw, by backing off to shorter n-grams. These methods are mature and well-understood, and they’re still the foundation of fast statistical models for autocompletion, predictive text, and parts of speech recognition pipelines.

Where you’ll find them

Mobile autocomplete and predictive text in keyboards that need to run offline.
Speech recognition language models – the acoustic part is now neural, but a fast n-gram language model is often the rescoring layer that picks between candidate transcriptions.
Spell checkers and grammar checkers, especially for languages where there isn’t a large neural model available.
Search query understanding for tail queries where you want a fast statistical signal, not a 200ms LLM round trip.

Hidden Markov Models

A Hidden Markov Model (HMM) is the next conceptual rung up. It models a sequence of observations that are generated by an underlying sequence of hidden states, where each state depends only on the previous state and each observation depends only on the current state.

The classical example: part-of-speech tagging. The observation sequence is the words you can see. The hidden sequence is the part-of-speech tag for each word – noun, verb, adjective. The HMM models two things:

Transition probabilities: how likely is each tag to follow each other tag? (e.g. determiners are often followed by nouns)
Emission probabilities: how likely is each word to be generated by each tag? (e.g. “run” can be a noun or a verb, with different probabilities for each)

Given a sentence, you find the most likely sequence of tags by running the Viterbi algorithm – a dynamic programming procedure that’s been the standard textbook example since the 1970s.

HMMs were the dominant approach to:

Part-of-speech tagging, until CRFs (next section) and then neural taggers replaced them.
Speech recognition acoustic modelling, until deep learning replaced them in the early 2010s.
Bioinformatics gene prediction, where they’re still widely used because biology has structural assumptions that match HMMs well.
Chunking and shallow parsing.

Why HMMs still ship

Two reasons.

First, biology. Genes have a structure that maps cleanly onto hidden states (intron, exon, promoter, terminator) and HMMs have decades of biological-tuning baked into them. Tools like HMMER for protein sequence analysis are everywhere in computational biology, and they’re not getting replaced by transformers any time soon.

Second, speed and tractability for low-resource languages. Training a neural POS tagger requires a lot of labelled data and a lot of compute. Training an HMM tagger requires hundreds of labelled sentences and a laptop. For low-resource language pipelines, an HMM is often the actual production tool.

Conditional Random Fields

A Conditional Random Field (CRF) is the more flexible cousin of the HMM. The idea: instead of modelling the joint probability of observations and hidden states (HMM-style, which makes strong independence assumptions), model the conditional probability of the hidden states given the observations directly.

In practice this lets you incorporate arbitrary features – not just “the current word” but “is the current word capitalised?”, “does it end in -ing?”, “is the previous word ‘to’?”, “what’s the gazetteer match?” – without breaking the model’s mathematical structure. CRFs work by combining many weak features through learned weights, much like logistic regression for sequences.

CRFs were the standard for sequence labelling tasks throughout the 2010s:

Named entity recognition (people, places, organisations, dates).
Information extraction from semi-structured text.
Slot filling in dialogue systems.
Biomedical entity tagging (gene names, drug names, diseases).

The classical pipeline – hand-craft good features, train a CRF on labelled data, deploy – produced systems that ran on CPUs at thousands of sentences per second with high accuracy. Many production NER systems still run a CRF either as the primary tagger or as a final layer on top of a neural model.

When a CRF still wins

CRFs are a good answer when:

You have moderate amounts of labelled data (say, 1k-10k sentences) – enough to learn meaningful weights, not enough to fine-tune a transformer well.
You need high precision on a fixed set of labels – regulatory keyword matching, structured-record extraction, controlled vocabularies.
You need to explain decisions – which features contributed to which label.
Latency matters – a CRF tagger runs in microseconds per sentence. A transformer NER model runs in milliseconds.
You’re working in a specialised domain with idiosyncratic vocabulary – medical, legal, scientific. Hand-crafted features encode domain knowledge that a generic transformer doesn’t have.

Word embeddings: the bridge

Between the n-gram era and the transformer era there was a brief but enormously influential phase where word embeddings became the primary research tool. Word2Vec (Mikolov et al., Google, 2013) and GloVe (Stanford, 2014) trained dense vectors for words that captured semantic relationships – the famous “king - man + woman = queen” arithmetic.

These models are no longer state of the art, but their descendants live everywhere. Modern sentence embeddings (BGE, E5, see The Other Transformers) are direct conceptual descendants. Many smaller production NLP systems still use word2vec-style embeddings as a fast feature backbone, sometimes feeding into a CRF or a small classifier rather than a transformer.

If you’ve ever called nltk.word_embeddings or used GloVe vectors as a baseline before reaching for a transformer, you’ve used this generation of model.

A decision table

If your task is…	Reach for…	Why not a transformer?
Predictive text on an offline device	A 5-gram language model with smoothing	Megabytes vs gigabytes; nanosecond latency
Rescoring speech recognition hypotheses	An n-gram LM	Streaming + low latency requirements
Predicting protein-coding regions in DNA	A profile HMM (HMMER)	Decades of domain tuning; biological structure matches the model
POS tagging a low-resource language	An HMM with a small labelled corpus	No transformer pre-training in that language
Extracting drug names from clinical notes	A CRF with hand-crafted features and a gazetteer	High precision; auditability; low latency on a CPU
Building a chatbot	A transformer LLM	n-grams and HMMs cannot generate fluent multi-turn text
Understanding ambiguous, context-rich queries	A transformer	Classical models struggle with long-range context

The story of NLP often gets told as a march of progress where each new generation makes the previous one obsolete. The actual picture is more layered. n-gram models still suggest the next word on your phone, still rescore speech-recognition hypotheses, still run inside spell checkers because they fit in megabytes and answer in nanoseconds. HMMs still dominate computational biology because gene structure maps cleanly onto hidden states and decades of domain tuning don’t transfer to a transformer overnight. CRFs are still the right answer when you have a thousand labelled sentences, a regulated domain, and a need to explain every decision the system makes.

Pre-transformer doesn’t mean obsolete. It means a different cost-benefit curve. The classical tools win where their curve dominates: on devices that can’t load a GPU, on languages without pre-training, on tasks that need to run in microseconds, on auditors who want to see the features and the weights. Reach for a transformer when you need the long-range context and the generative fluency. Reach for one of these when you don’t.

The next chapter, The Boring Baseline That Wins, publishes around 23 May.