The Other Transformers

April 25, 2026 · 10 min read

You have a backlog of 80,000 support tickets and you need to tag each one with one of fourteen categories. Someone suggests using an LLM. You write the prompt, you wire up the API, you run the numbers – and the bill comes back at $1,400 just for the categorisation. You haven’t even started doing anything with the categories yet.

There’s a better tool for this. It’s also a transformer. It’s just not the one everyone talks about.

In To LLMs… and Beyond! we treated “transformer” as one thing – the engine behind Claude, GPT, Llama. That was useful for a tour of the field, but it elided a real distinction. The transformer architecture comes in three structural shapes, and only one of them is the autoregressive text-generator that the AI conversation has fixated on.

The other two are still in production at every serious AI shop. They’re cheaper, faster, and often more accurate for the jobs they were designed to do. This post is about when to reach for them instead.

Three shapes from one paper

The 2017 paper Attention Is All You Need introduced the transformer with a specific job in mind: machine translation. English in, French out. The architecture had two halves – an encoder that read the English sentence and produced an internal representation of its meaning, and a decoder that consumed that representation and produced French one token at a time.

Almost immediately, researchers noticed you could use the halves separately.

  • Encoder-only models keep just the encoder. They take text in and produce a representation – a vector, a label, a span. They never generate text. BERT (2018) is the headline example.
  • Decoder-only models keep just the decoder. They take text in and produce more text, one token at a time. GPT, Claude, and Llama are all this shape.
  • Encoder-decoder models keep both halves. They take text in, encode it, and decode something different out. T5 and BART are the headline examples.

The shape determines what the model is good at. And it determines what it costs.

Encoder-only: BERT and friends

BERT stands for Bidirectional Encoder Representations from Transformers. The “bidirectional” is the part that matters. A decoder-only model like GPT processes text left-to-right, one token at a time – when it’s predicting the next token, it can only see what came before. An encoder-only model processes the entire sequence at once, and every token can attend to every other token in both directions.

This makes encoder-only models worse at generating fluent text – in fact, they don’t generate text at all in the usual sense – but better at understanding it. When BERT looks at the word “bank” in “I sat by the bank of the river,” it can see “river” three tokens later, and that informs its representation of “bank.” A left-to-right model has to commit to a meaning before it has all the evidence.

What encoder-only models actually output is a sequence of vectors – one per input token. You can use those vectors directly (as embeddings for similarity search) or you can stick a tiny classification head on top (a single linear layer that maps a vector to a label) and get a classifier.

The big BERT-family models you’ll encounter:

Model Made by Notable for
BERTGoogle, 2018The original. Set state of the art on a dozen benchmarks overnight.
RoBERTaMeta, 2019BERT trained better -- more data, longer, with the masking strategy fixed. Usually beats BERT.
DeBERTaMicrosoft, 2020-2021Disentangled attention. Strong on classification benchmarks, often the default for new projects.
DistilBERTHugging Face, 2019A 40%-smaller BERT that's 60% faster and keeps 97% of the accuracy. The pragmatic choice.
ModernBERTAnswer.AI, 2024BERT with the last six years of architectural improvements bolted on. Long context, fast inference.

These are all small. BERT-base has 110 million parameters, DistilBERT has 66 million, ModernBERT-large has 395 million. Compare that to a frontier LLM at hundreds of billions. They run on a CPU. They run on your laptop. They run on a Raspberry Pi if you don’t mind waiting.

What encoder-only models are good at

Anything where the answer is shorter than the input. Specifically:

  • Classification. Sentiment, intent, topic, language detection, content moderation, spam, urgency triage. One label out per input.
  • Multi-label classification. Tagging a document with several categories at once.
  • Named entity recognition (NER). Picking out people, places, organisations, dates from text. One label per token.
  • Span extraction. “Find the answer to this question inside this document.” The model points at the start and end positions of the span. SQuAD-style question answering.
  • Sentence embeddings. Producing a fixed-size vector that represents the meaning of a piece of text. The foundation of semantic search and RAG.
  • Pairwise classification. “Are these two sentences saying the same thing?” “Does sentence A entail sentence B?”

For all of these, an LLM will also work. It will just cost roughly a hundred times more, take roughly ten times longer, and – in many cases – be less accurate.

Why an LLM is often worse, not just more expensive

Counterintuitive but real: a fine-tuned BERT often outperforms a frontier LLM at classification tasks the BERT was specifically trained for.

The reason is task alignment. An LLM is trained to predict the next token across the entirety of internet text. A fine-tuned classifier is trained on labelled examples of exactly the task you care about – ten thousand support tickets with their correct categories, say. The LLM has read the universe and has a vague sense of what “billing” means; the classifier has stared at your specific definition of “billing” for a thousand epochs.

The LLM also has to speak its answer, which introduces failure modes the classifier doesn’t have. Will it return “billing” or “Billing” or “billing/payments” or a polite refusal because the ticket mentions a credit card? The classifier returns one of fourteen integers. Always.

There’s an obvious counter: what if you don’t have ten thousand labelled examples? Genuine constraint, and where LLMs shine – zero-shot or few-shot classification with a prompt is a real superpower when you’re starting from nothing. But the moment you’ve labelled enough data to fine-tune a small encoder, the cost-quality curve usually flips.

Encoder-decoder: T5, BART, FLAN

The encoder-decoder shape is for jobs where the output is structured but isn’t a free-form essay – a transformation of the input rather than a continuation of it.

The flagship example is Google’s T5 (Text-to-Text Transfer Transformer, 2019), which framed every NLP task as text-in, text-out:

  • Translation: input “translate English to German: That is good.” → output “Das ist gut.”
  • Summarisation: input “summarize: <article>” → output “<summary>”
  • Classification: input “cola sentence: The course is jumping well.” → output “not acceptable”
  • Question answering: input “question: What is the capital of France? context: …” → output “Paris”

The shape is well-suited to anything that has a deterministic-ish target – a translation, a summary, a structured output, a SQL query generated from a natural-language question. The encoder reads the whole input once, builds a rich representation, and the decoder produces the (usually short) output guided by that representation.

The other notable encoder-decoder family is BART (Meta, 2019), which was trained on a denoising objective – corrupt the input, recover the original – and is particularly strong at summarisation.

The instruction-tuned descendants – FLAN-T5, T5-XXL, BART-large-CNN – are still common backbones for production summarisation and translation pipelines, especially when you want to fine-tune on your own data.

What encoder-decoder models are good at

  • Translation. The original use case, still strong.
  • Summarisation. Extractive (copy spans) or abstractive (rewrite). BART-large-CNN was the production default for years.
  • Structured generation. Text-to-SQL, text-to-JSON, text-to-API-call. The encoder grounds the output in the input.
  • Grammar correction. Input: messy sentence. Output: clean sentence.
  • Question answering with generation. Where the answer isn’t necessarily a span in the document and needs to be paraphrased.

The boundary with decoder-only LLMs has blurred. Modern LLMs do all of the above competently, often better than older T5 models, and the simplicity of “one model for everything” has pulled a lot of work toward the decoder-only side. But for pipelines where you need something small, fast, deterministic, and fine-tuneable, T5-family models still pull their weight.

A decision table

If your task is… Reach for… Why not an LLM?
Tag each item with one of N categoriesDeBERTa or DistilBERT, fine-tuned100x cheaper, often more accurate, no parsing of free-text output
Find people, places, dates in textA BERT-family NER model (e.g. spaCy's transformer)Token-level precision, no hallucinated entities
Embed sentences for semantic searchA sentence-transformers model (BGE, E5, GTE)LLMs don't natively produce sentence embeddings; encoder models do this as their primary job
Translate between languages at scaleA T5- or NLLB-family model, fine-tuned if neededPer-token cost matters at translation volumes; specialised models still lead
Convert natural language to SQL or JSONA code-fine-tuned T5, or an LLM if accuracy matters more than costMixed -- LLMs win on hard cases, encoder-decoders win on cost at scale
Decide if a comment is toxicA fine-tuned encoder classifier (e.g. Detoxify)Real-time moderation needs millisecond latency, not 800ms API round-trips
Have a free-form conversationAn LLMEncoder models cannot generate fluent multi-turn text
Reason through a multi-step problemAn LLM, ideally a reasoning modelEncoder models have no chain-of-thought; they produce one answer in one pass

The pragmatic stack

In production AI systems, you’ll often see encoder, encoder-decoder, and decoder-only models working together rather than competing.

A typical retrieval-augmented chat application:

  1. Bi-encoder (BERT-family) embeds the user’s query and finds the top 100 candidate documents from the vector database. Cheap, parallel, fast.
  2. Cross-encoder (BERT-family) re-ranks those 100 down to the top 5 by reading each query-document pair carefully. We’ll cover this in the next post.
  3. Decoder-only LLM consumes the top 5 documents alongside the query and writes a fluent answer.

Each stage uses the right tool for its job. The encoder does the cheap, high-throughput retrieval and ranking. The LLM does the expensive, low-throughput generation, but only after the encoder has narrowed the search space by three orders of magnitude.

This is the pattern that matters. It’s not “LLM vs BERT.” It’s “use BERT to make the LLM step efficient enough to be worth doing.”

Where to find them

  • Hugging Face is the de facto registry. bert-base-uncased, roberta-large, microsoft/deberta-v3-large, distilbert-base-uncased, answerdotai/ModernBERT-large, t5-base, facebook/bart-large-cnn, google/flan-t5-xl – all available, all free to download.
  • sentence-transformers is the library for using BERT-family models as embedding models. all-MiniLM-L6-v2 is the gateway drug – 22 million parameters, runs on a phone, and is the right starting point for 80% of semantic-search projects.
  • spaCy wraps fine-tuned encoder models for NER, POS tagging, and similar pipelines, with an API designed for production use rather than research.
  • Cohere, OpenAI, Voyage sell hosted embedding APIs if you want the model without the operations.

What’s worth remembering

  1. “Transformer” is a family, not a model. Encoder-only, encoder-decoder, and decoder-only are three different shapes for three different jobs.
  2. LLMs are decoder-only. That’s what makes them generative. It’s also what makes them expensive.
  3. Encoder-only models are for understanding, not generating. Classification, embedding, NER, span extraction.
  4. Encoder-decoder models are for transforming. Translation, summarisation, structured output.
  5. Fine-tuning a small encoder beats prompting a big LLM when you have labelled data and a stable task.
  6. Most production AI is a stack, not a single model. Encoders feed LLMs, not replace them.
  7. The right question isn’t “should I use an LLM?” It’s “where in my pipeline does the LLM earn its cost?”

The next post in this series picks up on the “stack” point. Modern RAG doesn’t end with an embedding lookup – there’s a second pass most explanations skip, and it’s the one that actually makes retrieval work.

The next chapter, The Reranker You Didn't Know You Needed, publishes around 2 May.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.