The Story That Explains Text-to-Text AI
The remarkable thing? The translator never learned a separate skill for each task. They learned one meta-skill: reading text and producing the right text back. Every task — summarising, translating, answering, classifying — was reframed as "given this input text, what output text should I write?"
That is the entire philosophy of T5 (Text-To-Text Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformers).
Before T5 and BART, the NLP world was fragmented. You had one model for sentiment analysis, a different architecture for translation, a separate system for summarisation, and yet another for question answering. Each task required its own output head, its own loss function, its own fine-tuning recipe. Then, in 2019–2020, two landmark papers changed everything.
Both T5 and BART treat every NLP task as a text generation problem. Classification becomes: "Input: 'The movie was terrible' → Output: 'negative'". Summarisation becomes: "Input: [long article] → Output: [short summary]". The model always reads text in and writes text out — hence text-to-text. This single unification unlocked massive improvements across the board.
The Foundation — Encoder-Decoder Architecture
Both T5 and BART are built on the encoder-decoder Transformer, the same architecture introduced in the original "Attention is All You Need" paper (2017). Understanding this backbone is non-negotiable before studying either model.
Both T5 and BART use this encoder-decoder blueprint. The key difference lies in how they were pre-trained.
T5 — Text-To-Text Transfer Transformer
T5's creators at Google thought: what if NLP tasks are the same? They are all really "transform input text into output text." So they designed a single pan — one model, one training objective, one loss function — that can cook any NLP dish. They just prepend a text instruction telling it which dish to make.
T5 was introduced by Google in 2019 (Raffel et al., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"). Its defining feature is radical simplicity through unification: every task is cast as sequence-to-sequence generation by adding a task prefix to the input.
The Task Prefix System
| Task | T5 Input Format | T5 Output |
|---|---|---|
| Summarisation | summarize: [article text] | Short summary |
| Translation (EN→DE) | translate English to German: [text] | German text |
| Sentiment Classification | sst2 sentence: [text] | positive or negative |
| Question Answering | question: [q] context: [passage] | Answer string |
| Grammar Correction | cola sentence: [text] | acceptable or unacceptable |
| NLI (Entailment) | mnli hypothesis: [h] premise: [p] | entailment / contradiction / neutral |
The prefix is not a hack — it is the training signal. During pre-training on the C4 dataset (Colossal Clean Crawled Corpus — 750 GB of clean web text), T5 learned that different prefixes correlate with different transformation patterns. At inference time, the prefix primes the decoder to generate the appropriate type of output. You can even invent your own prefixes for custom tasks during fine-tuning.
How T5 Was Pre-trained — Span Corruption
T5 uses a pre-training objective called Span Corruption (also called masked span prediction). This is different from BERT's masked language modelling and GPT's next-token prediction.
Random contiguous spans (15% of tokens on average) are replaced with sentinel tokens <X>, <Y>, <Z>. The decoder must reconstruct only the missing spans — not the whole sentence. This is more efficient than BERT's single-token masking.
T5 Model Variants
T5 in Python — Complete Code Examples
Example 1: Summarisation with T5
from transformers import T5ForConditionalGeneration, T5Tokenizer
# Load pretrained T5-small (downloads ~240 MB on first run)
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
# Article to summarise — note the task prefix!
article = """summarize: The Amazon rainforest, often referred to as the \
"lungs of the Earth", produces 20% of the world's oxygen and is home to \
10% of all known species. Scientists warn that deforestation rates have \
increased by 22% in the past year, threatening biodiversity and \
accelerating climate change. Governments are urged to implement \
stronger protection measures immediately."""
# Tokenise — T5 uses max_length for encoder truncation
inputs = tokenizer.encode(
article,
return_tensors="pt",
max_length=512,
truncation=True
)
# Generate summary with beam search
summary_ids = model.generate(
inputs,
max_length=60,
min_length=20,
num_beams=4, # beam search for quality
length_penalty=2.0, # penalise very short outputs
early_stopping=True,
no_repeat_ngram_size=3 # avoid repetitive phrases
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"Summary: {summary}")
Example 2: Fine-tuning T5 on Custom Summarisation Data
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer, Seq2SeqTrainer
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
from datasets import load_dataset
# Load model and tokeniser
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
# Load CNN/DailyMail dataset (a standard summarisation benchmark)
dataset = load_dataset("cnn_dailymail", "3.0.0")
def preprocess(examples):
# Add the task prefix — critical for T5!
inputs = ["summarize: " + doc for doc in examples["article"]]
targets = examples["highlights"]
model_inputs = tokenizer(
inputs, max_length=512, truncation=True, padding="max_length"
)
with tokenizer.as_target_tokenizer():
labels = tokenizer(
targets, max_length=128, truncation=True, padding="max_length"
)
# Replace padding token id with -100 so loss ignores it
labels["input_ids"] = [
[(-100 if t == tokenizer.pad_token_id else t) for t in label]
for label in labels["input_ids"]
]
model_inputs["labels"] = labels["input_ids"]
return model_inputs
# Tokenise (use small slice for demo speed)
tokenised = dataset.map(preprocess, batched=True,
remove_columns=dataset["train"].column_names)
# Training configuration
args = Seq2SeqTrainingArguments(
output_dir="./t5-finetuned-cnn",
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
predict_with_generate=True, # Use generation for evaluation
fp16=torch.cuda.is_available(), # Mixed precision if GPU available
save_strategy="epoch",
evaluation_strategy="epoch",
load_best_model_at_end=True,
)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)
trainer = Seq2SeqTrainer(
model=model,
args=args,
train_dataset=tokenised["train"],
eval_dataset=tokenised["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
model.save_pretrained("./t5-finetuned-cnn")
1. Always add the task prefix — fine-tuning without it breaks the learned prefix routing.
2. Set labels padding to -100 — the cross-entropy loss ignores these positions, so you don't penalise padding.
3. Use predict_with_generate=True — T5 generates token-by-token at eval time; without this flag, the Trainer uses teacher-forcing outputs which look better than reality.
BART — Bidirectional and Auto-Regressive Transformers
BART was designed with exactly this philosophy. During pre-training, it takes clean text, deliberately corrupts it in several different ways — deleting words, scrambling sentences, rotating the text — and then trains the model to recover the original. By learning to repair every kind of damage, it becomes extraordinarily good at any task that requires understanding and regenerating text — which is most of NLP.
BART was introduced by Facebook AI (Lewis et al., 2019). It is also an encoder-decoder Transformer, but its pre-training scheme is fundamentally different from T5. While T5 uses only span corruption, BART uses five different noise functions during pre-training, making it a more generalist denoising model.
BART's Five Noise Functions (Pre-training)
The key difference: T5 only reconstructs the corrupted spans (short output). BART reconstructs the entire document (full output). BART's approach makes it better at open-ended generation tasks like abstractive summarisation and dialogue.
BART in Python — Complete Code Examples
Example 1: Abstractive Summarisation with BART
from transformers import BartForConditionalGeneration, BartTokenizer
# facebook/bart-large-cnn is fine-tuned specifically for summarisation
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)
article = """
NASA's James Webb Space Telescope has captured the most detailed image of
the Pillars of Creation ever recorded. The new image reveals thousands of
previously unseen stars in the iconic star-forming region located 6,500
light-years from Earth. Scientists say the observations will help rewrite
our understanding of how stars and planetary systems form. The telescope's
infrared cameras can pierce through the dense dust clouds that blocked
previous observations, revealing the inner workings of stellar nurseries.
"""
# BART does NOT use task prefixes — just pass the raw text
inputs = tokenizer.encode(
article,
return_tensors="pt",
max_length=1024,
truncation=True
)
# Generate with beam search and length controls
summary_ids = model.generate(
inputs,
max_length=130,
min_length=30,
num_beams=4,
length_penalty=2.0,
early_stopping=True,
no_repeat_ngram_size=3,
forced_bos_token_id=tokenizer.bos_token_id
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"Summary:\n{summary}")
Unlike T5, BART was not pre-trained with text instruction prefixes.
Do not add "summarize:" or similar prefixes to BART inputs —
it will confuse the model. The task is implicit in the fine-tuned checkpoint you load.
Use facebook/bart-large-cnn for summarisation, facebook/bart-large-mnli
for zero-shot classification, etc.
Example 2: Zero-Shot Text Classification with BART
from transformers import pipeline
# BART-large-mnli is fine-tuned on Natural Language Inference
# It can classify text into ANY labels — no fine-tuning needed!
classifier = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli"
)
text = "The new electric vehicle can travel 400 miles on a single charge."
# Define candidate labels — these can be anything
candidate_labels = [
"technology", "environment", "transportation",
"finance", "health", "politics"
]
result = classifier(text, candidate_labels, multi_label=False)
for label, score in zip(result["labels"], result["scores"]):
bar = "█" * int(score * 30)
print(f"{label:15s} {bar} {score:.3f}")
BART-large-mnli was fine-tuned to determine if a premise entails a hypothesis. The trick: for each label, we check if "This text is about [label]" is entailed by the input text. The label with the highest entailment probability wins. No extra training needed — works on any labels you invent at runtime. This is one of BART's most practical real-world applications.
Example 3: Fine-tuning BART on Dialogue Summarisation
from transformers import BartForConditionalGeneration, BartTokenizer
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq
from datasets import load_dataset
import numpy as np
# SAMSum is a dataset of messenger-style dialogues + summaries
dataset = load_dataset("samsum")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
def preprocess_bart(examples):
# BART: no prefix — just raw dialogue
inputs = tokenizer(
examples["dialogue"],
max_length=1024, truncation=True, padding="max_length"
)
with tokenizer.as_target_tokenizer():
labels = tokenizer(
examples["summary"],
max_length=128, truncation=True, padding="max_length"
)
# Mask padding in labels
label_ids = labels["input_ids"]
label_ids = [
[(-100 if t == tokenizer.pad_token_id else t) for t in seq]
for seq in label_ids
]
inputs["labels"] = label_ids
return inputs
tokenised = dataset.map(
preprocess_bart, batched=True,
remove_columns=["dialogue", "summary", "id"]
)
training_args = Seq2SeqTrainingArguments(
output_dir="./bart-samsum",
num_train_epochs=4,
per_device_train_batch_size=4, # BART-base is larger than T5-small
per_device_eval_batch_size=4,
gradient_accumulation_steps=2, # effective batch = 8
learning_rate=3e-5,
warmup_ratio=0.1,
weight_decay=0.01,
predict_with_generate=True,
fp16=True,
save_strategy="epoch",
evaluation_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="rouge1",
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenised["train"],
eval_dataset=tokenised["validation"],
tokenizer=tokenizer,
data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)
trainer.train()
model.save_pretrained("./bart-samsum-final")
tokenizer.save_pretrained("./bart-samsum-final")
T5 vs BART — Full Comparison
Both models are powerful and share the encoder-decoder Transformer backbone, but they make different design choices that make each one better suited for certain tasks.
BART uses GPT-2's tokenizer vocabulary which makes it stronger on English open-domain text. T5's SentencePiece is more multilingual-friendly.
| Property | T5 | BART |
|---|---|---|
| Architecture | Encoder-Decoder | Encoder-Decoder |
| Pre-training Objective | Span Corruption only | Multiple noise functions (text infilling best) |
| Input Paradigm | Prefix-based: "summarize: ..." | No prefix — checkpoint-specific |
| Position Embeddings | Relative (better for long docs) | Absolute (max 1024 tokens) |
| Vocabulary | 32,000 (SentencePiece) | 50,265 (BPE, GPT-2 style) |
| Zero-shot Classification | Not designed for it | Excellent (MNLI fine-tune) |
| Multi-task from one checkpoint | Yes — change the prefix | Requires task-specific checkpoints |
| Abstractive Summarisation | Very strong | State-of-the-art (CNN/DM) |
| Machine Translation | Excellent (multilingual T5) | Good for high-resource pairs |
| Dialogue Generation | Good | Excellent (open-domain) |
| Best Recommended Use | Multi-task fine-tuning, translation, QA | Summarisation, generation, zero-shot NLI |
Evaluating Summarisation — The ROUGE Metric
from evaluate import load
import numpy as np
rouge = load("rouge")
predictions = [
"Markets dropped sharply amid investor fears of rising inflation.",
"The rover discovered ancient lake bed deposits on Mars."
]
references = [
"Stock markets fell sharply as investors worried about inflation.",
"NASA's Mars rover found evidence of an ancient lake on the red planet."
]
results = rouge.compute(
predictions=predictions,
references=references,
use_stemmer=True # reduces/reducing/reduced → reduc → fairer match
)
for metric, score in results.items():
print(f"{metric:10s}: {score:.4f}")
Generation Strategies — How Both Models Produce Text
Both T5 and BART are autoregressive at decode time — they generate one token at a time. The strategy used to select each next token dramatically affects output quality, creativity, and speed. Understanding these strategies is essential for tuning your models in production.
from transformers import BartForConditionalGeneration, BartTokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
text = "Quantum computing uses quantum mechanical phenomena like superposition..."
inputs = tokenizer.encode(text, return_tensors="pt", max_length=512, truncation=True)
# ── Strategy 1: Greedy ─────────────────────────
greedy = model.generate(inputs, max_length=60)
# ── Strategy 2: Beam Search ────────────────────
beam = model.generate(
inputs, max_length=60, num_beams=5,
early_stopping=True, no_repeat_ngram_size=3
)
# ── Strategy 3: Sampling (creative / diverse) ──
sampled = model.generate(
inputs, max_length=60,
do_sample=True,
top_p=0.92, # nucleus sampling
top_k=50, # limit to top-50 candidates
temperature=0.7, # sharpen distribution slightly
num_return_sequences=3 # generate 3 alternatives
)
print("GREEDY:", tokenizer.decode(greedy[0], skip_special_tokens=True))
print("BEAM: ", tokenizer.decode(beam[0], skip_special_tokens=True))
for i, s in enumerate(sampled):
print(f"SAMPLE {i+1}:", tokenizer.decode(s, skip_special_tokens=True))
Which Model Should You Choose?
Golden Rules — T5 & BART in Production
labels padding to -100, not pad_token_id.
PyTorch's cross-entropy loss ignores index -100 by default. If you forget this,
your model wastes capacity learning to reproduce padding tokens — training will be slower
and accuracy lower.
predict_with_generate=True during evaluation.
Without it, the Trainer evaluates using teacher-forcing (it feeds ground-truth previous
tokens), which artificially inflates metrics by 15–30%. Real inference is autoregressive —
errors compound. Always evaluate the way you will serve.
num_beams=4) for quality tasks.
Greedy decoding is fine for speed tests but produces noticeably worse summaries.
For production summarisation, num_beams=4 with no_repeat_ngram_size=3
is the standard starting configuration.
gradient_accumulation_steps=4 with batch size 2 to get an effective batch of 8
on an 8 GB GPU. Pair with fp16=True for additional memory savings.
T5 and BART were foundational breakthroughs that directly influenced the design of modern LLMs. Google's Flan-T5 (instruction-tuned T5) and Meta's recent dialogue models trace their lineage directly to these architectures. Understanding T5 and BART means understanding the DNA of modern AI text generation — from ChatGPT's instruction-following to the summarisation APIs powering news apps worldwide.