Natural Language Processing (NLP) 📂 Pre-trained Language Models (PLMs) · 4 of 4 43 min read

T5 & BART

A comprehensive, story-driven tutorial on T5 and BART — the two landmark text-to-text Transformer models. Covers architecture, pre-training objectives, fine-tuning recipes, generation strategies, and production golden rules, with full Python code examples and visual diagrams.

Section 01

The Story That Explains Text-to-Text AI

The Universal Translator at the United Nations
Imagine a translator at the United Nations who is fluent in every language on Earth. Someone walks up and says: "Summarise this 50-page report into two sentences." The translator reads it and writes two crisp sentences. The next delegate asks: "Translate this speech into French." Same translator. Same brain. Then a scientist asks: "Answer this chemistry question." Still the same translator.

The remarkable thing? The translator never learned a separate skill for each task. They learned one meta-skill: reading text and producing the right text back. Every task — summarising, translating, answering, classifying — was reframed as "given this input text, what output text should I write?"

That is the entire philosophy of T5 (Text-To-Text Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformers).

Before T5 and BART, the NLP world was fragmented. You had one model for sentiment analysis, a different architecture for translation, a separate system for summarisation, and yet another for question answering. Each task required its own output head, its own loss function, its own fine-tuning recipe. Then, in 2019–2020, two landmark papers changed everything.

🌐
The Unifying Insight

Both T5 and BART treat every NLP task as a text generation problem. Classification becomes: "Input: 'The movie was terrible' → Output: 'negative'". Summarisation becomes: "Input: [long article] → Output: [short summary]". The model always reads text in and writes text out — hence text-to-text. This single unification unlocked massive improvements across the board.


Section 02

The Foundation — Encoder-Decoder Architecture

Both T5 and BART are built on the encoder-decoder Transformer, the same architecture introduced in the original "Attention is All You Need" paper (2017). Understanding this backbone is non-negotiable before studying either model.

🏗️ Encoder-Decoder Architecture — How Data Flows
INPUT
Raw text fed as tokens: "Summarise: The stock market fell sharply today..."
ENCODER
Reads the entire input simultaneously using bidirectional self-attention. Produces a rich contextual representation — a "thought vector" — of the full input.
CROSS-ATTN
The decoder attends to the encoder's output at every generation step. This is how the output "knows" what the input said.
DECODER
Generates output tokens one at a time, left to right, using causal (masked) self-attention so it can only see previously generated tokens.
OUTPUT
Final sequence: "Markets dropped sharply amid investor fears."
🔷 Encoder–Decoder Architecture Diagram
ENCODER Self-Attention (Bidirectional) Feed-Forward Network Self-Attention (Bidirectional) Feed-Forward Network × N layers Input Text Tokens Context Vectors DECODER Masked Self-Attention Cross-Attention ← Encoder Feed-Forward Network Linear + Softmax × N layers Output Tokens (one at a time) TASKS Summarisation Translation Q&A Classification Paraphrasing

Both T5 and BART use this encoder-decoder blueprint. The key difference lies in how they were pre-trained.


Section 03

T5 — Text-To-Text Transfer Transformer

The Chef Who Cooks Everything With One Pan
A great chef notices that despite cooking thousands of different dishes — risottos, soufflés, stir-fries, sauces — they always use the same fundamental pan on the same stove. The technique changes, the ingredients change, but the process is always: heat the pan, add ingredients, control temperature, plate the result.

T5's creators at Google thought: what if NLP tasks are the same? They are all really "transform input text into output text." So they designed a single pan — one model, one training objective, one loss function — that can cook any NLP dish. They just prepend a text instruction telling it which dish to make.

T5 was introduced by Google in 2019 (Raffel et al., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"). Its defining feature is radical simplicity through unification: every task is cast as sequence-to-sequence generation by adding a task prefix to the input.

The Task Prefix System

TaskT5 Input FormatT5 Output
Summarisation summarize: [article text] Short summary
Translation (EN→DE) translate English to German: [text] German text
Sentiment Classification sst2 sentence: [text] positive or negative
Question Answering question: [q] context: [passage] Answer string
Grammar Correction cola sentence: [text] acceptable or unacceptable
NLI (Entailment) mnli hypothesis: [h] premise: [p] entailment / contradiction / neutral
🔑
Why Prefixes Are Powerful

The prefix is not a hack — it is the training signal. During pre-training on the C4 dataset (Colossal Clean Crawled Corpus — 750 GB of clean web text), T5 learned that different prefixes correlate with different transformation patterns. At inference time, the prefix primes the decoder to generate the appropriate type of output. You can even invent your own prefixes for custom tasks during fine-tuning.

How T5 Was Pre-trained — Span Corruption

T5 uses a pre-training objective called Span Corruption (also called masked span prediction). This is different from BERT's masked language modelling and GPT's next-token prediction.

🔴 T5 Span Corruption — Pre-training Objective
ORIGINAL TEXT: The quick brown fox jumps over the lazy dog near the river bank ENCODER INPUT (corrupted): The quick <X> jumps over <Y> near <Z> bank DECODER TARGET (fill in spans): <X> brown fox <Y> the lazy dog <Z> the river <EOS>

Random contiguous spans (15% of tokens on average) are replaced with sentinel tokens <X>, <Y>, <Z>. The decoder must reconstruct only the missing spans — not the whole sentence. This is more efficient than BERT's single-token masking.

T5 Model Variants

🔹
T5-Small
60M parameters
6-layer encoder, 6-layer decoder, d_model=512. Fast inference, fits on CPU. Good for prototyping and edge deployment.
🔷
T5-Base
220M parameters
12-layer encoder, 12-layer decoder, d_model=768. The sweet spot for most fine-tuning tasks. Comparable to BERT-Base in encoder size.
🏔️
T5-Large / XL / XXL
770M → 11B parameters
Production-grade performance. T5-11B set state-of-the-art on many benchmarks in 2019. Requires multi-GPU setup for fine-tuning.

Section 04

T5 in Python — Complete Code Examples

Example 1: Summarisation with T5

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load pretrained T5-small (downloads ~240 MB on first run)
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Article to summarise — note the task prefix!
article = """summarize: The Amazon rainforest, often referred to as the \
"lungs of the Earth", produces 20% of the world's oxygen and is home to \
10% of all known species. Scientists warn that deforestation rates have \
increased by 22% in the past year, threatening biodiversity and \
accelerating climate change. Governments are urged to implement \
stronger protection measures immediately."""

# Tokenise — T5 uses max_length for encoder truncation
inputs = tokenizer.encode(
    article,
    return_tensors="pt",
    max_length=512,
    truncation=True
)

# Generate summary with beam search
summary_ids = model.generate(
    inputs,
    max_length=60,
    min_length=20,
    num_beams=4,           # beam search for quality
    length_penalty=2.0,    # penalise very short outputs
    early_stopping=True,
    no_repeat_ngram_size=3  # avoid repetitive phrases
)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"Summary: {summary}")
OUTPUT
Summary: deforestation rates have increased by 22% in the past year, threatening biodiversity and accelerating climate change. scientists warn that stronger protection measures are needed immediately.

Example 2: Fine-tuning T5 on Custom Summarisation Data

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer, Seq2SeqTrainer
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
from datasets import load_dataset

# Load model and tokeniser
model_name = "t5-small"
tokenizer  = T5Tokenizer.from_pretrained(model_name)
model      = T5ForConditionalGeneration.from_pretrained(model_name)

# Load CNN/DailyMail dataset (a standard summarisation benchmark)
dataset = load_dataset("cnn_dailymail", "3.0.0")

def preprocess(examples):
    # Add the task prefix — critical for T5!
    inputs  = ["summarize: " + doc for doc in examples["article"]]
    targets = examples["highlights"]

    model_inputs = tokenizer(
        inputs, max_length=512, truncation=True, padding="max_length"
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets, max_length=128, truncation=True, padding="max_length"
        )
    # Replace padding token id with -100 so loss ignores it
    labels["input_ids"] = [
        [(-100 if t == tokenizer.pad_token_id else t) for t in label]
        for label in labels["input_ids"]
    ]
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenise (use small slice for demo speed)
tokenised = dataset.map(preprocess, batched=True,
                         remove_columns=dataset["train"].column_names)

# Training configuration
args = Seq2SeqTrainingArguments(
    output_dir="./t5-finetuned-cnn",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    predict_with_generate=True,   # Use generation for evaluation
    fp16=torch.cuda.is_available(),  # Mixed precision if GPU available
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
model.save_pretrained("./t5-finetuned-cnn")
Critical T5 Fine-tuning Rules

1. Always add the task prefix — fine-tuning without it breaks the learned prefix routing.
2. Set labels padding to -100 — the cross-entropy loss ignores these positions, so you don't penalise padding.
3. Use predict_with_generate=True — T5 generates token-by-token at eval time; without this flag, the Trainer uses teacher-forcing outputs which look better than reality.


Section 05

BART — Bidirectional and Auto-Regressive Transformers

The Restorer Who First Destroys, Then Rebuilds
There is a famous technique in art restoration: before you can truly understand a painting, you study what happens when it breaks. Restorers sometimes deliberately introduce controlled damage — remove patches, rotate sections, blur areas — then train themselves to reconstruct the original perfectly.

BART was designed with exactly this philosophy. During pre-training, it takes clean text, deliberately corrupts it in several different ways — deleting words, scrambling sentences, rotating the text — and then trains the model to recover the original. By learning to repair every kind of damage, it becomes extraordinarily good at any task that requires understanding and regenerating text — which is most of NLP.

BART was introduced by Facebook AI (Lewis et al., 2019). It is also an encoder-decoder Transformer, but its pre-training scheme is fundamentally different from T5. While T5 uses only span corruption, BART uses five different noise functions during pre-training, making it a more generalist denoising model.

BART's Five Noise Functions (Pre-training)

Token Deletion
Random tokens removed
Tokens are deleted entirely (not masked). The model must infer where text is missing — harder than BERT's [MASK] which marks the position.
🔀
Token Masking
Like BERT's MLM
Random tokens replaced with [MASK]. Same as BERT but applied to the encoder of a seq2seq model. The decoder reconstructs the full original.
📦
Text Infilling
Spans → single mask
Spans of text (of varying length, including zero-length) are replaced by a single [MASK] token. The model must infer how many tokens are missing.
🔄
Sentence Permutation
Sentences shuffled
The document's sentences are randomly shuffled. The decoder learns document structure — ideal for long-form generation and coherent multi-sentence output.
🔃
Document Rotation
Rotated start point
The document is rotated so that a random token becomes the first token. The decoder must identify the original start — teaches beginning-of-document understanding.
Best Objective Found
Text Infilling wins
Lewis et al. found that Text Infilling alone (the BART default) gives the best downstream performance. The combination matters less than the infilling task.
⚖️ T5 vs BART — Pre-training Objective Comparison
T5 — SPAN CORRUPTION INPUT TO ENCODER The <X> fox jumps <Y> lazy dog DECODER TARGET <X> quick brown <Y> over the <EOS> Reconstruct only masked spans Efficient: output is SHORT Sentinel tokens <X> <Y> used BART — TEXT INFILLING INPUT TO ENCODER (corrupted) The [MASK] fox [MASK] [MASK] dog DECODER TARGET The quick brown fox jumps over the lazy dog Reconstruct the FULL original sentence Thorough: learns full document context Standard [MASK] token used

The key difference: T5 only reconstructs the corrupted spans (short output). BART reconstructs the entire document (full output). BART's approach makes it better at open-ended generation tasks like abstractive summarisation and dialogue.


Section 06

BART in Python — Complete Code Examples

Example 1: Abstractive Summarisation with BART

from transformers import BartForConditionalGeneration, BartTokenizer

# facebook/bart-large-cnn is fine-tuned specifically for summarisation
model_name = "facebook/bart-large-cnn"
tokenizer  = BartTokenizer.from_pretrained(model_name)
model      = BartForConditionalGeneration.from_pretrained(model_name)

article = """
NASA's James Webb Space Telescope has captured the most detailed image of
the Pillars of Creation ever recorded. The new image reveals thousands of
previously unseen stars in the iconic star-forming region located 6,500
light-years from Earth. Scientists say the observations will help rewrite
our understanding of how stars and planetary systems form. The telescope's
infrared cameras can pierce through the dense dust clouds that blocked
previous observations, revealing the inner workings of stellar nurseries.
"""

# BART does NOT use task prefixes — just pass the raw text
inputs = tokenizer.encode(
    article,
    return_tensors="pt",
    max_length=1024,
    truncation=True
)

# Generate with beam search and length controls
summary_ids = model.generate(
    inputs,
    max_length=130,
    min_length=30,
    num_beams=4,
    length_penalty=2.0,
    early_stopping=True,
    no_repeat_ngram_size=3,
    forced_bos_token_id=tokenizer.bos_token_id
)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"Summary:\n{summary}")
OUTPUT
Summary: NASA's James Webb Space Telescope has captured the most detailed image of the Pillars of Creation ever recorded. The new image reveals thousands of previously unseen stars in the star-forming region 6,500 light-years from Earth. Scientists say the observations will help rewrite our understanding of how stars and planetary systems form.
⚠️
BART Does NOT Use Task Prefixes

Unlike T5, BART was not pre-trained with text instruction prefixes. Do not add "summarize:" or similar prefixes to BART inputs — it will confuse the model. The task is implicit in the fine-tuned checkpoint you load. Use facebook/bart-large-cnn for summarisation, facebook/bart-large-mnli for zero-shot classification, etc.

Example 2: Zero-Shot Text Classification with BART

from transformers import pipeline

# BART-large-mnli is fine-tuned on Natural Language Inference
# It can classify text into ANY labels — no fine-tuning needed!
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

text = "The new electric vehicle can travel 400 miles on a single charge."

# Define candidate labels — these can be anything
candidate_labels = [
    "technology", "environment", "transportation",
    "finance", "health", "politics"
]

result = classifier(text, candidate_labels, multi_label=False)

for label, score in zip(result["labels"], result["scores"]):
    bar = "█" * int(score * 30)
    print(f"{label:15s} {bar} {score:.3f}")
OUTPUT
transportation ██████████████████████████ 0.871 technology ████ 0.082 environment ██ 0.031 finance 0.009 health 0.005 politics 0.002
🎯
Zero-Shot Classification — How It Works

BART-large-mnli was fine-tuned to determine if a premise entails a hypothesis. The trick: for each label, we check if "This text is about [label]" is entailed by the input text. The label with the highest entailment probability wins. No extra training needed — works on any labels you invent at runtime. This is one of BART's most practical real-world applications.

Example 3: Fine-tuning BART on Dialogue Summarisation

from transformers import BartForConditionalGeneration, BartTokenizer
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq
from datasets import load_dataset
import numpy as np

# SAMSum is a dataset of messenger-style dialogues + summaries
dataset   = load_dataset("samsum")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
model     = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

def preprocess_bart(examples):
    # BART: no prefix — just raw dialogue
    inputs = tokenizer(
        examples["dialogue"],
        max_length=1024, truncation=True, padding="max_length"
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"],
            max_length=128, truncation=True, padding="max_length"
        )
    # Mask padding in labels
    label_ids = labels["input_ids"]
    label_ids = [
        [(-100 if t == tokenizer.pad_token_id else t) for t in seq]
        for seq in label_ids
    ]
    inputs["labels"] = label_ids
    return inputs

tokenised = dataset.map(
    preprocess_bart, batched=True,
    remove_columns=["dialogue", "summary", "id"]
)

training_args = Seq2SeqTrainingArguments(
    output_dir="./bart-samsum",
    num_train_epochs=4,
    per_device_train_batch_size=4,   # BART-base is larger than T5-small
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,   # effective batch = 8
    learning_rate=3e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    predict_with_generate=True,
    fp16=True,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

trainer.train()
model.save_pretrained("./bart-samsum-final")
tokenizer.save_pretrained("./bart-samsum-final")

Section 07

T5 vs BART — Full Comparison

Both models are powerful and share the encoder-decoder Transformer backbone, but they make different design choices that make each one better suited for certain tasks.

🔷 Architecture Philosophy — T5 vs BART
T5 Relative Position Embeddings No absolute positional encoding Task prefix system (text instructions) Span corruption pre-training (C4 dataset) SentencePiece tokenizer (32,000 vocab) BART Learned Absolute Position Embeddings (Like GPT-2 and RoBERTa) No prefix — task implicit in checkpoint Text infilling pre-training (Books + CC) BPE tokenizer (50,265 vocab — like GPT-2)

BART uses GPT-2's tokenizer vocabulary which makes it stronger on English open-domain text. T5's SentencePiece is more multilingual-friendly.

PropertyT5BART
Architecture Encoder-Decoder Encoder-Decoder
Pre-training Objective Span Corruption only Multiple noise functions (text infilling best)
Input Paradigm Prefix-based: "summarize: ..." No prefix — checkpoint-specific
Position Embeddings Relative (better for long docs) Absolute (max 1024 tokens)
Vocabulary 32,000 (SentencePiece) 50,265 (BPE, GPT-2 style)
Zero-shot Classification Not designed for it Excellent (MNLI fine-tune)
Multi-task from one checkpoint Yes — change the prefix Requires task-specific checkpoints
Abstractive Summarisation Very strong State-of-the-art (CNN/DM)
Machine Translation Excellent (multilingual T5) Good for high-resource pairs
Dialogue Generation Good Excellent (open-domain)
Best Recommended Use Multi-task fine-tuning, translation, QA Summarisation, generation, zero-shot NLI

Section 08

Evaluating Summarisation — The ROUGE Metric

The Essay Grader With a Red Pen
A teacher grades essays by underlining words and phrases that match the model answer. The more overlapping content between a student's answer and the reference, the higher the grade. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) does exactly this — it counts n-gram overlaps between generated and reference summaries. It is not perfect, but it is the industry standard for comparing summarisation systems.
ROUGE-1
Overlap of single words (unigrams)
Measures basic vocabulary coverage. Did the summary use the right words?
ROUGE-2
Overlap of word pairs (bigrams)
Measures phrase-level accuracy. Did adjacent words appear in the right order?
ROUGE-L
Longest Common Subsequence
Measures sentence-level structure. Rewards in-order sequences without requiring contiguity.
BERTScore
Semantic similarity via BERT embeddings
Modern alternative. Catches paraphrases that ROUGE misses. Better for abstractive summaries.
from evaluate import load
import numpy as np

rouge = load("rouge")

predictions = [
    "Markets dropped sharply amid investor fears of rising inflation.",
    "The rover discovered ancient lake bed deposits on Mars."
]

references = [
    "Stock markets fell sharply as investors worried about inflation.",
    "NASA's Mars rover found evidence of an ancient lake on the red planet."
]

results = rouge.compute(
    predictions=predictions,
    references=references,
    use_stemmer=True   # reduces/reducing/reduced → reduc → fairer match
)

for metric, score in results.items():
    print(f"{metric:10s}: {score:.4f}")
OUTPUT
rouge1 : 0.6521 rouge2 : 0.3947 rougeL : 0.6182 rougeLsum : 0.6182

Section 09

Generation Strategies — How Both Models Produce Text

Both T5 and BART are autoregressive at decode time — they generate one token at a time. The strategy used to select each next token dramatically affects output quality, creativity, and speed. Understanding these strategies is essential for tuning your models in production.

🎯
Greedy Decoding
Always pick the top token
Fastest. At each step, pick the single most probable next token. Tends to produce repetitive, bland output. Never use for summarisation.
🌳
Beam Search
num_beams=4 or 5
Keep the top-K partial sequences at each step. Far better quality than greedy. Standard for summarisation and translation. Slightly slower.
🎲
Sampling (Top-p / Top-k)
temperature + top_p
Sample from a distribution instead of always taking the max. Produces diverse, creative output. Better for dialogue and story generation.
from transformers import BartForConditionalGeneration, BartTokenizer

model     = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

text   = "Quantum computing uses quantum mechanical phenomena like superposition..."
inputs = tokenizer.encode(text, return_tensors="pt", max_length=512, truncation=True)

# ── Strategy 1: Greedy ─────────────────────────
greedy = model.generate(inputs, max_length=60)

# ── Strategy 2: Beam Search ────────────────────
beam = model.generate(
    inputs, max_length=60, num_beams=5,
    early_stopping=True, no_repeat_ngram_size=3
)

# ── Strategy 3: Sampling (creative / diverse) ──
sampled = model.generate(
    inputs, max_length=60,
    do_sample=True,
    top_p=0.92,         # nucleus sampling
    top_k=50,           # limit to top-50 candidates
    temperature=0.7,    # sharpen distribution slightly
    num_return_sequences=3  # generate 3 alternatives
)

print("GREEDY:", tokenizer.decode(greedy[0], skip_special_tokens=True))
print("BEAM:  ", tokenizer.decode(beam[0],   skip_special_tokens=True))
for i, s in enumerate(sampled):
    print(f"SAMPLE {i+1}:", tokenizer.decode(s, skip_special_tokens=True))

Section 10

Which Model Should You Choose?

01
Do you need multiple tasks from a single model?
If yes → T5. Its prefix system lets you do summarisation, QA, translation, and classification with a single checkpoint by simply changing the text prefix. BART requires a different fine-tuned checkpoint for each task.
02
Is your primary task abstractive summarisation?
If yes → BART (facebook/bart-large-cnn). BART set the state-of-the-art on CNN/DailyMail summarisation and tends to produce more fluent, natural-sounding summaries on English news text. T5 is competitive but BART's denoising pre-training gives it an edge.
03
Do you need zero-shot text classification?
If yes → BART (facebook/bart-large-mnli). This is BART's killer feature. You can classify text into arbitrary labels without any fine-tuning data. T5 was not designed for this pattern.
04
Is compute budget very limited?
If yes → T5-Small (60M params). It can run on CPU and still performs respectably on summarisation and QA. BART-base is 140M — smaller than BART-large's 400M but still larger than T5-small.
05
Do you need multilingual support?
If yes → mT5 (multilingual T5, 101 languages). BART was pre-trained on English-dominant data only. The multilingual T5 variants (mT5-small through mT5-XXL) cover 101 languages with no extra effort.

Section 11

Golden Rules — T5 & BART in Production

🌿 Non-Negotiable Rules for Text-to-Text Models
1
Always add the task prefix for T5, never for BART. This single mistake accounts for the majority of "my T5 doesn't work" bug reports. T5 without a prefix produces garbled output; BART with a prefix produces confused output.
2
Set labels padding to -100, not pad_token_id. PyTorch's cross-entropy loss ignores index -100 by default. If you forget this, your model wastes capacity learning to reproduce padding tokens — training will be slower and accuracy lower.
3
Use predict_with_generate=True during evaluation. Without it, the Trainer evaluates using teacher-forcing (it feeds ground-truth previous tokens), which artificially inflates metrics by 15–30%. Real inference is autoregressive — errors compound. Always evaluate the way you will serve.
4
Use beam search (num_beams=4) for quality tasks. Greedy decoding is fine for speed tests but produces noticeably worse summaries. For production summarisation, num_beams=4 with no_repeat_ngram_size=3 is the standard starting configuration.
5
Watch max_length — both encoder and decoder. T5 supports up to 512 encoder tokens by default. BART supports 1024. Silently truncating long documents kills recall in the summary. Always log how many examples exceed your max_length during preprocessing.
6
Use gradient accumulation for large BART models on limited GPU memory. BART-large (400M params) requires ~16 GB GPU RAM for a batch size of 8 at full precision. Set gradient_accumulation_steps=4 with batch size 2 to get an effective batch of 8 on an 8 GB GPU. Pair with fp16=True for additional memory savings.
7
Fine-tune on task-specific data — do not use base models in production. Pre-trained T5 and BART are strong but not production-ready. Even 1,000–5,000 labeled examples of your specific task will produce dramatically better results than a zero-shot base model. Always fine-tune when data is available.
🚀
Where These Models Live Today

T5 and BART were foundational breakthroughs that directly influenced the design of modern LLMs. Google's Flan-T5 (instruction-tuned T5) and Meta's recent dialogue models trace their lineage directly to these architectures. Understanding T5 and BART means understanding the DNA of modern AI text generation — from ChatGPT's instruction-following to the summarisation APIs powering news apps worldwide.

You have completed Pre-trained Language Models (PLMs). View all sections →