encoder-decoder is the best 2025-10-31

this article is incomplete. currently, I am using this page to just collect papers which can enhance my claim.

papers

comparison

[2510.26622] Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
- enc-dec lags behind without instruction tuning, but after instruction tuning, it becomes comparable to dec-only
- significantly better inference efficiency
- enc-dec benefits from bidirectional attention in encoders
- i don’t like that they
  - We further split each sequence in the middle, resulting in 1024 input and 1024 target tokens (i.e., k = 1024), for training RedLLM.
  - because how to split the sequence seems like an important decision, e.g., number of losses computed for a sample, but it is not discussed in the paper
  - Note the effective target token count for RedLLM is 0.8T, half the size used for DecLLM due to the architectural difference
- enc-dec may be doing better at long contex after pre-training
- cross-encoder’s attention seem to behave different in long context which seems interesting
  - same size, same training data, but attends differently, how different, and how does it influence on dowmstream tasks?
- efficiency comparison by flops and empirical analysis (examples per second during training and inference)
  - and shows that enc-dec has “clear advantage”
T5Gemma: A new collection of encoder-decoder Gemma models - Google Developers Blog
What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?
Examining Scaling and Transfer of Language Model Architectures for Machine Translation
[2304.04052] Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder

encoder-decoder design

Others

GitHub - google-research/t5x

« Benchmark of small embedding models on LitSearch matryoshka embeddings? »