this article is incomplete. currently, I am using this page to just collect papers which can enhance my claim.
papers
comparison
- [2510.26622] Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
- enc-dec lags behind without instruction tuning, but after instruction tuning, it becomes comparable to dec-only
-
significantly better inference efficiency
- enc-dec benefits from bidirectional attention in encoders
- i don’t like that they
-
We further split each sequence in the middle, resulting in 1024 input and 1024 target tokens (i.e., k = 1024), for training RedLLM.
- because how to split the sequence seems like an important decision, e.g., number of losses computed for a sample, but it is not discussed in the paper
-
Note the effective target token count for RedLLM is 0.8T, half the size used for DecLLM due to the architectural difference
-
- enc-dec may be doing better at long contex after pre-training
- cross-encoder’s attention seem to behave different in long context which seems interesting
- same size, same training data, but attends differently, how different, and how does it influence on dowmstream tasks?
- efficiency comparison by flops and empirical analysis (examples per second during training and inference)
- and shows that enc-dec has “clear advantage”
- T5Gemma: A new collection of encoder-decoder Gemma models - Google Developers Blog
- What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?
- Examining Scaling and Transfer of Language Model Architectures for Machine Translation
- [2304.04052] Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder
encoder-decoder design
- Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation
- [2501.16273v2] Return of the Encoder: Maximizing Parameter Efficiency for SLMs
- Bidirectional Language Models Are Also Few-shot Learners | OpenReview
- [2205.05131] UL2: Unifying Language Learning Paradigms