Background

I don’t know how much other summarization people are confused but I am almost panicing by seeing how well these LLMs can generate summaries, and also the increasing development speed within the research community, and in a larger tech community.

To find out whether I should just stop working on summarization or if there is something that I can still find interesting and meaningful, I started to maintain this page which I will keep updating until I go crazy.

This post contains the followings.

  • A list of papers about LLMs on summarization
  • My thoughts at the moment

Both will change over time, and this git history page will show you how it has been updated.

Paper list

Most of them will be arxiv papers, not peer-reviewed work, so use this list at your own risk. TLDRs are generated by a “traditionaly” fine-tuned BART-large model. My comments just focus on my question mentioned above. I give a reference number to each paper but its order has no meanings.

If you know other relevant papers, I would appreciate if you could tell by raising an issue here.

  1. News Summarization and Evaluation in the Era of GPT-3
    • TLDR: We show that humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality.
  2. Benchmarking Large Language Models for News Summarization
    • TLDR: We show that instruction tuning, not model size, is the key to the LLM’s zero-shot summarization capability.
    • Comments
    • The number of evaluated dataset may be too small to draw their conclusion given these two datasets are known to have some critical problems.
    • But I agree to that at least on this relatively simple setup, i.e., single document news summarization, large models can generate human-level summaries.
  3. On Learning to Summarize with Large Language Models as References
    • TLDR: We propose a novel reward-based learning paradigm for text summarization models that considers the LLMs as the reference or the gold-standard oracle on commonly used summarization datasets such as the CNN/DailyMail dataset.
  4. LMGQS: A Large-scale Dataset for Query-focused Summarization
    • TLDR: We propose a large-scale query-focused summarization dataset for query-centric summarization and demonstrate state-of-the-art performance on multiple existing QFS benchmarks.
    • Finally we have a large scale QFS dataset by generating it (?)
  5. Multi-Dimensional Evaluation of Text Summarization with In-Context Learning
    • TLDR: We study the efficacy of large language models as multi-dimensional evaluators using in-context learning, obviating the need for large training datasets.

Thoughts

Reference summary quality issues

As repeatedly mentioned in the paper 2, reference summaries in the existing datasets have quality issues and they show that zero-shot LLMs can easily generate better summaries. It makes sense because these summaries (especially ones in XSUM dataset) are not really summaries but “the first sentence in a news article”. Therefore, it is natural that they are sometimes inconsistent to the rest of the article or having some information which do not appear in the rest of the article.

This “problem” results in two super critical issues which can mislead research outcomes,

  • fine-tuned models generate “the first sentence in the article” not summaries,
  • automatic metrics (e.g., ROUGE-x) doesn’t correlate well with human evaluations.

It’s a fact that LLMs can generate high-quality summaries. However, due to the first issue, fine-tuned models may be underestimated when they are evaluated by human annotators because the annotators are asked to evaluate text quality as summaries which are not what models are trained to produce.

There have been a lot of works that propose “better evaluation metric” but there is a possibility that what we need is datasets with reference summaries actually summarize input documents. I strongly believe that we need such datasets to re-evaluate the tools we have now.

Ignoring instructions

Paper 2 mentions that LLMs often ignore instruction. Such as generating some texts which are not summaries or exceeding length limit. Even though smaller fine-tuned models sometimes low-quality texts but they are still “summaries” that the models are trained to generate. This is unaccepted for many applications, e.g., where users can’t interact with models to actually make them to generate summaries, e.g., my paper tldr pages.

So I at the moment in my opinion if there are datasets which fit the application needs, it is safer to take fine-tuned models.

Evaluate models by models

(Still thinking…)

If you have a LLM which can evaluate other models (almost) zero-shot, can’t this LLM just generate “perfect” summaries? (ref. paper 5)