We created a dataset for aspect-based summarization from papers published in ACL conferences, ensuring its quality through manual validation.

Find our paper here, it will appear at NAACL 2024 conference proceedings soon.

Quality in Summarization Datasets

In text summarization research, the quality of the data we use is crucial. However, we’ve faced challenges with existing datasets, mainly because they are often created by (semi)automatically pulling information from the web. This approach can make it easy to gather a lot of data, but it often leads to summaries that don’t accurately reflect the original texts. For example, some summaries are more about grabbing attention than providing a true overview because they are lead-sentences of news articles. Also, these datasets can include unwanted elements like web links, reducing their quality. Sometimes, they even contain texts in the unexpected language. Legal issues and restricted access are additional problems with some well-known datasets.


We present ACLSum, a carefully crafted and evaluated summarization dataset for the scientific domain. ACLSum contains 250 papers from ACL conferences, complemeted with abstractive and extractive multi-aspect summaries of challenges, approaches, and outcomes. Each summary contains less than 25 words (which is an average sentence length in papers) to provide compact information for researchers to quickly graps the paper overview. We created the summaries with a lot of love and time to keep the quality high. After the summary writing annotation, different experts checked the qualities in summaries.

Overall, the quality of reference summaries is very good. Inter annotator agreements of this quality check are also quite high. Fluency is slightly lagging behind (I, the annotator for summary writing, is not very fluent in English.) but it achieves high quality in other rather more important aspects, namely relevance and consistency.

Additionally to the abstractive summaries, we also provide manually-annotated extractive annotations for each aspect, which is not a common type of labels since in most of the extractive summaries datasets are automatically induced.


We tried a few things.

  • Comparison between end-2-end summarization and 2-step extract-then-abstract summarization
  • Instruction tuning of Llama 2
  • Evaluation of the common heuristic to induce extractive labels from abstractive summaries using our gold extractive annotations

Checkout our paper for more details.


As a follow-up project, we bulit a web-based paper explorer for papers published at ACL conferences, called GenGO (which means “language” in Japanese).

GenGO presents papers with multi-aspect summaries and of course they are generated by models trained on ACLSum. GenGO implements a few NLP-powered features like NER-based filtering and text embedding-based semantic search, recommendation. Checkout how our summarization models perform for fantastic papers published in ACL conferences.


Thank for reading this post until the end. Constructing a summarization dataset is very time/energy-consuming but it is a lot of fun to build models using a custom dataset.

Feel free to contact me about the dataset as well as about the GenGO including feature requests!