Yale Scisumm Dataset

Background: What is Scisumm?

A summary of scientific papers should ideally incorporate the impact of the papers on the research community reflected by citations. To facilitate research in citation-aware scientific paper summarization (Scisumm), the CL-Scisumm shared task has been organized since 2014 for papers in the computational linguistics and NLP domain.

The latest CL-Scisumm 2018 task contains 40 NLP papers with citation sentences and human-annotated reference summaries. Participants develop systems that automatically produce summaries using the original papers and their citation information.

The ScisummNet Corpus

At the Yale LILY lab, we have expanded the CL-Scisumm project and developed the first large-scale, human-annotated Scisumm dataset, ScisummNet. It provides over 1,000 papers in the ACL anthogoly network with their citation networks (e.g. citation sentences, citation counts) and their comprehensive, manual summaries.

The following paper (AAAI 2019) introduces the corpus in detail and shows how ScisummNet enables the training of data-driven summarization models for scientific papers.
Read the paper

Getting Started

Download the dataset (distributed under the CC BY-SA 4.0 license):
ScisummNet ver1.0 (15 MB)
When unzipped, the package contains a dataset description and subdirectories for the 1000 papers. Each paper directory contains the paper's PDF file, XML file, annotated citation information (in JSON format), and manual summay. Please see the included documentation for more detail.

If you use the corpus for your work, please consider citing this paper.

    title = {{ScisummNet}: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks},
    author = {Michihiro Yasunaga and Jungo Kasai and Rui Zhang and Alexander Fabbri and Irene Li and Dan Friedman and Dragomir Radev},
    booktitle = {Proceedings of AAAI 2019},
    year = {2019}


We thank the members of the CL-Scisumm team, Kokil Jaidka, Muthu Kumar Chandrasekaran, and Min-Yen Kan, for their help on this project. We are also grateful to the developers of the SQuAD website, from which this website design is adapted.