PURE: A Dataset of Public Requirements Documents | IEEE Conference Publication | IEEE Xplore

PURE: A Dataset of Public Requirements Documents


Abstract:

This paper presents PURE (PUblic REquirements dataset), a dataset of 79 publicly available natural language requirements documents collected from the Web. The dataset inc...Show More

Abstract:

This paper presents PURE (PUblic REquirements dataset), a dataset of 79 publicly available natural language requirements documents collected from the Web. The dataset includes 34,268 sentences and can be used for natural language processing tasks that are typical in requirements engineering, such as model synthesis, abstraction identification and document structure assessment. It can be further annotated to work as a benchmark for other tasks, such as ambiguity detection, requirements categorisation and identification of equivalent re-quirements. In the paper, we present the dataset and we compare its language with generic English texts, showing the peculiarities of the requirements jargon, made of a restricted vocabulary of domain-specific acronyms and words, and long sentences. We also present the common XML format to which we have manually ported a subset of the documents, with the goal of facilitating replication of NLP experiments.
Date of Conference: 04-08 September 2017
Date Added to IEEE Xplore: 25 September 2017
ISBN Information:
Electronic ISSN: 2332-6441
Conference Location: Lisbon, Portugal

I. Introduction

Requirements are normally expressed with the most flexible of the communication codes, which is natural language (NL) [1]. Several authors have applied natural language processing (NLP) techniques in requirements engineering (RE) to address multiple tasks, including model synthesis [2], classification of requirements into functional/non-functional categories [3], classification of online product reviews [4], traceability [5], [6], ambiguity detection [7]–[9], structure assessment [10], detection of equivalent requirements [11], completeness evaluation [12], and information extraction [13]–[15]. With some exceptions, most of the works use proprietary or domain-specific documents as benchmarks, and replication of the experiments as well as generalisation of the results have always been an issue [16]. This paper presents PURE (PUblic REquirements dataset), a dataset of 79 publicly available requirements documents retrieved from the Web. The dataset is oriented to replication of NLP experiments and generalisation of results. The documents cover multiple domains, have different degrees of abstraction, and range from product standards, to documents of public companies, to university projects. We also defined a general XML schema file (XSD) to represent these different documents in a uniform format. We have currently ported a subset of the documents to this format to ease rigorous comparison of NLP experiments. The paper extends a recent conference contribution [17]. With respect to this previous work, we provide statistical information on the NL content of the dataset, we present the XSD schema adopted to format the documents, and we provide recommendations on the usage of the dataset.

Contact IEEE to Subscribe

References

References is not available for this document.