The data collections that we created and published about are available for researchers in the field.

The Nijmegen 2011 query intent data set

(added March 2013)

The dataset consists of:

  • A txt-file containing 605 queries with annotations according to our multi-dimensional intent classification scheme;
  • A txt-file (README.txt) containing documentation.

You can download the data as a zipped archive here. If you use the data, please refer to this paper:

Why-questions with snippets from Bing and relevance assessments for each snippet

(added March 2011)

This download set (as described in Verberne et al., 2011) consists of:

  • A txt-file containing 238 questions with 10 snippets per question + relevance assessments on a 3-point scale;
  • A txt-file (00about.txt) containing documentation.

You can download the data as gzipped tar archive after filling in your name and affiliation in this form

Why-questions and answers with relevance labels for machine learning (learning-to-rank) purposes

(added 2010)

This download set (as described in Verberne et al., 2010) consists of:

  • A txt-file containing 186 questions with 150 candidate answers per question + labels for your own feature extraction;
  • A txt-file containing 37 feature values for 150*186 answers + labels in SVMlight format for machine learning purposes.
  • A txt-file containing documentation.

You can download the data as gzipped tar archive after filling in your name and affiliation in this form

Why-questions from the Webclopedia collection with Wikipedia answer fragments

(added March 2007)

This download set (as described in Verberne et al., 2007b) consists of:

  • An Excel sheet with 400 randomly selected why-questions from the Webclopedia set (questions asked to the online QA system answers.com, gathered by Hovy et al.) and for each question a Wikipedia text fragment giving the answer and a pointer to the complete Wikipedia document;
  • A zip-file containing all complete Wikipedia documents that is referred to
    in the Excel sheet;
  • A zip-file containing all answer fragments in context (complete paragraph and sometimes also the previous paragraph or heading), manually annotated with RST structures (Carlson et al. 2003);
  • A readme file.

You can download the data after filling in your name and affiliation in this form

Why-questions and answers formulated to RST-annotated WSJ-texts

(added January 2007)

This download set (as described in Verberne et al., 2007) consists of:

  • Seven documents selected from the RST Treebank (Carlson et al., 2003), both the annotated and the unannotated versions, used for elicitation;
  • All 372 why-questions and the corresponding answers, formulated by native speakers;
  • A readme file.

All files are plain text files.

You can download the data after filling in your name and affiliation in this form

Why-questions and answers formulated to newspaper texts

(added March 2006)

This download set (as described in Verberne et al., 2006) consists of:

  • The source documents from Reuters and Guardian news archives, used for elicitation;
  • All 395 why-questions and the corresponding answers, formulated by native speakers;
  • 211 user-formulated paraphrases and the 166 corresponding questions;
  • A readme file.

All files are plain text files.

You can download the data after filling in your name and affiliation in this form