GitXplorerGitXplorer
g

paraphraseDecanlpCorpus

public
0 stars
1 forks
0 issues

Commits

List of commits on branch master.
Verified
148bc997a0eef15cd6888ae9681ac126352b21cf

Update README.md

gghomasHudson committed 3 years ago
Verified
8b28b7ecdf708c8245cb108ba9121baae14981c0

Update README.md

gghomasHudson committed 3 years ago
Verified
5e90e86f374265382077fd7476c8ff4d04a53d9a

Update README.md

gghomasHudson committed 3 years ago
Verified
3febe07550bc9df7b07b8281bdfbae73155b1881

Update README.md

gghomasHudson committed 3 years ago
Verified
434f03d64f9c7829fbed54617e09551864fe42a3

Update README.md

gghomasHudson committed 3 years ago
Verified
e28e18c274108266367a36dd0962a3679e71f942

Update README.md

gghomasHudson committed 3 years ago

README

The README file for this repository.

PQ-decaNLP

Paraphrases of decaNLP questions. E.g.

What is the synopsis?
Give me a condensed version
Sum up the article
What would be a good summary of the article?
Sum it up

Dataset for the paper Ask me in your own words: paraphrasing for multitask question answering.

Using with the decaNLP code

The templates folder contains the question paraphrases gathered from mechanical turk, one-per-line. These have already been split 70:30 into train/test sets.

To use the dataset with decaNLP, you first need to slightly modify the decaNLP code to dump the task data as jsonl files. First apply the patch:

git clone https://github.com/salesforce/decaNLP
cd decaNLP
git apply ../save_jsonl.patch 

Then follow the decaNLP instructions to run the train.py/evaluate.py scripts (just to download the data for the first time). In the decaNLP/.data directory you should now also have .jsonl files containing question, context, answer keys.

Running:

python makeCorpus.py --data decaNLP/.data --templates templates/ --output paraphrase_corpus

will expand the templates using the decaNLP data. You'll end up with the following structure:

outputFolder/
├── train/
│   ├── WOZ0
│   │    └── train.jsonl
│   ├── WOZ1
│   │    └── train.jsonl
│   ...
└── test/
    ├── WOZ0
    │    └── val.jsonl
    ├── WOZ1
    │    └── val.jsonl
    ...

Each set can be then used for training/evaluating individually, or can be merged together using makeTraining.sh to make a single training set per task based on picking a random paraphrase for each training instance.

Annotated data

annotated.json contains paraphrase annotations for each of the examples in the test set with the types of paraphrase phenomena used (See the paper for full details).

Trained T5 checkpoints

Checkpoints for the T5 checkpoints to evaluate this corpus are here.

Citation

If you use this dataset in your work, please cite:

@article{hudson2021askme,
 title = {Ask me in your own words: paraphrasing for multitask question answering}
 author = {G. Thomas Hudson and Noura Al Moubayed},
 doi = {10.7717/peerj-cs.759},
 year = 2021,
 publisher = {{PeerJ}},
 volume = {7},
 pages = {e759},
} 

The dataset in this project (files under the templates/ dir and the annotation.json file) is licensed under the CC-BY-4.0, and the underlying source code used to create and process that data is licensed under the MIT.

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name PQ-decaNLP
alternateName Paraphrase Questions - decaNLP
name PQ-decaNLP
url https://github.com/ghomasHudson/paraphraseDecanlpCorpus
description Multitask learning has led to significant advances in Natural Language Processing, including the decaNLP benchmark where question answering is used to frame 10 natural language understanding tasks in a single model. PQ-decaNLP is a crowd-sourced corpus of paraphrased questions, annotated with paraphrase phenomena. This enables analysis of how transformations such as swapping the class labels and changing the sentence modality lead to a large performance degradation.

This repository contains question templates and scripts for using this with the decaNLP code.

citation https://doi.org/10.7717/peerj-cs.759
creator
property value
name Thomas Hudson
sameAs https://orcid.org/0000-0003-3562-3593
license
property value
name CC BY 4.0
url https://creativecommons.org/licenses/by/4.0/