Style change detection dataset using AO3 fics. Inspired by the PAN 21: Style Change Detection Task, but for much longer documents.
Note: Due to the nature of the fanfiction source, much of the text will be NSFW.
We pick 4 relationships from different popular fandoms on AO3:
- Sherlock Holmes/John Watson
- Castiels/Dean Winchester
- Steve Rodgers/Tony Stark
- Draco Malfoy/Harry Potter (used as the test set)
For each pairing, we find collect stories which include it, and are written in English. We collate these by author and randomly generate documents which contain paragraphs from 1-4 authors.
To quickly use this dataset in your code use the Huggingface Datasets loader:
import datasets
ds = datasets.load_dataset("ghomasHudson/ao3_style_change")
print(ds["train"][0])
>> {"site": "Castiel/Dean Winchester", "authors": 4, "structure": ["Author1", "Author2", ...], "multi-author": 1, "changes": [0,0,...]...}
We use the same data format as the PAN 21 task, with 2 files for each problem instance, x
:
-
problem-x.txt
containing the text -
truth-problem-x.json
containing the ground truth (labels), e.g.
{
"site": "Sherlock Holmes/John Watson",
"authors": 3,
"multiauthor": 1,
"structure": ["Username1", "Username2", "Username1", "Username3"],
"changes": [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...],
"paragraph-authors": [1, 1, 1, 1, 1, 2, 2, 2, 2, ...]
}
2 python files are provided which were used when scraping the data:
-
main.py
iterates through the list of character pairings, downloading fics in the following structure:
fanfics/
├── pairing1
│ ├── Username1
│ │ ├── 3b6ff2cadcaedf11d5eaaefd1e998d49c493c45f.json
│ │ ├── 3b6ff2cadcaedf11d5eaaefd1e998d49c493c45f.txt
│ │ ├── ab35ee7ceb06ee97c94cd042d8874f1eab99bd1a.json
│ │ ├── ab35ee7ceb06ee97c94cd042d8874f1eab99bd1a.txt
│ │ └── ...
│ ├── Username2
│ │ └── ...
│ ...
└── pairing2
│ ├── Username3
│ │ └── ...
│ ├── Username4
│ │ └── ...
...
-
to_style_change.py
turns this into a style change task, by randomly creating a structure and filling it with random paragraphs.
run_baseline.sh
will train a simple baseline model based on chunking the data.