GitXplorerGitXplorer
g

ao3_style_change

public
1 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
c9f5e9679a9382cf65d8d12deacb72263dbf0ac6

Tidy up loader

gghomasHudson committed 3 years ago
Unverified
c29609ae9c6a1b2144a6fdf2ba1b7c70bff976a3

Add data with > 100 chars per line

gghomasHudson committed 3 years ago
Unverified
471325046cf6d4eb042bc7beb15aaf778681abb8

Remove data < 100 chars

gghomasHudson committed 3 years ago
Unverified
2ed8a3a718d63fbd756308b7e196979381a9cb06

Add new data

gghomasHudson committed 3 years ago
Unverified
19285c137d3b51559c705b91c87a4397179e9018

Del

gghomasHudson committed 3 years ago
Verified
ffdd00fa0f7a43d67c8b5cb1ec7a6aac06494b28

Update README.md

gghomasHudson committed 3 years ago

README

The README file for this repository.

AO3 Style Change Detection

Style change detection dataset using AO3 fics. Inspired by the PAN 21: Style Change Detection Task, but for much longer documents.

Note: Due to the nature of the fanfiction source, much of the text will be NSFW.

Dataset construction methodology

We pick 4 relationships from different popular fandoms on AO3:

  • Sherlock Holmes/John Watson
  • Castiels/Dean Winchester
  • Steve Rodgers/Tony Stark
  • Draco Malfoy/Harry Potter (used as the test set)

For each pairing, we find collect stories which include it, and are written in English. We collate these by author and randomly generate documents which contain paragraphs from 1-4 authors.

Quickstart

To quickly use this dataset in your code use the Huggingface Datasets loader:

import datasets
ds = datasets.load_dataset("ghomasHudson/ao3_style_change")
print(ds["train"][0])
>> {"site": "Castiel/Dean Winchester", "authors": 4, "structure": ["Author1", "Author2", ...], "multi-author": 1, "changes": [0,0,...]...}

Data Format

We use the same data format as the PAN 21 task, with 2 files for each problem instance, x:

  • problem-x.txt containing the text
  • truth-problem-x.json containing the ground truth (labels), e.g.
{
    "site": "Sherlock Holmes/John Watson",
    "authors": 3,
    "multiauthor": 1,
    "structure": ["Username1", "Username2", "Username1", "Username3"],
    "changes": [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...],
    "paragraph-authors": [1, 1, 1, 1, 1, 2, 2, 2, 2, ...]
}

Gathering new data

2 python files are provided which were used when scraping the data:

  • main.py iterates through the list of character pairings, downloading fics in the following structure:
fanfics/
├── pairing1
│   ├── Username1
│   │    ├── 3b6ff2cadcaedf11d5eaaefd1e998d49c493c45f.json
│   │    ├── 3b6ff2cadcaedf11d5eaaefd1e998d49c493c45f.txt
│   │    ├── ab35ee7ceb06ee97c94cd042d8874f1eab99bd1a.json
│   │    ├── ab35ee7ceb06ee97c94cd042d8874f1eab99bd1a.txt
│   │    └── ...
│   ├── Username2
│   │    └── ...
│   ...
└── pairing2
│   ├── Username3
│   │    └── ...
│   ├── Username4
│   │    └── ...
    ...
  • to_style_change.py turns this into a style change task, by randomly creating a structure and filling it with random paragraphs.

Baseline model (WIP)

run_baseline.sh will train a simple baseline model based on chunking the data.