GitXplorerGitXplorer
g

character-type-identification

public
0 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
bae49af87c4150fba52c863f133ba84b473303ea

Made datasets version use gold for test

gghomasHudson committed 3 years ago
Unverified
35110f446f3896151a8fd8f2857070df418d4569

Add gold labels

gghomasHudson committed 3 years ago
Unverified
4c863ffe6f93c9bc4e9ea09663a389367fb44647

Added unit quality score

gghomasHudson committed 4 years ago
Unverified
cc36fa0d1c24eac6431de8dc320cca572b872777

Fixed sets

gghomasHudson committed 4 years ago
Unverified
f4c618c7603224bccd879a2fcade6f0a74ab4b36

Added word counts

gghomasHudson committed 4 years ago
Unverified
ce562f6dc5135c80938845a9c9c627aa1da3b6a5

Bigger (>900) dataset

gghomasHudson committed 4 years ago

README

The README file for this repository.

Character-type Identification

This repository contains the character type identification dataset.

For more details, see the paper TBD.

Files

  • documents.csv - contains document metadata from document_id, set, script_url, script_file_size, script_word_count, script_start, script_end, wiki_url, wiki_title.
  • summaries.csv - contains wikipedia summaries in the format document_id, set, summary.
  • character_labels.csv - contains the character type annotations in the format document_id, set, character_name, character_type
  • download_scripts.py - downloads the full scripts.

Using the Dataset

Due to licensing issues, the full scripts aren't included in this repository. They can be downloaded to /path/to/repo/tmp by running:

pip install -r requirements.txt
python download_scripts.py

Alternatively, the dataset can be conveniently loaded using huggingface/datasets:

import datasets
ds = datasets.load_dataset("character_type_id")
print(ds["train"][0])
>> {"document_id": "00001", "summary":{"title": "Name of Movie (film)", "text": "The movie begins..."},...

Citation

@article{characterTypeID,
author = {TBD},
title = {TBD},
journal = {TBD},
url = {https://TBD},
year = {2021},
pages = {TBD},
}