GitXplorerGitXplorer
f

head-to-tail

public
11 stars
0 forks
0 issues

Commits

List of commits on branch main.
Verified
391222b8e8ce212175644eea1ee775fecf69683d

Update dbpedia_question_templates.json

aaccreator committed 7 months ago
Verified
dd2c1abde9928afa277c4874121e1bfa2c320941

Update README.md

aaccreator committed 7 months ago
Unverified
1887255479d7cc34be8ced0181097977eb7dc622

script for generating open domain qa pairs

aaccreator committed 7 months ago
Verified
07283bbd902d2a60b3815f3e45f1ddf9e52ceff1

Update README.md

aaccreator committed 7 months ago
Unverified
6de3371a8e394546ebdd96fbd32723c39dded14a

Initial commit

aaccreator committed 7 months ago

README

The README file for this repository.

Overview

Head-to-Tail is a benchmark that consists of question-answer (QA) pairs regarding head, torso, and tail facts in terms of popularity. For more details, please refer to this paper (NAACL 2024).

Data

We provide scripts to generate QA pairs using third-party domain-specific Knowledge Graphs (KGs), such as IMDb. Due to licensing restrictions, we do not directly distribute the data generated by these scripts.

The following files need to be manually downloaded by the users.

  • IMDb: title.basics.tsv.gz, title.ratings.tsv.gz, title.crew.tsv.gz, title.principals.tsv.gz, name.basics.tsv.gz
  • Goodreads: goodreads_books.json.gz, goodreads_book_authors.json.gz
  • MAG: ConferenceInstances.txt.zst, Journals.txt.zst, Papers.txt.zst, PaperAuthorAffiliations.txt.zst, Authors.txt.zst
  • DBLP: dblp.xml (extracted from dblp.xml.gz)
  • DBpedia: mappingbased-objects_lang=en.ttl.bz2

Usage: python head_to_tail_{imdb,goodreads,mag,dblp,dbpedia}.py --kg-dir DIRECTORY_CONTAINING_KG_FILES

The scripts will generate head_to_tail_{imdb,goodreads,mag,dblp,dbpedia}.json whose format is as follows.

{
 "head": [
  [
   relevant entity or entity id,
   template id,
   question,
   answer
  ],
  ...
 ],
 "torso": [
  ...
 ],
 "tail": [
  ...
 ]
}

Please note that the QA pairs generated by these scripts differ from those used in our paper due to several factors: 1) updates to the KGs over time, 2) randomness in the sampling process, and 3) the absence of final manual checking (and cleaning when applicable) processes to ensure high data quality. In addition, for the open domain, we do not release the original question templates developed with the assistance of ChatGPT due to licensing restrictions. The script released here is for the same purpose but utilizes templates created with the assistance of Llama. Despite these differences, the overall distribution and statistical features of the QA pairs should closely resemble those discussed in our research. Users should be aware of these factors when using the scripts in their work.

Citations

@inproceedings{sun2024headtotail,
      title={Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs?},
      author={Kai Sun and Yifan Ethan Xu and Hanwen Zha and Yue Liu and Xin Luna Dong},
      booktitle={Proceedings of the NAACL-HLT},
      year={2024},
      url={https://arxiv.org/abs/2308.10168}
}

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This license permits sharing and adapting the work, provided it's not used for commercial purposes and appropriate credit is given. For a quick overview, visit Creative Commons License.