GitXplorerGitXplorer
o

gpt-2-output-dataset

public
1953 stars
549 forks
31 issues

Commits

List of commits on branch master.
Verified
b76f67c651fea691b37abe01a6de762bc43330e7

Update README.md

WWuTheFWasThat committed a year ago
Unverified
2c102400c7e4e698acd3f0e51d3b6cf1c637c0fe

update the download URLs to azure CDN

jjongwook committed 4 years ago
Unverified
d6f4e2956bceed24d37e3f157e3cc61281898aa2

move to azure

WWuTheFWasThat committed 4 years ago
Verified
ddfecb39328f0a9857cd09b40e55819d1f9ad512

add LICENSE

jjongwook committed 5 years ago
Unverified
6d90da539b6fac84eec403e16d5299c195ea8926

using sys.executable for subprocess calls (fixes #8)

jjongwook committed 5 years ago
Verified
12459ab3ed239895558beb7063ec95ffc46cd796

Update the blog and report links

jjongwook committed 5 years ago

README

The README file for this repository.

gpt-2-output-dataset

This dataset contains:

  • 250K documents from the WebText test set
  • For each GPT-2 model (trained on the WebText training set), 250K random samples (temperature 1, no truncation) and 250K samples generated with Top-K 40 truncation

We look forward to the research produced using this data!

Download

For each model, we have a training split of 250K generated examples, as well as validation and test splits of 5K examples.

All data is located in Google Cloud Storage, under the directory gs://gpt-2/output-dataset/v1. (NOTE: everything has been migrated to Azure https://openaipublic.blob.core.windows.net/gpt-2/output-dataset/v1/)

There, you will find files:

  • webtext.${split}.jsonl
  • small-117M.${split}.jsonl
  • small-117M-k40.${split}.jsonl
  • medium-345M.${split}.jsonl
  • medium-345M-k40.${split}.jsonl
  • large-762M.${split}.jsonl
  • large-762M-k40.${split}.jsonl
  • xl-1542M.${split}.jsonl
  • xl-1542M-k40.${split}.jsonl

where split is one of train, test, and valid.

We've provided a script to download all of them, in download_dataset.py.

Finetuned model samples

Additionally, we encourage research on detection of finetuned models. We have released data under gs://gpt-2/output-dataset/v1-amazonfinetune/ with samples from a GPT-2 full model finetuned to output Amazon reviews.

Detectability baselines

We're interested in seeing research in detectability of GPT-2 model family generations.

We provide some initial analysis of two baselines, as well as code for the better baseline.

Overall, we are able to achieve accuracies in the mid-90s for Top-K 40 generations, and mid-70s to high-80s (depending on model size) for random generations. We also find some evidence that adversaries can evade detection via finetuning from released models.

Data removal requests

If you believe your work is included in WebText and would like us to remove it, please let us know at webtextdata@openai.com.