Implied NLI

Dataset Overview

Implied NLI, or INLI is a corpus of 10k premises that mirror real-world communication and 40k hypotheses that are implied, explicit, neutral, and contradictory.

Number of examples: 40,000
Number of labels: 4 (implied entailment entailment, explicit, neutral, contradictory)

On top of the raw data, we also include all prompts used to generate INLI, all human annotations described in the paper, and the outputs of the LLM experiments described in the paper.

For more details on the design and content of the dataset, please see our paper

Data Format

Our raw dataset is split into three csv files.

Size of training dataset: 32,000
Size of test dataset: 4,000
Size of validation dataset: 4,000

In each of these files, one row represents a premise and four corresponding hypotheses. These files include the following columns:

dataset: The original dataset containing the conversation or social norm used to create the premise
premise: A premise
implied_entailment: A hypothesis implicitly entailed in the premise
explicit_entailment: A hypothesis explicitly entailed in the premise
neutral: A hypothesis neither entailed nor contradicted by the premise
contradiction: A hypothesis contradicted by the premise

Citation

Contact

Shreya Havaldar

Disclaimer

INLI may contain premises and hypotheses involving sensitive content, e.g. descriptions of violence, harassment, profanity, etc. All such content originates from the datasets INLI is built off of (Ludwig, Circa, NormBank, SocialChem) and was not introduced by Gemini during the generation of INLI. All such content is purely hypothetical and not based on real people or events. Anyone using this dataset should be aware of these limitations of the dataset.

inli

Commits

renamed resources folder

changed License

added License and updated README

adding license + disclaimer

uploading INLI data

README