GitXplorerGitXplorer
a

ml-toad

public
9 stars
0 forks
0 issues

Commits

List of commits on branch main.
Verified
59b83189c955e0605c341912309653289828fb57

Update README.md

wwilliamLyh committed 5 months ago
Verified
0c9a949a5e2a3bbf42c61093f665795dece2e047

Update README.md

wwilliamLyh committed 5 months ago
Verified
06dc7f2e0cfe89d92d05fa013610a47d25c38f45

Update README.md

wwilliamLyh committed 5 months ago
Unverified
1691fd552407923acd3cc16ba92cf6b22c076c0b

first commit

yyimai-fang committed 5 months ago

README

The README file for this repository.

TOAD

This software project accompanies the research paper, TOAD: Task-Oriented Automatic Dialogs with Diverse Response Styles. This paper has been accepted by ACL 2024.

Toad

TOAD is a synthetic TOD dataset that simulates realistic app context interactions and provides multiple system response styles (verbosity & mirroring user expressions).

Run Data Synthesis

Preparation:

  • Install dependencies from requirements.txt.
  • We use OpenAI Compatible API to make requests to LLMs. Set the environment variable OPENAI_API_KEY, BASE_URL (optional) and ENGINE (e.g. "gpt-3.5-turbo") to config the backend LLM. You can use a dotenv file.

Synthesis: The data synthesis pipeline is divided into 3 steps. The generated files will be stored in data/.

Step 1: Context generation

  1. Run python -m context_generation.occupation_generator to synthesize occupations.json (you can skip this step and re-use the existing file).
  2. Run python -m context_generation.persona_generator to synthesize personas.jsonl using occupations.
  3. Run python -m context_generation.context_generator to synthesize contexts.jsonl using personas.

Step 2: Dialog generation

  1. Run code in dialog_generation to synthesize dialogs based on contexts. Example command:
python -m dialog_generation.main \
    --phenomena='compound' \
    --output_dir='data/dialogs' \
    --number_of_data=1000 \
    --full_options_mode \
    --thread_num=15
  • --phenomena specifies the phenomena to be used in dialog generation. It can be one of compound, compositional, none.
  • --output_dir specifies the path to save the generated dialogs.
  • --number_of_data specifies the number of dialogs to generate.
  • --full_options_mode asks for generating of all 6 response style options.
  • --thread_num specifies the number of threads to run in parallel.

For how to customize dialog generation by modifying the schema.json, please refer to the documentation in that directory.

Step 3: Quality control

  1. Run python -m quality_control.main to filter out inconsistent dialogs using the LLM.

Citation

@inproceedings{liu2024toad,
    title = "{TOAD}: Task-Oriented Automatic Dialogs with Diverse Response Styles", 
    author = "Liu, Yinhong  and
      Fang, Yimai  and
      Vandyke, David  and
      Collier, Nigel",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    year = "2024",
    url = "https://arxiv.org/abs/2402.10137"
}