GitXplorerGitXplorer
v

summarize_from_feedback_details

public
126 stars
16 forks
4 issues

Commits

List of commits on branch main.
Verified
ee77a9abc31f5338365ee087945c5bcafac7f1ca

Update README.md

vvwxyzjn committed 2 months ago
Unverified
b340aaf5371dd9342b67bdeb0b9957ad3af11db5

push newest script

vvwxyzjn committed 8 months ago
Unverified
d18d1950fc515651d48a5686c31471deb6adc6b4

push all the data

vvwxyzjn committed 8 months ago
Verified
792b2b285f9fac0eac41e70d5fb3b53e95e752a9

Merge pull request #12 from vwxyzjn/calibration

mmnoukhov committed 9 months ago
Unverified
824375aa97ed64a20c26e7381acef63b72f49251

remove jupyter dependency

mmnoukhov committed 9 months ago
Verified
115860b232ba4deb54af6947565d6770f4708666

Update README.md

vvwxyzjn committed 10 months ago

README

The README file for this repository.

summarize_from_feedback_details

The follow-up work of https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo

Prerequisites:

  • A slurm cluster of 8xH100 box (we are thinking of adding LORA)

Get started

Install the dependencies

# with poetry (recommended)
poetry install
# or with pip
pip install -r requirements.txt

Run inference

python visualize_tokens.py

asciicast

To run a hello world example, you can run the hello_world.sh script. For the full scaling behaviors experiment, you can run

mkdir -p slurm/logs
sft_job_id=$(sbatch --parsable sbatches/sft.sbatch)
rm_job_id=$(sbatch --parsable --dependency=afterany:$sft_job_id sbatches/reward.sbatch)
ppo_job_id=$(sbatch --parsable --dependency=afterany:$rm_job_id sbatches/ppo_left_padding.sbatch)

The command above runs end-to-end RLHF experiments with 4 random seeds. We then run the following scripts to fetch experiments and generate plots

cd eval
python sft_rm_scale.py
python rlhf_scaling_plot.py
Rouge Score (sft.py) Reward Model (reward.py)
RLHF Policy (ppo_left_padding.py)

Dataset Information

We use our pre-built TL;DR datasets:

You can optionally build them yourself with

poetry run python summarize_from_feedback_details/tldr_dataset.py \
    --base_model=EleutherAI/pythia-1b-deduped \
    --tldr_params.max_sft_response_length=53 \
    --tldr_params.max_sft_query_response_length=562 \
    --tldr_params.max_rm_response_length=169 \
    --tldr_params.max_rm_query_response_length=638 \
    --cnndm_params.max_rm_response_length=155 \
    --cnndm_params.max_rm_query_response_length=2021 \
    --tldr_params.padding="pad_token" \
    --cnndm_params.padding="pad_token"
    # --push_to_hub # you can optionally push to hub

Note that these datasets use the same OpenAI processing as the original paper (summarize-from-feedback/tasks.py#L98-L165); it does things like

  • make sure query is only 512 tokens (pad if shorter, and ''smartly truncate'' if longer, e.g., like it will truncate at before the last \n instead of a hard truncation.)
  • make sure response tokens is limited

Citation

@inproceedings{
huang2024the,
title={The N+ Implementation Details of {RLHF} with {PPO}: A Case Study on {TL};{DR} Summarization},
author={Shengyi Huang and Michael Noukhovitch and Arian Hosseini and Kashif Rasul and Weixun Wang and Lewis Tunstall},
booktitle={First Conference on Language Modeling},
year={2024},
url={https://openreview.net/forum?id=kHO2ZTa8e3}
}