PradeepKadubandi/neuron-lightning-ray

About

This repository is a contained repository to run Pytorch Lightning (PTL from here on) sample on AWS Trainium. It was developed by taking only the relevant code from this tutorial. The original tutorial uses parallel cluster to run the workload on mulitple node cluster. This is a simpler sample that can train LLama 2.0 7B model on a single trn1.32xlarge instance with reduced batch size and using all cores for Tensor Parallelism. It also contains a hacky implementation to make Ray work with PTL on Trainium.

Common Prerequisites / Setup (Needed commonly for all subsequent sections)

Get a single trn1.32xlarge AWS Trainium EC2 instance, use the Neuron DLAMI when creating the instance (refer to EC2-Instance-Screenshot.png if you are new to this).
Clone this repository (The commands below assume the repository is cloned in home directory at path /home/ubuntu. If a different path is used to clone, the commands need to be changed accordingly).

LLama 2.0 Pre-requisites:

Download the dataset. The command below downloads the dataset to the path /home/ubuntu/examples_datasets/ and further commands in this ReadMe assume that path for dataset. If the script is modified OR the data is placed in some other path, the commands need to be adjusted as appropriate.

cd neuron-lightning-ray/data && python get_dataset.py --llama-version 2

Install the requirements

pip install -r requirements.txt
Install neuronx-distributed: python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com

Llama 2.0: Running directly from the instance

Do either of the below:
- For PTL only run (torchrun launch step in commands.txt), source the pytorch 2 environment in terminal.
```
source /opt/aws_neuronx_venv_pytorch_2_1/bin/activate
```
- If you want to run Ray with PTL instead (launch using python in commands.txt), source the pytorch 1 environment instead. Note that this relies on a custom PTL strategy implementation for now.
```
source /opt/aws_neuronx_venv_pytorch_1_13/bin/activate
```
Run the relevant commands from commands.txt

Llama 2.0: Using a docker container (PTL on Trainium only)

Build a docker container, replace <name> and <tag> to a consistent value in all the commands below.

cd neuron-lightning-ray && docker build . -f Dockerfile -t <name>:<tag>

Run the command from the docker container (use same <name> and <tag> from the above) to run PTL on Trainium cores

docker run --device=/dev/neuron0 --device=/dev/neuron1 --device=/dev/neuron2 --device=/dev/neuron3 --device=/dev/neuron4 --device=/dev/neuron5 --device=/dev/neuron6 --device=/dev/neuron7 --device=/dev/neuron8 --device=/dev/neuron9 --device=/dev/neuron10 --device=/dev/neuron11 --device=/dev/neuron12 --device=/dev/neuron13 --device=/dev/neuron14 --device=/dev/neuron15 -v /home/ubuntu/examples_datasets:/examples_datasets -itd <name>:<tag> torchrun --nproc_per_node 32 main.py --model_path /neuron-lightning-ray/config.json --data_dir /examples_datasets/wikicorpus_llama2_tokenized_4k --tensor_parallel_size 32 --train_batch_size 1 --max_steps 100 --warmup_steps 5 --lr 3e-4 --grad_accum_usteps 4 --seq_len 4096 --use_sequence_parallel 0 --use_selective_checkpoint 1 --use_fp32_optimizer 0 --use_zero1_optimizer 1 --scheduler_type 'linear' --use_flash_attention 0

If you would like to see the output of the above command, you can do these 2 steps:

docker ps # get the container id from this command
docker logs <container_id> -f # this will show the output

You can run neuron-top from a terminal on the host instance and see the memeory and utilization of accelerator cores.

BERT: Running directly from the instance

For PyTorch Lightning (PTL) only example run

source /opt/aws_neuronx_venv_pytorch_2_1/bin/activate
export HF_TOKEN="<REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN>"
cd neuron-lightning-ray 
pip install -r requirements.txt
Install neuronx-distributed: python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
Install accelerate: pip install accelerate
torchrun --nproc_per_node=32 bert/bert-ptl.py

For RayTrain with PTL example run

source /opt/aws_neuronx_venv_pytorch_1_13/bin/activate
export HF_TOKEN="<REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN>"
cd neuron-lightning-ray
pip install -r requirements.txt
Install neuronx-distributed: python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
Install accelerate: pip install accelerate
python -m bert.bert-raytrain-ptl

You can run neuron-top from a terminal on the host instance and see the memeory and utilization of accelerator cores.

neuron-lightning-ray

Commits

Changes to make ray train work on pytorch 2+ (#3)

Refactor llama 2 config to a separate folder to facilitate adding other versions later

more changes to readme (#2)

Read me changes for clarity

Merge pull request #1 from PradeepKadubandi/bert-example

Committing BERT Examples

README