GitXplorerGitXplorer
r

moco_v3_tpu

public
16 stars
3 forks
0 issues

Commits

List of commits on branch global_avg_pool.
Unverified
4766876857e6a797c897b3f17be6494cc143c8f4

allow specifying `tpu_devices_per_node` for TPU v4

rronghanghu committed 3 years ago
Unverified
fd697229d46c72b9453ce8c6dac624dee05d6743

only load the pretrained checkpoint on the master rank (in case it's only accessible on the master node's filesystem)

rronghanghu committed 3 years ago
Unverified
f3f2f0da72995a5911bc916d26a0fe09b2388568

update `broadcast_xla_master_model_param` to ensure SPMD across TPUs

rronghanghu committed 3 years ago
Unverified
6fcaf814a4c703e187b4b484375b686d2b3e538b

fix local distributed init

rronghanghu committed 3 years ago
Unverified
031641a29f46c4080ca78e34c4fa58ace46d050e

allow linear evaluation over DeiT checkpoint format

rronghanghu committed 3 years ago
Unverified
92c3dc6384ba12c597e0057874aa0ee15f824cfd

add 12-head ViT-S for consistency w/ MoCo v3 paper

rronghanghu committed 3 years ago

README

The README file for this repository.

SimCLR and MoCo v3 ViT implementation

This repo implements the SimCLR and MoCo v3 algorithm on Vision Transformers (ViT) for both GPUs and TPUs, with hyperparams following An Empirical Study of Training Self-Supervised Vision Transformers.

Installation

Install pytorch (and its dependencies). Install pytorch xla if running on TPUs.

Finally, install timm for vision transformers: pip3 install timm.

Download ImageNet-1k to a shared directory (e.g. to /checkpoint/ronghanghu/megavlt_paths/imagenet-1k) that can be accessed from all nodes, which should have the following structure.

/checkpoint/ronghanghu/megavlt_paths/imagenet-1k
|_ train
|  |_ <n0......>
|  |  |_<im-1-name>.JPEG
|  |  |_...
|  |  |_<im-N-name>.JPEG
|  |_ ...
|  |_ <n1......>
|  |  |_<im-1-name>.JPEG
|  |  |_...
|  |  |_<im-M-name>.JPEG
|  |  |_...
|  |  |_...
|_ val
|  |_ <n0......>
|  |  |_<im-1-name>.JPEG
|  |  |_...
|  |  |_<im-N-name>.JPEG
|  |_ ...
|  |_ <n1......>
|  |  |_<im-1-name>.JPEG
|  |  |_...
|  |  |_<im-M-name>.JPEG
|  |  |_...
|  |  |_...

Running MoCo v3 ViT training on ImageNet-1k

Launch the training on GPUs or TPUs as follows. It trains MoCo v3 for 300 epochs by default. (Append num_epochs=100 to the commands below to train only for 100 epochs.)

Make sure SAVE_DIR is a shared directory that can be accessed from all nodes. For TPUs, one can use an NFS directory on GCP.

On GPUs (e.g. using 64 V100 GPUs):

SAVE_DIR="/private/home/ronghanghu/workspace/simclr_vit_release/save_mocov3_gpu64"

srun \
  --mem=300g --nodes=8 --gres=gpu:8 --partition=learnlab,learnfair \
  --time=4300 --constraint=volta32gb --cpus-per-task=40 \
python3 run_mocov3_vit.py \
  world_size=64 \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k \
  batch_size=4096 lr=2.4e-3 weight_decay=0.1  # lr is already scaled by batch size

(append use_pytorch_amp=True to the command above to use automatic mixed precision)

On TPUs (e.g. using a v3-256 TPU pod):

SAVE_DIR="/checkpoint/ronghanghu/workspace/simclr_vit_release/save_mocov3_tpu_v3-256"

TPU_NAME=megavlt-256  # change to your TPU name
# use absolute paths with torch_xla.distributed.xla_dist
sudo mkdir -p $SAVE_DIR && sudo chmod -R 777 $SAVE_DIR  # workaround for permission issue
python3 -m torch_xla.distributed.xla_dist \
  --tpu=${TPU_NAME} --restart-tpuvm-pod \
  --env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 \
  -- \
python3 $(realpath run_mocov3_vit.py) \
  device=xla \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k \
  batch_size=4096 lr=2.4e-3 weight_decay=0.1  # lr is already scaled by batch size

Running SimCLR ViT training on ImageNet-1k

Launch the training on GPUs or TPUs as follows. It trains SimCLR for 300 epochs by default. (Append num_epochs=100 to the commands below to train only for 100 epochs.)

Make sure SAVE_DIR is a shared directory that can be accessed from all nodes. For TPUs, one can use an NFS directory on GCP.

On GPUs (e.g. using 64 V100 GPUs):

SAVE_DIR="/private/home/ronghanghu/workspace/simclr_vit_release/save_simclr_gpu64"

srun \
  --mem=300g --nodes=8 --gres=gpu:8 --partition=learnlab,learnfair \
  --time=4300 --constraint=volta32gb --cpus-per-task=40 \
python3 run_simclr_vit.py \
  world_size=64 \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k \
  batch_size=4096 lr=3.2e-3 weight_decay=0.1  # lr is already scaled by batch size

(append use_pytorch_amp=True to the command above to use automatic mixed precision)

On TPUs (e.g. using a v3-256 TPU pod):

SAVE_DIR="/checkpoint/ronghanghu/workspace/simclr_vit_release/save_simclr_tpu_v3-256"

TPU_NAME=megavlt-256  # change to your TPU name
# use absolute paths with torch_xla.distributed.xla_dist
sudo mkdir -p $SAVE_DIR && sudo chmod -R 777 $SAVE_DIR  # workaround for permission issue
python3 -m torch_xla.distributed.xla_dist \
  --tpu=${TPU_NAME} --restart-tpuvm-pod \
  --env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 \
  -- \
python3 $(realpath run_simclr_vit.py) \
  device=xla \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k \
  batch_size=4096 lr=3.2e-3 weight_decay=0.1  # lr is already scaled by batch size

Running linear evaluation on the trained MoCo v3 or SimCLR models

Suppose the final checkpoint from the previous step is PRETRAINED_MODEL (e.g. /private/home/ronghanghu/workspace/simclr_vit_release/save_simclr_gpu64/vit_b16_epoch_300.ckpt or any checkpoint trained by SimCLR or MoCo v3 above). Let's evaluate it as follows.

  • For MoCo v3 (ImageNet-1k 300 epochs by default), expected linear evaluation accuracy is around 0.765 for both GPUs and TPUs.
  • For SimCLR (ImageNet-1k 300 epochs by default), expected linear evaluation accuracy is around 0.739 for both GPUs and TPUs.

Make sure SAVE_DIR is a shared directory that can be accessed from all nodes. For TPUs, one can use an NFS directory on GCP.

On GPUs (e.g. using 64 V100 GPUs):

PRETRAINED_MODEL=/private/home/ronghanghu/workspace/simclr_vit_release/save_simclr_gpu64/vit_b16_epoch_300.ckpt
# SAVE_DIR can be the same or a different directory from SSL training
SAVE_DIR="/private/home/ronghanghu/workspace/simclr_vit_release/save_simclr_gpu64"

srun \
  --mem=300g --nodes=8 --gres=gpu:8 --partition=learnlab,learnfair \
  --time=4300 --constraint=volta32gb --cpus-per-task=40 \
python3 $(realpath run_linear_eval_vit.py) \
  world_size=64 \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k \
  linear_eval.pretrained_ckpt_path=$PRETRAINED_MODEL

On TPUs (e.g. using a v3-256 TPU pod):

PRETRAINED_MODEL=/checkpoint/ronghanghu/workspace/simclr_vit_release/save_simclr_tpu_v3-256/vit_b16_epoch_300.ckpt
# SAVE_DIR can be the same or a different directory from SSL training
SAVE_DIR="/checkpoint/ronghanghu/workspace/simclr_vit_release/save_simclr_tpu_v3-256"

TPU_NAME=megavlt-256  # change to your TPU name
# use absolute paths with torch_xla.distributed.xla_dist
sudo mkdir -p $SAVE_DIR && sudo chmod -R 777 $SAVE_DIR  # workaround for permission issue
python3 -m torch_xla.distributed.xla_dist \
  --tpu=${TPU_NAME} --restart-tpuvm-pod \
  --env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 \
  -- \
python3 $(realpath run_linear_eval_vit.py) \
  device=xla \
  ckpt_dir=$SAVE_DIR \
  data_dir=/checkpoint/ronghanghu/megavlt_paths/imagenet-1k \
  linear_eval.pretrained_ckpt_path=$PRETRAINED_MODEL