GitXplorerGitXplorer
g

multilingual-t5

public
1259 stars
128 forks
19 issues

Commits

List of commits on branch master.
Unverified
9dcd60fc43c31a8651461f9a21894a134ba22166

refactor: depend on seqio instead of t5.data.

sstephanwlee committed 2 years ago
Unverified
f9c8fa4a22666432603a3a17ceabe414167b9f0c

Refactor the tasks so that it can be registered with other output features

hhwchung27 committed 2 years ago
Unverified
86a3135977446d5e67e4133440aaaf5b38b943eb

Fix pawsx translate train original mixture

ssharannarang committed 2 years ago
Unverified
db5d6fdcf2401b6d8ba58160a51326de9f18bbef

Split up the vocab component of tasks.py into it's own file so teams can depend on the vocab without pulling in all the tasks dependencies.

aa-googler committed 2 years ago
Unverified
625e1ca79b12299ffb7a4920041b4aa72639522d

Remove unused comments related to Python 2 compatibility.

yyilei committed 3 years ago
Unverified
23a4899e26067736de847007b542099c95b534be

Fix a bug in XQUAD helper function.

llintingxue committed 3 years ago

README

The README file for this repository.

mT5: Multilingual T5

Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model, trained following a similar recipe as T5. This repo can be used to reproduce the experiments in the mT5 paper.

Table of Contents

Languages covered

mT5 is pretrained on the mC4 corpus, covering 101 languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.

Results

mT5 achieves state-of-the-art performance on many cross-lingual NLP tasks, as of November 2020. For example, on XTREME zero-shot classification, structured prediction and QA tasks (showing F1 scores):

Model XNLI PAWS-X WikiAnn-NER XQuAD MLQA TyDiQA-GoldP
mBERT 65.4 81.9 62.2 64.5 61.4 59.7
XLM 69.1 80.9 61.2 59.8 48.5 43.6
InfoXLM 81.4 - - - 73.6 -
X-STILTs 80.4 87.7 64.7 77.2 72.3 76.0
XLM-R 79.2 86.4 65.4 76.6 71.6 65.1
VECO 79.9 88.7 65.7 77.3 71.7 67.6
RemBERT 80.8 87.5 70.1 79.6 73.1 77.0
mT5-Small 67.5 82.4 50.5 58.1 54.6 36.4
mT5-Base 75.4 86.4 55.7 67.0 64.6 59.1
mT5-Large 81.1 88.9 58.5 77.8 71.2 68.4
mT5-XL 82.9 89.6 65.5 79.5 73.5 77.8
mT5-XXL 85.0 90.0 69.2 82.5 76.0 82.0

Usage

Training

To run this code, you need to install the t5 library. General instructions for training, fine-tuning, evaluation, and exporting models for inference can be found in the t5 repo. In order to use the additional mT5 tasks provided in this library with the t5_mesh_transformer command, run from this directory and add the flag --module_import="multilingual_t5.tasks". There is also support for mT5 in HuggingFace; see instructions in the T5 repo here.

To train an mT5-Large model on the mc4 task from scratch as described in the paper:

export PROJECT=yourproject
export ZONE=yourzone
export BUCKET=yourbucket
export TPU=yourtpu

ctpu up --name=$TPU --project=$PROJECT --zone=$ZONE --tpu-size=v3-256 --tpu-only --noconf

TASK=mc4
MODEL_DIR="${BUCKET}${TASK}"

python -m t5.models.mesh_transformer_main \
  --tpu="${TPU}" \
  --gcp_project="${PROJECT}" \
  --tpu_zone="${ZONE}" \
  --model_dir="${MODEL_DIR}" \
  --gin_file="models/t5.1.1.large.gin" \
  --gin_param="MIXTURE_NAME = '${TASK}'" \
  --gin_param="utils.run.sequence_length = {'inputs': 1024, 'targets': 256}" \
  --gin_param="utils.run.batch_size = ('tokens_per_batch', 1048576)" \
  --gin_param="utils.run.learning_rate_schedule=@learning_rate_schedules.rsqrt_no_ramp_down" \
  --gin_param="run.train_steps = 1000000" \
  --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology = 'v3-256'" \
  --eval_mode="perplexity_eval" \
  --eval_gin_param="mesh_eval_dataset_fn.num_eval_examples = 10000" \
  --t5_tfds_data_dir="${BUCKET}/t5-tfds" \
  --module_import="multilingual_t5.tasks"

Fine-Tuning

The example below shows how to finetune the mT5-Large model on the XNLI zeroshot task. See finetune_mt5_tasks.sh for hyperparameter settings for other tasks.

export PROJECT=yourproject
export ZONE=yourzone
export BUCKET=yourbucket
export TPU=yourtpu

ctpu up --name=$TPU --project=$PROJECT --zone=$ZONE --tpu-size=v3-256 --tpu-only --noconf

TASK=mt5_xnli_zeroshot
SEQUENCE_LENGTH_GIN=xnli
PRETRAINED_DIR=gs://t5-data/pretrained_models/mt5/large
PRETRAINED_STEPS=1000000
FINETUNE_STEPS=20000
MODEL_DIR="${BUCKET}${TASK}"

# Run fine-tuning
python -m t5.models.mesh_transformer_main \
  --tpu="${TPU}" \
  --gcp_project="${PROJECT}" \
  --tpu_zone="${ZONE}" \
  --model_dir="${MODEL_DIR}" \
  --gin_file="${PRETRAINED_DIR}/operative_config.gin" \
  --gin_file="sequence_lengths/${SEQUENCE_LENGTH_GIN}.gin" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology = 'v3-256'" \
  --gin_param="MIXTURE_NAME = '${TASK}'" \
  --gin_param="utils.run.train_steps=$((PRETRAINED_STEPS+FINETUNE_STEPS))" \
  --gin_param="utils.run.init_checkpoint='${PRETRAINED_DIR}/model.ckpt-${PRETRAINED_STEPS}'" \
  --t5_tfds_data_dir="${BUCKET}/t5-tfds" \
  --module_import="multilingual_t5.tasks" \
  --gin_param="utils.run.batch_size = ('tokens_per_batch', 1048576)" \
  --gin_location_prefix="multilingual_t5/gin/"

The remaining experiments are shown in the tasks.py file.

Released Model Checkpoints

We have released the following checkpoints for pre-trained models described in our paper:

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@inproceedings{xue-etal-2021-mt5,
    title = "m{T}5: A Massively Multilingual Pre-trained Text-to-Text Transformer",
    author = "Xue, Linting  and
      Constant, Noah  and
      Roberts, Adam  and
      Kale, Mihir  and
      Al-Rfou, Rami  and
      Siddhant, Aditya  and
      Barua, Aditya  and
      Raffel, Colin",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.naacl-main.41",
    doi = "10.18653/v1/2021.naacl-main.41",
    pages = "483--498"
}