CLAfICLe

Read our paper:

Cross-Lingual Adaptation for In-Context Learning [PDF] (Not submitted for publication)

Contents

Requirements and Setup

Required Packages

Details such as python and package versions can be found in the generated pyproject.toml and poetry.lock files.

We recommend using an environment manager such as conda. After setting up your environment with the correct python version, please proceed with the installation of the required packages

For poetry users, getting setup is as easy as running

poetry install

We also provide a requirements.txt file for pip users who do not wish to use poetry. In this case, simply run

pip install -r requirements.txt

This requirements.txt file is generated by running the following

sh gen_pip_reqs.sh

Checkpoints

If you wish to run evaluation without first training the model, we provide our checkpoints via The Internet Archive at this link. Please unzip this folder and organize it such that the checkpoints are in the checkpoints folder at the root of this repository.

We do not provide the bare hr_to_lr MetaICL model checkpoint. For this checkpoint, please refer to the instructions on the MetaICL repo for downloading their metaicl model in the hr_to_lr setting. Once downloaded, rename this to metaicl.pt and place it in the relevant checkpoints directory.

Model Reference

The following table provides a reference for the models evaluated in our paper.

Model Name	Evaluation Languages	Description
`metaicl`	en	`direct` `hr_to_lr` checkpoint from the MetaICL repo
`sandwich-{lang}`	fr, de	`metaicl` sandwiched in a translation API for `lang`, serving as a baseline
`metaicl-gewechselt-{lang}-clm`	fr, de	`metaicl` adapted to a `lang` (fr or de) using WECHSEL, 0 shot or with the additional recommended CLM training.
`gpt2-gewechselt-{lang}-clm`	not evaluated	`gpt2` adapted to `lang` (fr or de) using WECHSEL with additional recommended CLM training. Note, we do not actually evaluate this buut only use it as a base.
`{base}-metaicla`	fr, de	A `base` (any of the `gpt2-gewechselt-{lang}-clm`) with a MetaICL adapter, trained the standard way.
`{base}-metaiclva`	fr, de	A `base` (any of the `gpt2-gewechselt-{lang}-clm`) with a MetaICL vessel adapter, trained with targeted distillation.

Usage

We use hydra for configuring our project.

To download/process the data, either run claficle/data/oscar.py or claficle/data/benchmark.py for OSCAR and our multi-lingual multi-task benchmark respectively. You may have to configure or override claficle/conf/setup_data.yaml accordingly. We suggest inspecting slurm/data/ for examples of how we ran these.

Note that to process OSCAR in French and German data you need to make use of trained tokenizers from WECHSEL initialization. You can either download these along with our checkpoints or run WECHSEL initalization yourself by running claficle/models/gewechselt.py, configured with claficle/conf/wechsel_init.yaml. We have examples of how we ran this in slurm/wechsel/.

Once the data is downloaded, to run evaluation run claficle/run/eval.py, configured with claficle/conf/eval.yaml. Examples at slurm/eval/.

Of course, to run evaluation you need trained checkpoints. You can once again either download these or train them yourself. For geWECHSELt models, you can run claficle/run/train.py. For MetaICLVA, you can run claficle/run/distil.py. For MetaICLA, please refer to our MetaICL fork. Like always, these are configured with the relevant files in claficle/conf/ and are accompanies by examples of how we did it in slurm/.

Project Organization

    ├── LICENSE
    ├── README.md          <- The top-level README
    ├── data/
    │   ├── interim/       <- Intermediate data that has been transformed.
    │   ├── processed/     <- The final, canonical data sets for modeling.
    │   └── raw/           <- The original, immutable data dump.
    ├── checkpoints/       <- Trained and serialized models.
    ├── notebooks/         <- Jupyter notebooks.
    ├── slurm/             <- SLURM scripts
    ├── logs/              <- logs
    ├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
    ├── pyproject.toml     <- project metadata, handled by poetry.
    ├── poetry.lock        <- resolving and locking dependencies, handled by poetry.
    ├── requirements.txt   <- for non-poetry users.
    ├── gen_pip_reqs.sh    <- for generating the pip requirements.txt file
    └── claficle/          <- Source code for use in this project.
        ├── __init__.py    <- Makes src a Python module
        ├── data/          <- Scripts to download or generate data
        ├── models/        <- Model definitions
        ├── run/           <- scripts to train, evaluate and use models
        ├── conf/          <- config files
        ├── utils/         <- miscellaneous utils
        └── visualization/ <- Scripts for visualization

The project structure is largely based on the cookiecutter data-science template. This is purposely opinionated so that paths align over collaborators without having to edit config files. Users may find the cookiecutter data-science opinions page, of relevance

The top level data/ and models/ directory are in version control only to show structure. Their contents will not be committed and are ignored via .gitignore.

CLAfICLe

Commits

add note that the paper was not submitted for publication

link pdf in readme

add notes to presentation

presentation

final fixes

basic usage instructions

README

CLAfICLe

Requirements and Setup

Required Packages

Checkpoints

Model Reference

Usage

Project Organization