GitXplorerGitXplorer
V

MLM_Filter

public
43 stars
1 forks
1 issues

Commits

List of commits on branch main.
Unverified
1d73a71c8b635a58ad99909fee440b89627b79fb

update README

VVictorwz committed 19 days ago
Unverified
46d2efa9919d77f411dc3b6d17b01fe58cdd5864

1. completely integrate into LLaVA-Unified repo; 2. add qwen2-based mlm-filter

VVictorwz committed 19 days ago
Unverified
0c47b17b1e63dc429d422bc00b8f4dc5059d96ff

change lib name

VVictorwz committed 3 months ago
Unverified
0ff0fbcd1815d0691ee131da24e8d44d3ebb3299

enable float16 inference for model

VVictorwz committed 3 months ago
Unverified
a7e6e4188f9681c509d124d735eeaa11be393672

update builder

VVictorwz committed 3 months ago
Unverified
8aa5a46f94537e0fb59f855a4fd695b98adbcbdc

fix bug in builder

VVictorwz committed 3 months ago

README

The README file for this repository.

MLM Filter

Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".

Release

  • [12/30/2024] šŸ”„ We released a new generation MLM-Filter model based on Qwen2.5-1.5B, mlm-filter-qwen2.5-1.5b-gpt4o. The instruction data are re-generated with GPT-4o. With the much smaller LLM backbone, the inference has been significantly improved. The llava codebase for mlm-filter model inference has been completely removed and integrated into LLaVA-Unified.
  • [10/24/2024] šŸ”„ We released two new MLM-Filter models based on llama3, mlm-filter-llama-3-8b and mlm-filter-llama-3.2-3b.
  • [2/25/2024] šŸ”„ We released Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters. We propose to adopt fine-tuned Multimodal Language Model as effective and efficient data filters to select high-quality image-text pairs from large-scale web-crawled iamge-text data. Checkout the paper.

Project Structure

Install

We highly suggest you to use python==3.10, i.e.,

conda create -n mlm_filter python=3.10

Then install the dependencies for quality score generation:

pip install git+https://github.com/Victorwz/LLaVA-Unified.git

Quality Score Generation

Inference on Single Image

python mlm_filter_scoring_single_image.py --image-path /path/to/image --caption "text caption"

Parameters to note:

  • --metric: quality scoring metric for generation, select among image_text_matching, object_detail_fulfillment, caption_text_quality, semantic_understanding, all
  • --image-path: path to image file or image url
  • --caption: text caption

Inference on Webdataset Large-Scale Data

bash run_inference.sh ${GPU_START_ID} ${Metric} ${Model_Path} ${Data_Path} ${Tars_Per_GPU} ${Num_GPU}

Parameters to note:

  • GPU_START_ID: for large-scale score generation using multi-machines, specify the index of machines
  • Metric: quality scoring metric for generation, select among image_text_matching, object_detail_fulfillment, caption_text_quality, semantic_understanding, all
  • Model_Path: path to the mlm filter model checkpoint
  • Data_Path: path to the webdataset image-text tars
  • Tars_Per_GPU: the number of webdataset image-text tars for a single-gpu to inference on
  • Num_GPU: the number of GPUs for one machine, e.g. 1, 8, 16

Fine-Tuning MLM as Data Filter

  1. Prepare data

Please download the 50k multimodal instructions and save it to ./data/mlm_filter_instruct_50k_gpt4v_cc12m_4k.json.

Please download the images from constituting datasets:

After downloading all of them, organize the data as follows in ./data/images,

ā”œā”€ā”€ coco
ā”‚   ā””ā”€ā”€ train2017
ā”œā”€ā”€ gqa
ā”‚   ā””ā”€ā”€ images
ā”œā”€ā”€ ocr_vqa
ā”‚   ā””ā”€ā”€ images
ā”œā”€ā”€ textvqa
ā”‚   ā””ā”€ā”€ train_images
ā””ā”€ā”€ vg
ā”‚   ā”œā”€ā”€ VG_100K
ā”‚   ā””ā”€ā”€ VG_100K_2
ā””ā”€ā”€ cc12m

OCR-VQA are repacked by ourselves to ensure there is no failed-to-download images which are included in LLaVA-v1.5-665k instruction dataset.

  1. Start training!

Please refer to LLaVA-Unified for more fine-tuning guidance.

Training script with DeepSpeed ZeRO-3: LLaVA_Unified/scripts/mlm_filter/finetune.sh.

Our Best CLIP Model on DataComp-Medium

We also open-sourced our pre-trained CLIP-ViT-B/32 checkppint under the DataComp-Medium Benchmark Controlled Setting in weizhiwang/clip_datacomp_medium_itm_th_66_AND_odf_th_20_gpt4v. Our best model is trianed on the data filtered by both the ITM and ODF Quality Scores.

License

Code License Data License
Usage and License Notices: The data and checkpoint are intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Contacts

For any question or issue, please feel free to contact weizhiwang@ucsb.edu or submit github issues.

Citation

Please cite our paper if you find this repository interesting or helpful in your research:

@article{mlm-filter,
    title={Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters}, 
    author={Wang, Weizhi and Mrini, Khalil and Yang, Linjie and Kumar, Sateesh and Tian, Yu and Yan, Xifeng and Wang, Heng},
    publisher={arXiv preprint arXiv:2403.02677},
    year={2024},
}

Credits

MLM-Filter is developed based on

  • Vicuna: foudation language model for LLaVA
  • LLaVA: the codebase for fine-tuning LLaVA as image-text data filters
  • DataComp: the codebase for data filtering and CLIP pre-training