GitXplorerGitXplorer
r

laion-prepro

public
215 stars
20 forks
10 issues

Commits

List of commits on branch main.
Unverified
ce8ded2c030abac3a1eeb167363b2dc59fe77c12

Add various scripts I had not published.

rrom1504 committed 8 months ago
Verified
f954447759b7a25b84bd263c5b6594192221784e

small fix to preparing_data_for_training.md (#21)

vvanpersie32 committed 2 years ago
Unverified
116ca8b354639a89aef7d2cf4550b4ca1db9121f

add aesthetic kv usage

rrom1504 committed 3 years ago
Unverified
7edb2cd16c393f5f0dda81a5c77156b8e45eb669

upload aesthetic join and ondisk creators

rrom1504 committed 3 years ago
Unverified
d4dca9627ca2d7fa8474a4f7d1efb98bb08d7aa6

add high resolution filtering job

rrom1504 committed 3 years ago
Unverified
4413b7c754b9581a95aeec38c6054492ac0b018c

clarify watermark usage guide

rrom1504 committed 3 years ago

README

The README file for this repository.

laion-prepro

Get billions of image+url from the laion datasets and preprocess them.

This repository can be run on

  • for laion400m one machine with 32GB of ram, 8TB of disk, 16 i7 core and a 1Gbps connection.
  • laion5B 10 machines similar to the laion400m one

What is laion ?

The laion project has for objective to use commoncrawl to retrieve billions of aligned image+text pairs. It is composed of a central server that track the progress of decentralized (run by anyone) workers that process small chunks of commoncrawl. Currently, 5B such pairs have already been retrieved. Read more about it at the laion 400M release post

What can be done with these dataset ?

Vision and language modeling has been taking off in 2021. Here are some pointers about what this kind of image + text datasets unlocks and why it seems really interesting:

  • 6 months ago OpenAI released 2 blogposts and papers clip and dall-e. Both model rely on a large amount of (text, image) pairs. They used an unreleased 400M pairs dataset.
    • CLIP is a model that computes how related are a text and an image. This makes it possible to build large text to image search, and it makes it possible to build that kind of crazy text to image art clip-art . They released a small and medium version of the model but no training code.
    • DALL-E is a model that directly generate images from texts. As can be seen from the blogpost, it achieves very impressive results that could have direct impacts on the world, for anything that need drawing and illustrations. OpenAI did not release any model, even through an API

Since then, several efforts have been organized to replicate DALL-E. People organized initially around this awesome dalle replication repository DALLE-pytorch with some nice results that can be seen in the readme. More recently as part of an huggingface events, new results have been achieved (see dalle mini report ) and an online demo is now available dalle-mini demo

The replication effort is still far from achieving the same performance as the original dalle, and it seems it's possible to go even further. Some people also want to make a better CLIP to produce even better generated art.

A large part of the results that can be achieved with such models is thanks to data. Large amount of data. Before laion 400M, the largest open dataset for (image, text) pairs are in the order of 10M (see DALLE-datasets ), which is enough to train okay models, but not enough to reach the best performance. Having a public dataset with hundred of millions of pairs will help a lot to build these image+text models.

Visualization of the dataset

Check the colab and the web demo

laion5B

laion5B and laion400m processing is overall the same, but laion5B being 10x, it required making everything distributed

Read more at laion5B/README.md

laion400m

See laion400m/README.md