GitXplorerGitXplorer
a

stablefused

public
14 stars
2 forks
3 issues

Commits

List of commits on branch main.
Verified
e345b21b137facd2e341f2bb8c00a71d8e3c79b0

Support for StoryBook (Text + Video + Audio) (#5)

aa-r-r-o-w committed a year ago
Verified
1af8caa94b7fcea1373d4414167d1f8f25b813ec

bump version

aa-r-r-o-w committed a year ago
Verified
702d5864431e5cef6107603f2946a64c00e2ef06

Inpaint Support (#3)

aa-r-r-o-w committed a year ago
Verified
ad8c799a69cd506cb353c7321b13c5ebfe62176d

Merge pull request #1 from a-r-r-o-w/dev

aa-r-r-o-w committed a year ago
Verified
25d7b8bd9ac6fad98526effe2ca42af55698690c

update README.md

aa-r-r-o-w committed a year ago
Verified
f1241c10a90dd15ff404b2810e1d6367d9bc0bd7

bump version

aa-r-r-o-w committed a year ago

README

The README file for this repository.

StableFused

StableFused is a toy library to experiment with Stable Diffusion inspired by 🤗 diffusers and various other sources! One of the main reasons I'm working on this project is to learn more about Stable Diffusion, and generative models in general. It is my current area of research at university.

Installation

It is recommended to use a virtual environment. You can use venv or conda to create one.

Unix:

python -m venv venv
source venv/bin/activate

Windows:

python -m venv venv
venv\Scripts\activate

For usage, install the package from PyPI.

pip install stablefused

For development, fork the repository, clone it and install the package in editable mode.

git clone https://github.com/<YOUR_USERNAME>/stablefused.git
cd stablefused
pip install -e ".[dev]"

Usage

Checkout the examples folder for notebooks 🥰

Contributing

Contributions are welcome! Note that this project is not a serious implementation for training/inference/fine-tuning diffusion models. It is a toy library. I am working on it for fun and experimentation purposes (and because I'm too stupid to modify large codebases and understand what's going on).

As I'm not an expert in this field, I will have probably made a lot of mistakes. If you find any, please open an issue or a PR. I'll be happy to learn from you!

Acknowledgements/Resources

The following sources have been very helpful to me in understanding Stable Diffusion. I highly recommend you to check them out!

Results

Visualization of diffusion process

Refer to the notebooks for more details and enjoy the denoising process!

Text to Image

These results are generated using the Text to Image notebook.

Your browser does not support the video tag.
Image to Image

These results are generated using the Image to Image notebook.

Source Image Denoising Diffusion Process
The Renaissance Astronaut High quality and colorful photo of Robert J Oppenheimer, father of the atomic bomb, in a spacesuit, galaxy in the background, universe, octane render, realistic, 8k, bright colors Stylistic photorealisic photo of Margot Robbie, playing the role of astronaut, pretty, beautiful, high contrast, high quality, galaxies, intricate detail, colorful, 8k
Your browser does not support the video tag.
PS The results from Image to Image Diffusion don't seem very great from my experimentation. It might be some kind of bug in my implementation, which I'll have to look into later...

Text to Video

There is a lot of ongoing research on the generation of videos from text prompts. It is also my current area of research at university. The implementation here is adapted from AnimateDiff.

There is immense potential in developing this kind of technology and its possible usecases are unlimited - personalized educational content, marketing and advertising, creativity and art, etc. to name a few. Imagine a world where you have your own personal ChatGPT/Bard like assistants for visual learning - a model that can generate 3Blue1Brown style videos explaining science topics, or depict a story! Current models are not that capable yet, but this is where we are headed, I think, and is what me and my team are researching on. The future of this technology will be fascinating to witness!

Text to Video

These results are generated using the Text to Video notebook.

Text to Video
An astronaut floating in space, interstellar, black background with stars, photorealistic, high quality, 8k
Your browser does not support the video tag.
A mighty pirate ship sailing through the sea, unpleasant, thundering roar, dark night, starry night, high quality, photorealistic, 8k
Your browser does not support the video tag.

Inpainting

Image inpainting is a technique that aims to fill in missing or damaged parts of an image. It is used to restore or repair images by extrapolating the surrounding information to recreate the missing regions seamlessly.

These results are generated using the Inpainting notebook.

Inpainting using a fixed mask and different prompts
Inpainting

Prompt 1: Digital illustration of a mythical creature, high quality, realistic, 8k
Prompt 2: Digital illustration of a mythical creature, high quality, realistic, 8k
Prompt 3: Digital illustration of a dragon, high quality, realistic, octane render, 8k
Prompt 4: Digital illustration of a ferocious lion, high quality, realistic, octane render, 8k
Prompt 5: Digital illustration of an evil white rabbit, high quality, realistic, 8k
Prompt 6: Digital illustration of samurai with a moon-like object in the background, high quality, realistic, octane render, 8k

Image Mask
Infinite Zoom In

Prompt: A painting of a cat, in the style of Vincent Van Gogh, hanging in a room

Your browser does not support the video tag.
Pan and Zoom Out

Prompt: Post-apocalyptic world with ruins, overgrown vegetation, and a lone survivor

Your browser does not support the video tag.

Understanding the effect of Guidance Scale

Guidance scale is a value inspired by the paper Classifier-Free Diffusion Guidance. The explanation of how CFG works is out-of-scope here, but there are many online sources where you can read about it (linked below).

In short, guidance scale is a value that controls the amount of "guidance" used in the diffusion process. That is, the higher the value, the more closely the diffusion process follows the prompt. A lower guidance scale allows the model to be more creative, and work slightly different from the exact prompt. After a certain threshold maximum value, the results start to get worse, blurry and noisy.

Guidance scale values, in practice, are usually in the range 6-15, and the default value of 7.5 is used in many inference implementations. However, manipulating it can lead to some very interesting results. It also only makes sense when it is set to 1.0 or higher, which is why many implementations use a minimum value of 1.0.

But... what happens when we set guidance scale to 0? Or negative? Let's find out!

When you use a negative value for the guidance scale, the model will try to generate images that are the opposite of what you specify in the prompt. For example, if you prompt the model to generate an image of an astronaut, and you use a negative guidance scale, the model will try to generate an image of everything but an astronaut. This can be a fun way to generate creative and unexpected images (sometimes NSFW or absolute horrendous stuff, if you are not using a safety-checker model - which is the case with StableFused).

Results

The original images produced are too large to display in high quality here. You can find them in my Drive. These images are compressed from ~30 MB to ~6 MB in order for GitHub to accept uploads.

Effect of Guidance Scale on Different Prompts
Effect of Guidance Scale on Different Prompts
Each image is sampled with the same prompt and seed to ensure only the guidance scale plays a role.
Column 1: Artistic image, very detailed cute cat, cinematic lighting effect, cute, charming, fantasy art, digital painting, photorealistic
Column 2: A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k
Column 3: A grand city in the year 2100, atmospheric, hyper realistic, 8k, epic composition, cinematic, octane render
Column 4: Starry Night, painting style of Vincent van Gogh, Oil paint on canvas, Landscape with a starry night sky, dreamy, peaceful
Your browser does not support the video tag.
Effect of Guidance Scale with increased number of inference steps
Effect of Guidance Scale with increased number of inference steps
Columns have number of inference steps set to 3, 6, 12, 20, 25.
Prompt: Photorealistic illustration of a mystical alien creature, magnificent, strong, atomic, tyrannic, predator, unforgiving, full-body image
Your browser does not support the video tag.
Your browser does not support the video tag.

Latent Walk

Generative models, like the ones used in Stable Diffusion, learn a latent representation of the world. A latent representation is a low-dimensional vector space embedding of the world. In the case of SD, this latent representation is learnt by training on text-image pairs. This representation is used to generate samples given a prompt and a random noise vector. The model tries to predict and remove noise from the random noise vector, while also aligning the vector to the prompt. This results in some interesting properties of the latent space.

Stable Diffusion models (atleast, the models used here) learn two latent representations - one of the NLP space for prompts, and one of the image space. These latent representations are continuous. If we choose two vectors in the latent space to sample from, we get two different/similar images depending on how different the chosen vectors are. This is the basis of latent walking. We can choose two vectors in the latent space, and sample from the latent path between them. This results in a smooth transition between the two images.

Similar Image Generation by sampling latent space

The results below show just how information rich the latent space of these stable diffusion models are.

Source Image Latent Walks
Large futuristic mechanical robot in the foreground of a baroque-style battle scene, photorealistic, high quality, 8k
Generating Latent Walk videos
Generating Latent Walk videos
Prompt 1: A dog chasing a cat in a thrilling backyard scene, high quality and photorealistic
Prompt 2: A determined dog in hot pursuit, with stunning realism, octane render
Prompt 3: A thrilling chase, dog behind the cat, octane render, exceptional realism and quality
Prompt 4: The exciting moment of a cat outmaneuvering a chasing dog, high-quality and photorealistic detail
Prompt 5: A clever cat escaping a determined dog and soaring into space, rendered with octane render for stunning realism
Prompt 6: The cat's escape into the cosmos, leaving the dog behind in a scene,high quality and photorealistic style
Your browser does not support the video tag.

Note that these results aren't very good. I tried different seeds but for this story, I couldn't make a great video. I did try some other prompts and got better results, but I like this story so I'm sticking with it 🤓 You can improve the results by using better prompts and increasing the number of interpolation and inference steps.

Future

At the moment, I'm not sure if I'll continue to expand on this project, but if I do, here are some things I have in mind (in no particular order, and for documentation purposes):

  • Add support for more techniques of inference - explore new sampling techniques and optimize diffusion paths
  • Implement and stay up-to-date with the latest papers in the field
  • Removing 🧨 diffusers as a dependency by implementing all required components myself
  • Create user-friendly web demos or GUI tools to make experimentation easier.
  • Add LoRA, training and fine-tuning support
  • Improve codebase, documentation and tests
  • Improve support for not only Stable Diffusion, but other diffusion techniques, involving but not limited to audio, video, etc.

License

MIT