GitHub - cloneofsimo/lora: Using Low-rank adaptation to quickly fine-tune diffusion models.
Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning
- Fine-tune Stable diffusion models twice as faster than dreambooth method, by Low-rank Adaptation
- Get insanely small end result (3MB for just unet, 4MB for both unet + clip + token embedding), easy to share and download.
- Easy to use, compatible with
- Sometimes even better performance than full fine-tuning (but left as future work for extensive comparisons)
- Merge checkpoints + Build recipes by merging LoRAs together
- Pipeline to fine-tune CLIP + Unet + token to gain better results.
- using Gradio. Try out the Web Demo
Integrated into Huggingface Spaces
Easy colab running example of Dreambooth by @pedrogengo
UPDATES & Notes
- Pivotal Tuning now available with
- More Utilities added, such as datasets,
patch_pipefunction to patch CLIP, Unet, Token all at once.
- Adjustable Ranks, Fine-tuning Feed-forward layers.
- More example notebooks added.
- You can now fine-tune text_encoder as well! Enabled with simple
- Converting to CKPT format for A1111's repo consumption! (Thanks to jachiam's conversion script)
- Img2Img Examples added.
- Please use large learning rate! Around 1e-4 worked well for me, but certainly not around 1e-6 which will not be able to learn anything.
Thanks to the generous work of Stability AI and Huggingface, so many people have enjoyed fine-tuning stable diffusion models to fit their needs and generate higher fidelity images. However, the fine-tuning process is very slow, and it is not easy to find a good balance between the number of steps and the quality of the results.
Also, the final results (fully fined-tuned model) is very large. Some people instead works with textual-inversion as an alternative for this. But clearly this is suboptimal: textual inversion only creates a small word-embedding, and the final image is not as good as a fully fine-tuned model.
Well, what's the alternative? In the domain of LLM, researchers have developed Efficient fine-tuning methods. LoRA, especially, tackles the very problem the community currently has: end users with Open-sourced stable-diffusion model want to try various other fine-tuned model that is created by the community, but the model is too large to download and use. LoRA instead attempts to fine-tune the "residual" of the model instead of the entire model: i.e., train the ΔW instead of W.
Where we can further decompose ΔW into low-rank matrices : ΔW=ABT, where A,∈Rn×d,B∈Rm×d,d<<n. This is the key idea of LoRA. We can then fine-tune A and B instead of W. In the end, you get an insanely small model as A and B are much smaller than W.
Also, not all of the parameters need tuning: they found that often, Q,K,V,O (i.e., attention layer) of the transformer model is enough to tune. (This is also the reason why the end result is so small). This repo will follow the same idea.
Now, how would we actually use this to update diffusion model? First, we will use Stable-diffusion from stability-ai. Their model is nicely ported through Huggingface API, so this repo has built various fine-tuning methods around them. In detail, there are three subtle but important distictions in methods to make this work out.
First, there is LoRA applied to Dreambooth. The idea is to use prior-preservation class images to regularize the training process, and use low-occuring tokens. This will keep the model's generalization capability while keeping high fidelity. If you turn off prior preservation, and train text encoder embedding as well, it will become naive fine tuning.
Second, there is Textual inversion. There is no room to apply LoRA here, but it is worth mensioning. The idea is to instantiate new token, and learn the token embedding via gradient descent. This is a very powerful method, and it is worth trying out if your use case is not focused on fidelity but rather on inverting conceptual ideas.
Last method (although originally proposed for GANs) takes the best of both worlds to further benefit. When combined together, this can be implemented as a strict generalization of both methods. Simply you apply textual inversion to get a matching token embedding. Then, you use the token embedding + prior-preserving class image to fine-tune the model. This two-fold nature make this strict generalization of both methods.
Enough of the lengthy introduction, let's get to the code.
Fine-tuning Stable diffusion with LoRA.
Basic usage is as follows: prepare sets of A,B matrices in an unet model, and fine-tune them.
A working example of this, applied on Dreambooth can be found in
train_lora_dreambooth.py. Run this example with
Another dreambooth example, with text_encoder training on can be run with:
Loading, merging, and interpolating trained LORAs with CLIs.
We've seen that people have been merging different checkpoints with different ratios, and this seems to be very useful to the community. LoRA is extremely easy to merge.
By the nature of LoRA, one can interpolate between different fine-tuned models by adding different A,B matrices.
Currently, LoRA cli has three options : merge full model with LoRA, merge LoRA with LoRA, or merge full model with LoRA and changes to
ckptformat (original format)
Merging full model with LoRA
path_1can be both local path or huggingface model name. When adding LoRA to unet, alpha is the constant as below:
So, set alpha to 1.0 to fully add LoRA. If the LoRA seems to have too much effect (i.e., overfitted), set alpha to lower value. If the LoRA seems to have too little effect, set alpha to higher than 1.0. You can tune these values to your needs. This value can be even slightly greater than 1.0!
Mergigng Full model with LoRA and changing to original CKPT format
TESTED WITH V2, V2.1 ONLY!
Everything same as above, but with mode
Merging LoRA with LoRA
alpha is the ratio of the first model to the second model. i.e.,
Set alpha to 0.5 to get the average of the two models. Set alpha close to 1.0 to get more effect of the first model, and set alpha close to 0.0 to get more effect of the second model.
More bash examples with Text Encoder Lora:
: This will build a
merged_model.ckptwith LoRA merged with α=1.2 and text encoder LoRA.
Making Text2Img Inference with trained LoRA
scripts/run_inference.ipynbfor an example of how to make inference with LoRA.
Making Img2Img Inference with LoRA
scripts/run_img2img.ipynbfor an example of how to make inference with LoRA.
Merging Lora with Lora, and making inference dynamically using
scripts/merge_lora_with_lora.ipynbfor an example of how to merge Lora with Lora, and make inference dynamically using
Above results are from merging
lora_kiriko.ptwith both 1.0 as weights and 0.5 as α.
Tips and Discussions
Training tips in general
I'm curating a list of tips and discussions here. Feel free to add your own tips and discussions with a PR!
- Discussion by @nitrosocke, can be found here
- Configurations by @xsteenbrugge, Using Clip-interrogator to get a decent prompt seems to work well for him, https://twitter.com/xsteenbrugge/status/1602799180698763264
- Super easy colab running example of Dreambooth by @pedrogengo
- Amazing in-depth analysis on the effect of rank, αunet, αclip, and training configurations from brian6091!
How long should you train?
Effect of fine tuning (both Unet + CLIP) can be seen in the following image, where each image is another 500 steps. Trained with 9 images, with lr of
1e-4for unet, and
5e-5for CLIP. (You can adjust this with
You can see that with 2500 steps, you already get somewhat good results.
What is a good learning rate for LoRA?
People using dreambooth are used to using lr around
1e-6, but this is way too small for training LoRAs. I've tried using 1e-4, and it is OK. I think these values should be more explored statistically.
What happens to Text Encoder LoRA and Unet LoRA?
Let's see: the following is only using Unet LoRA:
And the following is only using Text Encoder LoRA:
So they learnt different aspect of the dataset, but they are not mutually exclusive. You can use both of them to get better results, and tune them seperately to get even better results.
With LoRA Text Encoder, Unet, all the schedulers, guidance scale, negative prompt etc. etc., you have so much to play around with to get the best result you want. For example, with αunet=0.6, αtext=0.9, you get a better result compared to αunet=1.0, αtext=1.0 (default). Checkout below:
Here is an extensive visualization on the effect of αunet, αtext, by @brian6091 from his analysis
"a photo of (S*)", trained with 21 images, with rank 16 LoRA. More details can be found
- Make this more user friendly for non-programmers
- Make a better CLI
- Make a better documentation
- Kronecker product, like LoRA [https://arxiv.org/abs/2106.04647]
- Time-aware fine-tuning.
- Test alpha scheduling. I think it will be meaningful.
This work was heavily influenced by, and originated from these awesome researches. I'm just applying them here.