CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation (old.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/stablediffusion@lemmit.online

0 comments fedilink hide all child comments

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/ninjasaid13 on 2024-10-15 05:42:07+00:00.

Disclaimer: I am not the author.

Paper:

Code:

Weights:

Abstract

Recently, large-scale diffusion models have made impressive progress in text-to-image (T2I) generation. To further equip these T2I models with fine-grained spatial control, approaches like ControlNet introduce an extra network that learns to follow a condition image. However, for every single condition type, ControlNet requires independent training on millions of data pairs with hundreds of GPU hours, which is quite expensive and makes it challenging for ordinary users to explore and develop new types of conditions. To address this problem, we propose the CtrLoRA framework, which trains a Base ControlNet to learn the common knowledge of image-to-image generation from multiple base conditions, along with condition-specific LoRAs to capture distinct characteristics of each condition. Utilizing our pretrained Base ControlNet, users can easily adapt it to new conditions, requiring as few as 1,000 data pairs and less than one hour of single-GPU training to obtain satisfactory results in most scenarios. Moreover, our CtrLoRA reduces the learnable parameters by 90% compared to ControlNet, significantly lowering the threshold to distribute and deploy the model weights. Extensive experiments on various types of conditions demonstrate the efficiency and effectiveness of our method. Codes and model weights will be released at this https URL.

Figure 1: Their results of single-conditional generation, multi-conditional generation, style transfer.

Figure 2: Overview of the CtrLoRA framework. “CN” denotes Base ControlNet, “L” denotes LoRA.(a) They first train a shared Base ControlNet in conjunction with condition-specific LoRAs on a largescale dataset that contains multiple base conditions. (b) The trained Base ControlNet can be easily adapted to novel conditions with significantly less data, fewer devices, and shorter time.

Comparison of model size, dataset size, and training time cost. For N conditions, the totalnumber of parameters is 361M × N for ControlNet and 360M + 37M × N for the CtrLoRA.

Figure 3: Training and inference of the CtrLoRA framework. “SD” denotes Stable Diffusion, “CN”denotes Base ControlNet, and “L”s in different colors denote LoRAs for different conditions.

This paper presents CtrLoRA, a framework for creating controllable image generation models with minimal data and resources. It trains a Base ControlNet with condition-specific LoRAs, then adapts it to new conditions with additional LoRAs, reducing data needs and speeding up training. The models can be easily integrated with community models for multi-condition generation without extra training. A common issue with CtrlLoRA found is that color-related tasks, like Palette and Lineart, converge more slowly than spatial tasks, likely due to network architecture limitations. Future improvements may come from using more advanced transformer diffusion backbones like Stable Diffusion V3 and Flux.1.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here

this post was submitted on 15 Oct 2024

1 points (100.0% liked)

StableDiffusion

98 readers

1 users here now

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and...

founded 1 year ago

MODERATORS

bot@lemmit.online