DreamTuner: Single Image is Enough for
Subject Driven Generation

Miao Hua* Jiawei Liu* Fei Ding* Wei Liu Jie Wu Qian He

ByteDance Inc.
* Equal Contribution



Single image is enough for superising subject driven image generation!

[Paper]     

Abstract

Large diffusion-based models have demonstrated impressive capabilities in text-to-image generation, and are expected for personalized applications which require the generation of customized concepts with one or a few reference images, i.e., subject-driven generation. However, existing methods based on fine-tuning necessitate a trade-off between subject learning and the maintenance of the generation capabilities of pretrained models. Furthermore, other methods based on additional image encoders tend to lose some important details of the subject due to encoding compression. To address these issues, we propose DreamTurner, a novel method that injects the reference information of the customized subject from coarse to fine. A subject encoder is first proposed for coarse subject identity preservation, where the compressed general subject features are introduced through an additional attention layer before visual-text cross-attention. Then, noting that the self-attention layers within pretrained text-to-image models naturally perform detailed spatial contextual association functions, we modified them to self-subject-attention layers to refine the details of the target subject, where the generated image queries detailed features from both reference image and itself. It is worth emphasizing that self-subject-attention is an elegant, effective, and training-free method for maintaining the detailed features of customized concepts, which can be used as a plug-and-play solution during inference. Finally, with additional fine-tuning on only a single image, DreamTurner achieves remarkable performance in subject-driven image generation, controlled by text or other conditions such as pose.

Approach

we propose DreamTuner as a novel framework for subject driven image generation based on both fine-tuning and image encoder, which maintains the subject identity from coarse to fine. DreamTuner consists of three stages: subject encoder pre-training, subject-driven fine-tuning and subject-driven inference. Firstly, a subject encoder is trained for coarse identity preservation. Subject encoder is a kind of image encoder that provides compressed image features to the generation model. A frozen ControlNet is utilized for decoupling of content and layout. Then we fine-tune the whole model on the reference image and some generated regular images as in DreamBooth. Note that subject encoder and self-subject-attention are used for regular images generation to refine the regular data. At the inference stage, the subject encoder, self-subject-attention, and subject word [S*] obtained through fine-tuning, are used for subject identity preservation from coarse to fine. Pre-trained ControlNet could also used for layout controlled generation.



we propose subject-encoder as a kind of image encoder that provides a coarse reference for subject driven generation. A frozen CLIP image encoder is used to extract the compressed features of reference image. Salient Object Detection (SOD) model or segmentation model are used to remove the background of the input image and emphasize the subject. Then some residual blocks (ResBlock) are introduced for domain shift. Multi-layer features extracted by CLIP are concatenated in the channel dimension and then adjust to the same dimension as the generated features through the residual blocks. The encoded reference features of subject-encoder are injected to the text-to-image model using additional subject-encoder-attention (S-E-A) layers. The subject-encoder-attention layers are added before the visual-text cross-attention, because the cross-attention layers are the modules that control the general appearance of generated images. We build the subject-encoder attention according to the same settings as cross-attention and zero initial the output layers. An additional coefficient $\beta$ is introduced to adjust the influence of subject encoder. Besides, we found that the subject-encoder will provide both the content and the layout of the reference image for text-to-image generation. However, in most cases, layout is not required in subject driven generation. Thus we further introduce ControlNet to help decouple the content and layout. Specifically, we train the subject-encoder along with a frozen depth ControlNet. As the ControlNet has provided the layout of reference image, the subject encoder can focus more on the subject content.

Since the subject-encoder has provided general appearance of the specific subject for the generation process, we further propose self-subject-attention based on the original self-attention layers for fine subject identity preservation. The features of reference image extracted by the pre-trained text-to-image U-Net model are injected to the self-attention layers. The reference features can provide refined and detailed reference because they share the same resolution with the features of the generated image. Specifically, the reference image are noised through diffusion forward process at each time step $t$. Then the reference features before each self-attention layer are extracted from the noised reference image, which share the same data distribution with the generated image features at time step $t$. The original self-attention layers are modified to self-subject-attention layers by utilizing reference features. The features of generated image are taken as the query and the concatenation of generated image features and reference image features is taken as the key and value. To eliminate the influence of background of the reference image, Salient Object Detection (SOD) model is used to create a foreground mask, which uses 0 and 1 to indicate the background and foreground. Besides, the mask can also be used to adjust the scale of the impact of reference image through a weight strategy, i.e., multiply the mask by an adjustment coefficient $\omega_{ref}$. The mask works as an attention bias, thus a log function is used as a preprocessing.


The original classifier free guidance method is also modified to:



where $\mathbf{z}_t$ is the generated image at time step $t$, $\mathbf{c}$ is the condition, $\mathbf{uc}$ is the undesired condition, $\mathbf{r}_{t-\Delta t}$ and $\mathbf{r}_{t+\Delta t'}$ are the diffusion noised reference images at time step $t-\Delta t$ and $t+\Delta t'$, $\Delta t$ and $\Delta t'$ are small time step biases, $\omega_{r}$ and $\omega_{c}$ are the guidance scales, $\hat{\mathbf{\epsilon}}_t$ is the final output. The first equation emphasizes the guidance of reference image and the second equation emphasizes the guidance of condition, where $p_{r}$ controls the possibility of selecting the first one.

Visualizations of the self-subject-attention maps

We visualize the attention maps of self-subject-attention at the middle time step (t=25) and the last time step (t=0) of generation process, with text "1girl [S*], Sitting at the table with a cup of tea in hands, sunlight streaming through the window". We choose the attention maps at the Encoder layers 7, 8 and the Decoder layers 4, 5 of Diffusion U-Net model, i.e., the layers with the feature resolution of 16*16 when the resolution of generated image is 512*512. The generated image is shown in the left and the reference image is shown in the right. The attention map appears red in areas with strong influence and blue in areas with weak influence. The red box represents the query. Some of the key attention maps at Decoder layer 5 are shown below. It can be found that the generated image will query from the reference image for refined subject information.

All of the attention maps are visualized as videos:

Results of Text-Controlled Anime Characters Driven Image Generation

Our results display the output of text-controlled subject-driven image generation focused on anime characters. Both local editing (such as expression editing in the first line) and global editing (including scene and action editing in the subsequent five lines) were performed, resulting in highly detailed images even with complex text inputs. Notably, the images maintained the details of the reference images accurately.

Results of Text-Controlled Natural Images Driven Image Generation

Our method is evaluated on the DreamBooth dataset, where one image of each subject is utilized as the reference image. Through the use of subject-encoder and self-subject-attention, a refined reference is generated, which enables DreamTuner to successfully produce high-fidelity images that are consistent with the textual input, while also retaining crucial subject details, including but not limited to, white stripes on puppy's head, logos on bag, patterns and texts on can.


Results of Pose-Controlled Character Driven Image Generation

Our method can be combined with ControlNet to expand its applicability to various conditions such as pose. In the following example, only one image is used for DreamTuner fine-tuning, with the pose of the reference image utilized as a reference condition. To ensure inter-frame coherence, both the reference image and the previous frame of the generated image are used for self-subject-attention, with reference weights of 10 and 1, respectively.