Upload QwenImageLayeredModularPipeline

#3
by YiYiXu HF Staff - opened
Files changed (1) hide show
  1. README.md +93 -0
README.md CHANGED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: diffusers
3
+ tags:
4
+ - modular-diffusers
5
+ - diffusers
6
+ - qwenimage-layered
7
+ - text-to-image
8
+ ---
9
+ This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
10
+
11
+ **Pipeline Type**: QwenImageLayeredAutoBlocks
12
+
13
+ **Description**: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.
14
+
15
+ This pipeline uses a 4-block architecture that can be customized and extended.
16
+
17
+ ## Example Usage
18
+
19
+ [TODO]
20
+
21
+ ## Pipeline Architecture
22
+
23
+ This modular pipeline is composed of the following blocks:
24
+
25
+ 1. **text_encoder** (`QwenImageLayeredTextEncoderStep`)
26
+ - QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
27
+ - *resize*: `QwenImageLayeredResizeStep`
28
+ - Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
29
+ - *get_image_prompt*: `QwenImageLayeredGetImagePromptStep`
30
+ - Auto-caption step that generates a text prompt from the input image if none is provided.
31
+ - *encode*: `QwenImageTextEncoderStep`
32
+ - Text Encoder step that generates text embeddings to guide the image generation.
33
+ 2. **vae_encoder** (`QwenImageLayeredVaeEncoderStep`)
34
+ - Vae encoder step that encode the image inputs into their latent representations.
35
+ - *resize*: `QwenImageLayeredResizeStep`
36
+ - Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
37
+ - *preprocess*: `QwenImageEditProcessImagesInputStep`
38
+ - Image Preprocess step. Images needs to be resized first.
39
+ - *encode*: `QwenImageVaeEncoderStep`
40
+ - VAE Encoder step that converts processed_image into latent representations image_latents.
41
+ - *permute*: `QwenImageLayeredPermuteLatentsStep`
42
+ - Permute image latents from (B, C, 1, H, W) to (B, 1, C, H, W) for Layered packing.
43
+ 3. **denoise** (`QwenImageLayeredCoreDenoiseStep`)
44
+ - Core denoising workflow for QwenImage-Layered img2img task.
45
+ - *input*: `QwenImageLayeredInputStep`
46
+ - Input step that prepares the inputs for the layered denoising step. It:
47
+ - *prepare_latents*: `QwenImageLayeredPrepareLatentsStep`
48
+ - Prepare initial random noise (B, layers+1, C, H, W) for the generation process
49
+ - *set_timesteps*: `QwenImageLayeredSetTimestepsStep`
50
+ - Set timesteps step for QwenImage Layered with custom mu calculation based on image_latents.
51
+ - *prepare_rope_inputs*: `QwenImageLayeredRoPEInputsStep`
52
+ - Step that prepares the RoPE inputs for the denoising process. Should be place after prepare_latents step
53
+ - *denoise*: `QwenImageLayeredDenoiseStep`
54
+ - Denoise step that iteratively denoise the latents.
55
+ - *after_denoise*: `QwenImageLayeredAfterDenoiseStep`
56
+ - Unpack latents from (B, seq, C*4) to (B, C, layers+1, H, W) after denoising.
57
+ 4. **decode** (`QwenImageLayeredDecoderStep`)
58
+ - Decode unpacked latents (B, C, layers+1, H, W) into layer images.
59
+
60
+ ## Model Components
61
+
62
+ 1. image_resize_processor (`VaeImageProcessor`)
63
+ 2. text_encoder (`Qwen2_5_VLForConditionalGeneration`)
64
+ 3. processor (`Qwen2VLProcessor`)
65
+ 4. tokenizer (`Qwen2Tokenizer`): The tokenizer to use
66
+ 5. guider (`ClassifierFreeGuidance`)
67
+ 6. image_processor (`VaeImageProcessor`)
68
+ 7. vae (`AutoencoderKLQwenImage`)
69
+ 8. pachifier (`QwenImageLayeredPachifier`)
70
+ 9. scheduler (`FlowMatchEulerDiscreteScheduler`)
71
+ 10. transformer (`QwenImageTransformer2DModel`) ## Input/Output Specification
72
+
73
+ **Inputs:**
74
+
75
+ - `image` (`Image | list`): Reference image(s) for denoising. Can be a single image or list of images.
76
+ - `resolution` (`int`, *optional*, defaults to `640`): The target area to resize the image to, can be 1024 or 640
77
+ - `prompt` (`str`, *optional*): The prompt or prompts to guide image generation.
78
+ - `use_en_prompt` (`bool`, *optional*, defaults to `False`): Whether to use English prompt template
79
+ - `negative_prompt` (`str`, *optional*): The prompt or prompts not to guide the image generation.
80
+ - `max_sequence_length` (`int`, *optional*, defaults to `1024`): Maximum sequence length for prompt encoding.
81
+ - `generator` (`Generator`, *optional*): Torch generator for deterministic generation.
82
+ - `num_images_per_prompt` (`int`, *optional*, defaults to `1`): The number of images to generate per prompt.
83
+ - `latents` (`Tensor`, *optional*): Pre-generated noisy latents for image generation.
84
+ - `layers` (`int`, *optional*, defaults to `4`): Number of layers to extract from the image
85
+ - `num_inference_steps` (`int`, *optional*, defaults to `50`): The number of denoising steps.
86
+ - `sigmas` (`list`, *optional*): Custom sigmas for the denoising process.
87
+ - `attention_kwargs` (`dict`, *optional*): Additional kwargs for attention processors.
88
+ - `**denoiser_input_fields` (`None`, *optional*): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
89
+ - `output_type` (`str`, *optional*, defaults to `pil`): Output format: 'pil', 'np', 'pt'.
90
+
91
+ **Outputs:**
92
+
93
+ - `images` (`list`): Generated images.