Upload QwenImageLayeredModularPipeline
#3
by
YiYiXu HF Staff - opened
README.md
CHANGED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: diffusers
|
| 3 |
+
tags:
|
| 4 |
+
- modular-diffusers
|
| 5 |
+
- diffusers
|
| 6 |
+
- qwenimage-layered
|
| 7 |
+
- text-to-image
|
| 8 |
+
---
|
| 9 |
+
This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
|
| 10 |
+
|
| 11 |
+
**Pipeline Type**: QwenImageLayeredAutoBlocks
|
| 12 |
+
|
| 13 |
+
**Description**: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.
|
| 14 |
+
|
| 15 |
+
This pipeline uses a 4-block architecture that can be customized and extended.
|
| 16 |
+
|
| 17 |
+
## Example Usage
|
| 18 |
+
|
| 19 |
+
[TODO]
|
| 20 |
+
|
| 21 |
+
## Pipeline Architecture
|
| 22 |
+
|
| 23 |
+
This modular pipeline is composed of the following blocks:
|
| 24 |
+
|
| 25 |
+
1. **text_encoder** (`QwenImageLayeredTextEncoderStep`)
|
| 26 |
+
- QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
|
| 27 |
+
- *resize*: `QwenImageLayeredResizeStep`
|
| 28 |
+
- Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
|
| 29 |
+
- *get_image_prompt*: `QwenImageLayeredGetImagePromptStep`
|
| 30 |
+
- Auto-caption step that generates a text prompt from the input image if none is provided.
|
| 31 |
+
- *encode*: `QwenImageTextEncoderStep`
|
| 32 |
+
- Text Encoder step that generates text embeddings to guide the image generation.
|
| 33 |
+
2. **vae_encoder** (`QwenImageLayeredVaeEncoderStep`)
|
| 34 |
+
- Vae encoder step that encode the image inputs into their latent representations.
|
| 35 |
+
- *resize*: `QwenImageLayeredResizeStep`
|
| 36 |
+
- Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
|
| 37 |
+
- *preprocess*: `QwenImageEditProcessImagesInputStep`
|
| 38 |
+
- Image Preprocess step. Images needs to be resized first.
|
| 39 |
+
- *encode*: `QwenImageVaeEncoderStep`
|
| 40 |
+
- VAE Encoder step that converts processed_image into latent representations image_latents.
|
| 41 |
+
- *permute*: `QwenImageLayeredPermuteLatentsStep`
|
| 42 |
+
- Permute image latents from (B, C, 1, H, W) to (B, 1, C, H, W) for Layered packing.
|
| 43 |
+
3. **denoise** (`QwenImageLayeredCoreDenoiseStep`)
|
| 44 |
+
- Core denoising workflow for QwenImage-Layered img2img task.
|
| 45 |
+
- *input*: `QwenImageLayeredInputStep`
|
| 46 |
+
- Input step that prepares the inputs for the layered denoising step. It:
|
| 47 |
+
- *prepare_latents*: `QwenImageLayeredPrepareLatentsStep`
|
| 48 |
+
- Prepare initial random noise (B, layers+1, C, H, W) for the generation process
|
| 49 |
+
- *set_timesteps*: `QwenImageLayeredSetTimestepsStep`
|
| 50 |
+
- Set timesteps step for QwenImage Layered with custom mu calculation based on image_latents.
|
| 51 |
+
- *prepare_rope_inputs*: `QwenImageLayeredRoPEInputsStep`
|
| 52 |
+
- Step that prepares the RoPE inputs for the denoising process. Should be place after prepare_latents step
|
| 53 |
+
- *denoise*: `QwenImageLayeredDenoiseStep`
|
| 54 |
+
- Denoise step that iteratively denoise the latents.
|
| 55 |
+
- *after_denoise*: `QwenImageLayeredAfterDenoiseStep`
|
| 56 |
+
- Unpack latents from (B, seq, C*4) to (B, C, layers+1, H, W) after denoising.
|
| 57 |
+
4. **decode** (`QwenImageLayeredDecoderStep`)
|
| 58 |
+
- Decode unpacked latents (B, C, layers+1, H, W) into layer images.
|
| 59 |
+
|
| 60 |
+
## Model Components
|
| 61 |
+
|
| 62 |
+
1. image_resize_processor (`VaeImageProcessor`)
|
| 63 |
+
2. text_encoder (`Qwen2_5_VLForConditionalGeneration`)
|
| 64 |
+
3. processor (`Qwen2VLProcessor`)
|
| 65 |
+
4. tokenizer (`Qwen2Tokenizer`): The tokenizer to use
|
| 66 |
+
5. guider (`ClassifierFreeGuidance`)
|
| 67 |
+
6. image_processor (`VaeImageProcessor`)
|
| 68 |
+
7. vae (`AutoencoderKLQwenImage`)
|
| 69 |
+
8. pachifier (`QwenImageLayeredPachifier`)
|
| 70 |
+
9. scheduler (`FlowMatchEulerDiscreteScheduler`)
|
| 71 |
+
10. transformer (`QwenImageTransformer2DModel`) ## Input/Output Specification
|
| 72 |
+
|
| 73 |
+
**Inputs:**
|
| 74 |
+
|
| 75 |
+
- `image` (`Image | list`): Reference image(s) for denoising. Can be a single image or list of images.
|
| 76 |
+
- `resolution` (`int`, *optional*, defaults to `640`): The target area to resize the image to, can be 1024 or 640
|
| 77 |
+
- `prompt` (`str`, *optional*): The prompt or prompts to guide image generation.
|
| 78 |
+
- `use_en_prompt` (`bool`, *optional*, defaults to `False`): Whether to use English prompt template
|
| 79 |
+
- `negative_prompt` (`str`, *optional*): The prompt or prompts not to guide the image generation.
|
| 80 |
+
- `max_sequence_length` (`int`, *optional*, defaults to `1024`): Maximum sequence length for prompt encoding.
|
| 81 |
+
- `generator` (`Generator`, *optional*): Torch generator for deterministic generation.
|
| 82 |
+
- `num_images_per_prompt` (`int`, *optional*, defaults to `1`): The number of images to generate per prompt.
|
| 83 |
+
- `latents` (`Tensor`, *optional*): Pre-generated noisy latents for image generation.
|
| 84 |
+
- `layers` (`int`, *optional*, defaults to `4`): Number of layers to extract from the image
|
| 85 |
+
- `num_inference_steps` (`int`, *optional*, defaults to `50`): The number of denoising steps.
|
| 86 |
+
- `sigmas` (`list`, *optional*): Custom sigmas for the denoising process.
|
| 87 |
+
- `attention_kwargs` (`dict`, *optional*): Additional kwargs for attention processors.
|
| 88 |
+
- `**denoiser_input_fields` (`None`, *optional*): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
|
| 89 |
+
- `output_type` (`str`, *optional*, defaults to `pil`): Output format: 'pil', 'np', 'pt'.
|
| 90 |
+
|
| 91 |
+
**Outputs:**
|
| 92 |
+
|
| 93 |
+
- `images` (`list`): Generated images.
|