diffusers
/

Qwen-Image-Layered-modular

+---
+library_name: diffusers
+tags:
+- modular-diffusers
+- diffusers
+- qwenimage-layered
+- text-to-image
+---
+This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
+**Pipeline Type**: QwenImageLayeredAutoBlocks
+**Description**: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.
+This pipeline uses a 4-block architecture that can be customized and extended.
+## Example Usage
+[TODO]
+## Pipeline Architecture
+This modular pipeline is composed of the following blocks:
+1. **text_encoder** (`QwenImageLayeredTextEncoderStep`)
+   - QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
+   - *resize*: `QwenImageLayeredResizeStep`
+     - Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
+   - *get_image_prompt*: `QwenImageLayeredGetImagePromptStep`
+     - Auto-caption step that generates a text prompt from the input image if none is provided.
+   - *encode*: `QwenImageTextEncoderStep`
+     - Text Encoder step that generates text embeddings to guide the image generation.
+2. **vae_encoder** (`QwenImageLayeredVaeEncoderStep`)
+   - Vae encoder step that encode the image inputs into their latent representations.
+   - *resize*: `QwenImageLayeredResizeStep`
+     - Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
+   - *preprocess*: `QwenImageEditProcessImagesInputStep`
+     - Image Preprocess step. Images needs to be resized first.
+   - *encode*: `QwenImageVaeEncoderStep`
+     - VAE Encoder step that converts processed_image into latent representations image_latents.
+   - *permute*: `QwenImageLayeredPermuteLatentsStep`
+     - Permute image latents from (B, C, 1, H, W) to (B, 1, C, H, W) for Layered packing.
+3. **denoise** (`QwenImageLayeredCoreDenoiseStep`)
+   - Core denoising workflow for QwenImage-Layered img2img task.
+   - *input*: `QwenImageLayeredInputStep`
+     - Input step that prepares the inputs for the layered denoising step. It:
+   - *prepare_latents*: `QwenImageLayeredPrepareLatentsStep`
+     - Prepare initial random noise (B, layers+1, C, H, W) for the generation process
+   - *set_timesteps*: `QwenImageLayeredSetTimestepsStep`
+     - Set timesteps step for QwenImage Layered with custom mu calculation based on image_latents.
+   - *prepare_rope_inputs*: `QwenImageLayeredRoPEInputsStep`
+     - Step that prepares the RoPE inputs for the denoising process. Should be place after prepare_latents step
+   - *denoise*: `QwenImageLayeredDenoiseStep`
+     - Denoise step that iteratively denoise the latents.
+   - *after_denoise*: `QwenImageLayeredAfterDenoiseStep`
+     - Unpack latents from (B, seq, C*4) to (B, C, layers+1, H, W) after denoising.
+4. **decode** (`QwenImageLayeredDecoderStep`)
+   - Decode unpacked latents (B, C, layers+1, H, W) into layer images.
+## Model Components
+1. image_resize_processor (`VaeImageProcessor`)
+2. text_encoder (`Qwen2_5_VLForConditionalGeneration`)
+3. processor (`Qwen2VLProcessor`)
+4. tokenizer (`Qwen2Tokenizer`): The tokenizer to use
+5. guider (`ClassifierFreeGuidance`)
+6. image_processor (`VaeImageProcessor`)
+7. vae (`AutoencoderKLQwenImage`)
+8. pachifier (`QwenImageLayeredPachifier`)
+9. scheduler (`FlowMatchEulerDiscreteScheduler`)
+10. transformer (`QwenImageTransformer2DModel`)  ## Input/Output Specification
+**Inputs:**
+- `image` (`Image | list`): Reference image(s) for denoising. Can be a single image or list of images.
+- `resolution` (`int`, *optional*, defaults to `640`): The target area to resize the image to, can be 1024 or 640
+- `prompt` (`str`, *optional*): The prompt or prompts to guide image generation.
+- `use_en_prompt` (`bool`, *optional*, defaults to `False`): Whether to use English prompt template
+- `negative_prompt` (`str`, *optional*): The prompt or prompts not to guide the image generation.
+- `max_sequence_length` (`int`, *optional*, defaults to `1024`): Maximum sequence length for prompt encoding.
+- `generator` (`Generator`, *optional*): Torch generator for deterministic generation.
+- `num_images_per_prompt` (`int`, *optional*, defaults to `1`): The number of images to generate per prompt.
+- `latents` (`Tensor`, *optional*): Pre-generated noisy latents for image generation.
+- `layers` (`int`, *optional*, defaults to `4`): Number of layers to extract from the image
+- `num_inference_steps` (`int`, *optional*, defaults to `50`): The number of denoising steps.
+- `sigmas` (`list`, *optional*): Custom sigmas for the denoising process.
+- `attention_kwargs` (`dict`, *optional*): Additional kwargs for attention processors.
+- `**denoiser_input_fields` (`None`, *optional*): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+- `output_type` (`str`, *optional*, defaults to `pil`): Output format: 'pil', 'np', 'pt'.
+**Outputs:**
+- `images` (`list`): Generated images.