HummingbirdHummingbird Icon: High Fidelity Image Generation via Multimodal Context Alignment

1Microsoft, 2Stony Brook University
ICLR 2025

*Indicates Equal Contribution

Hummingbird aligns generated image with multimodal context input (reference image + text guidance) ensuring the synthetic image is diverse w.r.t. reference image while exhibiting high fidelity (i.e. preserves the scene attribute from reference image in relation to text guidance). (a-f) By doing so, for VQA and HOI Reasoning, Hummingbird enables the answer to the question in the text guidance to remain consistent between the reference and generated images. (g) Hummingbird is also able to preserve both diversity and fidelity without the trade-off exhibited by existing methods.

Abstract

While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e.~a reference image with accompanying text guidance query. To address this, we introduce Hummingbird, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show that Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird's potential as a robust multimodal context-aligned image generator in complex visual tasks.

Method Overview

Method Overview

Given text guidance \( \mathbf{g} \) and reference image \( \mathbf{x} \) (multimodal context \( \mathcal{M} \)), Hummingbird crafts an instruction prompt \( p \) to feed to MLLM and obtain Context Description \( \mathcal{C} \). It then embeds \( \mathbf{x} \) and \( \mathcal{C} \) via CLIP to feed to the UNet Denoiser of SDXL to generate image \( \mathbf{\hat{x}} \). To improve the fidelity of \( \mathbf{\hat{x}} \) with respect to \( \mathcal{M} \) while preserving diversity, Hummingbird introduced a Multimodal Context Evaluator to simultaneously maximize novel rewards - Global Semantic and Fine-Grained Consistency Rewards - to align \( \mathbf{\hat{x}} \) with scene attributes provided in \( \mathcal{M} \).

Fine-tuning with Multimodal Context Rewards

Fine-tuning with Multimodal Context Rewards

Hummingbird's Multimodal Context Evaluator leverages pre-trained BLIP-2 QFormer. It simultaneously maximizes our novel Global Semantic and Fine-grained Consistency Rewards to align generated image \( \mathbf{\hat{x}} \) with Content Description \( \mathcal{C} \) corresponding to multimodal context \( \mathcal{M} \).

Qualitative Comparison with SOTA Methods

Generated image comparison between Hummingbird and SOTA methods on MME Perception and HOI Reasoning. Hummingbird achieves highest fidelity while maintaining high diversity.

Diversity Analysis

Hummingbird exhibits high diversity across different random seeds while producing high-fidelity images each time w.r.t multimodal context (reference image + text guidance).

Effectiveness of Fine-tuning

Effectiveness of Fine-tuning

Fine-tuning with Multimodal Context Rewards improves fidelity in generated images.

BibTeX

@inproceedings{
        le2025hummingbird,
        title={Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment},
        author={Minh-Quan Le and Gaurav Mittal and Tianjian Meng and A S M Iftekhar and Vishwas
          Suryanarayanan and Barun Patra and Dimitris Samaras and Mei Chen},
        booktitle={The Thirteenth International Conference on Learning Representations},
        year={2025},
        url={https://openreview.net/forum?id=6kPBThI6ZJ}
        }