While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric
tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and
Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated
images consistent with a multimodal context, i.e.~a reference image with accompanying text guidance query.
To address this, we introduce Hummingbird, the first diffusion-based image generator which, given
a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high
fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships
from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously
optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images
preserve the scene attributes of reference images in relation to the text guidance while maintaining
diversity. As the first model to address the task of maintaining both diversity and fidelity given a
multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI
datasets. Benchmark experiments show that Hummingbird outperforms all existing methods by achieving
superior fidelity while maintaining diversity, validating Hummingbird's potential as a robust multimodal
context-aligned image generator in complex visual tasks.