Method Overview
Given text guidance \( \mathbf{g} \) and reference image \( \mathbf{x} \) (multimodal context \( \mathcal{M} \)), Hummingbird crafts an instruction prompt \( p \) to feed to MLLM and obtain Context Description \( \mathcal{C} \). It then embeds \( \mathbf{x} \) and \( \mathcal{C} \) via CLIP to feed to the UNet Denoiser of SDXL to generate image \( \mathbf{\hat{x}} \). To improve the fidelity of \( \mathbf{\hat{x}} \) with respect to \( \mathcal{M} \) while preserving diversity, Hummingbird introduced a Multimodal Context Evaluator to simultaneously maximize novel rewards - Global Semantic and Fine-Grained Consistency Rewards - to align \( \mathbf{\hat{x}} \) with scene attributes provided in \( \mathcal{M} \).