ICML 2026

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

1Microsoft 2Stony Brook University *Equal contribution
PISCES qualitative teaser comparing baseline and OT-aligned post-training

PISCES aligns annotation-free reward supervision with the real-video distribution at both global and token levels, improving visual quality, temporal coherence, and prompt fidelity.

Abstract

Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision.

We present PISCES, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, PISCES uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens.

To our knowledge, PISCES is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that PISCES outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.

Method

PISCES dual optimal transport aligned rewards framework

PISCES T2V Post-Training. We introduce a Dual OT-aligned Rewards module: (i) a distributional OT map $\mathbf{T}^{\star}$ for Quality Reward via [CLS] representation similarity, and (ii) a discrete OT plan $\mathbf{P}^{\star}$ with spatio-temporal constraints for Semantic Reward via a Video-Text Matching (VTM) classifier. The rewards module provides supervision for fine-tuning the T2V denoiser and is applicable with direct backpropagation and RL fine-tuning (GRPO).

Qualitative Results

HunyuanVideo generations at 720p. Each carousel uses the same six MovieGenBench prompts for direct comparison.

Pre-trained: HunyuanVideo PISCES: OT + quality + semantic + LoRA No OT: quality + semantic + LoRA T2V-Turbo: quality + LoRA

Pre-trained HunyuanVideo

The original generator before reward-based post-training.

PISCES

Dual OT alignment + quality reward + semantic reward + LoRA, after 192 post-training steps.

PISCES without OT

Quality reward + semantic reward + LoRA after 192 post-training steps, without distributional or token-level OT alignment.

T2V-Turbo

Quality reward + LoRA after 192 post-training steps, without OT alignment or the fine-grained semantic reward.

BibTeX

@inproceedings{le2026pisces,
  title     = {PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards},
  author    = {Le, Minh-Quan and Mittal, Gaurav and Zhao, Cheng and Gu, David and Samaras, Dimitris and Chen, Mei},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=wSfc8mDEjM}
}