Multimodal Diffusion Language Models for Thinking-Aware Editing and Generation
Source
Published
TL;DR
AI GeneratedThe article introduces MMaDA-Parallel, a parallel multimodal diffusion framework designed to enhance thinking-aware editing and generation tasks by improving cross-modal alignment and semantic consistency between text and image outputs. The model is trained using supervised finetuning and further optimized with Parallel Reinforcement Learning (ParaRL) to enforce cross-modal consistency. Experiments show a 6.9% improvement in Output Alignment on the ParaBench benchmark compared to the state-of-the-art model Bagel, establishing a more robust approach for thinking-aware image synthesis. The authors have released codes and models for MMaDA-Parallel, with two 8B models available for use.