From Unified MLLM to Multimodal Generalist

Future human-level or superhuman AI will be multimodal generalist that can perceive, reason, and create across modalities while preserving deep synergy between modalities, tasks, paradigms.

01

Beyond language-only intelligence

Future AI should natively operate over text, images, video, audio, 3D, and embodied signals rather than treating language as the only privileged interface.

02

From unified MLLM to generalist

The target is the unification of modality, task, and paradigm, spanning both comprehension and generation.

03

Synergy is the missing axis

A real multimodal generalist should exhibit generalization across modalities, across tasks, and across paradigms, rather than hosting isolated skills side by side.

Concept illustration of multimodal intelligence linking comprehension and generation

Flagship Research

General-level General-Bench ICML 2025 (Oral / Spotlight)

On Path to Multimodal Generalist:

General-Level and General-Bench

A principled evaluation framework for measuring how far current multimodal large language models have progressed towards genuine multimodal generalists.

  • Defines a five-scale General-Level to map the capability and generality of multimodal systems.
  • Positions Synergy as a core criterion: models should stay consistently strong across comprehension, generation, and multiple modalities.
  • Introduces General-Bench with 700+ tasks and 325,800 instances spanning broad skills, formats, and capabilities.
  • Benchmarks 100+ state-of-the-art MLLMs to expose both the progress and the remaining gap towards AGI.
Overview figure for General-Level and General-Bench
Unified-R1 AD-Loop ICLR 2026

Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking

A unified reasoning paradigm where understanding and generation alternate inside one problem-solving loop instead of remaining parallel but disconnected capabilities.

  • Introduces the Analyzing-Drafting Loop, an interleaved process that alternates textual thoughts and visual thoughts before the final output.
  • Trains the loop in two stages: supervised imitation of interleaved thinking, then reinforcement learning for adaptive control over when to analyze or draft.
  • Delivers consistent gains on both understanding and generation benchmarks across multiple unified vision-language model architectures.
  • Provides evidence that multimodal generalist behavior benefits from explicit interaction between comprehension and creation.
Overview figure for the interleaved analyzing-drafting loop

Workshop

ACM MM 2025 Dublin, Ireland

The 1st International Workshop @ ACM MM 2025: MLLM for Unified Comprehension and Generation

MUCG focuses on closing the gap between multimodal comprehension and generation inside unified MLLM frameworks rather than treating them as independent capabilities.

  • Targets three connected fronts: sophisticated multimodal comprehension, controllable content generation, and unified frameworks for semantic alignment.
  • Emphasizes shared architectures, bidirectional knowledge transfer, and end-to-end training for jointly capable models.
  • Positions unified understanding and generation as a distinct research agenda for the next phase of MLLM development.

Focus for papers

MLLM for multimodal comprehension & reasoning MLLM for multimodal content generation Unified MLLM understanding and generation
IJCV special issue

Special Issue @ IJCV 2026:

MLLMs for Unified Comprehension and Generation

The special issue frames MUCG as a closed-loop paradigm where perceptual grounding, internal reasoning, and generative outputs must remain aligned across heterogeneous modalities.

  • Covers unified architectures, training and alignment, reasoning and controllability, evaluation and benchmarks, efficiency, scalability, and real-world applications.
  • Pushes for shared vocabulary, rigorous comparative studies, unified metrics, and more reproducible research practices in large-scale multimodal modeling.
  • Serves as an archival venue for consolidating advances that are currently scattered across conferences and rapidly moving project threads.

Key scopes

Unified modeling architectures Training & alignment strategies Reasoning, grounding & controllability Evaluation & benchmarks Efficiency & scalability Applications

Tutorial

The first tutorial series for the MLLM community, charting the evolution from multimodal large language models to human-level AI, with different editions focusing on benchmarks, architecture, reasoning, hallucination, efficiency, and broader open problems.

CVPR 2025 Nashville, TN, USA

Evaluations and Benchmarks in the Context of Multimodal LLM

ACM MM 2024 Melbourne, Australia

From Multimodal LLM to Human-level AI: Architecture, Modality, Function, Instruction, Hallucination, Evaluation, Reasoning and Beyond

CVPR 2024 Seattle, WA, USA

From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond

LREC-COLING 2024 Torino, Italia

From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning, Efficiency and Beyond