Path to Multimodal Generalist:

Levels and Benchmarks

Hao Fei^*,¹^,², Yuan Zhou^*,³, Juncheng Li¹, Xiangtai Li², Yucheng Han³, Wentao Hu³, Liyu Jia³,
Shengqiong Wu¹, Peng Zhou⁶, Lin Liu⁷, Haobo Yuan³, Tao Zhang⁴, Bobo Li⁴, Zixiang Meng⁴,
Chengjie Zhou⁴, Minghe Gao⁵, Kaihang Pan⁵, Yaobo Ye⁵, Mingze Zhou⁵, Zhiqi Ge⁵,
Hanwang Zhang^†,^2,³, Shuicheng Yan^†²

(* Equal contribution † Correspondence)

¹National University of Sigapore, ²Skywork AI, Singapore,
³Nanyang Technological University, ⁴Wuhan University,
⁵Zhejiang University, ⁶Shanghai Jiao Tong University,
⁷University of Science and Technology of China

Paper Code&Data

🔔News

🔥[2024-11-15]: Paper, benchmark datasets, and the leaderboard are fully released!

🔥[2024-06-16]: We release our first version of benchmark datasets, and the leaderboard!

The Multimodal Language Language Model (MLLM) community is currently experiencing rapid growth, driven by the advanced capabilities of language-based LLMs. Unlike their specialist predecessors, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate and edit across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting singular modalities to accommodating a wide array or even arbitrary modalities. To assess the capabilities of various MLLMs, a diverse array of benchmark test sets has been proposed. This leads to a critical inquiry: Can we simply assume that higher performance across diverse tasks and the support of more modalities indicate a stronger MLLM capability, bringing us closer to human-level AI?

We argue that the answer is not as straightforward. In this project, we introduce a framework to delineate the capabilities and behaviors of current MLLM generalists. This framework, named General-Level, establishes levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. Central to our framework is the use of Synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy in and across comprehension and generation, as well as across multimodal interactions. To evaluate the comprehensive abilities of various MLLMs as generalists, we propose a holistic benchmark, General-Bench, which encompasses a broader spectrum of tasks, modalities, formats, and capabilities. The evaluation results involving over 20 state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI.

Figure 1: Illustrating that most existing MLLMs are based on language intelligence (a), whereas the ideal intelligence mode is maintaining synergy and generalization across all modalities and tasks (b).

General-Level: Level of Multimodal Generalist

Principles

Before determining the Levels of Multimodal Generalists, let us first clarify a few preliminaries about MLLMs.

Multimodal Comprehension Vs. Multimodal Comprehension+Generation
An MLLM that only has multimodal comprehension capabilities represents the most basic and primitive level; we believe that the more powerful an MLLM is, the more it should support advanced functionalities, capable of both multimodal comprehension and generating content across various modalities.
Support for More and Broader Modalities and Task Paradigms
This aligns with our understanding of the capabilities of MLLMs: the stronger the MLLM and the closer it is to AGI, the more task types it can support, the more modalities it can handle, and the stronger its task performance.
A Strong Synergy Effect is the Core Aspect of an MLLM
One could imagine using a simple LLM as a task dispatcher, combining all existing specialists to construct a multimodal agent or generalist capable of supporting various task types and modalities. However, we argue that merely unifying different tasks does not make a system a generalist. Therefore, a simple integration of specialists can, at best, only reach the capabilities of individual specialists. If so, what is the difference from using existing separate specialists? What is the point of integration?

Instead, we suggest that synergy is the most critical aspect when assessing whether a multimodal generalist is stronger! An MLLM should be able to achieve a synergistic effect where 1+1 > 2, such as understanding a single modality & task that can be transferred to understanding other tasks & modalities, similar to the ChatGPT series, which can achieve robust generalization abilities with minimal training examples. This aligns with our expectations of a generalist.

To this end, we focus on examining synergy capabilities: first, we evaluate the synergy among various multimodal tasks (comprehension or generation), then explore the synergistic effects between families of multimodal comprehension and generation tasks, and further investigate the presence of an 'everything synergy'.

Definition

Based on the above principles, let's assume we have a benchmark of various modalities and tasks. We can categorize tasks under these modalities into Comprehension tasks and Generation tasks as illustrated in the diagram, where each symbol represents a specific task (dataset). Notably, we also isolate NLP tasks.

Let’s denote the number of datasets or tasks within the Comprehension task group by $ M $; the number within the Generation task group by $ N $; and the number of NLP tasks by $ T $.

Now, we can provide the specific definition and calculation of each level:

Level	Definition	Scoring	Example
Level-1: Specialist	Various current models, each fine-tuned on a specific task or dataset of specific modalities, are task-specific players (i.e., state-of-the-art (SoTA) specialists). This includes various AI processing tasks, such as recognition, classification, text generation, image generation, video segmentation, grounding, inpainting, and more.	For each task in the benchmark ($i$-th task), record the current SoTA specialist’s score: $$\sigma_i^{sota}$$	SAM, Dino, DALLe, ChatGPT
↓ Upgrading Conditions: LLM as intelligence medium (Comprehension or/and Generation)
Level-2: Generalist of Unified Comprehension and Generation	Models are task-unified players, e.g., MLLMs, capable of supporting different modalities and tasks. Such MLLMs can integrate various models through existing encoding and decoding technologies to achieve aggregation and unification of various modalities and tasks (such as comprehension and generation tasks).	The average score across all datasets is used as the model's score at this level. A model that can support a task, or scores non-zero on a corresponding dataset, is considered capable of supporting that task. The more tasks a model supports and the higher its scores, the higher its overall score: $$ S_{2} = \frac{1}{M+N} \sum_{i=1}^{M+N} \sigma_i $$	GPT4v, Ilava, LVM
↓ Upgrading Conditions: Realizing synergy: multi-task joint learning
Level-3: Generalist with Synergy in Comprehension and Generation	Models are task-unified players, and synergy is in Comprehension and/or Generation. MLLMs enhance several tasks' performance beyond corresponding SoTA scores through joint learning across multiple tasks due to the synergy effect.	Assign a mask weight of 0 or 1 to each task; assign mask=1 only if the corresponding score exceeds the SoTA specialist’s score, otherwise assign mask=0. Then, calculate the average score across all tasks. The more tasks a model surpasses the SoTA specialist, the higher its score at this level: $$ S_{3} = \frac{1}{M+N} \sum_{i=1}^{M+N} \begin{cases} \sigma_i, & \sigma_i \geq \sigma_i^{sota} \\\\ 0 \end{cases} $$	MM-GPT, SALOMNN, Midjourney
↓ Upgrading Conditions: Reconstruction loss for generation should be disentangled from compression learning loss
Level-4: Generalist with Synergy across Comprehension and Generation	Models are task-unified players, and synergy is across Comprehension and Generation.	Calculate the average scores exceeding SoTA specialists separately in the Comprehension and Generation groups, obtaining $ S_c $ and $ S_g $, and then compute their harmonic mean. The stronger a model is in Comprehension and Generation tasks, the higher its score at this level: $$ S_{4} = \frac{2 S_c S_g}{S_c + S_g}, \,\;\;\; \text{where} \\ S_g = \frac{1}{M} \sum_{i=1}^{M} \begin{cases} \sigma_i, & \sigma_i \geq \sigma_i^{sota} \\\\ 0 \end{cases}, \\ S_c = \frac{1}{N} \sum_{j=1}^{N} \begin{cases} \sigma_j, & \sigma_j \geq \sigma_i^{sota} \\\\ 0 \end{cases} $$	Emu2, NExT-GPT, SEED
↓ Upgrading Conditions: Acquiring the capable of abductive reasoning, being context consistent, everything synergy
Level-5: Generalist with Total Synergy across Comprehension, Generation, and NLP	Models are task-unified players, preserving the synergy effect across Comprehension, Generation, and NLP. In other words, the model not only achieves cross-modality synergy between Comprehension and Generation groups but also further realizes synergy with language. The NLP’s intelligence can enhance multimodal intelligence and vice versa; understanding multimodal information can also aid in understanding language.	First, calculate the model’s average score exceeding SoTA NLP specialists on NLP benchmark data, normalize it to a [0,1] weight, and multiply it by the score from level 4 to determine the level 5 score: $$ S_{5} = S_4 \times w_L, \,\;\;\; \text{where} \\ w_L = \frac{S_L}{S_{total}}, \\ S_L = \frac{1}{T} \sum_{k=1}^{T} \begin{cases} \sigma_k, & \sigma_k \geq \sigma_k^{sota} \\\\ 0 \end{cases} $$	None, this is our goal!

General-Bench: Benchmark of Multimodal Generalist

Existing benchmarks for evaluating MLLMs have the following issues:

Limited by Current MLLM Capabilities
Existing benchmarks rephrase and simplify all tasks into a fixed multiple-choice format. While this simplifies the evaluation of MLLMs, it limits assessments to the models' multimodal comprehension capabilities. As discussed above, a true multimodal generalist should not only support comprehension but also possess various other abilities, such as multimodal generation, editing, and more.
Limited Modal Coverage
The majority of existing benchmarks focus primarily on the image modality, neglecting comprehensive modal coverage.
Insufficient Task Granularity
Existing benchmarks are limited to coarse-grained understanding of multimodal information and do not adequately assess the understanding of more fine-grained details

In response, this project proposes a new benchmark, named General-Bench.

Data Highlights

General-Bench particularly places a strong emphasis on the diversity of its evaluation data, covering a wide range of fields and scenarios to assess different aspects of model capabilities. First, the dataset spans a variety of domains and disciplines, incorporating 14 major areas within both the physical sciences (e.g., Physics, Geometry, Biology) and the social sciences (e.g., Humanities, Linguistics, History). The evaluation of multimodal generalist skills and capabilities is categorized into universal modality-invariant abilities and modality-specific skills. The modality-invariant abilities comprehensively include 8 categories: content recognition, commonsense understanding, reasoning ability, causality discrimination, affective analysis, interactive capability, creativity and innovation, and problem-solving.

Example

Some typical examples of the tasks in our General-Bench are given here, which may help researchers to better understand our proposed benchmark.

1. Comprehension


Semantic Segmentation	Visual Grounding

Image Captioning	OCR

2. Generation


Text-to-image Generation	Sketch-to-image Generation

Image Inpainting	Semantic Image Synthesis

Contribute your dataset!

If you have a multimodal task and dataset and would like existing MLLMs to be evaluated on your benchmark, please consider contributing your dataset (testing set) to General-Bench!

Generalist Levels

The evaluation of multimodal generalist levels of existing popular MLLMs are shown below.

Multimodal Tasks

Below we list the original quantitative performance (zero-shot evaluation) of all MLLMs across all tasks.

Task	Dataset	SoTA Specialist	Metrics	SoTA Score	GPT-4V	GPT-4o	LLaVA 1.5(7B)	miniGemnini(7B)	Qwen-VL-Plus	InternVL	MoE-LLaVA-QWen-1.8B-4e	Yi-VL(6B)	BLIP-2	MiniGPT-4	Emu2(37B)	SEED-LLaMA(14B-sft)	InternLM-XComposer2(7B)	LaVIT-V2(7B)	GPT4ROI	Ospery	NextChat	GlaMM	OMG-LLaVA-InternLM(20B)	AnyGPT	NExT-GPT	Vitron
Object Counting	FSC147	counTR	MAE	14.49	0	0	0	0	0	0	0	0	0	0	25.20	44.20	34.20	27.00	0	0	0	0	0	0	0	0
Object Counting	CARPK	counTR	MAE	6.39	0	0	0	0	0	0	0	0	0	0	58.10	61.42	49.89	56.20	0	0	0	0	0	0	0	0
OCR	TextOCR	parseq	ANLS	94.79	73.01	73.21	84.17	79.39	91.01	86.10	34.09	86.14	74.50	15.74	90.44	37.27	54.78	64.67	8.54	62.17	59.52	1.00	47.80	51.22	49.32	65.85
Image Classification	CIFAR-100	Astroformer	ACC	86.30	57.00	80.67	52.33	38.67	49.66	76.00	29.87	64.67	17.30	26.66	40.00	21.00	54.66	33.67	12.33	12.67	10.00	43.60	6.00	37.62	46.33	57.80
Text-to-Image Retrieval	Flickr30k	CLIP	Recall@1	81.33	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Image Caption	COCO-Caption	GRIT	ROUGH-L	62.75	40.80	40.36	47.11	23.68	29.24	45.24	42.40	28.93	50.71	23.79	46.30	18.90	24.40	46.32	43.10	46.60	49.30	56.30	23.20	23.75	46.55	50.32
Depth Estimation	NYU-v2	TransDepth	ACC	87.30	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Image Matting	AM-2k	GFM	MSE	0.29	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
KeyPoint Detection	OCHuman	BUCTD	AP	43.9	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Object Detection	MSCOCO	DINO	AP	55.56	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	35.60
Visual Grounding	RefCOCO	polygon-former	AP@0.1	90.9	0	87.67	0	0	9.33	0	0	11.67	0	0	0	0	3.04	0	0	0	39.33	97.00	96.00	0	0	56.45
	RefCOCO+			85.97	0	18.00	0	0	7.00	0	0	7.67	0	0	0	0	2.49	0	0	0	31.33	63.00	89.00	0	0	67.41
	RefCOCOg			84.76	0	14.70	0	0	7.00	0	0	8.67	0	0	0	0	2.99	0	0	0	31.00	55.70	86.70	0	0	63.85
Semantic Segmentation	PASCAL VOC 2012	SegCLIP	mIoU	51.8	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	35.14
Instance Segmentation	Cityspaces	OneFormer	AP	45.68	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	23.45
Instance Segmentation	ADE20k		AP	35.95	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	23.45
Panoptic Segmentation	Cityspaces		PQ	67.33	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	17.89
Panoptic Segmentation	ADE20k		PQ	48.34	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	29.78
Anomaly Detection	MVTecAD	DMAD	AUC	97.5	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
VQA	GQA	GIT	ACC	44.67	25.67	25.00	70.00	69.67	68.33	72.67	67.00	55.33	22.30	33.66	74.00	59.11	56.15	47.90	0	64.66	31.67	0	57.70	43.95	56.58	68.40
	VQAv2			70.67	24.7	53.67	77.33	77.33	75.00	77.67	70.00	65.33	22.33	13.33	78.33	61.62	36.02	68.30	0	68.33	60.00	13.70	67.30	63.74	67.60	75.60
	VizWiz			14.67	54.00	55.00	22.33	38.33	52.51	44.67	21.00	40.33	10.33	3.66	25.33	52.33	35.00	41.00	1.60	14.33	7.60	0	18.30	11.22	20.33	50.40
	OK-VQA			36.67	33.67	47.67	54.67	56.00	45.82	59.00	48.00	46.00	16.33	0.90	55.33	48.00	54.00	55.90	0.30	37.66	29.00	14.30	45.70	28.03	40.30	52.03
	DocVQA	Donut	ANLS	60.75	68.99	78.66	15.94	55.87	82.58	86.39	23.53	85.13	3.40	1.66	38.93	7.65	79.33	21.57	2.60	13.48	10.79	14.50	14.00	17.56	36.66	25.80
Text-to-Image	MultiModal CeleA-HQ	Lafite	FID	149.861	0	0	0	131.60	0	0	0	0	0	0	112.33	0	0	0	0	0	0	0	0	141.64	123.30	130.60
	Imagenet	VQ-diffusion		126.694	0	0	0	133.09	0	0	0	0	0	0	224.63	0	0	0	0	0	0	0	0	227.83	163.88	143.05
	MSCOCO	Lafite		280.924	0	0	0	148.07	0	0	0	0	0	0	146.89	0	0	0	0	0	0	0	0	126.84	198.01	169.30
	CUB	Lafite		121.544	0	0	0	54.17	0	0	0	0	0	0	71.32	0	0	0	0	0	0	0	0	74.67	82.30	100.72
Sketch-to-Image	Scribble	PITI		312.12	0	0	0	350.76	0	0	0	0	0	0	406.01	0	0	0	0	0	0	0	0	357.91	505.66	324.25
Sketch-to-Image	SketchyCOCO	PITI		274.39	0	0	0	232.41	0	0	0	0	0	0	171.87	0	0	0	0	0	0	0	0	171.91	326.40	208.35
Layout-to-Image	COCOStuff	LayoutDiffusion		141.34	0	0	0	187.08	0	0	0	0	0	0	269.25	0	0	0	0	0	0	0	0	141.35	350.30	231.17
Layout-to-Image	Visual Genome	LayoutDiffusion		131.79	0	0	0	0	0	0	0	0	0	0	381.31	0	0	0	0	0	0	0	0	171.70	410.10	269.02
SuperResolution	Set14	HAT	PSNR	35.29	0	0	0	0	0	0	0	0	0	0	28.00	0	0	0	0	0	0	0	0	0	0	0
	BSD100			32.74	0	0	0	0	0	0	0	0	0	0	28.10	0	0	0	0	0	0	0	0	0	0	0
	Manga109			41.01	0	0	0	0	0	0	0	0	0	0	28.68	0	0	0	0	0	0	0	0	0	0	0
	Urban			35.09	0	0	0	0	0	0	0	0	0	0	28.13	0	0	0	0	0	0	0	0	0	0	0
Image Inpainting	Places2	mat	FID	40.38	0	0	0	0	0	0	0	0	0	0	111.67	0	0	0	0	0	0	0	0	93.12	153.06	123.56
	CelebA-HQ			12.69	0	0	0	0	0	0	0	0	0	0	47.86	0	0	0	0	0	0	0	0	81.65	76.33	64.02
	FFHQ			16.91	0	0	0	0	0	0	0	0	0	0	64.49	0	0	0	0	0	0	0	0	107.05	88.41	34.80
Text-based Image Editing	PIE-Bench	Prompt-to-Prompt	LPIPS	33.91	0	0	0	37.58	0	0	0	0	0	0	41.61	0	0	0	0	0	0	0	0	0	0	36.74
Text-based Image Editing	VITON-HD	mgd	FID	30.69	0	0	0	0	0	0	0	0	0	0	166.07	0	0	0	0	0	0	0	0	0	0	65.33
Semantic Image Synthesis	ADE20k	INADE		48.28	0	0	0	0	0	0	0	0	0	0	312.10	217.36	0	0	0	0	0	0	0	0	0	186.07
Semantic Image Synthesis	Cityspaces	INADE		87.16	0	0	0	0	0	0	0	0	0	0	226.01	181.96	0	0	0	0	0	0	0	0	0	175.33
Low-light Image Enhancement	LOL	WaveNet	PSNR	25.44	0	0	0	0	0	0	0	0	0	0	27.91	0	0	0	0	0	0	0	0	28.33	0	34.86
Image Denoising	SIDD	HINet		39.80	0	0	0	0	0	0	0	0	0	0	28.16	0	0	0	0	0	0	0	0	0	0	0
Image deblurring	GoPro	LAKDNet		33.48	0	0	0	0	0	0	0	0	0	0	27.97	0	0	0	0	0	0	0	0	27.91	56.31	40.66
Image deblurring	HIDE	LAKDNet		32.60	0	0	0	0	0	0	0	0	0	0	27.93	0	0	0	0	0	0	0	0	27.88	51.03	38.27
Text Classification	20 Newsgroups	Flan-T5-XL	F1	48.30
	AG News			90.80
	IMDB			96.70
	SST			96.20
	Yelp			55.80
	TREC			93.60
	DBpedia			79.80
	FakeNewsNet			58.33
	SNLI			86.90
	Quora			83.10
Information Extraction	NER	Flan-T5-XL	F1	19.85
	CoNLL-2003			13.28
	OntoNotes 5.0			10.24
	semeval-2010-task-8			33.7
	DocRed			0.21
	DialogRE			10.93
	ACE 2005			0.96
	Semeval 14			43.47
	CoNLL 2014			0
	SNIPS			94.14
	UPB			0
	SQuAD 2.0			43.90
Text Generation	HotpotQA	Flan-T5-XL	F1	48.85
	CoQA			55.10
	NewsQA			23.62
	RACE			85.60
	ReCoRD			55.89
	MS MARCO			35.49
	MultiRC			41.44
	CNN-Daily Mail			24.47

Most task evaluation scores, despite utilizing different metrics, fall within the 0-100 range (such as F1, Accuracy, etc.); however, some task metrics yield scores outside the 1-100 range. Thus, we design the following score mapping to standardize various scores into the 1-100 range, facilitating the calculation for the above level scoring algorithms.

Normalizing FID:
$$ y= \text{sigmoid} (100/x)$$
Normalizing MAE:
$$ y= \text{sigmoid} (50/x)$$
Normalizing LPIPS:
$$ y= \text{sigmoid} (80/x)$$
Normalizing PSNR:
$$ y= \text{sigmoid} (x/15)$$

Submit your MLLM!

If you would like to see your MLLM ranked on the above leaderboard, please submit your MLLM and evaluation results to us.

For any inquiries regarding technique or collaboration of this project, please feel free to reach out Hao Fei, Yuan Zhou and Hanwang Zhang, or create an issue on Github.

BibTeX

If you feel this project is useful to your research, please kindly use the following BibTeX entry to cite our paper. Thanks!

@articles{hao2024path2generalist,
      title={Path to Multimodal Generalist: Level and Benchmark},
      author={Hao Fei and Yuan Zhou and Juncheng Li and Xiangtai Li and Yucheng Han and Wentao Hu and Liyu Jia and Shengqiong Wu and Peng Zhou and Lin Liu and Haobo Yuan and Tao Zhang and Bobo Li and Zixiang Meng and Chengjie Zhou and Minghe Gao and Kaihang Pan and Yaobo Ye and Mingze Zhou and Zhiqi Ge and Hanwang Zhang and Shuicheng Yan},
      year={2024},
      eprint={2404.123456},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}