A benchmark for evaluating atomic visual world knowledge in Multimodal LLMs.
Authors Kimi Team
We are releasing WorldVQA, a new benchmark designed to measure the factual correctness of Multimodal Large Language Models (MLLMs). While recent models have demonstrated impressive capabilities in visual reasoning and description, measuring their reliability regarding visual world knowledge remains a challenge.
WorldVQA focuses on a critical question: Does the model actually recognize the specific entity it sees, or is it merely hallucinating based on visual patterns?
Our results show that WorldVQA creates a significant challenge for frontier models. Even state-of-the-art models struggle to achieve high accuracy on long-tail visual knowledge, often falling below 50% accuracy. This benchmark aims to drive progress toward more factually reliable and knowledgeable multimodal AI.
The dataset consists of 3,500 high-quality image-question pairs. The distribution aims to test a model's encyclopedic breadth across the world. The dataset distinguishes itself through three core design principles:
Note on Quality: To ensure the benchmark is a reliable gold standard, all images and question-answer pairs underwent rigorous multi-stage human verification to filter out noise and ambiguity.
What bird is in the picture?
Answer:Chestnut Shortwing
What's the name of the flower in the picture?
Answer:Freesia
图中出现的内容/文物是/属于哪个遗址?
Answer:善化寺
What is the name of the natural landmark shown in the image?
Answer:Cape of Good Hope
What is the title of the dance performance shown in the picture?
Answer:Swan Lake
这个图片是什么珍品
Answer:战国水晶杯
What style of bag is shown in the picture?
Answer:Shell bag
What electronic consumer product is shown in the image? Provide the exact name and model number.
Answer:iPhone 17 Pro
图中的飞行器是什么型号?
Answer:中国歼 - 20战斗机
What specific attachment or accessory is this for the vehicle?
Answer:Roll cage
What is the name of the character in the picture?
Answer:Bayle the Dread
Which film or TV series is this image from?
Answer:Your Name
What is the medium (carrier) of the advertisement in this image?
Answer:Direct-mail advertisement
What is the name of the trademark or logo shown in the image?
Answer:EgyptAir
What track-and-field or gymnastics event is shown in the picture? Please be as specific as possible.
Answer:Floor exercise
图片中的建筑是哪座体育场馆?
Answer:上海体育场
| Statistics | Number |
|---|---|
| Data | |
| Total | 3500 |
| Chinese (CN) | 1260 (36%) |
| English (EN) | 2240 (64%) |
| Category Categories | |
| Total categories | 9 |
| Nature & Environment (Nature) | 9.31% |
| Locations & Architecture (Geography) | 14.63% |
| Culture, Arts & Crafts (Culture) | 14.46% |
| Objects & Products (Objects) | 12.49% |
| Vehicles, Craft & Transportation (Transportation) | 8.74% |
| Entertainment, Media & Gaming (Entertainment) | 14.60% |
| Brands, Logos & Graphic Design (Brands) | 7.43% |
| Sports, Gear & Venues (Sports) | 4.06% |
| Notable People & Public Figures (People) | 14.29% |
| Difficulty | |
| Easy | 31.16% |
| Medium | 40.77% |
| Hard | 28.07% |
| Benchmark | Kimi K2.5 | Gemini-3-pro | Gemini-2.5-pro | Seed-1.5-vision-pro | Claude-opus-4.5 | Claude-sonnet-4.5 | GPT-5.2 | GPT-5.1 | GPT-4o | Grok-4.1-fast-reasoning | Grok-4-fast-reasoning | Kimi-VL-16B-A3B | Qwen3-VL-235B-A22B-Instruct | Qwen3-VL-32B-Instruct | GLM-4.6V | GLM-4.6V-Flash |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Overall results | ||||||||||||||||
Accuracy | 46.3 | 47.4 | 36.9 | 34.9 | 36.8 | 20.0 | 28.0 | 24.5 | 22.2 | 21.1 | 18.9 | 12.0 | 23.5 | 17.7 | 19.0 | 14.8 |
Not Attempted | 2.1 | 0.6 | 0.1 | 1.6 | 3.4 | 8.0 | 5.4 | 16.3 | 9.1 | 0.1 | 0.2 | 3.3 | 0.0 | 0.0 | 0.0 | 0.1 |
Correct Given Attempted | 47.3 | 47.7 | 36.9 | 35.5 | 38.1 | 21.8 | 29.5 | 29.3 | 24.4 | 21.1 | 19.0 | 12.4 | 23.5 | 17.7 | 19.0 | 14.8 |
F-score | 46.8 | 47.5 | 36.9 | 35.2 | 37.5 | 20.9 | 28.7 | 26.7 | 23.3 | 21.1 | 18.9 | 12.2 | 23.5 | 17.7 | 19.0 | 14.8 |
| F-score on 9 task categories | ||||||||||||||||
Nature | 40.6 | 45.1 | 37.1 | 41.4 | 32.5 | 19.4 | 24.3 | 27.3 | 25.6 | 18.4 | 17.8 | 11.2 | 26.1 | 18.1 | 24.5 | 16.0 |
Geography | 46.8 | 44.7 | 33.8 | 36.1 | 36.5 | 21.0 | 29.1 | 25.1 | 20.6 | 23.6 | 19.0 | 13.9 | 24.8 | 18.0 | 21.5 | 16.3 |
Culture | 43.0 | 47.2 | 32.6 | 33.4 | 34.1 | 17.4 | 26.7 | 22.5 | 17.8 | 20.2 | 18.6 | 10.1 | 22.9 | 16.8 | 17.8 | 13.2 |
Objects | 44.7 | 48.1 | 39.6 | 32.8 | 39.6 | 22.9 | 26.6 | 26.6 | 19.1 | 25.2 | 22.0 | 10.8 | 26.1 | 19.0 | 19.2 | 14.9 |
Transportation | 47.4 | 45.1 | 39.9 | 35.0 | 43.5 | 24.8 | 30.7 | 31.6 | 26.2 | 23.5 | 20.3 | 13.5 | 28.8 | 19.0 | 18.6 | 19.0 |
Entertainment | 48.1 | 47.6 | 34.2 | 33.6 | 29.0 | 11.6 | 24.8 | 18.5 | 19.1 | 11.4 | 8.3 | 7.9 | 15.5 | 12.1 | 12.5 | 7.8 |
Brands | 52.6 | 52.4 | 38.8 | 32.3 | 47.6 | 32.2 | 39.1 | 36.0 | 35.2 | 25.8 | 26.6 | 20.8 | 22.3 | 23.8 | 20.4 | 18.8 |
Sports | 64.8 | 59.4 | 54.2 | 43.7 | 54.9 | 31.0 | 40.8 | 45.4 | 44.5 | 30.3 | 34.5 | 17.7 | 26.1 | 20.4 | 23.2 | 20.4 |
People | 50.9 | — | — | — | — | — | — | — | — | — | — | 7.4 | 26.2 | 13.1 | 10.7 | 8.2 |
In our experiments comparing model confidence with actual accuracy, we utilized two key metrics to measure the alignment between a model's subjective belief and its objective performance:
Calibration and Confidence Distribution Analysis. Left: Reliability diagrams plotting Actual Accuracy against Stated Confidence. To ensure statistical significance, only bins containing more than 20 samples are visualized. The size of each data point is proportional to the number of samples in that bin. The black dashed diagonal (y=x) represents perfect calibration, while colored dashed lines indicate the weighted average slope for each model. Right: The distribution of stated confidence scores across the full dataset (without sample thresholding). The plots reveal a severe overconfidence trend, with most models concentrating their predictions in the 90-100% confidence range.
Our experiments reveal that all evaluated models are currently far from the ideal state, exhibiting a universal tendency toward overconfidence.
While Kimi-K2.5 achieves best performance on both metrics—recording an ECE of 37.9% and a Slope of 0.550—there remains a significant gap to bridge in the pursuit of "honesty" and "alignment." Enhancing the self-awareness boundaries of multimodal models represents a critical direction for future exploration.
WorldVQA is a simple but challenging benchmark for evaluating the atomic visual knowledge of frontier models. Improving performance on WorldVQA is a necessary step for the next generation of AI agents. We are open-sourcing the WorldVQA dataset and evaluation scripts to help the community address the visual knowledge gap.