Introducing WorldVQA

A benchmark for evaluating atomic visual world knowledge in Multimodal LLMs.

Authors Kimi Team


Overview

main figure

We are releasing WorldVQA, a new benchmark designed to measure the factual correctness of Multimodal Large Language Models (MLLMs). While recent models have demonstrated impressive capabilities in visual reasoning and description, measuring their reliability regarding visual world knowledge remains a challenge.

WorldVQA focuses on a critical question: Does the model actually recognize the specific entity it sees, or is it merely hallucinating based on visual patterns?

Our results show that WorldVQA creates a significant challenge for frontier models. Even state-of-the-art models struggle to achieve high accuracy on long-tail visual knowledge, often falling below 50% accuracy. This benchmark aims to drive progress toward more factually reliable and knowledgeable multimodal AI.

The Dataset

The dataset consists of 3,500 high-quality image-question pairs. The distribution aims to test a model's encyclopedic breadth across the world. The dataset distinguishes itself through three core design principles:

  • Factuality & Unambiguity: Every question has a single, verifiable ground-truth answer. We exclude subjective questions or ambiguous visual scenarios.
  • Rich Taxonomy: The dataset spans 9 categories to ensure broad coverage of world knowledge.
  • Head vs. Tail Distribution: We explicitly separate data into Head (common knowledge) and Tail (rare/long-tail knowledge). This allows us to measure how model performance degrades as knowledge becomes more obscure.

Note on Quality: To ensure the benchmark is a reliable gold standard, all images and question-answer pairs underwent rigorous multi-stage human verification to filter out noise and ambiguity.


Distribution of Tasks per Category

Using WorldVQA to compare models


Measuring Calibration: Confidence vs. Accuracy

In our experiments comparing model confidence with actual accuracy, we utilized two key metrics to measure the alignment between a model's subjective belief and its objective performance:

  • ECE (Expected Calibration Error): Measures the average gap between the model's subjective confidence and its objective accuracy. The ideal value is 0.
  • Slope (Weighted Average Slope): Measures the correlation and sensitivity between the model's accuracy and its own confidence. The ideal value is 1.0.

Our experiments reveal that all evaluated models are currently far from the ideal state, exhibiting a universal tendency toward overconfidence.

While Kimi-K2.5 achieves best performance on both metrics—recording an ECE of 37.9% and a Slope of 0.550—there remains a significant gap to bridge in the pursuit of "honesty" and "alignment." Enhancing the self-awareness boundaries of multimodal models represents a critical direction for future exploration.

Conclusion

WorldVQA is a simple but challenging benchmark for evaluating the atomic visual knowledge of frontier models. Improving performance on WorldVQA is a necessary step for the next generation of AI agents. We are open-sourcing the WorldVQA dataset and evaluation scripts to help the community address the visual knowledge gap.