Skip to main navigation Skip to search Skip to main content

KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities

  • Hsin Ping Huang
  • , Xinyi Wang
  • , Yonatan Bitton
  • , Hagai Taitelbaum
  • , Gaurav Singh Tomar
  • , Ming Wei Chang
  • , Xuhui Jia
  • , Kelvin C.K. Chan
  • , Hexiang Hu
  • , Yu Chuan Su
  • , Ming Hsuan Yang

Research output: Contribution to journalArticlepeer-review

Abstract

Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose Kitten, a benchmark for Knowledge-InTegrated image generaTion on real-world ENtities. Using Kitten, we conduct a systematic study of recent text-to-image models, retrieval-augmented models, and unified understanding and generation models, focusing on their ability to generate real-world visual entities such as landmarks and animals. Analyses using carefully designed human evaluations, automatic metrics, and MLLMs as judges show that even advanced text-to-image and unified models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entities in creative text prompts. The dataset and evaluation code are publicly available at https://kitten-project.github.io.

Original languageEnglish
JournalTransactions on Machine Learning Research
Volume2026-January
StatePublished - 2026
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2026, Transactions on Machine Learning Research. All rights reserved.

Fingerprint

Dive into the research topics of 'KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities'. Together they form a unique fingerprint.

Cite this