Evaluation Metrics for Generative Models: An Empirical Study: An Empirical Study

Eyal Betzalel, Coby Penso, Ethan Fetaya

Research output: Contribution to journalArticlepeer-review

Abstract

Generative models such as generative adversarial networks, diffusion models, and variational auto-encoders have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the inception score (IS) and Fréchet inception distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we propose a novel evaluation protocol for likelihood-based generative models, based on generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. This new scheme harnesses the advantages of knowing the underlying likelihood values of the data by measuring the divergence between the model-generated data and the synthetic dataset. Our study shows that while FID and IS correlate with several f-divergences, their ranking of close models can vary considerably, making them problematic when used for fine-grained comparison. We further use this experimental setting to study which evaluation metric best correlates with our probabilistic metrics.
Original languageEnglish
Pages (from-to)1531-1544
Number of pages14
JournalMachine Learning and Knowledge Extraction
Volume6
Issue number3
DOIs
StatePublished - Sep 2024

Bibliographical note

Publisher Copyright:
© 2024 by the authors.

Keywords

  • generative models
  • performance evaluation
  • synthetic dataset

Fingerprint

Dive into the research topics of 'Evaluation Metrics for Generative Models: An Empirical Study: An Empirical Study'. Together they form a unique fingerprint.

Cite this