TY - JOUR
T1 - Evaluation Metrics for Generative Models: An Empirical Study
T2 - An Empirical Study
AU - Betzalel, Eyal
AU - Penso, Coby
AU - Fetaya, Ethan
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/9
Y1 - 2024/9
N2 - Generative models such as generative adversarial networks, diffusion models, and variational auto-encoders have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the inception score (IS) and Fréchet inception distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we propose a novel evaluation protocol for likelihood-based generative models, based on generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. This new scheme harnesses the advantages of knowing the underlying likelihood values of the data by measuring the divergence between the model-generated data and the synthetic dataset. Our study shows that while FID and IS correlate with several f-divergences, their ranking of close models can vary considerably, making them problematic when used for fine-grained comparison. We further use this experimental setting to study which evaluation metric best correlates with our probabilistic metrics.
AB - Generative models such as generative adversarial networks, diffusion models, and variational auto-encoders have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the inception score (IS) and Fréchet inception distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we propose a novel evaluation protocol for likelihood-based generative models, based on generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. This new scheme harnesses the advantages of knowing the underlying likelihood values of the data by measuring the divergence between the model-generated data and the synthetic dataset. Our study shows that while FID and IS correlate with several f-divergences, their ranking of close models can vary considerably, making them problematic when used for fine-grained comparison. We further use this experimental setting to study which evaluation metric best correlates with our probabilistic metrics.
KW - generative models
KW - performance evaluation
KW - synthetic dataset
UR - http://www.scopus.com/inward/record.url?scp=85205220350&partnerID=8YFLogxK
U2 - 10.3390/make6030073
DO - 10.3390/make6030073
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
SN - 2504-4990
VL - 6
SP - 1531
EP - 1544
JO - Machine Learning and Knowledge Extraction
JF - Machine Learning and Knowledge Extraction
IS - 3
ER -