TY - JOUR
T1 - Optimization of molecular representativeness
AU - Yosipof, Abraham
AU - Senderowitz, Hanoch
PY - 2014/6/23
Y1 - 2014/6/23
N2 - Representative subsets selected from within larger data sets are useful in many chemoinformatics applications including the design of information-rich compound libraries, the selection of compounds for biological evaluation, and the development of reliable quantitative structure-activity relationship (QSAR) models. Such subsets can overcome many of the problems typical of diverse subsets, most notably the tendency of the latter to focus on outliers. Yet only a few algorithms for the selection of representative subsets have been reported in the literature. Here we report on the development of two algorithms for the selection of representative subsets from within parent data sets based on the optimization of a newly devised representativeness function either alone or simultaneously with the MaxMin function. The performances of the new algorithms were evaluated using several measures representing their ability to produce (1) subsets which are, on average, close to data set compounds; (2) subsets which, on average, span the same space as spanned by the entire data set; (3) subsets mirroring the distribution of biological indications in a parent data set; and (4) test sets which are well predicted by qualitative QSAR models built on data set compounds. We demonstrate that for three data sets (containing biological indication data, logBBB permeation data, and Plasmodium falciparum inhibition data), subsets obtained using the new algorithms are more representative than subsets obtained by hierarchical clustering, k-means clustering, or the MaxMin optimization at least in three of these measures.
AB - Representative subsets selected from within larger data sets are useful in many chemoinformatics applications including the design of information-rich compound libraries, the selection of compounds for biological evaluation, and the development of reliable quantitative structure-activity relationship (QSAR) models. Such subsets can overcome many of the problems typical of diverse subsets, most notably the tendency of the latter to focus on outliers. Yet only a few algorithms for the selection of representative subsets have been reported in the literature. Here we report on the development of two algorithms for the selection of representative subsets from within parent data sets based on the optimization of a newly devised representativeness function either alone or simultaneously with the MaxMin function. The performances of the new algorithms were evaluated using several measures representing their ability to produce (1) subsets which are, on average, close to data set compounds; (2) subsets which, on average, span the same space as spanned by the entire data set; (3) subsets mirroring the distribution of biological indications in a parent data set; and (4) test sets which are well predicted by qualitative QSAR models built on data set compounds. We demonstrate that for three data sets (containing biological indication data, logBBB permeation data, and Plasmodium falciparum inhibition data), subsets obtained using the new algorithms are more representative than subsets obtained by hierarchical clustering, k-means clustering, or the MaxMin optimization at least in three of these measures.
UR - http://www.scopus.com/inward/record.url?scp=84903291466&partnerID=8YFLogxK
U2 - 10.1021/ci400715n
DO - 10.1021/ci400715n
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 24802762
AN - SCOPUS:84903291466
SN - 1549-9596
VL - 54
SP - 1567
EP - 1577
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 6
ER -