k-nearest neighbors optimization-based outlier removal

Abraham Yosipof, Hanoch Senderowitz

Research output: Contribution to journalArticlepeer-review

23 Scopus citations

Abstract

Datasets of molecular compounds often contain outliers, that is, compounds which are different from the rest of the dataset. Outliers, while often interesting may affect data interpretation, model generation, and decisions making, and therefore, should be removed from the dataset prior to modeling efforts. Here, we describe a new method for the iterative identification and removal of outliers based on a k-nearest neighbors optimization algorithm. We demonstrate for three different datasets that the removal of outliers using the new algorithm provides filtered datasets which are better than those provided by four alternative outlier removal procedures as well as by random compound removal in two important aspects: (1) they better maintain the diversity of the parent datasets; (2) they give rise to quantitative structure activity relationship (QSAR) models with much better prediction statistics. The new algorithm is, therefore, suitable for the pretreatment of datasets prior to QSAR modeling.

Original languageEnglish
Pages (from-to)493-506
Number of pages14
JournalJournal of Computational Chemistry
Volume36
Issue number8
DOIs
StatePublished - 30 Mar 2015

Bibliographical note

Publisher Copyright:
© 2014 Wiley Periodicals, Inc.

Keywords

  • Distance-based method
  • Optimization
  • Outlier detection
  • Outlier removal
  • Quantitative structure activity relationship
  • k-nearest neighbors

Fingerprint

Dive into the research topics of 'k-nearest neighbors optimization-based outlier removal'. Together they form a unique fingerprint.

Cite this