TY - JOUR
T1 - Discrete profile comparison using information bottleneck
AU - O'Rourke, Sean
AU - Chechik, Gal
AU - Friedman, Robin
AU - Eskin, Eleazar
PY - 2006/3/20
Y1 - 2006/3/20
N2 - Sequence homologs are an important source of information about proteins. Amino acid profiles, representing the position-specific mutation probabilities found in profiles, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are an order of magnitude slower than sequence comparisons, making profiles impractical for large datasets. Also, because they are such a rich representation, profiles are difficult to visualize. To address these problems, we describe a method to map probabilistic profiles to a discrete alphabet while preserving most of the information in the profiles. We find an informationally optimal discretization using the Information Bottleneck approach (IB). We observe that an 80-character IB alphabet captures nearly 90% of the amino acid occurrence information found in profiles, compared to the consensus sequence's 78%. Distant homolog search with IB sequences is 88% as sensitive as with profiles compared to 61% with consensus sequences (AUC scores 0.73, 0.83, and 0.51, respectively), but like simple sequence comparison, is 30 times faster. Discrete IB encoding can therefore expand the range of sequence problems to which profile information can be applied to include batch queries over large databases like SwissProt, which were previously computationally infeasible.
AB - Sequence homologs are an important source of information about proteins. Amino acid profiles, representing the position-specific mutation probabilities found in profiles, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are an order of magnitude slower than sequence comparisons, making profiles impractical for large datasets. Also, because they are such a rich representation, profiles are difficult to visualize. To address these problems, we describe a method to map probabilistic profiles to a discrete alphabet while preserving most of the information in the profiles. We find an informationally optimal discretization using the Information Bottleneck approach (IB). We observe that an 80-character IB alphabet captures nearly 90% of the amino acid occurrence information found in profiles, compared to the consensus sequence's 78%. Distant homolog search with IB sequences is 88% as sensitive as with profiles compared to 61% with consensus sequences (AUC scores 0.73, 0.83, and 0.51, respectively), but like simple sequence comparison, is 30 times faster. Discrete IB encoding can therefore expand the range of sequence problems to which profile information can be applied to include batch queries over large databases like SwissProt, which were previously computationally infeasible.
UR - http://www.scopus.com/inward/record.url?scp=33947322857&partnerID=8YFLogxK
U2 - 10.1186/1471-2105-7-S1-S8
DO - 10.1186/1471-2105-7-S1-S8
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 16723011
AN - SCOPUS:33947322857
SN - 1471-2105
VL - 7
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - SUPPL.1
M1 - S8
ER -