Abstract
Sequence libraries that cover all k-mers enable universal, unbiased measurements of binding to both oligonucleotides and peptides. While the number of k-mers grows exponentially in k, space on all experimental platforms is limited. Here, we shrink k-mer library sizes by using joker characters, which represent all characters in the alphabet simultaneously. We present the JokerCAKE (joker covering all k-mers) algorithm for generating a short sequence such that each k-mer appears at least p times with at most one joker character per k-mer. By running our algorithm on a range of parameters and alphabets, we show that JokerCAKE produces near-optimal sequences. Moreover, through comparison with data from hundreds of DNA-protein binding experiments and with new experimental results for both standard and JokerCAKE libraries, we establish that accurate binding scores can be inferred for high-affinity k-mers using JokerCAKE libraries. JokerCAKE libraries allow researchers to search a significantly larger sequence space using the same number of experimental measurements and at the same cost. We present a new compact sequence design that covers all k-mers utilizing joker characters and develop an efficient algorithm to generate such designs. We show through simulations and experimental validation that these sequence designs are useful for identifying high-affinity binding sites at significantly reduced cost and space.
Original language | English |
---|---|
Pages (from-to) | 230-236.e5 |
Journal | Cell Systems |
Volume | 5 |
Issue number | 3 |
DOIs | |
State | Published - 27 Sep 2017 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2017 The Authors
Funding
This work was supported by the NIH (grant R01GM081871 to B.B., grant R00GM09984804 to P.F.). Part of this work was done while Y.O. was visiting the Simons Institute for the Theory of Computing. Part of this work was done while R.K. was visiting the Research Science Institute and was supported by the Center for Excellence in Education and their sponsors. P.F. is a Chan Zuckerberg Biohub Investigator and also acknowledges the support of a Gabilan and McCormick Fellowship for this work. An early version of this paper was submitted to and peer reviewed at the 2017 Annual International Conference on Research in Computational Molecular Biology (RECOMB). The manuscript was revised and then independently further reviewed at Cell Systems.
Funders | Funder number |
---|---|
National Institutes of Health | R00GM09984804 |
National Institute of General Medical Sciences | R01GM081871 |
Keywords
- de Bruijn graph
- microarray design
- sequence libraries