Abstract
G-quadruplexes are non-B-DNA structures that form in the genome facilitated by Hoogsteen bonds between guanines in single or multiple strands of DNA. The functions of G-quadruplexes are linked to various molecular and disease phenotypes, and thus researchers are interested in measuring G-quadruplex formation genome-wide. Experimentally measuring G-quadruplexes is a long and laborious process. Computational prediction of G-quadruplex propensity from a given DNA sequence is thus a long-standing challenge. Unfortunately, despite the availability of high-throughput datasets measuring G-quadruplex propensity in the form of mismatch scores, extant methods to predict G-quadruplex formation either rely on small datasets or are based on domain-knowledge rules. We developed G4mismatch, a novel algorithm to accurately and efficiently predict G-quadruplex propensity for any genomic sequence. G4mismatch is based on a convolutional neural network trained on almost 400 millions human genomic loci measured in a single G4-seq experiment. When tested on sequences from a held-out chromosome, G4mismatch, the first method to predict mismatch scores genome-wide, achieved a Pearson correlation of over 0.8. When benchmarked on independent datasets derived from various animal species, G4mismatch trained on human data predicted G-quadruplex propensity genome-wide with high accuracy (Pearson correlations greater than 0.7). Moreover, when tested in detecting G-quadruplexes genome-wide using the predicted mismatch scores, G4mismatch achieved superior performance compared to extant methods. Last, we demonstrate the ability to deduce the mechanism behind G-quadruplex formation by unique visualization of the principles learned by the model.
Original language | English |
---|---|
Article number | e1010948 |
Journal | PLoS Computational Biology |
Volume | 19 |
Issue number | 3 |
DOIs | |
State | Published - Mar 2023 |
Bibliographical note
Publisher Copyright:Copyright: © 2023 Barshai et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.