TY - JOUR
T1 - A novel approach to T-cell receptor beta chain (TCRB) repertoire encoding using lossless string compression
AU - Konstantinovsky, Thomas
AU - Yaari, Gur
N1 - Publisher Copyright:
© The Author(s) 2023.
PY - 2023/7/1
Y1 - 2023/7/1
N2 - Motivation: T-cell receptor beta chain (TCRB) repertoires are crucial for understanding immune responses. However, their high diversity and complexity present significant challenges in representation and analysis. The main motivation of this study is to develop a unified and compact representation of a TCRB repertoire that can efficiently capture its inherent complexity and diversity and allow for direct inference. Results: We introduce a novel approach to TCRB repertoire encoding and analysis, leveraging the Lempel-Ziv 76 algorithm. This approach allows us to create a graph-like model, identify-specific sequence features, and produce a new encoding approach for an individual’s repertoire. The proposed representation enables various applications, including generation probability inference, informative feature vector derivation, sequence generation, a new measure for diversity estimation, and a new sequence centrality measure. The approach was applied to four large-scale public TCRB sequencing datasets, demonstrating its potential for a wide range of applications in big biological sequencing data.
AB - Motivation: T-cell receptor beta chain (TCRB) repertoires are crucial for understanding immune responses. However, their high diversity and complexity present significant challenges in representation and analysis. The main motivation of this study is to develop a unified and compact representation of a TCRB repertoire that can efficiently capture its inherent complexity and diversity and allow for direct inference. Results: We introduce a novel approach to TCRB repertoire encoding and analysis, leveraging the Lempel-Ziv 76 algorithm. This approach allows us to create a graph-like model, identify-specific sequence features, and produce a new encoding approach for an individual’s repertoire. The proposed representation enables various applications, including generation probability inference, informative feature vector derivation, sequence generation, a new measure for diversity estimation, and a new sequence centrality measure. The approach was applied to four large-scale public TCRB sequencing datasets, demonstrating its potential for a wide range of applications in big biological sequencing data.
UR - http://www.scopus.com/inward/record.url?scp=85164843045&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btad426
DO - 10.1093/bioinformatics/btad426
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 37417959
AN - SCOPUS:85164843045
SN - 1367-4803
VL - 39
JO - Bioinformatics
JF - Bioinformatics
IS - 7
M1 - btad426
ER -