TY - JOUR
T1 - Self-supervised learning of T cell receptor sequences exposes core properties for T cell membership
AU - Kabeli, Romi Goldner
AU - Zevin, Sarit
AU - Abargel, Avital
AU - Zilberberg, Alona
AU - Efroni, Sol
N1 - Publisher Copyright:
© 2024 The Authors.
PY - 2024/4/26
Y1 - 2024/4/26
N2 - The T cell receptor (TCR) repertoire is an extraordinarily diverse collection of TCRs essential for maintaining the body's homeostasis and response to threats. In this study, we compiled an extensive dataset of more than 4200 bulk TCR repertoire samples, encompassing 221,176,713 sequences, alongside 6,159,652 single-cell TCR sequences from over 400 samples. From this dataset, we then selected a representative subset of 5 million bulk sequences and 4.2 million single-cell sequences to train two specialized Transformer-based language models for bulk (CVC) and single-cell (scCVC) TCR repertoires, respectively. We show that these models successfully capture TCR core qualities, such as sharing, gene composition, and single-cell properties. These qualities are emergent in the encoded TCR latent space and enable classification into TCR-based qualities such as public sequences. These models demonstrate the potential of Transformer-based language models in TCR downstream applications.
AB - The T cell receptor (TCR) repertoire is an extraordinarily diverse collection of TCRs essential for maintaining the body's homeostasis and response to threats. In this study, we compiled an extensive dataset of more than 4200 bulk TCR repertoire samples, encompassing 221,176,713 sequences, alongside 6,159,652 single-cell TCR sequences from over 400 samples. From this dataset, we then selected a representative subset of 5 million bulk sequences and 4.2 million single-cell sequences to train two specialized Transformer-based language models for bulk (CVC) and single-cell (scCVC) TCR repertoires, respectively. We show that these models successfully capture TCR core qualities, such as sharing, gene composition, and single-cell properties. These qualities are emergent in the encoded TCR latent space and enable classification into TCR-based qualities such as public sequences. These models demonstrate the potential of Transformer-based language models in TCR downstream applications.
UR - http://www.scopus.com/inward/record.url?scp=85191625522&partnerID=8YFLogxK
U2 - 10.1126/sciadv.adk4670
DO - 10.1126/sciadv.adk4670
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 38669334
AN - SCOPUS:85191625522
SN - 2375-2548
VL - 10
JO - Science advances
JF - Science advances
IS - 17
M1 - eadk4670
ER -