TY - JOUR
T1 - Immune2vec
T2 - Embedding B/T Cell Receptor Sequences in ℝ N Using Natural Language Processing
AU - Ostrovsky-Berman, Miri
AU - Frankel, Boaz
AU - Polak, Pazit
AU - Yaari, Gur
N1 - Publisher Copyright:
© Copyright © 2021 Ostrovsky-Berman, Frankel, Polak and Yaari.
PY - 2021/7/22
Y1 - 2021/7/22
N2 - The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis.
AB - The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis.
KW - BCR repertoire
KW - NLP
KW - biological sequence embedding
KW - computational immunology
KW - word2vec
UR - http://www.scopus.com/inward/record.url?scp=85112657810&partnerID=8YFLogxK
U2 - 10.3389/fimmu.2021.680687
DO - 10.3389/fimmu.2021.680687
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 34367141
AN - SCOPUS:85112657810
SN - 1664-3224
VL - 12
JO - Frontiers in Immunology
JF - Frontiers in Immunology
M1 - 680687
ER -