simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods

Chakravarthi Kanduri, Lonneke Scheffer, Milena Pavlović, Knut Dagestad Rand, Maria Chernigovskaya, Oz Pirvandy, Gur Yaari, Victor Greiff, Geir K. Sandve

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Background: Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing the key feature of many shared receptor sequences (selected for common antigens) found in antigen-experienced repertoires. Results: We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR can be used for constructing AIRR-level benchmarks based on a range of assumptions (or experimental data sources) for what constitutes receptor-level immune signals. This includes the possibility of making or not making any prior assumptions regarding the similarity or commonality of immune state-associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets. Conclusions: This study sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets. simAIRR is available as a Python package: https://github.com/KanduriC/simAIRR.

Original languageEnglish
Article numbergiad074
JournalGigaScience
Volume12
DOIs
StatePublished - 28 Dec 2022

Bibliographical note

Publisher Copyright:
© 2023 The Author(s). Published by Oxford University Press GigaScience.

Funding

Supported by the Leona M. and Harry B. Helmsley Charitable Trust (#2019PG-T1D011, to V.G.), UiO World-Leading Research Community (to V.G.), UiO: LifeScience Convergence Environment Immunolingo (to V.G. and G.K.S.), EU Horizon 2020 iReceptorplus (#825821) (to V.G.), a Norwegian Cancer Society Grant (#215817, to V.G.), Research Council of Norway projects (#300740, #331890 to V.G.), a Research Council of Norway IKTPLUSS project (#311341, to V.G. and G.K.S.), and Stiftelsen Kristian Gerhard Jebsen (K. G. Jebsen Coeliac Disease Research Centre, to G.K.S.). Some of the analyses in this work were performed using the Immunohub eInfrastructure funded by the University of Oslo and operated by the authors in close collaboration with the University Senter for Information Technology (USIT), University of Oslo.

FundersFunder number
Leona M. and Harry B. Helmsley Charitable Trust
Universitetet i Oslo
University Senter for Information Technology

    Keywords

    • AIRR
    • ML
    • adaptive immune receptor repertoires
    • benchmarking of machine learning methods
    • shortcut learning
    • simulation of AIRR data

    Fingerprint

    Dive into the research topics of 'simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods'. Together they form a unique fingerprint.

    Cite this