bioinformatics: state of art tools for NGS immunogenetics

bioinformatics: state of art tools for NGS immunogenetics Nikos Darzentas, Ph.D. CEITEC MU, Brno, Czech Republic bat.infspire.org nikos.darzentas@gmail.com Ministry of Health of theczech Republic, grant# 16 34272A CEITEC MU ESLHO::EuroClonality NGS MetaCentrum Virtual Organization ofcesnet

Nikos Darzentas ndarz@certh.gr Identifying Stereotyped Subsets in CLL Novel Computational Tools for Identifying Stereotyped Subsets in Chronic Lymphocytic Leukemia Nikos Darzentas Computational Genomics Unit (CGU) Centre for Research and Technology Hellas (CERTH) Greece ndarz@certh.gr Educational Workshop on Immunoglobulin Gene Analysis in Chronic Lymphocytic Leukemia June 14-15 in Uppsala, Sweden

e.g. eventually link to ARResT/ AssignSubsets other related tools diverse datasets / projects literature

unique challenges of NGS immunogenetics enormous inherent complexity, huge diversity and temporal variation of immune responses highly non trivial annotation vs. multiple germline sequences (from IMGT!) and of the rearrangement junction wide variety of applications, many with their own needs: basic research, technology development e.g. primers and assays, diagnostic / clinical, MRD and clonality assessment and monitoring, repertoire studies errors and biases of protocols and humans, in data and results

generic and specific challenges for bioinformatics modularity and flexibility, to support the many applications multiplexing, i.e. many receptors and chains and junction classes (e.g. incomplete) interpretation clonotype definition, with implications for assessment of clonality thresholds and cut offs and normalisations, incl. what to consider for relative abundances visualisation and user interaction detailed logging and reporting, for development and troubleshooting, but also interpretation and record keeping efficiency, although this is not as big a challenge as with e.g. full genome NGS foolproofing, esp. challenging for very diverse applications, data, and users

junction classes normal IG VJ : Vh (Dh) Jh IG VJ : Vk Jk IG VJ : Vl Jl TR VJ : Va Ja TR VJ : Va Jd TR VJ : Vb (Db) Jb TR VJ : Vd (Dd) Ja TR VJ : Vd (Dd) Jd TR VJ : Vg Jg incomplete IG DJ : Dh Jh TR DD : Dd2 Dd3 TR DJ : Db Jb TR DJ : Dd2 Jd TR DJ : Dd Ja special IG INTRON KDE IG Vk KDE TR VD : Vd Dd3

bioinformaticplatform focused on low throughput sequences, and CLL, mainly with IgCLL ARResT, or Antigen Receptors Research Tool bat.infspire.org/arrest/ ARResT/Teiresias discovering new subsets of stereotyped sequences ARResT/SeqCure curating antigen receptor sequences ARResT/AssignSubsets assigning new members to existing subsets of stereotyped sequences specifically for NGS, and within ESLHO s EuroClonality NGSconsortium (coordinated by Ton) ARResT/Interrogate web accessible, interactive, and integrating a data producing pipeline and a results browser

user experience user interactivity, esp. when users and questions can be diverse, as is the case here

user experience

user experience user messaging system, which will react to user actions and share info, advice, notes, tips, warnings, and errors user modes, e.g. simple, advanced, don t even bother, diagnostics, etc. application specific modes, e.g. clonality and MRD

visualisations

a)heatmaps: sample dynamics diagnostic prior to SCT donor sample sample then you can directly mix and match sample feature after SCT b)line chart: MRD kinetics 1 after SCT 10 1 10 2 10 3 10 4 single read / NGS depth Graft versus leukemia effects in T prolymphocytic leukemia: evidence from MRD kinetics and TCR repertoire analyses Sellner, Brüggemann, Schlitt, Knecht, Herrmann, Reigl, Krejci, Bystry, Darzentas et al. (submitted)

sequence forensics : sequence search + network of sequence differences or assessment, monitoring and quantification of rearrangements of interest sensitive (no heuristics), smart (rearrangement network aware distance calculation), NGS enabled (normalisation based on experimental setup, incl. MRD spike ins), adaptive (small/big, e.g. MRD/clonality, data), interactive (user control of final results) reads vs. distance to target

sequence forensics user control: change distance threshold, add more sequences to clone, get final %s

sequence forensics with network connected interactive multiple alignments and differences highlighted

sequence forensics and the ability to simplify the network and reduce the data to a manageable summary

primers specific functionality of the pipeline to: identify primers, also taking into account expected coordinates report their frequencies and characteristics, i.e. score and position statistics trim sequenced reads to before or after the primer, i.e. leaving on or removing primer leaving primer on, even if artificial, might help in identifying rearrangements then, primer development: IGHV1 IGHV2 IGHV3 IGHV4 experimental condition A1 experimental condition A2

primers also usable as controls for the health of an NGS run, compared to a golden standard dataset:

data quality our current strategy: keep as much as you can until the end e.g. paired end joining, sequence length, sequence quality PCR and NGS errors can hurt specific applications, e.g. SHM and evolution error correction is (arguably) a rather theoretical exercise unless helped by lab work (e.g. unique molecular identifiers / barcodes) contamination, wet but also digital usual demultiplexing does not handle noise well, leading to assignment of reads to no wrong samples => more statistics strength in numbers, and replicates, and experimental design in general normalize and/or filter on abundance with experimental information (spike in controls, number of cells, or amount of DNA)

EuroClonality NGS standardisation, and SOPs this can involve, for example: (standard operating procedures) predetermined computations with predetermined options i.e. locked scenarios and even sample sheets with complete control of complicated runs centrally available, curated sequences * primers, for development work, and batch quality control * spike ins + copy numbers, see MRD quantification

capture based enrichment and NGS conceptually elegant and practically useful that can create expert panels of genes if probes (or primers) for the IG locus are designed, IG rearrangements could be sequenced as well, incl. incomplete ones two main challenges (if applicable): with probe based capture, fragments and thus reads are not centered around the same area, and thus reconstruction of the rearrangements might be needed with paired end NGS, depending on read and fragment lengths, identifying a junction might be difficult e.g. non overlapping reads, but still reporting the rearrangement as a translocation event may be useful data seen so far show neither breadth nor depth of rearrangements, but major clones are usually found

access to ARResT/Interrogate manuscript for browser under revision, pipeline+browserto follow (he said) code on GitHub (not the whole platform yet, but soon) contact us nikos.darzentas@gmail.com bat@infspire.org bat.infspire.org eventually, also shared as a EuroClonality NGS validated and standardised platform for other ARResT tools, and an overview bat.infspire.org/arrest

acknowledging all people in my bioinformatics Team: VojtaBystry, TomasReigl, AdamKrejci, AndreaGrioni, and previously Baraand Martin and all our friends, collaborators and colleagues, including many you probably already know: Lesley, Anastasia, Andreas, Vassilis, Panagiotis, et al and across many networks: C. Belessi(Athens) F. Davi(Paris) P. Ghia(Milan) R. Rosenquist(Uppsala) K. Stamatopoulos (Thessaloniki) M P. Lefranc, V. Giudicelli (Montpellier) BIOMED II A. Langerak, J. van Dongen (Rotterdam+) M. Brüggemann, C. Pott(Kiel) G. Cazzaniga(Monza) F. Davi(Paris) D. Gonzalez(London) P. Groenen (Nijmegen) M P. Lefranc, V. Giudicelli(Montpellier) K. Stamatopoulos(Thessaloniki)