TCR Repertoire Diversity Background information

Size: px
Start display at page:

Download "TCR Repertoire Diversity Background information"

Transcription

1 TCR Repertoire Diversity Background information Contents Introduction to TCR Repertoire Sequencing...1 Glossary...2 Note on TCR Gene Nomenclature...2 V(D)J Recombination...3 Decombinator and Aho-Corasick String Matching...4 TCR Collapsing...5 Diversity Metrics Overview...6 Diversity Metrics of Non-Collapsed Repertoires...7 References...9 Introduction to TCR Repertoire Sequencing Jawed vertebrates have evolved complex adaptive immune systems to protect themselves from the host of parasites and toxins that could cause pathology. This immunity is made possible through the production of a large number of variable antigen receptors, through a pseudorandom process of somatic recombination. This V(D)J recombination rearranges germline encoded gene segments to produce a novel receptor gene (see below). This is the process by which B cells produce immunoglobulins, and T cells produce TCRs. There are many V, D and J segments from which a recombining cell can 'choose', and each expressed antigen receptor is heterodimer of two different, separately recombined chains, thus generating recombinatorial diversity. Furthermore there is non-templated addition and deletion of nucleotides at recombining gene junctions, massively driving up diversity. Although taking these parameters into account a simple calculation would predict that such a system could produce billions of possible heterodimeric TCRs (by either definition of billion), more conservation estimates place the number of different TCRs within an individual on the order of millions, still exceeding the number of genes in the somatic germline several fold (Arstila et al., 1999). Historically, TCR receptors were studied with combinations of flow-cytometry, receptor spectratyping and cloning individual receptor chains (Six et al., 2013). High-throughput DNA sequencing offers us the chance for the most in-depth, high-resolution and unbiased investigation into TCR repertoires to date (Benichou et al., 2012; Six et al., 2013).

2 Glossary 6N Hexamer of random nucleotides AC Aho-Corasick ART Antiretroviral therapy cdna Complementary DNA (product of reverse transcription) Clonotype A unique TCR, be it represented as a nucleotide sequence, CDR3 amino acid sequence or Decombinator identifier DCR Decombinator (Thomas et al., 2013) ds (DNA) Double-stranded DNA P5/P7 Adapter sequences required for binding to Illumina flowcells SP1/2 Sequencing primers and/or their binding sequences ss (DNA) Single-stranded DNA TCR T cell receptor v1 First bleed of our HIV patients, immediately before ART treatment begins v2 Second bleed of our HIV patients, after three months of ART V(D)J (C) Variable (Diversity) Joining (Constant) genes Note on TCR G ene Nomenclature We follow the IMGT standard of antigen receptor nomenclature (Lefranc et al., 2003). In this system, genes are referred to first by receptor type (TCR or immunoglobulin), then by chain (alpha, beta, delta or gamma), then gene type (variable, diversity, joining or constant), then finally numbered by gene family and specific allele. For example, all beta chain joining regions can be referred to as TRBJ, while TRAV1-2*01 refers to the prototypical allele of the variable alpha gene TRAV1-2 (where the hyphen indicates a subfamily). Note that all V, D, J and C regions are also referred to as genes or gene segments.

3 V(D)J Recombination Figure S1 shows an example schematic recombination of the human TCR β chain. Figure S1. Layout and example rearrangement of a human TCR beta chain. A: Scaled schematic of the human TRB germ-line, unrearranged locus, with dotted red lines indicating a magnified view of the TRBJ and TRBC genes. Genes marked with an asterisk are those used in this sample recombination, and are the same as those found in the beta-chain of the leukaemia T cell line Jurkat. B: Following V(D)J recombination, the V, D and J genes are brought into direct contact, the intervening DNA having been excised. Red dotted lines show the magnified rearrangement. C: An example transcript of the rearranged TCR, where non-rearranged TRBJ genes and TRBC exons have been spliced out. The CDR3 label is an approximation, but it will always sit around the hypervariable region encoded by the recombination junction(s). This schematic is available on figshare (Heather, 2013c). For a review of V(D)J recombination, see Alt et al., 1992.

4 Decombinator and Aho-Corasick String Matching While there are many ways one might analyse high-throughput TCR data, our lab has developed software called Decombinator, that finds rearrangements by looking for 'tag' sequences in sequencing reads, the presence of which uniquely denotes a specific V or J gene (Thomas et al., 2013). Having found these tag sequences, Decombinator then uses the position of these tags to fill out a five-field identifier that describes that TCR, and from which the entire nucleotide sequence of that TCR can be recapitulated. The crucial aspect of Decombinator is its speed, employing an Aho-Corasick (AC) string matching algorithm to find the V/J tags, which runs orders of magnitude faster than alignment based approaches (Aho & Corasick, 1975; Thomas et al., 2013). The AC algorithm uses a given set of keywords, in our case either the V or J tags of a TCR chain, to generate a trie which can be navigated by a finite state automaton, where a failure to traverse a given branch of the trie navigates to the longest available match instead of returning to the start node (see figure S2). This permits searching for every keyword or tag in the trie in one pass of the query string hugely increasing speed. Figure S2. Example Aho-Corasick finite-state automaton, which can be used to search for DNA patterns. Black lines indicate standard traversals of the trie, which correspond to successfully finding the next appropriate character. Blue lines indicate failure functions, where if the correct character is not found, the automaton can then navigate to an alternative branch, using the longest matched suffix on the current branch to form the longest possible prefix on another. This schematic is available on figshare (Heather, 2013a), as is an animation of the process (Heather, 2013b).

5 TCR Collapsing Despite the notion of using random barcodes to alleviate error entering the adaptive repertoire literature at an early point (Weinstein et al., Quake, 2009) it is yet to be widely adopted. Figure S3 depicts the stages of the error-correcting, frequency collapsing code used to process the figures 1-3 in the poster. Most analysis steps, including modified Decombinator and TCR collapsing, are coded in Python, while the diversity measures are calculated and plotted in R. Figure S3: Flowchart depicting the various steps involved in processing the raw output of a modified, verbose Decombinator, into an error-corrected and frequency collapsed repertoire

6 Diversity Metrics Overview As others have before us, in order to assess the diversity of TCR repertoires, we have borrowed metrics from various fields. For a review of such diversity measures applied to TCR repertoires see Six et al., The Gini Index (Ceriani & Verme, 2011) is often used in economics, where it is often used to measure the equality of distribution of wealth among countries; we consider this analagous to a country being a unique TCR (either sequence, Decombinator assignation or CDR3 sequence) and wealth being the proportion of total TCR frequency made up by that clone. A Gini index of one equates to total inequality, i.e. one country with all the money or only one clone of TCR in the population whereas a Gini index of zero represents total equality, where everyone has the same amount of money or every TCR is present in equal numbers. Thus for our TCR repertoires, the lower the Gini index, the more equal the distribution of TCR frequencies; as we don't include TCRs which didn't appear in a sample (i.e. frequency of zero) and tend to get numbers of unique clonotypes across an order of magnitude for most samples, we can consider a lower Gini to represent a greater diversity. Shannon entropy was developed for use in information theory, where it is used to represent the information content of a given message (Shannon, 1948), however it can be used more broadly to measure variability or diversity, such as is necessary to understand variable antigen receptor repertoires (Stewart et al., 1997). The Simpson diversity index (and its derivatives) were developed for use when classifying individuals within groups (Simpson, 1949). It is popular in ecology for studying diversity of species in an environment, which translates very well to measuring the diversity of different clonotypes.

7 Diversity Metrics of Non- Collapsed Repertoires As an adjunct to figure 2 of the poster, figure S4 shows the same plots but with the original, un-collapsed raw data included as well. Collapsing repertoires causes a clear reduction in diversity in all cases, often accompanied with an increase in the ability to distinguish healthy against HIV, or HIV bleed 1 against HIV bleed 2. Figure S4: Modified version of figure 2, wherein original, uncollapsed repertoires are also shown (green) to compare against the collapsed (blue). As with the poster, Wilcoxon test significance levels (unpaired, one-sided): < 1x10-5, ***; < 1x10-3, **; < 5x10-2, *.

8

9 References Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6), doi: / Alt, F., Oltz, E., Young, F., Gorman, J., Taccioli, G., & Chen, J. (1992). VDJ recombination. Immunology Today, 13(8), doi: / (92) Arstila, T. P., Casrouge, A., Baron, V., Even, J., Kanellopoulos, J., & Kourilsky, P. (1999). A Direct Estimate of the Human T Cell Receptor Diversity. Science, 286(5441), doi: /science Benichou, J., Ben-Hamo, R., Louzoun, Y., & Efroni, S. (2012). Rep-Seq: uncovering the immunological repertoire through next-generation sequencing. Immunology, 135(3), doi: /j x Ceriani, L., & Verme, P. (2011). The origins of the Gini index: extracts from Variabilità e Mutabilità (1912) by Corrado Gini. The Journal of Economic Inequality, 10(3), doi: /s x Heather, J. (2013a). Aho Corasick Finite State Machine for DNA String Matching. figshare. doi: /m9.figshare Heather, J. (2013b). Aho Corasick String Matching Video. doi: /m9.figshare Heather, J. (2013c). V(D)J recombination in the human Tcrb locus. figshare. doi: /m9.figshare Lefranc, M.-P., Pommié, C., Ruiz, M., Giudicelli, V., Foulquier, E., Truong, L., Lefranc, G. (2003). IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains. Developmental and Comparative Immunology, 27(1), Retrieved from Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, XXVII(3), 379. Simpson, E. H. (1949). Measurement of Diversity. Nature, 163, 688. Six, A., Mariotti-Ferrandiz, M. E., Chaara, W., Magadan, S., Pham, H.-P., Lefranc, M.-P., Boudinot, P. (2013). The Past, Present, and Future of Immune Repertoire Biology The Rise of Next-Generation Repertoire Analysis. Frontiers in Immunology, 4(November), doi: /fimmu Stewart, J. J., Lee, C. Y., Ibrahim, S., Watts, P., Shlomchik, M., Weigert, M., & Litwin, S. (1997). A Shannon entropy analysis of immunoglobulin and T cell receptor. Molecular Immunology, 34(15), Retrieved from

10 Thomas, N., Heather, J., Ndifon, W., Shawe-Taylor, J., & Chain, B. (2013). Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine. Bioinformatics, 29(5), doi: /bioinformatics/btt004 Weinstein, J. a, Jiang, N., White, R. a, Fisher, D. S., & Quake, S. R. (2009). High-throughput sequencing of the zebrafish antibody repertoire. Science (New York, N.Y.), 324(5928), doi: /science