Nature Immunology: doi: /ni Supplementary Figure 1. Data-processing pipeline.

Size: px

Start display at page:

Download "Nature Immunology: doi: /ni Supplementary Figure 1. Data-processing pipeline."

Hannah Morris
5 years ago
Views:

Supplementary Figure 1 Data-processing pipeline. Steps for processing data from multiple sorted B cell populations derived from a single individual at a single time point are shown.

1 Supplementary Figure 1 Data-processing pipeline. Steps for processing data from multiple sorted B cell populations derived from a single individual at a single time point are shown. Parameters used are indicated to the right of the corresponding box. These steps are implemented by a combination of publicly available software (fastq-joiner), public resources (IMGT/HighV-quest), and custom software written in Perl and Matlab by the authors. After raw reads are joined, the sequences are each confirmed to meet a length threshold of 200bp and are eliminated if they contain more than 0, 10, or 15 bp that are less than Q10, Q20, and Q30 respectively. A subset of 150,000 sequences from the total is randomly chosen to relieve computational stress and these sequences are aligned with IMGT.org's HighV-Quest tool. Alignment results are filtered by removing "unproductive" and "unknown" sequences. Custom software is then used to identify clusters of sequences based on the clonal identification metric (see methods). 50,000 sequences from the final set are then randomly chosen, again to relieve computational stress of the downstream analyses, to retain similar number of sequences in each data set, and for display purposes. All of IMGT/HighV-quest's data is retained through the process and used for mutation calculations and alignment analyses.

2 Supplementary Figure 2 Validation of identity thresholds for clonality identification metric. Grouped lineages of ASCs between sample replicates, cell populations and individual subjects are shown under different analytical requirements: (a) Circos plots are used to display connectivity between: a single CD138- ASC sample split into 3 separate fractions; a CD138+ ASC sample from the same individual (FLU 4, gray); and one CD138- ASC sample from a different individual (FLU 1, blue). Percentages in the upper left corner of each Circos plot indicate the level of CDR3 sequence similarity used to assign membership within the same clone. The two right-hand plots also require identical junction regions, as identified by the 3bp before and after the VHD split and the 3bp before and after the D-JH split. The clone labeled 1 in the 85% threshold plot is split into multiple clones (1a, 1b, and 1c) when the identical junction requirement is used. (b) Plot of lineage sizes of two replicates of CD138- ASCs from FLU 4 based on 85% shows a high correlation between the sizes of the largest clones in each data set. Deviation from similar sized clones within the pair of data sets primarily occurs in small clones, which would be more prone to reflect sampling and/or sequencing error. The high correlation of large clones within replicates further validates the ability to identify expanded clones in the circulating B cell and ASC populations. Blue numbers indicate number of sequences per clone (non-log) for very small clones.

3 Supplementary Figure 3 Effect of the requirement for identical V-D and D-J junctions on the clonal identification metric. (a) Alignment of a sample of VH sequences from the top FLU 4 CD138- ASC clone defined in Supplementary Fig. 2a using a requirement of 85% HCDR3 identity. Consistent with a shared clonal origin, this alignment demonstrates both shared as well as unique mutations in a step-wise fashion within VH rearrangements that share a highly conserved HCDR3 including conserved VH-D junction. However, the added requirement of complete conservation of both the VH-D and D-JH junction splits these sequences into 3 separate clones. Red indicates base differences from germline VH4-38 and differences within the clonally related HCDR3 sequences. (b) Alignments of example sequences from each of the lineages in a, focusing on CDR3 and surrounding region to illustrate the degree of HCDR3 and junctional conservation.

4 Supplementary Figure 4 Distribution of lineage sizes for 13 samples. Left: CD138- ASC. Right: CD138+ ASC. Lineages are lined up in size order from bottom to top along the extent of the y-axis representing 100% of all the sequences. Horizontal lines delineate the individual lineages. The x-axis is the normalized lineage size (percentage of the total number of sequences). Identical sequences are included in lineage composition.

5 Supplementary Figure 5 Distribution of lineage sizes for 13 samples. Left: naïve B cells. Right: IgD- memory cells. Lineages are lined up in size order from bottom to top along the extent of the y-axis representing 100% of all the sequences. Horizontal lines delineate the individual lineages. The x-axis is the normalized lineage size (percentage of the total number of sequences). Identical sequences are included in lineage composition.

6 Supplementary Figure 6 V H ASC clonal lineages at multiple time points following a flare. Time points are from patient SLE-3 at early flare and 5, 6, 7 and 8 weeks thereafter. Each vertical bar represents the repertoire of ASC (CD138 and CD138+) at a single time point, with lineages ordered from smallest to largest, and the y-axis representing the cumulative % of sequences. The largest 10 VH4-34 clones found at active flare are indicated by colored curves connecting corresponding lineages at multiple time points. All but one of these were found at all five time points and in all cases, the abundance of these VH4-34 clones within the most expanded clones diminishes beyond the early flare.

7 Flaring SLE Influenza Vaccine Tetanus Vaccine SLE1 SLE2 SLE3 SLE4 SLE5 FLU1 FLU2 FLU3 FLU4 TET1 TET2 TET3 TET4 Cells ( 000) Sequences Na SwM CD138- ASC CD138+ ASC Na SwM CD138- ASC CD138+ ASC Na Lineages SwM CD138- ASC CD138+ ASC Na D20 SwM CD138- ASC CD138+ ASC Na D50 SwM CD138- ASC CD138+ ASC Supplemental Table 1. Summary statistics for main set of samples described in this study. The D 20 (D 50 ) measure is the number of the largest lineages in a size-ordered list that span 20% (50%) of the sequences. Na: Naïve B Cells, SwM: IgD Memory B Cells, ASC: Antibody Secreting Cell. See text for description of how lineages were computed.

8 Supplementary Note 1 Clonal assignments were made based on matching V and J regions, identical HCDR3 length, and 85% sequence similarity throughout the HCDR3 sequence. This identity threshold was chosen after a thorough analysis of samples was conducted to determine an optimal metric for ASC clonal identification that would identify clonally related sequences with different degrees of mutation while excluding similar sequences of distinct clonal origin. To that end, CD138 ASCs from an influenza-vaccinated sample, 7 days postvaccination, was sorted into 3 separate tubes of cells each. The sequences obtained from these 3 tubes were then compared to similar numbers of CD138+ ASCs sorted from the same subject and CD138 ASCs sorted from a separate subject. Using the split samples from the same subject as positive markers and the\ sample from the separate subject as a negative marker, we analyzed sequences both through our automated analysis platform and also manually to determine which percentage of HCDR3 identity proved most efficient. A separate metric was also used which involved, in addition to the previous criteria, the requirement of identical junctional segments, identified as the last 3 nucleotides in the V region, first 3 nucleotides in the HCRD3, last 3 nucleotides in the HCDR3 and first 3 nucleotides in the J region. Results are presented in Supplementary Figs. 2, 3. It was determined that using greater than or equal to 85% HCDR3 similarity to identify clones resulted in a large overlap between identical samples (for same subject splits samples: 47.0% lineage connectivity between Tube 1 and 2, 46.1% between 1 and 3, and 46.9% between 2 and 3) and minimized the connectivity between different subjects (average 0.3% connectivity). In these examples, any difference between separate tubes from the same subject is likely due to limited sampling depth as 100% connectivity was found between the separate tubes of the same individual when only the top 50% of clones was considered. Any role for PCR sequencing error was examined by testing separate sequencing runs of the same identical samples. We found that the vast majority of errors concentrated on the smallest clones without significant difference in the overall clonality or the identity of the largest clones. In all, we considered that, given the predictable accumulation of somatic hypermutation that occurs as naïve and memory cells differentiate into ASCs, the use of complete or near complete sequence identity does inevitably diminish the degree of actual clonal connectivity between any two given populations. The 85% threshold value finally chosen takes this factor into account by allowing the HCDR3 within a single clone to accumulate a level of somatic hypermutation commensurate with the average mutation rate found in the CDR1 and CDR2 of all ASC samples. In contrast, identity thresholds higher than 85% frequently split into multiple smaller clones, clonally related sequences verified by sequence alignments demonstrating shared and unique mutations throughout the entire V(D)J rearrangement. This limitation was clearly illustrated in analysis of antigen-driven oligoclonal responses triggered by vaccination (Supplementary Fig. 2). The same problem was identified when a requirement for identical junctional segments was imposed.