SUPPLEMENTARY INFORMATION

Size: px
Start display at page:

Download "SUPPLEMENTARY INFORMATION"

Transcription

1 Contents De novo assembly... 2 Assembly statistics for all 150 individuals... 2 HHV6b integration... 2 Comparison of assemblers... 4 Variant calling and genotyping... 4 Protein truncating variants (PTV)... 6 De novo variation... 8 Resolving the extended MHC region and the Y chromosome... 9 Genome-graph representations Cohort selection Raw data

2 De novo assembly Assembly statistics for all 150 individuals See excel file: stable.1.150allpathlg_otherhumangenomes_assemblystatistics.xlsx HHV6b integration We identified two individuals with a high number of reads mapping to the HHV6b genome (Supplementary figure 1). Blasting of the HHV6b genome against these assemblies revealed that a single scaffold from each individual aligned with the HHV6b reference genome and showed complete assembly of the viral genome (Supplementary figure 2). Supplementary figure 1: Number of reads mapped to viral genomes. 2

3 Supplementary figure 2: Dot plot of two scaffolds vs HHV6b reference genome. 3

4 Comparison of assemblers See excel file: stable.2.assembly.comparison.xlsx Variant calling and genotyping Supplementary figure 3: BayesTyper validation rates. (a) Validation rates for insertions (red) and deletions (blue) as a function of the allele length. (b) The number of variants that were used to calculate the rates shown in (a). (c) Validation rates for insertions that both have a low allele frequency (allele count <= 15) and are repetitive (triangles), and complementary set of variants (circles). (d) The number of variants that were used to calculate the rates shown in (c). 4

5 a SNV Ins (<10) Del (<10) Ins (>=10, <100) Del (>=10, <100) Ins (>=100) Del (>=100) Inversion Complex Non redundant alleles (%) b Non redundant alleles (% of total) Variant allele length (alternative allele length reference allele length, 10 nt bins) Supplementary figure 4: Variant allele redundancy. Variant alleles were labeled as redundant if another variant allele with similarity above 90% was found within a window extending one variant length up- and downstream of variant start and end, respectively. (a) Percentage of non-redundant variant alleles for each variant allele class and (e) as a function of variant allele length for insertions and deletions. 5

6 Protein truncating variants (PTV) Supplementary figure 5: Protein truncating variants (PTV). (a) The number of heterozygous and homozygous PTVs found in each of the 50 children. Each variant is colour coded by type of mutation, showing SNPs (purple) accounting for 41% of LOF variants and indels (blue, red) account for 59%. On average each individual carries 68 homozygotic LOF variants. (b) Same variants as in a, but subdivided into categories based on type of PTV. 6

7 Supplementary figure 6: PTV length. Indels account for a substantial amount of the PTVs. A majority of the indels are small and we possibly miss some long insertions (cause unknown). 7

8 De novo variation Supplementary figure 7: Size distribution of de novo indels. The size distribution of the identified de novo indels as estimated by comparison with GRCh

9 Resolving the extended MHC region and the Y chromosome Scaffolds aligning to the MHC region >50kb 40 Median : 4 Mean : Count Number of scaffolds Supplementary figure 8: Number of scaffolds aligning to the MHC. Frequency distribution of the number of scaffolds aligning to the MHC region (pgf) or any of the alternative reference haplotypes (apd, cox, dbb, mann, mcf, qbl, ssto) in alignment blocks of at least 50 kb. Validation of HLA variants See excel file: stable.3.hla_validation.xlsx Length of novel sequence in MHC haplotypes Total novel MHC sequences Potential non-mhc sequences Potential non-mhc sequences with 100% query coverage, >=98% identity bp bp bp 9

10 Supplemental Table 4: The amount of novel sequence found in the HLA haplotypes together with the part of this that show strong homology to other parts of the genome Validation of chromosome Y variants See excel file: stable.5.chromosome_y_ validation.xlsx Genome-graph representations Supplementary figure 9: Comparison of the original and publicly available Danish pan genome. (a) Comparison of the correlation of minor allele frequency of the 7,500 first variants from chromosome 9 between the 150 real individuals and 150 individuals sampled from the genome-graph. (b) Comparison of the R-squared measure of linkage disequilibrium between pairs of variants in the real individuals versus the sampled for all pairs with an R-squared value above 0.2 within a sliding window of eight variants. Variant-pairs are represented as a dot which is colored and size-scaled by their proximity stratum (number of in-between variants) ranging from proximal (red/small) to distal (green/large). For each proximity stratum, a linear model is fitted and shown as a line of corresponding color, demonstrating a slight deterioration of the correlation as a function of variant pair proximity. The dotted line show the linear model fitted for all pairs regardless of distance. 10

11 Ambiguous nucleotides a Called insertion alleles (% of total) b Called complex alleles (% of total) (0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] (35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70] Percent ambiguous nucleotides (70,75] (75,80] (80,85] (85,90] (90,95] (95,100] Supplementary figure 10. Ambiguous nucleotide distribution. (a) Fraction of candidate and called insertions with ambiguous nucleotides in the alternative allele. (b) Fraction of candidate and called complex alleles with ambiguous nucleotides in the alternative allele. Only insertions and complex alleles had ambiguous nucleotides. 11

12 Cohort selection Supplementary figure 10: PCA plot of the two first principle components of the analysis conducted using SNPRelate. Black circles refer to Danish, red circles to German, blue circles to Norwegians and green circles to Swedish reference samples, while filled dots are the 120 parents of the 60 trios of this study. 12

13 Supplementary figure 11: The Principle Component Analysis of the 100 unrelated parents of the 50 trios. The first four PC s are plotted against each other in the panels below the diagonal and the distribution of samples on the PCs are shown on the diagonal. The PCA was conducted using SNPrelate and 76,991 bi-allelic SNVs retained after LD pruning (using a threshold of r 2 = ), filtering for minor allele frequency 0.05 and missing data of the total BayesTyper call set. Inner and outer ellipsoids equal 95% and 99% confidence limit of the t-distribution of pairwise PC s estimated using the R-package ellipse. PCA identified two outliers among the 100 unrelated parents possibly related to sequence quality PC PC PC PC4 13

14 Raw data Average Depth bp 500bp 800bp 2000bp 5000bp bp bp Supplementary figure 12: Sequencing depth. The sequencing depth for each type of insert size library over the 150 individuals after cleaning. 14

15 Sequence coverage of libraries per individual See excel file: stable.6.table.data.production.xlsx Supplementary figure 13: Allele balance for de novo SNVs (left) and indels (right). We used a cutoff of 0.3 to discriminate between somatic and germline mutations. 15

16 Mendelian inheritance of PTVs Cross (Child = Father X Mother) Number of observations across all trios 0/0 = 0/0 X 0/ /0 = 0/1 X 0/ /0 = 0/1 X 0/ /1 = 0/0 X 0/ /1 = 0/0 X 1/ /1 = 0/1 X 0/ /1 = 0/1 X 0/ /1 = 0/1 X 1/ /1 = 1/1 X 0/ /1 = 1/1 X 0/ /1 = 0/1 X 0/ /1 = 0/1 X 1/ /1 = 1/1 X 0/ /1 = 1/1 X 1/