Normal-Tumor Comparison using Next-Generation Sequencing Data

Similar documents
Transcription:

Normal-Tumor Comparison using Next-Generation Sequencing Data Chun Li Vanderbilt University Taichung, March 16, 2011

Next-Generation Sequencing First-generation (Sanger sequencing): 115 kb per day per instrument in 2001 (ABI 3730). NGS: Massively parallel sequencing ~5 Gb per day per instrument in 2009 (Illumina GAIIx) ~30 Gb in 2010 (Illumina HiSeq2000) $2000 per exome, $5000 per genome (March 2011) Third-generation: being developed Human genome: ~3 Gb Human exome: ~1% of genome

Raw Data Paired-end short reads For each subject Exome sequencing (Illumina GAIIx) 69,120 B/W pictures (~2 TB) (72bp paired-end reads) ~5GB in gzipped FASTQ format Whole genome same depth: ~150 GB per subject FASTQ format @HWUSI-EAS1758_0003:1:1:1064:20619#0/2 ACTATCCAAAGAATTTGAAATACCACTATTGTCAGAAACTTTACCTGTAAAATGCAGGTAGGCTTCTGGTCA +HWUSI-EAS1758_0003:1:1:1064:20619#0/2 :DDDBDDD-BB6-@>@CA5CA-A>5;@?>@A?DDDD-DBD;-@@?>A?A-?=@@C-AC5A=B.B.CCC:=D? @HWUSI-EAS1758_0003:1:1:1066:3504#0/2 TGTAAAAAATATTTGCAAATTCCAATATTTTTCCTCTGGGCGAAACTTTAAATCCTTTTGGCGCTCCTTTAA +HWUSI-EAS1758_0003:1:1:1066:3504#0/2?>AA>57+(*11<766@@,?@=@>.BC?A-A=8--1==-=+??=>-B.@6>??B->C?##############

Depth is Needed for Variant Call Depth of coverage: Target median 30x (85% 8x) 1000 Genomes Project Pilot 1: 4x Example: depth = 10, genotype = A/C 2.1% chance to have {10A, 9A/1C, 1A/9C, 10C} 8.8% chance to have {8A/2C, 2A/8C}

Base Quality Not all bases are read equally well. For each base, assess the probably for it to be miscalled Phred score = 10log 10 (p) Phred = 10 p = 0.1 Phred = 20 p = 0.01 Phred = 30 p = 0.001 Phred = 40 p = 0.0001 Int(Phred) + 33 is used for FASTQ. Examples: # Phred 2.5 p 0.56 A 31.5 Phred 32.5 5.6 10-4 p 7.1 10-4

Data Processing Initial alignment Using a reference genome (e.g., the NCBI reference) Mapping quality score for each mapped read. SAM and BAM formats. Mark duplicates Duplicate reads may be PCR artifacts. Local realignment Incorrectly aligned indels lead to wrong SNP calls. Recalibration Machine s assessment of quality can be too optimistic.

Calling Variants SNPs (VCF format) Indels: insertions and deletions CNVs Structural variations Annotation Mardis ER (2010) The $1,000 genome, the $100,000 analysis? Genome Medicine 2:84

Tumor Somatic changes in tumor DNA: Site mutations Driver mutations Passenger mutations Loss of heterozygosity (LOH) Other structural changes (inversions, translocations, fusions, etc.) Tumor DNA sample: A mixture of normal, stromal, and tumor cells. Laser capture microdissection may not be applicable sometimes. Tumors cells are heterogeneous. Low mutation rate Not clear LOH

Data Normal-tumor pairs, sequenced separately Should be sequenced side-by-side to avoid batch effect. We have exome data for 8 blood-tumor pairs of breast cancer subjects (from Shanghai Breast Cancer Study). We only consider bases: Whose base score 20 Whose read has mapping quality score 20.

Detection of Tumor Mutations For each subject: Consider sites with a single allele in blood but multiple alleles in tumor Calculate tumor mutation rate Select sites with high mutation rate across subjects genes with many mutation sites across subjects

Calculation of Mutation Rate Notation and goal: n i : depth at site i x i : number of mutated alleles at site i Goal: estimate the underlying mutation rate θ i Raw rate: x i /n i But 3/10 = 30/100 have different accuracy!

Empirical Bayes Method Random effect model: For each tumor sample, assume x i ~Bin(n i, θ i ) and θ i ~Beta α, β Estimate α and β as the MLE of marginal likelihood, f(x i α, β, n i ), where f x α, β, n = n x θx (1 θ) n x 1 B(α+β) θα 1 (1 θ) β 1 dθ It is called empirical Bayes when Beta α, β is viewed as a prior and the Bayesian machinery is used to derive posterior Beta(x i + α, n i x i + β) EB-adjusted mutation rate: x i + α / n i + α + β

Example: Top Mutation Sites For one sample, α = 1.62, β = 11.34 3/10 4.62/22.96 = 0.201 30/100 31.62/112.96 = 0.280 Result for top mutation sites ARate1 ARate2 RawRate Samples Perc Chr Position Ref Allele 0.1537 0.1815 0.2154 4 20.38% 5 112141747 G G/T 0.1500 0.1683 0.1918 4 16.96% 17 42568932 A A/C 0.1207 0.1279 0.1289 7 32.65% 17 42589653 T G/T 0.1202 0.1268 0.1274 4 34.68% 18 100332 T A/C/G/T 0.1155 0.1272 0.1305 4 29.90% 18 98601 T A/C/G/T 0.1130 0.1210 0.1238 4 29.41% 18 16767047 C C/G/T 0.1124 0.1219 0.1258 5 27.79% 2 132744647 G A/G 0.1088 0.1147 0.1153 6 35.84% 17 42589657 T G/T 0.1050 0.1155 0.1169 4 33.46% 9 67923477 A A/C 0.1036 0.1086 0.1077 4 34.42% 2 132744634 A A/G 0.1028 0.1117 0.1137 4 31.77% 18 16767073 A A/T 0.1012 0.1057 0.1052 5 36.94% 18 16773012 A A/G

Example: Top Mutation Genes Result for top mutation genes SNPs Samples SNPs/bp ARate1 ARate2 RawRate Perc genename geneid Strand chr from to 42 1.67 6.15e-4 0.1006 0.1105 0.1192 41.20% CDC27 996-17 42553299 42621536 33 1.27 1.49e-3 0.1277 0.1442 0.1687 31.39% FRG1 2483 + 4 191099158 191121277 27 1.04 7.12e-4 0.0290 0.0272 0.0228 84.01% MUC17 140453 + 7 100450136 100488044 22 1.00 2.66e-3 0.0227 0.0212 0.0170 89.89% FLG2 388698-1 150589709 150597983 20 1.00 1.57e-3 0.0242 0.0228 0.0194 84.67% FLG 2312-1 150541799 150554555 19 1.11 1.87e-3 0.0501 0.0537 0.0556 71.01% HRNR 388697-1 150452175 150462352 18 1.33 6.02e-5 0.1189 0.1324 0.1459 34.12% MLL3 58508-7 151464849 151763803 15 1.20 3.99e-3 0.0457 0.0446 0.0404 68.72% MAGEC1 9947 + X 140820524 140824284 15 1.13 1.03e-5 0.0919 0.1009 0.1144 43.76% MTMR11 10903-1 146715794 148174680 15 1.00 7.75e-4 0.0318 0.0295 0.0241 84.88% AHNAK 79026-11 62040791 62060145 13 1.00 8.94e-6 0.0822 0.0905 0.0939 48.64% GPATCH4 54865-1 153380688 154835461 10 1.00 5.73e-6 0.0786 0.0916 0.1058 63.75% CRB1 23418 + 1 193969199 195713631 10 1.00 5.27e-6 0.0575 0.0577 0.0518 60.26% LRP1B 53353-2 140707224 142604767

Detection of LOH For each subject: Consider sites heterozygous in normal sample Check if tumor has significant departure from 50:50 Select sites with high departure from 50:50 across subjects genes with many such sites across subjects

LOH Phred Score When there is no LOH, the observed allele counts follow Bin(n, 0.5). Calculate departure from 50:50 using twotailed p-value and convert it to Phred scale 10log 10 (p) and capped at 99. Sum Phred scores across subjects.

Adding Phred across Subjects When p~unif[0,1], 2 ln p ~χ 2 2 When multiple independent tests of the same hypothesis are performed, their information can be aggregated: 2 2 ln p 1 2 ln p k ~χ 2k under the null Phred score s i = 10 log p i = 5 ln 10 [ 2 ln p i ] The sum of Phred scores is s 1 + + s k ~ 5 reflects the overall evidence for LOH. ln 10 χ 2k 2, which

Expected Phred Score 0 Phred 99 (cap) even with no LOH. Expected Phred 99 ephred = Pr s n s Adjusted Phred: s=0 Phred ephred

Example: Top LOH Sites Results for top LOH sites samples AsumPhr sumphr EsumPh maxphr DPBlood GQBlood Chr Position Ref Allele 8 760.6 792 31.4 99 82.8 99.0 10 98551355 C C/G 7 665.5 693 27.5 99 305.9 99.0 19 32423898 A A/C/G/T 7 665.1 693 27.9 99 310.4 99.0 1 121186939 C A/C/G/T 7 662.0 690 28.0 99 692.6 99.0 4 191113251 G A/G/T 8 657.3 688 30.7 99 136.8 99.0 6 58886926 T C/G/T 8 636.2 666 29.8 99 155.4 99.0 17 42569591 T G/T 8 635.0 665 30.0 99 175.5 99.0 17 42569581 A A/G 8 630.9 659 28.1 99 86.5 65.1 12 11177516 T C/T 8 615.3 645 29.7 99 147.8 83.5 2 132735799 A A/G 8 605.1 634 28.9 99 133.0 99.0 12 25848414 C A/C/T 7 592.9 620 27.1 99 263.7 99.0 2 132736769 A A/C 8 591.6 622 30.4 99 140.9 99.0 2 132736818 T C/T 8 583.1 613 29.9 99 182.6 99.0 5 110313318 C A/C/T

Example: Top LOH Genes Results for top LOH genes SNPs Samples SNPs/bp AsumPhr SumPhr DPBlood GQBlood gname geneid Strand chr from to 150 2.14 9.76e-05 6.60 13.33 42.9 93.0 FCRLB 127943 + 1 158427569 159964075 109 2.54 2.87e-04 29.92 38.06 60.5 81.6 HYDIN 54768-16 69398983 69778298 99 2.13 5.04e-05 5.75 12.43 46.8 92.7 ALG2 85365-9 99059770 101023996 96 1.75 4.87e-05 4.48 9.75 43.1 94.2 PPP2R4 5524 + 9 128978316 130949563 93 1.99 5.33e-05 11.35 18.10 58.2 94.1 CRB1 23418 + 1 193969199 195713631 80 2.10 3.98e-05 10.75 17.54 47.5 92.4 MRRF 92399 + 9 122112724 124124716 79 2.00 1.56e-04 5.27 11.73 61.1 97.1 SYNE1 23345-6 152485263 152991158 75 1.68 3.82e-05 5.32 10.36 40.4 90.7 PTGES 9536-9 129590542 131555111 74 2.72 3.49e-04 20.02 28.85 78.8 94.9 C21orf29 54084-21 44744093 44955856 70 3.06 5.49e-03 81.95 93.57 314.5 99.0 FLG 2312-1 150541799 150554555 70 2.76 4.80e-05 45.31 54.26 42.7 73.8 MTMR11 10903-1 146715794 148174680