Normal-Tumor Comparison using Next-Generation Sequencing Data

Size: px
Start display at page:

Download "Normal-Tumor Comparison using Next-Generation Sequencing Data"

Transcription

1 Normal-Tumor Comparison using Next-Generation Sequencing Data Chun Li Vanderbilt University Taichung, March 16, 2011

2 Next-Generation Sequencing First-generation (Sanger sequencing): 115 kb per day per instrument in 2001 (ABI 3730). NGS: Massively parallel sequencing ~5 Gb per day per instrument in 2009 (Illumina GAIIx) ~30 Gb in 2010 (Illumina HiSeq2000) $2000 per exome, $5000 per genome (March 2011) Third-generation: being developed Human genome: ~3 Gb Human exome: ~1% of genome

3 Raw Data Paired-end short reads For each subject Exome sequencing (Illumina GAIIx) 69,120 B/W pictures (~2 TB) (72bp paired-end reads) ~5GB in gzipped FASTQ format Whole genome same depth: ~150 GB per subject FASTQ ACTATCCAAAGAATTTGAAATACCACTATTGTCAGAAACTTTACCTGTAAAATGCAGGTAGGCTTCTGGTCA TGTAAAAAATATTTGCAAATTCCAATATTTTTCCTCTGGGCGAAACTTTAAATCCTTTTGGCGCTCCTTTAA

4 Depth is Needed for Variant Call Depth of coverage: Target median 30x (85% 8x) 1000 Genomes Project Pilot 1: 4x Example: depth = 10, genotype = A/C 2.1% chance to have {10A, 9A/1C, 1A/9C, 10C} 8.8% chance to have {8A/2C, 2A/8C}

5 Base Quality Not all bases are read equally well. For each base, assess the probably for it to be miscalled Phred score = 10log 10 (p) Phred = 10 p = 0.1 Phred = 20 p = 0.01 Phred = 30 p = Phred = 40 p = Int(Phred) + 33 is used for FASTQ. Examples: # Phred 2.5 p 0.56 A 31.5 Phred p

6

7 Data Processing Initial alignment Using a reference genome (e.g., the NCBI reference) Mapping quality score for each mapped read. SAM and BAM formats. Mark duplicates Duplicate reads may be PCR artifacts. Local realignment Incorrectly aligned indels lead to wrong SNP calls. Recalibration Machine s assessment of quality can be too optimistic.

8 Calling Variants SNPs (VCF format) Indels: insertions and deletions CNVs Structural variations Annotation Mardis ER (2010) The $1,000 genome, the $100,000 analysis? Genome Medicine 2:84

9 Tumor Somatic changes in tumor DNA: Site mutations Driver mutations Passenger mutations Loss of heterozygosity (LOH) Other structural changes (inversions, translocations, fusions, etc.) Tumor DNA sample: A mixture of normal, stromal, and tumor cells. Laser capture microdissection may not be applicable sometimes. Tumors cells are heterogeneous. Low mutation rate Not clear LOH

10 Data Normal-tumor pairs, sequenced separately Should be sequenced side-by-side to avoid batch effect. We have exome data for 8 blood-tumor pairs of breast cancer subjects (from Shanghai Breast Cancer Study). We only consider bases: Whose base score 20 Whose read has mapping quality score 20.

11 Detection of Tumor Mutations For each subject: Consider sites with a single allele in blood but multiple alleles in tumor Calculate tumor mutation rate Select sites with high mutation rate across subjects genes with many mutation sites across subjects

12 Calculation of Mutation Rate Notation and goal: n i : depth at site i x i : number of mutated alleles at site i Goal: estimate the underlying mutation rate θ i Raw rate: x i /n i But 3/10 = 30/100 have different accuracy!

13 Empirical Bayes Method Random effect model: For each tumor sample, assume x i ~Bin(n i, θ i ) and θ i ~Beta α, β Estimate α and β as the MLE of marginal likelihood, f(x i α, β, n i ), where f x α, β, n = n x θx (1 θ) n x 1 B(α+β) θα 1 (1 θ) β 1 dθ It is called empirical Bayes when Beta α, β is viewed as a prior and the Bayesian machinery is used to derive posterior Beta(x i + α, n i x i + β) EB-adjusted mutation rate: x i + α / n i + α + β

14 Example: Top Mutation Sites For one sample, α = 1.62, β = / /22.96 = / / = Result for top mutation sites ARate1 ARate2 RawRate Samples Perc Chr Position Ref Allele % G G/T % A A/C % T G/T % T A/C/G/T % T A/C/G/T % C C/G/T % G A/G % T G/T % A A/C % A A/G % A A/T % A A/G

15 Example: Top Mutation Genes Result for top mutation genes SNPs Samples SNPs/bp ARate1 ARate2 RawRate Perc genename geneid Strand chr from to e % CDC e % FRG e % MUC e % FLG e % FLG e % HRNR e % MLL e % MAGEC X e % MTMR e % AHNAK e % GPATCH e % CRB e % LRP1B

16 Detection of LOH For each subject: Consider sites heterozygous in normal sample Check if tumor has significant departure from 50:50 Select sites with high departure from 50:50 across subjects genes with many such sites across subjects

17 LOH Phred Score When there is no LOH, the observed allele counts follow Bin(n, 0.5). Calculate departure from 50:50 using twotailed p-value and convert it to Phred scale 10log 10 (p) and capped at 99. Sum Phred scores across subjects.

18 Adding Phred across Subjects When p~unif[0,1], 2 ln p ~χ 2 2 When multiple independent tests of the same hypothesis are performed, their information can be aggregated: 2 2 ln p 1 2 ln p k ~χ 2k under the null Phred score s i = 10 log p i = 5 ln 10 [ 2 ln p i ] The sum of Phred scores is s s k ~ 5 reflects the overall evidence for LOH. ln 10 χ 2k 2, which

19 Expected Phred Score 0 Phred 99 (cap) even with no LOH. Expected Phred 99 ephred = Pr s n s Adjusted Phred: s=0 Phred ephred

20 Example: Top LOH Sites Results for top LOH sites samples AsumPhr sumphr EsumPh maxphr DPBlood GQBlood Chr Position Ref Allele C C/G A A/C/G/T C A/C/G/T G A/G/T T C/G/T T G/T A A/G T C/T A A/G C A/C/T A A/C T C/T C A/C/T

21 Example: Top LOH Genes Results for top LOH genes SNPs Samples SNPs/bp AsumPhr SumPhr DPBlood GQBlood gname geneid Strand chr from to e FCRLB e HYDIN e ALG e PPP2R e CRB e MRRF e SYNE e PTGES e C21orf e FLG e MTMR