Characterization of Allele-Specific Copy Number in Tumor Genomes

Similar documents
Estimation problems in high throughput SNP platforms

Single Tumor-Normal Pair Parent-Specific Copy Number Analysis

Statistical Methods for Quantitative Trait Loci (QTL) Mapping

Approaches for the Assessment of Chromosomal Alterations using Copy Number and Genotype Estimates

Statistical analysis of Single Nucleotide Polymorphism microarrays in cancer studies

Agilent CytoGenomics 2.0

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Getting high-quality cytogenetic data is a SNP.

Description of expands

Emma Huxley. Principal Clinical Scientist West Midlands Regional Genetics Laboratory

Detecting copy-neutral LOH in cancer using Agilent SurePrint G3 Cancer CGH+SNP Microarrays

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Lab 1: A review of linear models

Crash-course in genomics

Normal-Tumor Comparison using Next-Generation Sequencing Data

Package falcon. April 21, 2016

Genotype Prediction with SVMs

Using the Trio Workflow in Partek Genomics Suite v6.6

MoGUL: Detecting Common Insertions and Deletions in a Population

Course Announcements

Cornell Probability Summer School 2006

Guillaume Assié, 1 Thomas LaFramboise, 1,3 Petra Platzer, 1 Jérôme Bertherat, 5 Constantine A. Stratakis, 6 and Charis Eng 1,2,3,4, * Introduction

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Introduction to Quantitative Genomics / Genetics

Finding Recurrent Copy Number Alteration Regions: A Review of Methods

Population Genetics II. Bio

Philippe Hupé 1,2. The R User Conference 2009 Rennes

Using the Association Workflow in Partek Genomics Suite

arxiv: v1 [q-bio.gn] 2 Mar 2015

Computational Genomics

Runs of Homozygosity Analysis Tutorial

Linking Genetic Variation to Important Phenotypes

Package falconx. February 24, 2017

Haplotypes, linkage disequilibrium, and the HapMap

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions

Genome-wide association studies (GWAS) Part 1

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

SNP-array based Copy number analysis in R R course,

Dan Geiger. Many slides were prepared by Ma ayan Fishelson, some are due to Nir Friedman, and some are mine. I have slightly edited many slides.

High-density SNP Genotyping Analysis of Broiler Breeding Lines

GCTA/GREML. Rebecca Johnson. March 30th, 2017

QTL Mapping, MAS, and Genomic Selection

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling

CUMACH - A Fast GPU-based Genotype Imputation Tool. Agatha Hu

CMSC423: Bioinformatic Algorithms, Databases and Tools. Some Genetics

Quality Control Assessment in Genotyping Console

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Structure, Measurement & Analysis of Genetic Variation

The Diploid Genome Sequence of an Individual Human

Computational Haplotype Analysis: An overview of computational methods in genetic variation study

Algorithms for Genetics: Introduction, and sources of variation

STATISTICAL METHODS FOR GENOTYPE ASSAY DATA. Soo Yeon Cheong. MSc in Statistics, Seoul National University, South Korea, 2003

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Why do we need statistics to study genetics and evolution?

5/18/2017. Genotypic, phenotypic or allelic frequencies each sum to 1. Changes in allele frequencies determine gene pool composition over generations

An introductory overview of the current state of statistical genetics

Wu et al., Determination of genetic identity in therapeutic chimeric states. We used two approaches for identifying potentially suitable deletion loci

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY

Relative Fluorescent Quantitation on Capillary Electrophoresis Systems:

Identifying copy number alterations and genotype with Control-FREEC

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Agilent s CGH Platform: Recent Expansions for Cytogenetic Research

Cancer Genetics Solutions

Traditional Genetic Improvement. Genetic variation is due to differences in DNA sequence. Adding DNA sequence data to traditional breeding.

GenSap Meeting, June 13-14, Aarhus. Genomic Selection with QTL information Didier Boichard

Introduction to Animal Breeding & Genomics

Potential of human genome sequencing. Paul Pharoah Reader in Cancer Epidemiology University of Cambridge

Answers to additional linkage problems.

CS-E5870 High-Throughput Bioinformatics Microarray data analysis

Supplementary Note: Detecting population structure in rare variant data

Microsatellite markers

Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Supplementary information

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

On polyclonality of intestinal tumors

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

Cell Stem Cell, volume 9 Supplemental Information

January 6, 2005 Bio 107/207 Winter 2005 Lecture 2 Measurement of genetic diversity

GREG GIBSON SPENCER V. MUSE

Why learn linkage analysis?

Introduction to Genome Wide Association Studies 2015 Sydney Brenner Institute for Molecular Bioscience Shaun Aron

Analysis of genome-wide genotype data

NGS in Pathology Webinar

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

Familial Breast Cancer

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat

Mathematics of Forensic DNA Identification

Understanding Accuracy in SMRT Sequencing

12/8/09 Comp 590/Comp Fall

Surely Better Target Enrichment from Sample to Sequencer

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies

H3A - Genome-Wide Association testing SOP

Question. In the last 100 years. What is Feed Efficiency? Genetics of Feed Efficiency and Applications for the Dairy Industry

Shaare Zedek Medical Center (SZMC) Gaucher Clinic. Peripheral blood samples were collected from each

QTL Mapping, MAS, and Genomic Selection

Whole Genome Sequencing. Biostatistics 666

arxiv: v2 [q-bio.qm] 5 Nov 2015

Supplementary Methods Illumina Genome-Wide Genotyping Single SNP and Microsatellite Genotyping. Supplementary Table 4a Supplementary Table 4b

Transcription:

Characterization of Allele-Specific Copy Number in Tumor Genomes Hao Chen 2 Haipeng Xing 1 Nancy R. Zhang 2 1 Department of Statistics Stonybrook University of New York 2 Department of Statistics Stanford University Statistical Genomics Workshop, June 2009

Human chromosomes come in pairs.

Trisomy of 21 in Downs syndrome

Copy Number Changes at a Smaller Scale

Change in total copy number in cancer

Which chromosome hasgained or lost copies?

Which chromosome hasgained or lost copies?

Loss of Heterozygosity A large fraction of LOH events don t involve change in total copy number. Data from 24 pancreatic cancer cell lines, Calhoun et al., Genes, Chromosomes and Cancer 45:1070

Total versus Allele-specific Copy Numbers Each individual has two copies of every chromosome, the maternal copy and the paternal copy.

Total versus Allele-specific Copy Numbers Each individual has two copies of every chromosome, the maternal copy and the paternal copy. 1. Certain experimental platforms, such as Agilent and Nimblegen, provides only total copy number estimates, that is, the sum of the copy numbers of both maternal and paternal chromosomes.

Total versus Allele-specific Copy Numbers Each individual has two copies of every chromosome, the maternal copy and the paternal copy. 1. Certain experimental platforms, such as Agilent and Nimblegen, provides only total copy number estimates, that is, the sum of the copy numbers of both maternal and paternal chromosomes. 2. Other platforms, such as genotyping arrays (Illumina Beadarray, Affymetrix) can provide esimates of the copy number of each allele at selected polymorphic loci.

Data from SNP arrays (Illumina, Affymetrix)

Data from SNP arrays (Illumina, Affymetrix) Colored points: population control, black triangle: target sample.

Data from SNP arrays (Illumina, Affymetrix) Colored points: population control, black triangle: target sample.

Data from SNP arrays (Illumina, Affymetrix) Colored points: population control, black triangle: target sample.

Data from SNP arrays (Illumina, Affymetrix) Colored points: population control, black triangle: target sample.

Why do we care about parent-specific copy number? 1. Many loss-of-heterozygosity events do not cause a change in total copy number.

Why do we care about parent-specific copy number? 1. Many loss-of-heterozygosity events do not cause a change in total copy number. 2. Allele-specific nature of amplification and deletion often has biological relevance.

Why do we care about parent-specific copy number? 1. Many loss-of-heterozygosity events do not cause a change in total copy number. 2. Allele-specific nature of amplification and deletion often has biological relevance. 3. Making use of information from both alleles can improve CNV detection accuracy.

Why do we care about parent-specific copy number? 1. Many loss-of-heterozygosity events do not cause a change in total copy number. 2. Allele-specific nature of amplification and deletion often has biological relevance. 3. Making use of information from both alleles can improve CNV detection accuracy. 4. Teasing apart the parent-specific copy numbers allow us to quantify the fraction of cells in the sample that contain the copy number aberration. This makes possible the quantitative study of tumor clonality.

Why do we care about parent-specific copy number? 1. Many loss-of-heterozygosity events do not cause a change in total copy number. 2. Allele-specific nature of amplification and deletion often has biological relevance. 3. Making use of information from both alleles can improve CNV detection accuracy. 4. Teasing apart the parent-specific copy numbers allow us to quantify the fraction of cells in the sample that contain the copy number aberration. This makes possible the quantitative study of tumor clonality. How should this data be modeled and visualized?

More on the Clonality of Cancer Samples Image from: http://science.kennesaw.edu/ mhermes/cisplat/cisplat19.htm

Using information in both alleles can improve power Illumina outputs SNP-normalized data: R = (A + B)/[E(A + B)], b = (2/π) arctan(b/a), Normalized to population controls...

Copy number estimation using SNP chip data Existing approaches: 1. Segment sequence based on total copy number, then cluster segments based on allele information (LaFramboise et al., 2005).

Copy number estimation using SNP chip data Existing approaches: 1. Segment sequence based on total copy number, then cluster segments based on allele information (LaFramboise et al., 2005). This misses copy neutral loss of heterozygosity events.

Copy number estimation using SNP chip data Existing approaches: 1. Segment sequence based on total copy number, then cluster segments based on allele information (LaFramboise et al., 2005). This misses copy neutral loss of heterozygosity events. 2. Use existing segmentation tools to separately segment the log ratio and some thresholded version of B-frequency (Staaf et al. (2008), Assié et al. (2008)).

Copy number estimation using SNP chip data Existing approaches: 1. Segment sequence based on total copy number, then cluster segments based on allele information (LaFramboise et al., 2005). This misses copy neutral loss of heterozygosity events. 2. Use existing segmentation tools to separately segment the log ratio and some thresholded version of B-frequency (Staaf et al. (2008), Assié et al. (2008)). B-frequency is a mixture residing within [0, 1]...

Copy number estimation using SNP chip data Existing approaches: 1. Segment sequence based on total copy number, then cluster segments based on allele information (LaFramboise et al., 2005). This misses copy neutral loss of heterozygosity events. 2. Use existing segmentation tools to separately segment the log ratio and some thresholded version of B-frequency (Staaf et al. (2008), Assié et al. (2008)). B-frequency is a mixture residing within [0, 1]... 3. Hidden Markov model with hidden states representing genotypes A, B, AA, AB, BB, AAB, ABB,... PennCNV, QuantiSNP.

Copy number estimation using SNP chip data Existing approaches: 1. Segment sequence based on total copy number, then cluster segments based on allele information (LaFramboise et al., 2005). This misses copy neutral loss of heterozygosity events. 2. Use existing segmentation tools to separately segment the log ratio and some thresholded version of B-frequency (Staaf et al. (2008), Assié et al. (2008)). B-frequency is a mixture residing within [0, 1]... 3. Hidden Markov model with hidden states representing genotypes A, B, AA, AB, BB, AAB, ABB,... PennCNV, QuantiSNP. Cancer samples are often a mixture of sub-populations with different genotypes.

Fractional changes

Fractional changes

Fractional changes

Fractional changes

Fractional changes

Fractional changes

A helpful observation For any homogeneous segment, the (A, B) allele intensities must take on either 2, 3, or 4 clusters. The locations of these clusters depend on the quantities of the maternal and paternal chromosome. The cluster membership of any specific SNP depends on the genotype.

Two-chromosome Markov jump model 1. Model description. 2. Estimation Procedure. 3. Algorithmic details. 4. Results on data.

Parental allele configuration s t Interpretation AA Both parental chromosomes carry A. AB Maternal carries A, paternal carries B. BA Maternal carries B, paternal carries A. BB Both parental chromosomes carry B.

Parental allele configuration s t Interpretation AA Both parental chromosomes carry A. AB Maternal carries A, paternal carries B. BA Maternal carries B, paternal carries A. BB Both parental chromosomes carry B. Parent-specific copy numbers at location t: θ t,1, θ t,2.

Parental allele configuration s t Interpretation AA Both parental chromosomes carry A. AB Maternal carries A, paternal carries B. BA Maternal carries B, paternal carries A. BB Both parental chromosomes carry B. Parent-specific copy numbers at location t: θ t,1, θ t,2. Allele specific copy numbers: x t,1, x t,2. s t x t,1 x t,2 AA θ t,1 + θ t,2 0 AB θ t,1 θ t,2 BA θ t,2 θ t,1 BB 0 θ t,1 + θ t,2

Model Overview

Model Overview θ t : Markov jump process,

Model Overview θ t : Markov jump process, y t = x t + ɛ, ɛ N(0, Σ st ).

Model Overview θ t : Markov jump process, y t = x t + ɛ, ɛ N(0, Σ st ). When somatic changes in copy number occur, s t remains fixed.

Parental copy numbers θ t θ t can belong to two states: Baseline state: θ t = ( B 2, B 2 ), changed state: θt N(µ, V ).

Parental copy numbers θ t θ t can belong to two states: Baseline state: θ t = ( B 2, B 2 ), changed state: θt N(µ, V ). At each position, θ t can: 1. remain at the same value as θ t 1, 2. jump to a changed state (if at baseline), or a new changed state (if already at changed state), 3. jump to baseline (if at changed state).

Parental copy numbers θ t θ t can belong to two states: Baseline state: θ t = ( B 2, B 2 ), changed state: θt N(µ, V ). At each position, θ t can: 1. remain at the same value as θ t 1, 2. jump to a changed state (if at baseline), or a new changed state (if already at changed state), 3. jump to baseline (if at changed state). This can be modeled with a 3-state Markov model with transition matrix: 1 p 1 2 p 1 2 p P = c a b. c b a

Parental allele configuration s t s t can be modeled as: i.i.d. multinomial distribution with known parameters (p AA t, p BA, p AB t t, p BB t ). Markov with transition probabilities between two SNPs estimated using Haplotype data. These parameters can be estimated from the appropriate population controls.

Data transformation

Data transformation X = (log R + C)b, Y = (log R + C)(1 b), plus some necessary symmetrization, variance stabilization work well.

Smoothing equations If s t were known, the posterior distribution of θ t given the complete data sequence is a mixture of normals: θ t (Y 1,n, s 1,n ) α t δ B + β ijt N(µ ij, V ij ). 1 i t j n where α t, β ijt, µ ij, V ij can be explicitly computed via recursion.

Smoothing equations If s t were known, the posterior distribution of θ t given the complete data sequence is a mixture of normals: θ t (Y 1,n, s 1,n ) α t δ B + β ijt N(µ ij, V ij ). 1 i t j n where α t, β ijt, µ ij, V ij can be explicitly computed via recursion. Then we can estimate θ t by its posterior mean. E(θ t Y 1,n, s 1,n ) = α t B + 1 i t j n β ijt µ ij.

Smoothing equations If s t were known, the posterior distribution of θ t given the complete data sequence is a mixture of normals: θ t (Y 1,n, s 1,n ) α t δ B + β ijt N(µ ij, V ij ). 1 i t j n where α t, β ijt, µ ij, V ij can be explicitly computed via recursion. Then we can estimate θ t by its posterior mean. E(θ t Y 1,n, s 1,n ) = α t B + 1 i t j n β ijt µ ij. Using BCMIX approximations (Lai et al., 2005), this smoothing step can be done in O(n) computations, where n is number of SNPs. This translates to 15 seconds for 30000 SNPs (4 minutes for 500K chip).

A Twist on the EM Algorithm Set s 1 to some initial value. Let i = 1.

A Twist on the EM Algorithm Set s 1 to some initial value. Let i = 1. Repeat: 1. Expectation step: Given s i, set θt i to its posterior mean.

A Twist on the EM Algorithm Set s 1 to some initial value. Let i = 1. Repeat: 1. Expectation step: Given s i, set θt i to its posterior mean. 2. Maximization step: Given θ i, set s i+1 to its maximum a posteriori value s i+1 = arg max P(s θ i, y). (3.1) s S This can be done easily because given θ i, y t is a mixture of Gaussians at each t, and s i+1 t is simply the identifier for each mixture component.

A Twist on the EM Algorithm Set s 1 to some initial value. Let i = 1. Repeat: 1. Expectation step: Given s i, set θt i to its posterior mean. 2. Maximization step: Given θ i, set s i+1 to its maximum a posteriori value s i+1 = arg max P(s θ i, y). (3.1) s S This can be done easily because given θ i, y t is a mixture of Gaussians at each t, and s i+1 t is simply the identifier for each mixture component. 3. If θ i+1 θ i < δ, stop and report θ i+1, s i+1. Otherwise, set i i + 1 and go back to step 1.

Smoothed data: birds eye view

Segmented data: birds eye view Segmentation by thresholding: if difference > δ then segment. Plus other rules for min SNPs, min heterozygosity, etc.

Transform back to Illumina AB Coordinates

End Result: Estimates of Major and Minor Copy Numbers

Specificity using simulated titration data of Staaf et al. (2008)

Sensitivity using simulated titration data of Staaf et al. (2008)

Distribution of types of aberrations across patients 1. 223 Glioblastoma samples studied by the TCGA Research Network. 2. Stringent cut offs were used to identify CNAs. 3. Percentage of bases in each of the aberration categories: Event type % Gain/Gain 3 Gain/Normal 28 Gain/Loss 15 Loss/Normal 51 Loss/Loss 3

Variation of Aberration Profiles Across Patients

Example Regions

Example Regions

Example Regions

Example Regions

Example Regions

Example Regions

Example Regions Unbalanced gain/loss, followed by fractional gain.

Example Regions

Example Regions

Example Regions

Example Regions

Example Regions

Example Regions

Example Regions Copy neutral loss of heterozygosity, followed by alternating fractional gain loss.

What s next? We can now segment a genome into regions of homogeneous parent-specific copy number. For each region, this gives estimates of the relative quantities of the major and minor chromosome.

What s next? We can now segment a genome into regions of homogeneous parent-specific copy number. For each region, this gives estimates of the relative quantities of the major and minor chromosome. This makes possible many new types of analyses: 1. Estimate normal cell contamination? 2. Quantify clonality? 3. Cross-sample analysis to identify allele-specific gains/losses to distinguish between passenger and driver mutations.