Structural variation analysis using NGS sequencing

Similar documents
Transcription:

Structural variation analysis using NGS sequencing Victor Guryev NBIC NGS taskforce meeting April 15th, 2011

Scale of genomic variants Scale 1 bp 10 bp 100 bp 1 kb 10 kb 100 kb 1 Mb Variants SNPs Short indels Indels S t r u c t u r a l v a r i a n t s Tools G A P S S Solutions S A P 4 2 P I N D E L G A T K a C G H FISH, Cytogenetics

Depth-of-coverage (DOC) analysis Fragment library, sample vs genome reference Ref. Heterozygous deletion? Sequenceability issue? Mappability issue? Sample vs sample comparison Fragment library 1 Ref. Fragment library 2 Ref.

DOC with dynamic windows Fragment library 1 Test Fragment library 2 Ref Static windows, e.g. 1kb Window 4 Dynamic windows, 10 (100, 1000) reads in the reference set Next-step: Fine-mapping to determine the exact breakpoints

DWAC-Seq: CNV analysis using tag density analysis fedor21.hubrecht.eu/dwac-seq Google already knows DWAC-Seq

Validation of DWAC-Seq with acgh DNA Source: 2 inbred rat strains (BN, M520) acgh, Nimblegen 2.1M probes (experimentally independent) NGS data: SOLID v3.5 DWAC-Seq 520M reads for M520; (Dynamic window) 850M reads for BN [Koval et al, submitted] CNV-Seq (Static window) [Xie & Tammi, 2009] Sensitivity: 96,0% 68,7% Specificity: 0.06 0.12 (FP rate) Performance: 20 CPU hours @2.66 GHz

Validation of DWAC-Seq by Simulated simulation set: RNO18. 100 CNVs. Sizes: 1 kb-100 kb. Copy number: 0 5 (1 = diploid) with step = 0.25 Sensitivity, % False positives, % Precision of breakpoint mapping, bp Error in Copy-number estimation

Other applications of DWAC-Seq BAM(test) segmenting fine-mapping CNVs CNVs BAM(ref) window size, mapping ambiguity (windows) (bp level) Exome enrichment data (testing) RNA-Seq

DOC visualization Hlbert curves Raw data Segmented profile Duplications (red) and deletions (blue)

Two flavours of paired tags Fragmentation, size selection Adaptors ligation Internal adaptor ligation, circularization Paired-end library Insert size 200-400 bp Mate-pair library Insert size 0.6-25 kb Adaptors ligation

Checking MP/PE libraries Size of insert Secondary peaks Proportion of di-tags, % Insert sizes, bp Sharpness of size distribution

Common artefacts of MP / PE Clonal reads Possible causes: PCR overamplification of a library; several beads in one reactor After amplification Solution: Remove di-tags that have Reactor with 2 beads, exactly the same mapping positions But single DNA molecule Chimaeric clones Possible causes: PCR or ligation artifacts, sequencing errors Solution: Real events should be supported by multiple independent events

Mapping of pairs: together or apart? Independent mapping of tags F tag R tag + Any mapping tool can be used + Low false-negative rate - Overall coverage is lower Ref. - Lower coverage for (moderate) repeats map map Simultaneous mapping of tags + Better overall coverage + Repeat regions can be covered - Not every tool supports this mode - Higher false-negative rate Ref. F tag map R tag

Signatures of Structural Varaints Normal sequenced reference /mapped Inversion Tandem duplication Insertion Deletion Translocation Chr7 Chr5

Analysis of MP / PE data Mapping of di-tags Sorting by pattern of mapping Remote Inverted Everted Too far Too close C L U S T E R I N G Translocations Inversions Tandem duplication s Deletions Insertions

Population-based approach to SV calling WKY BN-Lx F344 BN SHR Discovery SV calls SV1 SV2 SV3 Genotyping WKY SV1, SV3 BN-Lx SV3 SV3 BN SHR SV2, SV3 F344 SV1, SV3

SV detection pipeline: 1-2-3-SV Design principles: Universality (MP, PE; SOLiD, Solexa; OpenSource); Usability (little dependency, BAM->SV) ; Scalability (CPU/RAM/Time) Csfasta (F3) Qual (F3) Csfasta (R3) Qual (R3) SAP42, BWA Fastq /1 Fastq /2 BAM (F3) BAM (R3) Step 1 Insert size distribution s Potential SV regions Multiple libraries Step 2 Fine-mapping, visualization Verification assays Step 3 SVs

Vizualization of paired tags http://genome.ucsc.edu/goldenpath/help/bam.html

Composite pattern of a deletion (di-tag + coverage) Paired-end, 150 bp inserts Zero coverage (deletion or mappability issue?) Mate-Pair, 1.5 kbp inserts

Composite patterns of other SVs Tandem duplication: Everted orientation of di-tags + increased coverage Large Insertion: Hanging ends + small gap in coverage Local rearrangement: Clustered everted and too-distant di-tags

Repeat instability and retrotransposition as cause of SVs Use different insert sizes to catch insertions of all mobile elements Example: PE 200 bp, MP 1-2 kb and MP 7-10 kb

Fine-mapping of SV breakpoints?? deletion Normal 1 3 Too distant 2 Hanging ends 4 5 Fetching all di-tags from the region 1F 1R 2F 2R 3F 3R 4F 4R 5F 5R Local de novo assembly Align to reference, determine exact breakpoints Reference deletion Assembled >10,000 fine-mapped

Complex structural variations, chromotrypsis 55793180 55793182 55792170 chr10 57519917 57521100 57521088 57523787 57519913 57523805 57524597 50761470 50761463 chr1 102287386 102287791 102287798 =DNA double strand breaks chr4 105745953 104738996 105025700 105036708 105745783 104738136 105028395 105035150 105745828 105029770 105036735 105028400

Complex structural variations, chromotrypsis(2)

Closing gaps in genome assembly Step 1. Array design Step 2. Array enrichment

Quality of assembly is critical

Complementarity of DOC and di-tags DOC and Paired-tag analyses are complimentary for detection of deletions and duplications Unique sequence in SV breakpoints Repetitive sequence in SV breakpoints Unique sequence inside a breakpoint DOC, Paired-tag DOC Repetitive sequence inside a breakpoint Paired-tag?

Software for MP/PE analysis PEMer Korbel et al, 2009 (translocations) SegSeq Chiang et al, 2009 (copy-number) VariationHunter Hormozdiari et al, 2009(tandem duplications) MoDIL Lee et al, 2009 Pindel Ye et al, 2009 (split mapping) BreakDancer Chen et al, 2009 (hanging reads) ABI Tools McKernan et al, 2009(copy-number, invers.) SVDetect Zeitouni et al, 2010 (anomalous di-tag clustering) HYDRA Quinlan et al, 2010 (clustering discordant di-tags) CNVer Medvedev et al, 2010 (CNV + MP signature)

Acknowlegements Cuppen Lab @ Hubrecht Institute, Utrecht

GvNL data A4A sample 3 Solexa lanes Mapping with BWA 0.5.9 k 2 l 25 n 10 34.7 CPU days 123SV Step 1 4 CPU hours 123SV Step 2 1 CPU day Inversions Insertions Deletions Duplication(Evertions) Translocation No coverage 45 CPU minutes