Structural variation analysis using NGS sequencing

Size: px
Start display at page:

Download "Structural variation analysis using NGS sequencing"

Transcription

1 Structural variation analysis using NGS sequencing Victor Guryev NBIC NGS taskforce meeting April 15th, 2011

2 Scale of genomic variants Scale 1 bp 10 bp 100 bp 1 kb 10 kb 100 kb 1 Mb Variants SNPs Short indels Indels S t r u c t u r a l v a r i a n t s Tools G A P S S Solutions S A P 4 2 P I N D E L G A T K a C G H FISH, Cytogenetics

3 Depth-of-coverage (DOC) analysis Fragment library, sample vs genome reference Ref. Heterozygous deletion? Sequenceability issue? Mappability issue? Sample vs sample comparison Fragment library 1 Ref. Fragment library 2 Ref.

4 DOC with dynamic windows Fragment library 1 Test Fragment library 2 Ref Static windows, e.g. 1kb Window 4 Dynamic windows, 10 (100, 1000) reads in the reference set Next-step: Fine-mapping to determine the exact breakpoints

5 DWAC-Seq: CNV analysis using tag density analysis fedor21.hubrecht.eu/dwac-seq Google already knows DWAC-Seq

6 Validation of DWAC-Seq with acgh DNA Source: 2 inbred rat strains (BN, M520) acgh, Nimblegen 2.1M probes (experimentally independent) NGS data: SOLID v3.5 DWAC-Seq 520M reads for M520; (Dynamic window) 850M reads for BN [Koval et al, submitted] CNV-Seq (Static window) [Xie & Tammi, 2009] Sensitivity: 96,0% 68,7% Specificity: (FP rate) Performance: 20 CPU GHz

7 Validation of DWAC-Seq by Simulated simulation set: RNO CNVs. Sizes: 1 kb-100 kb. Copy number: 0 5 (1 = diploid) with step = 0.25 Sensitivity, % False positives, % Precision of breakpoint mapping, bp Error in Copy-number estimation

8 Other applications of DWAC-Seq BAM(test) segmenting fine-mapping CNVs CNVs BAM(ref) window size, mapping ambiguity (windows) (bp level) Exome enrichment data (testing) RNA-Seq

9 DOC visualization Hlbert curves Raw data Segmented profile Duplications (red) and deletions (blue)

10 Two flavours of paired tags Fragmentation, size selection Adaptors ligation Internal adaptor ligation, circularization Paired-end library Insert size bp Mate-pair library Insert size kb Adaptors ligation

11 Checking MP/PE libraries Size of insert Secondary peaks Proportion of di-tags, % Insert sizes, bp Sharpness of size distribution

12 Common artefacts of MP / PE Clonal reads Possible causes: PCR overamplification of a library; several beads in one reactor After amplification Solution: Remove di-tags that have Reactor with 2 beads, exactly the same mapping positions But single DNA molecule Chimaeric clones Possible causes: PCR or ligation artifacts, sequencing errors Solution: Real events should be supported by multiple independent events

13 Mapping of pairs: together or apart? Independent mapping of tags F tag R tag + Any mapping tool can be used + Low false-negative rate - Overall coverage is lower Ref. - Lower coverage for (moderate) repeats map map Simultaneous mapping of tags + Better overall coverage + Repeat regions can be covered - Not every tool supports this mode - Higher false-negative rate Ref. F tag map R tag

14 Signatures of Structural Varaints Normal sequenced reference /mapped Inversion Tandem duplication Insertion Deletion Translocation Chr7 Chr5

15 Analysis of MP / PE data Mapping of di-tags Sorting by pattern of mapping Remote Inverted Everted Too far Too close C L U S T E R I N G Translocations Inversions Tandem duplication s Deletions Insertions

16 Population-based approach to SV calling WKY BN-Lx F344 BN SHR Discovery SV calls SV1 SV2 SV3 Genotyping WKY SV1, SV3 BN-Lx SV3 SV3 BN SHR SV2, SV3 F344 SV1, SV3

17 SV detection pipeline: SV Design principles: Universality (MP, PE; SOLiD, Solexa; OpenSource); Usability (little dependency, BAM->SV) ; Scalability (CPU/RAM/Time) Csfasta (F3) Qual (F3) Csfasta (R3) Qual (R3) SAP42, BWA Fastq /1 Fastq /2 BAM (F3) BAM (R3) Step 1 Insert size distribution s Potential SV regions Multiple libraries Step 2 Fine-mapping, visualization Verification assays Step 3 SVs

18 Vizualization of paired tags

19 Composite pattern of a deletion (di-tag + coverage) Paired-end, 150 bp inserts Zero coverage (deletion or mappability issue?) Mate-Pair, 1.5 kbp inserts

20 Composite patterns of other SVs Tandem duplication: Everted orientation of di-tags + increased coverage Large Insertion: Hanging ends + small gap in coverage Local rearrangement: Clustered everted and too-distant di-tags

21 Repeat instability and retrotransposition as cause of SVs Use different insert sizes to catch insertions of all mobile elements Example: PE 200 bp, MP 1-2 kb and MP 7-10 kb

22 Fine-mapping of SV breakpoints?? deletion Normal 1 3 Too distant 2 Hanging ends 4 5 Fetching all di-tags from the region 1F 1R 2F 2R 3F 3R 4F 4R 5F 5R Local de novo assembly Align to reference, determine exact breakpoints Reference deletion Assembled >10,000 fine-mapped

23 Complex structural variations, chromotrypsis chr chr =DNA double strand breaks chr

24 Complex structural variations, chromotrypsis(2)

25 Closing gaps in genome assembly Step 1. Array design Step 2. Array enrichment

26 Quality of assembly is critical

27 Complementarity of DOC and di-tags DOC and Paired-tag analyses are complimentary for detection of deletions and duplications Unique sequence in SV breakpoints Repetitive sequence in SV breakpoints Unique sequence inside a breakpoint DOC, Paired-tag DOC Repetitive sequence inside a breakpoint Paired-tag?

28 Software for MP/PE analysis PEMer Korbel et al, 2009 (translocations) SegSeq Chiang et al, 2009 (copy-number) VariationHunter Hormozdiari et al, 2009(tandem duplications) MoDIL Lee et al, 2009 Pindel Ye et al, 2009 (split mapping) BreakDancer Chen et al, 2009 (hanging reads) ABI Tools McKernan et al, 2009(copy-number, invers.) SVDetect Zeitouni et al, 2010 (anomalous di-tag clustering) HYDRA Quinlan et al, 2010 (clustering discordant di-tags) CNVer Medvedev et al, 2010 (CNV + MP signature)

29 Acknowlegements Cuppen Hubrecht Institute, Utrecht

30 GvNL data A4A sample 3 Solexa lanes Mapping with BWA k 2 l 25 n CPU days 123SV Step 1 4 CPU hours 123SV Step 2 1 CPU day Inversions Insertions Deletions Duplication(Evertions) Translocation No coverage 45 CPU minutes