Building a platinum human genome assembly from single haplotype human genomes. Karyn Meltz Steinberg PacBio UGM December,

Size: px
Start display at page:

Download "Building a platinum human genome assembly from single haplotype human genomes. Karyn Meltz Steinberg PacBio UGM December,"

Transcription

1 Building a platinum human genome assembly from single haplotype human genomes Karyn Meltz Steinberg PacBio UGM December,

2 Single haplotype from hydatidiform mole Enucleated egg (no maternal DNA) Paternal DNA doubles Tumor like growth ONLY paternal DNA present

3 Last year Steinberg et al, 2014

4 This year Contig Number Contig N CHM13 Draft CHM1 PB_2 CHM1 PB_1 CHM1_1.1 HuRef ALLPATHS YH_2.0

5 This year Contig Number Contig N Log scale 1 CHM13 Draft CHM1 PB_2 CHM1 PB_1 CHM1_1.1 HuRef ALLPATHS YH_2.0

6 We combine PacBio with other technologies to construct the assembly

7 How do we define platinum and gold standards? % Reference genome covered GRCh38 Platinum (CHM1) % Assigned chromosomes % gene models covered (>95% id, >90% length) Gold (NA19240) Contig N Mb 26.9 Mb 6.0 Mb Number of gaps 875 3,640 3,568 Total Assembled size Gb Gb Gb % haplotype blocks (>1kb) resolved NA >95 >80

8 CHM13 Draft Assembly (GCA_ ) 60X PacBio (P5 and P6 chemistry) Average read length ~11kb Daligner/Falcon v 0.2 Total sequence length 2,851,367,788 Number of contigs 2,873 Contig N50 12,981,785 Contig L50 68

9 Short read sequence analysis 100X Illumina sequence Align with BWA-MEM to ordered and oriented assembly Variant calling via SpeedSeq (Chiang et al, 2015) SNVs, indels: FreeBayes SVs: LUMPY, SVTyper CNV: CNVnator

10 CHM13 Illumina data aligned to CHM13 assembly 202,016 SNVs/indels on unplaced scaffolds SV_TYPES >10kb 5-10kb 1-5kb <1kb DELETIONS INVERSIONS DUPLICATIONS TOTAL

11 BioNano can be used to size gaps and identify structural variants Collapse Expansion in Assembly PacBio Assembly Gap in Sequence BioNano Map BioNano alignment to CHM13 SV_TYPES DELETIONS 41 INVERSIONS 10 INSERTIONS 15 TOTAL 66

12 BioNano reveals collapse in PacBio assembly PacBio Assembly BioNano Map

13 Illumina data aligned to PacBio assembly also shows collapse

14 BioNano reveals collapse in PacBio assembly due to highly homologous segmental duplications BioNano Map SD = 96% PacBio Assembly CHR W LBHZ CHR N gap CHR W LBHZ

15 This region is rich in medically relevant genes This locus has an assigned GRC issue due to unresolved variation and may be a candidate locus for alternative representation in the reference

16 CHM13 Hybrid Scaffold Hybrid Scaffold PacBio Contigs BioNano Contigs

17 CHM13 Hybrid Scaffolds Improve Contiguity BioNano Map PacBio Assmbly Hybrid Scaffold # of Contigs * 254 Min Contig Length 0.08 Mb Mb Median Contig Length 0.61 Mb 0.06 Mb 4.35 Mb Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb Contig N Mb Mb Mb Max Contig Length 5.27 Mb Mb Mb Total Contig Length 2812 Mb 2824 Mb Mb *Number of contigs used in hybrid scaffolding

18 Reference based Analyses 100X Illumina sequence from CHM13 Align to GRCh37 and GRCh38 with BWA-MEM Variant calling via SpeedSeq (Chiang et al, 2015) SNVs, indels: FreeBayes SVs: LUMPY, SVTyper CNV: CNVnator

19 Similar number of variants per chromosome GRCh37.p15 GRCh38.p2

20 Similar annotation of variants GRCh37.p15 GRCh38.p2

21

22 GRCh37.p15 GRCh38.p2

23 SRGAP2 region resolved in GRCh38 1q32 1q21 1p21 Patch alignment to chromosome 1

24 GRCh37.p15 GRCh38.p2

25 PRIM2 region resolved in GRCh38

26 tl;dpa* The reference genome assembly is constantly being improved New PacBio-based assemblies are orders of magnitude more contiguous than previous WGS assemblies Integration of other data (e.g. BioNano, Dovetail) can improve contiguity even further and be used to identify structurally variant haplotypes that can be added to reference as alternative loci Platinum genome sequences integrated into GRCh38 have greatly improved read mapping and variant calling *too long; didn t pay attention

27 Acknowledgements The McDonnell Genome Institute at Washington University in St. Louis Rick Wilson Bob Fulton Wes Warren Tina Graves-Lindsay Vince Magrini Sean McGrath Derek Albracht Milinn Kremitzki Susan Rock Debbie Scheer Aye Wollam The Finishing and Bioinformatics Teams at The Genome Institute Pacific Biosciences Jason Chin Nick Sisneros University of Pittsburgh School of Medicine (CHM cell lines) Urvashi Surti University of Washington Evan Eichler John Huddleston Archana Raja BioNano Genomics Palak Sheth NCBI Valerie Schneider Personalis Deanna Church

28