Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Similar documents
SUPPLEMENTARY INFORMATION

The Diploid Genome Sequence of an Individual Human

Nature Genetics: doi: /ng Supplementary Figure 1. Neighbor-joining tree of the 183 wild, cultivated, and weedy rice accessions.

Supplementary Figures

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

Parts of a standard FastQC report

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

Map-Based Cloning of Qualitative Plant Genes

Genomic resources. for non-model systems

Nature Methods: doi: /nmeth Supplementary Figure 1. Ideograms showing scaffold boundaries and segmental duplication locations.

Next-generation sequencing technologies

Result Tables The Result Table, which indicates chromosomal positions and annotated gene names, promoter regions and CpG islands, is the best way for

Release Notes for Genomes Processed Using Complete Genomics Software

Nature Biotechnology: doi: /nbt.3943

Supplementary Table 1: Oligo designs. A list of ATAC-seq oligos used for PCR.

Next Generation Genetics: Using deep sequencing to connect phenotype to genotype

T G T A. artificial chimera

Nature Methods: doi: /nmeth Supplementary Figure 1. Pilot CrY2H-seq experiments to confirm strain and plasmid functionality.

02 Agenda Item 03 Agenda Item

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Figure S1. Schematic representation of the winter VRN-H1 allele from cv. Strider (AY750993) with positions of markers genotyped in this study

Supplementary Figure 1: sgrna library generation and the length of sgrnas for the functional screen. (a) A diagram of the retroviral vector for sgrna

Get to Know Your DNA. Every Single Fragment.

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Nature Methods: doi: /nmeth Supplementary Figure 1. Construction of a sensitive TetR mediated auxotrophic off-switch.

SeattleSNPs Interactive Tutorial: Database Inteface Entrez, dbsnp, HapMap, Perlegen

SNP calling and VCF format

Chromatin signature identifies monoallelic gene expression across mammalian cell types

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Supplemental Figure 1.

Supplementary Figure 1

Mate-pair library data improves genome assembly

Nature Biotechnology: doi: /nbt Supplementary Figure 1. sndrop-seq overview.

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

SUPPLEMENTARY INFORMATION

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Supplementary Figures

Biol 478/595 Intro to Bioinformatics

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Mutations during meiosis and germ line division lead to genetic variation between individuals

Genome Projects. Part III. Assembly and sequencing of human genomes

Supplemental Data. Zhou et al. (2016). Plant Cell /tpc

Revolutionize Genomics with SMRT Sequencing. Single Molecule, Real-Time Technology

Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C

Nature Genetics: doi: /ng Supplementary Figure 1

Figure S1. Unrearranged locus. Rearranged locus. Concordant read pairs. Region1. Region2. Cluster of discordant read pairs, bundle

Towards Personal Genomics

Chapter 5. Structural Genomics

GATCGTGCACGATCTCGGCAATTCGGGATGCCGGCTCGTCACCGGTCGCT

The Human Genome and its upcoming Dynamics

Mammalian non-cg methylations are conserved and cell-type specific and may have been involved in the evolution of transposon elements

Supplementary Figures

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

Mapping and quantifying mammalian transcriptomes by RNA-Seq. Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold

SureSelect Target Enrichment for the Ion Proton TM Next Generation Sequencing System

De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse

Next-Generation Sequencing. Technologies

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Supplementary Figure 2.Quantile quantile plots (QQ) of the exome sequencing results Chi square was used to test the association between genetic

This is a closed book, closed note exam. No calculators, phones or any electronic device are allowed.

The genome of Leishmania panamensis: insights into genomics of the L. (Viannia) subgenus.

EFI 2016 DEBATE: WHOLE GENE VERSUS EXONIC SEQUENCING. Dr Katy Latham Stance: Whole gene sequencing should be the norm for HLA typing

Runs of Homozygosity Analysis Tutorial

Deep Sequencing technologies

14 March, 2016: Introduction to Genomics

Supporting Information

Supplemental Figure Legends

Erhard et al. (2013). Plant Cell /tpc

Comparing a few SNP calling algorithms using low-coverage sequencing data

The Human Genome Project has always been something of a misnomer, implying the existence of a single human genome

Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing (HaploSeq)

Strand NGS Variant Caller

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

SUPPLEMENTARY INFORMATION

Introduction to RNA-Seq in GeneSpring NGS Software

Systematic evaluation of spliced alignment programs for RNA- seq data

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Supplementary Figure 1

Supplementary Figure 1 Strategy for parallel detection of DHSs and adjacent nucleosomes

HaloPlex HS. Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D.

SCIENCE CHINA Life Sciences. High-performance single-chip exon capture allows accurate whole exome sequencing using the Illumina Genome Analyzer

Wu et al., Determination of genetic identity in therapeutic chimeric states. We used two approaches for identifying potentially suitable deletion loci

Supplemental Figure 1 A

Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line

SUPPLEMENTARY INFORMATION

Genome-wide genetic screening with chemically-mutagenized haploid embryonic stem cells

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

DNBseq TM SERVICE OVERVIEW Plant and Animal Whole Genome Re-Sequencing

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

SMRT Analysis Barcoding Overview (v6.0.0)

Supplementary Materials. Sequence-based profiling of DNA methylation: comparisons of methods and catalogue of allelic epigenetic modifications

About Strand NGS. Strand Genomics, Inc All rights reserved.

Before starting, write your name on the top of each page Make sure you have all pages

Supplementary Information

How to view Results with Scaffold. Proteomics Shared Resource

Services Presentation Genomics Experts

Transcription:

Supplementary Figure 1 Number and length distributions of the inferred fosmids. Fosmid were inferred by mapping each pool s sequence reads to hg19. We retained only those reads that mapped to within a 3~50 kb region. (a) Fosmid number in each pool. On average, there were ~32 fosmids per pool. (b) Fosmid size. The average length was 36.8 kb.

Supplementary Figure 2 Fosmids physical coverage distribution. Blue curve denotes the theoretical coverage distribution, at an average coverage of 8x, and red curve denote the actual coverage. The average fosmid coverage was 8x, with a median of 7x. About 7% of YHref was not covered by fosmids, which may be due to a bias in the fosmid library construction and/or sequencing.

Supplementary Figure 3 Completeness of assembled sequence in each fosmid pool. The horizontal axis represents the percentage of the fosmid sequence that was assembled in each pool. The vertical axis represents the proportion of fosmid pools at that given percentage. In total, 88.5% of the assembled pools contained at least 80% of the fosmid sequence, and 53.2% of the assembled pools contained at least 95% of the fosmid sequence.

Supplementary Figure 4 Contiguity of assembled sequence for individual fosmids. The horizontal axis represents the ratio of the longest assembled sequence vs the inferred length of each defined fosmid. The vertical axis represents the proportion of fosmids at the given ratio. 54.7% of fosmids had a longest assembled sequence equal to, or longer than, half of the fosmid length. About 18% of the fosmids were completely assembled.

Supplementary Figure 5 Construction of the haplotype-resolved sequence. The top (orange) bar represents the non-phased YHref sequence and the bottom (multi-color) bar represents the haplotype-resolved output. The middle (blue) bars represent the fosmid assembled haploid (FAH) sequences belonging to the same haplotype.

Supplementary Figure 6 Theoretical N50 length of haplotype phasing and long homozygous region. a. Long homozygous regions (>=20 kb) for different populations in 1000 genomes project. Asians have more long homozygous region than other populations. This might be why YH had a shorter haplotype N50 than other individuals sequenced at a comparable fosmid depth. b. The theoretical N50 length distribution of haplotype phasing using the method of the current study, in 4 different individuals. Heterozygous marker numbers are shown at the top-left. The haplotype N50 of YH is expected to be 510 kb with a fosmid coverage of 4x per haplotype (or 8x for a 3 Gb genome).

Supplementary Figure 7 HDG coverage on hg19 and RefSeq genes. Our HDG sequence was aligned to the hg19 genome using Lastz. Coverage of the chromosomes and gene regions was calculated. Both means covered by the two assembled haplotypes (blue), Single means covered by just one assembled haplotype (red). a. Coverage information for each chromosome. b. Proportion of RefSeq gene at given coverage.

Supplementary Figure 8 Length distributions of insertions and deletions. a. Length distribution of short indels (<10 bp). Peaks at multiples of 3 bp in the exon distribution are expected because they do not disturb the reading frames. b. Length distribution of long indels (100 bp~1 kb). The peak at ~300 bp is due an enrichment for Alu element insertions and deletions. Note that there is no bias between insertion and deletion, which is progress compared to previous studies. c. Distribution of long indels (100 bp~1 kb) in unique versus repeat regions. As expected, there are more indels in the repeat regions and the peak at ~300 bp is more pronounced. d. Length distribution of homozygous and heterozygous long indels (100 bp~1 kb).

Supplementary Figure 9 SNP detection and intersection from different methods/platforms. A total of ~4.0 M SNPs were detected by three different methods/platforms. The majority (68.2%) of these was consistent between all three datasets. However, there were still tens of thousands of methods/platforms specific calls.

Supplementary Figure 10 Indel detection and intersection from different platforms. We show the number of small indels detected by each method/platform and their intersection, at a flank size of 50 bp. For the ~1 M indels detected, there was only 27.6% concordance.

Supplementary Figure 11 Example of a heterozygous deletion located inside a gene. This heterozygous deletion was detected by the ASV method but difficult to find by either WGS resequencing method. The yellow block in the reference is the region that was missing from hap2. Below are the WGS reads aligned to this region. This 151 bp deletion covered the 5-UTR and a part of exon1 for the gene PSMD1.

Supplementary Figure 12 Example of a heterozygous insertion located inside a gene. This heterozygous insertion was detected by the ASV method but difficult to find by either WGS resequencing method. The yellow block in hap1 is the region that was missing from the reference. Below are the WGS reads aligned to this region. Near the breakpoint there were very few reads, perhaps because the insert sequence influenced the alignment. This 54 bp insertion covered exon3 of the gene LATS2.

Supplementary Figure 13 Variation rate for YH vs hg19 and heterozygosity between the two haplotypes of YH. The curves at the top and the right summarize the distribution of heterozygosity rates for the two haplotypes of YH and the variation between YH and hg19, respectively. The black line indicates the 99% cutoff for each distribution.

Supplementary Figure 14 The classification of novel gene sequences. a. Classification of different types of novel and gap covered sequences. i) novel insertion; ii-iv), novel haplotypes; v-vii), gap covered sequences; viii), orphan scaffolds. b. Distribution of novel sequences based on their length and number, in 100 bp bins. Novel sequences of length >1000 bp accounted for 93% of the total length. The longest was 123 kb. c. Distribution of breakpoints for novel sequences. Most of the novel sequences were in non-coding (intron, repeat and intergenic) regions. Only 0.8% were in CDS regions. These distributions are subdivided by the length of the sequence, represented by the color bars. d. Repeat content based on RepeatMasker.

Supplementary Figure 15 Examples of cis- and trans-acting genes. a. Cis-acting gene DSPP on 4q22.1 encoding dentin sialophosphoprotein. Mutations in DSPP are associated with Dentinogenesis imperfecta, Shields type II, and deafness. b. Trans-acting gene CA9 on 9p13.3. Diseases associated with mutations in CA9 include horseshoe kidney and renal cell carcinoma. GO annotations include carbonate dehydratase activity.

Supplementary Figure 16 Allele specific methylation and expression. Venn diagram showing the relationship between allele specific methylation (ASM) and allele specific expression (ASE). The numbers refer to the gene count. The red/brown circle inside the larger ASM circle represents genes where ASM was detecting in the promoter region.

Supplementary Figure 17 Construction of the fosmid libraries. Approximately 30 fosmid clones were cultured together to form a single fosmid pool. Then, 3 g of DNA from each pool was digested, and fragments with insert size ranging from 180 to 800 bp were selected. Adapters containing the 11 bp barcode were ligated to these selected fragments to form a single pooled-fosmid library. Barcoded fragments from 60~320 single pooled-fosmid libraries were pooled again (evenly) to create a Stage I barcode library. DNA fragments of sizes between 180 bp to 650 bp (lengths exclude barcode) from each Stage I barcode library were used to construct two independent libraries (one with small insert sizes and one with intermediate insert sizes). Each library was then PCR amplified with index primers, each of which contained an 8 bp barcode, to form a Stage II barcode library.

Supplementary Figure 18 Indel positional concordance as a function of flank size for the different methods of detection. To determine the best flank size for use in indel detection, we plotted the concordance between the ASV and resequencing based analyses. The results stabilize at above 50 bp.

Supplementary Figure 19 Length distributions for method-specific short indels. Short indels (1-50 bp) detected only by one method/platform, were selected out and plotted according to the length. Top-right figure provided information for indels with length between 10 and 50.

Supplementary Figure 20 Example of ASV-specific indel supported by fosmid aligned reads. This was a 3 bp heterozygous deletion in a region covered by fosmids from eight independent pools, two of which supported the deletion.

SUPPLEMENTARY TABLES Supplementary Table 1. Summary of sequenced genome data Sequencing Type Insert size(bp) Read length(bp) Number of reads(m) Raw data(gbp) Fosmid library WGS-seq Small (180~300 bp) 93 9,416 875 Intermediate (450~650 bp) IL 93 9,004 837 200 100 1,196 119 500 100 287 29 ~2K 90 177 16 ~5K 90 157 14 ~10K 90 154 14 ~20K 90 128 12 CG 500 26 12,731 331 a Total - 33,250 2,246 a:gross mapping yield

Supplementary Table 2. Genes in hypervariable regions.xlsx Supplementary Table 3. Annotation of predicted novel genes.xlsx Supplementary Table 4. Cis and Trans genes annotation.xlsx Supplementary Table 5. ASE and ASM gene analysis.xlsx

Supplementary Table 6. Parameters and filter criteria used in Lastz alignment. Variation SNP Short indel Inversion/Translocation Long indel Alignment Parameters --strand=both --hspthresh=9000 --chain --ambiguous=iupac --gapped --identity=90 --step=8 --word=31 --seed=12of19 "--strand=both --hspthresh=10000 --chain --ambiguous=iupac --gapped --ydrop=50000 --gap=2000,1 --identity=90 --step=19 --word=31 --seed=12of19" Filter Criteria 1. in 50bp flanking region: disallow consecutive N 2. distance between any two SNPs must over 5bp 1. in 50bp flanking region: disallow consecutive N ; disallow any other indel; mismatchs<3bp. 2. It should not be located in the boundary of each aligment block The neighboring alignment blocks must be in a good linear relation. 1. in 50bp flanking region: disallow consecutive N ; disallow any other indel over10bp. 2. It should not be located in the boundary of each alignment block

Supplementary Table 7. Summary of re-sequencing based variations. SNP Indel CG HS All 3,411,305 3,365,182 Ti/Tv 2.12 2.05 hete/homo 1.37 1.65 dbsnp137 3,368,094 3,322,168 novel 43,211 43,014 cording 21,054 19,763 Nonsynonymous 9,583 9,184 All 510,944 633,679 hete/homo 2.24 1.20 dbsnp137 331,912 459,733 novel 179,032 173,946 cording 451 321 frameshift 154 120

Supplementary Table 8. Summary of detected novel sequence Summary Hap1(Logic) Hap2(Logic) XY Number Length(bp) Number Length(bp) Number Length(bp) All(un-redundant) 1,367 3,934,838 1,335 3,344,145 54 209,484 Novel insertion 613 96,199 616 96,614 41 5,702 Novel haplotype 706 2,913,736 672 2,287,638 11 91,563 Novel(nomadic) 50 925,156 48 960,017 2 112,219 Cover reference 'N' 420 3,149,398 305 3,016,239 213 913,082 *There is 1,183,474bp novel sequence share with the two haplotype