Next-Generation Genome Sequencing

Size: px
Start display at page:

Download "Next-Generation Genome Sequencing"

Transcription

1 Next-Generation Genome Sequencing Jarkko Salojärvi, D.Sc. (Tech) Department of Biosciences, Division of Plant Biology Department of Veterinary Biosciences, Veterinary Microbiology and Epidemiology University of Helsinki

2 Topics Today Quick recap of relevant parts of technologies SOLiD and Solexa/Illlumina. Sequencing in Practice Properties of data How raw data will look like Re-sequencing Sequencing de novo Some example throughout the presentation...

3 Where it all began: Sanger sequencing Developed in Read length up to bases. Current platforms allow 96 concurrent reads. Still needed for closing gaps, sequencing long repeated fragments Frederick Sanger twice Nobel laureate

4 High-throughput sequencing technologies - summary

5 Throughput of different technologies Most recent versions. SOLiD 4 Solexa HiSeq 454 Titanium Runtime 5-10 days 8 days 10 h Read Length (bp) Raw sequence Gb /run Accuracy >99.94% >98.5% 99% (at 400 bp)

6 Sequencing in Practice

7 What do you get from sequencing service? Huge text files with usually two things being reported: 1.Base calls for each read 2.Read qualities In SOLiD these are in two separate files.csfasta and.qv.qual In Solexa, both are reported in the same.fastq file. At Biocenter Viikki: SOLiD 4 and 454 Titanium At FIMM Meilahti: Solexa Commercial services available, price thousand(s)/run.

8 Assembly task Given: A text file with lots of short reads, nucleotide sequences. Task: Align these, either with respect to each other (de novo) or a reference genome (re-sequencing). Essential: Coverage: Number of overlapping reads. Depth: Number of reads on a single nucleotide. Composition of reads (fragments/mate-pair) Is there a reference sequence of the organism?

9 Fragments vs. mate-pairs 1.Individual fragments. 2.Paired reads Mate-pair: genomic DNA is fragmented and size-selected inserts are circularized and linked by means of an internal adaptor. Paired end: Fragmentation of genomic DNA into short segments, followed by sequencing of both ends of the segment (but not the part in between). End result is the same: you know reads from both ends, plus the average distance between reads. A large fraction of short reads are difficult to map uniquely to the genome, and the second read of a pair can be used to find the correct location.

10 Mate-pair reads [Korbel et al.07] A Human genomic DNA i) Shearing and size selection ii) Protection and adapter ligation Bio Met Met Bio viii) Cutoff I Cutoff D iii) Circularization vi) Sequencing of >30 million paired ends with 454 technology Bio Bio Sequenced paired ends iv) Random Cleavage Bio v) Linker(+) read isolation Bio vii) Computational analysis and mapping of Structural Variants (SVs) Count Span of paired ends (i.e. distance between mapped ends [bp]) Result: A snapshot of the full genome in every ~3kbp -Can be used to align contigs from standard sequencing run.

11 Example: paired end mapping to reveal structural variation (SV) in human genome B Human reference genome Normally mapped End distance > cutoff D Altered end orientation Source, i.e. location paired ends originated from in sample genome Individual (sample) sequence Best-placement of end in human reference genome No SV Deletion Inversion breakpoint Span of paired ends in human reference genome End distance < cutoff I Insertion of sequence from distant locus Insertion of sequence from distant locus Region deleted from sample genome Region inserted in sample genome Human reference genome Individual (sample) sequence Region inverted in sample genome End that maps in inverted orientation relative to original (i.e. sample) locus Insertion, simple Insertion, mated Insertion, unmated [Korbel et al.07]

12 De novo vs. re-sequencing of genome In de novo, reads are assembled into contigs: Contiguous sequence of DNA created by assembling overlapping sequenced fragments of a chromosome. Reference assembly = re-sequencing. if the genome/template is known! reference assembly if the genome/template genome/tempale is unknown! de ference novo assembly Reference genome Gap in sequence coverage but reference genome tells that sequences are from the same contig or genomic region! tolerates short read length contig 1 contig 2 same original contig (for example mrna) may be splitted to multiple to shor contigs! longer reads provide more overlap for connecting individual reads

13 SOLiD raw data

14 SOLiD probes Probes designed for reading two nucleotides at a time. Four different colors. Resulting sequence in colorspace...but also CG,GC and TA are red?!

15 SOLiD sequence decoding Key to decoding: known last base of the adapter oligo sequence. Known: 0=A CA AC GT TG 0-1 CA AC GT TG 1-2 AA CC GG TT 2-3 A C A Petri Auvinen, DNA Sequencing and Genomi

16 SOLiD raw data files: raw reads in.csfasta SOLiD gives out in general two files, the reads in color space (.csfasta) and read qualities (.QV-qual) <filename>.csfasta Overall format: Last base of adapter oligo sequence+color space presented in numbers. 1 st Nucleotide 2 nd Nucleotide A C G T A C G T Example: >1_88_1830_R3 G >1_89_1562_R3 G & >$'&)#0(-$&+'& >TAG_ I D Co l o r _spa c e

17 SOLiD raw data files - qualities in QV.qual Quality values are in <filename>.qv.qual phred-like score for each read. score q=-10*log10(p) Example: ' >TAG_ID quality values >97_2040_1850_F >97_2040_1898_F ' p q

18 Benefit: Complementation in color space One benefit of color space is that it is self-complementing:! 2 nd Nucleotide A C G T A C st Nucleotide! F'6%'03'! Ba s e G T A G C T C G T C G T G C A G Co l o r spa c e D+71.'7'0-'5! Ba s e T C G A G C A G C A C G T C Co l o r spa c e

19 Downside One incorrect base can screw up the whole read in decoding CA AC GT TG 0-1 CA AC GT TG 1-2 AA CC GG TT 2-3 A C A A CA AC GT TG 0-1 TA GC CG AT 1-2 AA CC GG TT 2-3 A C G G In colorspace there is still only one error -> Alignment MUST be done in colorspace!

20 Solexa raw data

21 Solexa pipeline

22 Solexa output in fastq file Solexa raw data comes in one text file, default naming by flowcell lane and read direction example: s_7_1_sequence.txt Four lines per read: identifier 2.Raw sequence letters (A,T,C,G,N) 3.+same_sequence identifier 4.Read quality codes Phred-like.

23 Fastq quality scores Quality scores are reported in ASCII Saves disk space Example:

24 Re-sequencing

25 Requirements All alignment programs are designed in unix/linux platforms. Windows too slow. Written in C, some parts in Perl. Need a lot of memory: for human-sized genomes, at least 8Gb of RAM. Need a lot of disk space: data files now ~ 5 Gb. Take from tens of minutes to hours to complete. An account at CSC or some other computation facility recommended No graphical user interfaces Command line interface example: >assemble.pl reads_in.csfasta read_qualities_f3_qv.qual ref_file TAIR9_chr.fas -ref_type nt -NO_CORRECTION

26 Re-sequencing pipeline Proceeds in a similar manner for all platforms: 1. Create an index to be used for searching the reference genome. 2. Using the index, align reads to reference. 3. Form a consensus sequence. 4. Identify SNPs etc.

27 Short read alignment Because of huge amount of data, BLAST is too slow, and faster alignment methods have been developed. Faster methods use shortcuts based on indexing, where you search only a small part of the sequences. Hashing-based indexing. Burrows-Wheeler transform. Progress is rapid, methods published 2 years ago are now old.

28 Hashing-based aligners First generation of read aligners. Extend the idea of BLAST. Indexing: divide reads of length L into bins based on their first n nucleotides n is roughly 20 Alignment: For each position p in the reference genome: Reference sequence=pick next L nucleotides Find the appropriate bin Match the remaining reference sequence to reads in the bin Software: No gaps: Eland, Maq. Gaps allowed: Elandv2, SOAP, GenomeMapper (part of SHORE).

29 Burrows-Wheeler transform Next generation of sequence aligners Reference: ^GOOGOL Searches sequence matches using a prefix trie. Results in fast read alignment method Requires less memory Small gaps allowed Software: Bowtie (no gaps), BWA. SOAP2, 2-way BWT Task: Find match to LOL, given at most one mismatch

30 SHort Read Mapping Package = SHRiMP k-mer hashing step +very efficient implementation of the Smith-Waterman algorithm. Can be used for letter space and color space reads. Slower than the others, but gives optimal local alignment.

31 Performance comparison Homer N, Merriman B, Nelson SF (2009) BFAST: An Alignment Tool for Large Scale Genome Resequencing. PLoS ONE 4(11): e7767. doi: /journal.pone

32 Re-sequencing in colorspace Colorspace has its own pros and cons, which can be taken into account in sequence alignment. Translation into nucleotides as the last step after alignment! Use software that supports colorspace. In practice, this is just an option you give to the alignment program.

33 Using read qualities in alignment? Not all programs use them!! (check the manuals) Most new methods use read qualities. Out of the old ones: Maq. Lets look how the assembly goes with Maq... Has been used a lot in early papers. A benchmark for new methods regarding speed and alignment.

34 Maq - workflow maq fasta2bfa ref.fasta ref.bfa Convert the reference sequences to the binary fasta format maq fastq2bfq reads.fastq reads-1.bfq Convert the reads to the binary fastq format maq match reads-1.map ref.bfa reads-1.bfq Align the reads to the reference maq mapcheck ref.bfa reads-1.map >mapcheck.txt Statistics from the alignment maq assemble consensus.cns ref.bfa reads-1.map 2>assemble.log Build the mapping assembly maq cns2fq consensus.cns >cns.fq Extract consensus sequences and qualities maq cns2snp consensus.cns >cns.snp Extract list of SNPs

35 Maq - workflow

36 Aligning reads - analysing results Software for visualization: Maqviewer, SHOREmap, IGV, Tablet Visualization is VERY important! See the real quality of the data, alignment, coverage etc. Helps to identify errors Helps to evaluate SNP calls, identify gaps etc. There are wings, a propeller, and a pilot - it must be...

37 Maqviewer Only basic functionality.

38 Tablet viewer Coded in JAVA Graphical interface May be slow

39 SNP calling After aligning the short reads to reference genome, identify nucleotides that differ from reference. Make a consensus sequence of the reads Simplest: choose the most common one. Better: Use quality values in the voting Reference Consensus Individual reads SNP

40 SNP calling Software for SNP detection: SOAPsnp, Maq, probhd, SHOREmap, MUMmer Maq computes a phred-type score for the SNPs. SNPs hard to define, usually some thresholds given based on depth and number of differing nucleotides.??

41 Example: Sequencing of A.thaliana genome, mutations induced by EMS Two different mutants sequenced with two platforms: SOLiD + Maq: SNPs Solexa + Maq: SNPs Roughly 1 SNP per every 10,000 bases.

42 Further analysis Which SNP is responsible of the mutant phenotype? Usually some window of the genome is known. To identify the SNP, need to combine SNP locations and genome annotation: Is the SNP in a coding sequence,exon,intron, promoter, 3/5 UTR, junk? Is the SNP disruptive? Transclation results in stop codon/altered amino acid sequence? Each splice variant can be different, variants not known. Location in the protein? If done properly, would require protein structure prediction (very hard)

43 Sequencing de novo

44 How long reads does de novo genome assembly require? Key Problem: longer than read length repeats in the genome. Theoretical analysis: E.coli: 30 bp read length, 75% of genome is covered with contigs>10,000bp C.elegans: 50 bp read length, 51% of the genome is covered with contigs >10,000bp. Human: 50 bp read length, ~15% is covered with contigs>10,000 bp (chromosome 1). Re-sequencing and de novo sequencing of the majority of a bacterial genome is theoretically possible with read lengths of bp. With longer genomes significant proportions are left uncovered.

45 Percentage of the E.coli genome covered by contigs greater than a threshold length as a function of read length Whiteford, N. et al. Nucl. Acids Res :e171; doi: /nar/gni170

46 (b) (b) Read length l (nt) Read length l (nt) C. Elegans Human Use paired-end mapping to connect the longer contigs.

47 Genome assembly using paired end reads Figure 1: An illustration of the Paired End assembly process. Paired End reads are used to order and orient the contigs derived from the Newbler assembly. The large blue lines represent contigs generated from the whole genome shotgun sequencing and assembly. The multiple blue and grey lines represent Paired End information. The blue segments represent the two 20 nucleotide regions that were sequenced while the dotted grey line represents the distance between those two sequenced regions. [454 Technical note 1]

48 2010 Nature America ity score q20 (Supplementary Table 7). From these reads, ABySS, sequence could not be al SOAPdenovo and Velvet generated 6,535, 4,826 and 6,617 contigs estimated that M longer than 100 bp, respectively (Supplementary Fig. 18). On aver- alternative assemblies (S age, 64.6% of the contigs showed high sequence similarity (q90% identity) to a contig in each of the other assemblies (Fig. 4a). In addi- DISCUSSION tion, the SOAPdenovo assembly showed similarity to 90.7% of all The most abundant and assembly sequences (Fig. 4a). are SNVs. We compared Software: ALLPATHS, Edena, we Velvet and SOAPdenovo To analyze AbySS, the quality of these assemblers, designed PCR prim- three methods of SNP ers to amplify DNA fragments from the 186 randomly selected con- decision. Among the th Based on de Bruijn graphs. tigs (32 fragments that were 500 1,000 bp long and 30 fragments of at least 1.5 kb Fujimoto et al.: and carried from each software assembly) Illumina/solexa sequencing of human genome a b SOAPdenovo out PCR 200-bp amplification. Out of 186 contigs, 814 insert libraries, 12 runs. 181 were amplified with51-76 the proper length read length nt. (Supplementary 40xcoverage Fig. 19). We also validated these sequences through Sanger sequencing novothan assembly, of contigs: anddemore 90% comparison of them showed high 1,956 11,616 1,921 ABySS 6,535 to (violet) sequence identity (>90%) the predicted SOAPdenovo 4,826 (yellow) contig sequences. Software for de novo genome assembly Velvet 6,617 (green). ABySS Contigs that were aligned with more than 90% Velvet Figure 4 De novo assembly of unmapped identity were considered shared contigs 955 reads. (a) Comparison of contigs generated by ABySS (violet), SOAPdenovo (yellow) and c d Hs Alt Velvet (green). Contigs that were aligned with Hs GRCh37 more than 90% identity were considered shared Fujimoto et al. (2010) Whole-genome sequencing and comprehensivehs variant analysis of other contigs. (b) Identification of contigs by ABySS a Japanese individual using massively parallel sequencing. Nature Genetics 42, Chimpanzee showing the proportion of the total length

49 Velvet assembly - de Bruijn graphs Split reads into k-mers. Align all k-mers in the reads (here 5- mers) de Bruijn graph: Each node represents a series of overlapping k-mers Final nucleotides make up the sequence of the node. Last k-mer of an arc s origin overlaps with the first of its destination. Reads are mapped as paths through the graph D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18:

50 Example: de novo fragment assembly of one SOLiD run of A.Thaliana Data 68,073,401 reads. Read length 50 nt 12 x coverage. Assembly At least 3xcoverage: 29 M reads. Number of contigs: 295,203 Median contig length: 276 nt Longest contig: 3399 nt Shortest contig: 82 nt Sum contig length : 98,457,405 ~62.7% of the genome. contig length (nt) Index

51 SNP analysis of de novo vs. re-sequencing The sequenced genome was A.Thaliana, Cvi ecotype In de novo, velvet+mummer: 2,371,409 SNPs Re-sequencing, TAIR9+Maq: 183,811 SNPs Published SNP list: 810,205 SNPs Warning: Indices of the published list of Cvi SNPs do not match TAIR9 reference genome!! Consensus sequence of Cvi is not published ( released )

52 Example: Sequencing of the giant panda genome Li et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463,

53 Genome assembly using paired end reads v2 Several libraries with different insert lengths Can use the same sequencing technology for whole assembly. Strategy: Join reads with short insert lengths into contigs Make into scaffolds by mapping unpaired ends to other contigs Scaffold = set of contigs with spaces between. Use longer insert libraries for arranging contigs

54 Sequencing setup for Giant Panda 37 paired-end insert libraries with insert sizes of 150 bp, 500 bp, 2 kb, 5 kb and 10 k. Illumina Genome Analyser platform. 176 Gb of usable sequence, 73x coverage. Average read length of 52 bp.

55 Summary of Assembly Final contig size 2.24 Gb Estimated genome size 2.40 Gb.

56 Genome annotation de novo Gene finding: Align known genes of model species against the new genome. Hidden Markov Model-based prediction of genes: Genscan, Augustus, HMMgene Gene annotation: Function of the genes that can be aligned to new genome give some clue. Gene orthologues, InParanoid, Multiparanoid.

57 Structure of the umami receptor T1R1 gene Heterodimer T1R1/T1R3 may be the sole receptor for umami taste. Umami: detection of the carboxylate anion of glutamic acid, a naturally occurring amino acid common in meats, cheese, broth, stock and other protein-heavy foods. In panda T1R1 is a pseudogene. Recent mutation, may explain the diet?

58 Example application: nucleosome positioning

59 Chromatin structure Chromatin=combination of DNA, RNA, and protein that makes up chromosomes. Functions: Package DNA, strengthen the DNA to allow mitosis and meiosis Serves as a mechanism to control expression and DNA replication. Changes in chromatin structure are affected by chemical modifications of histone proteins such as methylation (DNA and proteins) and acetylation (proteins), and by non-histone, DNA-binding proteins.

60 Predicting nucleosome positions Separate DNA into nucleosome vs. linker DNA parts. Sequence these with 454. Nucleosome ~146 bp, linker DNA ~ bp. Construct a model to predict nucleosome positions. [Field et al. 08]

61 Computational model Nucleosomes: estimate a (position-specific) di-nucleotide model PN over all nucleotide sequences. Linker DNA: Estimate 5-mer model PL for linker DNA vs. nucleosome. ScoreðSÞ~log P NðSÞ P L ðsþ P N,1 ðs½1šþ 147 P P N,iðS½iŠjS½i{1ŠÞ i~2 ~log P l S½1Š 147 P P lðs½išjs½maxð1,i{4þš,...,s½i{1šþ i~2 Estimate score for whole DNA, taking into account all legal configurations of nucleosome positioning. Normalize to get probabilities PðW c ½SŠÞ~ W c½sš P W c ½SŠ, c [C

62 Result Nucleosome localization can be predicted from DNA sequence. Two different types of regulation by chromatin in yeast promoters: Nucleosome-depeleted areas: genes showing relatively low cell-to-cell expression variability, or transcriptional noise. Nucleosome-rich areas: Transcription factors need to compete with nucleosomes for access to the DNA => variability in gene expression.

63 Further uses for high-throughput sequencing? Cataloging sequences and their variation: Between individuals and species. SNPs, quantitative trait loci. Copy number variations. Mutations and genome rearrangements. Metagenomics. Evolution at an individual level. Phylogeny Epigenetics DNA methylation (using ChIP-seq). Chromatin structure. Transcriptome Digital Gene Expression. ChIP-seq. Splice variants. microrna. Cell-specific gene expression.

64 What can high-throughput sequencing do for you? [Kahvejian et al. 08]

65 References Li, H, Homer, N. (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11(5): Fujimoto et al. (2010) Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nature Genetics 42, Li et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463, Magi A. et al. (2010) Bioinformatics for Next Generation Sequencing Data. Genes 1: Vera, J.C., Wheat, C.W., Fescemyer, H.W., Frilander, M.J., Crawford, D.L., Hanski, I., and Marden, J.H. (2008) Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Molecular Ecology 17: Field Y, Kaplan N, Fondufe-Mittendorf Y, Moore IK, Sharon E, et al. (2008) Distinct Modes of Regulation by Chromatin Encoded through Nucleosome Positioning Signals. PLoS Comput Biol 4(11): e doi: / journal.pcbi Kahvejian A., Quackenbush J., Thompson J.F. (2008) What would you do if you could sequence everything? Nature Biotechnology 26(10): D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18: Korbel et al. (2007) Paired-end mapping reveals extensive structural variation in the human genome, Science 318: Whole Genome Assembly using Paired End Reads in E. coli, B. licheniformis, and S. cerevisiae. 454 Application note 1, Whiteford, N. et al. (2005) An analysis of the feasibility of short read sequencing. Nucleic Acids Res. 33, e171.

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler High-Throughput Bioinformatics: Re-sequencing and de novo assembly Elena Czeizler 13.11.2015 Sequencing data Current sequencing technologies produce large amounts of data: short reads The outputted sequences

More information

NEXT GENERATION SEQUENCING. Farhat Habib

NEXT GENERATION SEQUENCING. Farhat Habib NEXT GENERATION SEQUENCING HISTORY HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites HISTORY Sanger Dominant for last ~30 years 1000bp

More information

De Novo Assembly of High-throughput Short Read Sequences

De Novo Assembly of High-throughput Short Read Sequences De Novo Assembly of High-throughput Short Read Sequences Chuming Chen Center for Bioinformatics and Computational Biology (CBCB) University of Delaware NECC Third Skate Genome Annotation Workshop May 23,

More information

Next Generation Sequencing. Tobias Österlund

Next Generation Sequencing. Tobias Österlund Next Generation Sequencing Tobias Österlund tobiaso@chalmers.se NGS part of the course Week 4 Friday 13/2 15.15-17.00 NGS lecture 1: Introduction to NGS, alignment, assembly Week 6 Thursday 26/2 08.00-09.45

More information

Mate-pair library data improves genome assembly

Mate-pair library data improves genome assembly De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate

More information

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequence Assembly and Alignment. Jim Noonan Department of Genetics Sequence Assembly and Alignment Jim Noonan Department of Genetics james.noonan@yale.edu www.yale.edu/noonanlab The assembly problem >>10 9 sequencing reads 36 bp - 1 kb 3 Gb Outline Basic concepts in genome

More information

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics Alignment methods Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics Alignment methods Sequence alignment Assembly vs alignment Alignment methods Common issues Platform

More information

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme Illumina (Solexa) Current market leader Based on sequencing by synthesis Current read length 100-150bp Paired-end easy, longer matepairs harder Error ~0.1% Mismatch errors dominate Throughput: 4 Tbp in

More information

Contact us for more information and a quotation

Contact us for more information and a quotation GenePool Information Sheet #1 Installed Sequencing Technologies in the GenePool The GenePool offers sequencing service on three platforms: Sanger (dideoxy) sequencing on ABI 3730 instruments Illumina SOLEXA

More information

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club De novo assembly of human genomes with massively parallel short read sequencing Mikk Eelmets Journal Club 06.04.2010 Problem DNA sequencing technologies: Sanger sequencing (500-1000 bp) Next-generation

More information

Lecture 7. Next-generation sequencing technologies

Lecture 7. Next-generation sequencing technologies Lecture 7 Next-generation sequencing technologies Next-generation sequencing technologies General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively

More information

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI Variation detection based on second generation sequencing data Xin LIU Department of Science and Technology, BGI liuxin@genomics.org.cn 2013.11.21 Outline Summary of sequencing techniques Data quality

More information

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1 BST 226 Statistical Methods for Bioinformatics David M. Rocke March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1 NGS Technologies Illumina Sequencing HiSeq 2500 & MiSeq PacBio Sequencing PacBio

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

Introduction to Next Generation Sequencing

Introduction to Next Generation Sequencing The Sequencing Revolution Introduction to Next Generation Sequencing Dena Leshkowitz,WIS 1 st BIOmics Workshop High throughput Short Read Sequencing Technologies Highly parallel reactions (millions to

More information

NOW GENERATION SEQUENCING. Monday, December 5, 11

NOW GENERATION SEQUENCING. Monday, December 5, 11 NOW GENERATION SEQUENCING 1 SEQUENCING TIMELINE 1953: Structure of DNA 1975: Sanger method for sequencing 1985: Human Genome Sequencing Project begins 1990s: Clinical sequencing begins 1998: NHGRI $1000

More information

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère C3BI VARIANTS CALLING November 2016 Pierre Lechat Stéphane Descorps-Declère General Workflow (GATK) software websites software bwa picard samtools GATK IGV tablet vcftools website http://bio-bwa.sourceforge.net/

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet May 2013 Standard sequence library generation Illumina

More information

Matthew Tinning Australian Genome Research Facility. July 2012

Matthew Tinning Australian Genome Research Facility. July 2012 Next-Generation Sequencing: an overview of technologies and applications Matthew Tinning Australian Genome Research Facility July 2012 History of Sequencing Where have we been? 1869 Discovery of DNA 1909

More information

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis Data Basics Josef K Vogt Slides by: Simon Rasmussen 2017 Generalized NGS analysis Sample prep & Sequencing Data size Main data reductive steps SNPs, genes, regions Application Assembly: Compare Raw Pre-

More information

Next-Generation Sequencing. Technologies

Next-Generation Sequencing. Technologies Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062

More information

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance

More information

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio

More information

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Transcriptomics analysis with RNA seq: an overview Frederik Coppens Transcriptomics analysis with RNA seq: an overview Frederik Coppens Platforms Applications Analysis Quantification RNA content Platforms Platforms Short (few hundred bases) Long reads (multiple kilobases)

More information

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Introduction to metagenome assembly Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Sequencing specs* Method Read length Accuracy Million reads Time Cost per M 454

More information

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio

More information

NGS part 2: applications. Tobias Österlund

NGS part 2: applications. Tobias Österlund NGS part 2: applications Tobias Österlund tobiaso@chalmers.se NGS part of the course Week 4 Friday 13/2 15.15-17.00 NGS lecture 1: Introduction to NGS, alignment, assembly Week 6 Thursday 26/2 08.00-09.45

More information

Mapping strategies for sequence reads

Mapping strategies for sequence reads Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements

More information

High throughput sequencing technologies

High throughput sequencing technologies High throughput sequencing technologies and NGS applications Mei-yeh Lu 呂美曄 High Throughput Sequencing Core Manager g g p q g g Academia Sinica 6/30/2011 Outlines Evolution of sequencing technologies Sanger

More information

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop Output (bp) Aaron Liston, Oregon State University Growth in Next-Gen Sequencing Capacity 3.5E+11 2002 2004 2006 2008 2010 3.0E+11 2.5E+11 2.0E+11 1.5E+11 1.0E+11 Adapted from Mardis, 2011, Nature 5.0E+10

More information

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster Aobakwe Matshidiso Supervisor: Prof Chrissie Rey Co-Supervisor: Prof Scott Hazelhurst Next Generation Sequencing

More information

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio

More information

Parts of a standard FastQC report

Parts of a standard FastQC report FastQC FastQC, written by Simon Andrews of Babraham Bioinformatics, is a very popular tool used to provide an overview of basic quality control metrics for raw next generation sequencing data. There are

More information

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 1 Vocabulary Gene: hereditary DNA sequence at a

More information

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies Vocabulary Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 Gene: Genetics: Genome: Genomics: hereditary

More information

Genomic resources. for non-model systems

Genomic resources. for non-model systems Genomic resources for non-model systems 1 Genomic resources Whole genome sequencing reference genome sequence comparisons across species identify signatures of natural selection population-level resequencing

More information

Compute- and Data-Intensive Analyses in Bioinformatics"

Compute- and Data-Intensive Analyses in Bioinformatics Compute- and Data-Intensive Analyses in Bioinformatics" Wayne Pfeiffer SDSC/UCSD August 8, 2012 Questions for today" How big is the flood of data from high-throughput DNA sequencers? What bioinformatics

More information

L3: Short Read Alignment to a Reference Genome

L3: Short Read Alignment to a Reference Genome L3: Short Read Alignment to a Reference Genome Shamith Samarajiwa CRUK Autumn School in Bioinformatics Cambridge, September 2017 Where to get help! http://seqanswers.com http://www.biostars.org http://www.bioconductor.org/help/mailing-list

More information

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Genome 373: Mapping Short Sequence Reads II. Doug Fowler Genome 373: Mapping Short Sequence Reads II Doug Fowler The final Will be in this room on June 6 th at 8:30a Will be focused on the second half of the course, but will include material from the first half

More information

Short Read Alignment to a Reference Genome

Short Read Alignment to a Reference Genome Short Read Alignment to a Reference Genome Shamith Samarajiwa CRUK Summer School in Bioinformatics Cambridge, September 2018 Aligning to a reference genome BWA Bowtie2 STAR GEM Pseudo Aligners for RNA-seq

More information

Genomics and Transcriptomics of Spirodela polyrhiza

Genomics and Transcriptomics of Spirodela polyrhiza Genomics and Transcriptomics of Spirodela polyrhiza Doug Bryant Bioinformatics Core Facility & Todd Mockler Group, Donald Danforth Plant Science Center Desired Outcomes High-quality genomic reference sequence

More information

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie Sander van Boheemen Medical Microbiology Next-generation sequencing Next-generation sequencing (NGS), also known as

More information

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome See the Difference With a commitment to your peace of mind, Life Technologies provides a portfolio of robust and scalable

More information

RNA-Sequencing analysis

RNA-Sequencing analysis RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges

More information

Introductory Next Gen Workshop

Introductory Next Gen Workshop Introductory Next Gen Workshop http://www.illumina.ucr.edu/ http://www.genomics.ucr.edu/ Workshop Objectives Workshop aimed at those who are new to Illumina sequencing and will provide: - a basic overview

More information

Next Gen Sequencing. Expansion of sequencing technology. Contents

Next Gen Sequencing. Expansion of sequencing technology. Contents Next Gen Sequencing Contents 1 Expansion of sequencing technology 2 The Next Generation of Sequencing: High-Throughput Technologies 3 High Throughput Sequencing Applied to Genome Sequencing (TEDed CC BY-NC-ND

More information

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS. !! www.clutchprep.com CONCEPT: OVERVIEW OF GENOMICS Genomics is the study of genomes in their entirety Bioinformatics is the analysis of the information content of genomes - Genes, regulatory sequences,

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 21 2011, pages 2957 2963 doi:10.1093/bioinformatics/btr507 Genome analysis Advance Access publication September 7, 2011 : fast length adjustment of short reads

More information

Analysis of RNA-seq Data

Analysis of RNA-seq Data Analysis of RNA-seq Data A physicist and an engineer are in a hot-air balloon. Soon, they find themselves lost in a canyon somewhere. They yell out for help: "Helllloooooo! Where are we?" 15 minutes later,

More information

Next Generation Sequencing: An Overview

Next Generation Sequencing: An Overview Next Generation Sequencing: An Overview Cavan Reilly November 13, 2017 Table of contents Next generation sequencing NGS and microarrays Study design Quality assessment Burrows Wheeler transform Next generation

More information

Structural variation analysis using NGS sequencing

Structural variation analysis using NGS sequencing Structural variation analysis using NGS sequencing Victor Guryev NBIC NGS taskforce meeting April 15th, 2011 Scale of genomic variants Scale 1 bp 10 bp 100 bp 1 kb 10 kb 100 kb 1 Mb Variants SNPs Short

More information

Machine Learning. HMM applications in computational biology

Machine Learning. HMM applications in computational biology 10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly

More information

De novo assembly in RNA-seq analysis.

De novo assembly in RNA-seq analysis. De novo assembly in RNA-seq analysis. Joachim Bargsten Wageningen UR/PRI/Plant Breeding October 2012 Motivation Transcriptome sequencing (RNA-seq) Gene expression / differential expression Reconstruct

More information

Eucalyptus gene assembly

Eucalyptus gene assembly Eucalyptus gene assembly ACGT Plant Biotechnology meeting Charles Hefer Bioinformatics and Computational Biology Unit University of Pretoria October 2011 About Eucalyptus Most valuable and widely planted

More information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database

More information

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter A shotgun introduction to sequence assembly (with Velvet) MCB 247 - Brem, Eisen and Pachter Hot off the press January 27, 2009 06:00 AM Eastern Time llumina Launches Suite of Next-Generation Sequencing

More information

Differential gene expression analysis using RNA-seq

Differential gene expression analysis using RNA-seq https://abc.med.cornell.edu/ Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, March 2018 Friederike Dündar with Luce Skrabanek & Paul Zumbo Day 1: Introduction into high-throughput

More information

02 Agenda Item 03 Agenda Item

02 Agenda Item 03 Agenda Item 01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD 3 System: Applications Overview April 12th, 2010 Jennifer Stover Field Application Specialist - SOLiD Applications Workflow for SOLiD Application Application

More information

Deep Sequencing technologies

Deep Sequencing technologies Deep Sequencing technologies Gabriela Salinas 30 October 2017 Transcriptome and Genome Analysis Laboratory http://www.uni-bc.gwdg.de/index.php?id=709 Microarray and Deep-Sequencing Core Facility University

More information

Supplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity

Supplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity Supplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity Sections: S1. Evaluation of transcriptome assembly completeness S2. Comparison

More information

Next-generation sequencing and quality control: An introduction 2016

Next-generation sequencing and quality control: An introduction 2016 Next-generation sequencing and quality control: An introduction 2016 s.schmeier@massey.ac.nz http://sschmeier.com/bioinf-workshop/ Overview Typical workflow of a genomics experiment Genome versus transcriptome

More information

Supplementary Figures

Supplementary Figures Supplementary Figures A B Supplementary Figure 1. Examples of discrepancies in predicted and validated breakpoint coordinates. A) Most frequently, predicted breakpoints were shifted relative to those derived

More information

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010 Mapping Next Generation Sequence Reads Bingbing Yuan Dec. 2, 2010 1 What happen if reads are not mapped properly? Some data won t be used, thus fewer reads would be aligned. Reads are mapped to the wrong

More information

Analysing genomes and transcriptomes using Illumina sequencing

Analysing genomes and transcriptomes using Illumina sequencing Analysing genomes and transcriptomes using Illumina uencing Dr. Heinz Himmelbauer Centre for Genomic Regulation (CRG) Ultrauencing Unit Barcelona The Sequencing Revolution High-Throughput Sequencing 2000

More information

Gap Filling for a Human MHC Haplotype Sequence

Gap Filling for a Human MHC Haplotype Sequence American Journal of Life Sciences 2016; 4(6): 146-151 http://www.sciencepublishinggroup.com/j/ajls doi: 10.11648/j.ajls.20160406.12 ISSN: 2328-5702 (Print); ISSN: 2328-5737 (Online) Gap Filling for a Human

More information

ChIP-seq analysis 2/28/2018

ChIP-seq analysis 2/28/2018 ChIP-seq analysis 2/28/2018 Acknowledgements Much of the content of this lecture is from: Furey (2012) ChIP-seq and beyond Park (2009) ChIP-seq advantages + challenges Landt et al. (2012) ChIP-seq guidelines

More information

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park Next Generation Sequences & Chloroplast Assembly 8 June, 2012 Jongsun Park Table of Contents 1 History of Sequencing Technologies 2 Genome Assembly Processes With NGS Sequences 3 How to Assembly Chloroplast

More information

Welcome to the NGS webinar series

Welcome to the NGS webinar series Welcome to the NGS webinar series Webinar 1 NGS: Introduction to technology, and applications NGS Technology Webinar 2 Targeted NGS for Cancer Research NGS in cancer Webinar 3 NGS: Data analysis for genetic

More information

De novo genome assembly with next generation sequencing data!! "

De novo genome assembly with next generation sequencing data!! De novo genome assembly with next generation sequencing data!! " Jianbin Wang" HMGP 7620 (CPBS 7620, and BMGN 7620)" Genomics lectures" 2/7/12" Outline" The need for de novo genome assembly! The nature

More information

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing Illumina Assembly 1 Outline The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing 2 Illumina Sequencing Paired end Illumina

More information

NGS in Pathology Webinar

NGS in Pathology Webinar NGS in Pathology Webinar NGS Data Analysis March 10 2016 1 Topics for today s presentation 2 Introduction Next Generation Sequencing (NGS) is becoming a common and versatile tool for biological and medical

More information

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz Table of Contents Supplementary Note 1: Unique Anchor Filtering Supplementary Figure

More information

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis -Seq Analysis Quality Control checks Reproducibility Reliability -seq vs Microarray Higher sensitivity and dynamic range Lower technical variation Available for all species Novel transcript identification

More information

Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line

Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line Table of Contents SUPPLEMENTARY TEXT:... 2 FILTERING OF RAW READS PRIOR TO ASSEMBLY:... 2 COMPARATIVE ANALYSIS... 2 IMMUNOGENIC

More information

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) tru TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) Anton Bankevich Center for Algorithmic Biotechnology, SPbSU Sequencing costs 1. Sequencing costs do not follow Moore s law

More information

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ de novo transcriptome assembly de novo from the Latin expression meaning from the beginning In bioinformatics, we often use

More information

Biol 478/595 Intro to Bioinformatics

Biol 478/595 Intro to Bioinformatics Biol 478/595 Intro to Bioinformatics September M 1 Labor Day 4 W 3 MG Database Searching Ch. 6 5 F 5 MG Database Searching Hw1 6 M 8 MG Scoring Matrices Ch 3 and Ch 4 7 W 10 MG Pairwise Alignment 8 F 12

More information

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly SCIENCE CHINA Life Sciences SPECIAL TOPIC February 2013 Vol.56 No.2: 156 162 RESEARCH PAPER doi: 10.1007/s11427-013-4444-x Comparative analysis of de novo transcriptome assembly CLARKE Kaitlin 1, YANG

More information

COMPUTER RESOURCES II:

COMPUTER RESOURCES II: COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer

More information

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center High Throughput Sequencing the Multi-Tool of Life Sciences Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center Complementary Approaches Illumina Still-imaging of clusters (~1000

More information

Introduction to the MiSeq

Introduction to the MiSeq Introduction to the MiSeq 2011 Illumina, Inc. All rights reserved. Illumina, illuminadx, BeadArray, BeadXpress, cbot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate,

More information

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics Genomic Technologies Michael Schatz Feb 1, 2018 Lecture 2: Applied Comparative Genomics Welcome! The primary goal of the course is for students to be grounded in theory and leave the course empowered to

More information

Bioinformatics Course AA 2017/2018 Tutorial 2

Bioinformatics Course AA 2017/2018 Tutorial 2 UNIVERSITÀ DEGLI STUDI DI PAVIA - FACOLTÀ DI SCIENZE MM.FF.NN. - LM MOLECULAR BIOLOGY AND GENETICS Bioinformatics Course AA 2017/2018 Tutorial 2 Anna Maria Floriano annamaria.floriano01@universitadipavia.it

More information

Introduction to RNA sequencing

Introduction to RNA sequencing Introduction to RNA sequencing Bioinformatics perspective Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden November 2017 Olga (NBIS) RNA-seq November 2017 1 / 49 Outline Why sequence

More information

De novo meta-assembly of ultra-deep sequencing data

De novo meta-assembly of ultra-deep sequencing data De novo meta-assembly of ultra-deep sequencing data Hamid Mirebrahim 1, Timothy J. Close 2 and Stefano Lonardi 1 1 Department of Computer Science and Engineering 2 Department of Botany and Plant Sciences

More information

De novo whole genome assembly

De novo whole genome assembly De novo whole genome assembly Qi Sun Bioinformatics Facility Cornell University Sequencing platforms Short reads: o Illumina (150 bp, up to 300 bp) Long reads (>10kb): o PacBio SMRT; o Oxford Nanopore

More information

Analysis of Biological Sequences SPH

Analysis of Biological Sequences SPH Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,

More information

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow Marcus Hausch, Ph.D. 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life, Oligator,

More information

Genome 373: High- Throughput DNA Sequencing. Doug Fowler

Genome 373: High- Throughput DNA Sequencing. Doug Fowler Genome 373: High- Throughput DNA Sequencing Doug Fowler Tasks give ML unity We learned about three tasks that are commonly encountered in ML Models/Algorithms Give ML Diversity Classification Regression

More information

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer. DNA Preparation and QC Extraction DNA was extracted from whole blood or flash frozen post-mortem tissue using a DNA mini kit (QIAmp #51104 and QIAmp#51404, respectively) following the manufacturer s recommendations.

More information

de novo paired-end short reads assembly

de novo paired-end short reads assembly 1/54 de novo paired-end short reads assembly Rayan Chikhi ENS Cachan Brittany Symbiose, Irisa, France 2/54 THESIS FOCUS Graph theory for assembly models Indexing large sequencing datasets Practical implementation

More information

Modern Epigenomics. Histone Code

Modern Epigenomics. Histone Code Modern Epigenomics Histone Code Ting Wang Department of Genetics Center for Genome Sciences and Systems Biology Washington University Dragon Star 2012 Changchun, China July 2, 2012 DNA methylation + Histone

More information

Illumina s Suite of Targeted Resequencing Solutions

Illumina s Suite of Targeted Resequencing Solutions Illumina s Suite of Targeted Resequencing Solutions Colin Baron Sr. Product Manager Sequencing Applications 2011 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life,

More information

SNP calling and VCF format

SNP calling and VCF format SNP calling and VCF format Laurent Falquet, Oct 12 SNP? What is this? A type of genetic variation, among others: Family of Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide

More information

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before Jeremy Preston, PhD Marketing Manager, Sequencing Illumina Genome Analyzer: a Paradigm Shift 2000x gain in efficiency

More information

Authors: Vivek Sharma and Ram Kunwar

Authors: Vivek Sharma and Ram Kunwar Molecular markers types and applications A genetic marker is a gene or known DNA sequence on a chromosome that can be used to identify individuals or species. Why we need Molecular Markers There will be

More information

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012 Introduction to transcriptome analysis using High Throughput Sequencing technologies D. Puthier 2012 A typical RNA-Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,

More information

RNA-Seq data analysis course September 7-9, 2015

RNA-Seq data analysis course September 7-9, 2015 RNA-Seq data analysis course September 7-9, 2015 Peter-Bram t Hoen (LUMC) Jan Oosting (LUMC) Celia van Gelder, Jacintha Valk (BioSB) Anita Remmelzwaal (LUMC) Expression profiling DNA mrna protein Comprehensive

More information

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data BST227 Introduction to Statistical Genetics Lecture 8: Variant calling from high-throughput sequencing data 1 PC recap typical genome Differs from the reference genome at 4-5 million sites ~85% SNPs ~15%

More information