Lecture 7. Next-generation sequencing technologies

Size: px
Start display at page:

Download "Lecture 7. Next-generation sequencing technologies"

Transcription

1 Lecture 7 Next-generation sequencing technologies

2 Next-generation sequencing technologies

3 General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively parallel DNA sequencing reactions Analyze data

4 Library preparation Prepares sample nucleic acids for sequencing Fragmentation Generates double-stranded DNA flanked by Illumina adapters Generates the same general template structure, but variables include Insert size Adapter type Index for multiplexing

5 Library preparation: Overview Purified genomic DNA Fragment DNA Repair Ends Fragments < 800bp Add an A to the 3 Ends Ligate Paired-end adapters Blunt end fragments with 5 phosphorylated ends Genomic DNA Library QC Library Amplified DNA with adapters PCR bp fragments Size-select on Gel

6 Library preparation: Fragmentation

7 Library preparation: Fragmentation The size of the target DNA fragments in the final library is a key parameter for NGS library construction. Optimal library size is impacted by 1. the process of cluster generation: Short products amplify more efficiently than longer products. Longer library inserts generate larger, more diffuse clusters than short inserts. 2. the sequencing application: For example, PE for exome sequencing since more than 80% of human exomes are under 200bp.

8 Library preparation: Repair Ends

9 Library preparation: A-tailing

10 Library preparation: A-tailing To facilitate ligation to sequencing adapter To prevent self-ligation between blunt ended template molecules (concatermers), or between adapters (adapter dimers) T P P A A P P A A P P A A P T P P T

11 Library preparation: Adapter ligation

12 Library preparation: Y-shaped adaptors

13 Library preparation: Y-shaped adapters Y-shaped adapters Non Y-shaped adapters

14 Library preparation: Size-select on Gel 600bp area excised 300bp area excised

15 Library preparation: PCR Selectively enrich DNA fragments with adapters on both ends Amplify the amount of DNA in the library

16 Library preparation: PCR

17 Library preparation: QC Library QC by Agilent Bioanalyzer: gives size confirmation and visualizes unwanted products Lower marker 15bp Upper marker 1500bp

18 General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively parallel DNA sequencing reactions Analyze data

19 Cluster amplification: Flow cells

20 Cluster amplification: Flow cells Adapter-ligated library elements hybridize to complementary oligonucleotides on the surface of a flow cell. Each attached library fragment acted as a seed and is amplified to generate a clonal cluster containing thousands of identical fragments. Ideally, clusters are of similar size and spaced well apart from each other to achieve accurate resolution during imaging. In reality, DNA clusters are randomly distributed across the flow cell with many clusters in close proximity to neighboring clusters, if the sample is overloaded, making it difficult to discern individual clusters from each others and reducing the amount of information generated during the run.

21 Cluster amplification: Patterned flow cells

22 Cluster amplification: Patterned flow cells Patterned flow cell technology provides even cluster spacing and uniform feature size to deliver extremely high cluster densities. Clusters can only form in the nanowells, allowing accurate resolution of clusters during imaging.

23 Cluster amplification

24 Cluster amplification

25 Cluster amplification: Hybridization and extension

26 Cluster amplification: Denaturation

27 Cluster amplification: Anchor the template to the surface

28 Cluster amplification: Bridge amplification

29 Cluster amplification: Bridge amplification

30 Cluster amplification: Denaturation

31 Cluster amplification: Bridge amplification

32 Cluster amplification: Bridge amplification

33 Cluster amplification: P5 Linearization P7 P5

34 Cluster amplification: P5 Linearization

35 Cluster amplification: Blocking

36 Cluster amplification: Read1 sequencing

37 General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively parallel DNA sequencing reactions Analyze data

38 Sequencing by synthesis

39 Sequencing by synthesis

40 Single read, paired-end and read lengths Program the system to sequence a specific number of bases ( bases) Sequence the strands from both directions to achieve a total of e.g. 600 bases (2 300 bases)

41 Paired-end sequencing Longer read lengths improve 1) the overall length of contiguous sequence that can be assembled, and 2) the certainty of short read alignments. Several next-generation sequencers have offered increases in read length over time. Another improvement has resulted from paired-end sequencing, producing sequence data from both ends of each library fragment. Read pairs can be obtained by one of two mechanisms: 1) paired ends or 2) mate pairs.

42 Paired-end sequencing

43 Paired-end sequencing

44 Paired-end sequencing

45 Paired-end sequencing: P7 linearization

46 Paired-end sequencing

47 Paired-end sequencing Fragment length Advantage (a) paired-end (b) mate-pair < 1000 bp > 1000 bp Higher accuracy of alignments than a single-end read of the same length Providing a scaffold for de novo sequencing by long-range order and orientation

48 Illumina: Summary

49 Illumina platforms: Benchtop sequencers

50 Illumina platforms: Production-scale sequencers

51 Choosing a library type Single read library Unidirectional sequencing Compatible with only single-read flow cells Applications: ChIP-seq, mrna-seq for quantification, low-coverage resequencing

52 Choosing a library type Paired end library Uni or Bidirectional sequencing Compatible with both single-read and paired-end flow cells Applications: the most common library type, de novo assembly, structural variants detection, high-coverage resequencing

53 Choosing a library type Indexed libraries Uni or bidirectional sequencing Allows multiple libraries per lane Single-indexed libraries: adds up to 48 unique 6-base index 1 (i7) se quences to generate up to 48 uniquely tagged libraries. Dual-indexed libraries: adds up to 24 unique 8-base index 1 (i7) sequences and up to 16 unique 8-base index 2 (i5) sequences to generate up to 384 uniquely tagged libraries.

54 Single-indexed sequencing The single-indexed sequencing workflow applies to all Illumina sequencing platforms.

55 Dual-indexed sequencing on a pairedend flow cell Dual-indexed sequencing includes 2 index reads.

56 Dual-indexed adapters

57 Reads and coverage The number of reads for a specific region is denoted depth or coverage

58 FASTA/FASTQ

59 Overview Nucleotide (and protein) sequences are stored in two plaintext formats widespread in bioinformatics: FASTA and FASTQ. We will discuss each format and their limitations, and then see some tools for working with data in these formats.

60 FASTA The FASTA format originates from the FASTA alignment suites, created by William R. Pearson and David J. Lipman. The FASTA format is used to store any sort of sequence data not requiring per-base pair quality scores. This includes: reference genome files, protein sequences, coding DNA sequences (CDS), transcript sequences, and so on.

61 FASTA FASTA files are composed of sequence entries, each containing two parts: a description and the sequence data. The description line begins with a greater than symbol (>) and contains the sequence identifier and other optional information The sequence data begins on the next line after the description, and continues until there s another description line.

62 FASTA An example FASTA file: The FASTA format s simplicity and flexibility comes with an unfortunate downside: the FASTA format is a loosely defined ad hoc format.

63 FASTA In general, the following rules should be observed: 1. Sequence lines should not be too long. While a FASTA file that contains the sequence of the entire human chromosome 1 on a single line is a valid FASTA file, most tools that run on such a file would fail. 2. Some tools may accept data containing alphabets beyond those that they know how to deal with. For example, the standard alphabet for nucleotides would contain ATGC. An extended alphabet may also contain 1) N: A, T, G, or C 2) W: A or T 3) Search the web for IUPAC nucleotides to get a list of all such symbols.

64 FASTA 3. The sequence lines should always at the same width with the exception of the last line. Some tools will fail to operate correctly and may not even warn the users if this condition is not satisfied. The following is technically a valid FASTA but it may cause various problems: It should be reformatted to:

65 FASTA 4. Use upper-case letters. Whereas both lower-case and upper-case letters are allowed by the specification, the different capitalization may carry additional meaning and some tools and methods will operate differently when encoutering upper- or lower-case letters. Some communities (e.g. Ensembl) chose to designate the lowercase letters as all repeats and low complexity regions.

66 FASTQ The FASTQ format extends FASTA by including a numeric quality score to each base in the sequence. The FASTQ format is widely used to store high-throughput sequencing data, which is reported with a per-base quality score indicating the confidence of each base call. It is the de facto standard by which all sequencing instruments represent data.

67 FASTQ The FASTQ format looks like: Line1: The description line beginning This contains the record identifier and other information. Line2: Sequence data, which can be on one or many lines. Line3: The line beginning with + indicates the end of the sequence. Line4: Quality data, which can also be on one or many lines, but must be the same length as the sequence. Each numeric base quality is encoded with ASCII characters.

68 FASTQ The FASTQ format is a multi-line format just as the FASTA format is. In the early days of high-throughput sequencing, instruments always produced the entire FASTQ sequence on a single line. The FASTQ format suffers from the unexpected flaw that sign is both a FASTQ record separator and a valid value of the quality string. For that reason it is a little more difficult to design a correct FASTQ parsing program.

69 bioawk bioawk is a program that extends awk s powerful processing of tabular data to processing tasks involving common bioinformatics formats like FASTA/FASTQ, GTF/GFF, BED, SAM, and VCF. To install bioawk,

70 Counting FASTA/FASTQ entries Counting FASTA entries: First approach for counting FASTQ entries:

71 Counting FASTA/FASTQ entries

72 Counting FASTA/FASTQ entries Second approach for counting FASTQ entries: If you re unsure if some of your FASTQ entries wrap across many lines, a more robust way to count sequences is with bioawk

73 Base qualities in FASTQ Each sequence base of a FASTQ entry has a corresponding numeric quality score (Phred score) in the quality line(s). Each base quality scores is encoded as a single ASCII character, which are represented as integers between 0 and 127. Because not all ASCII characters are printable to screen (e.g. character echoing 07 makes a ding noise), qualities are restricted to the printable ASCII characters, ranging from 33 to 126. The idea behind the Phred score is to map two digit numbers to single characters so that the length of the quality string stays the same as the length of the sequence.

74 Base qualities in FASTQ Converting these ASCII characters to meaningful quality scores can be tricky because there are three different quality schemes: Sanger, Solexa, and Illumina. Name ASCII range Offset Quality score type Quality score range Sanger Illumina ( 1.8) Solexa Illumina (<1.3) PHRED Solexa 5-62 Illumina ( ) PHRED 0-62

75 Base qualities in FASTQ Fortunately, the bioinformatics field has finally seemed to settle on the Sanger encoding. A Phred score Q is used to compute the probability of a base call being incorrect by the formula PP = 10 QQ/10 or QQ = 10 log 10 PP

76 Base qualities in FASTQ From ASCII characters encoded by Sanger scheme to Phred scores: 1. Subtract an offset to convert this Sanger quality score to a Phred quality score. In python, ord() converts the ASCII characters to their decimal representations.

77 Base qualities in FASTQ 2. Apply the formula to convert quality scores to the probability the base is correct: