Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, PDF Free Download

Alignment J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

From reads to molecules

Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG CAATAAAAGTGCCGTATCATGCTGGTGTTACAATCGCCGCA CGTATCATGCTGGTGTTACAATCGCCGCATGACATGATCAATGG TGTCTGCTCAATAAAAGTGCCGTATCATGCTGGTGTTACAATC ATCGTCGGGTGTCTGCTCAATAAAAGTGCCGTATCATG--GGTGTTATAA CTCAATAAGAGTGCCGTATCATG--GGTGTTATAATCGCCGCA GTTATAATCGCCGCATGACATGATCAATGG To measure variation.

Why align?

Short Read Aligners: choices... Fall '12 - Apr '13:... now 150-180 Gbp / day!* * http://www.illumina.com/systems/hiseq_2500_1500/performance_specifications.ilmn

Burrows-Wheeler Aligners Burrows-Wheeler Transform used in bzip2 file compression tool; FM-index (Ferragina & Manzini) allow efficient finding of substring matches within compressed text algorithm is sub-linear with respect to time and storage space required for a certain set of input data (reference 'ome, essentially). Reduced memory footprint, faster execution.

BWA BWA is fast, and can do gapped alignments. When run without seeding, it will find all hits within a given edit distance. Long read aligner is also fast, and can perform well for 454, Ion Torrent, Sanger, and PacBio reads. BWA is actively developed and has a strong user / developer community. bio-bwa.sourceforge.net Short reads under 200 bp Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168] Long reads over 200 bp chimeric alignments built-in Li H. and Durbin R. (2010) Fast and accurate long read alignment with Burrows-Wheeler Transform. Bioinformatics, 26:589-95. [PMID: 20080505] don't forget to join the mailing groups!

Bowtie Bowtie (now Bowtie 2) is probably faster than BWA for some types of alignment, but it may not find the best alignments (see discussions on sensitivity, accuracy on SeqAnswers.com). Bowtie is part of a suite of tools (Bowtie, Tophat, Cufflinks, CummeRbund) that address RNAseq experiments. http://bowtie-bio.sourceforge.net Langmead B., Trapnell C., Pop M., and Salzberg S.L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Genome Biology 10:R25 [PMID: 19261174] don't forget to join the mailing groups!

Alignment concepts / parameters Paired-End reads Mate-Paired reads

Alignment concepts / parameters 454 "Paired-End" reads Single End Construct

Alignment concepts / parameters

File Format: SAM / BAM / CRAM! NEW http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and up Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] SAM specification (currently v1, renumbered 1 is after old v1.4) samtools man page example workflow(s) mailing list!

File Format: SAM

File Format: SAM SAM Format Specification v1.4-r985 7,8 - formerly MRNM, MPOS (mate reference name, mate position) 9 - formerly ISIZE ("insert" size)

File Format: SAM google "Heng Li slides" - Challenges and Solutions in the Analysis of Next Generation Sequencing Data (2010)

File Format: BAM BAMs are compressed SAMs (so, binary, not human-readable text don't look directly at them!). They can be indexed to allow rapid extraction of information, so alignment viewers do not need to uncompress the whole BAM file in order to look at information for a particular read or coordinate range, somewhere in the file. Indexing your BAM file, mycoolbamfile.bam, will create an index file, mycoolbamfile.bam.bai, which is needed (in addition to the BAM file) by viewers and other downstream tools. An occasional downstream tool will require an index called mycoolbamfile.bai (notice that the.bai replaces the.bam, instead of being appended after it).

File Format: CRAM Available as of SAMtools 1.0, and is a binary format like BAM. Uses data-specific compression tools (i.e. compressing letters is different than compressing numbers), specifically reference-based compression (e.g. for aligned reads, only mis-matching bases need to be stored). Also can employ lossy compression of base qualities, which appears to have a negligible effect on, say, variant calling (see Illumina white paper). Indexing your CRAM file, mycoolbamfile.cram, will create an index file, mycoolbamfile.cram.crai, which is needed (in addition to the CRAM file) by viewers and other downstream tools. This is a very recent development, so it may be a while before tools are CRAM-capable.

Alignment Viewers IGV (Integrated Genomics Viewer) www.broadinstitute.org/igv/ BAMview, tview (in SAMtools), IGB, GenomeView, SAMscope... UCSC Genome Browser, GBrowse

IGV red box indicates region of reference in view below coverage track: read coverage depth plot read alignments: (various view styles - squished shown here) read positions, orientations, pairing, sequence that disagrees with reference highlighted, improper pairs highlighted, etc. annotation tracks (GTF, BED, etc.)

IGV colored bases where they disagree with reference (substitution, indel, etc.) improper pairs (mate aligns far away, in wrong orientation, or on another chromosome) reference sequence, reading frames, etc.

Variant Calling - VCF format One main application of read alignment. A.k.a. "resequencing", SNP / indel discovery. VCF (variant call format) is now the standard format for variant reporting. http://vcftools.sourceforge.net/specs.html... VCF poster

Variant Call Format ##fileformat=vcfv4.1 ##filedate=20130825 ##source=freebayes v9.9.2-9-gfbf46fc-dirty ##reference=../results/8/8.fa ##phasing=none ##commandline="../tools/freebayes/bin/freebayes -f../results/8/8.fa --min-alternate-fraction 0.03 --minmapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --usemapping-quality --min-alternate-fraction 0.04 --min-alternate-count 1../results/8/8.bam" ##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB1 26. TGTTACGCG GCTTTTGC,TGTTTCTAC 27.2619. AO=1,2;RO=0;TYPE=complex, complex GT:DP:RO:QR:AO:QA:GL 2:3:0:0:1,2:31,70:-4.46,-1.65,0 8_PB1 38. TCA ACG,TA,AGA 0.0495692. AO=1,1,1;RO=3;TYPE=complex,del,mnp GT:DP:RO:QR:AO:QA:GL 2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004,-4.28 8_PB1 42. G A 3.94171e-14. AO=8;RO=128;TYPE=snp GT:DP:RO:QR:AO:QA: GL

Variant Call Format #CHROM POS ID REF 8_PB2 407. A 170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID =. REF = A ALT = G QUAL = 3935.83 ALT G QUAL FILTER 3935.83. INFO FORMAT 8 AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1: FILTER =. INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0

Variant Call Format

Variant Call Format #CHROM POS ID REF 8_PB2 407. A 170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID =. REF = A ALT = G QUAL = 3935.83 ALT G QUAL FILTER 3935.83. INFO FORMAT 8 AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1: ##FORMAT=<ID=DP,Number=1,Type=Integer, Description="Read Depth"> FILTER =. INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0

Variant Call Format ##INFO=<ID=RO,Number=1,Type=Integer,Description=" Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description=" Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">

Variant Call Format ##FORMAT=<ID=GT,Number=1,Type=String,Description=" Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description=" Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description=" Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description=" Read Depth">

Variant Call Format ##FORMAT=<ID=RO,Number=1,Type=Integer,Description=" Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description=" Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description=" Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description=" Sum of quality of the alternate observations">

Variant Effect Prediction snpeff Variant Effect Predictor (EMBL) SIFT

VCF after Effect Prediction #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB2 407. A G 3935.83. AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING (LOW SILENT gaa/gag E123 759 PB2 CODING Tr_PB2 1 1) GT:DP:RO:QR:AO:QA:GL 1:170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID =. REF = A ALT = G QUAL = 3935.83 FILTER =. INFO = AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING (LOW SILENT gaa/gag E123 759 PB2 CODING Tr_PB2 1 1) FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0

VCF after Effect Prediction ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.format: 'Effect ( Effect_Impact Functional_Class Codon_Change Amino_Acid_change Amino_Acid_length Gene_Name Transcript_BioType Gene_Coding Transcript_ID Exon GenotypeNum [ ERRORS WARNINGS ] )' "> INFO = AO=149;RO=21;TYPE=snp; EFF=SYNONYMOUS_CODING (LOW SILENT gaa/gag E123 759 PB2 CODING Tr_PB2 1 1)

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014