RNAseq and Variant discovery

Size: px
Start display at page:

Download "RNAseq and Variant discovery"

Transcription

1 RNAseq and Variant discovery

2 RNAseq Gene discovery Gene valida5on training gene predic5on programs Gene expression studies Paris japonica Gene discovery Understanding physiological processes Dissec5ng signal/response pathways Understanding disease causa5on and symptomatology Image:

3 RNAseq library produc5on Add adaptors and sequence

4 RNAseq data what next? No reference genome De novo assembly: Trinity, Oasis (con5gs) Annota5on: BLAST, BLAST2GO (func5ons?) Reference genome available Map RNAseq reads to reference: Tophat, STAR Improve an exis5ng genome annota5on: cufflinks, cuffmerge Confirm gene predic5ons, iden5fy novel variants, discover new genes Compare gene expression levels: cuffdiff

5 Reference- based assembly vs. de novo assembly htp:// F1.gif

6 Mapping NGS sequences to a reference genome Resequencing studies (DNA) Variant discovery (SNPs, indels) Structural varia5on Inser5ons, dele5ons, duplica5ons, inversions, transloca5ons ChIPseq MeDIPseq Bisulfite sequencing (C à U à T; m C à C à C) Map to a bisulfite- converted reference)

7 BLAST a common alignment tool TOO SLOW! Different alignment algorithms are necessary Burrows Wheeler Alignment sequence database (genome) is transformed to produce an index (Burrows Wheeler Transforma5on) Individual sequence reads are searched against this index STAR Aligner (Dobin et al. 2012) Bioinforma5cs Uncompressed Suffix trees

8 BWT of banana

9 Tophat2 Based on the Bow5e alignment engine Bow5e, matching with no gaps Tophat2, gapped matches (necessary for RNAs) Aligns reads to a Burrows Wheeler transformed index of the genome 1st pass à non- gapped matches 2 nd pass à splits unmapped reads and atempts to align the fragments

10 The STAR Aligner Start at the first base of sequence read Find Maximal Mappable Prefix (MMP) Repeat process using unmapped por5on of read 50x faster than other aligners

11 OUTPUTS TopHat (Bow5e).bam file (binary alignment/map).sam (sequence alignment/map) Single.sam file entry: M01478:116: AB84U:1:2101:18669: Chromosome_ M * 0 0 TGTGGAAGTCGTCAAGGGCTGTCGCTGAATTCTTGAAGTTTTCAGCCGGGTACCACGTGTCGTCTGGATCACATCC TTGCCATGCGACCTGGTATTGCAATGTCTTACTCCGGCCAAATAATCGGGACGCTAAAACTTTATCGACCAC HHHHHGGGGGHHHGHG HHF@BBDHHHHHHHHHGHG3HGHHGHGECFDHHHHHEHHGGGGGGGHHHHHHHHHHHHGGHGHHGGGGGGFGHHHFHEHHGGHHHHHHHHG GGGGGHHHHHHGGGGGGEEFGGGGGGGFGFAABCBFBBAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 X G:i:0 NM:i:0 MD:Z:148 YT:Z:UU NH:i:11 CC:Z:Chromosome_8.2 CP:i: HI:i:0

12 .sam fields Field Regular expression Range Description QNAME [^ \t\n\r]+ Query pair NAME if paired; or Query NAME if unpaired 2 FLAG [0-9]+ [0,2 16-1] bitwise FLAG (Section 2.2.2) RNAME [^ \t\n\r@=]+ Reference sequence NAME 3 POS [0-9]+ [0,2 29-1] 1-based leftmost POSition/coordinate of the clipped sequence MAPQ [0-9]+ [0,2 8-1] MAPping Quality (phred-scaled posterior probability that the mapping position of this read is incorrect) 4 CIGAR ([0-9]+[MIDNSHP])+ \* extended CIGAR string MRNM [^ \t\n\r@]+ Mate Reference sequence NaMe; = if the same as <RNAME> 3 MPOS [0-9]+ [0,2 29-1] 1-based leftmost Mate POSition of the clipped sequence ISIZE -?[0-9]+ [-2 29,2 29 ] inferred Insert SIZE 5 SEQ [acgtnacgtn.=]+ \* query SEQuence; = for a match to the reference; n/n/. for ambiguity; cases are not maintained 6,7 QUAL [!-~]+ \* [0,93] query QUALity; ASCII-33 gives the Phred base quality 6,7 TAG [A-Z][A-Z0-9] TAG VTYPE [AifZH] VALUE [^\t\n\r]+ Value TYPE match <VTYPE> (space allowed) I8MVR:53: _dna:chromosome M * 0 0 TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*- NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H...h.H...x...h XR:Z:CT XG:Z:CT

13 .sam flags Bit Decimal DescripAon 0x1 1 template having mul5ple segments in sequencing 0x2 2 each segment properly aligned according to the aligner 0x4 4 segment unmapped 0x8 8 next segment in the template unmapped 0x10 16 SEQ being reverse complemented 0x20 32 SEQ of the next segment in the template being reversed 0x40 64 the first segment in the template 0x the last segment in the template 0x secondary alignment 0x not passing quality controls 0x PCR or op5cal duplicate 0x supplementary alignment etc.

14 .sam flag decoded M01478:116: AB84U:1:2101:18669: Chromosome_ M * 0 0 TGTGGAAGTCGTCAAGGGCTGTCGCTGAATTCTTGAAGTTTTCAGCCGGGTACCACGTGTCGTCTGGATCACATCC TTGCCATGCGACCTGGTATTGCAATGTCTTACTCCGGCCAAATAATCGGGACGCTAAAACTTTATCGACCAC HHHHHGGGGGHHHGHG HHF@BBDHHHHHHHHHGHG3HGHHGHGECFDHHHHHEHHGGGGGGGHHHHHHHHHHHHGGHGHHGGGGGGFGHHHFHEHHGGHHHHHHHHG GGGGGHHHHHHGGGGGGEEFGGGGGGGFGFAABCBFBBAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 X G:i:0 NM:i:0 MD:Z:148 YT:Z:UU NH:i:11 CC:Z:Chromosome_8.2 CP:i: HI:i:0 409 = read paired; mate unmapped; read reverse strand; second in pair; not primary alignment

15 CIGAR format I8MVR:104: _dna:chromosome M1I14M * 0 0 GGTTTTTTGGAAGAGTAGTTCGCGTTTCATTAATTAGTTATTTTTTAGTTTTTAAATAAAATAAAATTTTAAAAAAA op M I D N S H P Description Alignment match (can be a sequence match or mismatch) Insertion to the reference Deletion from the reference Skipped region from the reference Soft clip on the read (clipped sequence present in <seq>) Hard clip on the read (clipped sequence NOT present in <seq>) Padding (silent deletion from the padded reference sequence)

16 Quan5fying alignments How many reads overlap a given interval on a chromosome (scaffold)? How do these regions correspond to known genes?.gp file How many transcripts from my gene of interest? How confident can I be about a variant call?

17 Annotate regions - GTF files Chromosome _8.1 Cufflinks transcript Chromosome _8.1 Cufflinks exon Chromosome _8.1 Cufflinks exon Chromosome _8.1 Cufflinks transcript gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "2"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; GTF fields 1. Sequence ID 2. Source 3. Feature 4. Start 5. End 6. Score 7. Strand 8. Frame 9. ATribute

18 Variant Calling.bam/.sam file contains all of the informa5on required to call variants Variant calls can t be extracted from the.bam file Must provide the genome sequence I8MVR:53: _dna:chromosome M * 0 0 TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*- NM:i: 7 XX:Z:C5T3C2T2CT2C XM:Z:h..H...h.H...x...h XR:Z:CT XG:Z:CT

19 Today s exercises RNAseq workflow Variant discovery 1. Create bow5e index of ref genome 2. Map RNA seqs to indexed reference 3. Assemble transcripts 4. Merge assemblies 5. Compare gene expression levels 6. Retrieve IDs for differen5ally expressed genes.bwt2 files.bam files.gp files merged.gp.diff files 1. Iden5fy sequence variants.vcf file

20 VCF file format VCF variant call format :##fileformat=vcfv4.1 ##samtoolsversion= (r982:295) ##INFO=<ID=DP,Number=1,Type=Integer,Descrip5on="Raw read depth"> ##INFO=<ID=DP4,Number=4,Type=Integer,Descrip5on="# high- quality ref- forward bases, ref- reverse, alt- forward and alt- reverse bases"> ##INFO=<ID=MQ,Number=1,Type=Integer,Descrip5on="Root- mean- square mapping quality of covering reads"> ##INFO=<ID=FQ,Number=1,Type=Float,Descrip5on="Phred probability of all samples being the same"> ##INFO=<ID=AF1,Number=1,Type=Float,Descrip5on="Max- likelihood es5mate of the first ALT allele frequency (assuming HWE)"> ##INFO=<ID=AC1,Number=1,Type=Float,Descrip5on="Max- likelihood es5mate of the first ALT allele count (no HWE assump5on)"> ##INFO=<ID=INDEL,Number=0,Type=Flag,Descrip5on="Indicates that the variant is an INDEL."> ##INFO=<ID=VDB,Number=1,Type=Float,Descrip5on="Variant Distance Bias (v2) for filtering splice- site artefacts in RNA- seq data. Note: this version may be broken."> ##FORMAT=<ID=GT,Number=1,Type=String,Descrip5on="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Descrip5on="Genotype Quality"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Descrip5on="List of Phred- scaled genotype likelihoods"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT scaffold C T DP=2;VDB=0.0060;AF1=1;AC1=2;DP4=0,0,1,1;MQ=20;FQ=- 33 GT:PL:GQ 1/1:40,6,0:8 scaffold T G DP=2;VDB=0.0060;AF1=1;AC1=2;DP4=0,0,1,1;MQ=20;FQ=- 33 GT:PL:GQ 1/1:40,6,0:8 scaffold A G DP=2;VDB=0.0060;AF1=1;AC1=2;DP4=0,0,1,1;MQ=20;FQ=- 33 GT:PL:GQ 1/1:40,6,0:8 scaffold T C DP=2;VDB=0.0060;AF1=1;AC1=2;DP4=0,0,1,1;MQ=20;FQ=- 33 GT:PL:GQ 1/1:40,6,0:8 scaffold G A 18. DP=3;VDB=0.0034;AF1=1;AC1=2;DP4=0,0,2,1;MQ=20;FQ=- 36 GT:PL:GQ 1/1:50,9,0:15

21 Introducing Dr. Eric Rouchka KBRIN Bioinforma5cs Core Director Department of Computer Engineering and Computer Science University of Louisville Kentucky Biomedical Research Infrastructure Network