Genomics AGRY Michael Gribskov Hock 331

Similar documents
Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Genome Projects. Part III. Assembly and sequencing of human genomes

Bioinformatics for Genomics

DATA FORMATS AND QUALITY CONTROL

NEXT GENERATION SEQUENCING. Farhat Habib

Parts of a standard FastQC report

De Novo Assembly of High-throughput Short Read Sequences

We begin with a high-level overview of sequencing. There are three stages in this process.

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Genome Sequencing-- Strategies

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Lecture 7. Next-generation sequencing technologies

Next-generation sequencing and quality control: An introduction 2016

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

DNA Sequencing and Assembly

CSE182-L16. LW statistics/assembly

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Francisco García Quality Control for NGS Raw Data

Next Gen Sequencing. Expansion of sequencing technology. Contents

Sequencing techniques

Sanger vs Next-Gen Sequencing

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

Contact us for more information and a quotation

Alignment and Assembly

Lander-Waterman Statistics for Shotgun Sequencing Math 283: Ewens & Grant 5.1 Math 186: Not in book

10/20/2009 Comp 590/Comp Fall

Illumina Read QC. UCD Genome Center Bioinformatics Core Monday 29 August 2016

Introduction to Next Generation Sequencing

Genome Sequence Assembly

DNA sequencing. Course Info

Illumina Sequencing Error Profiles and Quality Control

Molecular Biology: DNA sequencing

Differential gene expression analysis using RNA-seq

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Browser Exercises - I. Alignments and Comparative genomics

Experimental Design Microbial Sequencing

Lecture 14: DNA Sequencing

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

Mate-pair library data improves genome assembly

Workflow of de novo assembly

De novo whole genome assembly

QIAseq Targeted Panel Analysis Plugin USER MANUAL

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

3) This diagram represents: (Indicate all correct answers)

NGS developments in tomato genome sequencing

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Zika infected human samples

RADseq Data Analysis Workshop 3 February 2017

Introduction of RNA-Seq Analysis

Next Generation Sequencing. Tobias Österlund

1. A brief overview of sequencing biochemistry

Why QC? Next-Generation Sequencing: Quality Control. Illumina data format. Fastq format:

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Introduction to 'Omics and Bioinformatics

DNA vs. RNA DNA: deoxyribonucleic acid (double stranded) RNA: ribonucleic acid (single stranded) Both found in most bacterial and eukaryotic cells RNA

The Diploid Genome Sequence of an Individual Human

Next-Generation Sequencing: Quality Control

Biol 478/595 Intro to Bioinformatics

Quality control for Sequencing Experiments

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Introduction to CGE tools

Biochemistry. Dr. Shariq Syed. Shariq AIKC/FinalYB/2014

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Chapter 20 DNA Technology & Genomics. If we can, should we?

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

BIOINFORMATICS ORIGINAL PAPER

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Conditional Random Fields, DNA Sequencing. Armin Pourshafeie. February 10, 2015

DNA and genome sequencing. Matthew Hudson Dept of Crop Sciences University of Illinois

PLNT2530 (2018) Unit 6b Sequence Libraries

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

CSCI2950-C DNA Sequencing and Fragment Assembly

Chapter 8: Recombinant DNA. Ways this technology touches us. Overview. Genetic Engineering

2nd (Next) Generation Sequencing 2/2/2018

Quality Control of Sequencing Data

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Deep Sequencing technologies

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Transcriptome analysis

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Matthew Tinning Australian Genome Research Facility. July 2012

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

GENETICS EXAM 3 FALL a) is a technique that allows you to separate nucleic acids (DNA or RNA) by size.

CISC 889 Bioinformatics (Spring 2004) Lecture 3

1. Introduction Gene regulation Genomics and genome analyses

Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010!

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Introduction to the MiSeq

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Introduction to bioinformatics (NGS data analysis)

Transcription:

Genomics AGRY 60000 Michael Gribskov gribskov@purdue.edu Hock 331

Computing Essentials Resources In this course we will assemble and annotate both genomic and transcriptomic sequence assemblies We will use a wiki, http://rna.genomics.purdue.edu to handle course logistics We will use computing facilities at the Rosen Center for Advance Computing (RCAC), in particular we will use the server scholar.rcac.purdue.edu You must have access to a computer, and preferably bring it to class on "computational days" You must understand how to use your computer to connect to RCAC

NGS Sequence Analysis General Process Simpler version of the first two bubbles of fig 1 in Ekblom Sample Preparation Draft Genome Annotation Sequencing Validation and QC Data Cleaning Scaffold Assembly Quality Control Contig Assembly

Genome Assembly Original plan for human genome Isolate chromosomes clone in Bacterial Artificial chromosome (BAC) vectors Find "golden path" to minimize sequencing Subclone BACs into plasmids and sequence using dideoxy chain terminating nucleotides (Sanger sequencing) Optimistically estimated to take $3 billion and take 15 years Initiated in 1990 claimed to be the largest collaborative project

Genome Assembly Whole Genome Shotgun (WGS) Assembly 1998 Crag Venter Why mess around with all the subcloning and tiling, why not just fragment the whole genome randomly and sequence all the pieces 1998 NIH It'll never work You have to sequence too many clones You won't be able to put it together 2000 Celera completes draft sequence using WGS approach Funding $300 Million

Genome Assembly How much sequence do you need (ca. 2004) Lander ES, Waterman MS, Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2: 231-239 (1988) Depends on genome size (G) sequence length (L) number of sequences (N) coverage = L N / G In 1 st generation sequencing 16 X coverage was a good target For the human genome 15 X coverage 500 base reads 3.3 x 10 9 bp 99 million reads = $ 3 billion @ $30/base, = $9.9 million @ $0.10/base

Genome Assembly

Genome Assembly Monascus Purpureus Used to make red yeast rice ( beni-koji and ang-kak) Also produces statins More on the wiki - http://rna.genomics.purdue.edu/2014_genomics_(agry60000)/sequence_data genome size about ~50 Mb? has introns and other typical eukaryotic features Data 149,983,522 DNA reads (JGI) 150 base TruSeq paired-end 230 k RNA reads

Genome Assembly Illumina TruSeq System universal adapter Primer insert Primer index adapter bar code Paired end reads, each 150 bases Vocabulary paired-end mate-pair contig scaffold Ekblom, fig 2 (partial)

Quality and Cleaning TruSeq Adapters Universal adapter, 58 bases, same for all sequences, primer location > TruSeq_Universal_Adapter AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT > TruSeq_Universal_Adapter_Reversed AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT Index adapter, 63 bases, contains Barcode, primer location > TruSeq_Index_Adapter-GTAGAG TCGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTAGAGATCTCGTATGCCGTCTTGCTTG > TruSeq_Index_Adapter-GTAGAG_Reversed CAAGCAGAAGACGGCATACGAGATCTCTACGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC

Quality and Cleaning What should we clean?

Quality and Cleaning Is it important? Depends on the assembler Depends on level of contamination Depends on depth Zhou & Rokas, Molec.Ecol. 23, 1679-1700,2014.

Quality and Cleaning What is quality? Introduced in the Phred (Phil Green, UWa) program Quality Score, Q = -10 log 10 ε where ε is the expected error rate (probability of calling an incorrect base) 20 (P=0.01) is a commonly used cutoff Phred quality score Error Probability 10 1:10 90% 20 1:100 99% Base Call Accuracy 30 1:1,000 99.9% 40 1:10,000 99.99% 50 1:100,000 99.999%

Genome Assembly Sequence files fastq Combines sequence an quality Beginning of sequence is marked with @, rest of line has sequence ID and documentation Quality section begins with + Quality values are converted to letters in the ASCII alphabet by adding 33 to the Phred quality Typical ascii value = quality + 33 Some companies use 64 instead of 33

Quality and Cleaning Typical 1 st gen quality >jgi JGI_CAOP10014.rev JGI_CAOP10014.rev 15 11 13 11 13 20 19 19 26 35 42 40 37 37 37 37 35 35 35 40 40 35 32 32 35 35 42 42 37 35 35 35 35 35 40 42 34 28 28 26 24 23 29 28 30 28 33 30 29 29 30 33 25 29 26 26 26 30 32 35 33 33 29 30 33 31 31 21 21 21 26 26 33 33 32 32 32 35 33 42 42 35 30 35 35 35 37 44 42 35 35 35 31 31 24 24 15 15 15 33 31 35 31 37 31 35 35 37 42 42 41 41 41 41 41 42 42 42 42 47 47 44 50 37 35 35 37 50 42 44 44 44 42 42 33 33 21 21 21 33 33 35 35 35 35 35 35 37 37 37 35 41 41 41 41 41 41 44 42 37 35 37 35 35 35 35 35 35 35 41 41 33 33 21 21 21 24 24 24 33 33 35 42 42 33 33 18 33 33 35 33 33 33 33 35 35 37 37 37 33 21 33 33 50 50 50 50 44 41 35 42 35 35 35 42 37 44 44 42 42 37 35 35 35 33 50 37 27 27 33 37 35 37 37 42 42 50 37 35 35 35 50 37 37 35 33 33 33 21 21 19 33 24 24 27 33 33 37 33 33 27 27 27 33 33 42 42 42 42 42 42 37 37 44 50 50 33 33 27 23 23 23 23 27 30 33 33 50 37 37 27 27 21 33 33 35 33 33 33 33 37 42 42 42 42 42 42 42 42 35 35 35 22 25 13 13 15 36 33 35 35 35 35 42 37 44 44 42 33 33 27 33 33 37 33 35 35 37 37 44 37 37 21 21 18 36 18 21 21 37 37 44 50 50 50 50 27 27 27 42 33 35 35 37 35 41 41 42 35 35 35 35 35 39 33 33 27 27 27 31 31 35 35 35 35 31 33 24 35 35 42 50 50 37 37 30 30 30 50 42 44 47 33 33 17 17 17 33 33 44 39 37 37 37 37 37 44 44 50 44 44 44 44 44 35 35 35 35 37 37 39 35 33 28 23 23 31 26 24 27 21 22 22 36 33 28 28 23 27 23 37 37 42 37 42 30 30 37 48 42 30 30 30 33 28 24 24 21 21 21 25 21 31 21 21 21 28 25 36 27 23 23 19 20 28 30 33 42 33 29 29 33 33 22 22 22 31 31 37 42 42 33 33 28 28 28 31 33 37 37 34 34 34 30 30 26 19 19 16 16 22 18 18 15 25 25 42 42 44 30 27 27 22 19 21 17 17 18 30 27 28 28 27 28 27 27 19 13 18 23 18 20 9 10 10 14 19 27 27 18 18 17 14 12 9 9 18 23 23 21 22 20 20 20 31 31 28 22 20 18 22 19 29 23 27 27 32 27 27 27 21 21 20 20 20 14 12 9 13 12 15 25 27 20 20 22 13 11 8 8 15 20 19 24 18 14 14 17 9 9 9 18 18 30 17 17 13 15 17 13 13 11 11 11 9 14 9 8 8 10 12 9 14 14 13 13 9 14 12 15 12 10 9 18

Quality and Cleaning Fastq format @HISEQ02:319:C22FKACXX:2:1101:1699:1972 1:N:0:GTAGAG GACCCATCCATTGTTGGACAGCTGAAGACGGGACGATCGTGCTCGTGTTTTGAATGCGAGAATCCCTGCAGAGGCTGCCTGCTTCGGNNNNNNNNNNTCCTCGACAGCC + CCCFFFFFHHHHHJIJJJJGIJJJJJJJJJJJIIJIJJJIIJIIHAFGIJJEHHHHFFFDCDDDDDDCDDDDDDBBDDDDDDCCDDB##########++28<<@BB>BD I = ascii 73 Quality = 73 33 = 40 Quality = -10 log 10 ε ε = 10-4 # = ascii 35 Q = 35 33 = 2 ε = 10-0.2 = 0.63 = totally bogus

Quality and Cleaning FastQC Available on RCAC servers. You will use it. A good data set Zhou & Rokas, Molec.Ecol. 23, 1679-1700,2014.

Quality and Cleaning FastQC a,b before and after quality trimming c sequence composition bias at 5' end d kmer enrichment. adapter dimer?? e non-random priming in RNAseq Zhou & Rokas, Molec.Ecol. 23, 1679-1700,2014.

Quality and Cleaning FastQC - Monpu1.genome.rawReads.fastq

Quality and Cleaning Quick and Dirty check for primers Universal primer reverse, first 22 bases expected for read 2 (from index adapter) 99.99% are in read 2 (62228/62232) universal adapter Primer Forward Reverse Primer index adapter

Quality and Cleaning Quick and Dirty check for primers Sequence: Monpu1.genome.rawReads.fastq TruSeq index adapter (forward) first 22 bases TCGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTAGAGATCTCGTATGCCGTCTTGCTTG 92% are read 1 universal adapter Primer Forward Reverse Primer index adapter

Quality and Cleaning Quick and Dirty check for primers Sequence: Monpu1.genome.rawReads.fastq TruSeq index adapter (forward) first 22 bases TCGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTAGAGATCTCGTATGCCGTCTTGCTTG universal adapter Primer Forward Reverse Primer index adapter

Quality and Cleaning Quick and Dirty check for primers Sequence: Monpu1.genome.rawReads.fastq TruSeq index adapter (forward) first 22 bases TCGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTAGAGATCTCGTATGCCGTCTTGCTTG TruSeq Index Adapter (Reverse) CAAGCAGAAGACGGCATACGAGATCTCTACGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC

Quality and Cleaning Quick and Dirty check for primers Sequence: Monpu1.genome.rawReads.fastq TruSeq universal primer: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT grep AATGATACGGCGACCACCGAGA Monpu1.genome.rawReads.fastq more AATGATACGGCGACCACCGAGATCTCGGATGCCGTCTTCTGCTTGAAAAATTAAGGATGATGAACTGCCGCGCAAGATCTTGTTAGAAATCTTGCTGCTGCGGGTACTTTCGGGGGAAATATTTCCTTGCAATCGGGGCCGAGCTTTGGG AATGATACGGCGACCACCGAGATCTACACTCTTTCTTCTTCTACTTCTCCTCCTTAACCACTCTCCTCTTTTCTCTTTCTACTTCTCCTTCTACCACTCTTCTACCACTTCTCCCTTTTCTCCCTCTCTGTCTTCCTCCACTTCTCCTTC AATGATACGGCGACCACCGAGATCTACACTCCTTCCCCAACCACTCCTCCACTCTTCCCCTTCTACTTCTCCTCCCCCACCACTTCTTTACTTCACCTCTCTTACACCTCCCATCTTTCTTCTTCTTCTCATCCTTCTCCTTCTACCACC AATGATACGGCGACCACCGAGATCTACACTCTTTCCCGTCCGTTCCCTACGCTCCATATTTCTCAACCCCCCGGCCTTGGACGGGGGGGGCGGACCGGCCCGGCGGAGCCCACGCGGCGCAGCTTGCTGCTCCTCGTGGTCGCGGCAACA AATGATACGGCGACCACCGAGATCTCGCATGCCGTCTTCTGCTTGAAAAATAAAGCCGTAGAGGGAGAGCGGATGGTCGACGTTGTGCAGCAACCGGCACGGCATGCTGGCGTTGGTGGTGGTCACGGAGTGGGAGACGGTTTAGGGAAG AATGATACGGCGACCACCGAGATCTACACTCTTCTCCTTCTCCTCTTTCTTCTTCTCCTTCTACCACTCTTCTCTTCTCCTTCTACTTCTTCTTCTTCTCCTTCTCCCTAACCACTCTTTCACTTCTCCTTCCTCCTCTTCTCCCTTACC AATGATACGGCGACCACCGAGATCTCTACGTGTCTTGACATCCCCCGCCTCCTCTTCCGACCCAATCCCTTTTTCAAAAACACCCCAGCGGGTGGGGAGGAACACCCTACACTTCCTTCCACCCCACCCTTTCCCAAACACAAAACCTCC AATGATACGGCGACCACCGAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAATCAAACAGATGTCCGCGACGTCGCAACGCCCCGTTTCGCAGCCGTCGGCTCGGGAACCTGCCCAAGCACACCAACAGACGGCAAGCCACCATCACGAA AATGATACGGCGACCACCGAGATCTTCACCCTTTCCCTTCACATCTAAATCCACTCAGGTGGATACCAAATCGTTCTTTTTCAATTCTCCCCCCTCCCCCCGTACATCTTCGTACTTTTACTCACGTGCTACTGTCACGCACTCGTCCAC AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCTGCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTAGAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAACACAACAGCAGCGTCTGCC AATGATACGGCGACCACCGAGATCTACACTCTATATGGTATCCCTGGAGGATCAATCCTCGCATACACGCAAGGATTGATCCTCCAGGGATACCATATAGAGTGTTCCGATCAATGTGTTACGGCATAGAGGGATGTAAGGAATGCAGCG AATGATACGGCGACCACCGAGATCTACACTATTTTCTCCGTTCTGAGCTCTTACTGCTCTTACTGGTTCACCAGGTGTCCATCTGGGTCGGCGTTTTTGGGGAATCAATGCCTACGTATTAATATCATGTGCCCTCTAAAGACTGTTTTT AATGATACGGCGACCACCGAGATCTACACTCCTTTCCTCCTTCCTTCCCTCTTATTTGATCTTTTTGTATTTAAACATGGAGTAGAGTGCAGTAAAATTTTAAGACCTTCCTTTATATTAGTAATAAAGATTATTAAATACGCTGGAAGC AATGATACGGCGACCACCGAGATCTACACTCTTTCTTCTCCTGACTACCTTTCCCCTCATGTTCGTGACCCTGCTCTTCCCTGGCTTCACTTTCTGGGTCGCCCAAAAAAGCGCCGCCCGCAAAGACTGCGGGCTACCACGCTACATCCT AATGATACGGCGACCACCGAGATCTACACTCTTTCCCATACCAGCCCAGACACCCCTCCACCTACTGTCAGCGAGAAAAGGTTAAAATGATGGAGCTTCTGAAAACACACCTGAATAACATCAAGATCCTCTGCACGCGCGCGGACAAAC AATGATACGGCGACCACCGAGATCTTCACTCTTTTCTTGCTCTTCTGATAATCTGGTGTTGGTTGGTGGTCGTTCTCATAGATTTGATCTTGCGCTTCAACCGGCGGTGGCGGTCCCGGCCTGCGGCTGGAGCGCTCGTGGCAACCGTCC AATGATACGGCGACCACCGAGATCTACACTCTGTCCCCACAGCTACGCTCTTACTCAAATCCTGATGTCTTCGGATGGATTTGAGTAAGAGCGGAGCTGTGGGGACCCGGAAGATGGTGGAGCATTGCTCAATATGGCGCAACAAGAGGA AATGATACGGCGACCACCGAGATCTACACTCTCCCCCTATGCATAGCTCCGACGTTGACGAGAAGGGTACTCTTCGCTCCCATCCCCTCGTCCTCCCCCCCAAAACCGTTCTGCGAGTGACCCCGCTGGCCCCTTTGTGCGCCGCACACC AATGATACGGCGACCACCGAGATCTACACTCTTTCTCTCTATTTATTCTTTATTTTTCTCTTTTTTCATTTCTCTTCCTCCCCAACCACTCTCTCTTTTTTCCACTTCTCTTTCCCCTTTAATATTCTATTTTTCTTCTCTTTATTCATT AATGATACGGCGACCACCGAGATCTACACTCTTTCCCCACCTCATCCTTCGCAAGCCACACCCCGTACAAACTACCCAGCTCTTATTTTCTCCTCGCGGTTGTTGGGGGCTGGTCGCCCCCGGGGCTCGGGCGCCGCTTTCGCCTCCCGA AATGATACGGCGACCACCGAGATCTACACTCTATCTCCTCAACCAGTCGAGGAGATAGAGGGTCAGACACTTCCGGTTCAGGGTCCTTGGGGCATGTTCCGGGCTAGGGCTGTGGGGGGTGGGGGTGATGCTGCTATTCTTCTTGGCACG AATGATACGGCGACCACCGAGATCTACACTCTTTTCTTAGCCCCTCTAAAGTCTTTTGATCTTGGGGTTGGGGCTGTCGTGTAAACCGAAACATCAAAGTGAGCCACTGGCAAAAAAACTTTTTACCAACCCTGCCTCCAGACCCACAAA AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCTGCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTAGAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAACAAACTAACATTCCTAGACCG AATGATACGGCGACCACCGAGATCTACACTCTTTCCCCCCACCCCCCCCCTATCTTTGTCACAACTGCACCTACAACCCCCGCCCCTCTCCTTTTCGGGGGCAGGATGCGCCCACACTTTCTCTTGTTTAATACAGTTCTTTTCCACCCC AATGATACGGCGACCACCGAGATCTACACTCTTTCCCCGCGACTGTAATTCGTCAAAGCCTCGACCTTTTCTCTTTGGAATATTAGCTTCCTGTCTTCTTTTTTCTTCTTCTCATTCTCCCTCACCTCATAATTCTGTCTTCAACATATA AATGATACGGCGACCACCGAGATCTACATTCCTTTTCTCCCTTCGTCTTAGTCCTACGTCGACTTGGTGAAGTCGACGTAGGACTAAGACGAAGGGAGAAAAGGAATGTTAGCAAGTTCCGCGCGTTTAACGCTAGGAAAGGAGAGGAAA CAAGCAGAAGACGGCATACGAGATCTCTACGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTAATGATACGGCGACCACCGAGATCTCGTATGCCCGCTTCTGCTTGAAAAAAAAAAAAAGAGGCGACGAGACGTACGAAGACCGCAC AATGATACGGCGACCACCGAGATCTACACTCTTTCTTCTCATCATCGTCCTCCCTTTCGAAGTAGGGATGCATCTTTTTTGGCCCTTTTAGCTTTGTGCTGAAAAATACTATGTTTCTCATAATTCTTTTGTGAACACCATCCACTCCAC

Quality and Cleaning matches to reverse of index adapter universal adapter Primer Forward Reverse Primer index adapter

Quality and Cleaning reverse of index adapter adapter should end 41 bases to the right 5-9,:;<=>?@A-Z less than 1% error CAAGCAGAAGACGGCATACGAGATCTCTACGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC

Quality and Cleaning How many have exact 22 base matches? universal adapter forward grep c AATGATACGGCGACCACCGAGA Monpu1.genome.rawReads.fastq 2694 universal adapter reverse grep c AGATCGGAAGAGCGTCGTGTAG Monpu1.genome.rawReads.fastq 62232 index adapter forward grep c GATCGGAAGAGCACACGTCTGA Monpu1.genome.rawReads.fastq 95109 index adapter, reverse grep c CAAGCAGAAGACGGCATACGAG Monpu1.genome.rawReads.fastq 6887 2694+62232+95109+6887=166,922 / 149,983,353 total reads = 0.11 % most are what is expected for small or no insert in expected orientation What about mismatches?

5' adapter matches vs length 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 Universal forward Universal reverse Index forward Index reverse Random Expected

RCAC Modules On RCAC servers many bioinformatics tools have been installed. These are referred to as modules. This is a system specific to RCAC and Purdue Module commands module avail show available modules. To see available modules you must first run module use /apps/group/bioinformatics/modules. Put this in your.bash_profile module load load a module module list show currently loaded modules module show show details about the installation of a module The module system makes it unnecessary to load specific paths, environment symbols an program names on your own Because bioinformatics modules change rapidly, multiple versions are often available

RCAC Modules Module avail When there are different versions, one is the default Default is sometimes but not always shown Different versions

RCAC Other programs You are not limited to modules, you can download an run programs on your own. This is the main use for your home directory For instance sickle, which is available on github

RCAC Other Programs I want to download onto the scholar server not my PC Option one: download on PC/Mac and transfer Option two: right click on the download Zip button to copy the URL https://github.com/najoshi/sickle/archive/master.zip in my home directory type wget https://github.com/najoshi/sickle/archive/master.zip unzip the resulting file result

RCAC Other programs Read the file README.md This will tell you, amongst a lot of other stuff To build Sickle, enter: make After running make I have a new file called sickle Try sickle h or sickle help this will usually give you some brief directions Also look for a doc directory Also look for files named README or MANUAL

RCAC Batch jobs not using a module The RCAC server are not designed to run jobs interactively. Instead, jobs are submitted to a queuing system called PBS (or Torque) Since you cannot run jobs on the frontend systems you will need to make job files to submit your jobs

RCAC Batch job using a module

Assignment Adapter Cleaning See the wiki page: http://rna.genomics.purdue.edu/2014_genomics_(agry60000)/ngs_data_cleaning For 5 pts, choose one of the installed module programs and run it on the monascus sequences. Be prepared to explain why you chose the settings you chose. For 10 pts, install and run one of the non-module adapter cleaners. You may use one not on the wiki page. Explain why you chose settings. You may do both of the above Check how well it worked Using grep as shown in class Using FastQC (installed module, run as batch job) Upload relevant information onto your group page Add comments about download sites and papers to the wiki page