SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM

Size: px
Start display at page:

Download "SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM"

Transcription

1 SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM 10/23/2017 Microbiome : Analysis of NGS Data 1

2 Outline Background Wet Lab! Raw reads Quality Assessment Quality Control Merging and Filtering OTU picking Decontamination, Annotation and BIOM

3 WET LAB Garbage in garbage out It takes a good lab practice to produce reliable data for the downstream processing If we mess-up in wetlab it can not be corrected in dry lab 10/23/2017 Microbiome : Analysis of NGS Data 3

4 1. PCR and Sequencing flow chart Stage1: PCR Stage 2: QC analysis Stage 3: Sequencing 10/23/2017 Microbiome : Analysis of NGS Data 4

5 Why the V4-16S rrna region? Pros Well established protocals Full overlap of forward and reverse reads Less error during assembling Highly reduced sequencing noise Cons Hypervariable regions only Less information Limited resolution in Bacillus* 10/23/2017 Microbiome : Analysis of NGS Data 5

6 2. Raw sequence Reads Quality Assessment 10/23/2017 Microbiome : Analysis of NGS Data 6

7 Raw sequences FASTQ file always has 4 lines per sequence. The first line shows the sequence ID and an optional description. The second line contains a sequence of nucleotides. The third line generally holds only a + symbol and occasionally, the same ID and sequence description as the first line. The fourth line displays the quality score of each nucleotide shown on the second line. The probability of a sequencing error at each position of the nucleotide 10/23/2017 Microbiome : Analysis of NGS Data 7

8 1 For example, if the probability of an error (p) equals 0.01, then the corresponding quality score will be 20; if p = 0.001, then Q=30. These are special ASCII characters that are used to encode quality values with a single symbol, rather than a double or triple digit. 10/23/2017 Microbiome : Analysis of NGS Data 8 100

9 VISUALIZE FASTQ FILE SEQUENCE QUALITY FastQC Package (Andrew S, 2010) fastqc_base/fastqc --extract $fastq -f fastq -o $out_dir -t $fastqc_threads" fastqc --extract -f fastq -o $fastqc_dir -t 6 $raw_reads_dir/* fastqc_combine_base/fastqc_combine.pl -v --out $out_dir --skip --files \"$out_dir/*_fastqc\"" 10/23/2017 Microbiome : Analysis of NGS Data 9

10 Raw Sequences: Sample Dog8_R1 10/23/2017 Microbiome : Analysis of NGS Data 10

11 Raw Sequence: Sample Dog8_R2 10/23/2017 Microbiome : Analysis of NGS Data 11

12 What about this quality?? 10/23/2017 Microbiome : Analysis of NGS Data 12

13 3. Processing of 16S rrna NGS data 10/23/2017 Microbiome : Analysis of NGS Data 13

14 Some tools available CBIO-PIPELINE integrates some tools from UPARSE and QIIME to process NGS microbiome data 10/23/2017 Microbiome : Analysis of NGS Data 14

15 3.1 Merging Paired End reads UPARSE pipeline uses Usearch commands (Edgar, 2010) Usearch9 fastq_mergepairs; maxdiff=3 R1 ATGGATCCCGGAGGGGCGCGAAAAGAGAGAGATTCTCC...300bp 300bp..ATGGATCCCTGAGGCGCGCGAAAGGAGAGAGATCTCTCC R2 Merged: ATGGATCCCTGAGGGGCGCGAAAGGAGAGAGATCTCTCC If two bases are different in R1 and R2, the one to appear in merged seq should have 3 x more quality score than the other, otherwise it will be N (ambiguous call) If the diff in nucleotide btn R1 and R2 is > 3, it will be rejected 10/23/2017 Microbiome : Analysis of NGS Data 15

16 Merged summary output Fwd /researchdata/fhgfs/cbio/cbio/courses/ibs5003z/samson/uparse/renamed/dog10_r1.fastq Rev /researchdata/fhgfs/cbio/cbio/courses/ibs5003z/samson/uparse/renamed/dog10_r2.fastq Totals: Pairs (79.3k) Merged (70.3k, 88.59%) Alignments with zero diffs (62.90%) 8990 Too many diffs (> 3) (11.33%) 0 Fwd tails Q <= 2 trimmed (0.00%) 174 Rev tails Q <= 2 trimmed (0.22%) 0 Fwd too short (< 64) after tail trimming (0.00%) 38 Rev too short (< 64) after tail trimming (0.05%) 27 No alignment found (0.03%) 0 Alignment too short (< 16) (0.00%) Staggered pairs (99.75%) merged & trimmed Mean alignment length Mean merged length 0.29 Mean fwd expected errors 2.23 Mean rev expected errors 0.03 Mean merged expected errors 10/23/2017 Microbiome : Analysis of NGS Data 16

17 3.2 Filtering Merged Reads Generally, filtering involves three steps Based on error contribution of each nucleotide base (maxee) Primer stripping (nowadays stripped by sequencing platform) Length truncation Filtering based on maxim expected error (maxee = 0.1) uparse_filter_fastq_maxee=0.1 This is the maximum expected error of each nucleotide in a DNA sequence Thus, for a sequence with a length 100bp it will be rejected only if the total error > 10 [0.1 x100] 250bp will be rejected if total error > 25. What if maxee = 0.5? For a 250bp sequence, it will be rejected if total error > 250 x 0.5 = 125!!!! 10/23/2017 Microbiome : Analysis of NGS Data 17

18 Was Quality Control Effective? 10/23/2017 Microbiome : Analysis of NGS Data 18

19 3.3 FastQC of Merged, Trimmed and Filtered Reads 10/23/2017 Microbiome : Analysis of NGS Data 19

20 4. Uparse_downstream 10/23/2017 Microbiome : Analysis of NGS Data 20

21 4.1. De-replication Full length de-replication is done to find a set of unique sequences. Sequences are compared letter by letter Sample result >524e5df45a66fb616ef4a553473dd833dedff0ca;size=2; AACACAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATG TGAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAA TTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTA ACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >459142c21f9a9981d43f98e53cc276b781ad2c6a;size=5; AACATAAGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAAGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG >4ea702c69516c b5d1a3b59b5d9c6;size=1; AACATAGAGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTCTGTTAAGTCGGATGT GAAATGTAGGGGCTCAACCCTTAACGTGCATCCGATACTGGCAGACTTGAGTGCGGAAGAGGCAAGTGGAAT TCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTGCTGGGCCGTAA CTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG 10/23/2017 Microbiome : Analysis of NGS Data 21

22 4.2. Sort sequences by size usearch9 -sortbysize command, min_size=2 Sample result >2fd264476fe1367dbe062db6e5bdcc7d384a8487;size=190716; TACGTAGGGGGCTAGCGTTATCCGGATTTACTGGGCGTAAAGGGTGCGTAGGCGGTCTTTCAAGTCAGGAGTTAAAGGCTAC GGCTCAACCGTAGTAAGCTCCTGATACTGTCTGACTTGAGTGCAGGAGAGGAAAGCGGAATTCCCAGTGTAGCGGTGAAATG CGTAGATATTGGGAGGAACACCAGTAGCGAAGGCGGCTTTCTGGACTGTAACTGACGCTGAGGCACGAAAGCGTGGGGAGC AAACAGG >dfcca28a6795cdd3c43b2fbfd5d1f7f64ead1fa8;size=161971; TACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGCAGGCGGACTCTTAAGTCAGTTGTGAAATACGGC GGCTCAACCGTCGGACTGCAGTTGATACTGGGAGTCTTGAGTGCACACAGGGATGCTGGAATTCATGGTGTAGCGGTGAAAT GCTCAGATATCATGAAGAACTCCGATCGCGAAGGCAGGTATCCGGGGTGCAACTGACGCTGAGGCTCGAAAGTGCGGGTATC AAACAGG 10/23/2017 Microbiome : Analysis of NGS Data 22

23 4.3. Denovo Otu-picking Usearch9 cluster_otus, otu_radius_pct=3 Performs 97% OTU clustering using the UPARSE-OTU algorithm. Edgar, R.C. (2013) 10/23/2017 Microbiome : Analysis of NGS Data 23

24 4.4. Chimera detection and removal 3. usearch9 -uchime2_ref, gold_db Chimeric sequences detected and removed Sample output: 59Mb 100.0% Reading /scratch/db/bio/qiime/uchime/gold.fa 26Mb 100.0% Converting to upper case 27Mb 100.0% Word stats 27Mb 100.0% Alloc rows 86Mb 100.0% Build index 93Mb 100.0% Chimeras 5/184 (2.7%), in db 27 (14.7%), not matched 152 (82.6%) 10/23/2017 Microbiome : Analysis of NGS Data 24

25 4.5. OTUs - table generation De-dereplication and Qiime compatible otu_table usearch9 -usearch_global usearch_global command: searches for how many times each OTU appears in each set of samples and then generates qiime compatible out_table OTUId Dog10/1 Dog15/1 Dog16/1 Dog17/1 Dog1/1 Dog22/1 Dog24/1 Dog29/1 Dog2/1 OTU_ OTU_ OTU_ OTU_ OTU_ /23/2017 Microbiome : Analysis of NGS Data 25

26 5. Decontamination 10/23/2017 Microbiome : Analysis of NGS Data 26

27 5.1 Overview of the Decontamination We need to know OTUs that might be contributed by contamination from reagents used for sampling, DNA extraction and purification, and environments and personnel where DNA was extracted This is very critical, especially in clinical samples. Why? These OTUs must be subtracted from biological samples to retain a true representation of the OTUs from the sample of interest. To achieve this, reagents / blanks [controls] are spiked with known bacteria at the same DNA concentrations as those used in sample under study 10/23/2017 Microbiome : Analysis of NGS Data 27

28 5.2 Assign Taxonomy to controls Note: The assumption here is that the QC has been conducted as explained in previous slides. Taxonomy of controls [spiked ] assign_taxonomy.py -i otus_repsetout.fa -o tax -r gg_db/rep_set/97_otus.fasta -t gg_db/taxonomy/97_otu_taxonomy.txt -m uclust From the output we can tell spiked and contaminants OTUs In most cases, spiked OTUs will be the most abundant Spiked OTU sequence will be removed from the controls, thus remaining sequences are contaminants Contaminant OTUs sequence is aligned to Biological sample sequences at 100% and for their entire length 10/23/2017 Microbiome : Analysis of NGS Data 28

29 5.3 Example of spiked control OTUs reads sequencing_control primestorecyano P1_G06 sequencing_control primestorecyano P2_G06 sequencing_control primestorecyano P3_G06 sequencing_control primestorecyano P4_G06 OTUId OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ OTU_ /23/2017 Microbiome : Analysis of NGS Data 29

30 5.4. Example of Spiked control taxonomy table OUT Tax % ID Av_reads OTU_1 Bacteria; Cyanobacteria; Cyanobacteria; SubsectionIII; FamilyI; Arthrospira OTU_10 Bacteria; Bacteroidetes; Flavobacteria; Flavobacteriales; Cryomorphaceae; Fluviicola 88 1 OTU_11 Bacteria; Deinococcus-Thermus; Deinococci; Deinococcales; Trueperaceae; Truepera OTU_12 Bacteria; Proteobacteria; Gammaproteobacteria; Oceanospirillales; Alcanivoracaceae; OTU_13 Bacteria; Proteobacteria; Gammaproteobacteria; Order_Incertae_Sedis; Family_Incertae_ OTU_14 Bacteria; Bacteroidetes; Cytophagia; Order_III_Incertae_Sedis; ML310M-34; g OTU_15 Bacteria; Verrucomicrobia; Opitutae; Puniceicoccales; Puniceicoccaceae; g 72 4 OTU_16 Bacteria; Actinobacteria; Acidimicrobiia; Acidimicrobiales; Acidimicrobiaceae; g 86 4 OTU_17 Bacteria 99 1 OTU_18 Bacteria; Bacteroidetes; Cytophagia; Order_III_Incertae_Sedis; ML310M-34; g 95 1 OTU_19 Bacteria; Proteobacteria; Alphaproteobacteria; Caulobacterales; Hyphomonadaceae; O OTU_2 Bacteria; Proteobacteria; Gammaproteobacteria; Pasteurellales; Pasteurellaceae; Haem 79 2 OTU_20 Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Sinobacteraceae; 52 7 OTU_21 Bacteria; Lentisphaerae; Lentisphaeria; SS1-B-03-39; f; g 60 1 OTU_22 Bacteria; Verrucomicrobia; Opitutae; Puniceicoccales; Puniceicoccaceae; g OTU_3 Bacteria; Proteobacteria; Alphaproteobacteria; Rhodobacterales; Rhodobacteraceae OTU_4 Bacteria; Bacteroidetes; Cytophagia; Cytophagales; Cyclobacteriaceae; g OTU_5 Bacteria; Proteobacteria; Gammaproteobacteria; Alteromonadales; Alteromonadaceae;_ OTU_6 Bacteria; Proteobacteria; Alphaproteobacteria; Rhizobiales; Phyllobacteriaceae; Pseuda 79 1 OTU_7 Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Moraxellaceae; M OTU_8 Bacteria; Firmicutes; Bacilli; Lactobacillales; Carnobacteriaceae; Dolosigranulum OTU_9 Bacteria; Proteobacteria; Alphaproteobacteria; Rhizobiales; Bradyrhizobiaceae; Salinari Note: Before calculating the average reads of the control replicates, establish whether they are comparable by calculating the % of each OUT in each spiked control 10/23/2017 Microbiome : Analysis of NGS Data 30

31 5.6 Search of contaminant OTUs from Biological samples align_seqs.py -i $indir/conta.fa -o $outdir/decon100 -t $indir/otus_repsetout.fa -e 250 -p m pynast candidate sequence ID candidate nucleotide count template ID BLAST percent identity to template candidate nucleotide count post-nast Contaminants OTUs Biological OTUs OTU_1 250 OTU_ OTU_2 250 OTU_ OTU_3 250 OTU_ OTU_4 250 OTU_ OTU_5 250 OTU_ OTU_6 250 OTU_ OUT_7 250 OTU_ OTU_8 250 No search results. OTU_9 250 OTU_ OTU_ OTU_ OTU_ No search results. OTU_ OTU_ /23/2017 Microbiome : Analysis of NGS Data 31

32 5.7. Removing contaminant sequences If contaminant OTUs matches at 100% to the OTUs in biological sample, it means that a particular contaminant is present in biological sample otherwise the reverse is true. If present: action taken: 1. If # of average reads in contaminant is similar to biological sample, then, that OTU is completely removed from Biological sample 2. If # of average reads in contaminant is more than in Biological sample, again the OTU is completely removed from Biological sample 3. If # of average reads in contaminant is less than in Biological sample, then, equivalent reads is removed from the BS. 10/23/2017 Microbiome : Analysis of NGS Data 32

33 What if % alignment is lowered? i.e. 99% This implies that we give a chance of 1% to the contaminant to match the BS which actually they could not match at 100% The risk here is that we are going to lose # of sequences /OTUS from the BS which we would otherwise keep at 100% Note: Since DNA competes during PCR amplification, unless there s a serious contamination, I think it makes sense that very few sequences from contaminants will match to BS at 100% 10/23/2017 Microbiome : Analysis of NGS Data 33

34 5.8 Product of decontamination After removing contaminant OTUs / Sequences; otu_table free from contaminant OTUs/ Sequences and a true representation of sample under investigation is generated Taxonomic annotation clean assigned using the same approach as explained in spiked controls Phylogenetic tree is generated after aligning sequences to the reference database (greengeens/ silva/) 10/23/2017 Microbiome : Analysis of NGS Data 34

35 6. Taxonomy assignment of BS Biological Sequences are assigned using the same approach as explained in controls Phylogenetic tree is generated after aligning sequences to the reference db 10/23/2017 Microbiome : Analysis of NGS Data 35

36 SUMMARY Raw sequence data CBIO-PIPELINE BIOM file otus_repsetout.fa Phylogenetic tree Taxonomy file otu_table 10/23/2017 Microbiome : Analysis of NGS Data STATISTICIAN36

37 Acknowledgement CBIO Team: Prof Nicola Mulder Gerrit Katie for their constructive inputs 10/23/2017 Microbiome : Analysis of NGS Data 37

38 Thank you for your attentive 10/23/2017 Microbiome : Analysis of NGS Data 38