A base composition analysis of natural patterns for the preprocessing of metagenome sequences

Size: px
Start display at page:

Download "A base composition analysis of natural patterns for the preprocessing of metagenome sequences"

Transcription

1 A base composition analysis of natural patterns for the preprocessing of metagenome sequences Oliver Bonham-Carter, Dhundy Bastola, Hesham Ali College of Information Science & Technology School of Interdisciplinary Informatics Peter Kiewit Institute University of Nebraska at Omaha Omaha, NE USA 26 April 2013

2 1 Introduction Problem Motivation 2 Contribution Our Study 3 Materials and Methods Spectrum Sets Examples Of Spectrum Sets Proportions Data Experiment and Flow Chart Association 4 Results Phylogeny 5 Conclusions References

3 A Preprocessing Step to de novo Sequencing The reconstruction of a genetic sequence is done by merging smaller pieces (reads) together to make the whole. Contigs are made of combined reads.

4 Sequence Assembly: Similar to a Jigsaw Puzzle Smaller pieces come together to build the whole.

5 Mixing Pieces Makes a Harder Jigsaw Puzzle Puzzle building is frustrated by the addition of foreign pieces in the mix.

6 Assembly of multiple sequences by de novo technologies Often there are multiple sequences present in the sequencing pool.

7 Sequencing Alignment End-regions of reads are analyzed for adjacency properties. Increased analysis is now necessary due to the added foreign reads.

8 Contribution of this Study: Base Composition Analysis We propose a statistical method to cluster related sequence data into groups. This step will reduce the search space when aligning the individual reads of the pool. Verified by synthetic and biological data.

9 Contribution of this Study: Base Composition Analysis

10 Spectrum Sets from Restriction Sites for Statistical Analysis Restriction sites are regions in foreign DNA (i.e., viruses) where bacterial enzymes cut to destroy the DNA of an invading threat.

11 Spectrum Sets from Restriction Sites for Statistical Analysis Restriction sites to create a list of DNA words (spectrum sets). The proportional content of all these words (motifs) is used to determine sequence relatedness.

12 Four Spectrum Sets From All Known RS s (Length 6)

13 Examples of Home Grown Spectrum Sets The RS s Used by Clostridium and Staphylococcus are different.

14 Collecting Proportions of Motifs Over Sequence Data: Length 6 Motifs Where m i is a motif, S L is a read sequence, count(m i ) is the number of occurrences of m i found in S L, m i and S L are the lengths of the motif and sequence, respectively. Since we are not using the entire sample space (all possible length-n motifs), proportions were appropriate.

15 Organisms Organism Contig Originator Division Bifidobacterium longum NC Actinobacteria Mycobacterium bovis NC Actinobacteria Clostridium tetani NC Firmicutes Staphylococcus aureus NC Firmicutes Burkholderia pseudomallei NC Proteobacteria Campylobacter jejuni NC Proteobacteria Ten trials of ( 6 2) = 10 * 15 = 150 experiments, each with fresh sequence reads. The Contig Originator column displays the fully assembled sequences processed via MetaSim 1 to make synthetic reads. 1, MetaSim:

16 Flowchart of Algorithm

17 Association DNA sequences appear to naturally have a unique base composition. Related sequences cluster (associate) together.

18 Clostridium and Staphylococcus Genomes, CCCGGG-Spectrum Set Note: Clostridium Staphylococcus

19 Clostridium and Staphylococcus Genomes, AAATTT-Spectrum Set Note: Clostridium Staphylococcus

20 An Unknown Sequence Joins the Pool Party

21 An Unknown Sequence Joins the Pool Party Addition of Clostridium Sequence Note: Spectrum-Set CCCGGG Clostridium Staphylococcus

22 An Unknown Sequence Joins the Pool Party Addition of Clostridium Sequence Note: Spectrum-Set AAATTT Clostridium Staphylococcus

23 Clostridium and Staphylococcus, CCCGGG-Spectrum Set, with Clostridium Contigs Note: Clostridium Staphylococcus

24 Mixed Contigs: Clostridium, Staphylococcus and Burkholderia (Bacterial Genomes) AAATTT-Spectrum: There is a high contrast between one of the three to the other two. Remove this set and rerun the test. Note: Clostridium Staphylococcus, Burkholderia

25 Phylogeny Remark We successfully used our method to assign the first chromosomes of the following organisms to their rightful phylogenetic groupings. Organism Caenorhabditis elegans Canis lupus familiaris Drosophila melanogaster Mus musculus Mycoplasma hyorhinis GDL-1 Oryctolagus cuniculus Rattus norvegicus Common Name Worm Dog Fruit fly Mouse Bacteria Rabbit Rat

26 Taxonomy Tree from Genbank

27 CCGGAT - Best Tree Note: The heatmap graphic is removed to show only the tree.

28 AAATTT - Second Best Tree Note: The heatmap graphic is removed to show only the tree.

29 Limitations Information Based Successful phylogeny grouping requires ample sequence data (> 700bps of sequence data). Next generation sequencing trends indicate that improving sequencing technology is growing longer reads each year. Contigs: longer sequences made from combined reads are suitable. Spectrum Set Behavior Spectrum sets do not perform similarly on each data set. Contrast-based analysis: knowledge of organismal natural uses of restriction sites.

30 Conclusions: Preprocessing for Multi-Sequence Assembly The Separation of Mixed Reads. We proposed a binning preprocessing method which separates and partitions related sequence data. This method can reduce the search space when aligning reads in assembly tasks to expedite the sequence assembly process. The structural properties of sequence material can be used to infer phylogenetic properties

31 References Bonham-Carter O, Ali H, Bastola D, A base composition analysis of natural patterns for the pre-processing of metagenome sequences, BMC Bioinformatics. (accepted, 2013) Bonham-Carter O, Ali H, Bastola D, A Meta-genome Sequencing and Assembly Preprocessing Algorithm Inspired by Restriction Site Base Composition, 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) Wang Y, Leung HC, Yiu SM, Chin FY, Bioinformatics. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample, 2012 Sep 15;28(18):i356-i362.

32 We would like to thank the support students, faculty and staff in the UNO- Bioinformatics Core Facility. This project has been funded by the grants from the National Center for Research Resources (5P20RR016469) and the National Institute for General Medical Science (NIGMS) (8P20GM103427). Thank You! Questions? IS&T Bioinformatics

33 Motifs Set Seed Available Motifs AAATTT 12 CCCGGG 12 AATTCG 156 CCGGAT 156 The numbers of available motifs belonging to each spectrum. The motifs in the spectrum set are non-palindromic and are permutations of the set seeds.