Regina Bohnert. Friedrich Miescher Laboratory of the Max Planck Society Tübingen, Germany. July 18, 2008

Size: px
Start display at page:

Download "Regina Bohnert. Friedrich Miescher Laboratory of the Max Planck Society Tübingen, Germany. July 18, 2008"

Transcription

1 Revealing Sequence Variation Patterns in Rice with Machine Learning Methods Regina Bohnert Friedrich Miescher Laboratory of the Max Planck Society Tübingen, Germany July 18, 2008 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

2 Motivation What distinguishes the sequences of subpopulations with different traits? Identify sequence variations within one species Basis for further evolutionary and functional studies Genome-wide identification of sequence polymorphisms High-density oligonucleotide microarrays for high-throughput resequencing Array-based resequencing applied for Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature, 2007) Oryza sativa (rice) Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

3 Motivation What distinguishes the sequences of subpopulations with different traits? Identify sequence variations within one species Basis for further evolutionary and functional studies Genome-wide identification of sequence polymorphisms High-density oligonucleotide microarrays for high-throughput resequencing Array-based resequencing applied for Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature, 2007) Oryza sativa (rice) Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

4 Motivation What distinguishes the sequences of subpopulations with different traits? Identify sequence variations within one species Basis for further evolutionary and functional studies Genome-wide identification of sequence polymorphisms High-density oligonucleotide microarrays for high-throughput resequencing Array-based resequencing applied for Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature, 2007) Oryza sativa (rice) Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

5 Motivation What distinguishes the sequences of subpopulations with different traits? Identify sequence variations within one species Basis for further evolutionary and functional studies Genome-wide identification of sequence polymorphisms High-density oligonucleotide microarrays for high-throughput resequencing Array-based resequencing applied for Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature, 2007) Oryza sativa (rice) Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

6 Oryza sativa Prominent model organism Most important food source Representative of grass family Closely related to other cereals 372 Mb genome on 12 chr. Challenges relative to A. thaliana Different experimental design Data not as clean No gold standard set of labelled sequences from Koehler s Medicinal-Plants, 1887 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

7 Oryza sativa Prominent model organism Most important food source Representative of grass family Closely related to other cereals 372 Mb genome on 12 chr. Challenges relative to A. thaliana Different experimental design Data not as clean No gold standard set of labelled sequences from Koehler s Medicinal-Plants, 1887 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

8 The Resequencing Data Tiling Arrays Hybridisation Signal A C G T A C G T Array Surface TACCGGTCGGAAATCGATCGGTTGA TACCGGTCGGAACTCGATCGGTTGA ATGGCCAGCCTTGAGCTAGCCAACTTGAAT TACCGGTCGGAAGTCGATCGGTTGA TACCGGTCGGAATTCGATCGGTTGA TCAACCGATCGAATTCCGACCGGTA TCAACCGATCGACTTCCGACCGGTA TCAACCGATCGAGTTCCGACCGGTA AGTTGGCTAGCTCAAGGCTGGCCATAGGTA TCAACCGATCGATTTCCGACCGGTA Oligonucleotides on glass surface Reference probe SNP probes Reference probe SNP probes ssdna labelled with fluorescence as target DNA Rice resequencing arrays Tiling strategy with 1 bp resolution Each base queried with a forward and reverse quartet 800 million oligos on 246 arrays for each of the 20 cultivars 32 % of the genome represented Target DNA amplified by long-range PCR Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

9 The Resequencing Data Tiling Arrays Hybridisation Signal A C G T A C G T Array Surface TACCGGTCGGAAATCGATCGGTTGA TACCGGTCGGAACTCGATCGGTTGA ATGGCCAGCCTTGAGCTAGCCAACTTGAAT TACCGGTCGGAAGTCGATCGGTTGA TACCGGTCGGAATTCGATCGGTTGA TCAACCGATCGAATTCCGACCGGTA TCAACCGATCGACTTCCGACCGGTA TCAACCGATCGAGTTCCGACCGGTA AGTTGGCTAGCTCAAGGCTGGCCATAGGTA TCAACCGATCGATTTCCGACCGGTA Oligonucleotides on glass surface Reference probe SNP probes Reference probe SNP probes ssdna labelled with fluorescence as target DNA Rice resequencing arrays Tiling strategy with 1 bp resolution Each base queried with a forward and reverse quartet 800 million oligos on 246 arrays for each of the 20 cultivars 32 % of the genome represented Target DNA amplified by long-range PCR Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

10 The Resequencing Data Tiling Arrays Hybridisation Signal A C G T A C G T Array Surface TACCGGTCGGAAATCGATCGGTTGA TACCGGTCGGAACTCGATCGGTTGA ATGGCCAGCCTTGAGCTAGCCAACTTGAAT TACCGGTCGGAAGTCGATCGGTTGA TACCGGTCGGAATTCGATCGGTTGA TCAACCGATCGAATTCCGACCGGTA TCAACCGATCGACTTCCGACCGGTA TCAACCGATCGAGTTCCGACCGGTA AGTTGGCTAGCTCAAGGCTGGCCATAGGTA TCAACCGATCGATTTCCGACCGGTA Oligonucleotides on glass surface Reference probe SNP probes Reference probe SNP probes ssdna labelled with fluorescence as target DNA Rice resequencing arrays Tiling strategy with 1 bp resolution Each base queried with a forward and reverse quartet 800 million oligos on 246 arrays for each of the 20 cultivars 32 % of the genome represented Target DNA amplified by long-range PCR Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

11 The Resequencing Data Tiling Arrays Hybridisation Signal A C G T A C G T Array Surface TACCGGTCGGAAATCGATCGGTTGA TACCGGTCGGAACTCGATCGGTTGA ATGGCCAGCCTTGAGCTAGCCAACTTGAAT TACCGGTCGGAAGTCGATCGGTTGA TACCGGTCGGAATTCGATCGGTTGA TCAACCGATCGAATTCCGACCGGTA TCAACCGATCGACTTCCGACCGGTA TCAACCGATCGAGTTCCGACCGGTA AGTTGGCTAGCTCAAGGCTGGCCATAGGTA TCAACCGATCGATTTCCGACCGGTA Oligonucleotides on glass surface Reference probe SNP probes Reference probe SNP probes ssdna labelled with fluorescence as target DNA Rice resequencing arrays Tiling strategy with 1 bp resolution Each base queried with a forward and reverse quartet 800 million oligos on 246 arrays for each of the 20 cultivars 32 % of the genome represented Target DNA amplified by long-range PCR Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

12 Polymorphism Detection Log Mean Intensity Log Mean Intensity Reference A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T A:C Cultivar A A C G T Data analysis challenge Hybridisation signal dependent on sequence properties of oligomer, repeats, amplicon, cultivar Measurement noise Machine learning based SNP calling 4 A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T Problematic cases Highly polymorphic regions Deletions and insertions Margin-based prediction of polymorphic regions Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

13 Polymorphism Detection Log Mean Intensity Log Mean Intensity Reference A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T A:C Cultivar A A C G T Data analysis challenge Hybridisation signal dependent on sequence properties of oligomer, repeats, amplicon, cultivar Measurement noise Machine learning based SNP calling 4 A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T Problematic cases Highly polymorphic regions Deletions and insertions Margin-based prediction of polymorphic regions Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

14 Polymorphism Detection Log Mean Intensity Log Mean Intensity Reference A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T A:C Cultivar A A C G T Data analysis challenge Hybridisation signal dependent on sequence properties of oligomer, repeats, amplicon, cultivar Measurement noise Machine learning based SNP calling Log Mean Intensity A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T Cultivar B A:T A:C A T G C T T T C T G G A C T T C T G C A A A A T A C T G T C A T C A T Problematic cases Highly polymorphic regions Deletions and insertions Margin-based prediction of polymorphic regions Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

15 Polymorphism Detection Log Mean Intensity Log Mean Intensity Reference A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T A:C Cultivar A A C G T Data analysis challenge Hybridisation signal dependent on sequence properties of oligomer, repeats, amplicon, cultivar Measurement noise Machine learning based SNP calling Log Mean Intensity A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T Cultivar B A:T A:C A T G C T T T C T G G A C T T C T G C A A A A T A C T G T C A T C A T Problematic cases Highly polymorphic regions Deletions and insertions Margin-based prediction of polymorphic regions Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

16 SNP Calling ML Approach Feature 2 Feature 1 Support Vector Machines (SVM) Extract features Array data Sequence Repetitiveness Labelled data generated by sequencing of randomly selected fragments Apply SVMs using RBF kernel in a two-layered approach 2nd layer exploits information across cultivars cf. Clark et al., Science, 2007 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

17 SNP Calling ML Approach Feature Feature 1 Support Vector Machines (SVM) Extract features Array data Sequence Repetitiveness Labelled data generated by sequencing of randomly selected fragments Apply SVMs using RBF kernel in a two-layered approach 2nd layer exploits information across cultivars cf. Clark et al., Science, 2007 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

18 SNP Calling ML Approach Feature Feature 1 Support Vector Machines (SVM) Extract features Array data Sequence Repetitiveness Labelled data generated by sequencing of randomly selected fragments Apply SVMs using RBF kernel in a two-layered approach 2nd layer exploits information across cultivars cf. Clark et al., Science, 2007 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

19 SNP Calling Results Predicted SNPs 1.2 M MB SNP calls 1.3 M ML SNP calls 760, 000 SNPs in MB ML at 160, 000 positions Recall Precision MB 14 % 91 % ML 21 % 92 % MB ML 11 % 97 % MB: Model based approach by Perlegen Sciences ML: Proposed machine learning approach Precision = TP+FP TP, Recall = TP P Visit Poster M09 and P28 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

20 SNP Calling Results Predicted SNPs 1.2 M MB SNP calls 1.3 M ML SNP calls 760, 000 SNPs in MB ML at 160, 000 positions Recall Precision MB 14 % 91 % ML 21 % 92 % MB ML 11 % 97 % MB: Model based approach by Perlegen Sciences ML: Proposed machine learning approach Precision = TP+FP TP, Recall = TP P Visit Poster M09 and P28 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

21 SNP Calling Results Predicted SNPs 1.2 M MB SNP calls 1.3 M ML SNP calls 760, 000 SNPs in MB ML at 160, 000 positions Recall Precision MB 14 % 91 % ML 21 % 92 % MB ML 11 % 97 % MB: Model based approach by Perlegen Sciences ML: Proposed machine learning approach Precision = TP+FP TP, Recall = TP P Visit Poster M09 and P28 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

22 SNP Calling Results / / Predicted SNPs 1.2 M MB SNP calls 1.3 M ML SNP calls 760, 000 SNPs in MB ML at 160, 000 positions 82 All Recall Precision 81 Coding Noncoding MB 14 % 91 % ML 21 % 92 % 0 / / MB ML 11 % 97 % Recall [%] Precision [%] MB Sites MB: Model based approach by Perlegen Sciences ML: Proposed machine learning approach Precision = TP+FP TP, Recall = TP P Visit Poster M09 and P28 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

23 SNP Calling Results / / Predicted SNPs 1.2 M MB SNP calls 1.3 M ML SNP calls 760, 000 SNPs in MB ML at 160, 000 positions 82 All Recall Precision 81 Coding Noncoding MB 14 % 91 % ML 21 % 92 % 0 / / MB ML 11 % 97 % Recall [%] Precision [%] MB ML Sites MB: Model based approach by Perlegen Sciences ML: Proposed machine learning approach Precision = TP+FP TP, Recall = TP P Visit Poster M09 and P28 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

24 SNP Calling Results / / Predicted SNPs 1.2 M MB SNP calls 1.3 M ML SNP calls 760, 000 SNPs in MB ML at 160, 000 positions 82 All Recall Precision 81 Coding Noncoding MB 14 % 91 % ML 21 % 92 % 0 / / MB ML 11 % 97 % Recall [%] Precision [%] MB ML Sites MB: Model based approach by Perlegen Sciences ML: Proposed machine learning approach Precision = TP+FP TP, Recall = TP P Visit Poster M09 and P28 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

25 SNP Calling Results / / Predicted SNPs 1.2 M MB SNP calls 1.3 M ML SNP calls 760, 000 SNPs in MB ML at 160, 000 positions 82 All Recall Precision 81 Coding Noncoding MB 14 % 91 % ML 21 % 92 % 0 / / MB ML 11 % 97 % Recall [%] Precision [%] MB ML Sites MB: Model based approach by Perlegen Sciences ML: Proposed machine learning approach Precision = TP+FP TP, Recall = TP P Visit Poster M09 and P28 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

26 SNP Calling Database Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

27 Predicting Polymorphic Regions 12 Log max intensity 10 8 Reference Cultivar A bp Known polymorphisms Labels Predictions SNP Deletion Insertion Not polymorphic PR SNP PR Difficulties when SNPs occur in vicinity Approach: Predict genomic segments with label sequence learning algorithm Conserved and Polymorphic regions (PRs) cf. Zeller et al., Genome Research, 2008 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

28 Predicting Polymorphic Regions 12 Log max intensity 10 8 Reference Cultivar A bp Known polymorphisms Labels Predictions SNP Deletion Insertion Not polymorphic PR SNP PR Difficulties when SNPs occur in vicinity Approach: Predict genomic segments with label sequence learning algorithm Conserved and Polymorphic regions (PRs) cf. Zeller et al., Genome Research, 2008 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

29 Predicting PRs Results Between 65, 000 and 203, 000 PRs predicted per cultivar 27 % recall at a precision of 80 % Between 1.7 % and 5.1 % of Precision All Coding UTRs + introns Intergenic the genome covered Precision = Recall TP+FP TP, Recall = TP P Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

30 Predicting PRs Results Between 65, 000 and 203, 000 PRs predicted per cultivar 27 % recall at a precision of 80 % Between 1.7 % and 5.1 % of Precision All Coding UTRs + introns Intergenic the genome covered Precision = Recall TP+FP TP, Recall = TP P Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

31 Predicting PRs Results Between 65, 000 and 203, 000 PRs predicted per cultivar 27 % recall at a precision of 80 % Between 1.7 % and 5.1 % of Precision All Coding UTRs + introns Intergenic the genome covered Precision = Recall TP+FP TP, Recall = TP P Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

32 Long Deletion Example Disease resistance protein SlVe1 precursor on chromosome 12 Os12g Genes LTH M 202 Tainung 67 Azucena Cypress Moroberekan Dom_Sufid Dular FR 13A N 22 Rayada Aswina Minghui 63 IR 64 Pokkali Sadu Cho SHZ2 Swarna Zhenshan 97 6,222,000 6,223,000 6,224,000 6,225,000 6,226,000 6,227,000 6,228,000 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

33 Conclusions Created the first whole-genome inventory of polymorphisms for rice Highly polymorphic 0.2 % in SNPs, 2.4 % in PRPs Intersection of MB and ML calls provides highly reliable SNP predictions Used to genotype many more rice cultivars Polymorphic region predictions Important for more detailed analyses (e.g. dideoxy sequencing) Useful for primer design to increase PCR success rates Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

34 Conclusions Created the first whole-genome inventory of polymorphisms for rice Highly polymorphic 0.2 % in SNPs, 2.4 % in PRPs Intersection of MB and ML calls provides highly reliable SNP predictions Used to genotype many more rice cultivars Polymorphic region predictions Important for more detailed analyses (e.g. dideoxy sequencing) Useful for primer design to increase PCR success rates Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

35 Conclusions Created the first whole-genome inventory of polymorphisms for rice Highly polymorphic 0.2 % in SNPs, 2.4 % in PRPs Intersection of MB and ML calls provides highly reliable SNP predictions Used to genotype many more rice cultivars Polymorphic region predictions Important for more detailed analyses (e.g. dideoxy sequencing) Useful for primer design to increase PCR success rates Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13

36 Acknowledgements Friedrich Miescher Laboratory Gunnar Rätsch Georg Zeller Gabriele Schweikert MPI for Developmental Biology Detlef Weigel Richard Clark Michigan State University, USA Robin Buell Kevin Childs Perlegen Sciences, USA Renee Stokowski Dennis Ballinger Kelly Frazer David Cox IRRI, The Philippines Kenneth McNally Victor Ulat Hei Leung Colorado State University, USA Jan Leach Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, / 13