Uncovering the dynamics of genome function at the organismal scale using machine vision and computation

Size: px
Start display at page:

Download "Uncovering the dynamics of genome function at the organismal scale using machine vision and computation"

Transcription

1 Uncovering the dynamics of genome function at the organismal scale using machine vision and computation Tessa Durham Brooks, Ph.D. Doane College Department of Biology

2 Anticipation at the dawn of the Genomic Era Within the next few years, technologies developed for the Human Genome Project and similar sequencing efforts will revolutionize medicine, agriculture, crimefighting, and other fields. Gwynne and Page, Science, 2000 Genetronic Replicator Star Trek: TNG, 1992

3 The genomic powerhouses For reference: Humans have about 25,000 genes, 3.2 bil base pairs. ~25,000 genes 100 mil base pairs Sequenced finished 2000 ~20,000 genes 97 mil base pairs Sequence finished 1998 ~22,000 genes 137 mil base pairs Sequence finished 2000

4 The problem: at best functional roles have been assigned for 15% of predicted genes of a genome For reference: Humans have about 25,000 genes, 3.2 bil base pairs. ~25,000 genes 100 mil base pairs Sequenced finished 2000 ~20,000 genes 97 mil base pairs Sequence finished 1998 ~22,000 genes 137 mil base pairs Sequence finished 2000

5 Why has determining gene function in multicellular organisms been difficult?

6 Why has determining gene function in multicellular organisms been difficult?

7 Our goal: Describe genome function at the organismal scale. ~25,000 genes Requirements: Observations should be made at sufficiently high spatial and temporal resolution Methods should be relatively highthroughput to allow genomic survey Observations should be able to be made over time and in many environmental contexts

8 Root gravitropism: a model for image analysis approaches in functional genomics

9 Scale Scale Tip Angle (deg) Root gravitropism: a model for image analysis approaches in functional genomics Second Order (acceleration) glr3.3-1 vs wt glr3.3-2 vs wt First Order (swing rate) Time (h) glr3.3-1 glr3.3-2 SalkCol Time (h) Miller, Durham Brooks, and Spalding 2010, Genetics

10 Developing tools to detect genome function at the organismal scale. ~25,000 genes Requirements: Observations should be made at sufficiently high spatial and temporal resolution Methods should be relatively highthroughput to allow genomic survey Observations should be able to be made over time and in many environmental contexts

11 Automation and High Throughput ver. 1

12 Doane Phytomorph Automation and High Throughput ver. 2

13 High throughput genetic stocks Recombinant inbred lines (RILs) S T S T Ecotypes (e.g genomes project) QTL Analysis Matthieu Reymond Max-Planck Institute

14 Developing tools to detect genome function at the organismal scale. ~25,000 genes Requirements: Observations should be made at sufficiently high spatial and temporal resolution Method should be relatively highthroughput to allow genomic survey Observations should be able to be made over time and in many environmental contexts

15 Phenotypes are plastic One ecotype Durham Brooks, Miller, and Spalding 2010, Plant Physiology

16 Seedling Age Position (cm) Genomics analysis in a multidimensional condition space LOD Seed Size Small Large 2d 164 lines X 15 indiv. Time (minutes) 3d Moore, et al., unpublished result 4d

17 Developing tools to detect genome function at the organismal scale. ~25,000 genes Requirements: Observations should be made at sufficiently high resolution Method should be relatively highthroughput to allow genomic survey Observations should be able to be made over time and in many environmental contexts Cyberinfrastructure must facilitate the above

18 Workflow Data Capture 0.5 TB/day Data Capture Data Compression Data Storage (30 TB) Data Storage (X TB) Feature Extraction Data Compression and Analysis Schorr Center and OSG QTL analysis

19 Data Compression Scan Root Response Time Series of 220 uncompressed TIF s ~225 MB each Auto Crop & Time Stamp (Condor on OSG) Equalize dimensions Insert the time stamp into the least significant bits of the first 14 pixels of each image Video Compression (Local Grid) Time series of lossless - compressed PNG s ~195 MB each Image Compression (Condor on OSG) Compress to video using FFV1 codec Uses a lossless intraframe codec ~160 MB/frame Ship out to UW

20 Feature Extraction Compressed Data Tip angle Measurement Decompress Data Currently using FFV1 codec Isolate and track root tip Currently image curvature features are used Machine Learning Isolate plant s Arial and ground tissue from image background Currently Bayesian and SVM methods work well Root tip Identification Linear regression on the root s meristematic tissue Ship out to Doane

21 QTL Analysis - Association of a phenotypic value (e.g. root tip angle) with a genetic element Compile Tip Angle Data Ship out to Doane Additional analysis Find mean tip angle at each time point for each genetic line Launch analysis locally Bleed out to OSG Choose detection method Optimize maximum likelihoods of trait data Determine significance (LOD) Determine Significance Threshold (Condor on OSG) Permutation testing Haley Knott or Multiple imputation (0.24 vs 63 CPU years per time point) Determine max likelihood and LOD score of randomized data QTL Detection (Local Grid) Interactions, additive effects Plasticity QTL Conditional QTL (GxE effects) Ship out to Doane

22 Progress and Future Directions One small college has collected over 14,500 individual root gravitropic responses in six conditions (32 TB) in RIL population in 6 mo. We will finish collection from NILs (nearisogenic lines) - an additional 8,700 individuals, 19 TB in 1-2 mo. Begin image analysis and QTL analysis dataset opens new doors in visualizing genomes Interdisciplinary undergraduate training opportunities

23 Acknowledgments Doane College Mike Carpenter (CIO), David Andersen, Dan Schneider Chris Wentworth (Physics) and Alec Engebretson (IST) Students: Amy Craig and Brad Higgins (Physics), Autumn Longo and Grant Dewey (Biochemistry), Tracy Guy, Miles Mayer, Halie Smith, Anthony Bieck, Sarah Merithew, Devon Niewohner, Muijj Ghani, Julie Wurdeman (Biology) University of Wisconsin Edgar Spalding s Laboratory: Nathan Miller, Candace Moore, Logan Johnson Miron Livny (CHTC) UNL Schorr Center and HCC David Swanson, Brian Bockelman, Carl Lundstedt University of Florida Mark Settles Computing Resources Funding PGRP EPSCoR - URE NE-INBRE

24 Questions?