Bioinformatics and computational tools Etienne P. de Villiers (PhD) International Livestock Research Institute Nairobi, Kenya
International Livestock Research Institute Nairobi, Kenya ILRI works at the crossroads of livestock and poverty, bringing high quality science and capacity building to bear on poverty reduction and sustainable development. It is one of 15 centers supported by the Consultative Group on International Agricultural Research (CGIAR) that conduct food and environmental research to help alleviate poverty and increase food security. ILRI biotech facilities: Molecular biology Laboratories (>6,000 sqm) State of the art biosciences equipment 2 ABI sequencers (3130, 3730) 1 Roche 454 GS FLX Bioinformatics unit 64 CPU high performance compute cluster BSL3 laboratory Flow cytometry and microscopy Diagnostics (nucleotide and protein based) Vaccine technology/immunology Small and Large Animal units http://www.ilri.org http://hub.africabiosciences.org/
Central dogma of molecular biology
Bioinformatics Bioinformatics is the application of information technology and computer science to the field of molecular biology.
Bioinformatics Genome (DNA) sequence ACGGTGCGTAACGTCAGTCAGGTCAGTCAG Bioinformatics or computational biology Gene or protein properties Comparative analysis Protein structure and function prediction
The Sequencing Revolution Next Generation Sequencing High Throughput Sequencing 2000 High Throughput Sequencing 2010 96 sequences per hour 2.6 million sequences per hour
The Sequencing Revolution Third Generation Sequencing Single Molecule sequencing Pacific Biosciences Oxford Nanopore ~3,000 wells per chip 1,500 bp per well 10 bp per second $1,000 human genome
Sequencing the Human Genome 2001: Human Genome Project 3 billion $, 11 years 10 8 2007: 454 1M$, 3 months Log 10 (price) 6 4 2001: Celera 100 Million $ 3 years 2008: ABI SOLiD 60K$, 2 weeks 2009: Illumina 40-50K$ 2010: 5K$, a few days 2 2000 2005 2010 Year
Next Generation Sequencing Current Projects 1000 Genomes project (www.1000genomes.org) Sequence genomes from 2500 people from divers backgrounds to 4x coverage to identify human genetic variation. Ensembl genomes (www.ensemblgenomes.org) 234 species sequenced from mammalians, birds to parasites. >400 bacterial species sequenced. Plant genomes 18 sequenced (www.phytozome.org/) BGI (China) (www.genomics.cn) 1,000 plant and animal reference genome project.
Cost of Computing 140 2010 Intel icore7 desktop $1,000 GigaFLOPS 10 3 1988 Cray YMP $40,000,000 1998 Sun HPC1000 $1,000,000 1
World Internet Connections
Cloud Computing Cloud computing is a general term for computation as aservice. Computation as a service means that customers rent the hardware and the storage only for the time needed to achieve their goals Amazon Elastic Compute Cloud (Amazon EC2) provides resizable compute capacity in the cloud including, High Performance computing (HPC) on demand 23 GB of memory 64 Compute nodes 1.7 Terabytes storage $1.60 per hour or $5,000 per year
Distributed computing Distributed computing is any computing that involves multiple computers remote from each other. A central server sends and receives the work units (essentially just protein structures and sequences). The client uses the spare CPU cycles on a user s computer to run the simulation algorithm on the assigned structure. Results are automatically returned and exchanged for a new work unit on a daily basis. home lab/office anywhere
Distributed computing Folding @home Understand how existing proteins attain their specific, functional three dimensional structures. Use distributed computing through installation of screensaver on user computer. In 2009 was running on 40,000 CPUs or 5 PFLOPS Fastest standalone supercomputer is "Tianhe 1A at 2.5 PFLOPS
Metagenomics Metagenomics is the sequencing and analysis of DNA of organisms recovered from an environment, without the need for culturing them using next generation sequencing technologies. Organisms The Sargasso Sea community survey Acid mine drainage film Human gut communities Symbiotic community from marine worm AVID project
From Sequence (genomics/metagenomics) to impact phylogenetic analysis Diagnostics (meta)genome sequencing geographical mapping Global diseases surveillance Databases protein modeling Vaccine dvlpmt Compilation of complete genomes, metagenomes, annotation and curation of metadata Extraction of important biological information sequence variation analysis Primer, microarray discovery of new microorganisms and pathways Drug dvlpmt Improved drug selection Environmental sustainability Better control tools
AVID Arbovirus Incident & Diversity project Google.org Predict and Prevent funded project. Pilot project on Rift Valley Fever virus. virus is transmitted by mosquitoes and infect both animals and humans deadly to both humans and livestock outbreaks occur every 5 6 years A complex mix of species, sub species, populations. Can we understand its dynamics?
AVID Questions Where is the virus (between outbreaks )? Environment Vectors Reservoirs What is the diversity of? Virus Vector Reservoir And how do these interact? Distribution of other pathogens? Novel pathogens and variants? For example: Does a particular virus variant occur in a particular vector variant associated with a particular mammalian variant? Viral Geneflow
AVID Strategy Samples are collected in specific areas: Human blood, livestock, wildlife, mosquitoes, water, soil Each sample collected with a full meta data description (location, date/time, eco geo socio descriptors). Amplify sequences from multiple points on multiple possible genomes virus, insect, mammal, others. Sequence these amplicons simultaneously from 1,000s of samples using next generation sequencing. Analyse sequences look for distribution and co occurrence. Refine primers for a simple (RT) PCR approach. Move diagnostic sequences on to high throughput PCR diagnostics.
AVID Data management and BioBANK Data management is one of the biggest challenges. The project cannot achieve its goals without great data integration. All samples are biobanked with full data descriptors Opportunity to share samples across projects? Wildlife samples are very expensive and everyone is collecting them for their own purposes!!
Thank You