Computing with large data sets

Size: px
Start display at page:

Download "Computing with large data sets"

Transcription

1 Computing with large data sets Richard Bonneau, spring 2009 Lecture 14 (week 8): genomics 1

2 Central dogma Gene expression DNA RNA Protein v : computing with data, Richard Bonneau Lecture 14

3 places we can measure / query Single gene level DNA RNA Protein Sequencing Northern blotting Q-RT-PCR Western blots

4 places we can measure / query Single gene level DNA RNA Protein Sequencing Northern blotting Q-RT-PCR Western blots Genome-wide Genome Transcriptome Proteome Shotgun sequencing, next generations seq. : polony seq., pyro seq., nano-pore seq. Microarrays, dirrect seq. of transcripts 2D Protein gels? Mass spec! Protein chips?

5 many points of regulation in biological systems

6 measuring mrna, RNA : microarrays A flash movie description:

7 measuring Spotting mrna, on glass RNA slides :(Pat microarrays Brown, Stanford) DNA is transferred from multi-well plates (96- or 384-well plates) to slides using a robot

8 measuring mrna, RNA : microarrays Targets RNA (condition 1) RNA (condition 2)

9 measuring mrna, RNA : microarrays Targets RNA (condition 1) + Cy3-dUTP reverse transcription RNA (condition 2) + Cy5-dUTP

10 measuring mrna, RNA : microarrays Targets RNA (condition 1) + Cy3-dUTP reverse transcription RNA (condition 2) + Cy5-dUTP Alternative: dye incorporated after RT by X-linking (aminoallyl-dutp)

11 measuring mrna, RNA : microarrays Targets RNA (condition 1) + Cy3-dUTP reverse transcription hybridization Probes (DNA chip) RNA (condition 2) + Cy5-dUTP Alternative: dye incorporated after RT by X-linking (aminoallyl-dutp)

12 measuring mrna, RNA : microarrays Targets RNA (condition 1) + Cy3-dUTP Laser 1 Laser 2 reverse transcription hybridization Probes (DNA chip) RNA (condition 2) + Cy5-dUTP Alternative: dye incorporated after RT by X-linking (aminoallyl-dutp) emission

13 measuring mrna, RNA : microarrays Targets RNA (condition 1) + Cy3-dUTP Laser 1 Laser 2 reverse transcription hybridization Probes (DNA chip) RNA (condition 2) + Cy5-dUTP Alternative: dye incorporated after RT by X-linking (aminoallyl-dutp) emission data analysis

14 measuring mrna, Bacillus RNA subtilis : microarrays Genome of 4,106 protein coding genes, one spot-one gene PCR-amplified probes printed on aminosilane coated slides, UV-crosslinked

15 measuring mrna, Bacillus RNA subtilis : microarrays Genome of 4,106 protein coding genes, one spot-one gene PCR-amplified probes printed on aminosilane coated slides, UV-crosslinked Spotting inconsistencies

16 measuring mrna, RNA : microarrays, Affymetrix Short probes (25 mers) synthesized directly on the chip Specific sequences are built up on oligonucleotide chips by successive rounds of photoactivated chemical reactions, in which A,C,G, or T nucleotides are selectively added in different cells using a series of masking steps we trade spotting inconsistencies for other problems

17 measuring mrna, RNA : microarrays

18 measuring mrna, RNA : microarrays polya-rna (eukaryotes)

19 measuring mrna, Affymetrix RNA : chip microarrays Time spent on experiment ~7 days Cost of Experiment $150-$600

20 measuring mrna, RNA : microarrays Higher expression in condition 1 Majority of spots (no change in expression) Condition 1 Higher expression in condition 2 Condition 2

21 different platforms have different systematic error Single molecule detection: Way to early to tell if it will scale to 500,000 MPSS and Dirrect sequencing expensive, but you can detect things you did not expect and genetic variants Affy: 25-mers are short Oligo synth is 98-99% (percent yeild) the average affy spot is thus: 0.98 ^ 25 == to 0.99 ^25 == pure Nimblegen: all affy s problems, more flexible, no mask = stray light (oops!) Printed oligo arrays: Hard to scale up past 500,000 features, DNA can be purified before printing.

22 what s next Single molecule detection: NanoString (Seatlle)(?) : small sample, no amplification, good on lower end of dynamic range Next Generations of: qpcr (unlikely?): still need to guess transcripts before hand Sage (?) + MPSS (Lynx) (?) : still expensive, new, they are ideal for when you re not sure of your probes or your transcripts For T. parva MPSS was done on key stages of pathogenesis during genome sequencing project the MPSS was then used to comment on which genes had evidence of expression, and new islands were found in the Telomeres that were not suspected prior, and are now thought to be critical in host recognition. Dirrect sequencing: Now that sequencing is so cheep we can directly sequence transcripts.

23 transcription factors control expression of genes Multiple antibiotic resistance regulator bound to promoter fragment: 1. Major groove interactions 2. Minor groove interactions 3. Backbone contacts

24 transcription factors control expression of genes

25 ChIP-chip

26 Chromatin Immunoprecipitation followed by chip Laub (2002) Proc. Natl. Acad. Sci. USA 99,

27 ChIP-chip Binding site Binding site ChIP IP-enriched DNA unenriched DNA Transcription positive regulation negative regulation Possible because most regulatory sequences in bacteria are located within 200 bp upstream of start codon and length of DNA fragments after IP and sonication is ~300 bp, so overlap is possible

28 measuring proteins and protein modifications Proteins do not hybridize... so microarray technology doesnt work. What can we measure to quantitate protein levels in the cell: - mass of peptides from protein degradation products - binding to antibodies -migration on a gel OR in an electric field - genetically modified fluorescent proteins. Movie 1st:

29 2D-SDS PAGE gel The first dimension (separation by isoelectric focusing) - gel with an immobilised ph gradient - electric current causes charged proteins to move until it reaches the isoelectric point (ph gradient makes the net charge 0) The second dimension (separation by mass) -ph gel strip is loaded onto a SDS gel -SDS denatures the protein (to make movement solely dependent on mass, not shape) and eliminates charge.

30 2D-SDS PAGE gel

31 2D-gel technique example Lecture 14

32 Current Mass Spec Technologies Proteome profiling/separation 2D SDS PAGE - identify proteins 2-D LC/LC - high throughput analysis of lysates (LC = Liquid Chromatography) 2-D LC/MS (MS= Mass spectrometry) Protein identification Peptide mass fingerprint Tandem Mass Spectrometry (MS/MS) Quantative proteomics ICAT (isotope-coded affinity tag) ITRAQ Lecture 14

33 MALDI: matrix assisted laser desorbtion ionization Lecture 14

34 Electrospray Ionization-MS Ion Source Mass Analyzer Detector (2) Aebersold, R.; Mann, M. Nature 2003, 422, Quadrupole time-of-flight (Q-TOF) CID spectra (collision induced dissociation) are obtained from the MS/MS mass analyzer. Lecture 14

35 Mass Spectrometry (MS) Introduce sample to the instrument Generate ions in the gas phase Separate ions on the basis of differences in m/z with a mass analyzer Detect ions

36 2D - LC/LC Study protein complexes without gel electrophoresis (trypsin) Peptides all bind to cation exchange column Successive elution with increasing salt gradients separates peptides by charge Complex mixture is simplified prior to MS/MS by 2D LC Peptides are separated by hydrophobicity on reverse phase column Lecture 14

37 2D - LC/MS

38 Methods for protein identification