Measuring and Understanding Gene Expression

Size: px
Start display at page:

Download "Measuring and Understanding Gene Expression"

Transcription

1 Measuring and Understanding Gene Expression Dr. Lars Eijssen Dept. Of Bioinformatics BiGCaT Sciences programme 2014

2 Why are genes interesting? TRANSCRIPTION Genome Genomics Transcriptome Transcriptomics TRANSLATION METABOLISM OH O OH Proteome Proteomics Metabolome Metabolomics 2

3 Gene expression is an indirect measure of effect DNA RNA protein phenotype 3 cdna gene expression Does not take into account: RNA transport Translation to protein Posttranslational modification Protein degradation

4 Gene expression only measures the production of the RNA We assume this corresponds to the amount of But if there s an even bigger increase in degradation then more transcription doesn t equal more protein Gene Expression Protein We measure Degradation 4

5 Microarrays for gene expression Image: labmate-online.com 5

6 Each spot represents a gene 6 Slide: J. van Delft

7 DNA hybridisation: a fishing expedition Probes and Targets In reality, each gene detected by many, many probes 7

8 Two colour versus one colour arrays 8 Two colour arrays are useful for e.g. paired designs

9 A scan overlay image of both channels (Agilent) Spot signal quantification scanner software Intensities + quality measures (e.g. spot size, evenness) 9 Images (l to r): agilent.com, chem.agilent.com

10 One and two-colour experiments In two-color experiments, two samples are hybridised to the same slide as is shown on the previous slide In one-color experiments just one sample is hybridised per array For data analysis, this distinction is very important Also the specific platform used is important 10

11 11 Example one colour: Affymetrix

12 Affymetrix Chips 1.28cm Image of Hybridized Probe Array 12

13 Lab Computer Overview of microarray procedures Biological question Experimental design Pre-processing (background correction, normalisation, filtering) Biological experiment RNA extraction Determination of differentially expressed genes Amplification Labelling Hybridisation Scanning Futher processing (clustering, classification, pathway analysis) Validation Publication 13

14 Preprocessing Microarray scans Image analysis Raw data Quality control Normalization Determining the intensity of each spot Normalized data Statistical analysis List of regulated genes Pattern analysis Pathway analysis Literature data Results 14

15 Quality control Quality control should allow you to detect chips where problems have affected the reliability of their measurements. Problems can include; Staining, hairs, scratches Air bubbles Low hybridisation rates Imperfect washing Differences in signal between arrays 15

16 Why is computer aided QC needed?: Example file after scanning and image analysis Lots more! background intensity foreground intensity 16

17 17 Example of staining (artificial colours)

18 Probes and probesets On Affymetrix arrays, each gene is measured by a set of oligonucleotide probes (dozens, number varies between chips) Each probe targets another part of the sequence of the gene So each gene is measured by a so called probeset The probes of each probeset are distributed randomly over the array 18 This ensures that the gene can still be measured even if there is a stain or other aberration on the array

19 Background correction Measure the intensity of the background around the spot as well as the intensity of the spot itself Reported intensity = spot intensity background intensity 19

20 Uneven background 20 image from

21 Normalisation After discarding bad arrays and spots, remaining withinand between-array differences not related to the biology, need to be corrected for For example, some chips may have an overall lower (or higher) intensity than others We correct for this by normalisation This aims to make the values from different chips comparable 21

22 22 The need for normalisation

23 Two colour arrays: Dye bias Foreground intensity Background intensity 23

24 Red and green foreground intensity For dual channel arrays, it is relevant to check whether effects cancel out between channels 24

25 How we normalise: Making the profiles as similar as possible The assumption is that most genes don t change! 25

26 Log transformation Generally, the intensities are first 2 log-transformed The distribution of the logged intensities is more normal than on the original scale We are normally interested in the ratio of a gene s expression between experimental groups, called the fold change a / b This transforms to a difference on the log scale, the log fold change 2 log(a) 2 log(b) 26

27 The log Fold Change The logfc spreads out the data and offers symmetry raw ratio (FC) ½ 1 2 log ratio (logfc) 2 log of: ½

28 Normalisation is cyclic Several QC plots are made before and after normalisation Whether normalisation can correct an artifact may influence decision to discard or not After data selection, the complete QC should be run again Some abberations may have been masked by larger ones 28

29 Pre-processing for Affymetrix chips A specific extra step is summarisation of probe values into one value for each probeset Well-known methods for pre-processing Affymetrix chips include: RMA (Robust Multiarray Average) Includes both background correction and (quantile) normalisation GC-RMA Like RMA, but also takes into account GC content The amount of G and C nucleotides determines hybridisation properties (G and C have 3 hydrogen bonds, A and T only two) 29

30 How do I know which gene is where? Mostly, annotation files are provided by the manufacturer of the microarray used These may be outdated Nowadays, mostly the probe sequences are provided by the manufacturer one can update annotation based on the most recent builds of genome data bases There are some labs that provide regularly updated annotations, such as BrainArray in case of Affymetrix 30

31 Low intensity filtering Before filtering difference between groups average intensity After filtering Low intensity spots are more subject to noise Filtering can be done at a later stage 31

32 ArrayAnalysis.org: automated QC and processing of data sets local machine web server calculation server 32

33 33

34 34 Boxplots

35 Virtual (spatial) images MA plots 35

36 Microarrays: not only for gene expression! Large scale measurements by microarray/chip technologies: gene expression (mrna) used for long time SNP (genetic variations) used actively DNA sequence used actively ChIP on chip used actively Methylation (MeDIP chip) used actively mirna used actively splice variants (tiling array, somewhat in use, difficult to exon array) analyse protein not really well-developed yet 36

37 Next Generation Sequencing (NGS) A more recent development is the use of Next-Generation Sequencing (NGS) or High Throughput Sequencing (HTS) to measure mrna We call this RNA SEQ It reads all the mrna fragments provided, giving us counts of the frequency of each mrna across the whole genome Note: NGS/HTS is also used for sequencing the DNA in order to find genetic variations between individuals! 37

38 Next generation sequencing (2004) DNA is fragmented Adapters are attached to DNA fragments Fragments are amplified Incorporation of fluorescently tagged nucleotides Cyclic readout on the array 38

39 Bridge amplification DNA fragments are flanked with adaptors Flat surface coated with two types of primers, corresponding to the adaptors. Amplification proceeds in cycles, with one end of each bridge attached to the surface. Used by Solexa (Illumina) 39

40 Processing the data Microarray scans Image analysis Raw data Quality control Normalization Normalized data RNA Seq reads Genome mapping Counts of mapped genes Quality control Normalization Normalized data Statistical analysis Different methods Statistical analysis List of regulated genes Pattern analysis Pathway analysis Literature data Results List of regulated genes Pattern analysis Pathway analysis Literature data Results 40

41 Comparing RNA SEQ and microarrays Microarray RNA SEQ Price Cheapest More expensive Genes measured Known genes on chip All genes, including isoforms Accurate measurements? Data analysis Above a certain background level and below saturation Straightforward, many tools available At all levels More complex 41

42 Finding interesting genes Once we have numbers for the measurements of the genes, we need a way to find the genes which are interesting to us Several different filters are regularly used: Significance in a statistical test looking for differences between two groups (e.g. t-test or Poisson) Fold change between two conditions Correlation to another feature of interest Multiple testing correction Often these two are combined, so we want to find genes with a particular fold change AND a significant t-test result 42

43 Using fold change and statistical significance Often people use both fold change and statistical significance between two groups to determine the list of significant genes Significant Fold change high Fold change low Non- Significant 43

44 Why do we need to worry about multiple testing? If we have 10,000 measurements for each item in 2 groups, with a t- test we find measurements different between the two groups We will expect 500 of the measurements to be significantly different in the t-test (p<0.05) 44 With 10,000+ genes measured by each microarray, we can get many false positive results.

45 How do we deal with multiple testing? Examples of methods: 45 Bonferroni a very strict correction, very few false positives remain, but we will discount many true positives too. Adjusted p-value = calculated p-value * number of tests done Eg. With we test 100 genes to see if they are different between the two groups. BRCA1 gives a p-value of 0.002, the adjusted p-value is; * 100 = 0.20 not significant. Benjamini-Hochberg we set the % of results which we can tolerate as false positives.

46 What next? Once we have found a list of genes which are correlated or significantly changed between two groups, we often still have 1,000 s of genes to consider. We may use these genes to build models of the differences between the groups Or search the literature Or 46

47 Visualising our data Understanding data is much easier when we can visualise it E.g. A small table of numbers is much harder to understand than a graph, showing the same data. With microarray data we can have 10,000 s of measurements for 100 s of samples, so we need other techniques to visualise the data PCA (principal components analysis) lets us see which samples are most similar by displaying our dataset in 2 or 3 dimensions Clustering also gives us a visualisation of which samples are similar and which genes cause this similarity (next lecture) 47

48 48 Clustering and PCA plots

49 PC2 Gene 2 PCA PC1 Gene 1 Percentage of variation explained 49 PC1 PC2

50 Pathway analysis After retrieving lists or clusters of differentially expressed genes, one mostly performs pathway analysis to find out which processes are affected most This will be covered in the next lecture! 50

51 Biological confirmation Transcriptomic experiments can be thought of as hypothesisgenerating experiments The differential up- or down-regulation of specific genes can be technically confirmed using independent assays (RT-PCR) We may use knockout/knockdown experiments to examine effects of specific genes of interest from the transcriptomics experiment. Findings can also be confirmed on other levels (e.g. protein abundance, protein localisation detection with antibodies, enzyme activity) Findings can be related to the literature and information in online resources 51

52 Online storage of microarray results When publishing papers based on microarray data, one is encouraged (or even obliged) to store the data in online databases Standards have been developed to describe microarray experiments and data MIAME: Minimal Information About a Microarray Experiment ( Two main databases exist: ArrayExpress at EBI (European Bioinformatics Institute) Gene Expression Omnibus (GEO) at NCBI (USA) 52

53 Acknowledgement Dr. Rachel Cavill, department of Toxicogenomics, for providing some of these slides 53