Podiplomski študij Biomedicine

Size: px
Start display at page:

Download "Podiplomski študij Biomedicine"

Transcription

1 Podiplomski študij Biomedicine Genetika: Genetski koncepti ( ) Uravnavanje izražanja genov, transkriptomika in drugi pristopi po-genomskega obdobja 1. Osnove uravnavanje izražanja genov. Prof. dr. Damjana Rozman 2. Študije genoma in transkriptoma z DNA mikromrežami (čipi). 3. Študije genoma in transkriptoma z novo generacijo sekveniranja. 4. Pomen informatike pri študijah omov. 5. Primeri uporabe omov : kemogenomika, toksikogenomika, farmakogenomika.

2 2008

3 Različne ravni uravnavanja

4 Molekularni dogodki pri izražanju genov Prenos signala 30 nm vlakno razdružitev nukleosomov Premik nukleosomov s tkivno-specifičnimi faktorji 10 nm vlakno Prostorsko razporejanje DNA Aktivatorji prepisovanja polii Preiniciacijski kompleks

5 The chromosome territory interchromatin compartment (CT IC) model Lanctôt et al. Nature Reviews Genetics 8, (February 2007) doi: /nrg2041 Chromatin is organized in distinct CTs. Nuclear topography remains a subject of debate, expecially with regard to the extent of chromatin loops expanding into the IC and intermingling between neighbouring CTs and chromosomal subdomains.

6 Three hypothetical chromosome territories Chromatin) might become repositioned over time; Subchromosomal 1 Mb domains (corresponding to DNA replication units that are labelled during early S phase32) are depicted as filled circles. The lighter-green circles indicate gene-poor chromatin; Dark-coloured circles indicate gene-dense chromatin. Areas that contain transcriptionally active genes are indicated by dark background shading. Chromatin mobility is constrained in several different ways. The chromatin mobility is constrained by the tendency of transcriptionally active and transcriptionally inactive chromatin regions to cluster in three-dimensional space. Lanctôt et al. Nature Reviews Genetics 8, (February 2007) doi: /nrg2041

7 Chromatin mobility allows dynamic interactions between genomic loci and between loci and other nuclear structures. a) Movement of chromatin from one compartment to another leads to changes in expression of the corresponding genomic regions. Activation is triggered by repositioning of the gene loci to an activating compartment, away from silencing compartments. b) Compartments are transient self-organizing entities. Gene activation leads to dissolution of the silencing compartments, changes in gene positioning and de novo assembly of an activating compartment. Lanctôt et al. Nature Reviews Genetics 8, (February 2007) doi: /nrg2041

8 Uravnavanje izražanja genov RNA polimeraze, osnovni aparat in cikel transkripcije Mehanizmi prepisovanja in aktivacije Kontrola prepisovanja preko spremembe strukture kromatina Epigenetska kontrola Kombinirani dogodki pri prenosu signala Metode za študij interakcij DNA-proteini

9 Osnovni aparat RNA polimeraza Transkripcijski aparat Regulatorji (aktivatorji, represorji; trans-delujoči faktorji) Cis-elementi Cis-regulatorni moduli (CRM); vzpodbujevalci (enhacers), utiševalci (silencers)

10 Generating a high-resolution atlas of mesodermal CRMs. RP Zinzen et al. Nature 462, (2009) doi: /nature08531

11 Bakterije 1 RNA polimeraza Evkarionti RNA pol I (45s pre-rrna ) -RNA pol II (mrna kodirani geni, mirna) -RNA pol III (trna, 5S rrna)

12 Tovarna RNA polimeraze II A.J. Courey, Mechanisms in transcriptional regulation, 2008)

13 Komunikacija med CRM in promotorjem A.J. Courey, Mechanisms in transcriptional regulation, 2008)

14 Tipični pol II gen ima promotor, ki zavzema področje cca 200 bp navzgor od mesta pričetka prepisovanja. Lahko vsebuje tudi pospeševalec (enhancer), ki je oddaljen poljubno (???) B. Lewin, GenesVII, 2000

15 Struktura gena razreda pol II A.J. Courey, Mechanisms in transcriptional regulation, 2008)

16 Core promoter-selective RNA polymerase II transcription Gross P, Oelgeschlager T, Transcription Laboratory, Marie Curie Research Institute, The Chart, Oxted, Surrey RH8 OTL, UK. The core promoter: -minimal DNA region that is sufficient to direct low levels of activator-independent (basal) transcription by RNAP II in vitro; -typically extends approx. 40 bp up- and downstream of the start site of transcription; -can contain several distinct core promoter sequence elements; -contains TATA box or INR (initiator) element. Computational analysis of metazoan genomes suggests that the prevalence of the TATA box has been overestimated in the past and that the majority of human genes are TATA-less. While TATA-mediated transcription initiation has been studied in great detail and is very well understood, very little is known about the factors and mechanisms involved in the function of the INR and other core promoter elements. Biochem Soc Symp. 2006;(73):

17 TATA-less promoter example CYP19 (aromatase) 3, (November 2003)

18 TATA-less promoter - example The human CYP19 (P450arom) gene is located in the chromosome 15q21.2 region and is comprised of a 30 kb coding region and a 93 kb regulatory region. The unusually large regulatory region contains 10 tissue-specific promoters that are alternatively used in various cell types. Each promoter is regulated by a distinct set of regulatory sequences in DNA and transcription factors that bind to these specific sequences. The most recently characterized promoter Is TATA-less and accounts for the transcription of 29-54% of P450arom mrnas in breast cancer tissues and contains endothelial-type cis-acting elements that interact with endothelial-type transcription factors, e.g. GATA-2. This promoter may upregulate aromatase expression in vascular endothelial cells. The in vivo cellular distribution and physiologic roles of this promoter in healthy tissues is not known. J Steroid Biochem Mol Biol Sep;86(3-5):219-24

19 TATA-less promoters Molina and Grotewold BMC Genomics :25

20 Transkripcijski cikel A.J. Courey, Mechanisms in transcriptional regulation, 2008)

21 Transkripcijski mehurček A.J. Courey, Mechanisms in transcriptional regulation, 2008)

22 RNA polimeraza se lahko premika v obe smeri A.J. Courey, Mechanisms in transcriptional regulation, 2008)

23 Vezavno mesto za DNA-RNA hibrid Matrična veriga Kodirajoča veriga Cramer et al., Science 288, (2000)

24 Model of the pol II transcription initiation complex with the addition of electron microscope structures of TFIIE, TFIIF, and TFIIH. Bushnell et al., Science 303, (2004)

25 Splošni transkripcijski faktorji RNA polimeraze II TFIID: velik multiproteinski kompleks, ki prične sestavljanje transkripcijskega aparata. Vsebuje TBP in TAFe TFIIB: interagira z TBP in z DNA (analog bakterijskih σ faktorjev) TFIIF: udeležen pri iniciaciji kot pri podaljševanju verige. TFIIE: veže TFIIH TFIIH ima kinazno aktivnost, je helikaza, sodeluje tudi pri popravljanja DNA. Predstavlja povezavo med prepisovanjem in popravljanjem DNA in uravnavanjem celičnega cikla. TFIIA - po nekaterih izsledkih ne spada med splošne transkripcijske faktorje temveč med specializirane TFIID koaktivatorje. V pogojih in vitro veže TBP-DNA kompleks in povzroči širjenje DNA-protein kontaktov, obstajajo pa tudi dokazi, da ima pomen le pri kontaktiranju z gensko-specifičnimi transkripcijskimi faktorji.

26 Kromatin in prepisovanje

27 N-terminalni repki histonov so gibljivi in obrnjeni navzven. Modifikacija histonskih repkov vodi do strukturnih sprememb, potrebnih pri pričetku prepisovanja GenesVII

28

29 keithr/fig1pt2.html

30 Three hypothetical chromosome territories Chromatin) might become repositioned over time; Subchromosomal 1 Mb domains (corresponding to DNA replication units that are labelled during early S phase32) are depicted as filled circles. The lighter-green circles indicate gene-poor chromatin; Dark-coloured circles indicate gene-dense chromatin. Areas that contain transcriptionally active genes are indicated by dark background shading. Chromatin mobility is constrained in several different ways. The chromatin mobility is constrained by the tendency of transcriptionally active and transcriptionally inactive chromatin regions to cluster in three-dimensional space. Lanctôt et al. Nature Reviews Genetics 8, (February 2007) doi: /nrg2041

31 Transkripcijske (nukleolarne) Tovarne v celicah HeLa Nascentni transkripti označeni z Br-UTP, Br-RNA imunobarvana s fluorescin izotiocianatom (zelena), nukleinske kisline označene s TO-TO-3 (rdeča) Nukleolarna tovarna Nukleoplazemska tovarna Millerjeva razporeditev rrna transkripcijska enota (pol I) s ~125 prepisi Pol II transkripcijska enota z enim prepisom Cook P.R., Science 284, (1999)

32 Transkripcijski cikel RNA II polimeraza ima večjo potezno moč kot kinezin in miozin. Energijo hidrolize ATP pretvori v mehansko in kinetično energijo Cook P.R., Science 284, (1999)

33 RNA polimeraze, osnovni aparat in cikel transkripcije povzetek Cikel transkripcije vsebuje začetek (iniciacijo), podaljševanje (elongacijo) in zaključek (terminacijo) Transkripcijo katalizirajo evolucijsko ohranjene RNA polimeraze. Evkarionti imajo RNA poi I, II in III. Potrebni pa so še drugi proteini (RNA polimeazni kompleks) Jedro RNA polimeraze vsebuje par klešč (clamp), ki tvorijo režo aktivnega centra encima za sintezo RNA. V stopnji podaljševanja DNA doseže aktivno mesto skozi kanal med kleščama, se denaturira (mehurček), rastoča RNA veriga pa tvori heterodupleks z matrično verigo DNA. NTP-ji dosežejo aktivno mesto skozi sekundarni kanal. Če polimeraza naleti na oviro, se lahko premakne nazaj (back-tracking). TBP je univerzalni transkripcijski faktot, ki prepozna TATA, potreben pa je tudi za prepisovanje s promotorjev brez TATA.

34 CRM in kromatin - povzetek Cis-regulatorni modul (CRM) je področje DNA, kamor se vežejo številni transkripcijski faktorji in uravnavajo izražanje bližnjih genov. Posamezni gen ima pri živalih lahko do 20 CRM. CRM se običajno nahajajo v 5 neprevedenih regijah gena ali v intronih. Prepisovanje genov se dogaja v nukleolarnih tovarnah, ki predstavljajo transkripcijsko aktivna področja jedra. Acetilacija (deacetilacija) uravnava kondenzacijo kromatina. Za prepisovanje primeren je acetiliran kromatin.

35 2. Študije transkriptoma z DNA mikromrežami (čipi), povezave z drugimi omi The suffix -ome- originated as a back-formation from genome, a word formed in analogy with chromosome. Because genome refers to the complete genetic makeup of an organism, some people have made the inference that there exists some root, * -ome-, of Greek origin referring to wholeness or to completion. Because of the success of large-scale quantitative biology projects such as genome sequencing, the suffix "-ome-" has migrated to a host of other contexts. Bioinformaticians and molecular biologists figured amongst the first scientists to start to apply the "-ome" suffix widely.

36 employees.csbsju.edu/.../ olbindtransciption.html

37 The ome world The transcriptome, the mrna complement of an entire organism, tissue type, or cell; with its associated field transcriptomics The metabolome, the totality of metabolites in an organism; with its associated field metabolomics The metallome, the totality of metal and metalloid species; with its associated field metallomics The lipidome, the totality of lipids; with its associated field Lipidomics The glycome, the totality of glycans, carbohydrate structures of an organism, a cell or tissue type. Glycomics: The associated field of study. The interactome, the totality of the molecular interactions in an organism a once proposed field of interactomics has generally become known as systems biology The spliceome (see spliceosome), the totality of the alternative splicing protein isoforms; with its associated field spliceomics. The ORFeome refers to the totality of DNA sequences that begin with the initiation codon ATG, end with a nonsense codon, and contain no stop codon. Such sequences may therefore encode part or all of a protein. The Phenome - the organism itself. The Phenome is to the genome what the phenotype is to the genotype. Also, the complete list of phenotypic mutants available for a species. The Exposome - the collection of an individual's environmental exposures.

38

39 Genomics Genomics is the study of an organism's entire genome. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts.

40 Študije transkriptoma z DNA čipi Analiza genoma (genotpizacija enojnih nukleotidnih polimorfizmov SNP, analiza variance števila kopij CNV oz. CGH, resekvenciranje). Analiza transkriptoma (ekspresijsko profiliranje: 3' ekspresijski, eksonski ali genski čipi, čipi za sledenje izražanja mirna). Študije uravnavanja izražanja genov (kromatinska imunoprecipitacija na čipu ChIP-on-Chip, mapiranje prepisov).

41 A DNA microarray A DNA microarray (also commonly known as gene chip, DNA chip, or biochip) is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array for the purpose of monitoring of expression levels for thousands of genes simultaneously. The affixed DNA segments are known as probes (although some sources will use different nomenclature), thousands of which can be used in a single DNA microarray. Microarray technology evolved from Southern blotting, where fragmented DNA is attached to a substrate and then probed with a known gene or fragment. Measuring gene expression using microarrays is relevant to many areas of biology and medicine, such as studying treatments, disease, and developmental stages. For example, microarrays can be used to identify disease genes by comparing gene expression in disease and normal cells.

42 Different parts of the gene as probes on the array Robust, simple representation focusing on the 3' ends. Genome-wide, exon-level analysis on a single array a survey of alternative splicing and gene expression. High-density tiled microarrays for transcript mapping.

43 Parts of a gene represented in different types of microarrays

44 Hybridization on the microarray

45

46 Analisys of genome - Genotyping microarrays DNA microarrays can be used to read the sequence of a genome in particular positions. SNP microarrays are a particular type of DNA microarrays that are used to identify genetic variation in individuals and across populations. Short oligonucleotide arrays can be used to identify the single nucleotide polymorphisms (SNPs) that are thought to be responsible for genetic variation and the source of susceptibility to genetically caused diseases. Amplifications and deletions can also be detected using comparative genomic hybridization in conjuction with microarrays. Resequencing arrays have also been developed to sequence portions of the genome in individuals. These arrays may be used to evaluate germline mutations in individuals, or somatic mutations in cancer. Genome tiling arrays include overlapping oligonucleotides designed to blanket an entire genomic region of interest. Many companies have successfully designed tiling arrays that cover whole human chromosomes.

47 If every human genome is different, what does it mean to sequence "the" human genome? The complete human genome sequence announced in June 2000 is a "representative" genome sequence based on the DNA of just a few individuals. Over the longer term, scientists will study DNA from many different people to identify where and what variations between individual genomes exist. Sequencing a genome is such a Herculean task that capturing its person-toperson variability on the first pass would be next to impossible. Since every person's genome is unique, no one person is any more or less "representative" than any other and it hardly matters whose genome is sequenced first. The vast majority of the genome's sequence is the same from one person to the next, with the same genes in the same places. Chp4_1.shtml

48 DNA polymorphisms Single Base Pair Differences, Single Nucleotide Polymorphisms (SNPs), Microsatellites (short sequence repeats), Minisatellites (long sequence repeats), Deletions, Duplications.

49 High density Afymetrix microarrays Affymetrix uses a unique combination of photolithography and combinatorial chemistry to manufacture GeneChip Arrays.

50 SNP chips To determine which alleles are present, genomic DNA from an individual is isolated, fragmented, tagged with a fluorescent dye, and applied to the chip. The genomic DNA fragments anneal only to these oligos to which they are perfectly complementary. A computer reads the position of the two fluorescent tags and identifies the individual as a C / T heterozygote. The single spots in the other three columns indicate that the individual is homozygous at the three corresponding SNP positions.

51 >1,500 SNPs arrayed Genotyping

52 Comparative genomic hybridization Array comparative genomic hybridization (CGH) is a tool for the assessment of DNA copy number variation (CNV) within any given DNA sample. The technique is derived from the concept of conventional CGH where patient and reference DNA (the latter derived from a normal male/female) is differentially labelled and co-hybridized to a normal metaphase spread.

53 Some diseases are associated with sections of chromosomes that are erroneously replicated or deleted. The technique of array-based genomic hybridisation (array-comparative genomic hybridization CGH or copy number variation CNV) allows highthroughput, rapid identification and mapping of these genomic DNA copy number changes, with higher resolution than is possible using traditional, non-array methods.

54 Expression profiling and CGH

55 Array-based Comparative genome hybridization (CGH) In array-based CGH, large insert clones like BACs, containing human DNA, replace the metaphase spread as target. The clones are spotted in three- four- or fivefold on a microscope slide. Depending on the contents of the array, the collection of clones that is printed on the slide, specific chromosome parts or even the entire genome can be analyzed in one experiment.

56 Resequencing Arrays Tiling arrays - tools for resequencing, based on the concept of interrogating the genome in a systematic, unbiased fashion. Before tiling arrays came along, other methods were used to predict genes that were unknown, including expressed-sequence tag (EST) sequencing, comparison with known protein-coding sequences, and ab initio prediction. Tiling arrays are designed to address this problem by taking an empirical measurement across the entire genomic DNA. This allows a researcher to probe for to probe for transcribed sequences without a predetermined concept of thier location Using the tiling chips it is possible to perform the most comprehensive analysis of the whole genome combined with transcript mapping. The probes are designed (tiled) at an average resolution of 35-bp throughout the whole genome, thus enabling a more accurate view of the transcription and discovery of novel RNA transcripts.

57

58 Analiza transkriptoma

59 Ekspresijsko profiliranje z DNA mikromrežami dvojno označevanje Principle of expression profiling Geni zastopani s cdna ali z oligonukleotidi mikromreže s cdna in dolgimi oligonukleotidi s 3 -konca Test Referenca Laser 1 čitalec Laser 2 Obratno prepisovanje označevanje Emisija Robot-nanašalec (spotter) Prekrivanje ko-hibridizacija Računalniška analiza

60 Exon arrays GeneChip Exon Arrays provide the most comprehensive coverage of the genome and enable two complementary levels of analysis: gene expression and alternative splicing. Multiple probes per exon enable "exon-level" analysis and allow you to distinguish between different isoforms of a gene. T his exon-level analysis on a whole-genome scale opens the door to detecting specific alternative splicing events that may play a central role in disease mechanism and etiology. You can also perform "gene-level" analysis, in which multiple probes on different exons are summarized into a single expression value of all transcripts from the same gene, enabling similar analysis workflows to 3'-gene expression.

61 Exon arrays example I Complex organisation and structure of the ghrelin antisense strand gene GHRLOS, a candidate noncoding RNA gene Seim I et al., BMC Molecular Biology 2008, 9:95

62 Gene arrays Human Gene Arrays (April 2007), interrogate approximately 29,000 well annotated genes by using an average of 26 probes per gene. Probes are designed to cover the entire transcript length, unlike the 3 focused arrays which contain probes targeting the transcript s 3 end. Probes on the array are perfect match (PM) only; mismatch probes found on the 3 focused arrays have been replaced on the Gene Arrays with a set of approximately 20,000 generic background probes.

63 mirna Profiling MicroRNAs (mirnas) are nucleotide long non-coding RNA molecules that have been shown to regulate the stability or translational efficiency of target messenger RNAs. They have been shown to regulate diverse processes such as Fragile X, cancer proliferation and fat metabolism. At present over 530 different mirna sequences have been validated in the human (Sanger Institute mirbase database) although recent research suggests that the number of unique mirnas could exceed 800. It is also thought that mirna genes regulate protein production for 10% or more of all human genes making them an increasingly important area of research

64 mirna isolation The starting material for Agilent mirna Arrays is total RNA (not small RNA) by isolated with i.e. TRIZOL still containing small RNA molecules. As such, they should not have been cleaned up using columns as this will invariably remove the small RNAs which the arrays are trying to measure. Just 100ng of total RNA is required for labelling. Once isolated, samples can be quantified using the Nanodrop Spectrophotometer. Because mirnas are so short, they are less prone to degradation then mrnas. Screen capture of Agilent 2100 Bioanalyzer electropherograms of a TRIZOL Reagent isolated total RNA. Note the pronounced peak of RNA at <200nt, typically not seen in column cleaned-up samples. Source: Microarray Centre

65 Interactome by Chromatine immunoprecipitation on a chip (CHIP-Chip)

66 The "ChIP to Chip" Assay

67 Chip-chip example I Detection of Nkx2-5 binding to the Actc1 promoter and intron 1 (A) and the Nppa promoter (B) using the Integrated Genome Browser

68

69 Povzetek DNA čipi so zbirka mikroskopskih DNA točk, cdna ali oligonukleotidov, pritrjenih na trdo podlago. Uporabljajo se za: Analizo genomske DNA (genotipizacija, SNP analiza, CHG, sekvenciranje). Analizo izražanja genov (ekspresijsko profiliranje), kjer probe lahko predstavljajo 3' konce genov, eksone ali različne dele genov. Posebni čipi pa so za sledenje izražanja mirna. Študije uravnavanja izražanja genov (določanje vezavnih mest transkripcijskih faktorjev (Chip-chip), metilacija kromatina), kjer probe predstavljajo 5 -neprevedene dele in nerepetitivne dele genov.

70 Nova generacija sekvenciranja

71 Thu Nov 5, :24pm EST Want to know your entire DNA sequence? A California company has done it for as little as $1,700. Privately held Complete Genomics says it can do a better quality, usable genome map for about $4, compared with the $100 million the Human Genome Project spent to complete the first sequencing of the human genome in "Whole-genome sequencing costs have dropped from the more than $100 million cost of the first human genomes to the point where individual labs have generated genome sequences in a matter of months for material costs of as low as $48,000," the company's Radoje Drmanac and colleagues reported in the journal Science. "This high-quality, cost-effective approach to genome sequencing will allow researchers to study complete genomes from hundreds of patients with a disease to advance the understanding of the genetic causes of that disease, with an end to preventing and treating common human ailments," said Cliff Reid, chief executive officer of Complete Genomics. Two of the people whose DNA was mapped had taken part in an international sequencing project called the International HapMap project -- a man of European descent and a Yoruban female. The third came from a white man taking part in the Personal Genome Project, an online registry in which people are asked to donate both their DNA and a little money. Genome sequencing is still early stage science. While researchers can get the code, figuring out what it means is a different matter.genomics pioneer Craig Venter had his own genome sequenced -- at a cost of "several million" dollars -- and found the analysis could only show he was likely to have blue eyes, for instance. Venter does have blue eyes. Last month Pauline Ng of the J. Craig Venter Institute in San Diego and Sarah Murray of Scripps Translational Science Institute in La Jolla, California, tested kits provided by California-based firms Navigenics Inc, a private company, and 23andMe, backed by Google Inc. They found they varied in predicting disease risk. Complete Genomics and The Institute for Systems Biology said earlier this week they plan to sequence the genomes of 100 people to try and find insights into Huntington's disease. Scientists also use a technique called genome-wide association to try to find genes that no one suspected were involved in various diseases. (Reporting by Maggie Fox; Editing by Eric Wals)

72 Nova (naslednja) generacija sekvenciranja - Sekvenciranje človeškega genoma je trajalo več let, z uporabo cca 20 kb BAC klonov, ki so vsebovali cca 100 kb dolge tarčne fragmente, in 8-kratnega pokrivanja vsakega dela tarče. Analiza s kapilarno elektroforezo. -Nadaljnji razvoj sekvenciranja je temeljil na sočasnem sekvenciranju celotnega genoma (WHS, angl. whole genome sequencing), ki je bil vstavljen v vektorje. Metoda je hitrejša, pušča pa velike praznine v zelo polimorfnih ali repetitivnih genomih. Analiza je potekala s kapilarno elektroforezo. -Naslednja generacija sekvenciranja (2004) visokozmogljivostno paralelno čitanje odsekov DNA na ravni celega genoma preko PCR pomnoževanja enoverižnih fragmentov genomske knjižnice.

73 Next-generation of DNA and RNA sequencing methods - The bead-amplification sequencing (Roche/454FLX) -Sequencing by synthesis (Illumina/Solexa Genome analyzer) -Sequencing by ligation (Applied Biosystems SOLID System) -Helicos Helioscope (2008) -Pacific Biosciences SMRT (2010) Common features: -A compex interplay of enzymology, chemistry, software, hardware, optics engineering ) -A streamline of sample preparation prior to sequencing (time saving) -Preparation of fragment libraries of the DNA of interest by annealing for platform-specific linkers and amplification -Amplification of single stranded fragment library and performing sequencing on amplified fragments -Single molecule sequencing just arrived or is under development

74 Comparison between capillary sequencers and novel generation sequencers -A huge difference in the throughput of a single run: 96 capillaries of about 750 bp compared to several thousand (Roche) to tens of millions (Illumina, ABi) shorter reads. -Time runs are longer that by next generation sequencers (8h 10days) - Longer run times are required to image the massive parallel sequencing reactions -Due to the streamline preparation and long reading times a single human operator can operate several machines of next generation at the full capacity. ABI Solexa

75 Classical Sanger dideoxy sequencing method Dideoxynucleotide sequencing represents only one method of sequencing DNA. It is commonly called Sanger sequencing since Sanger devised the method. This technique utilizes 2',3'- dideoxynucleotide triphospates (ddntps), molecules that differ from deoxynucleotides by the having a hydrogen atom attached to the 3' carbon rather than an OH group. These molecules terminate DNA chain elongation because they cannot form a phosphodiester bond with the next deoxynucleotide. campus.queens.edu/faculty/jannr/molecular/

76 General concepts for clonal-array generation and sequencing a Bead-chips. Genomic DNA is fragmented and adaptors are ligated to create an insert library that is flanked by two universal priming sites. Because of the random fragmentation, the complexity of this signature sequence library is equivalent to the genome. This library is cloned on beads using emulsion PCR technology. A water-in-oil emulsion is created from a PCR mix that contains a limiting dilution of DNA and beads. The emulsion creates micro-compartments with, on average, a single bead and single DNA template each. After PCR, beads with clones are affinity selected and assembled onto a planar substrate. A subsequent cycle-sequencing reaction is used to read out the sequence on the clones. b Sequencing by synthesis (SBS). A common anchor primer is annealed to a constant sequence (universal priming site) that is contained within the library clones that are located on the polony (clonal bead) array (the orientation of the immobilized target might vary depending on the platform that is used). The sequence is read out by polymerase extension in a base-by-base fashion using either reversible terminators or sequential nucleotide addition (pyrosequencing). After incorporation of a single base or base type, the incorporated base is identified by fluorescence (laser) or chemiluminescence (no laser required). c Sequencing by ligation. The polony array set-up is similar to SBS in which a common primer is annealed to an arrayed polony library and used to read out the sequence through a stepwise ligation of random oligomers. The labelled oligomers are designed to have random bases inserted at every site except the query site. The query site has one of four base substitutions, each matched to a particular fluorescent label on the oligonucleotide. After read-out of each ligation event, the primer and the ligated oligomer are stripped, a new primer reannealed and the process repeated with an oligomer that contains a query base at a different position. Fan et al. Nature Reviews Genetics 7, (August 2006) doi: /nrg1901

77 General concepts for clonal-array generation and sequencing Bead chips Roche pyrosequencing Sequencing by synthesis Solexa / illumina Sequencing by Ligation Applied biosystems Fan et al. Nature Reviews Genetics 7, (August 2006) doi: /nrg1901

78 Roche/454 FLX Pyrosequencer Library fragments are mixed with agarose beads with oligos complementary to adapter sequences on the library. Each bead is associated with a single fragment. Each fragment-bead complex is isolated into individual oil:water micelles with PCR mixture. Thermal cycling of this emulsion PCR of the micelles produces amplified unique sequences on the bead surface. En mass sequencing of PCR products on picotiter plates (PTP) with single beads in each picowell. Enzyme/substrate containing beads for the pyrosequencing reaction are added to wells that act as floww cells for addition of individual pure nucleotide solutions. The CCD camera records the light emitted at each bead.

79 Principles of Pyrosequencing A sequencing primer is hybridized to a single-stranded PCR amplicon that serves as a template. Mixtures incubated with the enzymes, DNA polymerase, ATP sulfurylase, luciferase, and apyrase as well as the substrates, adenosine 5' phosphosulfate (APS), and luciferin. ATP sulfurylase converts PPi to ATP in the presence of adenosine 5' phosphosulfate (APS). ATP drives the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light in amounts that are proportional to the amount of ATP. Apyrase, a nucleotide-degrading enzyme, continuously degrades unincorporated nucleotides and ATP. When degradation is complete, another nucleotide is added pyrogram

80 Watson and Cricks pyrosequencing readout 2008

81

82 Timeline of the pyrosequencing development October 2005 Release of the Genome Sequencer 20, the first next-generation sequencing system on the market October 2005 Collaboration agreement signed with Roche Diagnostics. January 2007 Release of the Genome Sequencer FLX System March 2007 Roche Diagnostics completes integration with 454 Life Sciences May 2007 Complete sequence of Jim Watson published in Nature. First genome to be sequenced for less than $1 million. November 2007 Announcement of the 100th peer-reviewed publication enabled by 454 Sequencing June Joins the 1000 Genome Project, an international effort to build the most detailed map to date of human genetic variation as a tool for medical research September 2008 Announcement of the 250th peer-reviewed publication enabled by 454 Sequencing October 2008 Release of Genome Sequencer FLX Titanium Series reagents, featuring 1 million reads at 400 base pairs in length

83 sequencing technology Is based on arrays of randomly assembled glass (silica) beads; The beads have oligonucleotides covalently attached to the surface; Each bead has about one million oligos on its surface; All oligos on each bead have the same sequence. Attached DNA fragments are extended and bridge amplified to create an ultra-high density sequencing flow cell with million clusters, each containing ~1,000 copies of the same template. These templates are sequenced using a robust four-color DNA sequencing-by-synthesis technology that employs reversible terminators with removable fluorescent dyes. This novel approach ensures high accuracy and true base-by-base sequencing, eliminating sequence-context specific errors and enabling sequencing through homopolymers and repetitive sequences. The beads are randomly assembled on the arrays, and the location of a particular probe is initially unknown;a process called decoding is used to find the location of each bead;

84 Illumina sequencing by synthesis

85 Illumina sequence - decoding

86

87 Approaches to create highly parallel genotyping assays by novel generation sequencing Fan et al. Nature Reviews Genetics 7, (August 2006) doi: /nrg1901

88 Fan et al. Nature Reviews Genetics 7, (August 2006) doi: /nrg1901

89 Nova generacija sekvenciranja - povzetek Nova generacija visokozmogljivostnega sekvenciranje omogoča razpoznavanje zaporedij DNA na ravni celega genoma, z resolucijo posameznega baznega para. Iz vsakega vzorca se pripravi z adaptorji ligirana knjižnica, ki vsebuje vse v vzorcu prisotne fragmente DNA ali RNA (cdna). Vse platforme bazirajo na ligaciji adaptorjev in pomnoževanju, imajo pa različne pristope sekvenciranja: -Pirosekvenciranje (Roche-Nimblegen) -Sekvenciranje s sintezo (Illumina-Solexa) -Sekvenciranje z ligacijo (ABI) Razvijajo se tudi metode, ki pred sekvenciranjem ne potrebujejo pomnoževanja. Aplikacije so enake kot pri klasičnih mikromrežah (ekspresijsko profiliranje oz. SAGE, genotipizacija SNP in CNV, kroamtinska imunoprecipitacija, metilacija kromatina, itd.). Prednost pred klasičnimi mikromrežami je v preprosti pripravi vzorca in zmožnosti procesiranja velikega števila vzorcev v kratkem času. Procesiranje velikega števila vzorcev na eni ali več aparaturah lahko upravlja en človek.

90 Bioinformatika v raziskavah omov

91 Technical Issues Involved in Obtaining Reliable Data from Microarray Experiments Standardization and beyond

92 Bioinformatics Wikipedia Making sense of the huge amounts of DNA data produced by gene sequencing projects. Bioinformatics and computational biology involve the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. Research in computational biology often overlaps with systems biology. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling. The terms bioinformatics and computational biology are often used interchangeably, although the former typically focuses on algorithm development and specific computational methods, while the latter focuses more on hypothesis testing and discovery in the biological domain.

93 Microarrays and bioinformatics Standardization The lack of standardization in arrays presents an interoperability problem in bioinformatics, which hinders the exchange of array data. Various projects are attempting to facilitate the exchange and analysis of data produced with non-proprietary chips. The "Minimum Information About a Microarray Experiment" (MIAME) XML based standard for describing a microarray experiment is being adopted by many journals as a requirement for the submission of papers incorporating microarray results.

94

95 Statistical analysis The analysis of DNA microarrays poses a large number of statistical problems, including the normalisation of the data. From a hypothesis-testing perspective, the large number of genes present on a single array means that the experimenter must take into account a multiple testing problem: even if each gene is extremely unlikely to randomly yield a result of interest, the combination of all the genes is likely to show at least one or a few occurrences of this result which are false positives.

96 Steps in microarray experiments irfgc.irri.org/cropbioportal/index.php?option...

97 Data mining includes gene ontology Gene ontology is a controlled vocabulary used to describe the biology of a gene product in any organism. There are 3 independent sets of vocabularies, or ontologies, that describe the: -molecular function of a gene product, - the biological process in which the gene product participates, -and the cellular component where the gene product can be found.

98 Experimental design is crucial for a succesfull omic experiment A proper experimental design is crucial for obtaining useful conclusions from a project. The choice of design ideally includes an assessment of the biological variation, the technical variation, the cost and duration of the experiment, and the availability of biological material. The experimental design can also depend on the methods that will be used to analyse the data afterwards. In certain cases, the parameters needed to find the optimal design must be obtained by a pilot experiment. A related problem is the comparison of different competing experimental methods or devices. Here, a proper test design is crucial as well to be able to make a firm conclusion in favor of one or the other method. Dere E. et al., BMC Genomics 2006, 7:80doi lunabiosciences.com/experimentaldesign.html

99 Experimenal design an example Changes in cholesterol homeostasis and drug metabolism caused by different factors in mouse liver T. Rezen et al., BMC Genomics 2008, 9:76

100 Primary data analysis experimental example Images of the Sterolgene v0 microarrays were analyzed by Array-Pro Analyzer 4.5 (Media Cybernetics, Bethesda, MD, USA). The median feature and local background intensities were extracted together with the estimates of their standard deviation. Only features with foreground to background ratio higher than 1.5 and coefficient of variation (CV, ratio between standard deviation of the background and the median feature intensity) lower than 0.5 in both channels were used for further analysis. Log2 ratios were normalized using LOWESS fit to spike in control RNAs according to their average intensity. Two types of spike in controls were used: custom-made (Firefly luciferase) and commercial ArrayControl Spikes (Ambion, Austin, TX, USA). In phenobarbital and cholesterol-feeding experiments data were additionally standardized (median-centered and scaled by median absolute deviation) in order to reduce inter-array variability. All data analysis were done in Orange software [37] Images of Agilent microarrays were analyzed using Array-Pro Analyzer 4.5 (Media Cybernetics, Bethesda, MD, USA). Features, with CV>0.39, were filtered out and data were normalized using LOWESS fit to all genes according to their average intensity. Filtration and normalization was done in BASE softwar]. Affymetrix data were normalized by the Robust Multichip Average (RMA) algorithm. After transformation to non-logarithmic data, the expression estimates were scaled to the average expression levels in the control group analyzed using GeneSpring software (Silicon Genetics, Redwood, USA). Classification of the differentially expressed genes was done in Orange using single-factor ANOVA or two tailed Student s t-test. For Sterolgene microarrays probability of type I error αs=0.05 or αs=0.1 was used. For Agilent microarrays complementary probability of type I error was calculated according to Bonferroni correction for multiple testing (αa=αs*ns/na), but final comparisons were made using more relaxed criteria αa=0.001 and αa=0.01. For Affymetrix microarrays complementary probability αaf= was calculated, but also a more relaxed criterion αaf=0.001 was used. Additional data analyses were done only using common genes between platforms, which were matched using unigene, refseq or gene symbol. On this gene list another classification of differentially expressed genes was done in Orange using ANOVA for Affymetrix and Agilent platforms. A probability for type I error was selected as in a complementary Sterolgene experiment (α=0.1 in Affymetrix analyses and α=0.05 in Agilent analyses). Pearson s product moment correlation coefficient and a scatterplot between log2 ratios from common genes were calculated in SPSS All data have been submitted to GEO (Gene expression omnibus) under accession codes: GSE6271 (Affymetrix data), GSE6317 (Agilent data), GSE6447 (Sterolgene phenobarbital and high cholesterol diet data), and GSE6423 (Sterolgene fasting and inflammation data). T. Rezen et al., BMC Genomics 2008, 9:76

101 Pomen bioinformatike v raziskavh omov - povzetek - Matematično-informatični pristopi v bioloških poskusih pogenomske dobe (bioinformatika) omogočajo razbiranje biološkega smisla v enormni količini podatkov. -Bioinformatika (računska biologija, matematična biologija) za reševanje bioloških problemov uporablja metode uporabne matematike. Informatike, statistike in računalniških znanosti. -Načrtovanje bioloških poskusov zahteva sodelovanje eksperimentatorjev in informatikov že od vsega začetka. -Zasnova poskusa zahteva definicijo biolškega vprašanja, izbiro platforme za analizo, določitev števila bioloških in tehničnih replik in načrt serije poskusov (hibridizacij ali sekvenciranj). -Po tehnični izvedbi poskusa sledita statistična in informatična obdelava ter rudarjenje podatkov. - Statistično-informatična obdelava obsega ekstrakcijo intenzitet signala, normalizacijo podatkov, ter različne statistične teste, da pridobimo listo diferencialno izraženih genov ali zaporedij DNA. - Sekundarna informatična analiza in rudarjenje podatkov obsegata gručanje in razpoznavanje vzorcev, študije seznama genov z genskimi ontologijami, iskanje regulatornih vzorcev, kot tudi načrtovanje validacijskih eksperimentov. -Matematično-informatične metode za povezovanje podatkov različnih omov (npr. transkriptom, proteom, metabolom) so še v razvoju.

102 Kemogenomika, toksikogenomika in farmakogenomika primeri uporabe omskih tehnologij v praksi

103 What s the difference? Chemogenomics screening a compound library to detect drug candidates Toxicogenomics investigates the effect of a chemical substance (drug) on the genome Pharmacogenomics investigates inherited basis for difference in drug response among individuals

104

105 A blockbuster drug is a drug generating more than $1 billion of revenue for its owner each year. The search for blockbusters has been the foundation of the R&D strategy adopted by big pharmaceutical companies, but this looks set to change. New advances in genomics, and the promise of personalized medicine, are likely to fragment the pharmaceutical market.

106 Chemogenomics: structuring the drug discovery process to gene families C.J. Harris and A. P. Stevens,Drug Discov Today., 11, , In the post-genomic era, if all proteins in a gene family can putatively be identified, how can drug discovery effectively tackle so many novel targets that might lack structural and small-molecule inhibitory data? In response, chemogenomics, a new approach that guides drug discovery based on gene families, has been developed. By integrating all information available within a protein family (sequence, SAR data, protein structure), chemogenomics can efficiently enable cross-sar exploitation, directed compound selection and early identification of optimum selectivity panel members.

107 Aims of chemogenomics - the druggable genome Chemogenomics aims to identify systematically all ligands and modulators for all the gene products expressed and allows the accelerated exploration of their biological function. The subject brings together diverse disciplines including chemistry, genetics, chemo- and bioinformatics, structural biology, and biological screening in phenotypic and target-based assays.

108 Expert Review of Proteomics June 2007, Vol. 4, No. 3, Pages Chemogenomics approaches to novel target discovery L Alex Gaither Chemogenomics involves the combination of a compound s effect on biological targets together with modern genomics technologies. The merger of these two methodologies is creating a new way to screen for compound target interactions, as well as map chemical and biological space in a parallel fashion. The challenge associated with mining complex databases has initiated the development of many novel in silico tools to profile and analyze data in a systematic way. The ability to analyze the combinatorial effects of chemical libraries on biological systems will aid the discovery of new therapeutic entities. Chemogenomics provides a tool for the rapid validation of novel targeted therapeutics, where a specific molecular target is modulated by a small molecule. Along with targeted therapies comes the ability to discovery pathway nodes where a single molecular target might be an essential component of more than one disease. Several disease areas will benefit directly from the chemogenomics approach, the most advanced being cancer. A genetic loss-of-function screen can be modulated in the presence of a compound to search for genes or pathways involved in the compound s activity. Several recent papers highlight how chemogenomics is changing with RNA interference-based screening and shaping the discovery of new targeted therapies. Together, chemical and RNA interference-based screens open the door for a new way to discovery disease-associated genes and novel targeted therapies.

109 Steps of chemogenomics 1. Formation/existance of a compound library 2. Chosen representative biological system 3. A reliable readout (i.e. gene expression, protein expression, HTP binding or functional assays, etc.) Rognan D. Chemogenomic approaches to rational drug design.br J Pharmacol. 2007, 152: Epub 2007 May 29. Review.

110 Analysing chemogenomic data is a never ending learning process aimed at completing a 2D-matrix, where columns represent targets/genes and rows chemical compounds. Reported values are binding constants or functional effects. Assumptions of chemogenomic-based approach: 1. compounds sharing some chemical similarity should also share targets, 2. targets sharing similar ligands should share similar patterns (binding sites). Defining the ligand space by descriptors (molecular weight, atom and bound counts, topological fingerprints, shape, field, spectra, etc.) Defining the target space by descriptors (databases for sequence, patterns secondary structure fold, atomic coordinates, binding site, etc.)

111 Ligand space: molecular descriptors Rognan D. J Pharmacol. 2007, 152: Epub 2007 May 29. Review.

112 Representation of ligand and target space correlations - example

113 Correlation matrices - networks

114 In silico target fishing approaches Machine learning algorithm using Bayesian statistics to predict target profiles from connectivity fingerprints of compounds from the biologically annotated database.

115 The key to improving drug discovery is to manage the whole value chain of research. Information needs to flow from fundamental analysis of genomes, proteins and chemistry through to managing clinical trials and drug submissions. Integrating so many fields of life sciences presents a massive task of integrating knowledge and information. Needed are bridges which help researchers in one area to communicate with others.

116

117 The TIP Database - an example The Target Informatics Platform's content includes more than 80,000 high resolution protein structures with reliably annotated small molecule binding sites, including >44,000 comparative models of human proteins, covering every major drug target family including proteases, kinases, phosphatases, phosphodiesterases, nuclear receptors, and GPCRs. In addition to TIP's high quality structural annotation, it is the only database of its kind that stores all similarity relationships between every sequence, structure, and binding site, making it an incredibly powerful system for structural and comparative proteomics.

118 Challenges in Chemogenomics Conceptual challenges Multidisciplinary project teams Data from chemistry, biology, genomics, Data challenges Disparate data sources, formats and identifiers Multiple data types gene expression, chemical structures, clinical chemistry, histopathology, receptor binding assays, Incomplete and missing data Analysis challenges Should the data be normalized? Does it make sense to cluster the data? How to relate chemical structures and gene expression data? Identify correlations between chemical properties and biological function Brian Prather, Ph.D., Application Specialist, Spotfire, Third Virtual Conference on Genomics and Bioinformatics

119 Conceptual challenges Multidisciplinary Drug Compound Candidate Project Team Chemistry Genomics Shared Analysis CAS ABase Affymetrix Toxicology CodeLink ISIS Biology Information Technology

120 Toxicogenomics is a form of analysis by which the activity of a particular toxin or chemical substance on living tissue can be identified based upon a profiling of its known effects on genetic material. Toxicogenomics may also be of use as a preventative measure to predict adverse "side", i.e. toxic, effects, of pharmaceutical drugs on susceptible individuals. This involves using genomic techniques such as gene expression level profiling and single-nucleotide polymorphism analysis of the genetic variation of individuals. Studies of those types are then correlated to adverse toxicological effects in clinical trials so that suitable diagnostic markers (measurable signs) for these adverse effects can be developed. There are many well-publicized cases in which popular drugs were pulled from the market because of toxic effects experienced by a small percentage of patients, with a cost of many billions of dollars to the companies responsible.

121 Some drugs recently withdrawn form the market Rofecoxib developed by Merck & Co. to treat osteoarthritis, acute pain conditions, and dysmenorrhoea. Rofecoxib was approved as safe and effective by the FDA in 1999 and was marketed under the brand name Vioxx, Ceoxx and Ceeoxx. Worldwide, over 80 million people were prescribed rofecoxib at some time. In 2004, Merck voluntarily withdrew rofecoxib from the market because of concerns about increased risk of heart attack and stroke associated with long-term, high-dosage use. Rofecoxib was one of the most widely used drugs ever to be withdrawn from the market. Fen-phen was an anti-obesity medication which consisted of two drugs fenfluramine and phentermine. After reports of valvular heart disease and pulmonary hypertension, primarily in women who had been undergoing treatment with Fen-phen, the FDA requested its withdrawal from the market in September As of 2004 Fen-phen is no longer widely available. In More than 50,000 product liability lawsuits had been filed by alleged Fen-phen victims. Estimates of total liability run as high as $14 billion. Cerivastatin (Baycol, Lipobay ) is a synthetic member of the class of statins, used to lower cholesterol and prevent cardiovascular disease. It was withdrawn from the market in 2001 because of the high rate of serious side-effects.cerivastatin was marketed by the Bayer A.G. During post-marketing surveillance, 52 deaths were reported in patients using cerivastatin, mainly from rhabdomyolysis and its resultant renal failure. Risks were higher in patients using fibrates and in patients using the high (0.8 mg/day) dose of cerivastatin. Another 385 nonfatal cases of rhabdomyolysis were reported. This put the risk of this (rare) complication at 5-10 times that of the other statins.

122 Toxicogenomic approaches

123

124 Pharmacogenomics is the branch of pharmacology which deals with the influence of genetic variation on drug response in patients by correlating gene expression or single-nucleotide polymorphisms with a drug's efficacy or toxicity. Pharmacogenomics aims to develop rational means to optimise drug therapy, with respect to the patients' genotype, to ensure maximum efficacy with minimal adverse effects. Such approaches promise the advent of "personalized medicine", in which drugs and drug combinations are optimised for each individual's unique genetic makeup. Pharmacogenomics is the whole genome application of pharmacogenetics, which examines the single gene interactions with drugs.

125 Adverse drug reactions An adverse drug reaction (abbreviated ADR) is an expression that describes the unwanted, negative consequences associated with the use of given medications. An ADR is a particular type of adverse effect. The meaning of this expression differs from the meaning of "side effect", as this last expression might also imply that the effects can be beneficial. The study of ADRs is the concern of the field known as pharmacovigilance. In 1994, an estimated 2,216,000 hospital patients in the US, suffered a serious ADR, leading to approximately 106,000 patient deaths, making ADR fatalities the fourth to sixth leading cause of death in the US in The current regime of one dose fits all is not ideal for patients and is not costeffective for the health service.

126 Why is every human genome different? Where are genome variations found? What kinds of genome variations are there?

127 Polymorphisms Single Base Pair Differences, Single Nucleotide Polymorphisms (SNPs), Microsatellites (short sequence repeats), Minisatellites (long sequence repeats), Deletions, Duplications.

128 Pharmacogenomic principle Pharmacogenomics principle: To identify patients at risk for toxicity or reduced response to therapy for optimal medication and/or dose selection

129 Pharmacogenomic approaches

130 SNP chips To determine which alleles are present, genomic DNA from an individual is isolated, fragmented, tagged with a fluorescent dye, and applied to the chip. The genomic DNA fragments anneal only to those oligos to which they are perfectly complementary. A computer reads the position of the two fluorescent tags and identifies the individual as a C / T heterozygote. The single spots in the other three columns indicate that the individual is homozygous at the three corresponding SNP positions.