NGS, Cancer and Bioinformatics. 5/3/2015 Yannick Boursin

Size: px
Start display at page:

Download "NGS, Cancer and Bioinformatics. 5/3/2015 Yannick Boursin"

Transcription

1 NGS, Cancer and Bioinformatics 5/3/2015 Yannick Boursin 1

2 NGS and Clinical Oncology NGS in hereditary cancer genome testing BRCA1/2 (breast/ovary cancer) XPC (melanoma) ERCC1 (colorectal cancer) NGS for personalized cancer treatment Clinical trials: MOSCATO (GR), SAFIR (GR), SHIVA (Curie), Ipilimumab (anti-ctla4), Nivolumab (anti-pd1), Trastuzumab (anti-her2), Cetuximab (anti-egfr) Detection of chimeric transcripts Chronic Myeloid Leukemia: Philadelphia chromosome (BCR/ABL) Non-Small-Cell Lung Cancer: EML4-ALK 5/3/2015 Yannick Boursin 2

3 NGS and Oncology NGS and Oncology NGS and Oncology NGS is now widely used as: as: NGS A isresearch now widely tool used to screen as: a large amount of cancer samples A research tool to screen a large amount of cancer samples A research tool to screen a large amount of cancer samples A clinical/diagnosis tool in daily practice A clinical/diagnosis tool in daily practice A clinical/diagnosis tool in daily practice These These projects projects require require dedicated dedicated bioinformatics integration integration project to access and analyses this huge amount of data. These project projects to access require and analyses dedicated this huge bioinformatics amount of data integration project to access and analyses this huge amount of data 18 5/3/2015 Yannick Boursin 3

4 Why do we need computers for NGS Sequencing data size evolution Needs to address Store PetaBytes of data (1 PB is 1000 TB). Share data around the world through networks Analyze huge amounts of data with complex algorithms 5/3/2015 Yannick Boursin 4

5 Bioinformatics and Oncology Problem: finding, extracting, and presenting relevant informations. Partial solution: designing workflows in order to ease data analysis. 5/3/2015 Yannick Boursin 5

6 Interdisciplinary collaboration Bioinformatics acts as a hubs between the different fields. Trust between partners is needed, training is needed as well for efficient understanding. 5/3/2015 Yannick Boursin 6

7 Standard Workflow for NGS Analysis A typical NGS workflow 5/3/2015 Yannick Boursin 7

8 Step 1: Quality Check and improvements 5/3/2015 Yannick Boursin 8

9 Standard Workflow for NGS Analysis A typical NGS workflow 5/3/2015 Yannick Boursin 9

10 NGS Data: what do they look like? A raw data file (.fastq,.sff,.fa,.csfasta/.qual) with millions of short reads of the same size (SOLiD, HiSeq) or reads of different size (Ion PGM/Proton) Enhanced view of the reads in a fastq file 5/3/2015 Yannick Boursin 10

11 FASTQ format 1 sequence = 1 read = 4 lines in the file First line = sequence identifier 5/3/2015 Yannick Boursin 11

12 FASTQ format Fourth line = Quality ASCII encoded (Reduce the file size) 5/3/2015 Yannick Boursin 12

13 Sequence quality encoding 5/3/2015 Yannick Boursin 13

14 Why looking at sequencing quality? Quality of data is very important for various downstream analyses: Sequence assembly or mapping Variants detection Gene expression studies... Quality of data = poor Try to find a reason Can we correct/improve the quality? May lead to erroneous conclusions 5/3/2015 Yannick Boursin 14

15 Quality controls on raw reads: which metrics to check? Mainly: Quality score per base and over the reads But also: Read length distribution Sequence content per base and % of GC Kmers content Overrepresented sequences Duplicated reads 5/3/2015 Yannick Boursin 15

16 Quality scores Per base (Box Whisker type plot) -> to see wether base calls falls into low quality (commonly towards the end of a read) Per sequence (mean quality distribution) -> to see if a subset of your sequences have universally low quality values 5/3/2015 Yannick Boursin 16

17 Quality scores 5/3/2015 Yannick Boursin 17

18 Quality scores 5/3/2015 Yannick Boursin 18

19 Standard Workflow for NGS Analysis A typical NGS workflow 5/3/2015 Yannick Boursin 19

20 Reads cleaning: removing bad quality bases After QC, we need to remove bad quality entities. This is often done by scanning reads with a sliding window algorithm. Read-ends trimming by a quality trimming algorithm. In red: bad quality bases. In blue: good quality bases. 5/3/2015 Yannick Boursin 20

21 Reads cleaning: adapters removal An adapter is a small piece of known DNA located at the end of the reads Adapters roles: Hang read to the sequencer flowcell Allows a specific PCR enrichment of reads having adapter Use in multiplex sequencing (samples in mix) Available tools to trim adapters: Cutadapt Trimmomatic RmAdapter In blue: adapters. In orange: informative part of the read. 5/3/2015 Yannick Boursin 21

22 Standard Workflow for NGS Analysis A typical NGS workflow 5/3/2015 Yannick Boursin 22

23 Step 2: Short Reads Alignment 5/3/2015 Yannick Boursin 23

24 Standard Workflow for NGS Analysis A typical NGS workflow 5/3/2015 Yannick Boursin 24

25 Reads alignment - Vocabulary Reference Genome : The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information. Alignment : (mapping) The reads alignment aims at transforming the single reads information in an organized and reduced set of information. Giving each read a genomic position. Mismatch : Incoherence between two nucleotides Gap : Bridge within the read alignment (i.e. small Insertion/deletion) Indels : Insertion/Deletion into the reference genome Mappability : Uniqueness of a region (repeated region = low mappability, unique region = good mappability) 5/3/2015 Yannick Boursin 25

26 Reads alignment Two strategies The reads alignment aims at transforming the single reads information in an organized and reduced set of information. Two strategies can be applied : - De novo Reads Assembly Used when no reference genome are available. It aims at reconstructing long scaffolds from single reads information. - Alignment on a Reference Genome The reads are directly compared to a known reference genome. 5/3/2015 Yannick Boursin 26

27 Alignment on a reference genome The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information. Alignment of reads against reference genome 5/3/2015 Yannick Boursin 27

28 Alignment on a reference genome The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information. Alignment of reads against reference genome 5/3/2015 Yannick Boursin 28

29 Alignment on a reference genome - Challenges New alignment algorithms must address the requirements and characterics of NGS reads Millions of reads per run (30x of genome coverage) Reads of different size (35bp - 200bp) Different types of reads (single-end, paired-end, mate-pair, etc.) Base-calling quality factors Sequencing errors ( ~ 1%) Repetitive regions Sequencing organism vs. reference genome Must adjust to evolving sequencing technologies and data formats 5/3/2015 Yannick Boursin 29

30 Alignment on a reference genome Bioinformatics tools 5/3/2015 Yannick Boursin 30

31 Finding the best alignment - Rational Given a reference and a set of reads, report at least one good local alignment for each read if one exists What is good? For now, we concentrate on: Fewer mismatches is better Failing to align a low-quality base is better than failing to align a high-quality base Based on a scoring system, i.e. score for a match (1), MM penalty (3), gap open penalty (5), gap extension penalty (2). The best alignment is the one with the highest score. 5/3/2015 Yannick Boursin 31

32 Treangen T.J. and Salzberg S.L Nature review Genetics 13, Alignment key parameters - Repeats Approximately 50% of the human genome is comprised of repeats 5/3/2015 Yannick Boursin 32

33 Alignment key parameters - Repeats Close proximity with genes : intergenic and intragenic positions BRCA2: a mosaic of repeated regions 5/3/2015 Yannick Boursin 33

34 9th April 2014 NGS and Bioinformatics Alignment Key Parameters Alignment key parameters Repeats 3 strategies Repeats Three strategies -1- Report only unique alignment -2- Report best alignments and randomly assign reads across equaly good loci -3- Report all (best) alignments -1- Report only unique alignment -2- Report best alignments and randomly assign reads across equaly good loci -3- Report all (best) alignments A B A B A B Treangen T.J. and Salzberg S.L Nature review Genetics 13, /3/2015 Yannick Boursin 34

35 Alignment on a reference genome Key points The alignment is a crucial step of the NGS analysis. The reference genome has to be carefully chosen. The mappability of the region of interest has to be takken into account (primer design). The scoring method has to be chosen accordingly to the sequencing error rate and the quality of the raw reads. The alignment parameters have to be set properly. 5/3/2015 Yannick Boursin 35

36 Limitations of Alignment Tools Even if we have now some nice tools to align reads on a reference genome, several issues are still important : - Homopolymer mapping - Efficiently align small indels - Alignment on several genomes - Alignment on repeted sequences /3/2015 Yannick Boursin 36

37 Alignment formats A lot of formats exists: SAM BAM ELAND (Illumina specific) MAQ map SAM and BAM are now the standard for aligned data 5/3/2015 Yannick Boursin 37

38 SAM format SAM for Sequence Alignment Map Tabulated text file 1 line per read Each line is composed of 11 fields (minimum) 5/3/2015 Yannick Boursin 38

39 SAM format 5/3/2015 Yannick Boursin 39

40 SAM format Second field can be used for quick sort of file With Samtools (command line) and f et F options Useful webpage: 5/3/2015 Yannick Boursin 40

41 BAM format BAM for Binary Alignment/Map Correspond to SAM format compressed as BGZF Reduce by 5 fois the size of the alignment file Not directly readable as SAM format Require Samtools Best format for alignment file sharing Couples with an index file (BAI) Avoid a sequential read of the complete file 5/3/2015 Yannick Boursin 41

42 Standard Workflow for NGS Analysis A typical NGS workflow 5/3/2015 Yannick Boursin 42

43 QC 3 : Which metric to check? In practice, how to validate my alignment? Be aware of the mapping strategy used Look at simple descriptive statistics Number of aligned reads Coverage/Depth Mapping quality Number of normal/abnormal pairs for paired-end data... 5/3/2015 Yannick Boursin 43

44 NGS Analysis : How can I work with my NGS data? Difficult on personal computer (lack of ressources) 1 alignement = 4 processors + 15gb Ram (to multiply by the number of samples) Impossible to open files into sofwares like text editor Need a very large storage capacity Data backup administration Applications server connected to a computing cluster and storage array: Commercials solution (CLC Bio, NextGene,...) Galaxy server: 5/3/2015 Yannick Boursin 44

45 Standard Workflow for NGS Analysis A typical NGS workflow 5/3/2015 Yannick Boursin 45

46 After sequencing : Data analysis Main challenges : The rapid evolution of the high-throughput technologies The rapid evolution of the bioinformatics solutions The rapid evolution of the biological/medical knowledge 5/3/2015 Yannick Boursin 46

47 Data analysis Chimeric transcript search Alternative transcripts study Differential expression study Methylation study Detection of genomic variants Detection of copynumber variation 5/3/2015 Yannick Boursin 47

48 Chimeric transcripts Does the tumoral cells express any chimeric transcript? History of the bcr-abl fusion 5/3/2015 Yannick Boursin 48

49 Alternative transcripts 5/3/2015 Yannick Boursin 49

50 Differential expression Are there genes that would be strongly expressed in one kind of tumor that are not in the other kind? Can we group tumors according to their expression profiles? Clustering differential expression in breast tumours. 5/3/2015 Yannick Boursin 50

51 Methylome Is there any difference between DNA methylation in tumors and in normal cells? How does methylation promotes cancer? 5/3/2015 Yannick Boursin 51

52 Detection of copynumber variations Are there any copy-number alteration (gain or loss of chomosomal regions, amplifications ) that could explain tumorigenesis? Copynumber variations in cancer. MYC and KRAS are amplified. 5/3/2015 Yannick Boursin 52

53 Detection of genomic variants Are there mutational events that are specific to the tumoral genome? Could the tumorigenesis be explained by those? Is there any drug targeting those mutations? Pancreas adenocarcinoma: from normal cells to tumoral cells 5/3/2015 Yannick Boursin 53

54 Limitations: Detection of genomic variants Between 1.4 and 8.9 % of the variants are technology specific 5/3/2015 Yannick Boursin 54

55 Limitations: Detection of genomic variants Common genomic variants between different variant callers 5/3/2015 Yannick Boursin 55

56 Conclusion Nowadays, NGS is widely used in cancer centers in order to categorize cancers and link patients with personalized treatments (Precision Medicine) NGS is also used in cancer research, in order to discover new oncogenetic mechanisms, to understand the way a treatment works, to link biological and genetical characters Due to technical issues using NGS might not answer your questions. It is important to know that the technique is limited: A) by the question you asked at first. If a cancer cannot be explained by mutational events, it might be explained by other mechanisms. But still, sometimes, nothing is to be found in data. B) by technical issues. Sequencers and softwares are prone to errors. Statistically, there will be at least one error for any analysis. You can often limit the effects of this limitations by making biological and technical replicates. 5/3/2015 Yannick Boursin 56

57 5/3/2015 Yannick Boursin 57

58 Paired-end mapping Insert-size checking % of "All Good"= both reads in the pair have aligned "the pair is properly aligned" meaning that they mapped within a proper distance from each other % of "All Bad" = neither the read nor its mate mapped % of Only one read maps = only one read in a pair is mapped 5/3/2015 Yannick Boursin 58

59 Alignment key parameters Using single or pairedend reads? The type of sequencing (i.e. single or paired-end reads) is often driven by the application. Exemple : Finding large indels, genomic rearrangements,... However, in most of the case, the pair information can improve the mapping specificity - Single-end alignment repeated sequence - Paired-end alignment unique sequence Alignment of reads against reference genome 5/3/2015 Yannick Boursin 59

60 NGS Toolkit : SAMtools Interacting with SAM/BAM format SAMTools provides the following commands : view : tansform and filter SAM or BAM data sort : sort a BAM file per genomic location or name index : creates a new index file that allows fast look-up of data in a (sorted) SAM or BAM mpileup : SNVs/indel detection rmdup : remove duplicated reads flagstat : compute statistics on the SAM/BAM file... 5/3/2015 Yannick Boursin 60

61 NGS Toolkit : BEDTools Address common genomics tasks such as finding feature overlaps and computing coverage. Can manage BED, GFF/GTF, VCF and SAM/BAM Unix-like command Fast All intersections or annotations tasks can be done with BEDTools Quinlan AR and Hall IM, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp /3/2015 Yannick Boursin 61

62 How to visualise data? IGV : Integrative Genome Viewer JAVA application (local version) Annotation available on the Broad server Batch command line Support a lot of different file formats (Variants visualization) Easy to use Limited in term of annotations Screencast: How to use IGV (french) 5/3/2015 Yannick Boursin 62

63 UCSC Genome Browser How to visualise data? Hundred of annotation data Hundred of public (ENCODE) profils Tables functions Fully online (session) Can be difficult to upload big data files (new format: bigbed, bigwig, etc.) Screencast: How to use the UCSC genome browser (french) 5/3/2015 Yannick Boursin 63

64 Sequence length distribution Sequencers generates: either sequence fragments of uniform length or reads of wildly varying lengths. Helps to identify and remove reads with abnormal length. 5/3/2015 Yannick Boursin 64

65 Sequence length distribution 5/3/2015 Yannick Boursin 65

66 Sequence content Proportion of each base position for which A,C, G, T has been called GC content of each base position -> in random librairies = a little to no difference between the different bases N content per base -> If a sequencer is unable to make a base call with sufficient confidence 5/3/2015 Yannick Boursin 66

67 Sequence content 5/3/2015 Yannick Boursin 67

68 Over-represented sequences The sequences that are highly duplicated in your library, as well as any primer and/or adapter dimers that were present in the original library. Run A: Sequence: GACTCGGCAGCATCTCCATCCAAACTTTTCATTTCTGCTTTTAAA GGAAA Count: 37 Pourcentage 0.1% 5/3/2015 Yannick Boursin 68

69 Duplicate reads Different reads which have the same sequence A duplicate could be PCR effect or reading same fragment twice or come from enrichment Reads which align to the identical location on the reference Remove duplicates? It depends of the application. Exemple: for targeted sequencing, you do not want duplicates to be removed 5/3/2015 Yannick Boursin 69

70 Duplicate reads 5/3/2015 Yannick Boursin 70

71 Data analysis Motif search Chimeric transcript search Microbiota study Alternate transcript search Differential expression study Methylation study Detection of genomic variants Detection of copynumber variation 5/3/2015 Yannick Boursin 71

72 Motif search How does my protein interacts with DNA? 5/3/2015 Yannick Boursin 72

73 Chimeric transcripts Are there any chimeric transcripts? 5/3/2015 Yannick Boursin 73

74 Microbiota What kind of species grows in the human gut? Could those species be associated with tumorigenesis? 5/3/2015 Yannick Boursin 74

75 Alternative transcripts Are there any differences between normal cell and tumoral cells regarding splicing events? 5/3/2015 Yannick Boursin 75

76 Differential expression Are there genes that would be strongly expressed in one kind of tumor that are not in the other kind? Can we group tumors according to their expression profiles? 5/3/2015 Yannick Boursin 76

77 Methylome Is there any difference between DNA methylation in tumors and in normal cells? 5/3/2015 Yannick Boursin 77

78 Detection of copynumber variations Are there any copy-number alteration (gain or loss of chomosomal regions, amplifications ) that could explain tumorigenesis? 5/3/2015 Yannick Boursin 78

79 Detection of genomic variants Are there mutational events that are specific to the tumoral genome? Could the tumorigenesis be subrogated to that? 5/3/2015 Yannick Boursin 79

80 Quality controls on raw reads : lets start after sequencing A raw read is characterized by three parameters: Its length Its sequence Per-base-in-sequence quality Raw reads 5/3/2015 Yannick Boursin 80