14 March, 2016: Introduction to Genomics

Size: px
Start display at page:

Download "14 March, 2016: Introduction to Genomics"

Transcription

1 14 March, 2016: Introduction to Genomics

2 Genome

3 Genome within Ensembl browser

4 Genome within Ensembl browser 1 2 3

5 Genome Genes Variation Repeats

6 Genes

7 Variation SNP or SNV (singlenucleotide polymorphism/variation) indels (insertions and deletions) structural variation CNV (copynumber variation) inversions translocations

8 Variation caused by mutations visible in DNA sequence proportion of variable sites depends on evolutionary distance within species little between species lots seq1 seq2 seq3 seq4 CGATGCGCGATACATCGACGTGCA CGATGCGCGGTACATCGACGTGCA CGATGCGCGATACATCGACGTGCA CGATGCGCGATACATCGACGTGCA

9 Repeats Many types Alu element is the most abundant transposable elements in the human genome ~ 300 bases long, ~ 1 million copies makes ~11% of human genome repeat copies are similar, cause troubles in genome assembly and shortread mapping often useless and ignored where possible

10 Genome correlations! Genes Variation Repeats

11 Variation data mainly interested in SNPs and indels observed between samples distribution of variation across sites distribution among samples here, reference sequence is known sample data are multiple genomes to us, data come from magic box data: m/billions of ~150 bp fragments R

12 Variation data mainly interested in SNPs and indels observed between samples distribution of variation across sites distribution among samples here, reference sequence is known sample data are multiple genomes to us, data come from magic box data: m/billions of ~100 bp fragments R Sample genome: millions to billions bp long (human ~ 3 x 109 bp)

13 Variation data mainly interested in SNPs and indels observed between samples distribution of variation across sites distribution among samples here, reference sequence is known sample data are multiple genomes to us, data come from magic box R Sample genome: millions to billions bp long (human ~ 3 x 109 bp) DNA fragmentation bp data: m/billions of ~100 bp fragments

14 Variation data mainly interested in SNPs and indels observed between samples distribution of variation across sites distribution among samples here, reference sequence is known sample data are multiple genomes to us, data come from magic box R Sample genome: millions to billions bp long (human ~ 3 x 109 bp) DNA fragmentation bp DNA sequencing data: m/billions of ~100 bp fragments 100 bp known bp unknown 100 bp known

15 Variation data mainly interested in SNPs and indels observed between samples distribution of variation across sites distribution among samples here, reference sequence is known sample data are multiple genomes to us, data come from magic box data: m/billions of ~100 bp fragments R Short read mapping Genomic analyses

16 Variation data mainly interested in SNPs and indels observed between samples 1. detection of CNVs (and struct. var.) using short read data is tricky evolution of CNVs is unclear population genetics theory is best developed for SNP data 2. R Short read mapping Genomic analyses

17 Illumina sequencing most common sequencing machines Illumina reads have systematic errors some errors can be accounted for

18 Illumina 1:N:0:4 ACAACNCGCCCGTGNTGCAGGACTGGGTCACGGCCACTGACATCCGCGTGGCCTTCCGCCGCCTGCACACGTTCGGTGACGAGAACGAGGCCGACTCCGAGCTGGCGCGCGCCTCGTACTTCTACGCCGTGTCCGACCT + 1:N:0:4 GCCACNAAAATTTANAACTAGAGCTGCCCTATGCCCCAGCAATTGCACTCCTGGGTATTTACCCCAAAGACACAGATGTAGTGAAAAGAAGGGCCATATTCACCCCAACGTTCATAGCAGCAAAGTCCACTATAGCCAA + <AA.A#A.7A.F<F#FFFF.AAFFA.FFFAFFF7FFFFFFFFFF.7)FFFFF.AFFAAAF.F)FFF<AFAF7FAF.FAFFFFFFFFA..)FFF.FF7FF)A...FF..<.<FF7)<FFFF<7F.)F.FFF.7FFFFFF7 data output in fastq format per base sequence content shows contamination and fragment enrichment base call qualities reflect sample DNA quality and issues in sequencing run base qualities are taken into account in later analyses typically no need for clean up good data bad data

19 Overview of resequencing data analysis samples fastq fastq fastq data fastq data data data mapping bam bam data data bam bam data data variant calling C G G T A A T T A G 0/1 0/1 1/1 1/1 1/1 1/1 1/1 1/1 vcf data summary statistics analysis analysis analysis