Genome variation - part 1

Similar documents
Sequence variation Introductory bioinformatics for human genomics workshop, UNSW

Population description. 103 CHB Han Chinese in Beijing, China East Asian EAS. 104 JPT Japanese in Tokyo, Japan East Asian EAS

Human Populations: History and Structure

De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse

I/O Suite, VCF (1000 Genome) and HapMap

Analysing Alu inserts detected from high-throughput sequencing data

Browsing Genes and Genomes with Ensembl

Eric Green NHGRI Director

Introduction to human genomics and genome informatics

A review of intelligence GWAS hits: their relationship to country IQ and the issue of spatial autocorrelation.

Supplemental materials. Table S1 Population names and abbreviations

Resources at HapMap.Org

Mining GWAS Catalog & 1000 Genomes Dataset. Segun Fatumo

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

Browsing Genes and Genomes with Ensembl

Evidence of selection on human stature inferred from spatial distribution of allele frequencies.

Haplotypes, linkage disequilibrium, and the HapMap

SUPPLEMENTAL MATERIAL

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Genotyping Technology How to Analyze Your Own Genome Fall 2013

UKPMC Funders Group Author Manuscript Nature. Author manuscript; available in PMC 2011 April 1.

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Introduction to the UCSC genome browser

SNP calling and VCF format

Supplementary Figure 1. Principle component analysis based on the GWAS subjects and the HapMap Phase 2 populations. (A) Distributions of all subjects

Amapofhumangenomevariationfrom population-scale sequencing

Evaluation of whole exome sequencing as an alternative to BeadChip and whole genome sequencing in human population genetic analysis

Bioinformatic Analysis of SNP Data for Genetic Association Studies EPI573

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

Genomics: Human variation

Ben Gurion University of the Negev, Beersheba, Israel; Tel.:

VEGAS2: Gene-based test software using 1000 Genomes reference sets. User Manual

ARTICLE Contrasting X-Linked and Autosomal Diversity across 14 Human Populations

UK Biobank Axiom Array

Infinium Global Screening Array-24 v1.0 A powerful, high-quality, economical array for population-scale genetic studies.

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project

Analytics Behind Genomic Testing

Update on the Genomics Data in the Health and Re4rement Study. Sharon Kardia Jennifer A. Smith University of Michigan April 2013

Understanding Human Variation

In silico variant analysis: Challenges and Pitfalls

Training materials.

Population differentiation analysis of 54,734 European Americans reveals independent evolution of ADH1B gene in Europe and East Asia

Briefly, this exercise can be summarised by the follow flowchart:

Variant prioritization in NGS studies: Annotation and Filtering "

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Nature Genetics: doi: /ng.3143

After the association: Functional and Biological Validation of Variants

Annotating your variants: Ensembl Variant Effect Predictor (VEP) Helen Sparrow Ensembl EMBL-EBI 2nd November 2016

Population structure, heritability, and polygenic risk

Applied Bioinformatics

Derrek Paul Hibar

Human Genetics and Gene Mapping of Complex Traits

Niemann-Pick Type C Disease Gene Variation Database ( )

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

The Whole Genome TagSNP Selection and Transferability Among HapMap Populations. Reedik Magi, Lauris Kaplinski, and Maido Remm

Global Screening Array (GSA)

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Axiom mydesign Custom Array design guide for human genotyping applications

Supplementary Figures

SUPPLEMENTARY INFORMATION

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Using VarSeq to Improve Variant Analysis Research

Finding Enrichments of Functional Annotations for Disease-Associated Single- Nucleotide Polymorphisms

Implications of human genetic variation in CRISPR-based therapeutic genome editing

Prioritization: from vcf to finding the causative gene

Supplementary Material for Extremely low-coverage whole genome sequencing in South Asians captures population genomics information

Supplementary Figures and Data

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Guided tour to Ensembl

Variant calling in NGS experiments

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Chapter 5. Structural Genomics

The Human Genome. The raw data. The repeat content. Composition of the human genome bases. A s T s C s and G s and N s.

CHARACTERIZING genetic diversity across individuals

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Analysing Linkage Disequilibrium with Ensembl

14 March, 2016: Introduction to Genomics

IL1B-CGTC haplotype is associated with colorectal cancer in. admixed individuals with increased African ancestry

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

For more information about how to cite these materials visit

Genomes contain all of the information needed for an organism to grow and survive.

Human Population Differentiation Is Strongly Correlated with Local Recombination Rate

CITATION FILE CONTENT / FORMAT

Supplemental Figure 1: Nisoldipine Concentration-Response Relationship on ipsc- CMs. Supplemental Figure 2: Exome Sequencing Prioritization Strategy

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

SNPs - GWAS - eqtls. Sebastian Schmeier

ChroMoS Guide (version 1.2)

Chapter 14: Genes in Action

Weighted likelihood inference of genomic autozygosity patterns in dense genotype data

VCGDB: A Virtual and Dynamic Genome Database of the Chinese Population

NGS in Pathology Webinar

Redefine what s possible with the Axiom Genotyping Solution

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

USER MANUAL for the use of the human Genome Clinical Annotation Tool (h-gcat) uthors: Klaas J. Wierenga, MD & Zhijie Jiang, P PhD

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Introduction to Genetics and Pharmacogenomics

Human Population Differentiation is Strongly Correlated With Local Recombination Rate

Genetic Variation in an Individual Human Exome

Transcription:

Genome variation - part 1 Dr Jason Wong Prince of Wales Clinical School Introductory bioinformatics for human genomics workshop, UNSW Day 2 Friday 21 th January 2016

Aims of the session Introduce major human genome variation databases. dbsnp 1000genomes Basic variant annotation using UCSC. We will look at ExAC/gnomAD in the next session.

Types of variation Cytological level: Chromosome numbers Segmental duplications, rearrangements, and deletions Sub-chromosomal level: Transposable Elements Short Deletions/Insertions, Tandem Repeats Sequence level: Single Nucleotide Polymorphisms (SNPs) Small Nucleotide Insertions and Deletions (Indels)

Why study sequence variation? Rare disease Determine disease risk Response to therapy Forensics Evolution

Single nucleotide polymorpisms (SNP) Typically refers to single bases substitution. There are ~40 M common SNPs in human population. A given individual would expect to differ from reference genome by 1% (i.e. 3 million SNPs)

Types of SNPs Genic, coding SNPs Frameshift Splice site Non-synonymous (missense, nonsense) Synonymous (splice enhancer/suppressor?) Genic, non-coding SNPs Untranslated region Regulatory SNPs Intronic SNPs Intergenic

Predicting effect of coding SNPs Functional importance of SNPs usually based on: Sequence conservation. Frequency in population. Alter protein 2D/3D structure. Within protein motifs. Many tools are now available for coding SNP function prediction BUT is still far from perfect.

Non-coding SNPs Traditionally more difficult to annotate as >98% of the genome is non-coding. Want to find SNPs that is associated with gene expression (eqtls). With the ENCODE/Epigenome project, it is easier (but still very difficult) to find potential functional non-coding SNPs.

Insertion/deletion (Indels) Typically defined as gain or loss of 1-50 bps Less frequent than SNPs (~10% of all sequence variation). But if in coding sequence can additionally cause frameshift mutations.

dbsnp Online database from NCBI for cataloguing all SNPs submitted by the scientific community. www.ncbi.nlm.nih.gov/snp/

Reliability of dbsnp Problem with dbsnp is that anyone can upload variants and therefore it is claimed that there is a false positive rate of perhaps > 10%. Furthermore, some somatic mutations have also found their way into dbsnp. Therefore, use dbsnp BUT ideally only SNPs from 1000genomes project.

1000 Genomes project Goal of the project is to find virtually all genetic variants with frequency of at least 1% in the human population. Ultimate aims was to sequence ~2500 humans at 4x whole-genome coverage www.1000genomes.org

Major populations Total samples East Asian (ASN) 523 South Asian (SAN) 494 African (AFR) 691 European (EUR) 514 Americas (AMR) 355 Total 2,577 1000 Genomes samples 26 sub-populations Population Code Population Description Super Population CHB Han Chinese in Bejing, China ASN JPT Japanese in Tokyo, Japan ASN CHS Southern Han Chinese ASN CDX Chinese Dai in Xishuangbanna, China ASN KHV Kinh in Ho Chi Minh City, Vietnam ASN CEU Utah Residents (CEPH) with Northern and Western European ancestry EUR TSI Toscani in Italia EUR FIN Finnish in Finland EUR GBR British in England and Scotland EUR IBS Iberian population in Spain EUR YRI Yoruba in Ibadan, Nigera AFR LWK Luhya in Webuye, Kenya AFR GWD Gambian in Western Divisons in The Gambia AFR MSL Mende in Sierra Leone AFR ESN Esan in Nigera AFR ASW Americans of African Ancestry in SW USA AFR ACB African Carribbeans in Barbados AFR MXL Mexican Ancestry from Los Angeles USA AMR PUR Puerto Ricans from Puerto Rico AMR CLM Colombians from Medellin, Colombia AMR PEL Peruvians from Lima, Peru AMR GIH Gujarati Indian from Houston, Texas SAN PJL Punjabi from Lahore, Pakistan SAN BEB Bengali from Bangladesh SAN STU Sri Lankan Tamil from the UK SAN ITU Indian Telugu from the UK SAN

Usage of variant data GWAS studies have typically only been able to find associations of variants of frequency of > 5%. 1000 Genomes project enables screening of variants discovered in sequencing of patients with genetic disorders and in cancer genomes.

Per individual variant load (Nature 491, 56-65)

State of 1000 genomes project Phase 1 Completed in 2012 1,092 humans. 14 populations 36.7 M SNPs 1.38 M Indels Phase 3 Completed in 2014 2,535 humans. 26 populations. 78.1 M SNPs 3.1 M Indels 1.9 M SNPS are not shared Some samples not shared. Different sequencing platforms. Change in variant calling pipeline.

Accessing 1000 genomes data www.1000genomes.org contains variant calls (VCF format), aligned BAM files and RAW files. Almost all (if not all) SNPs from 1000 genomes also catalogued in dbsnp. UCSC has a track containing all 1000 genome Phase 1 and 3 data.

Accessing 1000genomes from UCSC Aim: Visualise all coding 1000 genome SNPs over the APOE gene and highlighting non-synonymous SNPs To load session, user: jasewong session name: bioinf_workshop_snp_2016

Accessibility tracks Shows regions of the genome where variant calls can be reliably made using whole genome NGS data. Note that some regions still have variant calls because 1000 genomes didn t just use whole gnome NGS. Useful for raising caution if your variants come from these regions. https://genome.ucsc.edu/cgi-bin/hgtrackui?g=tgpphase3accessibility

Good for downloading SNPs, but not good for visualisation for this bring up dbsnp tracks

Bring up dbsnp 144 All SNPs, Common SNPs, Flagged SNPs and Multi. SNP from the Variation section of tracks. Select dense for each one.

SNP definitions Common SNPs - SNPs with >= 1% minor allele frequency (MAF), mapping only once to reference assembly. Flagged SNPs - SNPs < 1% minor allele frequency (MAF) (or unknown), mapping only once to reference assembly, flagged in dbsnp as "clinically associated" -- not necessarily a risk allele! (These are rare SNPs that with known clinical function). Mult. SNPs - SNPs mapping in more than one place on reference assembly. All SNPs - all SNPs from dbsnp mapping to reference assembly. https://genome.ucsc.edu/cgi-bin/hgtrackui?g=snp144

Need to configure track to only show 1000 genome coding SNPs Right-click anywhere on the All SNPs track and select Configure All SNPs(144)

1. Set Single Nucleotide Polymorphism only 2. Set 1000 Genomes Project only 3. Remove all except missense variant

Select only clinical associated variants

Why are there less clinically associated, missensed, 1000 genome SNPs than Flagged SNPs?

APOE contains two well known Alzhimer s disease risk associated SNPs rs429358 and rs7412 Bring up GWAS Catalog track (under Phenotype and Literature) to find where these are.

Note: 1. The two SNPs are NOT in the Flagged SNPs track because their MAF >= 1% 2. The two SNPs lie in regions inaccessible to short read NGS.

Change All SNPs track to packed and click on rs7412 for more information

Click to link out to dbsnp

Exporting specific SNPs using table browser Aim is to download all 1000genome clinically associated missense SNPs over APOE Type APOE and click go.

Select Table browser from Tools menu Select track Select filter Check position. The coordinates should be where the browser last was.

Filter 1. Check 1000genomes 2. Check missense 3. Clinically-assoc

Select BED as the output format Optionally type in a name for the output file to download the file.

BED file from UCSC should be 8 SNPs in total

Annotating variants in UCSC

The VCF file format Most common format to store variant/mutation information. In text format, but difficult to view in text editor/excel. Header contains useful information. https://samtools.github.io/hts-specs/vcfv4.2.pdf

Upload APOE_1000genomes.vcf

Select variants here (since we only have one uploaded this will be the default/only option.

Change to HTML for easier viewing

Further reading/resources 1000 genomes project (www.1000genomes.org/) Phase 1 paper (www.ncbi.nlm.nih.gov/pubmed/23128226) Phase 3 paper (www.ncbi.nlm.nih.gov/pubmed/2643224)