BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

Size: px
Start display at page:

Download "BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte"

Transcription

1 BICF Variant Analysis Tools Using the BioHPC Workflow Launching Tool Astrocyte

2 Prioritization of Variants SNP INDEL SV

3 Astrocyte BioHPC Workflow Platform Allows groups to give easy-access to their analysis pipelines via the web Standardized Workflows Simple Web Forms Online documentation & results visualization* Workflows run on HPC cluster without developer or user needing cluster knowledge astrocyte.biohpc.swmed.edu Slide contribution: David

4

5 Alignments FASTQ Trim Galore Trim Adapters Low quality ends (Q< 25) Remove short reads (<35bp) Trim FASTQ BWA Picard Dedup BAM Realigned, Recalibrated BAM GATK Reaalignment & Base Recalibration

6 Types of Variation Germline Somatic

7 Germline Workflows Realigned, Recalibrated BAM Dedup BAM Speed Seq Lumpy GATK Haplotype Caller Samtools Mpileup Platypus SV VCF GATK VCF SAM VCF Hotspot VCF Platypus VCF SS VCF = Union VCF

8 Key Files VCF file SNPs/Indels for each sample SampleID.annot.vcf.gz Coverage Histogram for each sample SampleID.coverage_histogram.png Cumulative Distribution Plot for all samples coverage_cdf.png QC for all samples sequence.stats.txt Structural Variants (unfiltered) SampleID.sssv.sv.vcf.gz.annot.txt

9 Recommended Filtering for Germline Testing ExAC POPMAX AF ( ) - depends on rarity of the phenotype of the proband Depth >10 LOF or Misssense (Coding Changes) Alt Read Ct > 3 Mutation Allele Frequency (MAF) > 0.15 If novel: Called by 2+ callers

10 Accuracy in GIAB Sample Sample Fixed Adapters SNV-SN Indel-SN SNV/Indel PPV NA12878_1_HFVC2BBXX Fresh % 100% 98.9% NA12878_2_HFVC2BBXX Fresh % % NA12878_1_HFYWMBBXX Fresh % % NA12878_2_HFYWMBBXX Fresh % % GM12878_Fresh_1adapter Fresh % % GM12878_Fresh_4adapter Fresh % % GM12878_FFPE_1adapter FFPE % % GM12878_FFPE_4adapter FFPE % %

11 Tumors are Heterogeneous Normal Tumor

12 Somatic Workflows Realigned, Recalibrated BAM Dedup BAM MuTect2 VarScan Shimmer Virmid Speed Seq Check Mate QC Pairs Same Subject Mutect VCF VarScan VCF Virmid VCF SS VCF Shimmer = Union VCF VCF

13 Key Files VCF file SNPs/Indels for each sample TumorID_NormalID.annot.vcf.gz Match Check File TumorID_NormalID_matched.txt

14 Recommended Filtering for Somatic Mutations ExAC POPMAX AF > 0.01 Depth < 20 LOF or Misssense MAF (Normal) * 10.< MAF (Tumor) In COSMIC > 5 Subject Tumor: Alt Read Ct < 3 Tumor: MAF < 0.01 Others Tumor: Alt Read CT < 8 Tumor: MAF < 0.05 Tumor: Called by 2+ callers

15 Simulated Datasets to Evaluate Sensitivity and Specificity of Somatic Mutation Calling We generated 3 sets of 18 SNVs and 16 Indels We inserted each set into 4 normal alignment files (1 cell line (Depth of Coverage) and 3 Saliva samples (Depth of Coverage) using BamSurgeon We calculated the observed mutation allele frequency (MAF) using bamreadct We ran our somatic mutation workflow using the original bam (Normal) and the altered bam (Tumor)

16 Bioinformatics Somatic Mutation Sensitivity Somatic Germline FP SNV Obs MAF > 5% 100% novel and known hotspots 80.5% novel, 88.3% known hotspots Germline: 0; Somatic: 0 Indel MAF > 5% and Alt Read CT > % novel, 95.4% known hotspots 86.3% novel, 87.5% known hotspots Germline: 0; Somatic: 0 Indel MAF > 10% and Alt Read CT > % novel, 100% known hotspots 100% novel and known hotspots Germline: 0; Somatic: 0

17 Create a new project

18 Add data to your project

19 Add data to your project For NGS experiment, this is recommended.

20 SampleID Make your design file Germline Workflow This ID will be used to name all workflow produced files ie S0001 will produce S0001.bam FullPathToFqR1 Name of the fastq file R1 (not the full path) FullPathToFqR2 Name of the fastq file R2 (not the full path) SampleID FullPathToFqR1 FullPathToFqR2 GM12877 GM12877_S124_R1_001.fastq.gz GM12877_S124_R2_001.fastq.gz GM12878 GM12878_S124_R1_001.fastq.gz GM12878_S124_R2_001.fastq.gz GM12879 GM12879_S124_R1_001.fastq.gz GM12879_S124_R2_001.fastq.gz

21 Tips on making your design file Use tab as delimiter Excel save as Text (tab delimited) If no SubjectID, use same number/character for all rows SampleID and SampleName If no FqR2, leave them empty For all contents, no - For all contents, no spaces Columns names MUST be exactly the same as documented

22 Select your data files and set up workflow and submit SELECT YOUR FILES

23 Project is running

24 Timeline of the whole run

25 Working with the output Export all output to Astrocyte Outgoing Directory

26 How to Transfer data to Run Somatic Workflow Mount BioHPC on your computer (see BioHPC Introduction slides) Login into Cluster

27 TumorID NormalID Make your design file Somatic Workflow The TumorID and NormalID are used for naming the files TumorID_NormalID.annot.vcf.gz TumorBam Name of the bam file for the Tumor sample NormalBam Name of the bam file for Normal sample TumorID NormalID TumorBAM NormalBAM Patient1_tumor Patient1_normal p1_tumor.bam p1_normal.bam Patient2_tumor Patient2_normal p2_tumor.bam p2_normal.bam

28 Common errors and solutions Make sure the delimiter is tab Make sure the column name are the same as mentioned in documentation Make sure the file names match

29 Common errors and solutions Not all files are uploaded It s about the proxy setting Use auto-detect proxy

30 Downstream visualization of variants from a user perspective Ling Cai QBRC/CRI-GMDP

31 Visualization tools IGV Somatic mutation example from a cancer sample gene.iobio Germline mutation examples from genetic disease patient samples IGV user guide: gene.iobio tutorials:

32 Using IGV on BioHPC getting started 1. Launch a WebGUI session from Web Visualization under Cloud Services from BioHPC portal 2. Open terminal and type in command module load IGV/2.3.90; igv.sh 3. Specify genome (should match to the reference genome from which the variants were called)

33 Using IGV on BioHPC loading files and search 1. File -> Load from File -> Select Have the index files in the same folder! 2. Search A locus (for example, chr5:90,339,000-90,349,000) A gene symbol or other feature identifier (e.g., DPYD or NM_ ) A mutation (EGFR:T790M or EGFR:2369C>T)

34 Using IGV on BioHPC customizing visualization Collapse tracks, display alternative reading frames

35 Using IGV on BioHPC customizing visualization Color alignments by read strand Sort alignments by base

36 Using IGV on BioHPC getting detailed information Variant, bam coverage, read, nucleotide position on different transcripts

37 Using gene.iobio to visualize variants for genetic diseases Genetic diseases Inherited Autosomal Sex-linked De novo

38 Using gene.iobio loading files Selection of index files is required during upload

39 Using gene.iobio viewing ranked variants and call variants

40 Using gene.iobio examining variants

41 Using gene.iobio transcript specific annotation

42 Using gene.iobio multi-gene analysis (import gene list)

43 Using gene.iobio multi-gene analysis (Viewing result)

44 Using gene.iobio multi-gene analysis (Filtering result) Loss/Gain of function mutations Splice Stop gain/loss Start gain/loss Coding frame shifts Non-synonymous Mutations Amino Acid Changes Variants Likely to Change Expression Transcription Factor Binding Sites mirna Targets

45 Using gene.iobio multi-gene analysis (bookmarks)

46 Determining Genetic Causes of Disease in Exomes is Not Trivial The causal variant is identified in Mendelian Disease (Inherited) from exomes is about 30% of cases. A genetic mutation can express a range of phenotypes (Penetrance) Not all functional mutations are in coding regions (ncrnas or regulatory regions) Sporatic genetic diseases often have a polygenic causes, sometimes with a combination of inherited and somatic (de novo) mutation Mutations can be localized to a particular tissue type or region of the body (Mosaicism)

47 GUI tools for variant filtering (not free)

48 Command line tools for variant filtering (free!) vcftools, bcftools: manipulate VCF peddy: ped correspondence check, ancestry check, sex check. directly, quickly on VCF