IMPACT User Manual. Version 1.0

Size: px
Start display at page:

Download "IMPACT User Manual. Version 1.0"

Transcription

1 IMPACT User Manual Version 1.0 1

2 Table of index: Overview 3 Dependencies 4 Preparation 4 Download 4 Quick Start 5 Module 1: Somatic Variants Detection 6 Module 2: Copy Number Alteration Detection 8 Module 3: Drug Prediction 9 Module 4: Tumor Heterogeneity Analysis 12 Contact Details 13 2

3 OVERVIEW: IMPACT (Integrating Molecular Profiles with ACtionable Therapeutics) is a novel Whole Exome Sequencing (WES) data analysis pipeline that integrates both single nucleotide variants and copy number alterations from WES data to identify a list of candidate genes for therapeutic targets. From the list of candidate genes, IMPACT returns a prioritized list of drugs predicted to target these cancer genes using our recently published comprehensive drug-target database. IMPACT analysis pipeline also allows users to explore the tumor heterogeneity of the sample. This allows users to understand the clonal dynamics from WES data during the course of treatments, or comparisons of clinically similar samples to each other. The IMPACT analysis pipeline and its four modules are illustrated in Figure 1. Figure 1: IMPACT Analysis Pipeline. IMPACT is designed to be completed in four modules. Module 1: Variant Detection Module o This module takes a fastq.gz as input and outputs a VCF file containing deleterious mutations Module 2: Copy Number Analysis Module o This module takes either fastq.gz files or bam files and determines copy number changes from a normal sample Module 3: Drug Prediction Module o This module takes in a list of genes from any or all of the previous three modules and links those to drugs that target those genes Module 4: Tumor Heterogeneity Analysis Module o This module takes VCF files from matched normal and cancer samples and outputs the allele frequency of somatic variants 3

4 Dependencies: Perl v SAMTOOLS v1.1& BCFtools v1.1 BWA v0.7.8-r455 Picard-Tools version Annovar version VarScan v2.3.8 JAVA R version Preparation: Make sure to gzip fastq files and that JAVA, R, and Perl are in $PATH. This program was designed to take in paired-end exome sequencing data. Normal samples are not required for Module s 1, 3, and 4. Change the path to directories statements in lines in modules 1 and 2: my $path_to_hg_index = "/path/to/bwa_index/hg19_exome_ref/"; my $path_to_picardtools = "/path/to/picard-tools-1.119/"; my $path_to_ref_exome = /path/to/hg19"; my $path_to_bcftools = "/path/to/bcftools-1.2/"; my $path_to_annovar_humandb= "/path/to/annovar/humandb/"; Download: IMPACT analysis pipeline and example files available at Example files: Melanoma Pre-treatment Tumor FASTQ files (paired-end) MB_1295_C_P0061.R1.fastq.gz MB_1295_C_P0061.R2.fastq.gz Melanoma Normal FASTQ files (paired-end) MB_1294_N_P0061.R1.fastq.gz MB_1294_N_P0061.R2.fastq.gz To unpack: $tar xvf IMPACT.gz.tar $cd IMPACT $ls module1 module2 module3 module4 4

5 Quick Start: Place fastq.gz files into folder IMPACT From folder IMPACT/ To execute Module 1 $cd module1 $./Module_1_fastq_vcf.pl../MB_1295_C_P0061.R1.fastq.gz../MB_1295_C_P0061.R2.fastq.gz../MB_1294_N_P0061.R1.fastq.gz../MB_1294_N_P0061.R2.fastq.gz To execute Module 2 $cd module2 $./Module_2_bam_varscan.pl../MB_1295_C_P0061.R1.fastq.gz../MB_1295_C_P0061.R2.fastq.gz../MB_1294_N_P0061.R1.fastq.gz../MB_1294_N_P0061.R2.fastq.gz To execute Module 3 $cd module3./module_3_drug_linking.pl../mb_1295_c_p0061.r1.fastq.gz../mb_1295_c_p0061.r2.fastq.gz../mb_1294_n_p0061.r1.fastq.gz../mb_1294_n_p0061.r2.fastq.gz $. PermRHypertest.sh $PermRHypertest hypertest.txt > p_values.txt $./Module_3_last_step.pl../MB_1295_C_P0061.R1.fastq.gz../MB_1295_C_P0061.R2.fastq.gz../MB_1294_N_P0061.R1.fastq.gz../MB_1294_N_P0061.R2.fastq.gz To execute Module 4 $cd module4 $./Module_4_allele_freq.pl../MB_1295_C_P0061.R1.fastq.gz../MB_1295_C_P0061.R2.fastq.gz../MB_1294_N_P0061.R1.fastq.gz../MB_1294_N_P0061.R2.fastq.gz 5

6 Module 1: Somatic Variants Detection To run: $cd module1 $./Module_1_fasq_vcf.pl../Tumor_R1.fastq.gz../Tumor_R2.fastq.gz../Normal_R1.fastq.gz../Normal_R2.fastq.gz This module takes FASTQ as the inputs. Users must provide WES data for the tumor sample, and optional for normal sample. This module will perform sequence alignment using BWA and variants detection and annotation using Samtools and ANNOVAR. The output from Module 1 is a list of variants. Pseudocode and parameters 1) BWA aligns fastq.gz files to HG19 exome bwa aln -t 5 hg19.fa Tumor_R1.fastqz > Tumor_R1.sai 2) BWA sampe combines R1 & R2 sai into a sam file bwa sampe hg19.fa Tumor_R1.sai Tumor_R2.sai Tumor_R1.fastqz.Tumor_R2.fastqz > Tumor.sam 3) Samtools sam to bam samtools view -Sb Tumor.sam > Tumor.bam 4) Sort bam file samtools sort Tumor.bam Tumor_sort.bam 5) Picard re-sort as preparation step for marking duplicates java -Xmx4g -Djava.io.tmpdir=/tmp -jar SortSam.jar SO=coordinate INPUT= Tumor_sort.bam OUTPUT= Tumor_sort_pic1.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true 6) Picard mark duplicates java -Xmx4g -Djava.io.tmpdir=/tmp/ -jar MarkDuplicates.jar INPUT= Tumor_sort_pic1.bam OUTPUT=Tumor_dedpuped.bam METRICS_FILE=Tumor.metric CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT 7) Samtools mpileup to bcftools to call variants in VCF file samtools mpileup -C 0 -A -B -d v -u -f hg19_exome.fa Tumor_dedpuped.bam bcftools call -O v -v -c -n p 1 - A -o Tumor.vcf 8) Change genotype values of less covered reads./change_gt_values.pl Tumor.vcf > Tumor_cleaned.vcf 9) Convert to Annovar Step 1 convert2annovar.pl -format vcf4 --includeinfo -coverage 20 - fraction 0.05 Tumor_cleaned.vcf > Tumor_cleaned.avinput 10) Convert to Annovar Step 2 table_annovar.pl Tumor_cleaned.avinput annovar/humandb/ - buildver hg19 -out Tumor_cleaned -remove -protocol knowngene,refgene,cosmic70,ljb26_all,esp6500si_all,snp138,1000g2 014oct_all -operation g,g,f,f,f,f,f 11) Compile dinucleotide variants./collect_dinuc.pl Tumor_annovar.txt > Tumor_annovar2.txt 12) Detect somatic variants if there is a normal sample./array_compare.cancer.pl Normal_annovar2.txt Tumor_annovar2.txt 6

7 13) Remove synonymous and intronic variants./single_remove_synon.pl Tumor.only_cancer.txt > Tumor.only_cancer.nonsyn.txt 14) remove 1000G and dbsnp snps over 1%./remove_over.01.one.pl Tumor.only_cancer.nonsyn.txt > Tumor.only_cancer.nonsyn.rm1.txt./remove_over.01.two.pl Tumor.only_cancer.nonsyn.rm1.txt > Tumor.only_cancer.nonsyn.rm2.txt 15) Keep Deleterious mutations found by 6 predictors, or 2 predictors and also in COSMIC./get_deleterious.pl Tumor.only_cancer.nonsyn.rm2.txt > Tumor_final_delet_muts.txt 7

8 Module 2: Copy Number Alteration Detection To run: $cd module2 $./Module_2_bam_varscan.pl../Tumor_R1.fastq.gz../Tumor_R2.fastq.gz../Normal_R1.fastq.gz../Normal_R2.fastq.gz 1 This module requires a normal sample unlike modules 1,3, and 4. The last parameter acts a flag to acknowledge if module 1 has finished. If module 1 has finished, a 1 is placed. If module 1 has not completed, a 0 should be used in order to run the relevant module 1 processes needed for module 2. The output from Module 2 is a list of CNAs. Pseudocode and parameters 1) Run a different Samtools mpileup for this module samtools mpileup -f hg19_exome.fa Tumor_dedpuped.bam > Tumor.mpileup samtools mpileup -f hg19_exome.fa Normal_dedpuped.bam > Normal.mpileup 2) Adjust the samtools mpileup output for VarScan input./find_delet.pl Tumor.mpileup > Tumor.fix.mpileup./find_delet.pl Normal.mpileup > Normal.fix.mpileup 3) Run varscan java -jar VarScan.v2.3.8.jar copynumber Normal.fix.mpileup Tumor.fix.mpileup Normal_vs_Tumor 4) Annotate copy number calls by gene./make_call.pl Normal_vs_Tumor.copynumber Tumor.raw.gene.txt 5) Make directory for output and store CNA calls for each gene./get_gene_coverage.pl Tumor.raw.gene.txt 6) Sort gene calls and get unique gene calls./get_uniq.pl Tumor.raw.gene.txt 7) Get the coverage of each exon for each gene./exon_coverage.pl Tumor.raw.gene.txt 8) Get unique exon coverage calls./get_uniq_outfile.pl Tumor.raw.gene.txt 9) Calculate total coverage, amplification, and deletion./calc_real.pl Tumor.raw.gene.txt > Tumor.finished_CNA.txt 10) Output of amplification and deletion calls from CNA file Tumor.finished_CNA_amp_del.txt 8

9 Module 3: Drug Prediction To run: $cd module3 $./Module_3_drug_linking.pl../Tumor_R1.fastq.gz../Tumor_R2.fastq.gz../Normal_R1.fastq.gz../Normal_R2.fastq.gz $. PermRHypertest.sh $PermRHypertest.sh hypertest.txt > p_values.txt $./Module_3_last_step.pl../Tumor_R1.fastq.gz Module 3 takes the deleterious variants found from Modules 1 and 2 and outputs actionable therapeutics. The output is separated into two levels. The first level outputs actionable therapeutics from NCI Match Clinical Trials, MD Anderson Personalized Cancer Therapy, and DSigDB FDA approved kinase inhibitors. Level two output uses the DsigDB database to identify FDA approved drugs and the genes they inhibit. A hypergeometric test is performed for each drug comparing the number of gene targets hit to the number of total gene targets. The outputs for Module 3 is a text file list out the drug prediction as well as a HTML file that could link to DsigDB for other drug information. Module 3 example output (Pre-Treatment): LEVEL 1: Actionable Therapeutics NCI Match Clinical Trials Mutation Actionable Therapeutic(s) BRAF V600E Dabrafenib and Trametinib MD Anderson Personalized Cancer Therapy Mutation Actionable Therapeutic(s) BRAF V600E Dabrafenib,Dasatinib,Regorafenib,Sorafenib,Trametinib,Vemurafenib DsigDB FDA approved Kinase Inhibitors Mutation Actionable Therapeutic(s) BRAF V600E Sorafenib,Vemurafenib,Dabrafenib LEVEL 2 Actionable Therapeutics from DsigDB Database Potential Gene Targets(91): 9

10 ABCD1 ACTR10 ACTR3 ACTR5 ADCY8 ALDOC ANKS4B APP ARL5B BPNT1 BRAF BRPF1 C1QTNF8 CACNA1F CACNB1 CAD CAPN11 CCDC6 CFTR CHRNB3 CTBP1 CYR61 DCBLD2 DHRS1 DIAPH2 EFNA2 EGFL8,PPT2- EGFL8 ERO1L FAM208A FBLN2 FKBPL FOXO4 GJB1 GMPR2 GNL1 GPSM1 HMGB3 IGF2BP3 IMMP1L IQCA1 KIAA1199 KLHL5 KLK13 KRT222 LDLRAD3 LGI2 LRP1B MST1 MSTO1 MXRA5 NID2 NOS1 NOTCH4 NRP1 OR10H1 OR10R2 PCSK4 PEX5L PHKB POTEG POU6F1 PRKCQ PRKG2 PTGER2 PTPRE RBM47 RYR3 SAR1B SASH3 SCN1A SCUBE3 SGCB SLC17A2 SLC6A17 SMAD3 SPOCK2 TAF6L TENM1 TENM3 THBS4 TIMM8A TKTL1 TNS3 TNXB TRIM15 TRPC5 TSEN34 TTC28 VAV2 XPNPEP3 Drug Targets Hit Potential Targets P-value (hypergeometric test) P- value (Permutation test) Vemurafenib Ibrutinib Epirubicin hydrochloride TOPOTECAN HYDROCHLORIDE thalidomide Sorafenib tosylate Doxorubicin Hydrochloride dexamethasone Trifluridine Regorafenib Vorinostat Vincristine sulfate Vinblastine sulfate Dasatinib Nilotinib Bosutinib Module 3 example output in HTML (Pre-Treatment): 10

11 Drugs have links to DSigDB website to explore the molecule information and drug targets. For example, by clicking Vemurafenib, it will link to the following page in DSigDB: 11

12 Module 4: Tumor Heterogeneity Analysis To run: $cd module4 $./Module_4_allele_freq.pl../Tumor_R1.fastq.gz../Tumor_R2.fastq.gz../Normal_R1.fastq.gz../Normal_R2.fastq.gz This is a stand alone program that can be run after module 1 has finished running. This Program extracts the allele frequency from a VCF file for somatic variants present in tumor sample. As input it needs two files first VCF file of tumor sample and second information of tumor somatic variants coordinates (tab separated chr Start _position End_position) generated within the module. Allele frequency information is calculated based on DP4 flag that gives number reads support alternate and reference allele and the distribution on both strands. As output it will provide variant information from VCF and last column with Allele frequency. 12

13 CONTACT DETAILS Jennifer Hintzsche Aik-Choon Tan 13