Oral Cleft Targeted Sequencing Project

Size: px

Start display at page:

Download "Oral Cleft Targeted Sequencing Project"

Esmond West
6 years ago
Views:

1 Oral Cleft Targeted Sequencing Project Oral Cleft Group January, 2013 Contents I Quality Control 3 1 Summary of Multi-Family vcf File, Jan. 11, Analysis Group Quality Control (Proposed Protocol) vcftools Targeted Region Capture & Read Generation 7 4 Sequence Alignment and Processing 7 5 Sample QC 7 6 Relationship QC 7 7 De novo and inherited variant calling 8 8 Generation of Multi-family SNP genotypes 9 II Descriptive Statistics 10 9 SNPs and SNVs MAFs and Heterozygosity 10 III Gen. Epi Polymorphisms (maf 0.01) LD & Ethnicity Linkage & Association snpstats trio Rare Variants (maf < 0.01) Scan Statistic for Rare Variants in Trios de novo mutations

2 List of Tables 1 Ethnicites genotype flags List of Figures 1 Missingness GQ histogram

3 Part I Quality Control 1 Summary of Multi-Family vcf File, Jan. 11, 2013 In total, the vcf file contains 4,495 individuals and 175,189 markers. The target regions span 6.7 MB, for a marker density of 1 marker per 38 bp. Of these 4,495 subjects in the vcf file only 4,139 are contained in the pedigree file. The breakdown of subjects in the vcf and pedigree file, by ethnicity is given in Table 1. In Figure 1 we display the missingness per subject and per site, and in Figure 2 the genotypic quality (GQ) averaged across subjects. 2 Analysis Group Quality Control (Proposed Protocol) We begin by defining the characteristics on which to filter, and what the criterion for exclusion is. We divided the filters into three types (as does vcftools), Genotype, Subject and Site filters. The following is an outline of our initial protocol. 1. Genotype Filters Remove all non- PASS flagged genotypes (See Table 2) vcftools --remove-filtered-geno FLAGNAME Filter on genotypic GQ 40 and Depth 10 vcftools --mingq 40 --mindp Subject Filters Remove Subject with missingness vcftools --mind Remove Subjects with average coverage 20 vcftools --min-indv-meandp Site Filters Remove Markers with missingness 0.05 vcftools --geno 0.95 Remove markers with mean depth 20 vcftools --min-meandp 20 Di-allelic variants only vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2 Quality, Filter and Info are not available in our vcf file. Should we remove markers with mean GQ 90? Perhaps we should use the median?. We have not implemented a site-wide filter on GQ as of Jan. 11., We should investigate Mendelian inconsistencies and Hardy-Weinberg equilibrium. 3

4 2.1 vcftools We implement the above filter with one run of vcftools in which we call the command. vcftools --gzvcf $vcf \ --remove-filtered-geno NRC \ --remove-filtered-geno SB1 \ --remove-filtered-geno IRC \ --remove-filtered-geno MMQSD50 \ --remove-filtered-geno PB10 \ --remove-filtered-geno MQD30 \ --remove-filtered-geno DETP20 \ --remove-filtered-geno MVC4 \ --remove-filtered-geno HPMR5 \ --remove-filtered-geno MVF5 \ --remove-filtered-geno RLD25 \ --mingq 40 \ --mindp 10 \ --mind \ --min-indv-meandp 20 \ --geno 0.95 \ --min-meandp 20 \ --min-alleles 2 \ --max-alleles 2 \ --recode \ --out $out \ 1> $logfile 2> $errfile bgzip $out.recode.vcf tabix $out.recode.vcf.gz It took almost six and a half hours to run, but took very little memory. Job (qc) Complete User = syounkin Queue = gwas.q@compute-0-43.local Host = compute-0-43.local Start Time = 01/10/ :59:15 End Time = 01/11/ :17:38 User Time = 06:14:17 System Time = 00:00:55 Wallclock Time = 06:18:23 CPU = 06:15:12 Max vmem = M Exit Status = 0 I do not know the order in which the filters were processed. This could make a difference. I suppose running it through the filter a second time could alleviate some of those concerns. Although the results still could be order-dependent, it is likely that the differences will be insignificant. 4

5 Ethnicity Count European 968 Chinese 1,371 Filipino 1,776 Guatemalan 24 In VCF and Pedigree 4,139 In VCF 4,495 In Pedigree 4,998 Table 1: Ethnicites Flag Description NRC Unable to grab readcounts for variant allele SB1 Reads supporting the variant have less than 0.01 fraction of the reads on one strand, but reference supporting reads are not similarly biased IRC Unable to grab any sort of readcount for either the reference or the variant allele MMQSD50 Difference in average mismatch quality sum between variant and reference supporting reads is greater than 50 PB10 Average position on read less than 0.10 or greater than 0.9 fraction of the read length MQD30 Difference in average mapping quality sum between variant and reference supporting reads is greater than 30 DETP20 Average distance of the variant base to the effective 3 end is less than 0.20 MVC4 Less than 4 high quality reads support the variant HPMR5 Variant is flanked by a homopolymer of the same base and of length greater than or equal to 5 MVF5 Variant allele frequency is less than 0.05 RLD25 Difference in average clipped read length between variant and reference supporting reads is greater than 25 Table 2: Flags for genotypes found in vcf file. All genotypes with any of these flags were removed with vcftools. (Presumably, to remove a genotype the call is set to missing.) 5

6 Histogram of missing.snp Histogram of missing.subject Frequency Frequency missing.snp missing.subject Figure 1: Missingness Cleft Targeted Sequencing Frequency Mean GQ per marker Figure 2: GQ histogram 6

7 3 Targeted Region Capture & Read Generation Do we need to discuss the methods behind the physical targeting of the regions? I m curious to know how we created the fragments for sequencing. 4 Sequence Alignment and Processing Data is aligned with BWA1 v0.5.9 with quality trimming (-q 5) to remove low quality bases at the ends of reads to the GRCh37-lite reference sequence. Data from individual runs is merged, if necessary, with Picard v1.46 ( All reads are deduplicated using Picard MarkDuplicates. 5 Sample QC Coverage across the target regions is evaluated using RefCov2 and >70% of targets must reach an average coverage of 20X in order to pass QC. If genotypes from another platform are available, a genotyping concordance QC is performed by comparing genotypes called using Samtools and to those from the outside platform. Any samples with an overall concordance of below 90% are flagged. Columns from this QC report are listed below: 1. SNPs called: SNPs reported by Samtools 2. With Genotype: The SNPs called are compared to the imported SNPs by position, so only the SNPs in common with the external data can be compared. 3. MetMinDepth: The SNP sites have to have a minimum depth of coverage at that position of 20X. Anything with lower coverage will be ignored in the concordance check. 4. Reference: How many SNP calls match the reference sequence (ie, build 37). 5. RefMatch: How many of the SNP sites that match the reference sequence also match the external array data. 6. Variant: How many SNP calls are different than the reference sequence (ie, build 37). 7. VarMatch: How many variant SNP sites match the external array data. Whether or not the different calls changed from heterozygous to homozygous or vice versa for both reference mismatches and variant calls is also evaluated. Finally, the % concordance is calculated as: (RefMatch + VarMatch)/MetMinDepth. 6 Relationship QC All offspring are required to have a significant relationship with their parents. To evaluate this, BEAGLE s fastibd command is used to calculate the identity by descent between children and their expected parents. This is done at the family level using both common and private SNPs within the target region. Variant sites are included in the calculation with the following criteria: the site is in the target region, is variant in at least one individual, and has 20X coverage in all individuals. After fastibd evaluation of these sites, the number of shared markers between each parent-child pair is calculated. If every marker is shared, 7

8 then those two individuals share 50% of their genome (this is the max that fastibd can detect, since it doesn t consider both haplotypes together when comparing individuals). If less than 40% of the target region is shared between parent and child in this way, the family is flagged as failing QC. If a family fails the initial, family-level QC evaluation (i.e., one parent is not highly related to child), then that entire family is subsequently evaluated as part of a pool containing all families failing the initial QC. For this cross-family IBD assessment, sites are selected as follows: the site is in the target region, is variant in any individual and individual genotypes are set to missing if coverage in that individual is <20X. As with the initial QC, if less than 40% of the target-region is shared between parent and child in this way, the family is flagged as failing the QC. If this cross-family QC identifies high IBD sharing between two ostensibly unrelated samples, manual checking is performed to confirm a sample swap. 7 De novo and inherited variant calling De novo variants and inherited variants were called using polymutt 0.11 ( com/ernfrid/polymutt, with the calling restricted to chromosomes containing target regions and all other options set to their defaults. GLF files were generated for input to polymutt using samtools-0.1.7a-hybrid ( with BAQ applied as in the following command: samtools-hybrid view uh some.bam samtools-hybrid calmd Aur refseq.fa 2> /dev/null samtools-hybrid pileup - -g r refseq.fa > output.glf Polymutt has two modes of variant calling, one for standard calling and one for de novo mutation calling. The VCF files for both of these modes were merged into a single VCF for each family and filters were applied. We used bam-readcount v0.4 ( with a minimum base quality of 15 (-b 15) to generate metrics (for both de novo and germline variant calls) and marked sites as filtered based on the following requirements: 1. Minimum variant base frequency at the site of 5% 2. Percent of reads supporting the variant on the plus strand 1% and 99% (variants failing these criteria are filtered only if the reads supporting the reference do not show a similar bias) 3. Minimum variant base count of 4 4. Variant falls within the middle 90% of the aligned portion of the read 5. Maximum difference between the quality sum of mismatching bases in reads supporting the variant and reads supporting the reference of Maximum mapping quality difference between reads supporting the variant and reads supporting the reference of Maximum difference in aligned read length between reads supporting the variant base and reads supporting the reference base of Minimum average distance to the effective 3 end1 of the read for variant supporting reads of 20% of the sequenced read length 8

9 9. Maximum length of a flanking homopolymer run of the variant base of 5. In addition to the above filters, a binomial test filter is applied to the de novo calls in order to remove likely false positive de novo calls. The input to the test is calculated by generating readcounts with base quality? 15 and mapping quality? 20 for all unaffected family members at a putative de novo mutation location. For each individual, reads are divided into either supporting the de novo allele or supporting some other allele. Using an assumed error rate of 0.01, the probability that the reads came from a binomial distribution where p = 0.01 (fraction of reads supporting the de novo allele) is calculated. If the resulting p-value is less than 10 4 for any one unaffected, the de novo prediction is marked as failing the filter. 8 Generation of Multi-family SNP genotypes Once all variant sites in all samples were predicted, sites were limited to the precise target space of the capture product, buffered by 500 bp on either side and aggregated into a list of segregating sites for the cohort. Each segregating site was (re)genotyped in all samples using polymuttvcf-0.01 (polymutt pos). The resulting genotypes were added to sites missing from the original VCF for each sample in order to distinguish between missing data and homozygous reference calls. The resulting single-sample VCF files, containing genotypes for all segregating sites, were subsequently merged using joinx1.6 ( All variant calls from this process are included in the final files. 9

10 Part II Descriptive Statistics 9 SNPs and SNVs Here we can breakdown the distribution of polymorphic markers (SNPs) vs. non-polymorphic (SNVs). 10 MAFs and Heterozygosity Here we can display the distribution of minor allele frequencies and heterozygosity across ethnicities. 10

11 Part III Gen. Epi. We begin with the common variants. Note that we refer to variants with estimated population allele frequency greater than 0.01 as polymorphisms. 11 Polymorphisms (maf 0.01) 11.1 LD & Ethnicity Here we can investigate LD structure and how it differs across ethnicities. Likewise with heterozygosity Linkage & Association We begin with four different formulations of the classic Transmission Disequilibrium Test. The TDT tests for an increase in the transmission rate from parent to offspring. If that rate is significantly greater than the expect rate under Mendelian inheritence of 1 2, then we conclude that transmission is preferred over non-transmission among the affected offspring, and the genetic marker is nearby the causal locus Clayton s regression-based test Allelic or genotypic. R Package: snpstats Given large-scale SNP data for families comprising both parents and one or more affected offspring, this function computes 1 df tests (the TDT test) and a 2 df test based on observed and expected transmissions of genotypes. Tests based on imputation rules can also be carried out Holger s gtdt and standard TDT Allelic or genotypic. R Package: trio 12 Rare Variants (maf < 0.01) 12.1 Scan Statistic for Rare Variants in Trios We have received and compiled the C++ source code for scan-trios from Dr. Iuliana Ionita- Laza at Columbia University Dept. of Biostatistics. Now we need to make the input files. The required input files are: pedigree, map, regions and weights. The pedigree file is slightly atypical, as the rows must be ordered to indicate trio structure. It does not use the information present in the pedigree file to order the data, it must be done manually. I do not know what the pedigree files look like that we have made using vcftools and plink de novo mutations 11

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Genotype matrix Genotype matrix: Samples x SNPs SNPs and errors A change in a read may due to: Sample contamination Cloning or PCR