John Hammond Targeted genomic enrichment and SMRT sequencing of immune-related gene complexes

Size: px
Start display at page:

Download "John Hammond Targeted genomic enrichment and SMRT sequencing of immune-related gene complexes"

Transcription

1 John Hammond Targeted genomic enrichment and SMRT sequencing of immune-related gene complexes

2 Innate immune gene variation: germ-line encoded NK cell receptors and MHC class I This arm of the immune system is critical in controlling and resolving viral infection Diverse NK cell receptor systems are rapidly evolving under intense Haplotype selection pressure variation, from high rapidly polymorphism evolving and pathogens. variegated expression creates NK cell subsets with different specificities and functions LRC NKC MHC CD8 T cells Genetically defined and inbred animals are key to dissecting genomic function

3 Highly repetitive regions are difficult to sequence with short read technology Reference genome not well resolved in these regions Poor SNP coverage in hard to assemble repetitive regions Reference genome presents only one haplotype of many

4 Human MHC class I is highly diverse but haplotypes do not vary in gene content Gene A B C E F G Alleles 4,200 5,091 3, Proteins 2,923 3,664 2, Nulls A greater degree of structural variation in cattle

5 Sequence identity comparison of two cattle class I genomic haplotypes

6 The current HD SNP chip does not interrogate MHC variation MHC class II

7 The cattle KIR complex has expanded and demonstrates all the features of a functional immune complex Identity Key properties of KIR loci Human KIR Cattle KIR Inhibitory and Activating Activating genes disarmed Functionally variable haplotypes? Polymorphic Paired activating and inhibitory receptors

8

9 The cattle NKC is largely correct in reference assemblydetermined by BAC clones and a new cattle reference assembly Schwartz et al Immunogenetics.

10 Distance between SNPs The cattle natural killer complex missing SNP variation over the most diverse region ~280 kb 8 SNPs & 17 genes SNP position in the genome

11 % of reads different from UMD3.1 The identity between genes and gene blocks is too high to map short reads over the KLRC region Location on BTA5 Illumina 250 bp PE reads

12 Enrichment and de novo assembly of immune related gene complexes in cattle for SNP discovery. Cattle are arguably the most important livestock species: they provide humans with meat, milk, hides, traction, manure, status and security. Reducing the burden of disease can have enormous positive impact for food security and welfare. Complex immune traits are phenotypically diverse making breeding/selection processes challenging-but there are many opportunities!

13 Where are the immune genes known to be involved btb? Prof Liz Glass, Roslin.

14 Targeted enrichment of Immune-related gene cluster with Roche Nimblegen probes Used the Roche (Nimblegen) SeqCap EZ system Library prep needed considerable optimisation Average pull down fragment was 5.5 kb

15 Four rounds of probe design and optimisation Illumina set First design to use masking as a way to deal with multiple variant targets for single genomic region, thus reducing over-capture of non-variant subregions. NG1-Pilot set ~ 9Mb with 50 known matches Update included an increase number of target regions and increase in number of variant inputs per region. Match level stretched to 50 for coverage. 2 animals NG2 ~ 5Mb with 50 known matches Similar strategy to _BTAU_TPI_NiGen2_EZ_HX3. Size of NKC target decreased, and MHCIIb,TPI, RP and IG regions removed. PacBio sequencing; 23 animals NG3 ~ 5Mb Main aim is to reduce off target mapping of % Redesign of _BTAU_DH_TPI_EZ_HX3. Match levels reduced to 3, mapping targets against reference included, efficiencies estimated, probes in NKC region replicated 3x and MHC replicated 2x.

16 NG1 probe set- good enrichment but still much off target NKC MHC Custom chromosome

17 NG2 probe set- better enrichment but NKC dropped out NKC MHC Custom chromosome

18 Probe performance NKC (NG2) 120,000,000 chr5/nkc 100,000,000 80,000,000 60,000,000 40,000,000 20,000,000 - Blue = nucleotides binding to whole chromosome Green = nucleotides binding to target area

19 Probe performance LRC (NG2) 120,000,000 chr18/lrc 100,000,000 80,000,000 60,000,000 40,000,000 20,000,000 - Blue = nucleotides binding to whole chromosome Green = nucleotides binding to target area

20 Probe performance MHC (NG2) 250,000,000 chr23/mhc 200,000, ,000, ,000,000 50,000,000 - Blue = nucleotides binding to whole chromosome Green = nucleotides binding to target area

21 Probe design summary On-target efficiency sacrificed for overlapping probe coverage- not entirely necessary Many off-target regions unsupported by probe sequence- multimapping and polymorphism/variation Masking does not adequately reduce the redundancy of the inputs but does allow probes with similar sequences to hybridize to similar haplotype sequences resulting in greater depth of coverage

22 Enrichment and De novo assembly At least subreads from 2 SMRTcells combined for de novo assembly with Canu filtered subreads as input Default parameters minreadlength>1kb for MHC minreadlength >3kb improves assembly For the LRC this does not improve the assemblies gfa file as output from Canu screened for contigs that contain MHC or KIR genes/haplotypes with bandage, which were then extracted and mapped to known MHC/KIR haplotypes

23 A18 gene 6 reconstructed haplotype 252 NC1 Gene6 6*01301 TRIM26 2 contigs: 170kb, 53kb NC1 Gene6 6*01301 TRIM26 9 contigs T NC1 Gene6 6*01301 TRIM26 7 contigs

24 A31 gene 1+2 reconstructed haplotype NC1 Gene1 *02101 Gene2 *02201 TRIM26 8 contigs NC1 Gene1 *02101 Gene2 *02201 TRIM26 8 contigs

25 Heterozygous A18/A31 (mixed reads from 252 and as input in de novo) Shared regions assemble contigs that are more similar to the haplotype with more reads Alleles for gene 6, gene 2, gene 1 identical to previous FALCON

26 MHC class I full-length bovine haplotypes 103kb 35kb 62kb 67kb 69kb 58kb TRIM26 20kb A14 P3 NC T ARS14 P3 NC1 5 2 T A T Angus P3 NC1 P T Brahman P T A18 P3 NC1 6 T A31 P3 NC1 1 2 T

27 Most likely haplotype based on alleles Breed ID known haplotype de novo haplotype allele haplotype alleles Hereford Dominette? 02*07001; 05*07201 Friesian 252 A18 A18 A18 06*01301 Friesian A31 A31 A31 01*02101; 02*02201 Friesian A18 A18 A18 06*01301 Friesian Herman A14/? A14? 01*02301; 04*02401;02*02501; no 06* Friesian A14 A14 A14? 01*02301;04*02401;02*02501; no 06* Friesian A31 A31 A31 01*02101; 02*02201 Hereford Domino? 02*06001;06*04001;05*07201 Friesian T A18 A18 A18 06*01301 Highland 8052? new? 01*03102; new02*? Sahiwal 83H? new? new03*? Friesian ? A14/A14 A14 01*02301;04*02401;02*02501;06*04001 Friesian ? A14? new01*;02*02501;04*02401 Friesian ? new? 01*01901;02*02501 Friesian ? A14/het? A14/? 01*02301;04*02401;02*02501;06*04001 Friesian A10/A14 01*02301;04*02401;03*00201; new02* Friesian A10/A14 01*02301; 04*02401;02*02501; 03*00201; Friesian dried706823? new? 01*02101;04*02401;02*02501; Friesian ? new? new 01*;02*00801;04*02401 Friesian 159 A31 A31 01*02101; 02* bp differences to allele

28 KIR haplotype contains block A and B reads from 2 SMRTcells* (>940,000 subreads, >1kb length) reads from 2 SMRTcells* (>625,000 subreads, >3kb length) _NG1+Sequel *includes one Sequel run

29 KIR haplotype from 252 missing block B? 252 reads from 1 SMRTcells (>486,000 subreads >1kb length): 8 contigs 252_NG1 252 reads from 4 SMRTcells* (>863,000 subreads with >3kb length): 8 contigs, longest 118kb One contig *includes one Sequel run

30 De novo assembly of KIR from two other A18 animals T reads from 2 SMRTcells (> 776,000 subreads >1kb length) T_NG2+2rep reads from 2 SMRTcells (> 606,000 subreads >1kb length) _NG2+2rep Also missing block B?

31 SNP selection using de novo assembled haplotypes - Illumina reads from 125 Holstein bulls mapped to immune-related genecluster haplotypes (BWA) - NKC, LRC, MHC - SNPs called with x variant caller - Filtered SNPs: QUAL > 900, strictly biallelic (no INDELs) o Called for all individuals; Alternative allele frequencies: between 5% and 95% (only NKC) - Selected SNPs 10-15kb apart across region and based on representation of mapping data - checked 50bp flanking region if repeat within haplotype o Transferability to other haplotypes (according to SNP coordinate) checked o SNPs checked for haplotype specificity, and gene specificity (MHC only)

32 New SNP panel over 3 different gene complexes being used to increase the power of GWAS for complex disease traits The cattle new cattle LRC SNP panel The single SNP on the current Illumina SNP chip

33 MHC haplotype SNP selection UMD3.1 pink SNPs A11 blue SNPs A14 orange SNPs A18 A31

34 NKC haplotype SNP selection ~280 kb KLRA MAGOHB KLRC1-3 KLRJ

35 First round of SNP selection successful (~70 % success). Established segregating markers in a cohort or 1500 extreme phenotype btb resistant cattle, currently doing GWAS

36 Acknowledgments Immunogenetics Nick Sanderson Alasdair Allan Mark Gibson John Schwartz Rebecca Philp Clare Grant Karen Billington Juan Medrano Liz Glass John Young Richard Borne Doro Harrison Elizabeth Morecroft Kevan Hanson William Mwangi Giuseppe Maccari Derek Bickhart Timothy Smith William Thompson Paul Norman Libby Guethlein Peter Parham Farbod Babrzadeh Denise Raterman Cynthia Moehlenkamp George Mayhew BB/M027155/1, BB/J006211/1 GCRF Databases and Resources