Targeted PacBio sequencing of wild zebrafish immune gene families. Jaanus Suurväli University of Cologne Institute for Genetics

Size: px
Start display at page:

Download "Targeted PacBio sequencing of wild zebrafish immune gene families. Jaanus Suurväli University of Cologne Institute for Genetics"

Transcription

1 Targeted PacBio sequencing of wild zebrafish immune gene families Jaanus Suurväli University of Cologne Institute for Genetics Leiden, 12. June 2018

2 Cyprinidae ~3000 species of cyprinids ~9-10 % of all fish species

3 10 most harvested fish in the world 11: Source: Wikipedia

4 10 most harvested fish in the world 6/10 of the top fish all belong to the family Cyprinidae! 11: Source: Wikipedia

5 Zebrafish Cyprinid fish, genus Danio Model vertebrate for research Also popular as pets to keep in an aquarium The common lab strains are inbred and often have unclear origins Reference genome: 1.4 Gb Average differences from the reference: 0.5%

6 Howe K, Schiffer P et al (2016). Open Biology 6: Zebrafish immune genes Many unknowns even after decades of research Adaptive immune system MHC I and II unlinked (as in all teleosts) MHC loci scattered across the genome Innate immune system B30.2 domain attached to TRIMs and NLRs, both multiplied to hundreds of copies Hundreds of small Ig-based receptors Closest neighbour graph of the NLRs. Different colors mark different NLR subtypes.

7 What is an NLR? NOD-Like Receptor NACHT and Leucine-rich Repeats Nucleotide-binding domain and Leucine-rich Repeats (used mostly in plants) Fish NLRs ( ) ( ) B30.2 FISNA(220 bp)-nacht (500 bp)-helixes (1100bp) Figure adapted from the Invivogen website,

8 Hundreds of NLRs in fish Tørresen, OK et al (2018) BMC Genomics 19: Note: this table might still be an underestimation. NLR coding exons can have up to 100% identity to each other, meaning that short-read approaches are not nearly sufficient to distinguish them from each other. PacBio or NanoPore are required here, either with WGS or target enrichment.

9 Methods The following protocol is based on Witek & Jones (2016). SMRT RenSeq protocol. Protocol Exchange doi: /protex Extract genomic DNA Perform target enrichment and multiplex Sequence on the PacBio Sequel Bait design: Custom 120 bp biotinylated baits with 2x coverage for all targeted regions Bait specificity was first tested in silico, nonspecific ones were excluded Targets: 400x ~2kb exons: FISNA-NACHT-helixes 600x ~0.6kb exons: B30.2 All exons of the Class I and II MHC genes, TLRs, IFNs and selected other genes* Final baitset: ~19,500 baits (~16k of these unique) DNA extraction Covaris shearing DNA repair Add amplification adapters and barcodes (NEBNext Ultra II) Enrichment with hybridization baits (Arbor Biosciences) Amplify the library to get 2-5 ug for Template Prep 5 kb SMRTbell Template Prep * All paralogs of: IRGs, GBPs, MX, NFKb, TLRs, RLRs, IFNs, PTGS In addition, IL-1b, TNFa, DHX9, CTCF, IRF3, IRF5, IRF7, and a few others PacBio Sequel

10 Zebrafish samples The pie charts show population substructure of wild zebrafish, calculated from RADseq data with the R package LEA. The populations in the red circle are targeted for PacBio. Collection of CHT samples has been previously described in: Whiteley et al (2011). Molecular Ecology 20: Coalescent tree built from ~4500 independent loci using SNAPP

11 Enrichment efficiency 4 fish per SMRTcell > 95% of the reads succesfully demultiplexed with lima ~60% of the data is on target 1.2 Gb data on average per fish (2.3 Gb with LR) 60% = 0.7 Gb on target per fish (1.4 Gb with LR) Zebrafish is diploid, 0.35 Gb per genome (0.7 with LR) 1200 targets, assuming 5kb for average length Coverage: ~60x per genome (120x with LR)

12 Methods 2 (data analysis) Lima (demultiplexing) CCS (getting the circular consensus) BLASR (mapping the subreads) Canu (de novo assembly) Arrow (polishing of the assembly, variant calling) Mapping/aligning the assembled reads: blastn, minimap2 Predicting protein domains: EMBOSS transeq, followed by hmmer3 Multiple alignments: clustalo, mafft Trees: MEGA, RaxML

13 Genetic variation Many de novo assembled contigs have > 99.5% identity to the reference This is exactly what would be expected from zebrafish. Getting information on haplotypes and heterozygous SNPs is a work in progress. CCS reads with >= 20 passes from a single wild fish were mapped to the reference with BLASR. The targeted-phasing-consensus approach described in the PB wiki was used to separate the haplotypes. This is the output for one of the NLRs on Chromosome 4, visualized in IGV (many others look similar). Many NLR-aligning reads get a superb MAPQ, yet still look like the above.

14 identitites 4 haplotypes were mapped to the same gene. Zebrafish is diploid, so not biologically possible. Previously undescribed NLR copies, possibly from recent duplications? 99% 97% 94% 96% Looking at the data, some genomic NLRs have a mapping coverage of up to 700x. Others have no primary alignments from the data at all. Indication of strain-specific copy number variation?

15 MHC haplotypes AB strain TU strain CG strain McConnell et al (2016). PNAS 113(34): E5014 E5023. Four Class I u MHC sequences were assembled from wild fish KG35. The closest matches in NCBI databases are shown with % identities UBA 83%, UKA 77% UKA 87%, UBA 83% UEA 92%, UIA 87%, UDA 82% UIA 85%, UEA 81%, UDA 74%

16 Non-reference NLRs in lab strains 98%

17 Conclusions We have established a pipeline for targeted sequencing of zebrafish immune genes We can see variation from SNPs to new genes There are three types of possible new genes in our data: Genes that clearly differ from anything in the reference genome Cases of multiple genes mapping to a single gene in the reference with high confidence (recent duplicates) MHC haplotypes, which in zebrafish can sometimes mean distinct sets of genes

18 Plans and perspectives Sequence the (partial) immune repertoire of a total of 96 zebrafish Build a new reference for mapping the reads. Get rid of PCR duplicates in the data Call all variation, including heterozygous. Phase the data into haplotype blocks and use it for population genetics.

19 Acknowledgements University of Cologne, Cologne, Germany Maria Leptin, Thomas Wiehe, Katja Palitzsch Max Planck Genome Centre, Cologne, Germany Bruno Hüttel University of Montana, Missoula, MT, USA Andrew Whiteley University College London, London, UK Philipp Schiffer Sainsbury laboratory, Norwich, UK Jonathan Jones, Kamil Witek, Oliver Furzer