HLA and Next Generation Sequencing it s all about the Data

HLA and Next Generation Sequencing it s all about the Data John Ord, NHSBT Colindale and University of Cambridge BSHI Annual Conference Manchester September 2014

Introduction In 2003 the first full public version of the human genome sequence was announced. It took 13 years to complete this at a cost of $3 billion The sequencing was all done using Sanger methodology Today we can do the same job overnight using Next Generation Sequencing (NGS) technology at a cost of about $3000 NGS can achieve this because it sequences hundreds of thousands of separate DNA templates in parallel and at high speed. However, one NGS run generates vast amounts of data which require a range of powerful bioinformatics tools to analyse it.

Next Generation Sequencing for HLA Genes Brief overview of NGS How NGS can improve HLA typing Data analysis for NGS HLA sequences

Next Generation Sequencing By Clonal Amplification Genomic DNA PCR Strong Sequencing Signal Amplicons Clonal Amplification Single Strand Binding Substrate

Paired End Sequencing Extending the data from fixed length sequencing runs Sequence 1: 250 bp Seq Primer 1 DNA Fragment 300 bp Seq Primer 2 Sequence 2: 250 bp Combined Paired Sequence: 300 bp

HLA typing standards The gold standard for HLA is typing to a single allele level and this is recommended for unrelated haematopoietic stem cell transplantation Typing Stem Cell Registry donors to allele level precludes the need for extended re-typing and can speed up the matching and donor selection process Sanger sequencing (SBT) can achieve allelic level typing in some circumstances but often requires further steps using Group Specific Primers because of the problem of assigning heterozygous base combinations (phase ambiguity or the cis/trans problem) SBT using targeted exons of HLA genes can give rise to ambiguities where differences between alleles lie outside of the regions sequencied. Can HLA typing based on Next Generation Sequencing give us accurate and reliable one step allele level HLA typing?

Phasing Ambiguity (the Cis/Trans Problem) Sanger Sequencing NGS Systems T C G T T C C G T G56 A G A G A G C T T T C C A G A G A G C A C A C A C G T G T G T A C G C A T???? T A

Allele Ambiguity Where differences between alleles lie outside the region sequenced then you will get allele ambiguity. For example, the difference between C*03:20N and C:03:03:01 is a single base change (C>T) in exon 1. this generates a premature stop codon which results in non-expression of the protein (Null allele). gdna 20 30 40 50 C*03:03:01 TGGCGCCCCG AACCCTCATC CTGCTGCTCT CGGGAGCCCT C*03:20N --------T- ---------- ---------- ----------

NGS Target Preparation Exon Based 5 UTR Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 3 UTR Fragmentation Based 5 UTR Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 3 UTR

NGS Data Example The data from the sequencer can come in several formats we use FASTQ This is a text file containing data from one sample A sequence identifier The DNA sequence itself (with the indexes and linker sequences stripped off) FF10F = Q37, Q16, Q15,Q37 A separator character The quality score for each base (Phred score) A typical FASTQ file for HLA-A,B,C, DRB1 contains ~250-300,000 sequences and is 20 to 40MB in size when compressed Each paired end sequencing run generates 2 FASTQ files per sample: up to 80MB of data Each of the half million resulting sequences needs to be assessed, aligned to the genome and then to the IMGT HLA allele database to figure out the HLA type.

NGS HLA Data Analysis NGS produces lots of data very quickly but it consists of hundreds of thousands of relatively short sequences (typically 150 to 400 bp) and this presents a major challenge for the bioinformatician The target preparation method can affect the approach to analysis Once the sequencing has been completed there are several questions to answer for each piece of data: Is the sequence long enough and of high enough quality? Which gene/part of the gene does the sequence come from? (alignment) Which haplotype does the sequence come from? (phasing) How much of the gene has been sequenced (breadth of coverage) How many times has each base been sequenced? (depth of coverage)

Colindale NGS HLA Project Long range PCR to amplify whole genes for HLA-A, -B, C 2 and DQB1 1 Two amplicons for DRB1: 5 Exon 1 + Intron 1, Exon 2 to 3 end 2. Illumina NextEra enzyme based tagmentation kit to fragment and index the long range PCR products Illumina MiSeq platform for sequencing a pool of 95 samples for 5 loci Commercial software for alignment and allele assignment. 1 Hosomichi et al BMC Genomics 2013 14:355 2 Shiina et el Tissue Antigens 2012 80: 305-316

An approach to analysis of NGS HLA data hg19 reference Paired end FASTQ Sequence File Align to HLA gene in hg19 reference Detect SNPs, Insertions and Deletions Phase reads using heterozygous SNPs Generate 2 consensus sequences Analysis Pipeline Tools ==================== Genome Analysis Toolkit BWA Picard Samtools In house PERL scripts Integrated Genome Viewer Align each consensus to IMGT/HLA Data HLA genotype Based on Hosomichi et al BMC Genomics 2013 14:355

Determine HLA Genotype IMGT HLA reference fully sequenced Phased sequence including non-coding regions IMGT HLA reference data missing from non-coding regions Phased sequence with just the coding regions

Reference Sequence Breadth of coverage Sequencing reads of each fragment 12X Read Depth Depth of coverage or Read depth (X): -Number of reads covering a single base 29X Read Depth - Average number of reads covering the target Breadth of Coverage or Sensitivity (%): - Proportion of genomic target covered to a pre-determined depth

Integrated Analysis Packages Omixon Target Ion Torrent HLA Plug In GENDX NGSengine Conexio Wellcome Trust Centre for Human Genetics NextGENe HLA module

Omixon Target Result Screen

HLA-A Alignment GENDX

HLA-C Alignment GENDX

Choice of Analysis Software There are a growing number of high quality commercial software packages available. Choice of package should be based on suitability for your platform and requirements and in-house expertise. There are also some non-commercial packages which may do the job but are less likely to have a unified user interface. Choice of package should also be based on careful validation with a panel of well-characterised samples.

Data storage 125TB Microsoft Azure cloud MiSeq Run 66 GB (short term storage) 10,000 donors 631 GB Stored for 30 years 6 GB Sanger Data 95 Donors 250 MB 95 Donors, 5 loci, 190 FASTQ files

Summing Up Next Generation Sequencing will deliver reliable and accurate HLA types Choice of analysis software is of prime importance Not yet cost effective for routine low throughput or urgent typing Could be useful for research projects where samples can be batched Need to consider carefully what data to keep and how to store it

NGS feel the fear and do it anyway

Dramatis Personae NHSBT Cristina Navarrete Lisa Creary John Girdlestone Sue Davey Colin Brown Zareen Goburdhun Monica Kyriacou John Ord University of Cambridge John Todd Howard Martin Kim Brugger Sam Haldenby Anthony Rogers (Willem Ouwehand)

NGS Selected Reading Hosomichi et al, : Phase-defined complete sequencing of the HLA genes by next-generation sequencing. BMC Genomics 2013, 14:355 Lange et al, : Cost-efficient high-throughput HLA typing by Miseq amplicon sequencing. BMC Genomics 2014 15:63 Shiina et al,: Super high resolution for single molecule-sequence-based typing of classical HLA loci at the 8-dogot level using next generation sequencers. Tissue Antigens 2012 80: 305-316 H. Ehrlich HLA DNA typing: past, present, and future. Tissue Antigens 2012 80: 1-11 Wang et al,: High-throughput, high-fidelity HLA genotyping with deep sequencing. PNAS 2012 109:8676 8681 Michael L Metzler : Sequencing technologies the next generation. Nature Reviews/Genetics 2010 11:31-46 NGS Analysis Software Sources http://www.omixon.com/hla/ http://www.gendx.com/products/ngsengine http://www.conexio-genomics.com/ http://www.softgenetics.com/nextgene.html http://www.lifetechnologies.com/