Finishing Drosophila ananassae Fosmid 2410F24

Nick Spies Research Explorations in Genomics Finishing Report Elgin, Shaffer and Leung 23 February 2013 Abstract: Finishing Drosophila ananassae Fosmid 2410F24 Finishing Drosophila ananassae fosmid clone 2410F24 proved to be quite an undertaking. The initial assembly contained 6 large and 7 smaller contigs, along with an intimidating number of inconsistent mate pairs. In order to complete this project, the method that proved most helpful was a sequence anchoring algorithm that allowed for correct placement of repeat copies in a project where repeats are a major feature. Through sequence anchoring and force joins of individual and grouped contigs, a final assembly with only one gap was obtained. Introduction: The primary aim of this year s Bio4342 class, Research Explorations in Genomics, is to finish the remaining regions of the fourth Chromosome (F element), and the Muller D element, of the interesting species Drosophila ananassae. While Drosophila chromosome 4 is usually referred to as the dot chromosome, the intriguing factor of the D. ananassae fourth chromosome is its large expansion, to the point where it no longer appears as a dot under a microscope. A sequence analysis of this chromosome, followed by a comparative genomics approach, will shed light on the mechanisms by which such a large expansion occurred. Using Drosophila melanogaster as a reference, the data will be compared on the basis of gene and repeat structure as well as regulatory motifs. This paper will discuss the work done to progress towards finishing fosmid clone 2410F24. Workflow: Initial Assembly: Figure 1 shows the initial assembly of fosmid 2410F24. There are 6 major contigs, along with 7 minor contigs not shown. Contig 7 will prove to be the major problem region, as it contains major tandem and inverted repeat regions. Contigs 6 and 4 both contains regions of the major repeat structure. The left end read is in contig 5, while the right end is in contig 2. Drosophila ananassae fosmid clone 2410F24 1

Figure 1: Initial Assembly view for clone 2410F24 Solution attempt #1: In order to assemble this project, the first method attempted was to join the contigs in the assembly if any joins were possible. This did not yield any progress, so the next effort was to tear many of the inconsistent reads, and put them back at their appropriate site in the assembly. To do so, I scanned the left end of the minor contigs, in their high quality regions, ran a search for string, and attempted to make any joins that resulted from the matches to major contigs (and other minor contigs in some cases).

Figure 2: Assembly View of post- torn assembly (left), without replacing torn reads. In making a few of the joins between contigs, I created a few high quality discrepancies, which required individual navigation. Most of these discrepancies show groups of reads disagreeing with each other (Fig. 3). These observations lead to the hypothesis of a multiple repeat structure, with sequence divergence over evolutionary time. A simple tandem repeat is unlikely as there is no pattern in the discrepant reads. It was not possible to group them in such a way that one tandem repeat gives no discrepancies. Upon progressing closer to finishing the project, it has become clear that the repeat structure is incredibly complex, with a combination of tandem and inverted repeats. Drosophila ananassae fosmid clone 2410F24 3

Figure 3: High Quality Discrepancy Example Solution Attempt #2: After telling phrap not to overlap on many of the high quality discrepancies, miniassembly was run to see what progress was made with the newly built scaffold. Notable features of this scaffold are the inverted sequence matches in contigs 2, 3, and 5, the gap in mate pair density at 11,000 bp in contig 7, and the concentration of inconsistent mate pairs in the same region as the repeat contained in three of the contigs. The next step was to remove these inconsistent reads, in an attempt to place them in the correct position and orientation.

Figure 4: Miniassembled structure After pulling out the inconsistent reads, it was possible to close gaps between contigs 3, 4 and 6. Miniassemble was run again, resulting in many of the remaining individual reads coming together into the major contigs. There is clearly a problem area at the left end of contig 12, as well as a large build- up of reads between 17 and 18 kb on contig 18. Drosophila ananassae fosmid clone 2410F24 5

Figure 5: Miniassembly with consolidated individual reads.

Solution Attempt#3: Inconsistent reads were removed from the assembly again, yielding this assembly. Figure 6: Assembly before anchoring began This is the anchoring method began. Anchoring reads can be described by the following: Swipe about 25bp of sequence and search for that string. If it matched elsewhere in the assembly, comment match found elsewhere in high quality. Any sequence that didn t have a match however, receives a unique comment tag. If a unique region is found, highlight the reads that have high quality data in that region, as shown below, and scroll until those reads were no longer high quality, then search again downstream to repeat the procedure. Figure 7: Anchored reads example When I reached the right end of my left most contig, I used the Consed main window to search for mate pairs that I could add to this contig to extend it. I used only pairs that were anchored in unique sequence, in order to be sure I was not adding incorrect sequences to the contig. I ended up adding 8 reads, which I miniassembled into their own contig first. Drosophila ananassae fosmid clone 2410F24 7

This extended my project to the assembly below. I could join contig 21 to contig 24, extending the contig to the image in Figure 8a. Figure 8a: Extended Unique- anchored left end contig. Figure 9: Roadblock in anchoring reads method. At this point another mini- assembly was run to see what phrap could put together, given the new input structure. The result of that miniassembly is shown in Figure 10. There were many possible joins, as suggested by consistent mate pairs. Many of the joins looked like the image in Figure 11, with discrepancies concentrated in low

quality regions, the joins were made. Figure 10: New Miniassembly with join candidates (upper). Typical join interface with low quality discrepancies. (lower) After making all of the joins of sufficient quality, the Figure 11 assembly was left below. Due to the large build- up in reads at the right end of my major contig, and the fact that the sum of the size of all of the current contigs is 34 kb, and my project as a whole should be 52 kb (as concluded by the digest sums, including vector), we hypothesized that most of what is currently contig 56 is actually a compression of a very large repeat structure. This may explain the inability to get a good phrap assembly, as well as the many inconsistent mate pairs that have been aligning to this region in the sequence, along with the 18kb shortage of data. Drosophila ananassae fosmid clone 2410F24 9

Figure 11: Assembly with possible collapsing of two repeats into one large contig. Solution Attempt #4: Upon formation of this collapsed repeat hypothesis, I removed arbitrary reads from contig 56, such that I can make another completely contiguous sequence, without having the current contig fall apart. In order to do this, I opened a new notepad document in terminal, started at the left edge of my contig, and copied in names of reads which would allow me to extract an entire full contig from the data in my current sequence, as shown below.

Figure 12: Extracting an arbitrary contig from the hypothesized compression region. After finishing this procedure I obtained the assembly below. The most notable feature of this assembly is the breadth of the sequence matches, which was to be expected, as the two largest contigs are simply copies of each other. This new assembly had many high quality discrepancies, which I proceeded to deal with individually. Drosophila ananassae fosmid clone 2410F24 11

Figure 13: Post- extracted repeat assembly (above), and library of high quality discrepancies (below). Solution Attempt #5: Next, I navigated through the high quality discrepancies found in the project. I cleaned the assembly up as much as possible, then pulled in reads whose mate pairs are not currently in my project. I used the NCBI database, searched for reads whose mate pairs were not in my project, and received 52 hits. I added these reads to the assembly. My next task was to navigate through the new discrepancies that resulted from adding the new reads. I first miniassembled only the new reads, to make putting them into place more efficient. This resulted in the assembly pictured below. I

navigated many discrepancies that looked like the on below, with one or two reads (mostly new) discrepant from a group of the old ones. I used the do not overlap command and removed the new reads. Figure 14: Assembly with new reads (above), HQD navigation (below). Solution Attempt #6: Over the weekend professional finishers worked on the project. I reviewed the.ace files in order to get a clearer understanding of how they proceeded. Starting from the assembly in Figure 12, they made a join between three of the smaller contigs, then joined contigs 49 and the group of three just made, to make contig 83, giving the assembly below. There are clearly more sequence matches in the post- addition assemblies than in the previous, suggesting that either we have added good data to help elucidate the compressed structure, or we have torn apart a good sequence Drosophila ananassae fosmid clone 2410F24 13

into more copies than we should have. Figure 15: Assembly view of professional finisher s first major joins. The next assembly is the result of a series of large contig joins made by starting at the left end read and searching for sequence matches throughout the contig. This resulted in creation of a 20kb left end contig with a few high quality discrepancies and a few inconsistent mate pairs. Cleary quite a bit of progress has been made. There are still quite a few single- read contigs that are not shown in this assembly.

Figure 16: Professional finisher s assembly after building on the left end contig. Drosophila ananassae fosmid clone 2410F24 15

The next assembly shows some very clear progress, with joins having been made between many of the major contigs present in the previous assembly. A notable feature in this assembly is that the largest, main contig is separated from both of the end contigs by gaps. This is because the fosmid end reads had quite a few high quality discrepancies among the danafos reads in the main contig. The danafos reads themselves contained quite a few discrepancies between them and the other reads. This is likely due to their matching in my project from other copies of the repeat elsewhere in the genome. Figure 17: Nearly contiguous assembly

Finally, the ends were rejoined to the major contig. In the process, another large contig was created. The figure below shows her final assembly, which I took to work on over the weekend. I attempted to close the gap between 202 and 203, and 203 and 181. I could not find a way to do so without adding a large number of high quality discrepancies. I will be treating this assembly as my final assembly and base my conclusions based on the sequence provided. Figure 18: Final Assembly Drosophila ananassae fosmid clone 2410F24 17

Conclusions: The digests show that clearly the sequence is not assembled correctly. The EcoRV digest shows a band both larger and smaller than the in silico digests. There are bands that are the correct size, all of which result from contig 181. Figure 19: EcoRV digests

The HindIII digests look somewhat better, and they also allow me to estimate the size of the gap. In the real digests, there is a 1336 bp fragment. The in silico digests show a 1020 bp fragment in contig 181. This suggests the size of the gap is around 300 bp. Figure 20: HindIII digests Concluding Remarks: Drosophila ananassae fosmid clone 2410F24 19

Several obstacles kept me from finishing the fosmid. In retrospect, the repeat structure in the project made my initial attempts seem futile. I could not prime a reaction to close the gap because each side of the gap is flanked by multiple repeat series that match in multiple other places in the fosmid. Earlier in the project, I ran Autofinish to see if it would offer any insights. The results were not at all helpful, as the primers suggested not only matched elsewhere in the genome, but were composed mostly composed of AT regions. I decided not to call any of these reactions, as they were unlikely to yield good results. The figure below shows the final repeat structure of the fosmid. There are a series of tandem and inverted repeats, spread throughout regions of unique sequence. Figure 21: Full assembly with sequence matches In conclusion, I was unable to properly finish the 2410F24 fosmid due to the repeat structure throughout it. I was unable to call any reactions, nor was I able to run a BLAST search to search for contamination.

APPENDIX: After this report was due and submitted, I continued to work on finishing my assembly. I made one join between my two largest contigs. Giving me this assembly. Figure 22: Final Assembly (above) with highlighted gap flanked by repeats (below). Drosophila ananassae fosmid clone 2410F24 21

The digests are modified just a little by this join, and now appear as follows: These two digests were the most consistent of the digests. EcoRV on the left, and HindIII on the right. The summation of my in silico bands is still short of the stated size (44000 versus 48000). Clearly quite a bit of progress still needs to be made, but this assembly seems significantly closer to a finished product. In walking through the final checklist; there are some notable features. There are quite a few single- strand, single- chemistry regions but they are concentrated in the

single read contigs, as shown in Figure 23. Many regions of low quality that dictate the consensus are present, including long stretches on either side of the gap. There is one homopolymer run of 15 T s, however, it is in a very low quality region. There are also many high quality discrepancies till remaining in the assembly. Acknowledgments: I would very much like to thank Lee Trani and Sara Kohlberg for all the help they offered in assembling this project. I would also like to thank Wilson Leung, Dr. Shaffer and Dr. Elgin for making this class, and this research possible. Figure 23: Navigation window of low depth coverage regions (< 3 reads) Drosophila ananassae fosmid clone 2410F24 23

Figure 24: Low quality consensus region Figure 25: Homopolymer run of T s in low quality data

Figure 26: High quality discrepancy navigation window Drosophila ananassae fosmid clone 2410F24 25