Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

Size: px
Start display at page:

Download "Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae"

Transcription

1 Schefkind 1 Adam Schefkind Bio 434W 03/08/2014 Finishing of Fosmid 1042D14 Abstract Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae genomic DNA. Through a comprehensive analysis of forward- reverse mate pairs, tandem repeat homology, and PacBio data, the fosmid was assembled into two major Contigs. However, additional sequence data, from Polymerase Chain Reaction (PCR) or the Trace Archive, is likely required to complete this project and close the remaining 1 kb gap. Introduction The D. melanogaster F element, or dot chromosome, is rather unusual. It appears heterochromatic, containing a 30% repeat density. At the same time, it has roughly 80 transcriptionally active genes, a typical gene density for its 1.2 Mb size. In contrast, the D. ananassae dot chromosome is hugely expanded with respect to the classical reference species, Drosophila melanogaster. In an effort to better understand these odd properties and chromatin in general, we have set out to sequence the D. ananassae dot chromosome for comparison with related species. Initial Assembly Figure 1: Initial Assembly with Crossmatch

2 Schefkind 2 The initial Assembly View for project 1042D14 (Figure 1) revealed a single 40,409 bp Contig containing a high repeat density and numerous inconsistent mate pairs toward its 3 end. Both clone end tags lay on this Contig, suggesting that the project was already assembled at least partially correctly. It appeared unlikely that any severe gaps existed. However, the tandem repeats beginning at around 35 kb were likely misassembled given mate pair information as well as multiple high quality discrepancies in this region (Figure 2). Figure 2: High Quality Discrepancies in Repeat- rich Region of Contig 19 Initial Sorting In the absence of restriction digest information, the first step in the sorting process was to attempt to rectify the forward reverse mate pair issues. Interestingly, the majority of the unresolved forward reverse mate pairs (shown by slanted red lines in Figure 1) had one member anchored in a unique region. A reasonable explanation is that the mate pair within the repetitive region was misassembled either because Contig 19 had a collapsed repeat and required expansion, or these mate pairs were simply placed in an incorrect repeat copy in this region. In support of the latter hypothesis, a homology search of the non- unique mate pairs showed that they matched several kb upstream (Figure 3). Successful placement of the reads at that location could rectify the mate pair problems as well as the high quality discrepancies.

3 Schefkind 3 Figure 3: An example of potential matches for problematic mate pair Figure 4: List of mate pairs removed from Contig 19 Accordingly, the 7 mate pairs (14 reads) that appeared misplaced (Figure 4) were removed from Contig 19 and, using Miniassemby, placed into their own Contigs 20 and 21. Contig 20 contained the mate pairs that were previously anchored in unique sequence. As expected, crossmatch shows that both of these new Contigs matched toward the end of Contig 19; Contig 20 matched in one place while Contig 21 matched to three places (Figure 5). Unfortunately, none of the given sequence matches between Contig 21 and 19 appeared real; all attempted alignments revealed numerous discrepancies. No joins could be made at this time,

4 Schefkind 4 so an inspection of high quality discrepancies was conducted for further clues. Figure 5: Assembly View with 2 new Contigs, 20 and 21. High Quality Discrepancies Most of the aforementioned high quality discrepancies (Figure 2) were deemed insignificant. Either they existed in a low quality region or were indicative of errors in sequencing. For example, the highlighted * in Figure 5 was mistakenly called. An examination of the trace window revealed irregular spacing; the four A s recorded for read E13.g1 should actually have been 5 A s, as shown by the overly broad peak between base pairs and (Figure 6). Discrepancies of this nature were tagged with a comment.

5 Schefkind 5 Figure 6: An example of an insignificant high quality discrepancy and its Trace Window However, there was one discrepancy that could not be ignored in this way. Whereas in other cases only one read deviated from consensus and the trace revealed a clear mistake in recording, at site roughly half of all reads showed a discrepant C, while the consensus was an A at this site (Figure 7).

6 Schefkind 6 The traces of these reads did not reveal a mistake in signal recording. Rather, the quality of these discrepancies was quite high and the spacing between peaks seemed normal. There are a few possible explanations for this finding. It could be a polymorphic site at which different genomes in a population happen to contain one base over another. Alternatively, it could be a contamination from E. coli or some other bacterial species during the cloning process. However, given the high repeat Figure 7: A genuine discrepancy at site density of the surrounding regions, half of the reads may have belonged somewhere else, in a different repeat. To test this, a homology search was performed with a string of bases containing the C, and a string of bases containing the A at site Both searches revealed several hits. However, movement of the reads containing a C at bp would rectify several of the remaining inconsistent forward reverse pairs. For this reason, these reads were pulled out of Contig 19 and placed into their own Contig 22 using Miniassembly. At this point, Contigs 21 and 22 represented portions of the repeated region in the main Contig. Contig 20 contained the mate pairs of Contig 21. Whether the main Contig contained collapsed repeats or whether some copies of the repeat belong outside this assembly altogether remained unclear.

7 Schefkind 7 Initial Resolution of Gaps As expected, Contig 20, which had originally been anchored in unique sequence, clearly matched to one place in the main Contig. The only discrepancies came in low quality regions and could thus be ignored (Figure 8). A join was made, placing Contig 20 back at around 33 kb in Contig 19. With Contig 20 now placed, its mate pairs in the resultant Contig 21 must be inserted within roughly 3 kb. As implied previously, no suitable matches appeared in this range. A 97.2% similarity match at around 36 kb was ideal except for a string of pads Contig 21 had 31 extra base pairs as seen in Figure 9. Figure 8: Contig 19 and Contig 20 can be joined as discrepancies fall only in low quality regions Figure 9: Unsuitable match between Contigs 19 and 21

8 Schefkind 8 Given that the assembly was roughly only 37 kb long at this point, it seemed reasonable that at least one copy of the tandem repeat was collapsed. Mate pair information, as well as the perfect match before the string of pads between Contigs 19 and 21 (Figure 9) suggested that Contig 21 was a collapsed repeat, belonging in the middle of Contig 19. Consequently, Contig 19 was torn at this position. The result was Contigs 24 and 25. Figure 10: Assembly View after tearing the major Contig apart As expected, Contig 21 matched suitably well with the very end of the major Contig 24. In addition, placement here would fix the original mate pair mismatch. This evidence corroborated the idea of Contig 21 being the second copy of a collapsed repeat within the major Contig. So, Contig 21 was successfully joined here. In addition, Contig 22, the other repeat copy that was originally discrepant, was found to match perfectly to Contig 25, and was placed there (Figure 11). All smaller

9 Schefkind 9 Contigs were now joined back into larger Contigs. Figure 11: A suitable match between Contigs 22 and 25 The net result of these joins was the addition of about 1 kb of repeat sequence into the middle of the major Contig. The consequent assembly view showed two major Contigs, 27 and 28, with many inconsistent mate pairs that could hopefully be remedied by a Contig join (Figure 12).

10 Schefkind 10 Figure 12: Assembly View after several Contig joins Figure 13: Alignment of the two major Contigs After movement of one mate pair, it was realized that these two major Contigs matched very well, with no gaps. The alignment showed a 100% match with no problems. A join was thus justified. The result was a single Contig 31. At this point the assembly seemed nearly finished. However, there were still a few inconsistent mate pairs. Table 1 shows how these inconsistencies were rectified.

11 Schefkind 11 Figure 14 shows the apparently completed assembly. However, several components still required inspection. Figure 14: Assembly View after join to make one Contig. Table 1: A list of removed reads and where they were placed Remaining Checklist Components Most checklist items were resolved without problem: there were no unacceptable mononucleotide runs. There were no N s or X s in the consensus. A BLAST search revealed no E. coli traces in the sequence. Any region with only one sequencing direction or one chemistry showed a Phred score of over 30.

12 Schefkind 12 14: Above shows 3 reads at the low consensus area. All three indicate spacing of only 4 A's. The m image shows typical base pair spacing from a nearby region for comparison. One problematic area was the consensus quality at base An examination of neighboring bases revealed that the consensus sequence from to was relatively low quality. However, this existed in the middle of a 5 kb tandem repeat. Successful PCR thus seemed improbable here. A BLASTn search was attempted to obtain more reads to add here; however, all suggested reads already existed in project 1042D14. No reads could be added. Despite this, an investigation of the traces of the reads at the low quality consensus revealed some information. The consensus originally showed 5 A s between and However, the traces seemed to show spacing indicating only 4 A s (Figure 14). Most reads showed a pad at position 39646; only 1 read showed a low quality A. For these reasons, the consensus was changed to a pad at

13 Schefkind 13 this position. Despite this, the low quality nature of the consensus could not be rectified at this time. The consensus was tagged between and with a comment describing this. A similar procedure was used for low consensus quality at position Another major problem was discovered when checking the assembly against PacBio data. As shown in Figure 15, most reads matched the project with Figure 15: Comparison of PacBio data with project 1042D14 high fidelity, denoted by a red bar. However, read 0_12144 (Figure 16) revealed a deviation from the project for roughly 1 kb. This suggests either this PacBio read was faulty, or my assembly was incorrect at this point.

14 Figure 16: A roughly 1 kb region lacking homology is evident. Schefkind 14

15 Schefkind 15 A review of mate pair information within the project assembly showed a worrisome lack of coverage in the discrepant area (Figure 17). In fact, only 2 reads spanned the region from bp to 32990, one of which was a danafos read. Given this evidence, it was concluded that the assembly could not be trusted; rather, the PacBio data was likely correct, and the deviation legitimate. Overall, since the PacBio read matched the consensus both before and after the deviation, there was probably a ~1 kb sequence missing from the main Contig. Figure 17: Mate pair information is severely lacking in the problematic region PacBio Data Incorporation Accordingly, the PacBio reads from Figure 14 were added to the project. The Contig was torn at the position of low coverage in hopes of locating and eventually filling the gap indicated by PacBio data. Henceforth, the two Contigs that resulted from this tear will be referred to as Contig A and Contig B (Figure 18).

16 Schefkind 16 The PacBio reads were unfortunately quite low quality; their exact sequences could not be used as a new consensus and they could not be used to directly close the gap. However, the gap size was still confidently approximated at 1 kb based on the dot plot in Figure 15. While PCR could not be run successfully at this location because of high repeat density, the Trace Archive could be used to pull in new reads that might cover the gap. Figure 18: Assembly View after tear at low coverage region. The large Contig containing mostly unique sequence is labeled Contig A. The smaller, repetitious Contig is labeled Contig B The strategy devised was to walk in from Contig B; new reads could be added onto the 5 end of this Contig, extending it into the gap. Subsequently, more reads could be retrieved based on this extended sequence. Indeed, upon a trace archive search, 21 reads and their mate pairs were found (Figure 19). They were

17 Schefkind 17 added in and several were added into the start of Contig B using a Miniassembly. The mate pairs of those reads added also appeared to belong within the assembly, lending weight to the proposed hypothesis. Figure 19: A list of reads added from the Trace Archive Conclusion Unfortunately, at this point time ran out for this project. However, the procedure just described will provide a useful starting point for future finishing attempts. The final assembly is shown in Figure 20. The second image is a diagram of the hypothesized correct complete assembly. While this fosmid was left with a 1 kb gap, the rest of the assembly appeared quite high quality, and essentially finished.

18 Schefkind 18 Figure 20: Final Assembly View. Contigs 102 and 115 are the two relevant Contigs. Acknowledgements Thank you to all involved with the Bio 4342 project. The faculty, finishers, and teaching assistants were invaluable to the progress made on this project; I am sincerely appreciative for their guidance and expertise.