ABSTRACT. LEENAY, RYAN THOMAS. Discovering, Characterizing, and Applying CRISPR-Cas Systems in Bacteria. (Under the direction of Dr. Chase Beisel.

Size: px

Start display at page:

Download "ABSTRACT. LEENAY, RYAN THOMAS. Discovering, Characterizing, and Applying CRISPR-Cas Systems in Bacteria. (Under the direction of Dr. Chase Beisel."

Franklin Moody
5 years ago
Views:

1 ABSTRACT LEENAY, RYAN THOMAS. Discovering, Characterizing, and Applying CRISPR-Cas Systems in Bacteria. (Under the direction of Dr. Chase Beisel.) CRISPR-Cas (clustered regularly interspaced short palindromic repeats- CRISPR-associated) systems are adaptive bacterial immune systems that target nucleic acids through an RNA-guided base-pairing mechanism. The simplicity of this mechanism has driven the development of numerous applications, such as genome editing, cellular imaging, gene regulation, and targeted anti-microbials. These applications primarily utilize a single Cas protein, despite the fact that there are CRISPR-Cas variants found on up to 50% of sequenced bacterial genomes. I present here novel methods to expedite the rate at which researchers can mine the diversity of CRISPR-Cas systems in nature for the development of new biological tools. I first developed a screen to define protospacer-adjacent motifs (PAMs) for CRISPR-Cas systems, essential sequences that determine if a Cas protein can bind to a desired target. As a proof of principle, I defined new targeting rules for four completely different systems and devised a novel display scheme to better analyze new and old datasets on CRISPR PAMs. The assay was then modified to enrich for weak PAMs, providing a holistic view of all possible sequences that a Cas protein can recognize. To apply these systems, a genome editing pipeline was developed in a gutresiding strain of Lactobacillus plantarum to study the strain s biological impact on its host. This pipeline also created multiple tools for further analysis of the Lactobacillus genus for future work. Together, these studies have laid the groundwork for the rapid characterization and application of CRISPR-Cas systems in bacteria.

2 Discovering, Characterizing, and Applying CRISPR-Cas Systems in Bacteria by Ryan Thomas Leenay A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Chemical and Biomolecular Engineering Raleigh, North Carolina 2017 APPROVED BY: Dr. Chase Beisel Committee Chair Dr. Robert Kelly Minor Member Dr. Amy Grunden Dr. Rodolphe Barrangou

3 DEDICATION To my wife who has supported me through this journey with constant love, support, and an enduring confidence in me. To my son who smiles and laughs me through every day. ii

4 BIOGRAPHY Ryan Thomas Leenay was born in Princeton, Minnesota in November of He spent his next 18 years here, and first thought he wanted to be an architect when he grew up. After taking an architecture class in high school and discovering that all of his designed houses looked the same, he decided to pursue a more practical engineering degree instead. He then moved to Madison, Wisconsin where he attended the University of Wisconsin-Madison and received his B.S. in chemical engineering in At UW-Madison, Ryan optimized mix design on the concrete canoe team and performed research under the direction of both Dr. Tom Record and Dr. Brian Pfleger. These applied science experiences sparked his interest in research, so Ryan moved to North Carolina to pursue a graduate degree at North Carolina State University. His studies were performed under the direction of Dr. Chase Beisel, with a focus on developing tools based on CRISPR-Cas technology. In the year 2015, Ryan interned at one of the flagship CRISPR-based companies: Caribou Biosciences. After returning, he married the love of his life, Chelsea Leenay, and worked towards finishing his doctoral studies. In January of 2017, Ryan and Chelsea welcomed their happy baby boy Jack into the world. Upon receiving his Doctorate of Philosophy, Ryan will be starting a job at the Chan Zuckerberg Biohub to pursue biotechnological solutions to medical problems. He hopes his career focuses on developing solutions to human health problems through the use of biotechnology. iii

5 ACKNOWLEDGEMENTS I would first like to thank my wife for her unending grace, faith, support, advice, and love over the years. This achievement would not have been possible without her, and she made the entire journey so much more enjoyable. I would also like to thank my son, who s laugh instantly becomes the highlight of my day. Next, I would like to thank my family both the Leenay s and the Polk s. They have raised me, loved me, and always supported my family and I throughout this adventure. I would also like to thank all of my co-workers, collaborators, and fellow Beisel lab members from over the years. There is not a single project that was done without input or direct assistance from one of you. I would especially like to thank all of my coauthors, our manuscripts pushed the CRISPR field into new directions. Finally, I would like to thank my advisor Chase Beisel. He constantly provided direction, assistance, or independence as my experience in the laboratory grew. This research experience would not have been nearly as successful without his guidance throughout the entire process. iv

6 TABLE OF CONTENTS LIST OF TABLES... viii LIST OF FIGURES... ix CHAPTER 1 Deciphering, communicating, and engineering the CRISPR PAM... 1 ABSTRACT INTRODUCTION CRISPR BIOLOGY PAM CHARACTERISTICS PAM DETERMINATION Protospacer identification in silico Plasmid clearance in vivo DNA cleavage in vitro DNA binding in vivo PAM REPORTING PAM ENGINEERING PERSPECTIVES REFERENCES CHAPTER 2 Identifying and visualizing functional PAM diversity across CRISPR-Cas systems ABSTRACT INTRODUCTION DESIGN Limitations of current PAM screens An in vivo, positive, tunable screen for functional PAMs Limitations of current PAM representation schemes Conveying functional PAM sequences and activities with PAM wheels RESULTS PAM wheels convey PAM landscapes from existing high-throughput screens Comprehensive screening reveals the PAM landscape for the E. coli type I-E system Stringency tuning uncovers the weak NAG PAM for the S. pyogenes type II-A system Individual and comprehensive screening results reveals functional PAMs for the S. thermophilus type II-A system Comprehensive screening reveals PAM bias for the F. novicida type V system Comprehensive screening reveals the PAM landscape for the compact B. halodurans type I-C system DISCUSSION LIMITATIONS EXPERIMENTAL PROCEDURES Strains, plasmids, oligonucleotides Growth conditions Flow cytometry analysis Fluorescence-activated cell sorting Next-generation sequencing v

7 2.6.6 Cell killing assay ACCESSION NUMBERS AUTHOR CONTRIBUTIONS ACKNOWLEDGEMENTS REFERENCES SUPPLEMENTARY INFORMATION SUPPLEMENTAL EXPERIMENTAL PROCEDURES DETAILED PROTOCOL CHAPTER 3 A growth-based PAM screen reveals non-canonical PAMs for the S. pyogenes Cas ABSTRACT INTRODUCTION DESIGN RESULTS Xylose media growth optimization Characterizing growth-based screening with SpydCas SpyCas9 growth-based PAM enrichment data presents non-canonical PAMs CONCLUSIONS EXPERIMENTAL PROCEDURES Strains, plasmids, oligonucleotides Plasmid generation Growth conditions Flow cytometry Xylose growth assay Deep sequencing PAM representation Plasmid clearance experiments REFERENCES SUPPLEMENTARY INFORMATION CHAPTER 4 Advancing tools for genome editing in Lactobacillus plantarum for in vivo gut studies ABSTRACT INTRODUCTION Lactobacillus as an impactful gut microbe Genome editing tools in Lactobacillus plantarum RESULTS Plasmid generation Characterization of the genome editing plasmids Oligo-based genome editing in Lactobacillus plantarum Repair template editing in Lactobacillus plantarum CONCLUSIONS EXPERIMENTAL PROCEDURES Strains, plasmids, oligonucleotides Plasmid generation Standard growth conditions Electroporation protocol for L. plantarum Colony qpcr REFERENCES SUPPLEMENTARY INFORMATION vi

8 CHAPTER 5 Conclusions and future work SUMMARY FUTURE WORK Developing a genetic system in characteristic gut bacteria Improving the health benefit of Lactobacillus plantarum to its host Top-down screen for beneficial genomic mutations in L. plantarum Bottom-up approach to improving growth promotion CLOSING THOUGHTS AND REFLECTIONS REFERENCES vii

9 LIST OF TABLES Table 1.1 Consensus PAMs for representative CRISPR-Cas systems Table 2.S1 Strains, plasmids, and oligonucleotides used in this work Table 3.S1 Strains, plasmids, and oligonucleotides used in this work Table 4.S1 Strains, plasmids, and oligonucleotides used in this work viii

10 LIST OF FIGURES Figure 1.1 Function of the CRISPR PAM... 7 Box 1.1 Standardizing the orientation of the CRISPR PAM... 8 Figure 1.2 Two orientations for reporting PAM sequences... 9 Figure 1.3 PAM orientation and targets for different types of CRISPR-Cas systems Figure 1.4 Methods for PAM determination Figure 1.5 Reporting the CRISPR PAM Figure 1.6 Engineering Cas proteins with altered or relaxed PAM recognition Figure 2.1 Overview of the PAM Screen Achieved by NOT-Gate Repression PAM-SCANR Figure 2.2 Representing PAMs and their enrichment with the PAM wheel Figure 2.3 Comprehensive screening of the type I-E CRISPR-Cas system in E. coli Figure 2.4 Stringency tuning and screening for the S. pyogenes and S. thermophilus type II-A Cas9s Figure 2.5 Comprehensive screening of the type V-A Cpf1 protein from F. novicida Figure 2.6 Comprehensive screening of the uncharacterized type I-C system from B. halodurans Figure 2.S1 The PAM-SCANR platform Figure 2.S2 Stringency tuning of PAM-SCANR with IPTG Figure 2.S3 Library coverage and flow cytometry histograms for pre-sorted and post-sorted cultures Figure 2.S4 PAM-SCANR results for the S. thermophilus CRISPR1 dcas Kronas 2.S1-7 Link for interactive Krona plots for PAM wheels Figure 2.S2.1 Example dataset representing fold enrichment of a PAM library Figure 2.S2.2 Example dataset of a 2-nucleotide library Figure 2.S2.3 Example dataset represented in a Krona plot, opened by a web browser Figure 2.S2.4 Analysis of Example 1 of all sequences that begin with A Figure 2.S2.5 Krona plot for the E. coli dataset seen in Figure Figure 2.S2.6 A modified E. coli Krona plot called the PAM wheel Figure 3.1 Growth-based circuit for a PAM assay using D-xylose selection Figure 3.2 Optimizing growth conditions for SpyCas9 PAMs Figure 3.3 Sequencing results for SpyCas9 after two rounds of D-xylose selection Figure 3.4 Analysis of individual PAMs inserted back into the PAM-SCANR genetic circuit Figure 3.5 Transcriptional repression of a novel protospacer Figure 3.S1 Testing a growth-based PAM circuit with ampicillin antibiotic resistance Figure 3.S2 Media optimization with supplemental carbon sources Figure 3.S3 PAM wheels for each library enrichment tested Figure 3.S4 GFP fold enrichment with IPTG for AAGTG and ACGTC PAMs ix

11 Figure 4.1 Lactobacillus plantarum mutation after directed evolution Figure 4.2 Plasmid generation for oligo-based genome editing in Lactobacillus Figure 4.3 Confirmation of genome-editing plasmid activity in Lactobacillus plantarum Figure 4.4 Attempted oligo-based editing of the acetate kinase gene in NIZO.G Figure 4.5 Genome editing in Lactobacillus plantarum with a dsdna repair template Figure 4.S1 Larval growth size for a variety of L. plantarum strains after passaging through the Drosophila gut x

12 CHAPTER 1 Deciphering, communicating, and engineering the CRISPR PAM Ryan T Leenay, Chase L. Beisel Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh NC Original publication in Journal of Molecular Biology 429(2): ,

13 ABSTRACT Clustered regularly interspaced short palindromic repeat (CRISPR) loci and their flanking CRISPR-associated (cas) genes make up RNA-guided, adaptive immune systems in prokaryotes whose effector proteins have become powerful tools for basic research and biotechnology. While the Cas effector proteins are remarkably diverse, they commonly rely on protospacer-adjacent motifs (PAMs) as the first step in target recognition. PAM sequences are known to vary considerably between systems and have proven to be difficult to predict, spurring the need for new tools to rapidly identify and communicate these sequences. Recent advances have also shown that Cas proteins can be engineered to alter PAM recognition, opening new opportunities to develop CRISPR-based tools with enhanced targeting capabilities. In this review, we discuss the properties of the CRISPR PAM and the emerging tools for determining, visualizing, and engineering PAM recognition. We also propose a standard means of orienting the PAM to simplify how its location and sequence are communicated. 2

14 1.1 INTRODUCTION Over the past 11 years, CRISPR (clustered regularly interspaced short palindromic repeats) loci and their associated Cas (CRISPR-associated) proteins have transitioned from curious prokaryotic immune systems to revolutionary tools for fundamental biomolecular research, biotechnology, agriculture, and medicine 1 5. A key driver has been the ease by which the effector proteins of these systems can be utilized as programmable nucleases to specifically bind and cleave selected DNA or RNA sequences. Because these proteins and their guide RNAs are easier and cheaper to implement than all comparable technologies, CRISPR technologies have become a standard for applications in genome editing, gene regulation, and DNA imaging and are being explored for gene drives and sequence-specific antimicrobials CRISPR-Cas immune systems have also proven to be remarkably diverse, with new and emerging systems poised to further advance existing applications or drive entirely new ones Despite this diversity, CRISPR-Cas systems rely on a common set of rules for target recognition: complementary between the guide RNA and the target sequence, and a protospacer-adjacent motif (PAM) flanking the target This review details the nature of the PAM and recent efforts to identify, disseminate, and alter the recognized PAM sequences for different CRISPR- Cas systems. We describe the field's current understanding of what defines a PAM as well as available methods to identify and communicate these sequences. We also discuss efforts to re-engineer PAM recognition and generate CRISPR Cas systems with altered or improved DNA recognition capabilities. 3

15 1.2 CRISPR BIOLOGY CRISPR-Cas systems naturally function as adaptive immune systems that protect bacteria and archaea from foreign genetic material such as bacteriophages or plasmids. The ability to uniquely recognize foreign sequences stems from the CRISPR array, a short stretch of DNA composed of alternating conserved repeats and targetspecific spacers 1, Each spacer is directly acquired from a fragment of the foreign genetic material called the protospacer, allowing the CRISPR array to possess heritable memory of the infection 30. The CRISPR array is transcribed and subsequently processed into individual units called CRISPR RNAs (crrnas) These RNAs associate with the system's Cas effector proteins to form a ribonucleoprotein surveillance complex. Once assembled, the complex scans the cell for PAM sequences. Upon binding, the complex interrogates the extent of base pairing between the downstream sequence and the spacer portion of the crrna. Extensive base pairing leads to the Cas proteins cleaving or degrading the target sequence, resulting in the clearance of the foreign invader. CRISPR-Cas systems possess a diverse compilation of genes and are found throughout the prokaryotic world. Current estimates place CRISPR-Cas systems within ~50% of bacterial genomes and ~90% of archaeal genomes 18, although recent metagenomics sequencing analyses have suggested that the frequency of CRISPR- Cas systems in nature could be much lower 36. Furthermore, only a subset of the identified systems may be active. Roughly 100 protein families have been associated with these systems, where the varying prevalence and co-occurrence of these genes has spurred the development of numerous classification schemes 18,19,37,38. The most 4

16 recent scheme breaks the systems into two classes, six types, and 19 subtypes 18,19. The two classes are differentiated by whether a protein complex (Class 1) or individual protein (Class 2) serves as the effector of immune surveillance. The six types (I - VI) are defined by the presence of a signature gene encoding a protein responsible for nucleic-acid cleavage (e.g. Cas3 for Type I systems, Cas9 for Type II systems). The types also differ in their mechanisms of crrna processing and target recognition, as well as whether the target is DNA (Types I, II, V), RNA (Type VI), or both (Type III) 39,40. The subtypes are named by addition of a letter to the type (e.g. I-A, II-C) and are defined based on the specific cas genes and their configuration. As the vast expanse of CRISPR-Cas systems in the prokaryotic world remain poorly characterized, more unique functions and systems are expected to be discovered and will ultimately yield new CRISPR tools and technologies. 1.3 PAM CHARACTERISTICS The PAM was first observed in 2008, when Horvath and coworkers noticed conserved sequences that flanked protospacers acquired by Streptococcus thermophilus after being challenged with a lytic bacteriophage 21,22. The following year, Mojica and coworkers uncovered similar motifs for multiple CRISPR-Cas systems through bioinformatics analyses, which established PAMs as generalized features of these systems 23. Each report coined different names for these motifs-crispr motifs or protospacer-adjacent motifs (PAMs), respectively--where the latter became the accepted terminology. 5

17 The first insights into the function of these motifs first came from studies of the Type III CRISPR-Cas system in Staphylococcus epidermidis by Marraffini and Sontheimer 41, who demonstrated that these flanking sequences are essential for self/non-self-discrimination (Figure 1.1). They specifically showed that the system uses these flanking sequences to differentiate between the CRISPR array (self) and the foreign invader (non-self), which both harbor a sequence perfectly complementary to the CRISPR RNA spacer. While the mechanism of discrimination appeared to rely on base pairing between the flanking regions of the spacer and protospacer 41, it established the theme that flanking sequences such as PAMs are critical for protospacer selection and target recognition. This insight was upheld as others showed that the PAM was an essential element for target recognition and cleavage 9,42. 6

18 Figure 1.1 Function of the CRISPR PAM. CRISPR-Cas systems naturally utilize PAMs to discriminate between self and non-self. The spacer portion of the CRISPR RNA is perfectly complementary to both its own CRISPR array (self) and the protospacer within the foreign invader's genetic material (non-self). The systems differentiate between these two through the flanking PAM that is absent within the CRISPR array (gray) and present within the invader s genetic material (red). (A) The PAM is used by the CRISPR acquisition machinery to select pieces of the foreign invader's DNA as new spacers. (B) Because the PAM is essential for target recognition, the CRISPR-Cas systems will target the invader's genetic material but not the CRISPR array. Extensive structural and biochemical analyses have helped reveal how the PAM participates in target recognition Cas effector proteins directly bind the PAM sequence through protein-dna interactions and subsequently unzip the downstream DNA sequence. The effector proteins then interrogate the extent of base pairing between one strand of the DNA target and the spacer portion of the CRISPR RNA. Sufficient complementarity between the two drives target cleavage. Critically, if a nonfunctional PAM sequence is present, the effector proteins do not interrogate the downstream sequence even if it is perfectly complementary to the spacer. Aside from immunity, the PAM also plays an integral role on spacer acquisition. In this case, the 7

19 acquisition proteins alone or in coordination with effector proteins recognize defined PAM sequences while acquiring new spacers, ensuring that each new spacer can recognize the invading DNA As PAMs for acquisition and interference can be different 29,54,55, the associated sequences have been respectively termed spacer acquisition motifs (SAMs) and target interference motifs (TIMs) 56,57. While this distinction is important, we refer to both motifs as PAMs given our primary focus on immune defense and the limited adoption of SAMs and TIMs. Despite the common role of the PAM in target recognition, its characteristics vary between the different types of CRISPR-Cas systems. One major difference is the location of the PAM. Using the non-target strand of the protospacer as a reference, the PAM is located on the 5 of the protospacer for type I and V systems and on the 3 end of the protospacer for type II systems. Note that the target strand also has been used to specify the PAM 29,43,58,59, creating some confusion about the exact location Box 1.1 Standardizing the orientation of the CRISPR PAM. The PAM sequence flanks the protospacer within the target DNA. Because of the double-stranded nature of DNA, only one strand needs to be reported along with its location relative to the protospacer. To date, both strands have been used to report PAM sequences, where the selected strand often trends with the particular type of CRISPR Cas system. The problems are that consensus PAM sequences are reported without the orientation, and the use of either strand creates confusion about the exact location of the PAM. Here, we describe both orientations, which we term target-centric and guide-centric, and argue for the guide-centric orientation to be universally adopted. Both orientations are illustrated in Figure 1.2. Under the target-centric orientation, the PAM is located on the same strand that base pairs with the guide RNA. In many cases, the PAM on this strand is specifically recognized by the Cas effector proteins 43,58, lending a mechanistic argument to this orientation. The target-centric orientation is regularly employed for Type I systems 29,43,58,59. Under the guide-centric orientation, the PAM is located on the strand that matches the sequence of the guide RNA. This match lends to guide-rna design, where the sequence flanking the identified PAM is used to create the guide portion of the RNA. This orientation is used for type II and V systems 21 23,61,69. While the two orientations are equivalent, we propose the universal adoption of the guide-centric orientation: the Cas9 effector proteins from type II systems are the most widely recognized and published, and the PAMs for these proteins are always reported in the guide-centric orientation. Adopting this orientation would therefore have a smaller impact on the existing body of CRISPR literature. 8

and sequence of any reported PAM; see Box 1.1 and Figure 1.2 for more information on these orientations and why we recommend the guide-centric orientation used in this review. Figure 1.3 illustrates the location of the PAM for different CRISPR Cas system types given this orientation.

20 and sequence of any reported PAM; see Box 1.1 and Figure 1.2 for more information on these orientations and why we recommend the guide-centric orientation used in this review. Figure 1.3 illustrates the location of the PAM for different CRISPR Cas system types given this orientation. In the case of Type III and Type VI systems, emerging evidence suggests that the PAM is located within the target RNA 39,40. Because of this unique location, the PAM for these systems was renamed the RNA PAM (rpam) or the protospacer-flanking sequence (PFS), respectively. Given that type III systems are thought to rely on base pairing between the crrna 5 handle and the region flanking the target DNA sequence 41, more work is needed to determine whether this mechanism or the rpam is normally implemented and whether they occur separately or together. Figure 1.2 Two orientations for reporting PAM sequences. The PAM sequence is located within double-stranded DNA, where either strand of the PAM can be reported along with its location relative to the protospacer. Under the target-centric orientation, the PAM is reported from the strand that base pairs with the guide RNA. Under the guide-centric orientation, the PAM is reported from the strand that matches the guide RNA and is used for guide-rna design. 9

Figure 1.3 PAM orientation and targets for different types of CRISPR Cas systems. CRISPR- Cas systems are subdivided into two classes and six types.

The effector proteins for the V-B and V-C subtypes require a tracrrna similar to type II systems. For type III and VI systems, the crrna binds target RNAs.

21 Figure 1.3 PAM orientation and targets for different types of CRISPR Cas systems. CRISPR- Cas systems are subdivided into two classes and six types. Representative illustrations of the effector proteins for each type are shown. The dsdna orientation has been flipped from Figure 1.2 to accommodate the guide-centric PAM orientation. The effector proteins for the V-B and V-C subtypes require a tracrrna similar to type II systems. For type III and VI systems, the crrna binds target RNAs. Type III systems cleave transcribed dsdna due to its proximity to the RNA target 40. Note that the mechanism by which type III and VI systems recognize their nucleic-acid targets is still under investigation. PAM orientations are presented for all systems except for type IV systems, which remain uncharacterized Aside from location, the composition of the PAM can vary widely. The composition includes the sequences comprising the PAM, the length of the linker (represented by N s, where N is any one of the four possible bases) separating the protospacer and the sequence-specific portion of the PAM, and the promiscuity in deviating from a defined consensus sequence. As one example, the widely used Type II-A system from Streptococcus pyogenes recognizes an NGG PAM, and to a lesser extent, an NAG PAM 9,23,60. Separately, one Type II-A system from Streptococcus thermophilus recognizes an NNAGAA PAM but has the ability to recognize other sequences such as NNGGAAA and may accommodate changes in its linker length of 10

22 2 nucleotides 21,22,61,62. Finally, the Type I-E system from Escherichia coli has one of the most promiscuous PAM recognition capabilities, with at least nine recognized PAM sequences (NAAG, NAGG, NATG, NGAG, NTAG, NAAC, NAAA, NAAT, NATA) and a strong nucleotide preference at the N position for some of these PAMs 29,58,63,64. Table 1.1 contains representative consensus sequences for the most active PAMs for each characterized subtype. Given that PAMs can vary widely even within a given subtype 21 23,65, more work is needed to fully interrogate the diversity of PAMs in nature. 11

23 Table 1.1 Consensus PAMs for representative CRISPR-Cas systems. The consensus PAM sequences represent the most active PAM sequences and are reported using the guide-centric orientation (see Box 1.1). Dashes indicate a subtype lacking any system with a characterized PAM. Classification Class Subtype Source organism PAM location Consensus sequence* Reference 1 I-A P. furiosus 5 YCN 59 1 I-B C. difficile 5 CCW 87 1 I-C B. halodurans 5 YYC 64 1 I-D Cyanothece sp. PCC I-E E. coli 5 AWG 63,64 1 I-F P. aeruginosa 5 CC 88 1 I-U G. sulfurreducens III-A S. epidermidis None None 18 1 III-B P. furiosus 3 MMA (RNA PAM) 40 1 III-C M. thermautotrophicus str. Delta H III-D Roseiflexus sp. RS IV A. ferrooxidans II-A S. pyogenes 3 NGG 9,23,60 2 II-A S. thermophilus (CRISPR1) 3 NNAGAA 21,22,61 2 II-A S. aureus 3 NNGRRT 65,76 2 II-B L. pneumophila str. Paris II-C N. meningitidis 3 NNNNGWWT 61 2 V-A F. novicida 5 TTN 64,69,73 2 V-B A. acidoterrestris 5 TTN 19 2 V-C Metagenomic datasets VI L. shahii 3 D (RNA PAM) 39 *B = C, T, G; D = A, G, T; M = A, C; W = A, T; Y = C, T. 12

24 1.4 PAM DETERMINATION The PAM is an essential feature of CRISPR-Cas systems, whether for the biological function of the system or for harnessing the system as a biomolecular tool. Determining the full set of functional PAM sequences, however, has proven difficult. This challenge has spurred the development of multiple methods to determine PAMs (Figure 1.4). Below, we describe the available methods along with their particular advantages and disadvantages. While all of these methods reproduce the same highly active PAM sequences, they can often identify differing sup-optimal PAMs, which can impact target selection and off-target predictions. Thus, the best option for PAM determination will likely depend on the particular CRISPR-Cas system and its end use. 13

25 Figure 1.4 Legend on following page 14

26 1.4.1 Protospacer identification in silico Mojica and coworkers introduced the first means of identifying PAMs as part of their original observation of this motif 23. Under this method, each spacer sequence from a natural CRISPR array is subjected to a nucleotide BLAST search for homologous sequences 66. Strong matches that appear to be derived from bacteriophages or plasmids are compiled, and the flanking sequences are aligned to discern a general motif. This method offers a rapid means of identifying potential functional PAM sequences that can be evaluated experimentally. With the recent development of automated online tools such as CRISPRTarget 67, the analysis can be completed in less than a day. One disadvantage is a strong dependence on available sequencing information, which may not contain the associated invader sequences. Another is the challenge of deciding whether a BLAST hit represents the true source of the spacer, particularly when mismatches are present between the spacer and the Figure 1.4 Methods for PAM determination. (A) In silico PAM determination. A BLAST search of metagenomic databases identifies potential protospacers. Matches found on foreign nucleic acid elements from bacteriophages and plasmids are aligned to elucidate a single consensus sequence. (B) PAM determination through plasmid clearance. Plasmids harboring a library of potential PAM sequences are transformed into cells expressing the Cas proteins and the targeting guide RNA, and functional PAM sequences are identified based on their depletion from the library. (C) PAM determination through bacteriophage clearance. A library of guide RNAs is designed so their targets tile along a lytic bacteriophage genome. Guide RNAs targeting a site flanked by a functional PAM protect the bacteriophage attack, as revealed by sequencing the enriched guide RNAs and mapping their locations and flanking sequences within the bacteriophage genome. (D) In vitro PAM determination through DNA cleavage. A large PAM library is incubated with purified Cas proteins and transcribed crrnas in a reaction buffer. Functional PAMs within this library are cleaved, and a sequencing adapter (shown in orange) can then be ligated for high-throughput sequencing. Alternatively, intact DNA can be sequenced, revealing PAM sequences that were depleted because of DNA cleavage. (E) In vivo PAM determination through DNA binding. Catalytically dead effector proteins are targeted to a PAM library upstream of the lac operon regulating GFP expression. If a functional PAM sequence is present, binding by the catalytically dead effector proteins represses expression of LacI, resulting in de-repression of GFP. Fluorescence cells are then isolated by fluorescent-activated cell sorting and sequenced. 15

27 putative protospacer sequence. Third, even if the protospacer is the source of the spacer, the protospacer may have accumulated mutations in the PAM. Finally, the PAMs associated with acquisition can represent a subset of those that elicit targeting 48,49,68, giving a narrow impression of the PAMs that elicit interference. For these many reasons, the in silico method represents a convenient starting point but often lacks sufficient sequence information and is less suited to obtain a comprehensive picture of the PAM for a given system Plasmid clearance in vivo The second and most common method screens for functional PAM sequencing using the natural ability of CRISPR-Cas systems to clear foreign genetic material. This method utilizes an effector protein that is either already present in its native host 59 or is imported into a convenient host such as Escherichia coli 60,61,69. To generate potential PAM sequences, a randomized nucleotide library is inserted next to a target sequence within a plasmid. Plasmids harboring the PAM library are transformed into the host and then subjected to next-generation sequencing. Any plasmids harboring functional PAM sequences would be cleared by the Cas effector proteins, resulting in a substantially lower frequency in the library. This method has been the most widely used because it recapitulates the natural function of CRISPR-Cas systems and has the potential to comprehensively determine all functional PAM sequences. One disadvantage is that the method employs a negative selection, which requires extensive library coverage to identify the missing sequences. Furthermore, because the screen is in vivo, there is a general limit on the library size, and the Cas proteins 16

28 must be functionally expressed in a non-native host. Finally, the readout of this method is the frequency of PAM escape, whether by mutating the target or by promoting DNA repair. This is problematic particularly for less-active PAM sequences that may translate poorly to other CRISPR-based applications, such as genome editing or gene regulation 54,64,70. One recent variation on this method generates a library of guide RNAs that tile along the genome of a lytic bacteriophage 39. The host harboring the CRISPR-Cas system is transformed with the plasmids harboring the guide-rna library and is then infected with the bacteriophage. Cells survive if the target location on the bacteriophage genome is flanked by a PAM, resulting in the enrichment of guide RNAs targeting functional PAM sequences. This strategy was applied to elucidate the PFS for the RNA-targeting Type VI effector protein C2c2 using the single-stranded RNA bacteriophage MS2 39. By selecting for cells targeting protospacers flanked by a functional PFS, the screen provided a positive selection. The major limitations of this variation are that the guide-rna library is much smaller than the equivalent PAM library, not all possible PAMs may be sufficiently represented within the bacteriophage genome, and guide-rna libraries will be much more expensive to generate than PAM libraries. Therefore, this strategy offers some unique advantages over the traditional method if a sufficiently large library can be generated and the associated costs are acceptable. 17

29 1.4.3 DNA cleavage in vitro In contrast to the in vivo plasmid-clearance methods described above, methods have been developed to determine PAMs under in vitro conditions. These methods involve an in vitro cleavage reaction that combines purified Cas effector proteins, the in vitro-transcribed guide RNAs, and a target DNA library of potential PAM sequences. Following the cleavage reaction, the PAM library is subjected to next-generation sequencing. The reaction can be conducted as a positive screen by ligating adapters to the ends of cleaved library members 70,71, or as a negative screen by sequencing the intact library members 69. In vitro methods generally offer numerous advantages particularly over in vivo methods. For instance, the screened library can be multiple orders-of-magnitude larger for in vitro screens than for in vivo screens because there are no limitations from transformation efficiencies or cloning in vitro. Furthermore, in vitro reactions grant exquisite control over the assay conditions, such as component concentrations, reaction temperature, and the reaction time. Finally, the ability to sequence the cleavage products represents a positive screen that can reliably identify functional PAM sequences. One downside is that reconstituting the complete system requires complete knowledge of the required components as well as a protein purification. Another is that the ligation step requires a double-stranded break, which excludes type I systems that cleave and degrade the target. Finally, assay conditions can deviate from the cellular environment, whether it is the buffer conditions, the relative stoichiometry of Cas effector proteins and targets, or the reaction times. This deviation can yield artificial PAM assignments as recently highlighted when Karvelis and coworkers 71 varied the stoichiometry of Cas9 and the target DNA. These in vitro 18

30 methods thus provide powerful screens to comprehensively determine PAMs for many CRISPR-Cas systems, although the resulting PAMs may not translate well in vivo DNA binding in vivo The high-throughput experimental methods described above all relied on target cleavage. However, the ability to generate catalytically-dead Cas effector proteins affords the development of PAM determination methods based on DNA binding We recently reported an in vivo DNA-binding method that utilizes catalytically-dead effector proteins to regulate the expression of GFP in E. coli 64. As part of the screen, binding by the catalytically-dead effector proteins blocks expression of the LacI repressor, which would otherwise block expression of GFP. Based on this configuration, cells harboring a functional PAM sequence fluorescence, allowing these cells to be isolated by fluorescence-activated cell sorting and subjected to nextgeneration sequencing. The unique benefits of this method include a positive screen based on the expression of GFP only in the presence of a functional PAM, and the ability to tune the assay stringency by titrating in the LacI inhibitor IPTG. One limitation of the method is the need to identify and inactivate the catalytic domains of the desired CRISPR-Cas systems, particularly if the systems have not undergone initial characterization. Another is the size of the limited library size that can be screened akin to any in vivo screen. Finally, the absence of nuclease activity can yield PAMs that promote efficient binding but not efficient cleavage, even though the PAMs elucidated to-date using this method closely aligned with those determined by cleavage-based methods 61,63,64,69, The in vivo DNA binding method thus may be 19

31 best suited for better understanding the biophysics of DNA recognition or in applications centered on DNA binding or gene regulation. PAM determination methods that rely on DNA binding in vitro are also under development that could introduce the advantages of in vitro screens PAM REPORTING PAM determination methods often generate a large collection of functional PAM sequences that vary in their extent of enrichment (for positive screens) or depletion (for negative screens). The question is how to best report this information without providing the complete list of all sequences and their associated enrichment or depletion scores. A number of reporting schemes have been described, where each is illustrated in Figure 1.5 using published PAM determination data for the widely used Type II-A system from S. pyogenes and the well-characterized Type I-E system from Escherichia coli 64,72. As illustrated in this figure, each reporting scheme manages a trade-off between simplicity and information content. 20

32 Figure 1.5 Reporting the CRISPR PAM. (A-C) Visualizing the PAM for the S. pyogenes type II-A system. The reported PAMs are based on data from the plasmid-clearance method conducted by Kleinstiver and coworkers 72. (D-F) Visualizing the PAM for the E. coli type I-E system. The reported PAMs are based on data from the DNA-binding method conducted by Leenay and coworkers 64. (A,D) The consensus sequence and sequence logo. The consensus sequence represents a compilation of all highly-active functional PAM sequences. The sequence logo displays the sequence conservation of each base at each position in functional PAM sequences 89. The red box demonstrates the PAM location assuming a guide-centric orientation (see Figure 1.2). (B,E) The PAM table. The table displays the enrichment or depletion scores from the conducted screen for each possible sequence. Higher values represent more active PAMs. PAMs with similar activities are colored. (C,F) The PAM wheel. Individual sequences are read following a radius of the circle. The arrow shows the orientation of each base in relation to the protospacer. The larger the radian occupied by the sequence, the greater its enrichment or depletion score. The outer ring depicts a common functional PAM sequence. PAM wheels are also available as interactive.html files 64. Sequence logos and consensus sequences have been the most common reporting schemes to date starting with the original discovery of the PAM 21,22. 21

33 Sequence logos display the relative prevalence or activity of each base within each position: more prevalent or higher scoring bases appear as larger letters, while less prevalent or lower scoring bases appear as smaller letters. Consensus sequences report a single sequence that captures the most dominant set of functional PAM sequences. For instance, the consensus sequence for the S. pyogenes Cas9 is NGG, while the sequence logo shows a small but notable A in the middle position reflecting partial activity of an NAG PAM 23,60. Conversely, the consensus sequence for the promiscuous E. coli I-E system has been reported as AWG (where W is A or T) while the sequence logo shows other, smaller letters at each of the three positions 23,58. Both schemes are easy to interpret and can be expanded to any sequence length. However, they also sacrifice individual functional PAM sequences and their relative activity. This loss of information is manageable if the PAMs are simple or only the most active functional PAM sequences are required for the final use of the interrogated CRISPR-Cas system, such as for designing highly active guide RNAs. However, these reporting schemes can obscure the true number of target sites that can be targeted by the system or the prediction of off-target effects by discarding or masking less active functional PAMs. One workaround is reporting multiple consensus sequences that are classified as more or less active 61, although this scheme still misses the full range of functional PAM sequences and activities. In cases where the PAM is relatively simple (e.g., NGG for the S. pyogenes Cas9) or the user is only interested in the most active PAMs, then sequence logos and consensus sequences are sufficient. 22

34 A separate scheme that better captures individual functional PAM sequences and activities can be described as a PAM table (Figure 1.5B,E). The table is similar to a codon table in which each base of the codon appears along the edges and each of the 64 cells represents a distinct sequence. In the case of the PAM table, each cell conveys the activity or enrichment/depletion score for that specific PAM sequence. Cells with similar activities or scores can be colored, although the groupings are somewhat arbitrary. The upside of the table is that it displays all PAM sequences and activities in a relatively compact format, as illustrated for PAMs recognized by the S. pyogenes Cas9 and by multiple CRISPR-Cas systems present in Pyrococcus furiosus 59,60. The major downside is that the table is difficult to expand beyond three bases, greatly limiting the size of a PAM library that can be reported. We recently developed a reporting scheme called the PAM wheel that also captures individual sequences and activities that can convey larger PAM libraries 64 (Figure 1.5C,F). The PAM wheel are derived from interactive Krona plots 75. Each PAM sequence is read by moving along a radius of the circle, where the accompanying arrow specifies the location of each base in relation to the protospacer. The relative activity of each PAM sequence scales with the size of the radial arc. The outer, black rings designate functional PAM sequences. As a Krona plot, any sector of the wheel can be expanded to better view a defined subset of sequences, such as PAMs that begin with a G. The PAM wheel is more difficult to interpret than traditional schemes, although it is the only available scheme that is extensible to longer PAM lengths and fully preserves individual PAM sequences and activities. These individual activities communicate a comprehensive picture of the PAM landscape that is otherwise 23

35 obscured by the other available reporting schemes. For instance, the PAM wheel revealed a potential two-base linker for the S. pyogenes Cas9 or strong base preferences at the -4 position for some functional PAM sequences for the E. coli Type I-E system (Figure 1.5). The most notable disadvantage is that the PAM wheel can be difficult to interpret, particularly in comparison to the other described methods. There is still room for additional reporting schemes that effectively capture and convey the higher-order information content of the PAM, and it will be up to the CRISPR research community to settle on a common scheme or set of schemes to convey PAM sequences and activities. 1.6 PAM ENGINEERING While the PAM remains an essential feature for self/non-self-discrimination, it also restricts which sequences can be targeted by a given CRISPR-Cas system and impacts the likelihood of off-target effects. Accordingly, there has been intense interest in modifying Cas proteins particularly the widely used Cas9 effector proteins to change the recognized PAM 68,72,76,77. This interest has been fueled by the growing number of structural studies that have pinpointed the PAM-interacting domains and how these domains interact with the PAM as part of target recognition The PAMinteracting domains are relatively modular, allowing Nishimasu and coworkers 45 to swap these domains between the closely related S. pyogenes Cas9 and S. thermophilus CRISPR3 Cas9 proteins, thereby changing each Cas9 s PAM recognition (Figure 1.6). Rationally mutating the protein residues involved in PAM binding has proven more difficult as illustrated for the S. pyogenes Cas9 9, although 24

Anders and coworkers 77 successfully changed PAM recognition using a variant of this protein. Figure 1.6 Engineering Cas proteins with altered or relaxed PAM recognition.

36 Anders and coworkers 77 successfully changed PAM recognition using a variant of this protein. Figure 1.6 Engineering Cas proteins with altered or relaxed PAM recognition. (A) Changing PAM specificity for the Cas9 effector protein. This protein was engineered to recognize an alternative PAM but not the original PAM, thereby changing which sequences can be targeted by this Cas9. (B) Relaxing PAM specificity for the Cas1 and Cas2 acquisition proteins. These proteins still recognize the original PAM as well as additional PAMs, thereby broadening the sequences that can be acquired with these proteins. The most successful PAM engineering efforts to-date combined random mutagenesis of the DNA binding residues or the entire protein with a high-throughput dual selection. Kleinstiver and coworkers applied this combination to change PAM recognition for the S. pyogenes Cas9 and the compact Staphylococcus aureus Cas9. These efforts generated a variant of the S. pyogenes Cas9 with only three point mutations (Cas9 VQR) that changed the recognized PAM from NGG to NGA. The new Cas9 exhibited a lower propensity for off-target effects for the selected targets 72 25

37 despite more promiscuous PAM recognition 64. These efforts also generated a variant of the S. aureus Cas9 that relaxed its recognized PAM consensus sequence from NNGRRT (where R is A or G) to NNNRRT 76. Relaxing the requirement for a G did not impact the propensity for off-target effects despite the expectation that a more flexible PAM would reduce its contribution to targeting specificity 76,81. Applying similar strategies to other Cas effector proteins such as Cpf1 could generate an expanded set of proteins that are no longer limited to their natural PAM sequences, thereby expanding the types of sequences that can be targeted for genome editing and other applications. Shipman and coworkers 68 recently extended PAM engineering to the acquisition proteins Cas1 and Cas2 from the E. coli Type I-E system. These proteins function together to integrate new spacers into the CRISPR array While the work by Shipman and coworkers sought to utilize these proteins to incorporate synthetic spacers for long-term memory storage, the two proteins were also engineered to relax their PAM recognition requirements. The engineering was performed using errorprone PCR, followed by integrating spacers from non-canonical PAMs 68. The resulting variants of Cas1 and Cas2 still slightly preferred an AAG PAM, although to a lesser extent than the wild-type versions. Taken together, these accomplishments highlight the potential for engineering diverse Cas effector proteins and acquisition proteins for tightened, relaxed, or non-native PAM recognition. 26

38 1.7 PERSPECTIVES Our knowledge of what comprises a PAM has been overwhelmingly shaped by a few well-characterized Cas effector proteins. However, these few proteins stand in contrast to the large diversity of systems currently spanning 19 subtypes and thousands of individual proteins 18,19 (Table 1.1). Furthermore, characterization of different Cas9 proteins from II-A systems revealed large variation in the PAM: for instance, those for Cas9 proteins from S. pyogenes (NGG PAM), S. thermophilus (NNAGAA PAM), and S. aureus (NNGRRT PAM). We predict that characterizing multiple systems within other subtypes has the potential to reveal similarly diverse PAMs. By performing bioinformatics and structural analyses of these effector proteins, we could learn how PAM recognition domains uniquely recognize different PAM sequences, informing future efforts to engineer PAM recognition. Elucidating PAMs within different subtypes could also reveal the full range extent of PAM sequences and lengths present in nature, highlighting which sequences may be more readily accessible through PAM engineering. High-throughput methods have been invaluable tools to determine PAMs for varying CRISPR Cas systems. One consistent insight from these methods is the plasticity of the PAM. Rather than a static sequence that is either correct or incorrect, the PAM comprises a range of sequences with varying activities. This is highlighted by the type V-A Cpf1 effector protein from Francisella novicida, which has a reported PAM consensus of NTTN but exhibits clear bias within the first T and both N s 64,69,73. Another unique insight is the co-dependency between the PAM and the protospacer. Despite the general assumption that these entities operate independently, separate 27

39 studies with the type I-E system from E. coli have shown that the relative activity of different PAMs depended on the specific target sequence 54,64. Whether this codependency extends to other systems remains to be explored and may need to become a standard part of any attempt at high-throughput PAM determination. Finally, there is evidence that PAM determination methods can predict slightly different PAMs for the same CRISPR Cas system. For instance, characterization of the S. thermophilus CRISPR1 Cas9 using an in vivo plasmid-clearance assay, an in vitro cleavage assay, and an in vivo binding assay revealed different suboptimal PAMs 64,71,72. While these differences may be attributed to the selected PAM reporting schemes, they highlight how the available PAM determination methods can yield disparate results. These differences can be important, as elucidated PAMs may not be universally relevant across all applications. As a result, it may be best to perform PAM determination methods that best align with the final application (i.e., in vitro cleavage assays for genome editing; binding assays for transcriptional regulation). The ability to engineer PAM recognition holds great potential for CRISPR technologies. One potential outcome is revising the process of guide-rna design. The current process involves identifying a PAM within a genetic locus and then selecting the flanking sequence as the target. However, PAM engineering could be used to generate a collection of effector proteins that together recognize all possible PAM sequences. For instance, the collection could include 15 derivatives of the S. pyogenes Cas9 that each recognize a variant of the NGG consensus PAM. Separately, the PAM could be lengthened to recognize a more specific sequence. These mutations could be combined with those known to better reject mismatches 28

40 between the guide RNA and the target 85,86, potentially yielding highly specific Cas effector proteins with negligible off-target effects. 29

41 REFERENCES 1. Barrangou, R. et al. CRISPR provides acquired resistance against viruses in prokaryotes. Science 315, (2007). 2. Doudna, J. A. & Charpentier, E. The new frontier of genome engineering with CRISPR-Cas9. Science 346, (2014). 3. Hsu, P. D., Lander, E. S. & Zhang, F. Development and applications of CRISPR-Cas9 for genome engineering. Cell 157, (2014). 4. van der Oost, J., Westra, E. R., Jackson, R. N. & Wiedenheft, B. Unravelling the structural and mechanistic basis of CRISPR-Cas systems. Nat Rev Microbiol 12, (2014). 5. Barrangou, R. & Doudna, J. A. Applications of CRISPR technologies in research and beyond. Nat. Biotechnol. 34, (2016). 6. Mali, P. et al. RNA-Guided human genome engineering via Cas9. Science 339, (2013). 7. Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, (2013). 8. Bikard, D. & Marraffini, L. A. Control of gene expression by CRISPR-Cas systems. F1000Prime Rep 5, (2013). 9. Jinek, M. et al. A programmable dual-rna guided DNA endonuclease in adaptive bacterial immunity. Science 337, (2012). 10. Luo, M. L., Mullis, A. S., Leenay, R. T. & Beisel, C. L. Repurposing endogenous type I CRISPR-Cas systems for programmable gene repression. Nucleic Acids Res. 43, (2014). 11. Esvelt, K. M., Smidler, A. L., Catteruccia, F. & Church, G. M. Concerning RNA-guided gene drives for the alteration of wild populations. Elife e03401, (2014). 12. Hammond, A. et al. A CRISPR-Cas9 gene drive system targeting female reproduction in the malaria mosquito vector Anopheles gambiae. Nat. Biotechnol. 34, (2016). 13. Bikard, D. et al. Exploiting CRISPR-Cas nucleases to produce sequence-specific antimicrobials. Nat. Biotechnol. 32, (2014). 14. Gomaa, A. A. et al. Programmable removal of bacterial strains by use of genometargeting CRISPR-Cas systems. MBio 5, (2014). 15. Anton, T., Bultmann, S., Leonhardt, H. & Markaki, Y. Visualization of specific DNA sequences in living mouse embryonic stem cells with a programmable fluorescent CRISPR/Cas system. Nucleus 5, (2014). 30

42 16. Chen, B. et al. Dynamic imaging of genomic loci in living human cells by an optimized CRISPR/Cas system. Cell 155, (2013). 17. Mohanraju, P. et al. Diverse evolutionary roots and mechanistic variations of the CRISPR-Cas systems. Science 353, aad5147 (2016). 18. Makarova, K. S. et al. An updated evolutionary classification of CRISPR Cas systems. Nat. Rev. Microbiol. 13, (2015). 19. Shmakov, S. et al. Discovery and functional characterization of diverse class 2 CRISPR-Cas systems. Mol. Cell 60, 1 13 (2015). 20. Luo, M. L., Leenay, R. T. & Beisel, C. L. Current and future prospects for CRISPRbased tools in bacteria. Biotechnol. Bioeng. 113, (2016). 21. Horvath, P. et al. Diversity, activity, and evolution of CRISPR loci in Streptococcus thermophilus. J. Bacteriol. 190, (2008). 22. Deveau, H. et al. Phage response to CRISPR-encoded resistance in Streptococcus thermophilus. J. Bacteriol. 190, (2008). 23. Mojica, F. J. M., Diez-Villasenor, C., Garcia-Martinez, J. & Almendros, C. Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155, (2009). 24. Garneau, J. E. et al. The CRISPR/Cas bacterial immune system cleaves bacteriophage and plasmid DNA. Nature 468, (2010). 25. Gasiunas, G., Barrangou, R., Horvath, P. & Siksnys, V. Cas9-crRNA ribonucleoprotein complex mediates specific DNA cleavage for adaptive immunity in bacteria. Proc. Natl. Acad. Sci. U. S. A. 109, E (2012). 26. Hale, C. R. et al. RNA-guided RNA cleavage by a CRISPR RNA-Cas protein complex. Cell 139, (2009). 27. Jinek, M. et al. RNA-programmed genome editing in human cells. Elife 2013, 1 9 (2013). 28. Jore, M. M. et al. Structural basis for CRISPR RNA-guided DNA recognition by Cascade. Nat. Struct. Mol. Biol. 18, (2011). 29. Westra, E. R. et al. CRISPR immunity relies on the consecutive binding and degradation of negatively supercoiled invader DNA by Cascade and Cas3. Mol. Cell 46, (2012). 30. Mojica, F. J. M., Díez-Villaseñor, C., García-Martínez, J. & Soria, E. Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J. Mol. Evol. 60, (2005). 31

43 31. Jansen, R., Van Embden, J. D. A., Gaastra, W. & Schouls, L. M. Identification of genes that are associated with DNA repeats in prokaryotes. Mol. Microbiol. 43, (2002). 32. Bolotin, A., Quinquis, B., Sorokin, A. & Dusko Ehrlich, S. Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin. Microbiology 151, (2005). 33. Carte, J., Wang, R., Li, H., Terns, R. M. & Terns, M. P. Cas6 is an endoribonuclease that generates guide RNAs for invader defense in prokaryotes. Genes Dev. 22, (2008). 34. Brouns, S. J. J. et al. Small CRISPR RNAs guide antiviral defense in prokaryotes. Science 321, (2008). 35. Deltcheva, E., Chylinski, K., Sharma, C. M. & Gonzales, K. CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature 471, (2011). 36. Burstein, D. et al. Major bacterial lineages are essentially devoid of CRISPR-Cas viral defence systems. Nat. Commun. 7, (2016). 37. Makarova, K. S. et al. Evolution and classification of the CRISPR Cas systems. Nat. Rev. Microbiol. 9, (2011). 38. Chylinski, K., Makarova, K. S., Charpentier, E. & Koonin, E. V. Classification and evolution of type II CRISPR-Cas systems. Nucleic Acids Res. 42, (2014). 39. Abudayyeh, O. O. et al. C2c2 is a single-component programmable RNA-guided RNA-targeting CRISPR effector. Science 353, aaf5573 (2016). 40. Elmore, J. R. et al. Bipartite recognition of target RNAs activates DNA cleavage by the Type III-B CRISPR Cas system. Genes Dev. 30, (2016). 41. Marraffini, L. A. & Sontheimer, E. J. Self versus non-self discrimination during CRISPR RNA-directed immunity. Nature 463, (2010). 42. Wiedenheft, B. et al. RNA-guided complex from a bacterial immune system enhances target recognition through seed sequence interactions. Proc. Natl. Acad. Sci. 108, (2011). 43. Mulepati, S., Héroux, A. & Bailey, S. Crystal structure of a CRISPR RNA-guided surveillance complex bound to a ssdna target. Science 345, (2014). 44. Jinek, M. et al. Structures of Cas9 endonucleases reveal RNA-mediated conformational activation. Science 343, (2014). 45. Nishimasu, H. et al. Crystal structure of Cas9 in complex with guide RNA and target DNA. Cell 156, (2014). 32

44 46. Anders, C., Niewoehner, O., Duerst, A. & Jinek, M. Structural basis of PAMdependent target DNA recognition by the Cas9 endonuclease. Nature 513, (2014). 47. Sternberg, S. H., Redding, S., Jinek, M., Greene, E. C. & Doudna, J. DNA interrogation by the CRISPR RNA-guided endonuclease Cas9. Nature 507, (2014). 48. Yosef, I. et al. DNA motifs determining the efficiency of adaptation into the Escherichia coli CRISPR array. Proc. Natl. Acad. Sci. U. S. A. 110, (2013). 49. Savitskaya, E., Semenova, E., Dedkov, V., Metlitskaya, A. & Severinov, K. Highthroughput analysis of type I-E CRISPR/Cas spacer acquisition in E. coli. RNA Biol. 10, (2013). 50. Levy, A. et al. CRISPR adaptation biases explain preference for acquisition of foreign DNA. Nature 520, (2015). 51. Heler, R. et al. Cas9 specifies functional viral targets during CRISPR-Cas adaptation. Nature 519, 1 16 (2015). 52. Wei, Y., Terns, R. M., Terns, M. P., Terns, M. P. & Terns, M. P. Cas9 function and host genome sampling in type II-A CRISPR cas adaptation. Genes Dev. 29, (2015). 53. Kunne, T. et al. Cas3-derived target DNA degradation fragments fuel primed CRISPR adaptation. Mol. Cell 63, (2016). 54. Xue, C. et al. CRISPR interference and priming varies with individual spacer sequences. Nucleic Acids Res. 43, (2015). 55. Xue, C. et al. Conformational control of Cascade interference and priming activities in CRISPR immunity. Mol. Cell 64, 1 9 (2016). 56. Mojica, F. J. M. & Díez-Villaseñor, C. Right of admission reserved, no matter the path. Trends Microbiol. 21, (2013). 57. Shah, S., Erdmann, S., Mojica, F. & Garrett, R. Protospacer recognition motifs: mixed identities and functional diversity. RNA Biol. 10, (2013). 58. Semenova, E. et al. Interference by clustered regularly interspaced short palindromic repeat (CRISPR) RNA is governed by a seed sequence. Proc. Natl. Acad. Sci. U. S. A. 108, (2011). 59. Elmore, J., Deighan, T., Westpheling, J., Terns, R. M. & Terns, M. P. DNA targeting by the type I-G and type I-A CRISPR Cas systems of Pyrococcus furiosus. Nucleic Acids Res. 43, gkv1140 (2015). 33

45 60. Jiang, W., Bikard, D., Cox, D., Zhang, F. & Marraffini, L. A. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nat. Biotechnol. 31, (2013). 61. Esvelt, K. M. et al. Orthogonal Cas9 proteins for RNA-guided gene regulation and editing. Nat. Methods 10, (2013). 62. Briner, A. E. et al. Guide RNA functional modules direct Cas9 activity and orthogonality. Mol. Cell 56, (2014). 63. Westra, E. R. et al. Type I-E CRISPR-cas systems discriminate target from non-target DNA through base pairing-independent PAM recognition. PLoS Genet. 9, e (2013). 64. Leenay, R. T. et al. Identifying and visualizing functional PAM diversity across CRISPR-Cas systems. Mol. Cell 62, (2016). 65. Ran, F. A. et al. In vivo genome editing using Staphylococcus aureus Cas9. Nature 520, (2015). 66. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, (1990). 67. Biswas, A., Gagnon, J. N., Brouns, S. J. J., Fineran, P. C. & Brown, C. M. CRISPRTarget: Bioinformatic prediction and analysis of crrna targets. RNA Biol. 10, (2013). 68. Shipman, S. L., Shipman, S. L., Nivala, J., Macklis, J. D. & Church, G. M. Molecular recordings by directed CRISPR spacer acquisition. Science 1175, 1 16 (2016). 69. Zetsche, B. et al. Cpf1 Is a single RNA-guided endonuclease of a class 2 CRISPR- Cas system. Cell 163, 1 13 (2015). 70. Pattanayak, V. et al. High-throughput profiling of off-target DNA cleavage reveals RNA-programmed Cas9 nuclease specificity. Nat. Biotechnol. 31, (2013). 71. Karvelis, T. et al. Rapid characterization of CRISPR-Cas9 protospacer adjacent motif sequence elements. Genome Biol. 16, 253 (2015). 72. Kleinstiver, B. P. et al. Engineered CRISPR-Cas9 nucleases with altered PAM specificities. Nature 523, (2015). 73. Fonfara, I., Richter, H., Bratovič, M., Le Rhun, A. & Charpentier, E. The CRISPRassociated DNA-cleaving enzyme Cpf1 also processes precursor CRISPR RNA. Nature 1 19 (2016). doi: /nature Boyle, E. A. et al. High-throughput biochemical profiling reveals Cas9 off-target binding and unbinding heterogeneity. biorxiv (2016). doi: /

46 CHAPTER 2 Identifying and visualizing functional PAM diversity across CRISPR-Cas systems Ryan T. Leenay 1, Kenneth R. Maksimchuk 1, Rebecca A. Slotkowski 1, Roma N. Agrawal 1, Ahmed A. Gomaa 1,2, Alexandra E. Briner 3, Rodolphe Barrangou 3,*, and Chase L. Beisel 1,* 1 Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh NC Chemical Engineering Department, Faculty of Engineering, Cairo University, Giza 12613, Egypt 3 Department of Food, Bioprocessing and Nutrition Sciences, North Carolina State University, Raleigh, NC 27695, USA *Correspondence Original publication in Molecular Cell 62(1):137-47,

47 ABSTRACT CRISPR-Cas adaptive immune systems in prokaryotes boast a diversity of protein families and mechanisms of action, where most systems rely on protospacer-adjacent motifs (PAMs) for DNA target recognition. Here, we developed an in vivo, positive, and tunable screen termed PAM-SCANR (PAM screen achieved by NOT-gate repression) to elucidate functional PAMs as well as an interactive visualization scheme termed the PAM wheel to convey individual PAM sequences and their activities. PAM-SCANR and the PAM wheel identified known functional PAMs while revealing complex sequence-activity landscapes for the Bacillus halodurans I-C (Cascade), Escherichia coli I-E (Cascade), Streptococcus thermophilus II-A CRISPR1 (Cas9), and Francisella novicida V-A (Cpf1) systems. The PAM wheel was also readily applicable to existing high-throughput screens and garnered insights into SpyCas9 and SauCas9 PAM diversity. These tools offer powerful means of elucidating and visualizing functional PAMs toward accelerating our ability to understand and exploit the multitude of CRISPR-Cas systems in nature. 36

48 2.1 INTRODUCTION CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) and their associated Cas (CRISPR associated) proteins form widespread adaptive immune systems in prokaryotes that have been harnessed as ubiquitous tools in biotechnology and medicine 1 6. Unlike adaptive immune systems in eukaryotes, CRISPR-Cas systems rely on the base-pairing potential of RNA guides for specific target recognition 7 9. Guide RNAs encoded within the CRISPR array employ their ~20 30 nucleotide spacer region to base pair with complementary protospacer sequences, leading the system s Cas effector proteins to cleave or degrade these targets CRISPR-Cas systems can also acquire pieces of foreign genetic material as new spacers to confer immunity against future infections. The programmable and multiplexable nature of DNA/RNA targeting has led to a myriad of applications ranging from gene therapy to antimicrobials 5,6, CRISPR-Cas systems have proven to be remarkably diverse despite their common role in adaptive immunity. The current classification system defines two classes, six main types, and sixteen subtypes 20,21. Each grouping is distinguished by its Cas proteins and by the mechanisms of RNA processing, target recognition, and target destruction. Across the six types, Type II systems have garnered the most attention to-date because their machinery can be packaged into a portable twocomponent system 22. However, Type I systems are the most abundant and widespread in nature and can degrade DNA 7,20. Type III systems can bind and cleave DNA and/or RNA 12,23 25, and Type V and putative Type VI systems offer potential alternatives to Type II systems and the Cas9 effector protein 21,26. While our 37

49 understanding and use of CRISPR-Cas systems has centered on a few exemplary model systems, a plethora of other CRISPR-Cas systems remains to be characterized and harnessed. When characterizing new CRISPR-Cas systems, one of the greatest challenges is elucidating the rules for guide RNA design and target selection. Aside from complementarity between the target and the spacer portion of the guide RNA, the protospacer must also be flanked on one side by defined sequences 27. These flanking sequences allow CRISPR-Cas immune systems to differentiate between self (the CRISPR array spacer) and non-self (the invader DNA) and have been heavily implicated in spacer acquisition For all characterized systems besides Type III systems, this flanking sequence (called the protospacer-adjacent motif or PAM) initially drives DNA interrogation by the Cas effector proteins prior to DNA unwinding and base pairing with the loaded guide RNA 14, In the absence of a PAM, the Cas proteins cannot recognize the target, even if it is perfectly complementary to the spacer. The PAM thus plays a central role in target selection, both in the context of host immunity and CRISPR-based technologies. 2.2 DESIGN Limitations of current PAM screens The centrality of the PAM has spurred the development of multiple approaches to elucidate functional PAM sequences necessary for DNA recognition for popular CRISPR-Cas systems. Originally, bioinformatics analysis of CRISPR spacers were used to identify matching bacteriophage and plasmid sequences 28,30,34. This technique 38

50 can identify a set of putative PAMs, although it remains limited by the availability of matching phage or plasmid DNA sequences in genomic databases, and obtained hits may include mutated escape-pams. More recent efforts have developed highthroughput, experimental screens to determine functional PAMs based on the depletion of a target plasmid or on the introduction of a double-stranded break in vitro However, plasmid removal screens are based on an irreversible binary event, implicitly measure the frequency of escape from killing, and require high library coverage to quantitatively identify depleted PAM sequences. Separately, in vitro DNA cleavage screens require the purification of active protein-rna complexes, can be highly sensitive to the assay conditions 37, and are incompatible with Type I systems that cleave and degrade DNA when relying on adaptor ligation 26,36. These shortcomings highlight the need for novel screens that overcome many of these limitations An in vivo, positive, tunable screen for functional PAMs We sought to develop a high-throughput in vivo screen with two distinct features: applicability across PAM-dependent CRISPR-Cas systems and the generation of a positive signal for functional PAMs (Figure 2.1A and Movie 2.S1). To develop a broadly applicable screen, we utilized gene repression as the basis for CRISPR function. Prior work demonstrated that type I and type II CRISPR-Cas systems lacking endonuclease activity could function as transcriptional regulators, either by removing the Cas3 protein from the type I-E system in E. coli or by mutating the catalytic residues of the HNH and RuvC endonuclease domains in type II-A Cas9 39

51 proteins In either case, the Cas protein(s) tightly bind to but do not cleave the target DNA, thereby interfering with transcription at the bound locus. Separately, the Cpf1 effector protein for the type V-A system in Francisella novicida could be converted into a DNA-binding protein by mutating the first or second RuvC domain 26, although Cpf1 has yet to be used for gene regulation. Identifying RuvC and HNH domains or cas3 genes has proven straightforward using bioinformatics analyses, lending to the conversion of uncharacterized CRISPR-Cas systems into gene repressors 39,43. Figure 2.1 Overview of the PAM Screen Achieved by NOT-Gate Repression, PAM- SCANR (A) The screening platform consists of a library of potential PAM sequences cloned upstream of the laci promoter. Immediately downstream of laci is the LacIdependent lacz promoter controlling expression of GFP. A catalytically dead CRISPR-Cas system is targeted to a protospacer within the laci promoter, resulting in GFP fluorescence only in the presence of a functional PAM. (B) Cells harboring a functional PAM can be isolated by fluorescence-activated cell sorting (FACS). The screen can identify functional PAMs comprehensively or individually. (C) Tuning the stringency of the screen with IPTG. The presence of IPTG reduces the fraction of LacI proteins capable of gene repression, thereby facilitating the upregulation of GFP. More IPTG lowers the threshold for GFP upregulation, in turn relaxing the stringency of the screen. 40

52 To associate gene repression with a positive signal, we constructed a simple genetic circuit termed a NOT gate (Figure 2.1). The NOT gate is based on the CRISPR-Cas system blocking the -35 element within the promoter upstream of laci and the LacI repressor blocking the promoter of gfp. The PAM library is placed upstream of the -35 element to minimally impact the transcription or translation of laci. By targeting the top or bottom strand of the laci promoter, PAMs located on the 5 end (type I, V, VI systems) or on the 3 end (type II systems) of the protospacer can be assessed (Figure 2.S1A) 21,26,28,44. Based on the configuration of the NOT gate, only a functional PAM would lead to laci repression and reporter expression. Fluorescent cells can then be isolated through fluorescence-activated cell sorting (FACS). The screen is performed in an E. coli strain stripped of laci-lacz and its endogenous CRISPR-Cas system to avoid crosstalk. We termed the resulting screen PAM-SCANR (PAM screen achieved by NOT-gate repression). By allowing the isolation of fluorescent cells, PAM-SCANR affords two modes of screening: comprehensive screening based on next-generation sequencing of pre-sorted and post-sorted PAM libraries and individual screening based on single sequencing of sorted fluorescent clones (Figure 2.1B). Furthermore, the stringency of PAM-SCANR can be tuned using intermediate concentrations of Isopropyl b-d-1-thiogalactopyranoside (IPTG) to titrate active LacI repressors within the NOT gate, allowing the detection of weak functional PAMs or the use of poorly expressed or weakly active systems (Figure 2.1C and Figure 2.S2) 41

53 2.2.3 Limitations of current PAM representation schemes High-throughput PAM screens yield lists of all sequences within the screened library along with their relative enrichment or depletion. Each list is normally conveyed in a compact visualization scheme called a sequence logo that reports the conservation of a given nucleotide at each position 28. While sequence logos have been the standard for conveying functional PAMs 21,28, they inherently sacrifice two crucial details: the individual sequences representing functional PAMs and the relative activity of each sequence. Although these details are dispensable for a single consensus sequence, emerging evidence suggests that CRISPR PAMs are more complex. For instance, the Streptococcus pyogenes Cas9 recognizes a weak NAG PAM in addition to the standard NGG PAM 39,45, while the S. thermophilus CRISPR1 Cas9 recognizes multiple sequences that deviate from the consensus NNAGAAW PAM (where W is A or T) 28,46,47. What remains to be developed is a standard means of conveying functional PAM sequences and their activities Conveying functional PAM sequences and activities with PAM Wheels We sought to develop a means of representing functional PAMs that preserves both individual sequences and enrichment scores (Figure 2.2A). We selected interactive Krona plots that present hierarchical node-link diagrams 48. By processing individual sequences and enrichment scores, the Krona plot outputs a wheel similar to the codon wheel used to relate codons and amino acids. We oriented the wheel to be read from the inner to the outer ring, conveying the PAM s nucleotide sequence moving away from the protospacer. Besides capturing the PAM, the Krona plot also 42

Krona plots are encapsulated in interactive HTML files, allowing interrogation of

54 conveys the relative enrichment of a given sequence as directly proportional to the area of the sector. We term the resulting plots PAM wheels. Krona plots are encapsulated in interactive HTML files, allowing interrogation of all sequences regardless of their enrichment score (Krona 2.S1 2.S8). Figure 2: Figure legend on next page 43

55 Figure 2.2 Representing PAMs and their enrichment with the PAM wheel. (A C) The sequence logos and PAM wheels are shown for prior high-throughput depletion screens conducted with the wild-type type II-A Cas9 from S. pyogenes (A), the VQR variant of the S. pyogenes Cas9 (B), and the type II-A Cas9 from S. aureus (C) 47,49. The raw sequencing reads were processed to generate the standard sequence logo and the PAM wheel. For the PAM wheel, sequences from the inner to outer circle match the PAM read moving away from the protospacer. Colors correspond to the relative frequency of the innermost nucleotide. For a given sequence, the area of the sector in the PAM wheel is directly proportional to the relative enrichment in the library. For the S. aureus Cas9, the +1 and +2 positions of the PAM exhibited similar enrichments and therefore were treated as a 2-nt gap, as seen by the smaller circle in (C). 2.3 RESULTS PAM wheels convey PAM landscapes from existing high-throughput screens To illustrate the applicability of the PAM wheel, we compiled previously published datasets from high-throughput depletion assays conducted with the type II- A Cas9 from S. pyogenes, the engineered VQR variant of this Cas9 with altered PAM specificity, and the short type II-A Cas9 from Staphylococcus aureus (Figure 2.2) 47,49. To convert depletion into a positive output, we inverted the extent of depletion to assign enrichment scores to each PAM sequence in order to generate the PAM wheel and sequence logo. The PAM wheel for the S. pyogenes Cas9 captures the canonical NGG PAM as the most enriched and the weak NAG PAM as the next most enriched sequence (Figure 2.2A, Krona 2.S1) 35,39. The PAM wheel also suggested bias against a C in the +1 position and a potential 2-nt gap between the protospacer and the PAM (e.g., TAGG, ATGG). Similarly, the PAM wheel for the VQR variant of the S. pyogenes Cas9 captured the reported NGA PAM but also indicated other functional PAMs (e.g., AGGG, TGCG) and bias toward an A at the +1 position (Figure 2.2B, Krona 2.S2). These insights contrast with the sequence logos, which merely indicate a preference for a G over an A at the +2 position and for a G at the +3 position for the WT Cas9, 44

56 and a preference for a G at the +2 position and for an A over a G at the +3 position for the VQR variant. We also generated a PAM wheel based on a high-throughput depletion assay conducted with the compact S. aureus type II-A Cas9 (Figure 2.2C, Krona 2.S3 and 2.S4) 47. The consensus PAM for this Cas9 has been reported to be NNGRR(T), where R represents an A or a G and the parentheses represent a beneficial, but not essential, nucleotide 49. The PAM wheel affirmed the NNGRR motif but also suggested a more nuanced bias for the T at the sixth position that applied to only some PAMs. Interestingly, this bias was more prevalent for non-canonical PAMs (e.g., AGGT, GGCT, GCGT in positions 3 6), suggesting that the T could compensate for deviations elsewhere in the PAM Comprehensive screening reveals the PAM landscape for the E. coli type I-E system We first applied PAM-SCANR to the canonical type I-E CRISPR-Cas system native to E. coli (Figure 2.3A). Because this system is present in our chassis strain of E. coli, we deleted the endogenous cas3 gene and inserted a constitutive promoter upstream of the adjacent casabcde operon (Figure 2.S1B). This modification eliminated DNA cleavage, allowing the Cascade complex encoded by the operon to reversibly bind target DNA and block transcription

(B) Correlating library enrichment and PAM-SCANR fluorescence for individually cloned PAMs. (C) Validating representative PAMs by gene repression.

57 Figure 2.3 Comprehensive screening of the type I-E CRISPR-Cas system in E. coli. (A) PAM wheel and sequence logo for comprehensive screening of the E. coli I-E Cascade with a 4-nt PAM library. See Figure 2.2 for more information about the PAM wheel. Comparisons between the two library replicates of the screen are shown. Related to Figure 2.S3 and Krona 2.S5. (B) Correlating library enrichment and PAM-SCANR fluorescence for individually cloned PAMs. (C) Validating representative PAMs by gene repression. The PAMs were cloned upstream of the -35 element of the lacz promoter controlling expression of GFP. The fold repression is in relation to a non-targeting RNA. (D) Validating representative PAMs by cell killing. Representative PAMs and the PAM-SCANR protospacer were recombineered into E. coli s genome, and the cells were transformed with a plasmid encoding cas3. The fold reduction in the transformation efficiency of the guide RNA plasmid over the non-targeting plasmid is shown. Values represent the mean and SEM from experiments starting with at least three independent colonies. Using E. coli s native I-E system as a model, we performed the comprehensive screen to elucidate the complete landscape of functional PAM sequences. We 46

58 selected a 4-nt library to fully capture the canonical 3-nt PAM along with any biases at the -4 position (Figure 2.S3A). No IPTG was used because of the potent gene repression previously observed for this system 40. The resulting PAM wheel (Krona 2.S5) and accompanying sequence logo for one of two library replicates are shown in Figure 2.3A. The screen revealed a remarkably diverse set of functional PAMs including the well-established 3-nt functional PAMs (NGAG, NTAG, NAAG, NAGG, NATG) 44. While these PAMs showed no observable bias in the -4 position, the PAM wheel revealed other functional PAMs with biases at the -4 position, indicating that the E. coli I-E system recognizes a longer motif than previously accepted. Furthermore, some functional PAMs were more enriched than others, suggesting that Cascade exhibits ranging PAM preferences. In contrast, the sequence logo gave a limited picture of the PAM spectrum, with a tendency toward an NAAG PAM (Figure 2.3A). To initially test the observed sequence biases and variable library enrichment, we cloned single representative functional PAM sequences back into the PAM- SCANR reporter constructs and measured GFP fluorescence by flow cytometry (Figures 2.3B). We found a strong linear correlation between the library fold enrichment and mean GFP fluorescence. The correlation indicated not only that library fold enrichment was an accurate proxy for the relative activity of a given functional PAM and protospacer, but also that functional PAMs can vary widely in their activity. We next asked how the observed PAM biases extend to a separate protospacer. To answer this question, we measured gene repression associated with representative PAMs by targeting the -35 element of the lacz promoter controlling GFP in a laci deletion strain. The PAMs were introduced immediately upstream of the 47

59 protospacer to minimize interference with transcription. Gene repression was then measured in the E. coli strain lacking cas3 by comparing fluorescence levels for the guide RNA and a non-targeting control. We measured diverse extents of fold repression that correlated with the enrichment scores from PAM-SCANR (Figure 2.3C). We also observed notable distinctions from the PAM-SCANR results: the CAAA PAM was the most active with ~1,000-fold repression, and the CATA PAM yielded greater repression than the CAAT PAM. These results confirm that E. coli s I-E system is sensitive to bias at the -4 position, expanding the accepted PAM length. They also suggest some dependence on the protospacer sequence, in line with recent work with the type I-E system reporting different PAM preferences for two protospacers 50. Finally, we asked how the observed PAM bias translates into the DNA clearance assays commonly used for PAM determination. To limit any protospacerspecific biases, we recombineered the protospacer and representative PAMs from the PAM-SCANR reporter plasmid into the genome of the E. coli chassis strain and transformed a plasmid constitutively expressing the cas3 gene. We then measured the transformation efficiency of the guide RNA plasmid in comparison to a nontargeting plasmid. Any reduction in the transformation efficiency can be attributed to the lethality of genome targeting and the frequency of escape 15,51. Interestingly, we observed similar fold reductions in the transformation efficiency for all tested functional PAMs including the extremely weak AAAC and AAAA PAMs (Figure 2.3D), which may be explained by weak and strong PAMs all eliciting irreversible DNA damage and infrequent escape 15,45,51. Targeting was still PAM dependent, as we observed a negligible fold reduction for a non-functional PAM (Figure 2.3D). 48

60 2.3.3 Stringency tuning uncovers the weak NAG PAM for the S. pyogenes type II-A system We next explored the use of PAM-SCANR with type II CRISPR-Cas systems and their widely exploited Cas9 effector proteins 52. We began with the popular type II- A Cas9 from S. pyogenes. We used the catalytically dead Cas9 (dcas9) containing point mutations to the RuvC and HNH endo- nuclease domains (D10A, H840A), which was shown to bind target DNA and block transcription in E. coli 38,39,41. We tested three PAM sequences representing the canonical NGG, the weak NAG, and a nonfunctional PAM within the PAM-SCANR platform 35,39. Flow cytometry analysis showed a high fluorescence signal for the canonical PAM and similarly low fluorescence signals for the weak and non-functional PAMs (Figure 2.4A, 0 µm IPTG). Because the weak NAG PAM is known to exhibit some activity 35, we asked if stringency tuning using sub-saturating concentrations of IPTG could temper LacI activity and help reveal this PAM (Figures 2.1B and 2.S2). As expected, the sub-saturating concentration of IPTG substantially increased the fluorescence for the NAG PAM over the nonfunctional PAM (Figure 2.4A, 10 µm IPTG). PAM-SCANR is therefore also applicable to type II systems, and stringency tuning with IPTG can allow the detection of weak functional PAMs. 49

Figure 2.4 Stringency tuning and screening for the S. pyogenes and S. thermophilus type II- A Cas9s. (A) Stringency tuning of PAM-SCANR constructs with the catalytically dead Cas9 (dcas9).

A 5-nt PAM library with a 2-nt gap was used to identify individual PAM sequences.

61 Figure 2.4 Stringency tuning and screening for the S. pyogenes and S. thermophilus type II- A Cas9s. (A) Stringency tuning of PAM-SCANR constructs with the catalytically dead Cas9 (dcas9). A sub-saturating concentration of IPTG reveals the NAG weak PAM for the S. pyogenes Cas9. (B) Individual screening of the S. thermophilus CRISPR1 Cas9. A 5-nt PAM library with a 2-nt gap was used to identify individual PAM sequences. Sequences identified multiple times are specified out of 38 sequencing runs total, with the number of appearances in parentheses. The mean, single-cell fluorescence is shown for post-sorted cultures harboring the indicated PAM sequence. (C) Comprehensive screening of the S. thermophilus CRISPR1 Cas9. The PAM wheel and sequence logo are shown for the same replicate subjected to individual sequencing. See Figure 2.2 for more information about the PAM wheel Individual and comprehensive screening results reveals functional PAMs for the S. thermophilus type II-A system We next applied PAM-SCANR to the canonical Streptococcus thermophilus Type II-A system associated with the CRISPR1 locus. The system s extensively studied Cas9 protein is substantially shorter than the S. pyogenes Cas9, comes from a generally-regarded-as-safe bacterium, and has one of the longest-known and firstdiscovered PAM sequences (NNAGAAW, where W is A or T) 28. Previous depletion- 50

62 based screens also identified a few functional PAMs that deviate from the consensus sequence 47. We performed individual screening with PAM-SCANR using the catalytically dead version of the S. thermophilus CRISPR1 Cas9 (D9A, H599A) 38,39,41. We selected a 5-nt library spanning the consensus PAM sequence (Figures 2.4B) and applied an intermediate concentration of IPTG to reveal weaker PAMs (Figure 2.S3B). We sorted and plated the ~0.4% GFP-positive cells from the transformed library and subjected cultures of individual colonies to flow cytometry analysis. Sanger sequencing was then performed on 38 of the fluorescent cultures to determine individual PAMs. We obtained four reoccurring sequences (Figures 2.4B and 2.S4), all of which had been identified previously 28,46. The mean fluorescence values were relatively scattered but substantially above the values for a non-functional PAM, which we attribute to the state of the cells following cell sorting and subsequent culturing. To further explore the PAM landscape for this Cas9, we performed comprehensive screening (Figure 2.4C, Krona 2.S6). The screen identified all four reoccurring PAMs from individual sequencing, which were the most enriched sequences in the PAM wheel. Interestingly, the +7 nucleotide in AGAAN was strongly biased toward a T and away from a C that was upheld by gene repression (Figure 2.S4C), indicating a more complex canonical PAM than NNAGAAW. The screen also revealed other less-enriched sequences that do not lend to a single consensus sequence. One of the slightly enriched sequences was complementary to AGAAT, although this PAM sequence yielded negligible gene repression when targeting the separate protospacer (Figure 2.S4C). This collection of identified functional PAMs 51

63 illustrates the utility of the individual method with PAM-SCANR and demonstrates that the S. thermophilus CRISPR1 Cas9 preferentially recognizes a hierarchy of PAM sequences Comprehensive screening reveals PAM bias for the F. novicida type V system The phylogenetic classification of CRISPR-Cas systems was recently expanded to six types 20,21. Of the newly classified types, the type V system has shown potential as an alternative to Cas9 26. Characterization of the type V-A Cpf1 protein from F. novicida U112 identified an NTTN PAM located on the 5 end of the protospacer 26. To employ the F. novicida Cpf1 protein with PAM-SCANR, we generated a catalytically dead version of the protein that would still bind target DNA. Based on mutational analysis from the original report of Cpf1 26, we introduced two point mutations to the RuvC domains that were each implicated in DNA cleavage (D917A, E1006A). We then performed comprehensive screening with the resulting catalytically dead Cpf1 (dcpf1). We selected a 4-nt library immediately flanking the protospacer based on the reported PAM 26, and applied an intermediate concentration of IPTG to reveal any weak PAMs (Figure 2.S3C). Figure 2.5A shows the resulting PAM wheel (Krona 2.S7) and sequence logo. In line with the previous characterization of Cpf1 26, the most enriched sequences fell within the NTTN motif, while some lightly enriched sequences matched the reported NCTN PAM. The screen indicated clear biases at the -1 and -4 positions and suggested new, weakly recognized PAMs that are similar to the complement of NTTN. 52

Comparisons between the two biological replicates of the screen are shown. (B) Validating representative PAMs by gene repression. See Figure 2.3C for detailed information.

64 Figure 2.5 Comprehensive screening of the type V-A Cpf1 protein from F. novicida. (A) PAM wheel and sequence logo for comprehensive screening of the V-A catalytically dead Cpf1 (dcpf1) with a 4-nt PAM library. See Figure 2.2 for more information about the PAM wheel. Comparisons between the two biological replicates of the screen are shown. (B) Validating representative PAMs by gene repression. See Figure 2.3C for detailed information. Values represent the mean and SEM from experiments starting with at least three independent colonies. To validate the observed PAM biases, we measured the extent of gene repression by targeting the lacz promoter upstream of gfp (Figure 2.5B). While the GTTC PAM yielded strong repression in comparison to a non-targeting RNA control, mutating the -1 or -4 position greatly impaired repression activity. We also measured limited repression for the previously validated GCTC PAM (Figure 2.5B) that mirrors its lower enrichment score (Figure 2.5A) and negligible repression for the noncanonical PAMs 47. These results affirm that strong biases exist within the consensus Cpf1 PAM. Furthermore, we demonstrate that Cpf1 can be readily repurposed for programmable gene regulation. 53

65 2.3.6 Comprehensive screening reveals the PAM landscape for the compact B. halodurans type I-C system As a final extension of PAM-SCANR, we tested the ability of this method to identify de novo functional PAMs for the canonical type I-C system from Bacillus halodurans. Type I-C systems are the second most abundant subtype 20 and only require three proteins (Cas5d, Csd1, Csd2) for Cascade 20,53. However, no functional PAMs have been experimentally determined to date. Previous work demonstrated that Cascade from E. coli s I-E system could repress gene expression in the absence of the Cas3 endonuclease 40,42, although this remained to be demonstrated for any other subtype. We therefore imported the three genes for the B. halodurans I-C Cascade into our E. coli strain along with a designed guide RNA (Figures 2.S1A and 2.S1B). We also introduced a 4-nt library based on the putative NTTC PAM identified through bioinformatics analysis across type I-C systems (Figure 2.S3D) 54. Stringency tuning with IPTG was necessary to reveal a small, GFPpositive subpopulation (Figure 2.S3D). The resulting PAM wheel (Krona 2.S8) and accompanying sequence logo from the comprehensive method are shown in Figure 2.6A. 54

3D for detailed information. Values represent the mean and SEM from experiments starting with at least three independent colonies. The screen revealed a defined hierarchy of PAM sequences for the B.

66 Figure 2.6 Comprehensive screening of the uncharacterized type I-C system from B. halodurans. (A) PAM wheel and sequence logo for comprehensive screening with a 4-nt PAM library. See Figure 2.2 for more information about the PAM wheel. Comparisons between the two biological replicates of the screen are shown. (B) Validating representative PAMs by gene repression. See Figure 2.3D for detailed information. Values represent the mean and SEM from experiments starting with at least three independent colonies. The screen revealed a defined hierarchy of PAM sequences for the B. halodurans type I-C system (Figure 2.6A). The most enriched set of PAM sequences (NTTC) matched bioinformatics predictions 54. We also discovered new functional PAMs with incrementally lower enrichments. There was also a clear bias in the -4 position that again depended on the remaining nucleotides of the PAM. Interestingly, the I-C PAMs were the complement of the I-E PAMs (Figures 2.3A and 2.6A), suggesting that these systems recognize opposite strands of the PAM or exhibit distinct nucleotide recognition properties. To validate the identified PAMs and sequence biases, we performed gene repression by targeting the lacz promoter upstream of gfp (Figure 2.6B). As part of the validation, we tested the four identified functional PAMs and the observed bias at 55

67 the -4 position. We found that all tested functional PAMs yielded measurable gene repression. Furthermore, the extent of repression strongly correlated with each PAM s library fold enrichment, despite targeting a separate protospacer. The repression data therefore validated the functionality of PAMs deviating from previous bioinformatics analyses as well as strong bias at the -4 PAM position. Additionally, we established that the abundant and compact type I-C CRISPR-Cas systems can be harnessed for gene regulation 2.4 DISCUSSION In this work, we developed PAM-SCANR for the rapid identification of functional PAMs across diverse CRISPR-Cas systems. We were able to demonstrate the universality of the screen using five distinct CRISPR-Cas systems from three main types, including two (I-C and V-A) never before used for gene repression. The screen is likely amenable to other types and subtypes of PAM-dependent CRISPR-Cas systems, such as the putative type VI systems 21 or the other type I sub-types that form a stable Cascade-like complex 55,56, with the potential to identify functional PAMs for the wide assortment of CRISPR-Cas systems found across the prokaryotic world. We also developed the PAM wheel to effectively capture the diversity and enrichment of functional PAMs without the loss of information. The wheel is based on interactive Krona plots that allow the user to interrogate individual sequences, including those with low enrichment (Krona 2.S1 2.S8) 48. PAM wheels are applicable to all existing high-throughput screens and offer a powerful alternative over current means of representing PAMs, including sequence logos that sacrifice the ability to 56

68 identify individual sequences or activities and abbreviated lists of functional PAMs designated as either strong or weak 21,28. Our identification and visualization of functional PAMs across multiple CRISPR-Cas systems revealed remarkable flexibility and bias that better lends to a sequence-activity landscape than a consensus sequence. We found this landscape to be heavily influenced by the periphery of the PAM as well as functional PAMs deviating from the consensus sequence, although other factors such as the protospacer sequence 50, the number of consecutive N nucleotides in the PAM following the protospacer 57,58, and possibly weak recognition of the complement of canonical PAMs could contribute. The concentration of Cas proteins may also influence PAM preferences, although this effect has only been reported under in vitro conditions 37. The landscape also varied widely between even related systems, as illustrated by the relative PAM flexibility and sequences for the E. coli I-E system and the B. halodurans I-C system. Performing PAM screens on a wider set of systems from different types and subtypes followed by structural analyses will shed light on the molecular basis of PAM recognition and how it varies across the diversity of CRISPR- Cas systems. Aside from motivating structural studies to interrogate PAM recognition, these findings underscore the diversity of functional PAMs that would enable flexible invader targeting and potentially explain biases observed in viral escape of CRISPR targeting via PAM mutation 59. The remarkable diversity of functional PAMs may confound off-target predictions while also presenting potential opportunities to expand the pool of available DNA targets and to design RNA guides with tailored activities for the next generation of CRISPR-based technologies 57

69 2.5 LIMITATIONS PAM-SCANR relies on gene repression to distinguish functional and nonfunctional PAMs. One drawback of gene repression is the lack of nuclease activity associated with fully functioning systems, potentially missing PAM-dependent allosteric changes that drive DNA cleavage or Cas3 recruitment 60,61. In addition, the variable PAM activities may be related to the ability of the Cas effector proteins to block transcription rather than to tightly bind DNA and elicit cleavage. However, PAM- SCANR faithfully reproduced canonical PAMs across multiple CRISPR-Cas types and systems, suggesting that DNA binding translates well to DNA cleavage. Further exploring the influence of PAMs on DNA binding and cleavage, particularly in the context of CRISPR immunity and genome editing, could help further refine the utility of PAM-SCANR and the identified PAMs for downstream applications. 2.6 EXPERIMENTAL PROCEDURES Strains, plasmids, oligonucleotides Table 2.S1 lists all strains, plasmids, and oligonucleotides, and the supplemental experimental procedures (File 2.S1) provides an in-depth explanation of the construction of all strains and plasmids. All foreign Cas proteins were inserted into pbad33 with the constitutive J23108 promoter. For the type I-E system, a previously reported expression system was used 40. The guide RNA plasmids were generated by inserting a CRISPR array or sgrna down- stream of the constitutive J23119 promoter in pbad18. All experiments were conducted in derivatives of E. coli BW25113 lacking the laci promoter through the lacz gene. 58

70 2.6.2 Growth conditions E. coli strains were cultured at 37 o C and 250RPM in Luria Bertani (LB) medium or M9 minimal medium with 0.4% glycerol and 0.2% casamino acids or on LB agar. LB medium was used for all experiments except those with the E. coli I-E CRISPR- Cas system due to limited growth in M9 minimal medium with three plasmids. Plasmids were maintained with ampicillin, chloramphenicol, and/or kanamycin as needed. Liquid media was supplemented with IPTG as specified Flow cytometry analysis Overnight cultures were diluted to an ABS600 of 0.01 and cultured to an OD600 of ~0.2. Cultures were analyzed on an Accuri C6 Flow Cytometer with CFlow plate sampler (Becton Dickinson). Events were gated based on forward scatter and side scatter, and fluorescence was measured in FL1-H, with at least 30,000 gated events for data analysis. For gene repression with three plasmids, cultures were inoculated and grown for 16hr prior to flow cytometry analysis. Fold repression was calculated as the ratio of the mean fluorescence values for the CRISPR plasmid over that of a non-targeting plasmid. For the individual screening, overnight cultures of single colonies were diluted to an ABS600 of 0.01 and grown for ~3hr prior to flow cytometry analysis. Plasmids from cultures exhibiting fluorescence over a non-targeting control were then isolated, and the region flanking the nucleotide library was PCR amplified and submitted for Sanger sequencing. 59

71 2.6.4 Fluorescence-activated cell sorting Overnight cultures were diluted to ABS600 of ~0.01 and grown to an ABS600 of ~0.2 prior to sorting on a MoFlo XDP Cell Sorter (Beckman Coulter). A non-targeting strain for each system was analyzed to establish all necessary gating parameters for sorting fluorescent populations. Cultures were subjected to one or two rounds of sorting resulting in at least 50,000 GFP-positive events. Sorted cells were diluted into LB medium supplemented with antibiotics and cultured overnight Next-generation sequencing DNA was prepared for sequencing by amplifying the PAM library from plasmid DNA, isolating it from pre-sorted and post-sorted cultures, and indexing it with Nextera barcodes. Amplicons were then subjected to MiSeq sequencing (Illumina). Data were processed using command line text programs to trim out the PAM library and count the numbers of time a sequence occurred. The Detailed Protocol (File 2.S2) provides an in-depth explanation of the sequencing analysis and PAM wheel generation Cell killing assay For the cell killing assays with the I-E CRISPR-Cas system, E. coli strains were modified by λ-red recombination to harbor protospacers and PAMs from the reporter plasmid. The strains were then electroporated with targeting and non-targeting guide RNA plasmid followed by recovery and dilution plating on LB agar. The fold reduction in the transformation efficiency was calculated as the ratio of the number of transformants for the non-targeting plasmid divided by that of the CRISPR plasmid. 60

72 See Supplemental Experimental Procedures (File 2.S1) for additional details on all experimental procedures. ACCESSION NUMBERS The accession number for all raw and processed reads is NCBI GEO: GSE AUTHOR CONTRIBUTIONS R.T.L. and C.L.B. devised PAM-SCANR. R.T.L., K.R.M., and C.L.B. developed the experiments. R.T.L., K.R.M., R.B., and C.L.B. analyzed the experimental data. R.T.L., K.R.M., R.A.S, R.N.A., and A.A.G. performed the experiments. R.T.L. and A.E.B. generated the PAM wheels and sequence logos. R.T.L., K.R.M., R.B., and C.L.B. wrote the manuscript. ACKNOWLEDGEMENTS We thank Stacie Meaux at Research Square for developing the video summary contained in Movie S1, Brooke McGirr for assistance with cloning and recombineering, and Michelle Luo for devising the PAM-SCANR acronym and for critical reading of the manuscript. We also thank Sarah Schuett at the NCSU CVM Cell Sorting Facility for all FACS work. The BhaloCascade plasmid was a gift from Ailong Ke, and the SpydCas9 plasmid was a gift from Lei Qi (Addgene #44249). The work was supported by funding from the National Science Foundation (CBET to C.L.B. and R.B., MCB to C.L.B.), the Kenan Institute of Engineering, Technology and Science (to C.L.B.), the National Institutes of Health (5T32GM to R.T.L.), and an 61

73 NCSU undergraduate research grant (to R.A.S.). R.B. is a shareholder and advisor of Caribou Biosciences and a founder of Intellia Therapeutics and a member on its scientific advisory board. A.A.G. and C.L.B. are founders of Locus Biosciences and members of its scientific advisory board 62

74 REFERENCES 1. Barrangou, R. et al. CRISPR provides acquired resistance against viruses in prokaryotes. Science 315, (2007). 2. Hsu, P. D., Lander, E. S. & Zhang, F. Development and applications of CRISPR-Cas9 for genome engineering. Cell 157, (2014). 3. van der Oost, J., Westra, E. R., Jackson, R. N. & Wiedenheft, B. Unravelling the structural and mechanistic basis of CRISPR-Cas systems. Nat Rev Microbiol 12, (2014). 4. Doudna, J. A. & Charpentier, E. The new frontier of genome engineering with CRISPR-Cas9. Science 346, (2014). 5. Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, (2013). 6. Mali, P. et al. RNA-Guided human genome engineering via Cas9. Science 339, (2013). 7. Brouns, S. J. J. et al. Small CRISPR RNAs guide antiviral defense in prokaryotes. Science 321, (2008). 8. Marraffini, L. A. & Sontheimer, E. J. CRISPR interference limits horizontal gene transfer in Staphylococci by targeting DNA. Science (80-. ). 322, (2008). 9. Garneau, J. E. et al. The CRISPR/Cas bacterial immune system cleaves bacteriophage and plasmid DNA. Nature 468, (2010). 10. Jinek, M. et al. RNA-programmed genome editing in human cells. Elife 2013, 1 9 (2013). 11. Gasiunas, G., Barrangou, R., Horvath, P. & Siksnys, V. Cas9-crRNA ribonucleoprotein complex mediates specific DNA cleavage for adaptive immunity in bacteria. Proc. Natl. Acad. Sci. U. S. A. 109, E (2012). 12. Hale, C. R. et al. RNA-guided RNA cleavage by a CRISPR RNA-Cas protein complex. Cell 139, (2009). 13. Jore, M. M. et al. Structural basis for CRISPR RNA-guided DNA recognition by Cascade. Nat. Struct. Mol. Biol. 18, (2011). 14. Westra, E. R. et al. CRISPR immunity relies on the consecutive binding and degradation of negatively supercoiled invader DNA by Cascade and Cas3. Mol. Cell 46, (2012). 15. Gomaa, A. A. et al. Programmable removal of bacterial strains by use of genometargeting CRISPR-Cas systems. MBio 5, (2014). 63

75 16. Bikard, D. et al. Exploiting CRISPR-Cas nucleases to produce sequence-specific antimicrobials. Nat. Biotechnol. 32, (2014). 17. Citorik, R. J., Mimee, M. & Lu, T. K. Sequence-specific antimicrobials using efficiently delivered RNA-guided nucleases. Nat. Biotechnol. 32, (2014). 18. Gilbert, L. A. et al. Genome-scale CRISPR-mediated control of gene repression and activation. Cell 159, (2015). 19. Hilton, I. B. et al. Epigenome editing by a CRISPR/Cas9-based acetyltransferase activates genes from promoters and enhancers. Nat. Biotechnol. 33, (2015). 20. Makarova, K. S. et al. An updated evolutionary classification of CRISPR Cas systems. Nat. Rev. Microbiol. 13, (2015). 21. Shmakov, S. et al. Discovery and functional characterization of diverse class 2 CRISPR-Cas systems. Mol. Cell 60, 1 13 (2015). 22. Jinek, M. et al. A programmable dual-rna guided DNA endonuclease in adaptive bacterial immunity. Science 337, (2012). 23. Samai, P. et al. Co-transcriptional DNA and RNA cleavage during Type III CRISPR- Cas immunity. Cell 161, (2015). 24. Staals, R. H. J. et al. RNA targeting by the type III-A CRISPR-Cas Csm complex of Thermus thermophilus. Mol. Cell 56, (2014). 25. Tamulaitis, G. et al. Programmable RNA shredding by the type III-A CRISPR-Cas system of Streptococcus thermophilus. Mol. Cell 56, (2014). 26. Zetsche, B. et al. Cpf1 Is a single RNA-guided endonuclease of a class 2 CRISPR- Cas system. Cell 163, 1 13 (2015). 27. Deveau, H. et al. Phage response to CRISPR-encoded resistance in Streptococcus thermophilus. J. Bacteriol. 190, (2008). 28. Horvath, P. et al. Diversity, activity, and evolution of CRISPR loci in Streptococcus thermophilus. J. Bacteriol. 190, (2008). 29. Heler, R. et al. Cas9 specifies functional viral targets during CRISPR-Cas adaptation. Nature 519, 1 16 (2015). 30. Mojica, F. J. M., Diez-Villasenor, C., Garcia-Martinez, J. & Almendros, C. Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155, (2009). 31. Marraffini, L. A. & Sontheimer, E. J. Self versus non-self discrimination during CRISPR RNA-directed immunity. Nature 463, (2010). 64

76 32. Sternberg, S. H., Redding, S., Jinek, M., Greene, E. C. & Doudna, J. DNA interrogation by the CRISPR RNA-guided endonuclease Cas9. Nature 507, (2014). 33. Semenova, E. et al. Interference by clustered regularly interspaced short palindromic repeat (CRISPR) RNA is governed by a seed sequence. Proc. Natl. Acad. Sci. U. S. A. 108, (2011). 34. Briner, A. E. & Barrangou, R. Lactobacillus buchneri genotyping on the basis of clustered regularly interspaced short palindromic repeat (CRISPR) locus diversity. Appl. Environ. Microbiol. 80, (2014). 35. Jiang, W., Bikard, D., Cox, D., Zhang, F. & Marraffini, L. A. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nat. Biotechnol. 31, (2013). 36. Pattanayak, V. et al. High-throughput profiling of off-target DNA cleavage reveals RNA-programmed Cas9 nuclease specificity. Nat. Biotechnol. 31, (2013). 37. Karvelis, T. et al. Rapid characterization of CRISPR-Cas9 protospacer adjacent motif sequence elements. Genome Biol. 16, 253 (2015). 38. Bikard, D. et al. Programmable repression and activation of bacterial gene expression using an engineered CRISPR-Cas system. Nucleic Acids Res. 41, (2013). 39. Jinek, M. et al. A Programmable Dual-RNA Guided DNA Endonuclease in Adaptive Bacterial Immunity. Science 337, (2012). 40. Luo, M. L., Mullis, A. S., Leenay, R. T. & Beisel, C. L. Repurposing endogenous type I CRISPR-Cas systems for programmable gene repression. Nucleic Acids Res. 43, (2014). 41. Qi, L. S. et al. Repurposing CRISPR as an RNA-guided platform for sequencespecific control of gene expression. Cell 152, (2013). 42. Rath, D., Amlinger, L., Hoekzema, M., Devulapally, P. R. & Lundgren, M. Efficient programmable gene silencing by Cascade. Nucleic Acids Res. 43, (2014). 43. Jackson, R. N., Lavin, M., Carter, J. & Wiedenheft, B. Fitting CRISPR-associated Cas3 into the Helicase Family Tree. Curr. Opin. Struct. Biol. 24, (2014). 44. Westra, E. R. et al. Type I-E CRISPR-cas systems discriminate target from non-target DNA through base pairing-independent PAM recognition. PLoS Genet. 9, e (2013). 45. Jiang, W. et al. Dealing with the evolutionary downside of CRISPR immunity: bacteria and beneficial plasmids. PLoS Genet. 9, e (2013). 46. Esvelt, K. M. et al. Orthogonal Cas9 proteins for RNA-guided gene regulation and editing. Nat. Methods 10, (2013). 65

77 47. Kleinstiver, B. P. et al. Engineered CRISPR-Cas9 nucleases with altered PAM specificities. Nature 523, (2015). 48. Ondov, B. D., Bergman, N. H. & Phillippy, A. M. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12, (2011). 49. Ran, F. A. et al. In vivo genome editing using Staphylococcus aureus Cas9. Nature 520, (2015). 50. Xue, C. et al. CRISPR interference and priming varies with individual spacer sequences. Nucleic Acids Res. 43, (2015). 51. Vercoe, R. B. et al. Cytotoxic chromosomal targeting by CRISPR/Cas systems can reshape bacterial genomes and expel or remodel pathogenicity islands. PLoS Genet. 9, e (2013). 52. Deltcheva, E. et al. CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature 471, (2011). 53. Nam, K. H. et al. Cas5d protein processes pre-crrna and assembles into a Cascadelike interference complex in subtype I-C/Dvulg CRISPR-Cas system. Structure 20, (2012). 54. Sorek, R., Lawrence, C. M. & Wiedenheft, B. CRISPR-mediated adaptive immune systems in bacteria and archaea. Annu. Rev. Biochem. 82, (2013). 55. Brendel, J. et al. A complex of Cas proteins 5, 6, and 7 is required for the biogenesis and stability of clustered regularly interspaced short palindromic repeats (CRISPR)- derived RNAs (crrnas) in Haloferax volcanii. J. Biol. Chem. 289, (2014). 56. Wiedenheft, B. et al. RNA-guided complex from a bacterial immune system enhances target recognition through seed sequence interactions. Proc. Natl. Acad. Sci. 108, (2011). 57. Briner, A. E. et al. Guide RNA functional modules direct Cas9 activity and orthogonality. Mol. Cell 56, (2014). 58. Chen, H., Choi, J. & Bailey, S. Cut site selection by the two nuclease domains of the Cas9 RNA-guided endonuclease. J. Biol. Chem. 289, (2014). 59. Paez-Espino, D. et al. Strong bias in the bacterial CRISPR elements that confer immunity to phage. Nat. Commun. 4, (2013). 60. Anders, C., Niewoehner, O., Duerst, A. & Jinek, M. Structural basis of PAMdependent target DNA recognition by the Cas9 endonuclease. Nature 513, (2014). 61. Hochstrasser, M. L. et al. CasA mediates Cas3-catalyzed target degradation during CRISPR RNA-guided interference. Proc. Natl. Acad. Sci. 111, (2014). Court, D. L. et al. Mini-λ: A tractable system for chromosome and BAC engineering. Gene 315, (2003). 66

78 63. Cherepanov, P. P. & Wackernagel, W. Gene disruption in Escherichia coli: TcR and KmR cassettes with the option of Flp-catalyzed excision of the antibiotic-resistance determinant. Gene 158, 9 14 (1995). 64. Datsenko, K. A. & Wanner, B. L. One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl. Acad. Sci. USA 97, (2000). 65. Zaslaver, A. et al. A comprehensive library of fluorescent transcriptional reporters for Escherichia coli. Nat. Methods 3, (2006). 66. Glascock, C. B. & Weickert, M. J. Using chromosomal laciq1 to control expression of genes on high-copy-number plasmids in Escherichia coli. Gene 223, (1998). 67. Afroz, T., Biliouris, K., Kaznessis, Y. & Beisel, C. L. Bacterial sugar utilization gives rise to distinct single-cell behaviors. Mol. Microbiol. 93, 1 11 (2014). 67

79 SUPPLEMENTARY INFORMATION Figure 2.S1 Figure legend on following page 68

80 Figure 2.S1. The PAM-SCANR platform. (A) Protospacer for both the PAM-SCANR and PAM validation by gene repression. The spacer for the CRISPR arrays for the Type I and V systems were designed to match the positive strand, while the guide for the sgrnas for the Type II systems were designed to match the negative strand. The target strand was selected to ensure that the PAM (yellow) was located upstream of the -35 and -10 elements (in red boxes) of the laci promoter (PAM-SCANR) or the lacz promoter (gene repression). The associated protospacer overlapped the -35 element of the promoter, allowing the Cas proteins to interfere with the recruitment of RNA polymerase in the presence of a functional PAM. (B) The twoplasmid and three-plasmid expression constructs for the PAM-SCANR and gene repression. The two-plasmid system (guide RNA plasmid, reporter plasmid) was used for the Type I-E CRISPR-Cas system native to E. coli, which was engineered to delete cas3 and constitutively express casabcde 40. The three-plasmid system (Cas plasmid, CRISPR plasmid, reporter plasmid) was used for all CRISPR-Cas systems that were imported into E. coli. For validating gene repression, the reporter plasmid and CRISPR plasmid were replaced with the pua66- lacz plasmid with an introduced PAM was used in place of the PAM-SCANR reporter plasmid and a guide RNA plasmid targeting pua66-lacz. Related to Figure

$(A) Adding sub-saturating concentrations of IPTG to the medium can reduce the fraction of active LacI repressor molecules, thereby facilitating the de-repression of GFP by the employed CRISPR-Cas$

81 Figure 2.S2. Stringency tuning of PAM-SCANR with IPTG. (A) Adding sub-saturating concentrations of IPTG to the medium can reduce the fraction of active LacI repressor molecules, thereby facilitating the de-repression of GFP by the employed CRISPR-Cas system. (B) Mean fluorescence values measured for the PAM-SCANR reporter construct targeted by the S. pyogenes dcas9. IPTG increases the mean fluorescence for the weak functional PAM (orange) and the non-functional PAM (gray) but has no effect on the strong functional PAM (yellow). Intermediate concentrations of IPTG yield the greatest separation between the fluorescence values of the three PAMs. (C) Corresponding histograms at the displayed concentration of IPTG. Related to Figure 2.1 and

82 Figure 2.S3. Library coverage and flow cytometry histograms for pre-sorted and post-sorted cultures. (A) Library coverage and flow cytometry histograms for pre-sorted cultures (gray) and post-sorted cultures (yellow) for the E. coli Type I-E system. The PAM-SCANR was conducted with a 4-nt PAM library. Cultures transformed with the library and non-targeting CRISPR plasmid (red) served as a negative control. (B) Library coverage and flow cytometry histograms for pre-sorted and post-sorted cultures for the S. thermophilus CRISPR1 Type II- A system. The PAM-SCANR was conducted with a 5-nt PAM library with a 2-nucleotide gap. (C) Library coverage and flow cytometry histograms for pre-sorted and post-sorted cultures for the F. novicida Type V system. The PAM-SCANR was conducted with a 4-nt PAM library. (D) Library coverage and flow cytometry histograms for pre-sorted and post-sorted cultures for the B. halodurans Type I-C system. The PAM-SCANR was conducted with a 4-nt PAM library. Data are shown for one of up to two similar replicates. Related to Figures

Figure 2.S4. PAM-SCANR results for the S. thermophilus CRISPR1 dcas9. (A) Protospacer and library spacing used to analyze the S. thermophilus CRISPR1 Cas9.

83 Figure 2.S4. PAM-SCANR results for the S. thermophilus CRISPR1 dcas9. (A) Protospacer and library spacing used to analyze the S. thermophilus CRISPR1 Cas9. A five nucleotide library was used with a 2 nucleotide gap. (B) Individual screening of the S. thermophilus CRISPR1 Cas9. A total of 38 colonies were sequenced, and each sequence is denoted with the number of appearances with the value in parentheses. The mean, single-cell fluorescence is shown for post-sorted cells harboring the indicated PAM. (C) Validation of three PAMs by transcriptional repression. See Figure 2.3C for detailed information. Related to Figure

84 Electronic files Movie 2.S1 and Kronas 2.S1-2.S7 can all be found at the following hyperlink: Movie 2.S1. Animated overview of the PAM-SCANR and the PAM wheel. Related to Figure 2.1. Krona 2.S1. Interactive Krona plot for the depletion assay with the S. pyogenes Cas9. Depletion data from Kleinstiver et al., Related to Figure 2.2. Krona 2.S2. Interactive Krona plot for the depletion assay with the VQR variant of the S. pyogenes Cas9. Depletion data from Kleinstiver et al., Related to Figure 2.2. Krona 2.S3. Interactive Krona plot for the depletion assay with the S. aureus Cas9. Depletion data from Kleinstiver et al., Related to Figure 2.2. Krona 2.S4. Interactive Krona plot for the PAM-SCANR with the E. coli Type I-E Cascade. Related to Figure 2.3. Krona 2.S5. Interactive Krona plot for the PAM-SCANR with the S. thermophilus CRISPR1 dcas9. Related to Figure 2.4. Krona 2.S6. Interactive Krona plot for the PAM-SCANR with the F. novicida Type V Cpf1. Related to Figure 2.5. Krona 2.S7. Interactive Krona plot for the PAM-SCANR with the B. halodurans Type I-C Cascade. Related to Figure

85 Table 2.S1 Strains, plasmids, and oligonucleotides used in this work. Name Genotype Source Stock Name NovaBlue enda1 hsdr17 (rk12 mk12+) supe44 thi-1 reca1 gyra96 rela1 lac F [proa+b+laciqzδm15::tn10] (TetR) EMD Millipore (CN# 70181) CB405 BW25113 laciq rrnbt14 aczwj16 hsdr514 arabadah33 rhabadld78 E. coli genetic stock center (CGSC#: 7636) CB406 NM500 E. coli F- λ- ilvg- rfb-50 rph-1 mini-λ::tet N. Majdalani CB407 BW25113 Δcas3::cat BW25113 Δcas3-P casa ::cat-p J23119 This study CB408 BW25113 Δcas3 BW25113 Δcas3-P casa ::P J23119 This study CB409 BW25113 ΔlacIlacZ::cat BW25113 ΔP laci -lacz::cat This study CB410 BW25113 Δcas3Δ lacilacz::cat BW25113 Δcas3-P casa ::P J23119 ΔP laci -lacz::cat This study CB411 One Shot ccdb Survival 2 T1R F-mcrA Δ(mrr-hsdRMS-mcrBC) Φ80lacZΔM15 ΔlacX74 reca1 araδ139 Δ(ara-leu)7697 galu galk rpsl (StrR) enda1 nupg fhua::is2thermo Fisher (CN#A10460) CB412 BW25113 Δcas3-CRISPR::cat BW25113 Δcas3-CRISPR1 ::cat This study CB413 BW25113 Δcas3-CRISPR lacilacz Δ BW25113 Δcas3-CRISPR1ΔP laci -lacz This study CB414 BW25113 Δcas3 TAGG PAM BW25113 Δcas3-P casa ::P J23119 ΔmhpR::kan R -TAGGlacIq This study CB415 BW25113 Δcas3 CAAC PAM BW25113 Δcas3-P casa ::P J23119 ΔmhpR::kan R -CAAClacIq This study CB416 BW25113 Δcas3 CAAA PAM BW25113 Δcas3-P casa ::P J23119 ΔmhpR::kan R -CAAAlacIq This study CB417 BW25113 Δcas3 CAAT PAM BW25113 Δcas3-P casa ::P J23119 ΔmhpR::kan R -CAATlacIq This study CB418 BW25113 Δcas3 CATA PAM BW25113 Δcas3-P casa ::P J23119 ΔmhpR::kan R -CATAlacIq This study CB419 BW25113 Δcas3 AAAC PAM BW25113 Δcas3-P casa ::P J23119 ΔmhpR::kan R -AAAClacIq This study CB420 BW25113 Δcas3 AAAA PAM BW25113 Δcas3-P casa ::P J23119 ΔmhpR::kan R -AAAA-lacIq This study CB421 BW25113 Δcas3 GTCC PAM BW25113 Δcas3-P casa ::P J23119 ΔmhpR::kan R -GTCClacIq This study CB422 74

86 Table 2.S1 Continued Plasmid Description Resistance Source Stock ppam-scanr placi-laci-lacilacz upstream of GFP Kan This Study pcb423 paag_pam-scanr Cloned AAG PAM into CB423 Kan This Study pcb424 pj23119_ecolirepterm Constitutive Repeat Terminator Construct - E. coli Amp Luo et. al 2014 pcb425 pj23119ecoli_pamscanr Cloned PAMSCANR targeting spacer CB425 Amp This Study pcb426 ppam-scanr_ccdb Cloned ccdb toxin gene into CB423 Kan This Study pcb427 ppam-scanr_4ntlib Cloned 4 nucleotide library into CB423 Kan This Study pcb428 ppam-scanr_4n4c Cloned Library "Protospacer-NNNN-CCCC" into CB423 Kan This Study pcb429 ppam-scanr_1c4n3c Cloned Library "Protospacer-C-NNNN-CCC" into CB423 Kan This Study pcb430 ppam-scanr_2c4n2c Cloned Library "Protospacer-CC-NNNN-CC" into CB423 Kan This Study pcb431 ppam-scanr_3c4n1c Cloned Library "Protospacer-CCC-NNNN-C" into CB423 Kan This Study pcb432 ppam-scanr_4c4n Cloned Library "Protospacer-CCCC-NNNN" into CB423 Kan This Study pcb433 ppam-scanr_2c5n1c Cloned Library "Protospacer-CC-NNNNN-C" into CB423 Kan This Study pcb434 ppam-scanr-tagg Cloned TAGG PAM into CB423 Kan This Study pcb435 ppam-scanr_caac Cloned CAAC PAM into CB423 Kan This Study pcb436 ppam-scanr_caaa Cloned CAAA PAM into CB423 Kan This Study pcb437 ppam-scanr_caat Cloned CAAT PAM into CB423 Kan This Study pcb438 ppam-scanr_cata Cloned CATA PAM into CB423 Kan This Study pcb439 ppam-scanr_aaac Cloned AAAC PAM into CB423 Kan This Study pcb440 ppam-scanr_aaaa Cloned AAAA PAM into CB423 Kan This Study pcb441 pbad33 Cloning backbone - Compatible with other two origins of replication (CB423 and CB425) Cm Addgene #65098 pcb442 psth1_cas9 Streptococcus thermophilus Cas9 protein Amp Briner et. al pcb443 pspy_dcas9 Deactivated Streptococcus pyogenes Cas9 protein Amp Addgene #44249 pcb444 psth1_cas9_h599a Mutated HNH(H599) Nuclease domain of CB443 Amp This Study pcb445 psth1_dcas9 Mutate RuvC(D9A) nuclease domain of CB445 Amp This Study pcb446 pbad33_sth1dcas9 Cloned CB446 into CB442 Cm This Study pcb447 pcbad33_sth1dcas9 Added J23108 constitutive promoter into CB447 Cm This Study pcb448 psth1_sgrna Base construct for Sth1 sgrnas Amp Briner et. al 2014 pcb449 psth1_sgrna_pamscanr Sth1 sgrna targeting the PAMSCANR construct Amp This Study pcb450 ppam-scanr_sth1pam Inserted "GGAGAATG" recognized PAM into CB423 Kan This Study pcb451 pbad33_spydcas9 Cloned CB444 into CB442 Cm This Study pcb452 pcbad33_spydcas9 Added J23108 constitutive promoter into CB452 Cm This Study pcb453 pspy_sgrna Base construct for Spy sgrnas Amp Briner et. al 2014 pcb454 pspy_sgrna_pamscanr Spy sgrna targeting the PAMSCANR construct Amp This Study pcb455 ppam-scanr_pspyaag Inserted "AAG" recognized PAM into CB423 Kan This Study pcb456 ppam-scanr_pspyagg Inserted "AGG" recognized PAM into CB423 Kan This Study pcb457 pbhalo_cascade Bacillus halodurans Cascade Sequence Kan Nam et. al 2012 pcb458 pbad33_bhalo Cloned ORF from CB458 into CB442 Cm This Study pcb459 pcbad33_bhalo Added J23108 constitutive promoter into CB459 Cm This Study pcb460 pj23119_bhalorepterm Cloned in a Bhalo CRISPR repeat into CB425 Amp This Study pcb461 pj23119_bhalo_pamscanr Inserted PAMSCANR targeting spacer into CB461 Amp This Study pcb462 placz_pua66 LacZ promoter inserted upstream of GFP Kan Zaslaver et al pcb463 pj23119ecoli_dr Inserted spacer targeting the LacZ promoter into CB425 Amp This Study pcb464 pj23119_bhalo_dr Inserted spacer targeting the LacZ promoter into CB461 Amp This Study pcb465 LacZPua66_GTTC Inserted GTTC PAM into CB463 Kan This Study pcb466 LacZPua66_ATTC Inserted ATTC PAM into CB463 Kan This Study pcb467 LacZPua66_GCTC Inserted GCTC PAM into CB463 Kan This Study pcb468 LacZPua66_ACTC Inserted ACTC PAM into CB463 Kan This Study pcb469 LacZPua66_GTCC Inserted GTCC PAM into CB463 Kan This Study pcb470 LacZPua66_ATCC Inserted ATCC PAM into CB463 Kan This Study pcb471 LacZPua66_GTTT Inserted GTTT PAM into CB463 Kan This Study pcb472 LacZPua66_TAGG Inserted TAGG PAM into CB463 Kan This Study pcb473 LacZPua66_CAAC Inserted CAAC PAM into CB463 Kan This Study pcb474 LacZPua66_CAAA Inserted CAAA PAM into CB463 Kan This Study pcb475 LacZPua66_CAAT Inserted CAAT PAM into CB463 Kan This Study pcb476 LacZPua66_CATA Inserted CATA PAM into CB463 Kan This Study pcb477 LacZPua66_AAAC Inserted AAAC PAM into CB463 Kan This Study pcb478 LacZPua66_AAAA Inserted AAAA PAM into CB463 Kan This Study pcb479 pcbad33_ecolicas3 Cloned E. coli Cas3 onto CB460 Cm This Study pcb480 pbad33_dcpf1 Cloned a codon optimized, deactivated, Cpf1 protein into CB442 via Gibson assembly Cm This Study pcb481 pcbad33_dcpf1 Inserted J23108 upstream of dcpf1 protein Cm This Study pcb482 pcpf1_pamscanr_spacer Inserted a repeat - 30 nucleotide spacer - repeat for Cpf1 into CB3 (dcpf1_pamscanrgblock) Amp This Study pcb483 pdcpf1_dr_spacer Inserted a repeat - 30 nucleotide spacer - repeat for Cpf1 into CB3 (dcpf1_drgblock) Amp This Study pcb484 LacZPua66_GCCA Inserted GCCA into CB463 Kan This Study pcb485 LacZPua66_AGAAT Inserted AGGAT into CB463 on the bottom strand Kan This Study pcb486 LacZPua66_AGAAC Inserted AGGAC into CB463 on the bottom strand Kan This Study pcb487 LacZPua66_TCTTA Inserted TCTTA into CB463 on the bottom strand Kan This Study pcb488 75

87 Table 2.S1 Continued Shorthand Name Sequence orl1 LacOper.Fwd ataactcgagtgcaaaacctttcgcggtatggc orl2 LacOper.Rev TtaaGGATCCCTAattaaatgtgagcgagtaacaacccgtc orl3 Pam.Screen.fwd ttagatttcatacacggtgc orl4 Pam.Screen.Rev CATGGGAGAAAATAATACTGTTG orl5 Lac.KO.Rev ACGTTGACACCATCGAATGGCGCAAAACCTTTCGCGGTtccatatgaatatcctccttag orl6 Lac.KO.Fwd TGTAGTCGGTTTATGCAGCAACGAGACGTCACGGAAAATtgtaggctggagctgctt orl7 Lac.KO.Screen.Rev CGT TTC ACC CTG CCA TAA AGA AAC TGT orl8 AGG.PAM.Fwd CTAAGAAACCATTATTATCATGACATTAACCTATAAGG orl9 AGG.PAM.Rev TCGACCTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGACGT orl10 E.coli.pLacI.Spacer.Fwd CACCTCGAGTTCCCCGCGCCAGCGGGGATAAACCGGTCGAGTGCAAAACCTTTCGCGGTATGGCA orl11 E.coli.pLacI.Spacer.Rev TCGATGCCATACCGCGAAAGGTTTTGCACTCGACCGGTTTATCCCCGCTGGCGCGGGGAACTCGAGGTGGTA orl12 SpyCas9 fwd ATGGATAAGAAATACTCAATAGGCT orl13 SpyCas9 rev TATAAACGCAGAAAGGCCCA orl14 AAG.Screen.Fwd CAT TAT TAT CAT GAC ATT AAC CTA TAA GGTCG orl15 AAG.Screen.Rev TCA TCC AGC GGA TAG TTA ATG ATC AG orl16 B.halo.ORF.Screen.Fwd ACCTATACGAAACCTATGAAGCCAATC orl17 B.halo.ORF.Screen.Rev CCCTGTGACGAAACAAATATCCTCT orl18 PAM.Nlib.Fwd ATATGACGTCCGTTCATTAAAAATTGAATTGACATTAACCTATAAAAATAGGCGT orl19 PAM.N2.Rev AAATGTCGACNNGTGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl20 PAM.N3.Rev AAATGTCGACNNNTGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl21 PAM.N4.Rev AAATGTCGACNNNNGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl22 PAMLib.Screen.Fwd ACGTCCGTTCATTAAAAATTGAAT orl23 ccdb gblock fwd agttacgccagagcttgac orl24 ccdb.rev cattggcaatggttctcgagttttttaa orl25 LacZstop.KanR.Fwd CAATCCATCTTGTTCAATCATGCGA orl26 H599A.CR1.Fwd TGAAGTAGATgcgATTTTACCTCTTTCTATCACATTC orl27 H599A.CR1.Rev AACTGATTAGAATTATTTATCAAATCATG orl28 D9A.CR1.Fwd TTTAGGACTTgcgATCGGTATAGGTTC orl29 D9A.CR1.Rev ACTAAGTCACTCATAGATCC orl30 Sth1.dCas9.pBad33.Fwd atttcccgggaggaggatcagatgagtgacttagttttaggacttgcga orl31 Sth1.dcas9.pBad33.Rev tgatctgcagttaaaaatctagcttaggcttatcaccctcatttttg orl32 B.halo.33.Fwd atttgagctcaggaggatcagatgagaaacgaagtccaatttgagcta orl33 B.halo.33.Rev tgatctgcagttactggccatcaatcacttcaaca orl34 B.halo.RepTerm.Fwd CACGACGTCGCACTCTTCATGGGTGCGTGGATTGAAATaaaaaaaaaccccgcccctgacagggcggggttttttttA orl35 B.halo.RepTerm.Rev AGCTTaaaaaaaaccccgccctgtcaggggcggggtttttttttATTTCAATCCACGCACCCATGAAGAGTGCGACGTCGTGGTAC orl36 ccdb.pbad33.fwd gattggtacctttacagctagctcagtcctaggtat orl37 ccdb.pbad33.rev cctatctagataaaaaaaaccccgccctgtc orl38 Sth1.PAMSCANR.sgRNA.fwd CTAGTCGAAAGGTTTTGCACTCGACGTTTTTGTACTCTGGTAC orl39 STH1.PAMSCANR.sgRNA.rev CAGAGTACAAAAACGTCGAGTGCAAAACCTTTCGA orl40 Sth1.PAM.pua66.Rev AAATGTCGACGGAGAATGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl41 Illumina.PCR1.Fwd TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGttcattaaaaattgaattgacattaacct orl42 Illumina.PCR1.Rev GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGattcaccaccctgaattgact orl43 Index.501.fwd AATGATACGGCGACCACCGAGATCTACACtagatcgcTCGTCGGCAGCGTC orl44 Index.502.fwd AATGATACGGCGACCACCGAGATCTACACctctctatTCGTCGGCAGCGTC orl45 Index.707.Rev CAAGCAGAAGACGGCATACGAGATgtagagagGTCTCGTGGGCTCGG orl46 Index.708.Rev CAAGCAGAAGACGGCATACGAGATcctctctgGTCTCGTGGGCTCGG orl47 Index.709.Rev CAAGCAGAAGACGGCATACGAGATagcgtagcGTCTCGTGGGCTCGG orl48 S.pyog.pBad33.Fwd aaatcccgggaaaagatctaaagaggagaaaggatctatggataag orl49 S.pyog.pBad33.Rev gtttctgcagcgccgacgtctccctaggtat orl50 S.pyog.GG.Rev AAATGTCGACAGGTGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA 76

88 Table 2.S1 Continued S.pyog.GG.Rev AAATGTCGACAGGTGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl51 S.pyog.AG.Rev AAATGTCGACAAGTGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl52 Spy.PAMSCANR.sgRNA.Fwd CTAGTCGAAAGGTTTTGCACTCGACGTTTTAGAGCTAGGTAC orl53 Spy.PAMSCANR.sgRNA.Rev CTAGCTCTAAAACGTCGAGTGCAAAACCTTTCGA orl54 B.halo.Pua66.Stop.Fwd GTCGCACTCTTCATGGGTGCGTGGATTGAAATGTCGAGTGCAAAACCTTTCGCGGTATGGCATGAacgt orl55 B.halo.Pua66.Stop.Rev TCATGCCATACCGCGAAAGGTTTTGCACTCGACATTTCAATCCACGCACCCATGAAGAGTGCGACgtac orl56 pbad33.j23108.fwd1 cgatctgacagctagctcagtcctaggtataatgctagcc orl57 pbad33.j23108.rev1 ccggggctagcattatacctaggactgagctagctgtcagat orl58 pbad33.j23108.fwd2 tctgacagctagctcagtcctaggtataatgctagcgagct orl59 pbad33.j23108.rev2 cgctagcattatacctaggactgagctagctgtcagatgca orl60 pbad33.seq.fwd aagtggcgagcccgatcttc orl61 tagg.pam.rev AAATGTCGACcctaGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl62 caac.pam.rev AAATGTCGACgttgGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl63 caaa.pam.rev AAATGTCGACtttgGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl64 caat.pam.rev AAATGTCGACattgGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl65 cata.pam.rev AAATGTCGACtatgGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl66 aaac.pam.rev AAATGTCGACgtttGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl67 aaaa.pam.rev AAATGTCGACttttGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl68 4C4N.Proto.Rev AAATGTCGACNNNNCCCCACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl69 3C4N1C.Proto.Rev AAATGTCGACCNNNNCCCACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl70 2C4N2C.Proto.Rev AAATGTCGACCCNNNNCCACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl71 1C4N3C.Proto.Rev AAATGTCGACCCCNNNNCACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl72 4N4C.Proto.Rev AAATGTCGACCCCCNNNNACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl73 1C5N2C.Proto.Rev AAATGTCGACCCNNNNNCACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA orl74 B.halo.DR.spacer.fwd GTCGCACTCTTCATGGGTGCGTGGATTGAAATCTTTACACTTTATGCTTCCGGCTCGTATGTTGTacgt orl75 B.halo.DR.spacer.rev ACAACATACGAGCCGGAAGCATAAAGTGTAAAGATTTCAATCCACGCACCCATGAAGAGTGCGACgtac orl76 B.halo.PAM.DR.rev TGAGTGAGCTAACTCACATTAATTG orl77 B.halo.GTTC.DR.fwd TTAGGCACCCgttcCTTTACACTTTATGCTTC orl78 B.halo.ATTC.DR.fwd TTAGGCACCCattcCTTTACACTTTATGC orl79 B.halo.GCTC.DR.fwd TTAGGCACCCgctcCTTTACACTTTATGCTTC orl80 B.halo.ACTC.DR.fwd TTAGGCACCCactcCTTTACACTTTATGC orl81 B.halo.GTCC.DR.fwd TTAGGCACCCgtccCTTTACACTTTATGCTTC orl82 B.halo.ATCC.DR.fwd TTAGGCACCCatccCTTTACACTTTATGC orl83 B.halo.GTTT.DR.fwd TTAGGCACCCgtttCTTTACACTTTATGC orl84 E.coli.DR.Spacer.fwd CACCTCGAGTTCCCCGCGCCAGCGGGGATAAACCG CTTTACACTTTATGCTTCCGGCTCGTATGT orl85 E.coli.DR.Spacer.rev TCGA ACATACGAGCCGGAAGCATAAAGTGTAAAG CGGTTTATCCCCGCTGGCGCGGGGAACTCGAGGTGGTAC orl86 E.coli.DR.PAM.rev TGAGTGAGCTAACTCACATTAATTG orl87 E.coli.DR.tagg.fwd TTAGGCACCCtaggCTTTACACTT orl88 E.coli.DR.caaa.fwd TTAGGCACCCcaaaCTTTACACTTTATG orl89 E.coli.DR.cata.fwd TTAGGCACCCcataCTTTACACTTTATG orl90 E.coli.DR.aaac.fwd TTAGGCACCCaaacCTTTACACTTTATGC orl91 E.coli.DR.caac.fwd TTAGGCACCCcaacCTTTACACTTTATG orl92 E.coli.DR.caat.fwd TTAGGCACCCcaatCTTTACACTTTATG orl93 E.coli.DR.aaaa.fwd TTAGGCACCCaaaaCTTTACACTTTATG orl94 dcpf1_gblock1.fwd CTCTCTACTGTTTCTCCATACCCGTTTTTTTGGGCTAGCGAATTCgagc orl95 dcpf1_gblock1.rev GTTTATAAACGATTTTCTTGTAACCTTCACCCTTATTTTC orl96 dcpf1_gblock2.fwd CAAGAAAAACAACAAAATCTTTGACGACAAAGCCATCAAG orl97 dcpf1_gblock2.rev TGTATCAGGCTGAAAATCTTCTCTCATCCGCCAAAACAGCCaagcttTTAG orl98 Cpf1.gcca.fwd TTAGGCACCCgccaCTTTACACTTTATGC orl99 Cpf1.gcca.rev TGAGTGAGCTAACTCACATTAATTG orl100 Cas3.HR.fwd GCCCGAATGTGCACCAGGTGCACCACGTTGTTTTAACTATGTGTAGGCTGGAGCTGCTTC 77

89 Table 2.S1 Continued GCCCGAATGTGCACCAGGTGCACCACGTTGTTTTAACTATGTGTAGGCTGGAGCTGCTTC orl101 Cas3.HR.tagg CTATCATGCCATACCGCGAAAGGTTTTGCACTCGACcctaATTCCGGGGATCCGTCGACC orl102 Cas3.HR.caac CTATCATGCCATACCGCGAAAGGTTTTGCACTCGACgttgATTCCGGGGATCCGTCGACC orl103 Cas3.HR.caaa CTATCATGCCATACCGCGAAAGGTTTTGCACTCGACtttgATTCCGGGGATCCGTCGACC orl104 Cas3.HR.caat CTATCATGCCATACCGCGAAAGGTTTTGCACTCGACattgATTCCGGGGATCCGTCGACC orl105 Cas3.HR.cata CTATCATGCCATACCGCGAAAGGTTTTGCACTCGACtatgATTCCGGGGATCCGTCGACC orl106 Cas3.HR.aaac CTATCATGCCATACCGCGAAAGGTTTTGCACTCGACgtttATTCCGGGGATCCGTCGAAC orl107 Cas3.HR.aaaa CTATCATGCCATACCGCGAAAGGTTTTGCACTCGACttttATTCCGGGGATCCGTCGACC orl108 Cas3.HR.gtcc CTATCATGCCATACCGCGAAAGGTTTTGCACTCGACggacATTCCGGGGATCCGTCGAAC orl109 EcHRxtend.for TTCGGCTACAATCAAAACATGCCCGAATGTGCACCAGGTGCAC orl110 EcHRxtend.rev TTGACTCTCTTCCGGGCGCTATCATGCCATACCGCGAAAG orl111 Cas3.EcHR1.seq.fwd ATCAGAGCAGCCGATTGTC orl112 Cas3.EcHR2.seq.rev CATCGCCGCTTCCAC orl113 q5.agaat.fwd TCATTAGGCAattctGGCTTTACACTTTATGC orl114 q5.agaac.fwd TCATTAGGCAgttctGGCTTTACACTTTATGC orl115 q5.ggagg.fwd TCATTAGGCAcctccGGCTTTACAC orl116 q5.tctta.fwd TCATTAGGCAtaagaGGCTTTACACTTTATGC orl117 q5.pam.rev GTGAGCTAACTCACATTAATTG 78

90 File 2.S1: Supplemental experimental procedures SUPPLEMENTAL EXPERIMENTAL PROCEDURES 2.S1.1 Strain generation The BW25113 ΔP laci -lacz::cat strain was generated by PCR amplifying the cat cassette from pkd3 with orl5-6 and conducting λ-red mediated recombineering in NM500 as described previously 62. The BW25113 Δcas3-CRISPR1 ΔlacIlacZ::cat strain was generated by P1 transducing ΔP laci -lacz::cat into BW25113ΔCRISPR 40. Successful recombinants were screened by PCR and sequenced. The cat resistance in BW25113 ΔCRISPR-Cas ΔP laci - lacz::cat was excised by FRT-mediated recombination with pcp20 63, resulting in BW25113 Δcas3-CRISPR1 ΔlacIlacZ. This strain was used as the screening and validation strain for all imported CRISPR-Cas systems. Each BW25113 strain used for the cell killing assays was generated by first PCR amplifying the kan R cassette from pkd13 with orl100 paired with orl , where each primer pair adds a PAM, protospacer, and homology arms for recombination into mhpr 64. The PCR product was then recombineered into BW25113 Δcas3::cat with the pkd46 plasmid by λ-red mediated recombineering S1.2 Plasmid generation The PAM-SCANR reporter plasmid was constructed by first amplifying the laci promoter through the 129th codon of lacz from E. coli K-12 MG1655 genomic DNA with orl1 and orl2. The PCR product was then inserted into the XhoI and BamHI restriction sites in pua The forward primer introduced the laciq mutation, increasing the production of LacI 66. The reverse primer inserted a stop codon to prevent continual translation beyond the 129th lacz codon. To ensure the efficient insertion of PAM sequences and libraries, a ccdb expression construct was chemically synthesized as a gblock (ccdb_gblock) and inserted into the AatII and SalI restriction sites of the plasmid, which was propagated in the ccdb-resistant strain 2 T1R (ThermoFisher). Individual PAM sequences or PAM libraries were then generated by primer-dimer PCR with orl18 and a complementary reverse primer (orl19-21, 40, 50, 51, 61-73). PCR products were then digested with AatII and XhoI and inserted into the AatII and SalI restriction sites in the plasmid, replacing the ccdb gene. All plasmids were verified by colony PCR and by Sanger sequencing. Plasmid libraries were analyzed by colony PCR, where each library consisted of at least 95% positive clones (see PAM library synthesis). Each CRISPR targeting plasmid was constructed by inserting a designed CRISPR array or single guide RNA downstream of a constitutive promoter in place of arac-pbad in pbad18. The CRISPR array for the Escherichia coli Type I-E system was generated by annealing orl10-11 or orl84-85 and 5 phosphorylating with PNK (NEB) and then inserting the DNA product into the KpnI and XhoI sites of J23119 pj23119_ecolirepterm 40. The insertion introduces a CRISPR repeat-spacer downstream of the constitutive J23119 promoter and upstream of a repeat and rho-independent terminator. pj23119_ecolirepterm was used as the non-targeting CRISPR plasmid for all experiments with the CRISPR-Cas I-E system. The sgrnas for Streptococcus pyogenes were cloned into pspy_sgrna 57. The spacer sequence was then introduced by annealing and 5 phosphorylating orl52-53 and inserting into the SpeI and KpnI restriction sites. The sgrnas for Streptococcus thermophilus were introduced into psth1_sgrna by annealing and 5 phosphorylating orl38-39 and inserting into the SpeI and KpnI restriction sites. The psth1_sgrna plasmid encodes a T4 phage-targeting spacer, and was used as the non-targeting CRISPR plasmid. The CRISPR array for the Francisella 78

91 novicida Cpf1 protein was generated by inserting dcpf1_pamscanrgblock or dcpf1_drgblock into the KpnI and HindIII restriction sites in pj23119_ecolirepterm, replacing the E. coli array. Each gblock encoded a designed CRISPR repeat-spacer-repeat upstream of a rho-independent terminator. pj23119_ecolirepterm was used as the nontargeting control for this Type V system. The CRISPR repeat-terminator construct for B. halodurans was inserted downstream of the J23119 promoter in the E. coli base construct by annealing and 5 phosphorylating orl34-35 to replace the E. coli array using KpnI and HindIII restriction sites, forming pj23119_bhalorepterm. Repeat-spacers were then inserted by 5 phosphorylating oligos (orl54-55 and orl74-75) and inserting them into J23119 pj23119_bhalorepterm using AatII and KpnI. Each Cas protein expression plasmid was constructed by inserting the necessary cas gene(s) for a CRISPR-Cas system into pbad33 followed by inserting an upstream constitutive promoter. The deactivated version of the Streptococcus pyogenes cas9 was PCR amplified with orl48-49 from pspy_dcas9 (Addgene #44249) and inserted into the XmaI and PstI restriction sites in pbad33. The S. thermophilus CRISPR1 cas9 from psth1_cas9 was deactivated with the D9A and H599A mutations by Q5 mutagenesis (NEB) with orl26-29 followed by PCR amplification with orl30-31 and insertion into pbad33 using XmaI and PstI 57. The deactivated version of the Francisella novicida cpf1 gene 26 was chemically synthesized as two gblocks (dcpf1_gblock1 and 2) and inserted into the SacI and HindIII restriction sites of pbad33 by Gibson assembly. The cas5d, csd1, and csd2 genes were amplified as an operon from pbhalo_cascade with orl32-33 and inserted into the SacI and PstI restriction sites within pbad The J23108 constitutive promoter (BBa_J23108 from the registry of standardized biological parts) was generated by annealing orl56-57 or orl58-59 and 5 phosphorylating with PNK (NEB). The resulting DNA product was respectively inserted into the ClaI and XmaI restriction sites to generate pcbad33_spydcas9 and pcbad33_sth1dcas9 or into the NsiI and SacI restriction sites to generate pcbad33_bhalo and pcbad33_dcpf1. The reporter plasmids used to validate gene repression were based on placz_pua Each PAM sequence was introduced by Q5 mutagenesis (NEB) upstream of the -35 element (Figure 2.S1B) with orl76 paired with orl77-83 or orl86 paired with orl S1.3 PAM library synthesis To generate each PAM library, oligos were chemically synthesized from IDT with random nucleotides within the location of the library (orl19-21,orl68-73). These individual reverse oligos were amplified by primer-dimer PCR paired with orl18, digested with SalI and AatII, and inserted in place of the cytotoxic ccdb gene in ppamscanr_ccdb. The ligation mix was then transformed into ccdb-sensitive NovaBlue cells. The efficiency of the library cloning was analyzed by plating the cells and performing colony PCR on 52 colonies. Correct band sizes were seen on greater than 95% of the screened colonies. The necessary amount of ligation mix was electroporated into NovaBlue cells to achieve at least ten times as many correct clones as the size of the nucleotide library. After transformation, the cells were recovered for one hour on SOC medium at 37o C and diluted 1:100 into LB medium containing antibiotics. The plasmid library was then isolated from the overnight culture to sequence with the post-sorted library DNA as part of comprehensive screening. 79

92 2.S1.4 Growth conditions All experiments in this work were performed in derivatives of E. coli BW All strains were grown in 14-ml round-bottom polypropylene tubes at 37o C and shaken at 250 RPM at volumes of 5 ml or less. Larger volumes were grown in baffled 125-ml Erlenmeyer glass flasks. Strains were cultured in LB medium (10 g/l NaCl, 5 g/l yeast extract, 10 g/l tryptone), or in M9 minimal medium (1X M9 salts, 2 mm MgSO4, 0.1 mm CaCl2, 10 µg/ml thiamine) supplemented with 0.4% glycerol and 0.2% casamino acids, or on LB agar (LB medium with 1.2% agar). Ampicillin, chloramphenicol, and kanamycin were used as necessary in working concentrations of 50 µg/ml, 34 µg/ml, and 50 µg/ml respectively, on both solid and liquid medium. 2.S1.5 Flow cytometry analysis All experiments with the E. coli CRISPR-Cas I-E system were performed in M9 minimal medium similar to previous work 67. Briefly, single colonies were inoculated and cultured overnight followed by measuring ABS600 on a NanoDrop 2000c Spectrophotometer. Cells were diluted to an ABS600 of 0.01 and grown for ~4.5 hours. Cultures were then diluted 1:50 in 1X phosphate buffered saline (PBS) and analyzed on an Accuri C6 Flow Cytometer (Becton Dickinson) equipped with CFlow plate sampler, a 488-nm laser, and a 530 +/- 15-nm bandpass filter. Forward scatter (cut-off of 15,500) and side scatter (cut-off of 600) were used to cut out non-cellular events and a gate was set for E. coli cells based on previous work 67. Fluorescence of gated events was recorded using FL1-H, with a minimum number of 30,000 collected events for data analysis. To analyze the three-plasmid systems (all other CRISPR-Cas systems analyzed in this work), cells were cultured in LB medium due to limited growth on M9 minimal medium. To analyze the extent of transcriptional repression, individual colonies were inoculated into overnight cultures and grown them for 16 hours. These cultures were then diluted 1:500 in 1X PBS and run on the flow cytometer with the same conditions described for the E. coli I-E CRISPR-Cas system. Fold-repression was calculated as the ratio of the fluorescence values for the targeting spacer over a non-targeting control for each PAM. 2.S1.6 Fluorescence-activated cell sorting One day prior to sorting, E. coli cells harboring the CRISPR plasmid and Cas plasmid (or constitutively expressing the I-E Cascade) were electroporated with the corresponding PAM library and recovered overnight in 25 ml of liquid medium. The following morning, plasmid DNA was isolated from an aliquot of the overnight cultures, representing the pre-sorted library. Cells were then back-diluted in their respective medium (With IPTG for all foreign systems) and cultured to an ABS600 of ~0.2. Following flow cytometry analysis, the cultures were transported on ice to the NC State College of Veterinary Medicine Cell Sorting Facility and sorted by fluorescence activated cell sorting (FACS) on a Beckman Coulter MoFlo XDP Cell Sorter. A control strain harboring a non-targeting CRISPR plasmid was also analyzed for each system to establish all side and forward scatter gating for all samples. If possible, samples were double sorted, first collecting 500,000 events followed by a second sort to collect 50,000 events. Due to the limitation of the sorter to separate small enrichments rapidly, targeting samples with lower enrichment were single-sorted, collecting 100,000 events. Sorted cells were transported back to the lab on ice, back-diluted into 25 ml of antibiotic selective LB medium, and grown overnight. The following morning, plasmid DNA was isolated from the sorted overnight cultures (post-sorted DNA). Some of the cells were back-diluted and subjected to flow cytometry analysis to measure the post-sorted fluorescence distribution. Histograms from cells prior to and following sorting can be found in Figure 2.S3. 80

93 2.S1.7 Next-generation sequencing library preparation DNA was prepared for deep sequencing by first amplifying a 134 base-pair region from the reporter plasmid isolated from pre-sorted and post-sorted cultures. These oligos contained the necessary adapters for analysis by Illumina MiSeq: Fwd Rev 5 -TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3 5 -GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-3 Once amplified, the DNA fragments were purified with AMPure XP magnetic beads (Beckman Coulter). A 0.9X bead purification was used followed by two 80% ethanol washes. After washing, the beads were dried and the DNA was eluted in twice the original volume with 10mM Tris-HCl. These amplified DNA fragments were then indexed using an 8-cycle PCR. Nextera indicies 501, 502, 707,708, and 709 were used in this work and are encoded within orl Once amplified, the PCR products were purified with AMPure XP magnetic beads similarly to the first PCR, but instead eluting in the original PCR volume. These samples were quantified on a 2000c NanoDrop spectrophotometer and by agarose gel electrophoresis. DNA then diluted to 10 nm and samples with separate indices were pooled for a total volume of 20 µl. Samples were run on the 150-bp single end read kit on an Illumina MiSeq. Every sample generated at least 550,000 reads, although most samples generated more than 1 million reads. Raw and processed sequencing reads were uploaded to NCBI GEO (Accession # GSE75718). An in-depth procedure of how the reads were analyzed can be found in the Detailed Protocol. Data analyzed from Kleinstiver et al. was already in excel format, with depletion based on sequence. These depletion values were inverted to create an enrichment score, which was imported into the Krona plot excel document. Krona plot generation. A detailed procedure for generating Krona plots 48 can be found in the Detailed Protocol document. 2.S1.8 Sequence logo generation To generate the sequence logos, PAM reads with a 10-fold or greater increase between the library and the depletion assay were used to generate the logo. The PAM enrichment values were normalized to the lowest enriched PAM from this subset to be entered into the Weblogo software ( 2.S1.9 Cell killing assay For the cell killing assays with the I-E CRISPR-Cas system, E. coli strains were modified by λ-red recombination to harbor PAMs from the reporter plasmid and transformed with the pcbad33_ecolicas3. The strains were then electroporated with 50 ng of pj23119ecoli_pamscanr Spacer or pj23119_ecolirepterm followed by recovery in SOC medium and dilution plating on LB agar. The fold-reduction in the transformation efficiency was calculated as the ratio of the number of transformants for the non-targeting plasmid divided by that of the CRISPR plasmid. 81

94 File 2.S2 Detailed protocol DETAILED PROTOCOL 2.S2.1 Generation of PAM enrichment data In order to generate enrichment scores, the PAM sequences from the raw.fastq files must be trimmed out. To perform this trimming, the.fastq files were analyzed using the following command line coding: grep 'TTCATTAAAAATTGAATTGACATTAACCTATAAAAATAGGCGTCGAGGCCCTTTCGT CTTC[TCAG][TCAG][TCAG][TCAG]GTCGAGTGCA' Sample1.fastq cut -c sort uniq -c sort -nr less > Sample1List.txt The larger 5-nt library had a slightly differing sequence, so those reads were trimmed from the raw.fastq files using the following code: grep 'TTCATTAAAAATTGAATTGACATTAACCTATAAAAATAGGCGTCGAGGCCCTTTCGT G[TCAG][TCAG][TCAG][TCAG][TCAG]GGGTCGAGTGCA' Sample1.fastq cut -c sort uniq -c sort -nr less > Sample1List.txt Note that the underlined regions represent filenames these will be different depending on sequencing sample. Also note that the differences in the two codes are only in sequence, as well as the character numbers after the cut -c command. This code requires that the library region and neighboring 12 nucleotides of the protospacer was accurately sequenced, removing any adapter-only reads, or other sequencing miscalls. This is done by the grep and cut commands. Additionally, the rest of the code outputs a tab delimited list of the read counts and their associated PAM sequences. This list was imported into Microsoft Excel and sorted by sequence. Enrichment was calculated as the ratio of the frequency of a given sequence within the library before and after cell sorting. An example enrichment dataset is presented below in Figure 2.S1.1. The highlighted cells are some of the PAM sequences within this dataset that showed high enrichment over the unsorted control. 82

Figure 2.S1.1: Example dataset representing fold enrichment of a PAM library 2.S1.2 Generation of PAM wheels using Krona plots To generate the Krona plot, there are two options (https://github.

95 Figure 2.S1.1: Example dataset representing fold enrichment of a PAM library 2.S1.2 Generation of PAM wheels using Krona plots To generate the Krona plot, there are two options ( The first is to download the source code, and run it on a local machine. In this work, we instead used the KronaExcelTemplate ( This is a Macro-enabled excel document that automatically saves an interactive.html file when data is inserted into the correct cells. The native document has an example already created, with nutrients to demonstrate how the hierarchy is created for the categories. When the Create chart button is pressed in the upper left corner, Excel brings up a Save as prompt, where the.html file can be saved. Once saved, this file can be opened and explored with any modern web browser. To use this plotting software for sequence data, the categories were switched from nutrients to nucleotides. Presented below in Figure 2.S2.2 is an example 2-nucleotide library inputted into the Krona Excel sheet. As seen in this dataset, the sequence with the highest score of 0.9 would be the sequence CA. 83

96 Figure 2.S2.2: Example dataset of a 2-nucleotide library. Category 1 represents a given nucleotide, while Category 2 is the nucleotide adjacent to the first. To better demonstrate this dataset, the Krona plot can be opened, and is presented below in Figure 2.S

97 Figure 2.S2.3: Example dataset represented in a Krona plot, opened by a web browser. Colors correspond to the enrichment of the innermost nucleotide. As seen in this plot, the CA sequence is the most enriched sequence. Note that Category 1 is always the nucleotide that determines the hierarchy of the remaining sequences, which is why the C nucleotide is the innermost nucleotide in the CA sequence. Another helpful feature of the Krona plot is the ability to explore a given sequence space. As seen in blue, The AC sequence is not visualized due to its extremely low 0.02 enrichment score. If the blue A is clicked twice, the sequence space zooms in to all sequences that begin with said A, allowing for direct interrogation of sequence space. The resulting zoomed plot is presented below in Figure 2.S

98 Figure 2.S2.4: Analysis of Example 1 of all sequences that begin with A. As seen in Figure 2.S2.4, now the AC sequence becomes visible, as a very small fraction. These percentages that are now shown represent the enrichment of each nucleotide requiring that the first nucleotide is an A. Thus, the 3% enriched C is actually 3% of all A nucleotides have a C following them. Additionally, this plotting software also includes a total enrichment, seen in the top right corner. 18% of all example 1 is shown within this plot, so the other 82% starts with anything but an A nucleotide. These basic principles are how the PAM wheels were created for this manuscript. Using enrichment data, the libraries were expanded to 4 and 5 nucleotides, instead of the 2 used in the previous example. All of these Krona plot.html files are available with the manuscript. In these Krona plots, the nucleotides were concatenated onto the previous one, to assist in reading the sequence. This also assists in orientation when looking at a large dataset. An example system is presented below in 2.S2.5, the E. coli dataset seen in Figure 2.3 of the main text. Just as previously described, this sequence space can be probed and investigated by clicking different sequences within the layers of the wheel, which is extremely useful when analyzing the less enriched samples. 86

Figure 2.S2.5: Krona plot for the E. coli dataset seen in Figure 2.3 As they stand, these.html files are well suited to explore the PAM space of a given system.

99 Figure 2.S2.5: Krona plot for the E. coli dataset seen in Figure 2.3 As they stand, these.html files are well suited to explore the PAM space of a given system. However, they are not easily read as single plots. Thus, for each system we simplified the plots using Adobe Illustrator, as well as added a guide to correspond to the position within the PAM. Figure 2.S2.6 is a modified version of Figure 2.S2.5 to make this single plot easier to comprehend. 87

Figure 2.S2.6: A modified E. coli Krona plot called the PAM wheel. Here, additional guides and sequences were added to assist the reader in orientation of the wheel.

100 Figure 2.S2.6: A modified E. coli Krona plot called the PAM wheel. Here, additional guides and sequences were added to assist the reader in orientation of the wheel. This same procedure was used on all the data presented in this manuscript. Of note, since the innermost nucleotide of the PAM wheel determined the color scheme and subsequent hierarchies, we chose this position to be the nucleotide adjacent to the protospacer. When viewing these wheels, this allowed for the most accurate representation of the biases seen between individual nucleotide sequences. Additionally, some wheels (such as those presented in Figure 2.2) represent the functional regions of the PAM, not the N spacing gap that is seen in many Class II, Type II systems. These N regions only increase the complexity of the plots, and do not add any insight (as seen in the S. aureus Cas9 data in Figure 2.2, panel C). 88

101 CHAPTER 3 A growth-based PAM screen reveals non-canonical PAMs for the S. pyogenes Cas9 Ryan T. Leenay, Rebecca A. Slotkowski, Brooke McGirr, and Chase L. Beisel Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh NC

102 ABSTRACT CRISPR-Cas systems have been engineered into a multitude of genetic tools that manipulate nucleic acids through a base-pairing mechanism. One of the only targeting rules for these systems is that the target be flanked by a protospacer-adjacent motif (PAM), a sequence that is unique to each Cas protein found in nature. In this work we present a growth-based assay to expedite the rate at which PAMs can be identified. This PAM screen manipulated the E. coli carbon metabolism so that growth only occurred if a PAM was recognized by a deactivated Cas effector protein. The assay was characterized with the ubiquitous Streptococcus pyogenes Cas9 nuclease and instead of presenting its most functional PAMs, the screen enriched for non-canonical PAMs. We used this screen to discover and later demonstrate that the S. pyogenes Cas9 is able to functionally target three non-canonical motifs. Recognition of these PAMs was validated using both transcriptional repression and plasmid clearance, although they proved to be less functional than the canonical NGG motif. Altogether, we expand the targeting rules for S. pyogenes Cas9 in bacteria through the development of a new tool that enriches for non-canonical PAMs for CRISPR-Cas systems. 90

103 3.1 INTRODUCTION CRISPR-Cas systems been engineered into commonplace tools in both biotechnology and medicine 1 4, and their popularity has driven the rapid discovery of unique properties and applications Despite this pace, only a handful of systems to date are completely characterized beyond just demonstrating their activity in vivo or in vitro 11,12. To completely define a novel CRISPR-Cas system, one must first identify the CRISPR-associated (Cas) proteins responsible for cleavage and the CRISPR RNAs (crrnas) required for target recognition Both of these components can be defined through bioinformatic approaches 11,16 by comparing bacterial genomes to more characterized systems 11,17. Targeting requirements for CRISPR-Cas systems is based on two primary rules: complementarity between the target and the crrna, and recognition of a protospacer-adjacent motif (PAM) by the Cas effector proteins. Thus, to completely understand the targeting rules for a novel CRISPR-Cas system, the PAM must be identified Upon recognition of a PAM, the Cas proteins unwind the DNA double helix and allow the crrna to probe the adjacent sequence for complementarity Due to this function, the protein-dna interaction can be difficult to accurately predict a priori 12, A number of assays currently exist to screen for functional PAM sequences recognized by a given Cas effector protein 12,28. Our lab has developed one previously (See Chapter 2) that utilized deactivated Cas effector proteins to enrich GFP populations based on recognition of a PAM sequence 29. Binding of a PAM activated GFP expression, and fluorescing cells containing PAMs were subsequently isolated using a fluorescence-activated cell sorter (FACS). However, selecting with GFP 91

104 presented drawbacks 12. The screen required a FACS, a piece of equipment that may not be available to every biotechnology lab. Additionally, the cell sorting process was both expensive and time-consuming, which is not well suited for a high-throughput based approach to characterize numerous CRISPR-Cas systems simultaneously. These drawbacks led us to design a modified version of the assay. 3.2 DESIGN To simplify the selection for the PAM-SCANR platform, a genetic circuit regulating bacterial propagation was chosen to positively select for CRISPR PAMs, removing the need for the rate-limiting FACS sorting. The previously designed circuit controlled gfp expression through the regulation of both the laci and lacz promoters. A deactivated Cas effector was directed to bind to the promoter of laci in a PAMdependent manner 29 34, preventing laci transcription and allowing for gfp expression. To simplify the PAM selection, the gfp gene could be replaced with a gene controlling bacterial growth. Thus, E. coli cells containing a PAM would grow faster than cells with non-pams, enriching for recognized sequences using only cellular propagation. To regulate bacterial growth, cells can either be controlled through an inhibitor such as an antibiotic, or through a growth promoting component such as a metabolic enzyme. As a first attempt to create this circuit, the gfp gene was replaced with the antibiotic resistance gene bla, which would enrich for PAM-containing cells through introduction of the antibiotic into the medium. Unfortunately, we observed no growth regulation, likely due to expression of the antibiotic resistance gene even when no repression of laci was occurring (Figure 3.S1). 92

105 To instead control propagation through expression of a growth promoting gene, the characterized D-xylose metabolic pathway was selected 35. Previous work in our lab linearized the response to D-xylose by changing the expression of different sugar utilization genes, ultimately disrupting the control loops set by the cell 35. To function, a growth-based circuit would only express the necessary D-xylose enzymes when a correct PAM was recognized by a CRISPR-Cas system, and catabolic genes would not be expressed if a non-pam was present (Figure 3.1). The gfp gene in the previously used genetic circuit 29 was replaced with the xylab operon, which encodes the enzymes responsible for converting D-xylose into D-xylulose 5-phosphate before it is transferred into the pentose phosphate pathway 29,36,37. Thus, when this genetic system is grown in a media where D-xylose is the sole carbon source, only circuits with functional PAMs would be expected to grow. These PAM-containing cells could then be lysed, and the resulting DNA would be sequenced to identify any recognized PAM sequences. 93

Figure 3.1 Growth-based circuit for a PAM assay using D-xylose selection. The xylab operon was inserted downstream of the laci promoter, laci gene, and lacz promoter to regulate cell growth.

106 Figure 3.1 Growth-based circuit for a PAM assay using D-xylose selection. The xylab operon was inserted downstream of the laci promoter, laci gene, and lacz promoter to regulate cell growth. The PAM library was placed directly adjacent to the -35 site of the P laci promoter, allowing a deactivated Cas effector protein to bind within the promoter. If a non-functional PAM is present, the Cas protein will not bind, LacI will be expressed, bind the lacz promoter, and prevent xylab expression and cell growth in xylose media. If a functional PAM is present, the Cas protein binds the laci promoter and allows for expression of xylab and growth in a minimal medium supplemented with D-xylose as the sole carbon source. 3.3 RESULTS Xylose media growth optimization To start, the growth conditions in minimal medium with D-xylose as the sole carbon source were optimized for the xylab genetic circuit. To mimic a CRISPR-Cas system binding a functional PAM, Isopropyl b-d-1-thiogalactopyranoside (IPTG) was added into the media to bind LacI and allow for xylab expression 29. E. coli cells supplemented with IPTG expressed the xylab genes and successfully propagated in selective xylose-only media, in comparison a fully active LacI protein (Figure 3.S2). However, the turbidity of this positive control was only increased three-fold, a limiting dynamic range for a positive screen. Due to the low turbidity of the bacteria after selection, we hypothesized that the selective D-xylose media was placing too much metabolic stress on the bacteria. The medium was then supplemented with glycerol 94

107 and cas amino acids to increase the fitness of only the the xylab expressing cells. At these lower concentrations of glycerol and casamino acids (0.5% and 10% of standard culture concentrations respectively), the difference between our positive and negative controls increased to approximately ~10 fold. When cells were cultured with standard concentrations of either glycerol or cas amino acids, enriched turbidity of cells with IPTG was not observed, most likely due to importation of these components as primary carbon sources. After optimization, these culture conditions were used to analyze the PAM recognition for a CRISPR-Cas system Characterizing growth-based screening with SpydCas9 After confirmation that the xylose growth circuit was successfully functioning using IPTG, a Cas nuclease was evaluated using this growth-based PAM screen. Streptococcus pyogenes dcas9 was chosen to further characterize the assay due to its broad applications and potent gene repression 38,39, Previous work established that this protein recognizes an NGG PAM sequence, and to a much lower extent an NAG PAM 39,34. Both sequences were inserted directly upstream of the -35 site of the laci promoter to ensure that each of them could be detected with the screen. This location within the promoter allows S. pyogenes Cas9 (SpyCas9) to control expression of laci, xyla and xylb, and ultimately cell growth. Additionally, previous results required IPTG to lower the threshold for GFP expression in order to observe a phenotypic response from NAG PAM recognition 29. IPTG was thus added in varying concentrations to analyze both NGG and NAG PAMs to determine if SpyCas9 was able to successfully bind the laci promoter and allow for growth in xylose only media 95

(Figure 3.2). The starting OD 600 of the culture was also varied to determine if the starting concentrations of xylab-expressing cells affected the final turbidity. Figure 3.

(A) IPTG concentrations and starting ODs were varied to optimize growth conditions for the NGG consensus sequence. Cells were washed in M9 salts, back diluted, and grown for 16 hours.

108 (Figure 3.2). The starting OD 600 of the culture was also varied to determine if the starting concentrations of xylab-expressing cells affected the final turbidity. Figure 3.2 Optimizing growth conditions for SpyCas9 PAMs. Cells were grown for 16 hours in M9 minimal media supplemented with 0.002% glycerol, 0.02% cas amino acids, and 0.2% xylose. (A) IPTG concentrations and starting ODs were varied to optimize growth conditions for the NGG consensus sequence. Cells were washed in M9 salts, back diluted, and grown for 16 hours. (B) IPTG concentration and starting ODs were varied for the NAG PAM motif. Cells were washed in M9 salts, back diluted, and grown for 16 hours. No separation between a targeting and non-targeting control was observed until the starting OD 600 was The AGG PAM showed differential growth across all starting OD 600 values and IPTG concentrations. However, the AAG PAM only showed a difference in turbidity when the starting OD 600 was 0.001, and showed a degree of improvement when IPTG was added. The growth experiments demonstrated large variabilities, which is 96

109 especially seen in the AAG results possibly due to minute changes in xylab expression. Altogether, these results confirmed that the xylose-based genetic circuit was functional and working as expected for SpyCas9 PAM recognition. To completely characterize the assay, individual PAMs were replaced with a five nucleotide library to evaluate the screen with S. pyogenes Cas SpyCas9 growth-based PAM enrichment data presents non-canonical PAMs After media optimization and confirmation that the assay selects for PAM recognition, a five-nucleotide library was inserted adjacent to the -35 site of the laci promoter to analyze SpyCas9. The PAM library was transformed into cells expressing both SpyCas9 and a targeting sgrna and recovered in LB media. These cells were then washed multiple times with M9 media to remove residual carbon sources, grown in M9 media with glycerol, washed again, and then grown in selective M9 media with D-xylose, with and without IPTG (Figure 3.S3). To further enrich for functional PAMs, a second wash and growth for the IPTG-free sample was performed in D-xylose media, again without IPTG. After D-xylose selection, DNA from both the D-xylose and glycerol M9 minimal medium was isolated, using the glycerol M9 DNA as a negative control for PAM enrichment. The PAM-containing region was amplified from each sample and subsequently deep sequenced. To calculate PAM enrichment, sequenced reads from the D-xylose samples were compared to the non-selective glycerol sample. Upon analysis of these enrichments, we did not observe the expected NGG motif as the most enriched PAM (Figure 3.3, Figure 3.S3). Instead, the most enriched PAM sequences fell under 97

110 alternate motifs, many of them containing an NNGG motif, one nucleotide further from the protospacer than the expected NGG. Additionally, the weakly recognized NAG PAM sequence was enriched more than the expected NGG PAM. Figure 3.3 Sequencing results for SpyCas9 after two rounds of D-xylose selection. The PAM wheel represents all PAMs sequenced from the library. Top PAMs are highlighted with an exterior label, all other PAMs are located within the gray region and can be further explored with the interactive krona.html file. This wheel is read from the center radially outwards, representing the PAM moving away from the protospacer. A sequence logo is presented at the bottom representing nucleotide conservation at each position. To ensure the observed PAMs were functional, five highly enriched PAMs from the sequencing analysis were cloned back into the original xylab construct. In line with the sequencing results, the CCGGG, GTGGC, and TAGGG PAMs all demonstrated significantly more growth than the canonical NGG PAM (Figure 3.4A). Similar to the 98

111 original assay characterization (Figure 3.2), this growth seemed to be extremely variable between biological replicates. Both the AAGTG and ACGTC PAMs showed no growth enrichment except for a single replicate of ACGTC that was enriched ~2- fold. These same characteristic PAMs were also inserted into the original PAM- SCANR construct with a gfp reporter gene instead of xylab, and activation of fluorescence was measured. Differing from the growth assay but aligning with previously published results 28,34,40, recognition of the NGG PAM allowed for significantly more expression of GFP than any of the putative PAMs from the screen (Figure 3.4B). However, binding of the CCGGG, GTGGC, and TAGGG PAMs all activated GFP expression, suggesting that SpyCas9 may recognize additional noncanonical PAMs in a bacterial system. These activation results for growth and GFP enrichment were performed in the absence of IPTG, explaining the lack of AAGTG enrichment, similar to previous results with this construct 29. When GFP enrichment for AAGTG and ACGTC were analyzed with IPTG, AAGTG demonstrated GFP activation, while ACGTC remained inert (Figure 3.S4). This led us to believe that ACGTC was a false positive from the original assay, since it showed great variability in cell growth (Figure 3.4A), but no activity in other experiments (Figure 3.4B-C, Figure 3.S4). 99

Figure 3.4 Analysis of individual PAMs inserted back into the PAM-SCANR genetic circuit. Standard deviation of each biological triplicate is presented.

112 Figure 3.4 Analysis of individual PAMs inserted back into the PAM-SCANR genetic circuit. Standard deviation of each biological triplicate is presented. (A) Statistical significance in comparison to the AGGTG motif (*) was calculated using a 2-tailed t-test with a requirement value of Individual PAMs were inserted back into the original circuit and final OD 600 measurements were recorded compared to a non-targeting control to calculate fold enrichment. Enrichment was measured in the absence of IPTG. (B) Statistical significance comparing AGGTG to all other PAMs (**) was calculated using was calculated using a 2-tailed t-test with a requirement value of Individual PAMs were inserted into the same genetic growth circuit, but with a gfp reporter in place of xylab. GFP fluorescence was measured in comparison to a non-targeting control to calculate fold enrichment. Enrichment was measured in the absence of IPTG. (C) The same PAM constructs analyzed in 3.4B were targeted with a fully activated Cas9. Transformation efficiencies were calculated by transforming in a targeting and non-targeting sgrna and comparing CFUs. Previous results have discovered that there are allosteric changes within the SpyCas9 protein that occur during DNA cleavage 41 43, which would not be taken into 100

account using an assay dependent on deactivated Cas9. To analyze this potential difference in Cas9 PAM recognition, the gfp reporter construct described in Figure 3.

113 account using an assay dependent on deactivated Cas9. To analyze this potential difference in Cas9 PAM recognition, the gfp reporter construct described in Figure 3.4B was targeted with a fully activated Cas9. Except for the false positive ACGTC, every PAM was able to be targeted by active Cas9. These plasmid clearance results further validate the activity of the non-canonical motifs, and confirms that there were no observable disconnects between PAM recognition for the assay and dsdna cleavage. To further validate the activity of these sub-optimal PAMs, each motif was inserted next to a constitutive promoter driving gfp expression on a different plasmid. Binding this novel target sequence with dcas9 would prevent transcription and reduce GFP levels in the cell. S. pyogenes dcas9 was directed to the various PAMs upstream of this target and the extent of transcriptional repression was calculated in comparison to a non-targeting control (Figure 3.5). Figure 3.5 Transcriptional repression of a novel protospacer. Statistical significance (*, p<0.05, n=4) was calculated comparing AGGTG to all other tested PAMs. PAMs were inserted upstream of a constitutive promoter driving expression of a gfp reporter. SpyCas9 was directed to a novel target sequence adjacent to these PAMs and the fold-repression was calculated in reference to a non-targeting control. 101

114 As presented in Figure 3.5, S. pyogenes dcas9 was able to directly repress gfp expression for the same five characteristic PAMs that demonstrated significant activity in Figure 3.4A-C. Similar to Figure 3.4B-C and previous results 28,34,40, NGG was recognized significantly more than all the other tested sequences. Both NAG motifs were also able to be bound successfully by SpyCas9, and the false positive ACGTC motif did not demonstrate significant transcriptional repression. Interestingly, all three PAMs falling under the NNGG motif were bound successfully by SpyCas9 confirming with a novel target that the three non-canonical sequences are recognized by this ubiquitous protein. 3.4 CONCLUSIONS Here, we report a novel screen to positively select for CRISPR PAMs using a bacterial growth circuit. Instead of presenting the most functional PAMs, the assay output weakly recognized PAMs for S. pyogenes Cas9. Weak PAMs typically are not used for Cas effector-based tools, but they have important implications in off-target predictions or the determination of available landing spots for Cas proteins. We discovered that the S. pyogenes Cas9 was able to target an expanded set of PAMs, three of which fell under a NNGG motif. Although SpyCas9 showed a weaker affinity for these PAMs in GFP activation and transcriptional repression, their activity was still significant over any non-recognized sequences. A plasmid clearance assay demonstrated that these non-canonical PAM-containing plasmids were cleared, suggesting that there is not a disconnect between the data taken from the screen and 102

115 a fully active Cas effector. Together, these results confirm that SpyCas9 recognizes alternative motifs in bacteria, expanding the possible target sites for this system s many applications in bacteria, including high-throughput screening, tunable metabolic regulation, and use as an anti-microbial agent 5,8, Although these PAMs would need to validated in a eukaryotic host, these results could confound off-target predictions which are primarily based on the recognition of an NGG motif alone. We designed and characterized a screen that presented weakly active PAMs for the S. pyogenes Cas9, expanding the available target space for this system in a bacterial host. This assay could now be used in conjunction with another PAM screen to fully characterize the PAMs for a CRISPR-Cas system providing both the highly recognized and the least functional sequences. Although a false positive was observed in our dataset, novel weak PAMs can be rapidly screened for activity using any of these described methods or other high-throughput approaches 47. We hope that this screen can expand the available target space and understanding for all CRISPR- Cas systems. 3.5 EXPERIMENTAL PROCEDURES Strains, plasmids, oligonucleotides Table 3.S1 contains all plasmids, strains, and oligonucleotides used in this work. The xylose knockout strain was generated by amplifying cat cassette the previously used xylab P con -xylfgh strain and recombining it into BW25113 CRISPR laci-lacz using λ-red mediated recombineering in NM500 as described previously 48. Successful recombinants were screened by colony PCR and sequenced. 103

116 The cat resistance in each recombined mutant was excised by FRT-mediated recombination with pcp20 49, resulting in BW25113 ΔCRISPR1 ΔlacI-lacZ xylab P con -xylfgh. This strain was used for screening and validation for all experimental datasets Plasmid generation The PAM-SCANR2.0 xylose growth circuit (CB618) was created by amplifying xyla through xylb from E. coli K-12 MG1655 using ors17-18, amplifying the original PAM-SCANR backbone with ors19-20 to remove the gfp gene, and stitching the pieces together using Gibson cloning 29. Silent point mutations were introduced to remove two AatII cut sites in xyla and xylb by amplifying with ors21-ors24 and recircularizing the plasmid with Q5 mutagenesis. To remain consistent with previous work, the AGG and AAG PAMs were inserted into the PAM-SCANR_xylAB growth circuit (CB621 and CB622, respectively) by amplifying with primer dimer PCR with ors32-ors33, followed by digestion of the PCR product and backbone with AatII and XhoI and ligation with T4 DNA ligase (NEB CN# M0202S). The five nucleotide PAM library (CB622) was inserted into the plasmid by amplifying CB618 with ors7-ors8 and re-circularizing the plasmid with Q5 mutagenesis (NEB CN#E0554S). Additional PAMs (CB625-CB629) were inserted into the PAM-SCANR_xylAB growth circuit by amplifying with ors1-ors6 and re-circularizing with Q5 mutagenesis. PAMs were cloned into the PAM-SCANR gfp-based circuit (CB630-CB634) by amplifying the PAM-SCANR GFP circuit (CB423) with ors1-ors6 and re-circularizing with Q5 104

117 mutagenesis. PAMs were cloned into the direct repression GFP circuit (CB463) by amplifying with ors9-ors16 and re-circularizing with Q5 mutagenesis. The sgrna targeting the direct repression circuit (CB643) was creating by phosphorylating and annealing primers ors25-ors26, digesting the sgrna scaffold (CB454) with KpnI and SpeI 50, and stitching the pieces together with T4 DNA Ligase. The single plasmid expression circuit for SpyCas9 the PAM-SCANR targeting sgrna was assembled by amplifying the sgrna (CB455) with ors29-30, amplifying the SpyCas9 expressing backbone with ors27-28, and re-circularizing the plasmid by Gibson assembly. The non-targeting sgrnra control plasmid was created in the same manner, by amplifying the unrecognized S. thermophilus targeting sgrna (CB450) and inserting it into the SpyCas9 expressing backbone (CB625) Growth conditions All E. coli strains used for data generation were derivatives of E. coli BW They were cultured at 37 o C and 250RPM in LB medium (10 g/l NaCl, 5 g/l yeast extract, 10 g/l tryptone) in 14-mL round-bottom polypropylene tubes in volumes of 5mL or less. Larger volumes were grown in 125mL Erlenmeyer glass flasks with baffles. M9 minimal media was used for all growth assays, and specific concentrations of components are defined in the main text. Base M9 minimal media for washing cells consisted of 1X M9 salts, 2mM MgSO 4, 0.1mM CaCl 2, µg/ml thiamine. Plasmids were maintained with ampicillin, chloramphenicol, and/or kanamycin in concentrations of 50 µg/ml, 34 µg/ml, and 50 µg/ml respectively, on both solid and liquid medium. 105

118 3.5.4 Flow cytometry For GFP activation, single colonies were inoculated and cultured overnight in base M9 media. Transcriptional repression was performed in LB medium. The next morning, the ABS 600 was measured on a NanoDrop 2000c Spectrophotometer. Cells were diluted to an ABS 600 of 0.01 and grown for ~4.5 hours, reaching an ABS 600 of ~0.2. Cultures were then diluted 1:50 in 1X phosphate buffered saline (PBS) and analyzed on an Accuri C6 Flow Cytometer (Becton Dickinson) equipped with CFlow plate sampler, a 488-nm laser, and a 530 +/- 15-nM bandpass filter. Forward scatter (cut-off of 18,500) and side scatter (cut-off of 600) were used to cut out non-cellular events and a gate was set for E. coli cells based on previous work 30. Fluorescence of gated events was recorded using FL1-H, with a minimum number of 30,000 collected events for data analysis Xylose growth assay Single colonies were inoculated overnight in LB media with appropriate chloramphenicol and kanamycin antibiotics. The following day, 1.5mL of cells were spun down and washed three times with 1mL of 1X PBS. After washing, cells were resuspended in 1mL of M9 base media (M9 salts with full concentrations of glycerol and cas amino acids). The ABS 600 of the culture was measured and these cells were backdiluted into an overnight base M9 culture to an ABS 600 of The following day, these cells were again washed three times in 1X PBS. Again, the cells were 106

119 overnighted at an ABS 600 of in M9 media instead containing 0.002% glycerol, 0.02% cas amino acids, and 0.2% xylose. The final ABS 600 of this culture was then measured the following morning. To enrich for functional PAMs within the nucleotide library, the same culture conditions were used. Instead of single colonies, the nucleotide library was freshly transformed into the host strain expressing pspy_dcas9_spysgrna_pamscanr and recovered in LB media with kanamycin and chloramphenicol to ensure the entire library diversity was preserved. The following two days of washing proceeded in the exact same manner as previously described, except all cultures were grown in 25mL culture flasks to ensure there were enough cells for freezer stocks. After the cells were grown in base M9 (with full concentration of glycerol and cas amino acids), DNA was isolated from a fraction of the cells and saved for sequencing to determine the base library diversity prior to PAM enrichment. Growth enrichment was then performed with M9+xylose, and enriched PAMs were isolated from the grown cells Deep sequencing DNA was prepared for deep sequencing by first amplifying a 134 base-pair region from the reporter plasmid isolated from pre- and post-xylose cultures. These oligos contained the necessary adapters for analysis by Illumina MiSeq: Fwd 5 -TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3 Rev 5 -GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-3 Once amplified, the DNA fragments were purified with AMPure XP magnetic beads (Beckman Coulter). A 0.9X bead purification was used followed by two 80% ethanol 107

120 washes. After washing, the beads were dried and the DNA was eluted in twice the original volume with 10mM Tris-HCl. These amplified DNA fragments were then indexed using an 8-cycle PCR. Nextera indicies 501, 502, 707,708, and 709 were used in this work and are encoded within ors Once amplified, the PCR products were purified with AMPure XP magnetic beads similarly to the first PCR, but instead eluting in the original PCR volume. These samples were quantified on a 2000c NanoDrop spectrophotometer and by agarose gel electrophoresis. DNA then diluted to 10nM and samples with separate indices were pooled to a total volume of 20 μl. Samples were run on the 150-bp single end read kit on an Illumina MiSeq. Every sample generated at least 550,000 reads, although most samples generated more than 1 million reads PAM representation A detailed procedure for generating Krona plots 51 can be found in the Detailed Protocol document of our previous manuscript 29. In short, the PAM counts were inserted into the Krona Microsoft Excel template document, PAMs were separated into individual nucleotides in every column of the Excel sheet, and the corresponding enrichment was used to generate the plot. The sequence logos were generated by pasting the PAM counts into the Weblogo software ( 108

121 3.5.8 Plasmid clearance experiments The pspy_cas9 plasmid was used to perform plasmid clearance on both the ppamscanr (CB630-CB634) plasmids containing various PAMs based on previous work 50. In short, strains containing both pspy_cas9 (CB339) and the reporter plasmids were electroporated with 50 ng of targeting plasmid (pspy_sgrna_pamscanr) and the non-targeting psth1_sgrna_pamscanr (pcb455 and pcb449 respectively). Following transformation, cells were recovered in SOC medium and plated on LB agar containing all three antibiotics. The foldreduction in the transformation efficiency was calculated as the ratio of the number of transformants for the non-targeting plasmid divided by that of the targeting plasmid. 109

122 REFERENCES 1. Barrangou, R. et al. CRISPR provides acquired resistance against viruses in prokaryotes. Science 315, (2007). 2. Barrangou, R. & Doudna, J. A. Applications of CRISPR technologies in research and beyond. Nat. Biotechnol. 34, (2016). 3. Hsu, P. D., Lander, E. S. & Zhang, F. Development and applications of CRISPR-Cas9 for genome engineering. Cell 157, (2014). 4. van der Oost, J., Westra, E. R., Jackson, R. N. & Wiedenheft, B. Unravelling the structural and mechanistic basis of CRISPR-Cas systems. Nat Rev Microbiol 12, (2014). 5. Bikard, D. et al. Exploiting CRISPR-Cas nucleases to produce sequence-specific antimicrobials. Nat. Biotechnol. 32, (2014). 6. Citorik, R. J., Mimee, M. & Lu, T. K. Sequence-specific antimicrobials using efficiently delivered RNA-guided nucleases. Nat. Biotechnol. 32, (2014). 7. Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, (2013). 8. Gomaa, A. A. et al. Programmable removal of bacterial strains by use of genometargeting CRISPR-Cas systems. MBio 5, (2014). 9. Hilton, I. B. et al. Epigenome editing by a CRISPR/Cas9-based acetyltransferase activates genes from promoters and enhancers. Nat. Biotechnol. 33, (2015). 10. Mali, P. et al. RNA-Guided human genome engineering via Cas9. Science 339, (2013). 11. Koonin, E. V, Makarova, K. S. & Zhang, F. Diversity, classification and evolution of CRISPR-Cas systems. Curr. Opin. Microbiol. 37, (2017). 12. Leenay, R. T. & Beisel, C. L. Deciphering, communicating, and engineering the CRISPR PAM. J. Mol. Biol. 429, (2017). 13. Brouns, S. J. J. et al. Small CRISPR RNAs guide antiviral defense in prokaryotes. Science 321, (2008). 14. Marraffini, L. A. & Sontheimer, E. J. CRISPR interference limits horizontal gene transfer in Staphylococci by targeting DNA. Science 322, (2008). 15. Garneau, J. E. et al. The CRISPR/Cas bacterial immune system cleaves bacteriophage and plasmid DNA. Nature 468, (2010). 110

123 16. Biswas, A., Gagnon, J. N., Brouns, S. J. J., Fineran, P. C. & Brown, C. M. CRISPRTarget: Bioinformatic prediction and analysis of crrna targets. RNA Biol. 10, (2013). 17. Shmakov, S. et al. Diversity and evolution of class 2 CRISPR Cas systems. Nat. Rev. Microbiol. (2017). doi: /nrmicro Deveau, H. et al. Phage response to CRISPR-encoded resistance in Streptococcus thermophilus. J. Bacteriol. 190, (2008). 19. Heler, R. et al. Cas9 specifies functional viral targets during CRISPR-Cas adaptation. Nature 519, 1 16 (2015). 20. Horvath, P. et al. Diversity, activity, and evolution of CRISPR loci in Streptococcus thermophilus. J. Bacteriol. 190, (2008). 21. Mojica, F. J. M., Diez-Villasenor, C., Garcia-Martinez, J. & Almendros, C. Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155, (2009). 22. Marraffini, L. A. & Sontheimer, E. J. Self versus non-self discrimination during CRISPR RNA-directed immunity. Nature 463, (2010). 23. Semenova, E. et al. Interference by clustered regularly interspaced short palindromic repeat (CRISPR) RNA is governed by a seed sequence. Proc. Natl. Acad. Sci. U. S. A. 108, (2011). 24. Sternberg, S. H., Redding, S., Jinek, M., Greene, E. C. & Doudna, J. DNA interrogation by the CRISPR RNA-guided endonuclease Cas9. Nature 507, (2014). 25. Westra, E. R. et al. CRISPR immunity relies on the consecutive binding and degradation of negatively supercoiled invader DNA by Cascade and Cas3. Mol. Cell 46, (2012). 26. Briner, A. E. & Barrangou, R. Deciphering and shaping bacterial diversity through CRISPR. Curr. Opin. Microbiol. 31, (2016). 27. Paez-Espino, D. et al. Strong bias in the bacterial CRISPR elements that confer immunity to phage. Nat. Commun. 4, (2013). 28. Karvelis, T. et al. Rapid characterization of CRISPR-Cas9 protospacer adjacent motif sequence elements. Genome Biol. 16, 253 (2015). 29. Leenay, R. T. et al. Identifying and visualizing functional PAM diversity across CRISPR-Cas systems. Mol. Cell 62, (2016). 30. Luo, M. L., Mullis, A. S., Leenay, R. T. & Beisel, C. L. Repurposing endogenous type I CRISPR-Cas systems for programmable gene repression. Nucleic Acids Res. 43, (2014). 111

124 31. Bikard, D. & Marraffini, L. A. Control of gene expression by CRISPR-Cas systems. F1000Prime Rep 5, (2013). 32. Bikard, D. et al. Programmable repression and activation of bacterial gene expression using an engineered CRISPR-Cas system. Nucleic Acids Res. 41, (2013). 33. Qi, L. S. et al. Repurposing CRISPR as an RNA-guided platform for sequencespecific control of gene expression. Cell 152, (2013). 34. Jiang, W., Bikard, D., Cox, D., Zhang, F. & Marraffini, L. A. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nat. Biotechnol. 31, (2013). 35. Afroz, T., Biliouris, K., Kaznessis, Y. & Beisel, C. L. Bacterial sugar utilization gives rise to distinct single-cell behaviors. Mol. Microbiol. 93, 1 11 (2014). 36. Rosenfeld, S. A., Stevis, P. E. & Ho, N. W. Y. Cloning and characterization of the xyl genes from Escherichia coli. Mol. Gen. Genet. 194, (1984). 37. Lawlis, V. B., Dennis, M. S., Chen, E. Y., Smith, D. H. & Henner, D. J. Cloning and sequencing of the Xylose Isomerase and Xylulose Kinase genes of Escherichia coli. Appl. Environ. Microbiol. 47, (1984). 38. Deltcheva, E. et al. CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature 471, (2011). 39. Jinek, M. et al. A programmable dual-rna guided DNA endonuclease in adaptive bacterial immunity. Science 337, (2012). 40. Pattanayak, V. et al. High-throughput profiling of off-target DNA cleavage reveals RNA-programmed Cas9 nuclease specificity. Nat. Biotechnol. 31, (2013). 41. Anders, C., Niewoehner, O., Duerst, A. & Jinek, M. Structural basis of PAMdependent target DNA recognition by the Cas9 endonuclease. Nature 513, (2014). 42. Sternberg, S. H., Lafrance, B., Kaplan, M. & Jennifer, A. Conformational control of DNA target cleavage by CRISPR Cas9. Nature 527, (2015). 43. Palermo, G., Miao, Y., Walker, R. C., Jinek, M. & Mccammon, J. A. CRISPR-Cas9 conformational activation as elucidated from enhanced molecular simulations. (2017). 44. Marshall, R. et al. Rapid and scalable characterization of CRISPR technologies using an E. coli cell-free transcription-translation system. BioRxiv 1 32 (2017). 45. Court, D. L. et al. Mini-λ: A tractable system for chromosome and BAC engineering. Gene 315, (2003). 46. Cherepanov, P. P. & Wackernagel, W. Gene disruption in Escherichia coli: TcR and KmR cassettes with the option of Flp-catalyzed excision of the antibiotic-resistance determinant. Gene 158, 9 14 (1995). 112

125 47. Briner, A. E. et al. Guide RNA functional modules direct Cas9 activity and orthogonality. Mol. Cell 56, (2014). 48. Ondov, B. D., Bergman, N. H. & Phillippy, A. M. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12, (2011). 113

126 SUPPLEMENTARY INFORMATION Table 3.S1: Strains, plasmids, and oligonucleotides used in this work. Stock Name Strain Shorthand Source pcb545 E. coli NovaBlue NovaBlue Novagen CN# pcb406 laciq rrnbt14 LacZWJ16 hsdr514 arabadah33 rhabadld78 BW25113 E. coli genetic stock center (CGSC#: 7636) pcb414 BW25113 CRISPR LacI-LacZ BW25113_CRISPRKO_LacKO Leenay et. al 2016 pcb615 MG1655 ΔxylAB ΔCM PconXylFGH MG1655_XylKO Afroz et. al 2015 pcb623 BW25113 CRISPR LacI-LacZ xylab::pcon-xylfgh BW25113_XylKO_CRISPRLacKO This study Plasmid Description Resistance Source Stock ppam-scanr placi-laci-lacilacz upstream of GFP Kan Leenay et al 2015 pcb423 ppam-scanr_pspyaag Cloned AAG PAM into CB423 Kan Leenay et al 2015 pcb456 ppam-scanr_pspyagg Cloned AGG PAM into CB423 Kan Leenay et al 2015 pcb457 ppam-scanr_pspytaggg Cloned TAGGG PAM into CB423 Kan This Study pcb634 ppam-scanr_pspyccggg Cloned CCGGG PAM into CB423 Kan This Study pcb633 ppam-scanr_pspygagcg Cloned GAGCG PAM into CB423 Kan This Study pcb632 ppam-scanr_pspyacgtc Cloned ACGTC PAM into CB423 Kan This Study pcb631 ppam-scanr_pspygtggc Cloned GTGGC PAM into CB423 Kan This Study pcb630 pspy_sgrna_pamscanr Spy sgrna targeting the PAMSCANR construct Amp Leenay et al 2015 pcb455 psth1_sgrna_pamscanr Sth1 sgrna targeting the PAMSCANR construct Amp Leeney et al 2015 pcb450 pspy_cas9 Streptococcus pyogenes Cas9 protein Cm Addgene #42876 pcb339 pspy_dcas9_spysgrna_pamscanr Deactived Streptococcus pyogenes Cas9 protein and Spy sgrna targeting the PAMSCANR plasmid Cm This Study pcb624 pspy_dcas9_sth1sgrna_pamscanr Deactived Streptococcus pyogenes Cas9 protein and Sth1 sgrna targeting the PAMSCANR plasmid Cm This Study pcb625 pspy_sgrna Base construct for Spy sgrnas Amp Briner et. al 2014 pcb454 placz_pua66 LacZ promoter inserted upstream of GFP Kan Zaslaver et al pcb463 pspy_sgrna_placzpua66 Spy sgrna targeting the LacZ construct Amp This Study pcb643 psth1_sgrna Base construct for Sth1 sgrnas Amp Briner et. al 2014 pcb449 pspy_dcas9 Deactived Streptococcus pyogenes Cas9 protein Cm Leenay et al 2015 pcb453 placzpua66_pspyccggg Cloned CCGGG PAM into CB463 Kan This Study pcb641 placzpua66_pspytaggg Cloned TAGGG PAM into CB463 Kan This Study pcb635 placzpua66_pspygagcg Cloned GAGCG PAM into CB463 Kan This Study pcb642 placzpua66_pspyacgtc Cloned ACGTC PAM into CB463 Kan This Study pcb636 placzpua66_pspygtggc Cloned GTGGC PAM into CB463 Kan This Study pcb638 placzpua66_pspyaggtg Cloned AGGTC PAM into CB463 Kan This Study pcb637 placzpua66_pspyaagtg Cloned AAGTG PAM into CB463 Kan This Study pcb639 ppamscanr_xylose placi-laci-lacilacz upstream of xylab Kan This Study pcb618 ppamscanr_xylose_pspytaggg Cloned TAGGG PAM into pcb618 Kan This Study pcb629 ppamscanr_xylose_pspyccggg Cloned CCGGG PAM into pcb618 Kan This Study pcb628 ppamscanr_xylose_pspygagcg Cloned GAGCG PAM into pcb618 Kan This Study pcb627 ppamscanr_xylose_pspyacgtc Cloned ACGTC PAM into pcb618 Kan This Study pcb626 ppamscanr_xylose_pspygtggc Cloned GTGGC PAM into pcb618 Kan This Study pcb625 ppamscanr_xylose_pspyaggtg Cloned AGGTG PAM into pcb618 Kan This Study pcb621 ppamscanr_xylose_pspyaagtg Cloned AAGTG PAM into pcb618 Kan This Study pcb620 ppamscanr_xylose_5n Cloned 5 nucleotide PAM library (NNNNN) into pcb618 Kan This Study pcb622 ppam_amp_laczstop placi-laci-lacilacz ustream of ampicillin Tet, Amp This Study pcb

127 Table 3.S1 Continued Shorthand Name Sequence ors1 Spy.q5.cccta.f TTTCGTCTTCccctaGTCGAGTGCAAAACC ors2 Spy.q5.cccgg.f TTTCGTCTTCcccggGTCGAGTGCAAAACC ors3 Spy.q5.cgctc.f TTTCGTCTTCcgctcGTCGAGTGCAAAACC ors4 Spy.q5.gacgt.f TTTCGTCTTCgacgtGTCGAGTGCAAAACC ors5 Spy.q5.gccac.f TTTCGTCTTCgccacGTCGAGTGCAAAACC ors6 Spy.q5.rev gggcctcgacgcctattt ors7 RAS.5nt.q5fwd TTTCGTCTTCnnnnnGTCGAGTGCAAAACCTTTCGCGGTATG ors8 RAS.5nt.q5rev gggcctcgacgcctattt ors9 spy.dr.taggg.fwd ATTAGGCACCccctaCTTTACACTTTATG ors10 spy.dr.ccggg.fwd ATTAGGCACCcccggCTTTACACTT ors11 spy.dr.aggtg.fwd ATTAGGCACCcacctCTTTACACTTTATGC ors12 spy.dr.gtggc.fwd ATTAGGCACCgccacCTTTACACTTTATGCTTCC ors13 spy.dr.aagtg.fwd ATTAGGCACCcacttCTTTACACTTTATGC ors14 spy.dr.cgctc.fwd ATTAGGCACCcgctcCTTTACACTTTATGC ors15 spy.dr.acgtc.f ATTAGGCACCgacgtCTTTACACTTTATG ors16 spy.pam.rev gagtgagctaactcacattaattg ors17 PS2.Xyl.rev CATGCCTGCAGTCTGGACATTTACGCCATTAATGGCAG ors18 PS2.Xyl.fwd ACTCGCTCACATTTAATTAGGGAGTTCAATATGCAAGC ors19 PS2.BB.fwd ATGTCCAGACTGCAGGCA ors20 PS2.BB.rev CTAATTAAATGTGAGCGAGTAACAAC ors21 xyloseq5.1.fwd actggagtgatgtcatgctgc ors22 xyloseq5.1.rev cacgctttgcgacatcca ors23 xyloseq5.2.fwd cgctggggacctcgggggtct ors24 xyloseq5.2.rev ataacattgcctgattagcatcaacc ors25 spyannealdr.f ctagtcggaagcataaagtgtaaaggttttagagctaggtac ors26 spyannealdr.r ctagctctaaaacctttacactttatgcttccga ors27 Spy.gibson.BB.fwd CGCCGGACGCATCGTGGC ors28 Spy.gibson.BB.rev TTATCATCGATCTGACAGC ors29 Spy.gibson.gBlock.fwd acgatgcgtccggcggaattctaaagatctctgacagc ors30 Spy.gibson.gBlock.rev ctgtcagatcgatgataaggcgctattcagatcctc ors31 Sp.Dimer.fwd ATATGACGTCCGTTCATTAAAAATTGAATTGACATTAACCTATAAAAATAGGCGT ors32 Sp.AGG.dimer.rev AAATGTCGACAGGTGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA ors33 Sp.AAG.dimer.rev AAATGTCGACAAGTGAAGACGAAAGGGCCTCGACGCCTATTTTTATAGGTTAATGTCA ors34 Library.PCR1.Fwd TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGttcattaaaaattgaattgacattaacct ors35 Library.PCR1.Rev GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGattcaccaccctgaattgact ors36 Index.501.fwd AATGATACGGCGACCACCGAGATCTACACtagatcgcTCGTCGGCAGCGTC ors37 Index.502.fwd AATGATACGGCGACCACCGAGATCTACACctctctatTCGTCGGCAGCGTC ors38 Index.707.Rev CAAGCAGAAGACGGCATACGAGATgtagagagGTCTCGTGGGCTCGG ors39 Index.708.Rev CAAGCAGAAGACGGCATACGAGATcctctctgGTCTCGTGGGCTCGG ors40 Index.709.Rev CAAGCAGAAGACGGCATACGAGATagcgtagcGTCTCGTGGGCTCGG ors41 115

128 Figure 3.S1 Testing a growth-based PAM circuit with ampicillin antibiotic resistance. (A) Genetic circuit used for ampicillin selection. GFP from PAM-SCANR was replaced with a bla resistance gene and tested using the synthetic IPTG repressor. Cells were challenged with a gradient of ampicillin selection. (B) Growth results from ampicillin based selection. IPTG did not improve growth conditions over a no-iptg control, preventing this construct from being used for a growth-based PAM assay. 116

Figure 3.S2 Media optimization with supplemental carbon sources. (A) Optimization of supplemental glycerol. IPTG was used to mimic a CRISPR-Cas system by binding the LacI complex.

129 Figure 3.S2 Media optimization with supplemental carbon sources. (A) Optimization of supplemental glycerol. IPTG was used to mimic a CRISPR-Cas system by binding the LacI complex. Typical concentration of 0.4% shows no selectable difference in growth between IPTG and a non-iptg control. Optimal concentration was selected at 0.002% to minimize the growth of the negative cells. (B) Optimization of supplemental cas amino acids (CasAA). The standard concentration of 0.2% gave a small selectable difference as xylose was still the preferred carbon source over CasAA. The 0.02% dilution was selected to enrich for positive growth without allowing the negative growth to flourish. 117

Figure 3.S3 PAM wheels for each library enrichment tested. (A) Xylose selection was performed for 16 hours in the absence of IPTG before library isolation.

130 Figure 3.S3 PAM wheels for each library enrichment tested. (A) Xylose selection was performed for 16 hours in the absence of IPTG before library isolation. (B) Xylose selection was performed for 16 hours with 10µM IPTG before library isolation. (C) Xylose selection was performed for 16 hours, washed, and then performed for an additional 16 hours all in the absence of IPTG. 118

Figure 3.S4 GFP fold enrichment with IPTG for AAGTG and ACGTC PAMs. The same experiment performed in Figure 3.4B was performed but with an addition of IPTG to the medium.

131 Figure 3.S4 GFP fold enrichment with IPTG for AAGTG and ACGTC PAMs. The same experiment performed in Figure 3.4B was performed but with an addition of IPTG to the medium. In line with previous results 27, AAGTG demonstrated significant GFP enrichment when IPTG was added. ACGTC did not show significant enrichment of GFP with or without IPTG, supporting the conclusion that it was a false positive from the xylose-based assay. 119

132 CHAPTER 4 Advancing tools for genome editing in Lactobacillus plantarum for in vivo gut studies Ryan T. Leenay 1, Malay Shah 1, Maria Martino 2, Francois Leulier 2, and Chase L. Beisel 1 1 Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh NC Institut de Génomique Fonctionnelle de Lyon, Université de Lyon, Lyon, France

133 ABSTRACT Lactobacillus is a widely studied genus that has been used in both the food and probiotic industries, and has been shown to combat some gut-residing infections. Recent studies have discovered that specific strains of Lactobacillus plantarum provide a growth-promoting benefit to its host when it is digested into the gut. It was also discovered that a single mutation to an acetate kinase gene further increased this benefit to a Drosophila host, which prompted a study to revert the mutation to confirm the phenotype. In this work, we report a tool development pipeline for bacterial strains that are not as tractable as their characteristic relatives. Genome editing tools were imported, characterized, and improved for the impactful NIZO21 gut-residing strain, and its acetate kinase mutation was successfully reverted back to wild-type. The genomic reversion presented a clear phenotype, mimicking the wild-type strain. The methods and tools presented here lay the foundation for future studies in this impactful gut-residing strain of Lactobacillus. 121

134 4.1 INTRODUCTION Lactobacillus as an impactful gut microbe Lactobacillus is a widely-studied genus due to its prevalent use for the fermentation of food products. The microbes are gram-positive, rod-shaped, and typically facultative anaerobes that are naturally found in multiple different communities within the human microbiome 1 3. Another unique characteristic of Lactobacillus is their natural ability to produce lactic acid. Studies have shown that these bacteria are useful probiotics, being found in many varieties of yogurts 4,5, and recent work has demonstrated their efficacy against a variety of infections 6,7. Lactobacillus plantarum is a species within the Lactobacillus genus that has been used as a human probiotic, mostly to combat gut residing infections 7,8. Interestingly, specific strains of gut-residing L. plantarum were recently discovered to provide a distinct growth phenotype in infant mice 9. L. plantarum strains WJL and NIZO2877 provided a significant fitness benefit to the mouse, which was observed in both weight gain and body length compared to the germ-free control. This phenotype was strain dependent within the L. plantarum species, where WJL showed an increased benefit over NIZO2877. The growth phenotype was further accentuated when the mouse was grown on a depleted diet that was low in proteins, fats, and vitamins. Leulier et al. hypothesized a mechanism where gut residing L. plantarum enhanced the sensitivity of the host to growth hormone affecting its growth rate and weight gain

Figure 4.1 Lactobacillus plantarum mutation after directed evolution. Faster growing Drosophila melanogaster larvae were selected, isolated, and sequenced.

To further investigate the microbe-host growth phenotype, the Leulier lab performed a directed evolution study within Drosophila melanogaster larvae 10,11 (Figure 4.1). M. Martino chose L.

135 Figure 4.1 Lactobacillus plantarum mutation after directed evolution. Faster growing Drosophila melanogaster larvae were selected, isolated, and sequenced. An in-frame deletion within the acetate kinase (acka) gene was found in this faster growth-promoting mutant. Gut passaging of Lactobacillus was performed by M. Martino in the Leulier lab. To further investigate the microbe-host growth phenotype, the Leulier lab performed a directed evolution study within Drosophila melanogaster larvae 10,11 (Figure 4.1). M. Martino chose L. plantarum NIZO21 as the base strain for evolution, as it provided a minimal growth promotion, but the benefit was inferior to the effect provided by L. plantarum WJL. This gave NIZO21 room for phenotypic improvement. To perform the evolution experiment, M. Martino placed L. plantarum NIZO21 in the Drosophila nutrient-limiting feed and passaged it through the host s gut. The fastest growing larvae were removed from the mixed population, and microbes from their intestines were then isolated and re-passaged multiple times in the same manner. After genomic sequencing of these passaged L. plantarum strains, M. Martino discovered a single in-frame deletion in the acetate kinase gene (acka) from a strain isolated from these faster growing Drosophila larvae. This resultant L. plantarum NIZO21 strain was named G2, or generation 2, as the mutation arose after two passages through the Drosophila intestine. To investigate this unique phenotype, they desired to synthetically revert this acka mutation from NIZO21.G2 back to the wildtype NIZO21. This reverted strain could then be directly compared to the wild-type by 123

136 measuring their growth benefit to the host, providing insight into the importance of the acetate kinase mutation Genome editing tools in Lactobacillus plantarum The Lactobacillus genus has a number of established genetic tools in place, but efficiencies have depended strongly on the specific species 2, Electroporation protocols have been developed and there are also a number of plasmids that have been successfully engineered and utilized Together, these provide the baseline requirements for the use of advanced genetic tools such as genome editing. Specific mutations to the Lactobacillus genome have been successfully made by integration cassettes, although these typically leave an antibiotic resistance or genomic scar site 19,20. More recently, scar-less editing has been established in Lactobacillus by electroporating cells with a targeting oligonucleotide that incorporates into the genome with the help of an inducible single-stranded DNA recombinase 12. Although efficiencies were significantly lower than other species, L. plantarum demonstrated the ability to successfully recombine a targeted oligonucleotide into its genome. The success rate in L. plantarum was multiple orders of magnitude lower than the most tractable L. reuteri species, which boasted average mutation rate of ~1% 12. To address this low efficiency, CRISPR-Cas systems were implemented to select against the non-mutated strains through self-targeting cell death 21. Clustered regularly interspaced short palindromic repeats (CRISPR)-Cas (CRISPR-associated) systems naturally exist as adaptive prokaryotic immune systems that have been engineered into genome editing tools Their simplicity has 124

137 driven their prevalent usage, where the guide RNA (grna) sequence simply needs to be complimentary to virtually any nucleic acid target Upon successful complementarity, the Cas proteins then perform the cleavage. The only rule for target selection is that the target sequence be flanked by a protospacer-adjacent motif (PAM), which is a commonly found NGG motif for the ubiquitous SpCas9 23,28. For bacterial genome editing, Cas9 is designed to self-target the genome and cleave any wild-type, un-mutated genomic sequence, causing cell death and only allowing the desired mutants to survive 23. Both single and double stranded DNA has been used to create genomic mutations, transformed either simultaneously or contained on a plasmid 23,29,30. To perform CRISPR-Cas9 genome editing in Lactobacillus reuteri, Cas9 was used in combination with the same oligo-based recombination performed previously 12,21. This method was used to edit multiple targets with a significantly higher success rate than with oligonucleotides alone. We thus set out to utilize the same method in L. plantarum to further investigate the genomic mutations observed after Drosophila passaging. 4.2 RESULTS Plasmid generation As a first step, oligonucleotide recombination was attempted with the inducible RecT protein used previously in L. plantarum 12. The plasmid was transformed into two characteristic L. plantarum strains: the highly tractable WCFS1 strain for validation and the NIZO21 strain used for in vivo gut studies. Unfortunately, after electroporation of a targeted oligonucleotide, we observed no measurable recombination activity. Due 125

to the wide usage of the RecT protein, we hypothesized that the sakacin pathway was not functioning in our specific strains of Lactobacillus so we instead searched for an alternative inducible system

138 to the wide usage of the RecT protein, we hypothesized that the sakacin pathway was not functioning in our specific strains of Lactobacillus so we instead searched for an alternative inducible system 12. One of the most prevalent inducible promoters in Lactobacillus is based on the nisin antibacterial peptide, where the nisin peptide binds to the NisR and NisK proteins, forming an activation complex 13,15. This complex subsequently binds the nisin promoter, activating transcription. This system was partially available on two plasmids, pjp005 and pmsp3545, so each component was combined into a single inducible plasmid that activates rect expression 12,15, removing need for NisR and NisK to be present on the genome or on second plasmid 21 (Figure 4.2). Figure 4.2 Plasmid generation for oligo-based genome engineering in Lactobacillus. (A) Generation of a nisin-inducible system to express the RecT recombinase. All cloning was performed in L. plantarum WCFS1. (B) Construction of a SpCas9 system that replicates in E. coli and Lactobacillus. Two subsequent steps were used to first generate a Cas9-only plasmid, followed by introduction of a Lactobacillus-targeting grna. 126

139 To create our CRISPR targeting plasmid, the base shuttle vector pmsp3545 was selected as the backbone so that new spacers could be rapidly cloned in E. coli. This would also allow for advanced future experiments, such as spacer library cloning or other high-throughput experiments that would be limited if cloning was performed in Lactobacillus. To create this plasmid, S. pyogenes Cas9 and its tracrrna were first inserted into the backbone, creating the non-targeting p3545cas9 (Figure 4.2B) 23. A repeat-spacer-repeat array was then synthesized as a gblock under the P pgm constitutive promoter to drive expression of the Lactobacillus targeting spacer (Figure 4.2B, Table 4.S1) 31. A CRISPR array was chosen over the standard single-guide RNA because we could not locate a constitutive Lactobacillus promoter that had a defined transcriptional start site. This is important, because we did not want to have additional non-complementary 5 bases on the single-guide RNA 28. Additionally, the CRISPR array also allows for multiplexing targets through the expression of multiple spacers, something the sgrna cannot perform without accessory factors 32. After successful creation of these plasmids, all necessary components were available for oligo-based genome engineering in Lactobacillus Characterization of the genome editing plasmids To test the efficacy of the generated nisin-inducible recombination plasmid, L. plantarum strains containing this plasmid were electroporated with a previously characterized oligonucleotide that confers rifampicin resistance upon a single base change within the rhob gene 12. This allows for a selective screen for recombination 127

140 efficiency based on rifampicin resistant CFUs. Successful oligonucleotide recombination and rect induction was observed in both WCFS1 and NIZO21 using the generated plasmid prectnisrk (Figure 4.3A). There was an observable difference in each strain s natural recombination efficiencies, suggesting that WCFS1 is able to recombine single stranded oligonucleotides without an imported recombinase but NIZO21 is not (Figure 4.3A). Comparing the RecT-based recombination to previously published results, WCFS1 demonstrated an order of magnitude fewer recombinants than L. reuteri, and NIZO21 showed approximately 2 fewer orders of magnitude 12. The electroporation efficiency was also evaluated for both NIZO21 and WCFS1 using the created p3545cas9 plasmid. We observed two orders of magnitude more CFUs for L. plantarum WCFS1 than NIZO21 when transforming in this Cas9-only shuttle vector, again suggesting that NIZO21 is less tractable than WCFS1 (Figure 4.2B). Furthermore, NIZO21 did not survive electroporation when transformed with DNA prepared from the E. coli NovaBlue strain, but successful transformants were observed when DNA was isolated from a variant of E. coli that has its methyltransferases removed (EC135) 33. Based on the EC135 transformation data, we speculate that NIZO21 has an active restriction-modification system that impacts electroporation, but further studies are required to support this hypothesis. Taken together, these CFUs from recombination and electroporation suggest that electroporation of oligonucleotides and plasmids into NIZO21 was optimized, but it all efficiencies were significantly lower than the more tractable WCFS1. 128

141 Figure 4.3 Confirmation of genome-editing plasmid activity in Lactobacillus plantarum. (A) L. plantarum cells expressing the RecT recombinase were transformed with an oligonucleotide targeting the rhob gene to infer resistance to rifampicin (RIF R ). The ratio of RIF R colonies to total CFUs was plotted with a no-oligo control to account for any strain differences. (B) Transformation efficiencies for plasmid DNA from E. coli into WCFS1 and NIZO21. DNA from EC135 contains no methylated sites. A Cas9 plasmid containing a self-targeting grna was also transformed into both strains NIZO21 and WCFS1, targeting a conserved gene on both genomes. We observed a three order of magnitude drop in transformants when this grna was included on the Cas9 plasmid, suggesting that the SpCas9 plasmid is also successfully cleaving genomic DNA in L. plantarum NIZO21 and WCFS1 and causing cell death. Successful characterization of each genome-editing plasmid was now completed, allowing for the next step of editing of the acka gene Oligo-based genome editing in Lactobacillus To edit the acka gene, a single stranded oligonucleotide was designed to revert NIZO.G2 back to the wild-type sequence (Table 4.S1, orl16). A self-targeting spacer was also cloned into p3545cas9+rsr plasmid to target and cleave the mutated 129

NIZO.G2 acka gene, allowing for programmed cell death in any strains that did not successfully recombine. To perform the editing, the recombination plasmid prectnisrk was transferred to NIZO.

142 NIZO.G2 acka gene, allowing for programmed cell death in any strains that did not successfully recombine. To perform the editing, the recombination plasmid prectnisrk was transferred to NIZO.G2, induced with nisin, and co-transformed with the acka oligo and the self-targeting Cas9 plasmid in a 1-step editing approach 21,34. Resulting CFU s from this method are presented in Figure 4.4A. Figure 4.4 Attempted oligo-based editing of the acetate kinase gene in NIZO.G2. (A) CFUs for NIZO.G2 after transformation of a non-targeting Cas9, a self-targeting grna, and an oligonucleotide conferring a genomic change to the acka gene. (B) Multiple rounds of oligonucleotides were transformed into NIZO.G2 to increase the prevalence of the acka mutation in the population. Ten successive rounds of transformations were performed. Cas9 targeting was introduced after the 10 th round. (C) Colony-based qpcr design used to rapidly screen any survivors to Cas9 targeting. Two sets of primer pairs were used: one to amplify the NIZO21 genome locus and one to amplify the NIZO.G2 locus. (D) Colony-based qpcr results for NIZO21 and NIZO.G2 control strains. Successful amplification was observed when correct primer-strain association was performed. When the primer set was inverted, both strains were unable to successfully amplify. Threshold values (C t ) are presented on the plot. (E) qpcr results for three representative survivors of the 23 total survivors after Cas9 selection. None of the 23 survivors showed the desired NIZO21 phenotype after surviving CRISPR targeting. 130

143 We were unable to observe any survivors to Cas9 cleavage with the 1-step editing approach that previously was the most effective 21,34 (Figure 4.4A). Because of the recombination and self-targeting success (Figure 4.3), it was hypothesized that there were too few recombination events within the entire population, preventing Cas9 from being successfully transformed into a strain containing a recombined acka site. A total of ten successive transformations of the acka oligo were then performed in order to increase the prevalence of the mutation within the NIZO.G2 population. After these successive rounds of mutant generation, the selective Cas9 plasmid was again transformed. No observable colonies appeared after the standard 48-hour incubation period, but a total of 23 colonies over six separate transformations appeared when the incubation increased by an additional 24 to 48 hours. To rapidly screen these survivors, a colony-based qpcr procedure was developed (Section 4.4.5). Two primer sets were designed: one to amplify only the NIZO21 genotype (containing the CCT codon), and another to only amplify the NIZO.G2 genotype lacking the CCT codon (Figure 4.4C). To validate the selective amplification, each primer set was used to amplify its designed genotype, as well as the competing site with or without the CCT codon (Figure 4.4D). Each primer set was only able to successfully amplify its designed genotype. Using this qpcr method, all 23 survivors to Cas9 cleavage were screened, and a representative three amplification curves are plotted in Figure 4.4E. None of the 23 survivors showed the expected NIZO21 amplification. Sanger sequencing of 5 colonies validated these results. It is hypothesized that these negative survivors appeared because the L. plantarum cells were able to successfully remove the selective Cas9 plasmid, 131

144 antibiotics within the plate degraded, or the colonies were actually inactive plasmids a result that has previously been reported 35,36. Moving forward, we sought out an alternate method to increase the prevalence of cells that contained the acka wild-type mutation, which would accommodate the low transformation efficiency in the L. plantarum NIZO strains Repair template editing in Lactobacillus plantarum To allow for successful editing, a double stranded DNA repair template was designed as the donor for the genomic DNA break after Cas9 cleavage 23,29,30,37. This repair template would be placed on a selectable plasmid within L. plantarum, providing every cell with the opportunity to successfully undergo editing. This is in comparison to giving this opportunity to 1 in every 60,000 cells (Figure 4.3A, NIZO21). To create this repair plasmid, a 2kb region of the acka gene from NIZO21, centered at the CCT mutation, was amplified and inserted into the base pjp005 plasmid, replacing the rect gene (Figure 4.5A) 12. After confirmation that the repair template was inserted and correct, it was transferred to NIZO.G2 (Figure 4.5A). The NIZO.G2 strain containing the repair template was then made electrocompetent and the Cas9 plasmid targeting the acka site in NIZO.G2 was transformed in to edit the genome. 132

Figure 4.5 Genome editing in Lactobacillus plantarum with a dsdna repair template. (A) Construction of the repair template plasmid containing the dsdna template.

145 Figure 4.5 Genome editing in Lactobacillus plantarum with a dsdna repair template. (A) Construction of the repair template plasmid containing the dsdna template. Following successful construct generation, cells containing the repair plasmid were transformed with the self-targeting Cas9 plasmid. (B) Transformation results after Cas9 self-targeting with the repair template plasmid. Presence of the repair template allowed for a total of 15 survivors. (C) Sequencing results for 10 of the survivors. Two survivors contained un-edited genomes, and one did not successfully amplify from the genome. Seven survivors contained the edited acka gene. (D) Plasmid removal after editing. Successfully edited cells were passaged multiple times through non-selective media to remove the genome editing plasmids. After validation of plasmid removal, strains were shipped to the Leulier lab for in vivo validation. 133