THE SEARCH AND ANALYSIS OF A CRISPR-CAS SYSTEM IN ESCHERICHIA COLI HS WITH SUBSEQUENT SCANNING FOR THE CORRESPONDING PHAGE RACES BASED ON THE SPACERS OF THE DETECTED CRIPSR ARRAY USING BIOINFORMATIC

Е. И. Иванова , Ю. П. Джиоев, А. Ю. Борисенко, Н. П. Перетолчина, Л. А. Степаненко, А. И. Парамонов, Е. В. Григорова, У. М. Немченко, Т. В. Туник, Е. А. Кунгурцева ПОИСК И АНАЛИЗ CRISPR-CAS СИСТЕМЫ В ШТАММЕ ESCHERICHIA COLI HS И ДЕТЕКТИРУЕМЫХ СПЕЙСЕРАМИ ЕГО CRISPR-КАССЕТЫ ФАГОВЫХ РАС МЕТОДАМИ БИОИНФОРМАТИКИ THE SEARCH AND ANALYSIS OF A CRISPR-CAS SYSTEM IN ESCHERICHIA COLI HS WITH SUBSEQUENT SCANNING FOR THE CORRESPONDING PHAGE RACES BASED ON THE SPACERS OF THE DETECTED CRIPSR ARRAY USING BIOINFORMATIC METHODS

The Escherichia coli species comprises multiple biotypes.Some of them are commensal colonizers of the mammalian (including human) gut.Others are pathogenic and cause disease.One of the most significant causative agents of intestinal infections is enterohemorrhagic Escherichia coli O157:H7, whereas an important representative of commensals is E. coli HS.Infection caused by E. coli O157:H7 can provoke hemolytic uremic syndrome (HUS) characterized by progressive renal failure.E. coli O157:H7 is a serotype capable of producing Shiga toxins [1][2][3].No specific treatment has yet proved effective against this syndrome.Only supportive care is recommended during the acute stage of the disease.The use of antibiotics for treating infections caused by Shiga-toxin-producing E. coli (Stx-E.coli) is very debatable [4,5].It has been shown that antibiotic therapy prescribed to patients with acute gastrointestinal infection caused by Stx-E. coli increases the risk of developing HUS 17-fold [6].Disruption of the bacterial membrane by antibiotics can stimulate progression to the acute stage because the bacteria start to release the toxin in large quantities [7].
Therefore, we need novel alternatives to antibiotics to combat pathogenic bacteria.Phage therapy holds great promise here [8][9][10].The evolution of this approach relies on the fundamental knowledge about the genetic basis underlying the interactions between bacteria and bacteriophages.This knowledge, in turn, can be obtained only if bacterial and phage genomes, as well as new analytical methods, are at the researcher's disposal.Currently available bioinformatics software allows the researcher to manipulate huge arrays of genomic data, extracting new information about bacterial genomes [11].
Besides the advances in bioinformatics, another significant event of the past few years is discovery of specific adaptive immunity in prokaryotes.It was long believed that bacteria could not resist phage attacks, but in 1987 a strange region was discovered in the E. coli genome that consisted of multiple repeats [12].However, it was not until 2005 that it became clear that the sequences alternating with those repeats were often identical to the sequences found in bacterial and plasmid genomes [13,14].The discovered structures were termed CRISPR-Cas (Clustered Regularly Interspaced Short Palindromic Repeats -CRISPR-associated proteins).They are a specific adaptive defense of bacteria and archaea against alien genetic material mostly derived from phages and plasmids [15][16][17][18].CRISPR arrays are a unique set of palindromic repeats of 21-47 base pairs separated by unique spacers.Spacers are complementary to the regions in phage or plasmid genomes the bacterium is immune against [13].In close proximity to a CRISPR locus are cas-genes.Their products ensure proper functioning of a CRISPR locus.According to the current classification, CRISPR-Cas systems are grouped into 3 types based on their mechanism of action and the cas-proteins present in the genome [19].
Bioinformatic methods are employed to detect and identify CRISPR-Cas systems in bacterial genomes [20,21].For example, they can help to identify bacteriophage races by bacterial spacer sequences and, therefore, to assess bacterial resistance to certain phages or plasmids [22][23][24].This is an important research field, because such screening can provide a solution to the practical challenges faced in the therapy of infections and contribute to the study of evolution across and between bacterial species [17,22].For many bacterial species, however, the mechanism of interactions between them and their phages mediated by CRISPR-Cas and anti-CRISPR-Cas systems remains totally understudied.Therefore, it is wise to start with the development of an efficient algorithm for the bioinformatics-based search and analysis of bacterial CRISPR-Cas loci and their structural components and then proceed to the screening of phage races using bacterial CRIPSR arrays.Considering the abovesaid, we aimed to search the genome of Escherichia coli HS for CRIPSR-Cas loci, study the detected components and then identify the corresponding bacteriophage races through screening using bacterial CRISPR arrays and an original bioinformatics-based search algorithm.

METHODS
The object of our study was the strain Escherichia coli HS.GenBank stores two of its genomes: NC_009800.1 sequenced in 2017 and CP000802 sequenced in 2014.E. coli HS represented in GenBank by the genome NC_009800.1 was cultured using a reference strain from the collection of the Center for Vaccine Development (USA) [25].For our study we selected the genome CP000802 of a reference strain [26] isolated from the gastrointestinal tract of a healthy human who showed no clinical symptoms of colonization [25].
To detect CRISPR-Cas loci in the bacterial genome, we used MacSyFinder (Macromolecular System Finder, ver.1.0.2.), a program for bioinformatic modelling [27].This software requires a protein profile of genomic sequences encoded as hidden Markov models (HMM) available in PFAM, TIGRFAM and PRODOM databases.Sequence homology searches were conducted using makeblastdb (ver.2.2.28) and HMMER (ver.3.0); the same software allowed us to obtain structural and functional characteristics of cas-proteins detected in each analyzed genome [28].Visual representation of the results returned by MacSyFinder was generated in MacSyView.The programming language used was Python (ver.2.7) [29].
The obtained CRISPR arrays were run against the online database CRISPI: a CRISPR Interactive database (Gen Ouest BioInformatics Platform, http://genouest.org/)for structural analysis.Bacterial and archaeal genomes were downloaded from the NCBI FTP Server and processed in C and Java (ver.1.5.0.12.) [30].The detection algorithm was based on imposing a limitation on the number of closest matches.To avoid detection of unrelated structures, the minimally required percent identity was set to 60%.The web-page was implemented in PHP (ver.4.3.9)and Java (ver.1.5.0.12).For phage identification, the obtained spacer sequences were run against the GenBank-Phage database using the search algorithm BLASTn [31].The following online services were used: CRISPRTarget (http:// bioanalysis.otago.ac.nz/CRISPRTarget/crispr_analysis.html) and Mycobacteriophage Database (http://phagesdb.org/blast/).

RESULTS
The screening of the E. coli HS genome CP000802 revealed a presence of a CRISPR-Cas system at positions 2920652-2921839, i.e. its length was 1,187 b.p.Structurally, this CRISPR-Cas system belonged to CAS-Type-ІЕ.
Using MacSyFinder, we identified and visualized the following regions of the E. coli HS genome coding for Сas proteins: -mandatory genes, whose presence in the genome indicates the presence of a CRISPR-Сas system (Fig. 1); -accessory genes that may be found in more than one system and are hard to identify using only one protein profile; however, they also signal the presence of a CRISPR-Сas system in a bacterial genome.
Using MacSyFinder, we were able to detect сas-genes in the CRISPR-Cas system of the analyzed E. coli HS genome and get a visual representation of the obtained XML in MacSyView.Examples of сas-genes and their location in the genome of the studied strain are shown in Fig. 1.
Using HMMER (ver.3.0) and makeblastdb (ver.2.2.28), we obtained structural and functional characteristics of сasproteins detected in each analyzed genome, namely: gene (the gene corresponding to the profile), system (the system the gene belongs to), hitid (the sequence identifier), hit seq length (length of the sequence), replicon name (the name of the replicon), position hit (the rank of the sequence matched in the input dataset file), i-eval (independent evalue), score (the score of the hit), profile coverage (percentage of the profile that matches the hit sequence), sequence coverage (percentage of the hit  sequence that matches the profile), begin match (the position in the sequence where the profile match begins), and end match (the position in the sequence where the profile match ends) (Fig. 2).
The obtained CRISPR arrays were analyzed in real time in CRISPI: a CRISPR Interactive database, which basically uses homology of repeated regions to return information about the sequence structure.Using this online tool, 11 repeats were identified in the CRISPR array of the studied strain.The consensus view is provided in Fig. 3.After repeats were detected, 10 spacers were identified in the CRISPR array (Table 1).Visual representations of the CRISPR array and cas-genes detected in the studied bacterial genome was implemented in Java (Fig. 4).

DISCUSSION
Last year the Escherichia coli HS genome NC_009800.1 was annotated in the GenBank database.The annotation contained information about three CRISPR-Cas loci in this genome.In the CRISPR-Cas database (http://crispr.i2bc.paris-saclay.fr/crispr/) these loci are represented by a few variants.Our study demonstrates that, on the whole, the structural units of the CRISPR array detected in the E. coli HS genome (CP000802, sequenced in 2014) coincide with the structural units of the E. coli HS strain (NC_009800_6, sequenced in 2017).
Using the spacer sequences detected in the CRIPSR array of the studied strain, we attempted to identify the phages (Table 2).Of 10 spacer sequences only 4 spacers (1, 5, 7, and 10) were complementary to the protospacers of phage races presented in the table.The identified phage races are typical for a wide range of bacterial hosts.Perhaps, this is a result of the horizontal transfer of CRIPSR-Cas systems between different types of bacteria throughout a long history of development of their adaptive immunity.Further research will definitely yield new knowledge of the nature of the antagonistic relationship between bacteria and their phages.Based on the detected phage races, one can infer the degree of immune protection and the viability of bacteria throughout their evolution.

CONCLUSIONS
The successful detection of a CRISPR-Cas array in the genome of the E. coli HS strain (CP000802, sequenced in 2014) and its structural analysis render bioinformatics-based methods effective for the search of CRISPR-Cas structures in the sequenced bacterial genomes.Such type of search References yields valuable information.The presence of "mandatory" Cas proteins suggest high anti-phage activity of the CRISPR-Cas system of the studied strain.The number of detected spacers reflect the duration of the strain's evolution.The comparative analysis of spacers in two CRISPR arrays detected in the CP000802 genome of E. coli HS sequenced in 2014 and in the NC_009800.1 genome of the same strain sequenced in 2017 demonstrates that the number of spacers in the CRIPSR array detected in NC_009800.1 has increased to 19.The number of spacers in CP000802 is only 10.We assume that such increase in the number of spacers was possible due to their accumulation following frequent passaging or because of frequent contamination by phages.In any case, it can be indicative of high CRIPSR-Cas activity in E. coli HS.

Fig. 1 .
Fig. 1.Cas-genes (A) and their location in the genome (B) of E. coli HS (CP000802) detected by MacSyFinder and visualized in MacSyView

Fig. 4 .
Fig. 4. Location of cas-genes and the CRISPR array in the genome of E. coli HS (CP000802)

Table 1 .
The list of nucleotide sequences in the CRIPSR array: spacers separated by repeat units detected in CRISPI: a CRISPR Interactive database in the genome of E. coli HS (CP000802)

Table 2 .
Spectrum of the phage races revealed by the complementary structures of the spacer sequences of the CRISPR cassette of E. coli HS strain (No.CP000802)