-
Salmonella is an important intestinal pathogen of foodborne disease, causing enteritis and bloodstream infections, among other serious consequences, transmitted by food and water. Plasmid genome sizes in Salmonella enterica are generally between 2 kb and 200 kb and are biased based on serotype (1). As an important mobile genetic element (MGE), plasmids in Salmonella endow strains with many biological characteristics, including toxin production, resistance to heavy metals, antibiotic resistance genes (ARGs), and prophage integration (2-4). The spread of plasmid-borne ARGs has become a global public health problem, and plasmids, as reservoirs of ARGs, can spread rapidly between different species, including human pathogens (5-6). Therefore, it is necessary to monitor the ARGs carried by plasmids for the evaluation of ARG transmission.
Salmonella genome analyses based on next-generation sequence techniques have become an important tool for infectious disease surveillance, prevention and control, and food safety management. Currently, it is challenging to distinguish the full genomes of the chromosomes versus the plasmids without using long-read sequencing. It is very important to obtain the complete genomes of these MGEs for understanding plasmid origins and contributions to strain adaptability. To solve this problem, several plasmid sequence prediction methods have been developed, including Kraken (7), cBar (8), PlasFlow (9), RFPlasmid (10), mlplasmids (11) and PlasmidFinder (12). The Kraken classifier is an ultra-fast and highly accurate species classification program for sequences, and the Kraken classifier-based method has the highest accuracy and balanced performance in terms of overall sensitivity and specificity among the compared methods in the prediction of plasmid sequences in Klebsiella pneumoniae (13).
In our study, three customized Kraken databases were constructed using three different plasmid datasets and a Salmonella chromosomal dataset. These formed three different Kraken classifiers. A five-fold cross-validation method was used to evaluate the performance of the three Kraken classifiers using two different benchmark datasets. Finally, the optimal Kraken classifier was used to predict the plasmid sequence contigs from the genomes of Salmonella strains isolated in China, and plasmid-carrying prevalence and plasmid-borne ARGs were estimated.
-
Evaluation results for the three Kraken classifiers showed that the third Kraken classifier, C, which was composed of complete Salmonella genomes and all NCBI bacterial plasmids, had the highest accuracy (98.94%) and the highest recall rate (97.67%), with relatively high precision (99.94%) and specificity (99.95%). The recall rate and precision of the other classifiers were lower (Table 1).
Dataset Classifier type Accurate Precision Recall Specificity False predictive value Benchmark dataset I Kraken classifier A 98.41% 100.00% 96.42% 100.00% 98.57% Kraken classifier B 98.89% 100.00% 97.49% 100.00% 98.96% Kraken classifier C 98.94% 99.94% 97.67% 99.95% 98.86% Benchmark dataset II Kraken classifier A 99.20% 99.80% 91.23% 99.65% 99.87% Kraken classifier B 99.25% 99.64% 92.38% 99.65% 99.90% Kraken classifier C 99.28% 99.48% 92.68% 99.66% 99.90% Table 1. Evaluation results for Kraken classifier-based plasmid sequence prediction.
-
Here, according to Salmonella draft genome contig length distributions in the NCBI database, the complete genomes in the benchmark dataset (benchmark dataset I) were broken into fragments according to empirical contig distributions to construct a simulated draft genome benchmark dataset (benchmark dataset II). The distribution of chromosomal fragment lengths and the distribution of plasmid fragment lengths in our simulated Salmonella draft genome benchmark dataset showed similar distributions as the distribution of contig lengths in 1,000 randomly selected Salmonella draft genomes from GenBank, indicating that our benchmark dataset II is a good simulation of actual data (Figure 2A).
Figure 2.The application and evaluation of Kraken classifier for Salmonella strains isolated in China. (A) Distribution of contig lengths of 1,000 randomly selected Salmonella draft genomes compared to the chromosomal and plasmid length distributions in our simulated draft genome benchmark dataset (benchmark dataset II). (B) Number of plasmid contigs per strain for Salmonella with plasmids isolated from China. (C) Total length distribution of plasmid contigs per strain for Salmonella with plasmids isolated from China. (D) Venn diagram displaying the overlap of strains containing plasmids predicted by a replicon-based method (PlasmidFinder) and the Kraken classifier. (E) Comparison of the number of ARGs carried by plasmids predicted by the Kraken classifier and the replicon-based method, respectively. (F) Comparison of the number of ARGs located in plasmids and those located in the chromosome.
Results showed that the third Kraken classifier, C, which was created from databases based on all bacterial plasmids and complete Salmonella genomes in NCBI, had the highest accuracy (99.28%). Other metrics were also relatively higher than the other two Kraken classifiers. Therefore, Kraken classifier C was selected as the optimal Kraken classifier obtained in this study.
-
A total of 4,036 draft Salmonella genomes isolated from China were collected from GenBank. Our optimal Kraken classifier was then used to predict plasmid contigs from them. Among all strains 3,673 (91.01%) were predicted to have plasmid contigs, with a median contig number of five [95% confidence interval (CI): 1–21] for plasmids (Figure 2B), and a median total plasmid length of 93,740 bp per strain (95% CI: 4,657–26,7721 bp) (Figure 2C).
To compare the Kraken classifier established in this study with a conventionally used replicon-based method, PlasmidFinder was also used to predict plasmid contigs. Among the 4,036 Salmonella strain draft genomes, 3,145 strains (72.72%) were predicted to contain plasmid contigs. Compared with PlasmidFinder, our Kraken classifier discovered that another 556 strains harbor plasmids, while the replicon-based method found 24 strains that our Kraken classifier did not (Figure 2D). Among these 24 strains, four strains had very long (>4 Mb) contigs, which may be due to the integration of plasmids into chromosomes. Additionally, contigs carrying replicons in the other 20 strains are quite short (<5 kb) and harbor extensive mobile genetic elements, making it difficult to distinguish whether these contigs belong to chromosomes or plasmids, or are the result of assembling error.
Simultaneously, the predictive ability to discover ARGs between the replicon-based method and the Kraken classifier was compared and it was found that the replicon-based method evaluated the median number of plasmid-borne ARGs to be zero (95% CI: 0–5). The Kraken classifier assessed the median number of ARGs carried by plasmids to be three (95% CI: 0–14), which is significantly different (P value <0.001, Kolmogorov-Smirnov test) (Figure 2E), suggesting the Kraken classifier established in this study can predict more ARGs carried on plasmids than other methods.
Using our Kraken classifier predictor, the median number of chromosome-carrying ARGs of each strain was one (95% CI: 1–7), and the median number of plasmid-borne ARGs was three (95% CI: 1–14). This is a significant difference in ARG distribution between chromosomes and plasmids in these Salmonella strains (P value <0.001, Kolmogorov-Smirnov test) (Figure 2F).
Quinolone and third-generation cephalosporins are commonly used antibiotics in clinics. ARGs can be carried on chromosome and plasmids in Salmonella. Here, our Kraken classifier was used to predict chromosomal and plasmid locations of these ARGs in 4,036 Salmonella strains. It was found that 1.88% of the strains have the acquired quinolone-related resistance genes on chromosomes, while 11.90% of the strains carry acquired quinolone-related resistance genes on plasmids. Besides 7.71% of the strains carry third-generation cephalosporin-related resistance genes on chromosomes, while 62.61% of the strains carry the gene on plasmids (Table 2). The number of strains carrying quinolone-related resistance genes or the acquired third-generation cephalosporin-related resistance genes on plasmids is significantly higher than that carrying the corresponding resistance genes on chromosomes (P value <0.001, Fisher's exact test).
Antibiotic type ARG Number of ARGs Number of ARGs isolated on chromosome Number of ARGs isolated on plasmids Number of ARGs isolated on both chromosome and plasmids Undefined P value Quinolone resistance qnrA 4 0 4 0 0 0.02 qnrB 182 0 182 0 0 <0.001 qnrD 3 0 3 0 0 0.06 qnrS 1,054 19 778 2 259 <0.001 qnrVC 4 0 4 0 0 0.02 qepA 29 0 29 0 0 <0.001 aac(6')-Ib-cr 942 76 299 3 570 <0.001 oqxA 797 13 221 0 563 <0.001 oqxB 798 13 225 0 560 <0.001 qnrS 1,054 19 778 2 259 <0.001 Third-generation cephalosporins
resistanceblaTEM 1,607 94 838 7 682 <0.001 blaCTX-M 863 192 408 17 280 <0.001 blaOXA 854 80 205 2 571 <0.001 blaCMY 27 1 23 1 4 <0.001 blaDHA 24 0 24 0 0 <0.001 blaNDM 10 2 8 0 0 0.02 blaSHV 5 0 5 0 0 0.01 Abbreviation: ARGs=antibiotic resistance genes. Table 2. Comparison of quinolone and third-generation cephalosporin-related ARGs prediction results.
HTML
Classifier Evaluation Based on the Complete Genome Benchmark Dataset (Benchmark Dataset I)
Classifier Evaluation Based on the Simulated Draft Genomes Benchmark Dataset (Benchmark Dataset II)
Analysis of Plasmid Carrying Prevalence and Plasmid Carrying ARGs for Salmonella Isolated From China
Citation: |