The training and the test set data contained 62 and 11 species, respectively, and the test set was set aside from the training process. In order to screen the models with a stable performance, we trained five models on three groups of site information (group 20, group 24, and group 117, each group containing 5 machine learning approaches). Finally, the predictions of the three groups were combined and a combination of six models with the highest precision was chosen as our prediction pipeline, out of a total of 408 combinations; this pipeline reached an in silico precision of circa 87.5% (Figure 1B) and was used for subsequent analysis. We used this pipeline to generate a prediction score for each ACE2 sequence, which was equal to the number of models predicting that it binded to the viral spike divided by the total number of models.
Bat species of the order Chiroptera were of highest interest for tracing the origin and studying the host range of SARS-CoV-2, as bat species harbor multiple coronavirus species including the SARS virus. One of the closest related strains of coronavirus to SARS-CoV-2, RaTG13, was found in horseshoe bats (Rhinolophus affinis) (14). Thus, we applied our pipeline and examined across bat species with ACE2 sequences available (59 in total), in which we predicted their ability to bind with SARS-CoV-2 spike proteins. We then tested the precision of our prediction in two experimentally validated datasets, in which ACE2 with predictions score >0.5 were considered likely to bind to the viral spike. We selected 12 bats’ ACE2 and expressed the proteins, then confirmed with Surface Plasmon Resonance (SPR) and flow cytometry for the ability to bind the viral spike (Supplementary Table S2). Overall, 4 of the 6 ACE2s predicted to bind to the SARS-CoV-2 spike were validated to bind to the viral spike (Figure 2B and Supplementary Figure S1), together with 5 ACE2s confirmed not to bind out of 6 ACE2s predicted to be so. Here we achieved a precision of 80% (Figure 1C). Then, using another dataset of 46 bat species by Yan et al. (6), after excluding the 2 sequences contained in our training set, we predicted the binding capacity and achieved 78.26% precision as shown in Figure 1C. Thus, our unified pipeline incorporating multiple machine learning models and different sets as input has the ability of confidently predicting binding between bat ACE2s and viral spikes.
Prediction and validations of ACE2 across species in binding to SARS-CoV-2 spike. (A) The predicted range of species with ACE2 capable of binding to SARS-CoV-2; (B) SPR and flow cytometry validation for multiple species’ ACE2 in binding to SARS-CoV-2 spike; (C) KD in nmol/L of the species shown in (B).
Note: For families with multiple species, the branch is collapsed and the proportion predicted to bind is shown in Figure 2A. Blue species/families are those predicted not to bind.Abbreviations: ACE2=angiotensin I converting enzyme 2; SARS-CoV-2=severe acute respiratory syndrome coronavirus 2; SPR=surface plasmon resonance; KD=binding affinity.
It also drew our attention that during our validation, ACE2 sequences from Pteropus alecto and Pteropus vampyrus have identical AAs at all 117 sites we selected for input; however, P. alecto ACE2 could bind to the SARS-CoV-2 spike in our experimental system and P. vampyrus ACE2 had no detectable binding, suggesting additional AAs affected the binding capacity. We compared ACE2 sequences of these 2 species and identified in total 22 sites of difference between the 2. Of these sites, 16 are identical to human ACE2 (12 for P. alecto and 4 for P. vampyrus) (Figure 1D and Figure 2C). This comparison provided extra information that one or more of the AAs different between P. alecto and P. vampyrus and humans underly the differences in binding to the viral spike protein but have not been discovered in available studies. Closer investigations revealed that this set of AAs was not involved in binding with viral spike protein, thus their influences were indirect and likely affected by the ACE2 protein structurally or even by post-translation modifications including glycosylation.
Eventually, we refined our models incorporating the modified list of AAs as an input, and performed predictions on available ACE2 sequences from mammalian species (Supplementary Table S3, 204 in total and belonging to 69 families). This has resulted in the ACE2 of interest (likely to bind to the SARS-CoV-2 spike) from a total of 144 species, spread across 47 families (60.87%, Figure 2A). It is worth noting that the wide range of potential mammalian hosts agree with the emerging evidences of SARS-CoV-2 virus presence across mammals. Aside from 5 species of Hominidae (primates), ACE2s were predicted to bind to the viral spike protein in: 13 species of Cercopithecidae (old world monkeys), 8 species of Pteropodidae (old world fruit bats), 7 species of Felidae (cats), 7 species of Bovidae (ruminants), 7 species of Mustelidae (containing minks), 6 species of Canidae (dogs), 3 species of Equidae (horses), 6 species of Cricetidae (muroid rodents), 4 species of Sciuridae (squirrels), and 3 species of Ursidae (bears). Even in all 3 families of marine mammal, their ACE2s had high likelihood to bind to the SARS-CoV-2 spike (in all 4 species of Phocidae, 4 of Delphinidae and 3 of Otariidae, Figure 2B). Our prediction was supported by emerging reports that white-tailed deer (family Cervidae) were positive in antibodies against SARS-CoV-2 in 2021, which came in addition to reports of dogs, cats, and minks being viable hosts for this virus. In summary, based on ACE2 sequence features, our study suggested that SARS-CoV-2 has an extremely large range of potential hosts and indicates the importance of investigating wild animals for viral existence and monitoring its spread.