Advanced Search

Methods and Applications: Machine Learning Approach Effectively Predicts Binding Between SARS-CoV-2 Spike and ACE2 Across Mammalian Species — Worldwide, 2021

View author affiliations
  • Abstract

    Introduction

    Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a recently emergent coronavirus of natural origin and caused the coronavirus disease (COVID-19) pandemic. The study of its natural origin and host range is of particular importance for source tracing, monitoring of this virus, and prevention of recurrent infections. One major approach is to test the binding ability of the viral receptor gene ACE2 from various hosts to SARS-CoV-2 spike protein, but it is time-consuming and labor-intensive to cover a large collection of species.

    Methods

    In this paper, we applied state-of-the-art machine learning approaches and created a pipeline reaching >87% accuracy in predicting binding between different ACE2 and SARS-CoV-2 spike.

    Results

    We further validated our prediction pipeline using 2 independent test sets involving >50 bat species and achieved >78% accuracy. A large-scale screening of 204 mammal species revealed 144 species (or 61%) were susceptible to SARS-CoV-2 infections, highlighting the importance of intensive monitoring and studies in mammalian species.

    Discussion

    In short, our study employed machine learning models to create an important tool for predicting potential hosts of SARS-CoV-2 and achieved the highest precision to our knowledge in experimental validation. This study also predicted that a wide range of mammals were capable of being infected by SARS-CoV-2.

  • loading...
  • Funding: The Strategic Priority Research Programs of the Chinese Academy of Sciences (XDB29020000), the National Natural Science Foundation of China (32041009) and Key R&D Program of Shandong Province (2020CXGC011305)
  • [1] Wacharapluesadee S, Tan CW, Maneeorn P, Duengkae P, Zhu F, Joyjinda Y, et al. Evidence for SARS-CoV-2 related coronaviruses circulating in bats and pangolins in Southeast Asia. Nat Commun 2021;12(1):972. http://dx.doi.org/10.1038/s41467-021-21240-1CrossRef
    [2] Kreye J, Reincke SM, Kornau HC, Sánchez-Sendin E, Corman VM, Liu HJ, et al. A therapeutic non-self-reactive SARS-CoV-2 antibody protects from lung pathology in a COVID-19 hamster model. Cell 2020;183(4):1058-69.e19. http://dx.doi.org/10.1016/j.cell.2020.09.049CrossRef
    [3] U.S. Food & Drug Administration. Coronavirus (COVID-19) update: FDA authorizes monoclonal antibody for treatment of COVID-19. 2020. https://www.fda.gov/news-events/press-announcements/coronavirus-covid-19-update-fda-authorizes-monoclonal-antibody-treatment-covid-19. [2021-8-22].https://www.fda.gov/news-events/press-announcements/coronavirus-covid-19-update-fda-authorizes-monoclonal-antibody-treatment-covid-19
    [4] Cao LX, Goreshnik I, Coventry B, Case JB, Miller L, Kozodoy L, et al. De novo design of picomolar SARS-CoV-2 miniprotein inhibitors. Science 2020;370(6515):426-31. http://dx.doi.org/10.1126/science.abd9909CrossRef
    [5] Damas J, Hughes GM, Keough KC, Painter CA, Persky NS, Corbo M, et al. Broad host range of SARS-CoV-2 predicted by comparative and structural analysis of ACE2 in vertebrates. Proc Natl Acad Sci USA 2020;117(36):22311-22. http://dx.doi.org/10.1073/pnas.2010146117CrossRef
    [6] Yan H, Jiao HW, Liu QY, Zhang Z, Xiong Q, Wang BJ, et al. ACE2 receptor usage reveals variation in susceptibility to SARS-CoV and SARS-CoV-2 infection among bat species. Nat Ecol Evol 2021;5(5):600-8. http://dx.doi.org/10.1038/s41559-021-01407-1CrossRef
    [7] Huang SJ, Cai NG, Pacheco PP, Narrandes S, Wang Y, Xu W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics 2018;15(1):41-51. http://dx.doi.org/10.21873/cgp.20063CrossRef
    [8] Liang L, Rasmussen MLH, Piening B, Shen XT, Chen SJ, Röst H, et al. Metabolic dynamics and prediction of gestational age and time to delivery in pregnant women. Cell 2020;181(7):1680-92.e15. http://dx.doi.org/10.1016/j.cell.2020.05.002CrossRef
    [9] Toth R, Schiffmann H, Hube-Magg C, Büscheck F, Höflmayer D, Weidemann S, et al. Random forest-based modelling to detect biomarkers for prostate cancer progression. Clin Epigenetics 2019;11(1):148. http://dx.doi.org/10.1186/s13148-019-0736-8CrossRef
    [10] Chan KK, Dorosky D, Sharma P, Abbasi SA, Dye JM, Kranz DM, et al. Engineering human ACE2 to optimize binding to the spike protein of SARS coronavirus 2. Science 2020;369(6508):1261-5. http://dx.doi.org/10.1126/science.abc0870CrossRef
    [11] Wang QH, Zhang YF, Wu LL, Niu S, Song CL, Zhang ZY, et al. Structural and functional basis of SARS-CoV-2 entry by using human ACE2. Cell 2020;181(4):894-904.e9. http://dx.doi.org/10.1016/j.cell.2020.03.045CrossRef
    [12] Liu YH, Hu GW, Wang YY, Ren WL, Zhao XM, Ji FS, et al. Functional and genetic analysis of viral receptor ACE2 orthologs reveals a broad potential host range of SARS-CoV-2. Proc Natl Acad Sci USA 2021;118(12):e2025373118. http://dx.doi.org/10.1073/pnas.2025373118CrossRef
    [13] Wu LL, Chen Q, Liu KF, Wang J, Han PC, Zhang YF, et al. Broad host range of SARS-CoV-2 and the molecular basis for SARS-CoV-2 binding to cat ACE2. Cell Discov 2020;6:68. http://dx.doi.org/10.1038/s41421-020-00210-9CrossRef
    [14] Liu KF, Pan XQ, Li LJ, Yu F, Zheng AQ, Du P, et al. Binding and molecular basis of the bat coronavirus RaTG13 virus to ACE2 in humans and other species. Cell 2021;184(13):3438-51.e10. http://dx.doi.org/10.1016/j.cell.2021.05.031CrossRef
  • FIGURE 1.  Overview of methodology and model performance of this study. (A) Schematic representation of the workflow; (B) The distribution of precision from all 408 potential combinations of models/input data; (C) Distribution of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) in our models’ prediction in two experimentally validated datasets; (D) Distribution of different AAs in human (Homo sapiens) and two bat species (P. alecto and P. vampyrus).

    Note: After sequencing alignment, information from chosen sites were transformed into vectors and fed to five different models, from which the optimal combination was chosen as pipeline and used to predict available ACE2 sequences. After the prediction, we selected some of the sequences for experimental validation. Figure 1B showed that multiple combinations reached high precision using our testing dataset. that we presume to influence binding between ACE2 and viral spike protein as well, based on the observation that the two bat species’ ACE2 have different binding with the viral spike. Abbreviations: ACE2=angiotensin I converting enzyme 2; DT=decision tree; RF=random forest; GBRT=gradient boosting regression tree; ADA=adaboost; SVM=support vector machine.

    FIGURE 2.  Prediction and validations of ACE2 across species in binding to SARS-CoV-2 spike. (A) The predicted range of species with ACE2 capable of binding to SARS-CoV-2; (B) SPR and flow cytometry validation for multiple species’ ACE2 in binding to SARS-CoV-2 spike; (C) KD in nmol/L of the species shown in (B).

    Note: For families with multiple species, the branch is collapsed and the proportion predicted to bind is shown in Figure 2A. Blue species/families are those predicted not to bind.Abbreviations: ACE2=angiotensin I converting enzyme 2; SARS-CoV-2=severe acute respiratory syndrome coronavirus 2; SPR=surface plasmon resonance; KD=binding affinity.

Citation:

通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索
Turn off MathJax
Article Contents

Article Metrics

Article views(9290) PDF downloads(56) Cited by()

Share

Related

Machine Learning Approach Effectively Predicts Binding Between SARS-CoV-2 Spike and ACE2 Across Mammalian Species — Worldwide, 2021

View author affiliations

Abstract

Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a recently emergent coronavirus of natural origin and caused the coronavirus disease (COVID-19) pandemic. The study of its natural origin and host range is of particular importance for source tracing, monitoring of this virus, and prevention of recurrent infections. One major approach is to test the binding ability of the viral receptor gene ACE2 from various hosts to SARS-CoV-2 spike protein, but it is time-consuming and labor-intensive to cover a large collection of species.

Methods

In this paper, we applied state-of-the-art machine learning approaches and created a pipeline reaching >87% accuracy in predicting binding between different ACE2 and SARS-CoV-2 spike.

Results

We further validated our prediction pipeline using 2 independent test sets involving >50 bat species and achieved >78% accuracy. A large-scale screening of 204 mammal species revealed 144 species (or 61%) were susceptible to SARS-CoV-2 infections, highlighting the importance of intensive monitoring and studies in mammalian species.

Discussion

In short, our study employed machine learning models to create an important tool for predicting potential hosts of SARS-CoV-2 and achieved the highest precision to our knowledge in experimental validation. This study also predicted that a wide range of mammals were capable of being infected by SARS-CoV-2.

  • 1. CAS Key Laboratory of Pathogen Microbiology and Immunology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
  • 2. School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
  • 3. State Key Laboratory for Molecular Virology and Genetic Engineering, National Institute for Viral Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China
  • 4. Institute of Physical Science and Information, Anhui University, Hefei, Anhui, China
  • 5. State Key Laboratory of Virology, Modern Virology Research Center, College of Life Sciences, Wuhan University, Wuhan, Hubei, China
  • Corresponding authors:

    Jun Wang, junwang@im.ac.cn

    Qihui Wang, wangqihui@im.ac.cn

  • Funding: The Strategic Priority Research Programs of the Chinese Academy of Sciences (XDB29020000), the National Natural Science Foundation of China (32041009) and Key R&D Program of Shandong Province (2020CXGC011305)
  • Online Date: November 12 2021
    Issue Date: November 12 2021
    doi: 10.46234/ccdcw2021.235
    • Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused the ongoing pandemic of coronavirus disease (COVID-19) and has led to more than 229 million people infected and 4.7 million fatalities as of September 23, 2021 (https://covid19.who.int). Despite a large number of investigations on the biology and pathology of SARS-CoV-2, as well as treatment of COVID-19, the virus and pandemic still pose a tremendous threat to global health and stability. The natural origin of this virus has gained consensus among scientific communities but available evidence is still short of being conclusive. For instance, bats and pangolins have been proposed but disputes still remain (1), leaving room for misinformation and abuse. Identifying the host species susceptible to, including the source and intermediate species of, SARS-CoV-2 is still one of the central scientific objectives for COVID-19 research and will help provide information for monitoring and containing a potential viral reservoir as well as preventing reoccurring zoonosis as in the case of influenza viruses.

      The entry of SARS-CoV-2 to host cells requires the binding of its spike protein and host angiotensin I converting enzyme 2 (ACE2), a process that underwent intense investigation. Blocking their binding with a list of neutralizing monoclonal antibodies (mAbs) has been demonstrated to effectively prevent viral entry to cells in vitro and in vivo (2), and several mAbs were approved for clinical treatment of COVID patients (3). Short peptide mimicking the structure of ACE2 region binding to the viral spike protein has also been developed, which binds the receptor binding domain (RBD) of spike proteins with picomole-level affinity and effectiveness in cell assays (4). Besides serving as a target for treatment, the ability of binding between the SARS-CoV-2 spike and the ACE2 from non-human species indicated the susceptibility of those species towards SARS-CoV-2 and, combined with ecological data and evolutionary evidence, might identify key species as probable origins and/or intermediate hosts of SARS-CoV-2.

      Screening the binding between the ACE2 from large-scale collection of species and the SARS-CoV-2 spike protein thus is highly desired; however, in reality, there are great constraints due to costs and time required for experimental verification. Alternatively, bioinformatic approaches capable of predicting binding between the two proteins with high precision are helpful in prioritizing species of interest and excluding very unlikely species, reducing the cost and time for this purpose. Based on sequence similarity in the ACE2 across species, Damas et al. (5) proposed a score predicting binding to the SARS-CoV-2 spikes; since then, many species’ ACE2 have been tested, and retrospectively it is clear that the approach is limited in its precision. Namely, ACE2 from all bat species (36 in total in their prediction) were predicted to be “low” or “very low” in binding to the SARS-CoV-2 spike, but later experiments demonstrated that 20 species’ ACE2 (55.56%) could bind to the viral spike (6). Alongside bats, 17 out of 29 (58.62%) other mammals with ACE2 genes considered unlikely to bind to the SARS-CoV-2 spike actually had ability to bind as well (Supplementary Table S1). Thus, the currently available bioinformatic approach has an extremely high false negative rate and is still short of precisely predicting binding between the SARS-CoV-2 spike protein and the ACE2 across species.

    • We have therefore applied machine learning approaches to address the remaining challenges (see Supplementary Materials). Machine learning methods have the ability to combine diverse and complex data and automatically learn features for prediction, classification, and regressions. In biology, they have been successfully applied in establishing predictive and classification models using genomic features (7), metabolic markers (8), and many more (9). In our study, we selected five representative machine learning methods to perform classification (i.e., prediction of binding vs. non-binding), namely Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Adaboost (ADA), and Gradient Boosting Regression Tree (GBRT). For the single estimator we chose SVM and DT because they are suitable for small training sets. However, single estimators have a tendency to cause poor generalizability or robustness. To reduce this issue, we chose three additional ensemble methods (RF, ADA, and GBRT) for the construction of the prediction model.

      The five models were further equipped with a priori information to establish a combined prediction pipeline. A study on the human ACE2 introduced mutations at 117 amino acid (AA) sites individually, whereas at each site the AA was mutated to all potential alternative AAs and the changes in affinity (relative to the wildtype ACE2) to that of SARS-CoV-2 have been experimentally examined, providing a quantitative reference data (10). Further, studies from Wang et al. (11) and Liu et al. (12) identified subsets of 24 and 20 AAs, respectively, in the human ACE2 as important sites for interaction with SARS-CoV-2 spike protein, which can be used as qualitative information to reduce model complexity and potential over-fitting. Based on reported experimental verifications of the ACE2 protein from 90 species (73 unique species, 27 from Wu et al. (13), 49 from Liu et al. (12). 14 are from our lab and currently being considered for independent publication), we aligned the ACE2 sequences of those species to the human ACE2 and extracted AAs to replace with log2 enrichment ratios for the 117, 24, and 20 sites as input data format (Figure 1A). We have deposited this pipeline and details of the method at https://github.com/mayuefine/Binding-prediction.

      Figure 1. 

      Overview of methodology and model performance of this study. (A) Schematic representation of the workflow; (B) The distribution of precision from all 408 potential combinations of models/input data; (C) Distribution of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) in our models’ prediction in two experimentally validated datasets; (D) Distribution of different AAs in human (Homo sapiens) and two bat species (P. alecto and P. vampyrus).

      Note: After sequencing alignment, information from chosen sites were transformed into vectors and fed to five different models, from which the optimal combination was chosen as pipeline and used to predict available ACE2 sequences. After the prediction, we selected some of the sequences for experimental validation. Figure 1B showed that multiple combinations reached high precision using our testing dataset. that we presume to influence binding between ACE2 and viral spike protein as well, based on the observation that the two bat species’ ACE2 have different binding with the viral spike. Abbreviations: ACE2=angiotensin I converting enzyme 2; DT=decision tree; RF=random forest; GBRT=gradient boosting regression tree; ADA=adaboost; SVM=support vector machine.
    • The training and the test set data contained 62 and 11 species, respectively, and the test set was set aside from the training process. In order to screen the models with a stable performance, we trained five models on three groups of site information (group 20, group 24, and group 117, each group containing 5 machine learning approaches). Finally, the predictions of the three groups were combined and a combination of six models with the highest precision was chosen as our prediction pipeline, out of a total of 408 combinations; this pipeline reached an in silico precision of circa 87.5% (Figure 1B) and was used for subsequent analysis. We used this pipeline to generate a prediction score for each ACE2 sequence, which was equal to the number of models predicting that it binded to the viral spike divided by the total number of models.

      Bat species of the order Chiroptera were of highest interest for tracing the origin and studying the host range of SARS-CoV-2, as bat species harbor multiple coronavirus species including the SARS virus. One of the closest related strains of coronavirus to SARS-CoV-2, RaTG13, was found in horseshoe bats (Rhinolophus affinis) (14). Thus, we applied our pipeline and examined across bat species with ACE2 sequences available (59 in total), in which we predicted their ability to bind with SARS-CoV-2 spike proteins. We then tested the precision of our prediction in two experimentally validated datasets, in which ACE2 with predictions score >0.5 were considered likely to bind to the viral spike. We selected 12 bats’ ACE2 and expressed the proteins, then confirmed with Surface Plasmon Resonance (SPR) and flow cytometry for the ability to bind the viral spike (Supplementary Table S2). Overall, 4 of the 6 ACE2s predicted to bind to the SARS-CoV-2 spike were validated to bind to the viral spike (Figure 2B and Supplementary Figure S1), together with 5 ACE2s confirmed not to bind out of 6 ACE2s predicted to be so. Here we achieved a precision of 80% (Figure 1C). Then, using another dataset of 46 bat species by Yan et al. (6), after excluding the 2 sequences contained in our training set, we predicted the binding capacity and achieved 78.26% precision as shown in Figure 1C. Thus, our unified pipeline incorporating multiple machine learning models and different sets as input has the ability of confidently predicting binding between bat ACE2s and viral spikes.

      Figure 2. 

      Prediction and validations of ACE2 across species in binding to SARS-CoV-2 spike. (A) The predicted range of species with ACE2 capable of binding to SARS-CoV-2; (B) SPR and flow cytometry validation for multiple species’ ACE2 in binding to SARS-CoV-2 spike; (C) KD in nmol/L of the species shown in (B).

      Note: For families with multiple species, the branch is collapsed and the proportion predicted to bind is shown in Figure 2A. Blue species/families are those predicted not to bind.Abbreviations: ACE2=angiotensin I converting enzyme 2; SARS-CoV-2=severe acute respiratory syndrome coronavirus 2; SPR=surface plasmon resonance; KD=binding affinity.

      It also drew our attention that during our validation, ACE2 sequences from Pteropus alecto and Pteropus vampyrus have identical AAs at all 117 sites we selected for input; however, P. alecto ACE2 could bind to the SARS-CoV-2 spike in our experimental system and P. vampyrus ACE2 had no detectable binding, suggesting additional AAs affected the binding capacity. We compared ACE2 sequences of these 2 species and identified in total 22 sites of difference between the 2. Of these sites, 16 are identical to human ACE2 (12 for P. alecto and 4 for P. vampyrus) (Figure 1D and Figure 2C). This comparison provided extra information that one or more of the AAs different between P. alecto and P. vampyrus and humans underly the differences in binding to the viral spike protein but have not been discovered in available studies. Closer investigations revealed that this set of AAs was not involved in binding with viral spike protein, thus their influences were indirect and likely affected by the ACE2 protein structurally or even by post-translation modifications including glycosylation.

      Eventually, we refined our models incorporating the modified list of AAs as an input, and performed predictions on available ACE2 sequences from mammalian species (Supplementary Table S3, 204 in total and belonging to 69 families). This has resulted in the ACE2 of interest (likely to bind to the SARS-CoV-2 spike) from a total of 144 species, spread across 47 families (60.87%, Figure 2A). It is worth noting that the wide range of potential mammalian hosts agree with the emerging evidences of SARS-CoV-2 virus presence across mammals. Aside from 5 species of Hominidae (primates), ACE2s were predicted to bind to the viral spike protein in: 13 species of Cercopithecidae (old world monkeys), 8 species of Pteropodidae (old world fruit bats), 7 species of Felidae (cats), 7 species of Bovidae (ruminants), 7 species of Mustelidae (containing minks), 6 species of Canidae (dogs), 3 species of Equidae (horses), 6 species of Cricetidae (muroid rodents), 4 species of Sciuridae (squirrels), and 3 species of Ursidae (bears). Even in all 3 families of marine mammal, their ACE2s had high likelihood to bind to the SARS-CoV-2 spike (in all 4 species of Phocidae, 4 of Delphinidae and 3 of Otariidae, Figure 2B). Our prediction was supported by emerging reports that white-tailed deer (family Cervidae) were positive in antibodies against SARS-CoV-2 in 2021, which came in addition to reports of dogs, cats, and minks being viable hosts for this virus. In summary, based on ACE2 sequence features, our study suggested that SARS-CoV-2 has an extremely large range of potential hosts and indicates the importance of investigating wild animals for viral existence and monitoring its spread.

    • In conclusion, our study employed machine learning models suitable for analyzing sequence data, incorporated established functional data with multiple features extracted from sequences, and achieved high precision in predicting binding between ACE2s from difference species to the spike protein of SARS-CoV-2. The precision within the test data set was 87.5%, and in a total of 44 bat species, the group of mammals that attracted most concern, we achieved >78% precision as well, indicating that the model can be further expanded to predict susceptibility of more bat species once genomic sequences or ACE2 sequences become available (Supplementary Table S4). With the same approach we have also screened the available ACE2 sequences across a large range of mammals, in which we found that a large range of mammals requires attention. Our pipeline is capable of determining species of interest for tracing and analyzing species of interest to understand the potential origin of and transmission routes of SARS-CoV-2.

      Our pipeline, in terms of performance, remains to be improved upon, provided that more accurate machine-learning models and/or more a priori information continues to emerge. First, limited by the number of experimentally validated sets and understanding on ACE2-spike interactions, we had to limit the total AAs in the ACE2 sequences for training and prediction, in which our result already indicated contained critical information that is currently unavailable with regard to AAs in other part of the sequence, as in the case of P. alecto and P. vampyrus. In addition, the growing concerns amid the COVID-19 pandemic lie in the fast-emerging variants of SARS-CoV-2 strains, especially when mutations in ACE2-interacting AAs in the spike protein have already demonstrated changes in binding affinity to human ACE2s, whether they lead to host range changes and even broader transmission remain to be investigated.

      In summary, our approach has the potential and will need to be expanded to analyze binding abilities of different SARS-CoV-2 variants and ACE2s to forecast the potential spread of this virus and identify priority species for monitoring.

Reference (14)

Citation:

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return