Advanced Search

Methods and Applications: An Open-Source Data Driven Hybrid Modeling System for Infectious Disease Surveillance and Early Warning

View author affiliations
  • Abstract

    Introduction

    The increasing trend of globalization has led to a heightened risk of imported epidemics; however, existing surveillance systems remain fragmented and reliant on laboratory confirmation. We developed an open-source data-driven hybrid modeling system to provide earlier and more reliable alerts, designed to complement China’s multipoint trigger early-warning framework.

    Methods

    This system integrates heterogeneous signals, including official epidemiology, digital traces, mobility, meteorology, and pathogen genomics, using semantic harmonization and a hybrid analytic stack. Seasonality-adjusted baselines with anomaly detection, mobility- and climate-aware SEIR models, and short-horizon learners generated calibrated early-warning scores. Thresholds were constrained by positive predictive value. Pilot studies were conducted for coronavirus disease 2019 (COVID-19) in Yantai and severe fever with thrombocytopenia syndrome virus (SFTSV) in Shandong and Henan, with tuberculosis indicators embedded for programmatic use.

    Results

    Across deployments, the system achieved 83.3% sensitivity and 76.9% positive predictive value, providing a median lead time of 9.3 days before official confirmation. Forecasting accuracy reached 92.1% for COVID-19 in Yantai, 90.3% for SFTSV in Shandong, and 89.8% for SFTSV in Henan. Early warnings were aligned with subsequent confirmations and supported targeted screening and resource allocation.

    Conclusion

    An open-source data-driven hybrid modeling system can deliver calibrated and timely alerts across diverse pathogens. By broadening inputs, enabling cross-agency linkage, and offering operator-oriented dashboards, it serves as a practical complement to China’s national early-warning system and has the potential for scaling out with One Health inputs.

  • loading...
  • Conflict of Interest: No conflicts of interest.
  • Funding: Supported by the National Natural Science Foundation of China (72274210, 72174004)
  • [1] Yu XJ, Liang MF, Zhang SY, Liu Y, Li JD, Sun YL, et al. Fever with thrombocytopenia associated with a novel bunyavirus in China. N Engl J Med 2011;364(16):1523 − 32. https://doi.org/10.1056/NEJMoa1010095.
    [2] Cui HL, Shen SJ, Chen L, Fan ZY, Wen Q, Xing YW, et al. Global epidemiology of severe fever with thrombocytopenia syndrome virus in human and animals: a systematic review and meta-analysis. Lancet Reg Health West Pac 2024;48:101133. https://doi.org/10.1016/j.lanwpc.2024.101133.
    [3] Miao D, Dai K, Zhao GP, Li XL, Shi WQ, Zhang JS, et al. Mapping the global potential transmission hotspots for severe fever with thrombocytopenia syndrome by machine learning methods. Emerg Microbes Infect 2020;9(1):817 − 26. https://doi.org/10.1080/22221751.2020.1748521.
    [4] World Health Organization (WHO). Global tuberculosis report 2023. Geneva: WHO; 2023. https://www.who.int/publications/i/item/9789240083851.
    [5] Li T, Zhang B, Du X, Pei SJ, Jia ZW, Zhao YL. Recurrent pulmonary tuberculosis in China, 2005-2021. JAMA Netw Open 2024;7(8):e2427266. https://doi.org/10.1001/jamanetworkopen.2024.27266.
    [6] Sun HM, Hu WH, Wei YY, Hao YT. Drawing on the development experiences of infectious disease surveillance systems around the world. China CDC Wkly 2024;6(41):1065 − 74. https://doi.org/10.46234/ccdcw2024.220.
    [7] Ren X, Wang LP, Cowling BJ, Zeng LJ, Geng MJ, Wu P, et al. Systematic review: national notifiable infectious disease surveillance system in China. Online J Public Health Inform 2019;11(1):e62534. https://doi.org/10.5210/ojphi.v11i1.9897.
    [8] National Bureau of Disease Control and Prevention, National Health Commission of the People’s Republic of China, National Development and Reform Commission, Ministry of Education of the People’s Republic of China, Ministry of Civil Affairs of the People’s Republic of China, Ministry of Finance of the People’s Republic of China, et al. Guiding opinions on the establishment of a sound and intelligent multi-point trigger infectious disease surveillance and early warning system. 2024. https://www.gov.cn/zhengce/zhengceku/202408/content_6971481.htm. (In Chinese).
    [9] Yang WZ, Lan YJ, Lyu W, Leng ZW, Feng LZ, Lai SJ, et al. Establishment of multi-point trigger and multi-channel surveillance mechanism for intelligent early warning of infectious diseases in China. Chin J Epidemiol 2020;41(11):1753 − 57. https://doi.org/10.3760/cma.j.cn112338-20200722-00972.
    [10] Liu QQ, Li JH, Liu SY, Tang L, Wang XQ, Huang AD, et al. The Epidemiological Characteristics and Spatiotemporal Clustering of Measles—China, 2005-2022. China CDC Wkly 2024;6(27):665 − 9. https://doi.org/10.46234/ccdcw2024.123.
    [11] Levin-Rector A, Kulldorff M, Peterson ER, Hostovich S, Greene SK. Prospective spatiotemporal cluster detection using SaTScan: tutorial for designing and fine-tuning a system to detect reportable communicable disease outbreaks. JMIR Public Health Surveill 2024;10:e50653. https://doi.org/10.2196/50653.
    [12] Lutz CS, Huynh MP, Schroeder M, Anyatonwu S, Dahlgren FS, Danyluk G, et al. Applying infectious disease forecasting to public health: a path forward using influenza forecasting examples. BMC Public Health 2019;19(1):1659. https://doi.org/10.1186/s12889-019-7966-8.
    [13] Reich NG, Brooks LC, Fox SJ, Kandula S, McGowan CJ, Moore E, et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proc Natl Acad Sci USA 2019;116(8):3146 − 54. https://doi.org/10.1073/pnas.1812594116.
    [14] Vorisek CN, Lehne M, Klopfenstein SAI, Mayer PJ, Bartschke A, Haese T, et al. Fast Healthcare Interoperability Resources (FHIR) for interoperability in health research: systematic review. JMIR Med Inform 2022;10(7):e35724. https://doi.org/10.2196/35724.
    [15] Rabiei R, Bastani P, Ahmadi H, Dehghan S, Almasi S. Developing public health surveillance dashboards: a scoping review on the design principles. BMC Public Health 2024;24(1):392. https://doi.org/10.1186/s12889-024-17841-2.
  • FIGURE 1.  Workflow from data ingestion to alert dissemination.

    FIGURE 2.  Model performance across pilots. (A) Monthly SFTSV cases in Henan (2009–2014); (B) Monthly SFTSV cases in Shandong (2011–2015). (C) Posterior distributions of COVID-19 predicted outcome intervals; (D) Beta-distributed probability calibration for COVID-19 with 95% confidence intervals; (E) System alerts and official confirmations for true-positive events; (F) Aggregate detection metrics including sensitivity and positive predictive value.

    Abbreviation: PPV=positive predictive value; SFTSV=severe fever with thrombocytopenia syndrome virus; CI=confidence interval; TP=true positive; FN=false negative; FP=false positive; COVID-19=coronavirus disease 2019.

    FIGURE 3.  Dashboard (TB). (A) Incidence Trends and Projections; (B) Posterior Parameters and Model Application

    Note: The system showed high stability in robustness checks under parameter perturbations.

    Abbreviation: TB=tuberculosis.

    TABLE 1.  System- and site-level performance summary.

    Setting Pathogen Outcome granularity Detection sensitivity, % PPV, % Median lead time, days Forecast accuracy, % (95% CI) Peak timing accuracy, % (95% CI) Peak magnitude accuracy, % (95% CI) $ R_{total incidence}^{2}\text{} $ $ R_{RR-TB incidence}^{2}\text{} $ $ R_{TB deaths}^{2}\text{} $
    Overall (all pilots) Mixed Event-level alerts 83.3 76.9 9.3
    Yantai, Shandong COVID-19 Community time series 92.15
    (86.99, 93.96)
    88.43
    (88.26, 88.59)
    91.16
    (91.04, 91.30)
    Shandong SFTSV Monthly incidence 90.29
    (85.79, 93.84)
    Henan SFTSV Monthly incidence 89.81
    (86.24, 93.08)
    China (National) TB
    National annual incidence 0.95 0.99 0.82
    Note: “−” means no data. The indicators of TB are presented in the form of proportions.
    Abbreviation: COVID-19=coronavirus disease 2019; SFTSV=severe fever with thrombocytopenia syndrome virus; CI=confidence interval; TB=tuberculosis.
    Download: CSV

Citation:

通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索
Turn off MathJax
Article Contents

Article Metrics

Article views(298) PDF downloads(2) Cited by()

Share

Related

An Open-Source Data Driven Hybrid Modeling System for Infectious Disease Surveillance and Early Warning

View author affiliations

Abstract

Introduction

The increasing trend of globalization has led to a heightened risk of imported epidemics; however, existing surveillance systems remain fragmented and reliant on laboratory confirmation. We developed an open-source data-driven hybrid modeling system to provide earlier and more reliable alerts, designed to complement China’s multipoint trigger early-warning framework.

Methods

This system integrates heterogeneous signals, including official epidemiology, digital traces, mobility, meteorology, and pathogen genomics, using semantic harmonization and a hybrid analytic stack. Seasonality-adjusted baselines with anomaly detection, mobility- and climate-aware SEIR models, and short-horizon learners generated calibrated early-warning scores. Thresholds were constrained by positive predictive value. Pilot studies were conducted for coronavirus disease 2019 (COVID-19) in Yantai and severe fever with thrombocytopenia syndrome virus (SFTSV) in Shandong and Henan, with tuberculosis indicators embedded for programmatic use.

Results

Across deployments, the system achieved 83.3% sensitivity and 76.9% positive predictive value, providing a median lead time of 9.3 days before official confirmation. Forecasting accuracy reached 92.1% for COVID-19 in Yantai, 90.3% for SFTSV in Shandong, and 89.8% for SFTSV in Henan. Early warnings were aligned with subsequent confirmations and supported targeted screening and resource allocation.

Conclusion

An open-source data-driven hybrid modeling system can deliver calibrated and timely alerts across diverse pathogens. By broadening inputs, enabling cross-agency linkage, and offering operator-oriented dashboards, it serves as a practical complement to China’s national early-warning system and has the potential for scaling out with One Health inputs.

  • 1. Department of Global Health, School of Public Health, Peking University, Beijing, China
  • 2. Beijing Municipal Health Big Data and Policy Research Center, Beijing, China
  • 3. Division of Surveillance, Early Warning and Emergency Response, Heilongjiang Provincial of Disease Control and Prevention, Harbin City, Heilongjiang Province, China
  • Corresponding author:

    Zhongwei Jia, jiazw@bjmu.edu.cn

  • Funding: Supported by the National Natural Science Foundation of China (72274210, 72174004)
  • Online Date: February 20 2026
    Issue Date: February 20 2026
    doi: 10.46234/ccdcw2026.036
  • Globalization and increased human mobility have raised the risk of infectious diseases. International tourist arrivals and global traffic have roughly doubled since 2000 (1). During the coronavirus disease 2019 (COVID-19) pandemic, imported cases repeatedly seeded local outbreaks in China, while the expanding distribution of severe fever with thrombocytopenia syndrome across East Asia illustrates cross border spread of vector-borne diseases (23). Tuberculosis (TB) remains a persistent global threat; with 10.6 million new cases and 1.3 million deaths in 2022; and rebounds in China underscore the need for improved prevention along travel corridors (45).

    China has developed a nationwide surveillance backbone, including the National Notifiable Infectious Disease Reporting System (NIDRIS) and the China Infectious Disease Automated-alert and Response System (CIDARS) that provide direct case reporting and rule-based signal generation from statutory notifiable diseases (67). More recently, national guidance emphasizes multi-point trigger early-warning architecture aimed at integrating multiple data sources, enhancing interoperability, and supporting multi-agency collaboration (89). However, most current pilot studies and applications rely primarily on report-based analytics, such as space-time scan statistics, which identify spatiotemporal clusters but remain constrained by delayed confirmation, limited data inputs, and weak predictive power (1011). These limitations reduce actionable lead time and restrict applicability to pathogens with long incubation periods or non-specific clinical presentations.

    Epidemic intelligence research has explored statistical, mechanistic models, and machine learning approaches separately; however, few studies combine them in hybrid frameworks balancing interpretability and accuracy (1213). Existing studies often lack interoperability standards and operator-facing dashboards, limiting their scalability and usability in real-world decision-making environments.

    To address these gaps, we introduce an open-source data-driven hybrid modeling system designed to complement China’s national multi-point-trigger early-warning architecture. The system integrates heterogeneous open and partner-shared signals — including epidemiological reports, digital traces, mobility, meteorology, and pathogen genomics — through semantic harmonization and hybrid analytics, including seasonality-adjusted baselines, anomaly detection, mobility- and climate-aware SEIR models, and short-horizon sequence learners. Interoperable HL7 FHIR-aligned data contracts enable scalable integration with health, customs, and laboratory systems (14). While operator-oriented dashboards follow established design principles for interpretability and oversight (15). We present the system’s architecture and pilot evidence across COVID-19 and severe fever with thrombocytopenia syndrome virus (SFTSV) and show how the same framework embeds TB indicators for programmatic use, bridging open-source data intelligence with the national early-warning workflow.

    • We selected three pathogen targets to test the system’s One Health versatility across distinct transmission modes and timescales: COVID-19 (acute respiratory disease requiring rapid community forecasting), SFTSV (vector-borne disease requiring ecological integration), and tuberculosis (chronic disease requiring long-term strategic planning). The pilot sites were chosen based on disease burden and data feasibility; for SFTSV, Shandong and Henan provinces were selected as high-endemicity regions in China, providing sufficient case volume to validate vector-driven models. For COVID-19, Yantai was selected as a representative coastal port city that experienced distinct waves of local transmission triggered by importation. This setting offered clear onset-to-suppression dynamics essential for validating the community forecasting model’s sensitivity to intervention measures. The system continuously ingests heterogeneous data, performs semantic harmonization, runs hybrid analytics (statistical baselines, mechanistic models, and deep-sequence learners), and emits a calibrated early warning score (EWS) for operations. Personally identifiable information was not collected or processed (Figure 1).

      Figure 1. 

      Workflow from data ingestion to alert dissemination.

    • In this study, the term “open-source data” refers to open-source intelligence (OSINT) and publicly available datasets that are accessible without proprietary restrictions. These include official bulletins, digital signals, meteorological records, and anonymized mobility data, distinct from internal hospital records or confidential line-list data.

      Official epidemiology relies on national/provincial bulletins and WHO/ECDC situation updates; digital epidemiology integrates multichannel digital traces such as search engine queries (Baidu Index, Wikipedia Pageviews), social media discussions (Weibo), and content from aggregators (Douyin/TikTok China, Toutiao), all equipped with geotags and temporal stamps; genomics provides sequence metadata for pathogen context; context & covariates encompass human mobility, meteorology, and holiday markers; and for the vector signal SFTSV, national tick index surveillance data is digitized from official graphs using scale-mean abstraction to create daily or weekly exogenous drivers, with monthly series derived through calendar aggregation.

    • In this study, we developed three distinct model components integrated through a hybrid framework, with detailed methodologies provided in the supplementary materials. The SFTSV model utilizes a network transmission approach where the vector driver is approximated by a Fourier series fitted to the 2018–2019 national tick index, assuming stationary seasonal phenology. For the COVID-19 model, an agent-based SEIR model was implemented on a dynamic contact graph; biological parameters were fixed to literature values to ensure identifiability, focusing calibration solely on the effective contact probability. Rifampicin-resistant tuberculosis (RR-TB) incidence was estimated following the WHO-recommended mathematical procedure (Supplementary Table S1). Finally, these outputs were integrated via hybrid fusion, employing logistic stacking as a meta-learner to weigh mechanistic and deep learning signals based on their historical performance.

    • Across pilots, the system operated at a pre-specified threshold tuned for decision utility (PPV constraint ≥0.70). Against officially confirmed events, the system achieved 83.30% sensitivity and 76.90% positive predictive value (PPV), with a median lead time of 9.30 days before first confirmation. Alerts and confirmatory timelines are illustrated in the dashboard traces (Figure 2E–F); adjudication logs indicate that most false positives arose from short sub-threshold anomalies that did not consolidate into confirmed events (Table 1).

      Figure 2. 

      Model performance across pilots. (A) Monthly SFTSV cases in Henan (2009–2014); (B) Monthly SFTSV cases in Shandong (2011–2015). (C) Posterior distributions of COVID-19 predicted outcome intervals; (D) Beta-distributed probability calibration for COVID-19 with 95% confidence intervals; (E) System alerts and official confirmations for true-positive events; (F) Aggregate detection metrics including sensitivity and positive predictive value.

      Abbreviation: PPV=positive predictive value; SFTSV=severe fever with thrombocytopenia syndrome virus; CI=confidence interval; TP=true positive; FN=false negative; FP=false positive; COVID-19=coronavirus disease 2019.
      Setting Pathogen Outcome granularity Detection sensitivity, % PPV, % Median lead time, days Forecast accuracy, % (95% CI) Peak timing accuracy, % (95% CI) Peak magnitude accuracy, % (95% CI) $ R_{total incidence}^{2}\text{} $ $ R_{RR-TB incidence}^{2}\text{} $ $ R_{TB deaths}^{2}\text{} $
      Overall (all pilots) Mixed Event-level alerts 83.3 76.9 9.3
      Yantai, Shandong COVID-19 Community time series 92.15
      (86.99, 93.96)
      88.43
      (88.26, 88.59)
      91.16
      (91.04, 91.30)
      Shandong SFTSV Monthly incidence 90.29
      (85.79, 93.84)
      Henan SFTSV Monthly incidence 89.81
      (86.24, 93.08)
      China (National) TB
      National annual incidence 0.95 0.99 0.82
      Note: “−” means no data. The indicators of TB are presented in the form of proportions.
      Abbreviation: COVID-19=coronavirus disease 2019; SFTSV=severe fever with thrombocytopenia syndrome virus; CI=confidence interval; TB=tuberculosis.

      Table 1.  System- and site-level performance summary.

    • For SFTSV monthly incidence forecasting in Shandong and Henan, the model’s predictions closely tracked observed trends in both provinces, as illustrated in Figure 3A–B. Using the pre-specified accuracy metric with bootstrap 95% confidence intervals (CIs), Shandong (2013–2015) achieved 90.29% accuracy (95% CI: 85.79%, 93.84%). Henan (2009–2014; including Xinyang) achieved 89.81% (95% CI: 86.24%, 93.08%).

      Figure 3. 

      Dashboard (TB). (A) Incidence Trends and Projections; (B) Posterior Parameters and Model Application

      Note: The system showed high stability in robustness checks under parameter perturbations.

      Abbreviation: TB=tuberculosis.

      Peak months and troughs aligned with the seasonality captured by the mechanistic (tick- and human-driven) transmission terms, and the model reproduced the interannual amplitude differences without overfitting (Figure 2A–B).

      In the COVID-19 community forecasting conducted for Yantai, community-scale forecasts achieved 92.15% accuracy (95% CI: 86.99%, 93.96%) under the same definition. In peak-focused validation with 10,000 simulations (Poisson-drawn initial seeds within the 95% interval), the model achieved a peak timing accuracy of 88.43% (95% CI: 88.26%, 88.59%) and a peak magnitude accuracy of 91.16% (95% CI: 91.04%, 91.30%). The forecast trajectories and observed counts are shown in Figure 2C–D.

      At the PPV-constrained threshold, the median lead time was 9.3 days (overall). Most detected events had ≥7 days’ advance notice; short-lead alerts (<7 days) clustered in late-season periods with compressed confirmation cycles (timeline examples in Figure 3E).

      Our TB model, adapted from the recurrent framework of Li (5), closely reproduced historical trends (R2=0.95 for total incidence, 0.99 for RR-TB incidence, and 0.82 for TB deaths), with a posterior mean force of infection of 2.35 per year (95% CI: 1.16, 3.58). Projections to 2030 indicated an incidence rate of 33.7 per 100,000 (95% CI: 30.80, 38.30), below Li’s estimate of 44.9 but above the End TB target of 13, suggesting China’s 2024–2030 goal (43) is attainable. The model was implemented as an interactive Shiny application to support visualization and policy use.

    • This study demonstrates that the system can combine diverse open signals with hybrid models to produce calibrated early-warning scores constrained by positive predictive values, reducing false alerts while preserving sensitivity. We tested three pathogen contexts — COVID-19, SFTSV, and TB — and observed their practical utility in both acute and chronic use cases.

      Semantic harmonization organized multi-source evidence into consistent geotemporal units, reducing ambiguity in sparse or fast-moving events. Hybrid modeling integrated statistical baselines, mobility- and climate-aware SEIR models (including a human-tick-human pathway for SFTSV), and short-horizon learners to preserve epidemiologic interpretability while capturing nonlinearity. PPV-constrained probability calibration translated model outputs into actionable alerts, improving resource allocation and limiting alert fatigue. Together, these choices enabled earlier, more precise alerts that aligned well with observed trends without overfitting to site-specific conditions.

    • While frameworks such as EWARS support outbreak management (89), their reliance on statutory reports limits their timeliness (6). Previous studies have often traded interpretability (statistical baselines) for short-term accuracy (machine learning), frequently lacking multisource integration. Our system advances the field by 1) hybridizing statistical, mechanistic, and sequence-based learners to balance interpretability with adaptability; 2) integrating open signals beyond statutory notifications; and 3) achieving high predictive accuracy and meaningful lead times relative to uncalibrated systems.

      Qualitatively, the hybrid framework offers distinct advantages over the single-method baselines. Mechanistic SEIR models capture long-term seasonal trends but lag during stochastic onsets, whereas deepsequence learners offer high sensitivity but lack epidemiological transparency. By fusing these approaches, our system stabilizes forecasts during peaks while improving sensitivity during early onsets. Quantitatively, the system demonstrated a median lead time of 9.3 days relative to official confirmation. Given the inherent reporting lags in traditional passive surveillance (7), this represents a substantial window for pre-emptive intervention.

    • A modular design allows flexible application across pathogen contexts. The same framework generates outbreak alerts for COVID-19 and SFTSV while embedding TB analytics to strengthen screening and continuity of care. This adaptability enables emergency response and long-term control through a unified operational surface.

      Operational utility differs across pathogen types. For acute outbreaks, such as COVID-19 and SFTSV, the system functions as a tactical early-warning tool, issuing short-horizon alerts (lead time <14 days) to trigger immediate containment measures such as targeted screening or vector control. For chronic diseases such as TB, the system serves a strategic forecasting function, projecting long-term trends (e.g., to 2030) to guide resource allocation and policy target setting. This multimodal capability aligns with the tiered surveillance architecture advocated in recent national guidance on intelligent multi-point trigger systems (89).

    • Embedding such a system into an operational setting can accelerate detection, improve the allocation of quarantine and laboratory resources, and better align vector control with clinical responses during high-risk periods. These functions align with national guidance on building a multipoint-trigger early-warning architecture (89) and with international calls to strengthen public health forecasting (12). The system also benefits from interoperable data contracts, such as HL7 FHIR, which facilitate scalable integration across health, customs, and laboratory agencies (14), as well as operator-oriented dashboards designed for real-time decision support (15). In the short term, priorities include regular recalibration, expanded data exchange with partner agencies, and the incorporation of operator feedback loops. In the medium term, multisite evaluations are required to provide robust evidence of improved timeliness and efficiency.

      First, data-related issues exist in multiple aspects. Digital traces are susceptible to “media noise”, and smartphone-derived mobility data may underrepresent the elderly. Meanwhile, meteorological data face spatiotemporal alignment challenges. Second, there are ecological and modeling-related constraints. The national tick index has limited local granularity in terms of ecological constraints. Structural simplifications, assuming uniform mixing for COVID-19 or simplified vector-host cycles for SFTSV, may overlook microenvironmental heterogeneity. Finally, there are coverage and parameter-related problems. Pilot coverage was geographically limited, and parameter uncertainty persists as PPV thresholds need recalibration and TB models depend on uncertain latent progression parameters.

  • Conflict of Interest: No conflicts of interest.
  • Reference (15)

    Citation:

    Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return