Developing Machine Learning Prediction Model for Daily Influenza Reported Cases Using Multichannel Surveillance Data — A City, Hubei Province, China, 2023–2025

Xinyue Zhang; Xinyi Sang; Beibei Liu; Quanyu Wang; Xiuran Zuo; Sheng Wei; Qi Wang

doi:10.46234/ccdcw2025.234

Article Navigation > China CDC Weekly > 2025, 7(44): 1396-1402

Methods and Applications: Developing Machine Learning Prediction Model for Daily Influenza Reported Cases Using Multichannel Surveillance Data — A City, Hubei Province, China, 2023–2025

View author affiliations

Abstract
Introduction
Public health surveillance is crucial for decision-making. Given the significant threat of influenza to public health, developing predictive models using multichannel surveillance systems is imperative.
Methods
Data were collected from multichannel surveillance systems, including hospitals, search engines, and climatological and air pollutant surveillance systems, in a southern Chinese city from January 2023 to January 2025. Spearman’s correlation analysis assessed the relationships between variables and reported influenza cases. Several machine learning models were used to predict trends in reported cases.
Results
Correlation analysis showed that all four surveillance systems were related to influenza, with 27 variables correlated with daily reported cases. The Long Short-Term Memory model, established based on variables with the highest lagged correlations (5-day to 7-day lag) through combined surveillance systems, outperformed other models for 5-day forecasts (R²=0.92; mean absolute error=156.92; mean absolute percentage error=79.95%; root Mean Squared Error=292.33).
Conclusions
Data from various surveillance systems effectively track influenza epidemics. The model shows potential for infectious disease surveillance and epidemic preparedness.
Conflicts of interest: No conflicts of interest.
Funding: Supported by the National Key Research and Development Program of China (grant no. 2022YFC2305103)

Author Affiliations

1.
Department of Epidemiology and Biostatistics, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan City, Hubei Province, China
2.
Health Information Center of Wuhan, Wuhan City, Hubei Province, China
3.
School of Public Health and Emergency Management, Southern University of Science and Technology, Shenzhen City, Guangdong Province, China

Corresponding authors: Qi Wang, wangqi_tj@hust.edu.cn; Sheng Wei, weis@sustech.edu.cn
Online Date: October 31 2025
Issue Date: October 31 2025
doi: 10.46234/ccdcw2025.234

References

[1]	Iuliano AD, Roguski KM, Chang HH, Muscatello DJ, Palekar R, Tempia S, et al. Estimates of global seasonal influenza-associated respiratory mortality: a modelling study. Lancet 2018;391(10127):1285 − 300. https://doi.org/10.1016/S0140-6736(17)33293-2.
[2]	Zhang YZ, Bambrick H, Mengersen K, Tong SL, Hu WB. Using internet-based query and climate data to predict climate-sensitive infectious disease risks: a systematic review of epidemiological evidence. Int J Biometeorol 2021;65(12):2203 − 14. https://doi.org/10.1007/s00484-021-02155-4.
[3]	Brownstein JS, Rader B, Astley CM, Tian HY. Advances in artificial intelligence for infectious-disease surveillance. N Engl J Med 2023;388(17):1597 − 607. https://doi.org/10.1056/NEJMra2119215.
[4]	Kraemer MUG, Tsui JLH, Chang SY, Lytras S, Khurana MP, Vanderslott S, et al. Artificial intelligence for modelling infectious disease epidemics. Nature 2025;638(8051):623 − 35. https://doi.org/10.1038/s41586-024-08564-w.
[5]	Dong YH, Wang LP, Burgner DP, Miller JE, Song Y, Ren X, et al. Infectious diseases in children and adolescents in China: analysis of national surveillance data from 2008 to 2017. BMJ 2020;369:m1043. https://doi.org/10.1136/bmj.m1043.
[6]	Nottmeyer LN, Sera F. Influence of temperature, and of relative and absolute humidity on COVID-19 incidence in England - A multi-city time-series study. Environ Res 2021;196:110977. https://doi.org/10.1016/j.envres.2021.110977.
[7]	Krymova E, Béjar B, Thanou D, Sun T, Manetti E, Lee G, et al. Trend estimation and short-term forecasting of COVID-19 cases and deaths worldwide. Proc Natl Acad Sci USA 2022;119(32):e2112656119. https://doi.org/10.1073/pnas.2112656119.
[8]	Prasanth S, Singh U, Kumar A, Tikkiwal VA, Chong PHJ. Forecasting spread of COVID-19 using google trends: a hybrid GWO-deep learning approach. Chaos Solitons Fractals 2021;142:110336. https://doi.org/10.1016/j.chaos.2020.110336.
[9]	Holmdahl I, Buckee C. Wrong but useful - what covid-19 epidemiologic models can and cannot tell US. N Engl J Med 2020;383(4):303 − 5. https://doi.org/10.1056/NEJMp2016822.
[10]	Chin V, Samia NI, Marchant R, Rosen O, Ioannidis JPA, Tanner MA, et al. A case study in model failure? COVID-19 daily deaths and ICU bed utilisation predictions in New York state. Eur J Epidemiol 2020;35(8):733 − 42. https://doi.org/10.1007/s10654-020-00669-6.
[11]	Morgan OW, Abdelmalik P, Perez-Gutierrez E, Fall IS, Kato M, Hamblion E, et al. How better pandemic and epidemic intelligence will prepare the world for future threats. Nat Med 2022;28(8):1526 − 8. https://doi.org/10.1038/s41591-022-01900-5.
[12]	Uyeki TM, Hui DS, Zambon M, Wentworth DE, Monto AS. Influenza. Lancet 2022;400(10353):693 − 706. https://doi.org/10.1016/S0140-6736(22)00982-5.
[13]	Milinovich GJ, Williams GM, Clements AC, Hu WB. Internet-based surveillance systems for monitoring emerging infectious diseases. Lancet Infect Dis 2014;14(2):160 − 8. https://doi.org/10.1016/S1473-3099(13)70244-5.
[14]	Xu L, Zhou C, Luo ST, Chan DK, McLaws ML, Liang WN. Modernising infectious disease surveillance and an early-warning system: the need for China’s action. Lancet Reg Health West Pac 2022;23:100485. https://doi.org/10.1016/j.lanwpc.2022.100485.
[15]	Zhang R, Lai KY, Liu WH, Liu YH, Ma XW, Webster C, et al. Associations between short-term exposure to ambient air pollution and influenza: an individual-level case-crossover study in Guangzhou, China. Environ Health Perspect 2023;131(12):127009. https://doi.org/10.1289/EHP12145.
[16]	Guo F, Zhang P, Do V, Runge J, Zhang K, Han ZS, et al. Ozone as an environmental driver of influenza. Nat Commun 2024;15(1):3763. https://doi.org/10.1038/s41467-024-48199-z.

FIGURE 1. Spearman correlation analysis between different surveillance systems and reported cases with a 7-day lag before.

Abbreviation: Tmean=daily mean temperature; Pmean=daily mean air pressure; RHmean=daily mean relative humidity; AHmean=daily mean absolute humidity; WSmean=daily mean wind speed; VISmean=daily mean visibility; PRCPmean=daily mean precipitation.

Download: Full-Size Img PowerPoint

FIGURE 2. Prediction diagram for the LSTM model. (A) All monitoring period; (B) Test set period.

Abbreviation: LSTM=Long Short-Term Memory; CI=confidence interval.

Download: Full-Size Img PowerPoint

TABLE 1. Comparison of different lag days of data between different models.

Model performance	Lag (day)
Model performance	−7	−6	−5	−4	−3	−2	−1	0
MAE
RF	337.87	338.28	341.30	334.08	312.13	300.46	288.00	242.29
XGBoost	287.21	278.68	306.39	296.40	256.80	270.32	262.13	194.60
LR	331.20	345.66	363.05	459.53	463.15	453.79	518.47	520.87
KNN	318.65	322.00	335.00	315.57	310.96	292.79	272.76	249.22
GRU	213.40	201.27	245.67	248.84	222.84	209.87	245.61	235.47
LSTM	200.40	229.21	156.92	238.00	244.77	258.08	214.40	170.59
MAPE (%)
RF	123.67	126.36	133.84	134.04	129.53	127.40	124.04	118.06
XGBoost	123.83	113.94	126.43	129.99	106.45	117.50	100.04	91.91
LR	290.16	299.56	290.85	405.42	409.27	396.86	466.83	553.25
KNN	101.26	110.71	112.25	101.42	104.63	95.35	85.36	89.93
GRU	92.66	80.78	99.19	70.50	74.83	73.84	104.70	87.31
LSTM	82.45	114.17	79.95	124.37	132.32	110.60	61.55	60.26
RMSE
RF	850.54	847.87	839.32	801.37	734.83	702.84	662.17	522.37
XGBoost	676.90	666.64	740.27	670.44	622.87	619.18	627.01	436.74
LR	563.26	575.52	607.19	665.98	659.59	653.02	699.16	601.48
KNN	835.86	824.83	840.50	789.46	792.10	753.85	721.15	572.74
GRU	416.80	505.86	611.00	646.34	553.33	509.26	574.51	581.57
LSTM	467.92	407.89	292.33	405.50	490.79	574.22	570.83	453.23
R²
RF	0.25	0.26	0.27	0.34	0.44	0.49	0.55	0.72
XGBoost	0.53	0.54	0.44	0.54	0.60	0.61	0.60	0.80
LR	0.67	0.66	0.62	0.54	0.55	0.56	0.50	0.63
KNN	0.28	0.30	0.27	0.36	0.35	0.41	0.46	0.66
GRU	0.84	0.76	0.65	0.61	0.71	0.76	0.69	0.68
LSTM	0.79	0.83	0.92	0.84	0.77	0.69	0.69	0.81
Abbreviation: MAE=mean absolute error; RF=Random Forest; XGBoost=eXtreme Gradient Boosting; SVM=Support Vector Machine; LR=Linear Regression; KNN=K-Nearest Neighbors; GRU=Gated Recurrent Unit; LSTM=Long Short-Term Memory; MAPE=mean absolute percentage error; RMSE=root mean square error; R²=coefficient of determination.

Download: CSV

TABLE 2. Comparison of different combinations of data.

Model performance of different combinations	MAE	MAPE (%)	RMSE	R²
Model H	428.57	221.61	896.41	0.25
Model B	414.27	141.49	971.23	0.11
Model M	456.34	159.50	946.81	0.16
Model P	435.46	131.86	1000.15	0.06
Model H+B	370.35	168.43	793.79	0.41
Model H+M	248.17	125.64	533.21	0.73
Model H+P	286.82	105.25	670.63	0.58
Model B+M	275.89	102.85	651.88	0.60
Model B+P	317.25	109.17	703.38	0.54
Model M+P	254.46	102.41	548.00	0.72
Model H+B+M	215.49	108.40	480.24	0.78
Model H+B+P	197.02	52.24	522.32	0.74
Model H+M+P	201.06	63.64	418.51	0.84
Model B+M+P	267.35	118.07	550.24	0.72
Model All	156.92	79.95	292.33	0.92
Abbreviation: MAE=mean absolute error; MAPE=mean absolute percentage error; RMSE=root mean square error; R²=coefficient of determination.

Download: CSV

Citation:

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Turn off MathJax

Article Contents

Get Citation

PDF

Article Metrics

Article views(315) PDF downloads(0) Cited by()

Introduction

Public health surveillance is crucial for decision-making. Given the significant threat of influenza to public health, developing predictive models using multichannel surveillance systems is imperative.

Methods

Data were collected from multichannel surveillance systems, including hospitals, search engines, and climatological and air pollutant surveillance systems, in a southern Chinese city from January 2023 to January 2025. Spearman’s correlation analysis assessed the relationships between variables and reported influenza cases. Several machine learning models were used to predict trends in reported cases.

Results

Correlation analysis showed that all four surveillance systems were related to influenza, with 27 variables correlated with daily reported cases. The Long Short-Term Memory model, established based on variables with the highest lagged correlations (5-day to 7-day lag) through combined surveillance systems, outperformed other models for 5-day forecasts (R²=0.92; mean absolute error=156.92; mean absolute percentage error=79.95%; root Mean Squared Error=292.33).

Conclusions

Data from various surveillance systems effectively track influenza epidemics. The model shows potential for infectious disease surveillance and epidemic preparedness.

HTML

Influenza, an acute respiratory infectious disease caused by the influenza virus, threatens global public health due to its high incidence, transmissibility, and severe complications (1). Effective surveillance is crucial for timely public health interventions.

Influenza spread is influenced by climatic, human migration, social media, and socioeconomic status (2). Integrating multisource data outbreak prediction remains challenging.

Artificial intelligence (AI) has introduced new scope in disease prediction. Deep learning (DL), a machine learning (ML) subset, enables model optimization through self-supervised learning, showing superior prediction performance (3). These technologies show potential for disease prediction by capturing complex patterns (4).

This study explored the relationship between influenza cases and surveillance systems in a southern Chinese city, using AI techniques to establish prediction models for influenza epidemics, to refine monitoring strategies and inform public health responses.

DISCUSSION

Prediction models were developed to track the influenza epidemic in a southern Chinese city by integrating data from multichannel surveillance systems. The LSTM model combining all surveillance data demonstrated high prediction accuracy, with R²=0.92, MAE=156.92, MAPE=79.95%, and RMSE=292.33, outperforming other models.

Compared to existing models, our approach shows improved prediction accuracy by integrating multiple data sources. Previous studies relied mainly on single-source data, such as clinical report (7) or Google Trends (8), limited by reporting delays, definition changes, and data errors (9-10). Our models used EMRs, social media, meteorological, and air pollutant data were designed to mitigate forecasting errors. The combined model performed best, highlighting the value of diverse data in assessing influenza trends. The early phase of the influenza outbreak highlighted the inadequacy of confirmed case data for traditional surveillance (11). A multifaceted monitoring approach is essential to improve epidemic predictions.

Data quality and diversity significantly affect model performance. EMRs provide detailed clinical information. Fever clinic visits exhibited a weaker correlation (r=0.257), as fever clinic patients may have other diseases and some patient with influenza may not visit fever clinics. Symptoms like myalgia, cough, and sore throat showed stronger positive correlations with influenza cases at shorter lags, which underscore the importance of symptom-based surveillance (12). Social media data provide real-time insights into potential outbreaks, with search volumes for symptom keywords correlating with influenza cases (13). Studies indicate that combining Internet-based queries and climate data improves the accuracy and timeliness of infectious disease warning systems (14). Our findings show positive correlations between air pollutants (SO₂, NO₂, PM₁₀, PM_2.5, and CO) and negative correlation between O₃ and influenza cases, consistent with previous studies (15–16). This highlights the importance of integrating air pollutant data for accurate influenza forecasting.

The LSTM model demonstrated improved accuracy through multisource data. Hospital surveillance enhanced prediction performance, consistent with correlation results. This approach provides understanding of environmental factors, public health interventions, and disease dynamics. Although a 5–7-day lag generally performed well, some combinations weakened due to flu’s complex spread mechanisms involving virus survival and human behavior. Environmental variables can extend virus survival time, potentially causing delays between changes and observable influenza case increases. Our multisource surveillance data integrated clinical, laboratory, and syndromic monitoring systems, with reporting delays contributing to extended lag period. Despite similarities in respiratory disease spread factors, significant heterogeneity existed. Early prediction improves response strategies, resource allocation, and outbreak management.

This study has certain limitations. The absence of population mobility and vaccination rates may restrict the capacity of the model to capture influenza transmission dynamics. Data quality from less reliable sources may affect performance, and the lack of external validation limits generalizability. Seasonal variations may lead to dispersed data patterns and higher noise levels during non-epidemic seasons. Future research should incorporate vaccination data and explore additional data sources like behavioral patterns and environmental factors. External validation and detailed data preprocessing, such as smoothing or cross-validation, could enhance generalization.

This study demonstrates that multichannel data integration improves respiratory infectious disease prediction accuracy and timeliness, with implications for public health responses. Ongoing research will refine these models for other health threats.

Acknowledgments

The staff members at the Center for Disease Prevention and Control and Health Information Center for data verification.

Conflicts of interest: No conflicts of interest.

Reference (16)

Citation:

[1]	Iuliano AD, Roguski KM, Chang HH, Muscatello DJ, Palekar R, Tempia S, et al. Estimates of global seasonal influenza-associated respiratory mortality: a modelling study. Lancet 2018;391(10127):1285 − 300. https://doi.org/10.1016/S0140-6736(17)33293-2.
[2]	Zhang YZ, Bambrick H, Mengersen K, Tong SL, Hu WB. Using internet-based query and climate data to predict climate-sensitive infectious disease risks: a systematic review of epidemiological evidence. Int J Biometeorol 2021;65(12):2203 − 14. https://doi.org/10.1007/s00484-021-02155-4.
[3]	Brownstein JS, Rader B, Astley CM, Tian HY. Advances in artificial intelligence for infectious-disease surveillance. N Engl J Med 2023;388(17):1597 − 607. https://doi.org/10.1056/NEJMra2119215.
[4]	Kraemer MUG, Tsui JLH, Chang SY, Lytras S, Khurana MP, Vanderslott S, et al. Artificial intelligence for modelling infectious disease epidemics. Nature 2025;638(8051):623 − 35. https://doi.org/10.1038/s41586-024-08564-w.
[5]	Dong YH, Wang LP, Burgner DP, Miller JE, Song Y, Ren X, et al. Infectious diseases in children and adolescents in China: analysis of national surveillance data from 2008 to 2017. BMJ 2020;369:m1043. https://doi.org/10.1136/bmj.m1043.
[6]	Nottmeyer LN, Sera F. Influence of temperature, and of relative and absolute humidity on COVID-19 incidence in England - A multi-city time-series study. Environ Res 2021;196:110977. https://doi.org/10.1016/j.envres.2021.110977.
[7]	Krymova E, Béjar B, Thanou D, Sun T, Manetti E, Lee G, et al. Trend estimation and short-term forecasting of COVID-19 cases and deaths worldwide. Proc Natl Acad Sci USA 2022;119(32):e2112656119. https://doi.org/10.1073/pnas.2112656119.
[8]	Prasanth S, Singh U, Kumar A, Tikkiwal VA, Chong PHJ. Forecasting spread of COVID-19 using google trends: a hybrid GWO-deep learning approach. Chaos Solitons Fractals 2021;142:110336. https://doi.org/10.1016/j.chaos.2020.110336.
[9]	Holmdahl I, Buckee C. Wrong but useful - what covid-19 epidemiologic models can and cannot tell US. N Engl J Med 2020;383(4):303 − 5. https://doi.org/10.1056/NEJMp2016822.
[10]	Chin V, Samia NI, Marchant R, Rosen O, Ioannidis JPA, Tanner MA, et al. A case study in model failure? COVID-19 daily deaths and ICU bed utilisation predictions in New York state. Eur J Epidemiol 2020;35(8):733 − 42. https://doi.org/10.1007/s10654-020-00669-6.
[11]	Morgan OW, Abdelmalik P, Perez-Gutierrez E, Fall IS, Kato M, Hamblion E, et al. How better pandemic and epidemic intelligence will prepare the world for future threats. Nat Med 2022;28(8):1526 − 8. https://doi.org/10.1038/s41591-022-01900-5.
[12]	Uyeki TM, Hui DS, Zambon M, Wentworth DE, Monto AS. Influenza. Lancet 2022;400(10353):693 − 706. https://doi.org/10.1016/S0140-6736(22)00982-5.
[13]	Milinovich GJ, Williams GM, Clements AC, Hu WB. Internet-based surveillance systems for monitoring emerging infectious diseases. Lancet Infect Dis 2014;14(2):160 − 8. https://doi.org/10.1016/S1473-3099(13)70244-5.
[14]	Xu L, Zhou C, Luo ST, Chan DK, McLaws ML, Liang WN. Modernising infectious disease surveillance and an early-warning system: the need for China’s action. Lancet Reg Health West Pac 2022;23:100485. https://doi.org/10.1016/j.lanwpc.2022.100485.
[15]	Zhang R, Lai KY, Liu WH, Liu YH, Ma XW, Webster C, et al. Associations between short-term exposure to ambient air pollution and influenza: an individual-level case-crossover study in Guangzhou, China. Environ Health Perspect 2023;131(12):127009. https://doi.org/10.1289/EHP12145.
[16]	Guo F, Zhang P, Do V, Runge J, Zhang K, Han ZS, et al. Ozone as an environmental driver of influenza. Nat Commun 2024;15(1):3763. https://doi.org/10.1038/s41467-024-48199-z.

Methods and Applications: Developing Machine Learning Prediction Model for Daily Influenza Reported Cases Using Multichannel Surveillance Data — A City, Hubei Province, China, 2023–2025