A New Integrated Interpolation Method for High Missing Unstable Disease Surveillance Data — 12 Urban Agglomerations, China, 2009–2020

Yuanhao Shi; Yilan Liao

doi:10.46234/ccdcw2024.124

Article Navigation > China CDC Weekly > 2024, 6(27): 670-676

Methods and Applications: A New Integrated Interpolation Method for High Missing Unstable Disease Surveillance Data — 12 Urban Agglomerations, China, 2009–2020

Yuanhao Shi^1,2;
Yilan Liao^1, ,

View author affiliations

Abstract
Introduction
The prevalence of unstable and incomplete monitoring data significantly complicates syndromic analysis. Many data interpolation methods currently available demonstrate inadequate effectiveness in overcoming this issue.
Methods
To improve the accuracy of interpolation, we propose the integration of the SHapley Additive exPlanation model (SHAP) with the structural equation model (SEM), forming a combined SHAP-SEM approach. A case study is then performed to assess the enhanced performance of this novel model compared to traditional methods.
Results
The SHAP-SEM model was utilized to develop an interpolation model employing data from the Chinese respiratory syndrome surveillance database. We executed three distinct experiments to establish the model datasets, comprising a total of 100 replicates. The performance of the model was evaluated using the root mean square error (RMSE), correlation coefficient (r), and F-score. The findings demonstrate that the SHAP-SEM model consistently achieves superior accuracy in data interpolation, which is evident across different seasons and in overall performance.
Discussion
We conclude that the SHAP-SEM model demonstrates an exceptional capacity for accurately interpolating volatile and incomplete data. This capability is crucial for developing a comprehensive database that is essential for conducting risk assessments related to syndromes.
Funding: Supported by the Foundation of China (grant number 42171419) and National Science and Technology Major Project of China (grant number 2018ZX10713001)

Author Affiliations

1.
The State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China
2.
University of Chinese Academy of Science, Beijing, China

Corresponding author: Yilan Liao, liaoyl@lreis.ac.cn
Online Date: July 05 2024
Issue Date: July 05 2024
doi: 10.46234/ccdcw2024.124

References

[1]	Jia P, Yang SJ. China needs a national intelligent syndromic surveillance system. Nat Med 2020;26(7):990.
[2]	Koch T. Disease mapping and innovation: a history from wood-block prints to Web 3. 0. Patterns 2022;3(6):100507.
[3]	Lee KY, Li LX. Functional structural equation model. J Roy Stat Soc Ser B Stat Methodol 2022;84(2):600 − 29.
[4]	Liu YC, Liu ZH, Luo X, Zhao HJT. Diagnosis of Parkinson's disease based on SHAP value feature selection. Biocybern Biomed Eng 2022;42(3):856 − 69.
[5]	Willmott CJ, Matsuura K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res 2005;30(1):79 − 82.
[6]	Zhang WG, He YW, Wang LQ, Liu SL, Meng XY. Landslide susceptibility mapping using random forest and extreme gradient boosting: a case study of Fengjie, Chongqing. Geol J 2023;58(6):2372 − 87.
[7]	Li ZJ, Zhang HY, Ren LL, Lu QB, Ren X, Zhang CH, et al. Etiological and epidemiological features of acute respiratory infections in China. Nat Commun 2021;12(1):5026.
[8]	Wang JL, Chen T, Deng LL, Han YJ, Wang DY, Wang LP, et al. Epidemiological characteristics of imported respiratory infectious diseases in China, 2014‒2018. Infect Dis Poverty 2022;11(1):22.
[9]	Zhao YJ, Lu RJ, Shen J, Xie ZD, Liu GS, Tan WJ. Comparison of viral and epidemiological profiles of hospitalized children with severe acute respiratory infection in Beijing and Shanghai, China. BMC Infect Dis 2019;19(1):729.
[10]	Xiao MY, Zhang GH, Breitkopf P, Villon P, Zhang WH. Extended Co-Kriging interpolation method based on multi-fidelity data. Appl Math Comput 2018;323:120 − 31.
[11]	Miller PC, Ren MD, Schlame M, Toth MJ, Phoon CKL. A bayesian analysis to determine the prevalence of barth syndrome in the pediatric population. J Pediatr 2020;217:139 − 44.
[12]	Wang JF, Haining R, Cao ZD. Sample surveying to estimate the mean of a heterogeneous surface: reducing the error variance through zoning. Int J Geogr Inf Sci 2010;24(4):523 − 43.
[13]	Mariano C, Mónica B. A random forest-based algorithm for data-intensive spatial interpolation in crop yield mapping. Comput Electron Agric 2021;184:106094.
[14]	Abdelaziz M, Wang TF, Elazab A. Alzheimer's disease diagnosis framework from incomplete multimodal data using convolutional neural networks. J Biomed Inform 2021;121:103863.

FIGURE 1. The virus structure spectrum of different urban agglomerations after interpolation.

Note: The letters “A” and “C” in the abscissa represent springand autumn quarters, respectively.

Abbreviation: HADV=human adenovirus; HCOV=human coronavirus; HMPV=human metapneumovirus; HPIV=human parainfluenza virus; HRSV=human syncytial virus; IFV=influenza virus.

Download: Full-Size Img PowerPoint

TABLE 1. The average values of the three evaluation indicators under 100 repeated experiments for 5 models in each setting.

Methods	RMSE	r	F-score
Setting 1: Training=60%
SHAP-SEM	5.813	0.710***	0.752
SEM	9.424	0.614**	0.651
Cokriging	8.273	0.429	0.634
Bayesian	10.174	0.494**	0.693
Sandwich	7.235	0.539***	0.621
Setting 2: Training=70%
SHAP-SEM	5.157	0.734***	0.781
SEM	9.364	0.633**	0.684
Cokriging	8.047	0.457	0.634
Bayesian	9.154	0.518**	0.691
Sandwich	7.176	0.584***	0.633
Setting 3: Training=80%
SHAP-SEM	5.081	0.767***	0.792
SEM	9.331	0.657*	0.708
Cokriging	7.524	0.461**	0.642
Bayesian	8.699	0.523**	0.701
Sandwich	6.926	0.601***	0.651
Abbreviation: RMSE=root-mean-square error; SHAP-SEM=SHapley Additive exPlanation model with the structural equation model; SEM=structural equation model. * P<0.05; P<0.01; * P<0.001.

Download: CSV

TABLE 2. Average values of the three evaluation indicators in different seasons under 100 repeated experiments for 5 models in each setting.

Quarters	Method	RMSE	r	F-score
	SHAP-SEM	4.214	0.781^***	0.811
	SEM	5.019	0.722^**	0.736
Spring	Cokriging	5.105	0.581^*	0.662
	Bayesian	6.428	0.614^**	0.705
	Sandwich	6.832	0.637^**	0.651
	SHAP-SEM	6.194	0.703^**	0.710
	SEM	10.521	0.641^**	0.682
Summer	Cokriging	7.113	0.519	0.627
	Bayesian	9.245	0.588^*	0.690
	Sandwich	8.194	0.572^*	0.638
	SHAP-SEM	7.237	0.651^**	0.631
	SEM	9.144	0.601^*	0.614
Autumn	Cokriging	6.184	0.467	0.596
	Bayesian	7.965	0.501	0.587
	Sandwich	8.229	0.523^*	0.577
	SHAP-SEM	4.057	0.722^***	0.753
	SEM	6.124	0.671^***	0.707
Winter	Cokriging	5.016	0.514^*	0.636
	Bayesian	7.255	0.635^**	0.641
	Sandwich	5.417	0.642^**	0.639
Abbreviation: MSE=mean-square error; RMSE=root-mean-square error; SHAP-SEM=SHapley Additive exPlanation model with the structural equation model; SEM=Structural equation model. * P<0.05; P<0.01; * P<0.001.

Download: CSV

Citation:

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Turn off MathJax

Article Contents

Get Citation

PDF

Article Metrics

Article views(6620) PDF downloads(15) Cited by()

Introduction

The prevalence of unstable and incomplete monitoring data significantly complicates syndromic analysis. Many data interpolation methods currently available demonstrate inadequate effectiveness in overcoming this issue.

Methods

To improve the accuracy of interpolation, we propose the integration of the SHapley Additive exPlanation model (SHAP) with the structural equation model (SEM), forming a combined SHAP-SEM approach. A case study is then performed to assess the enhanced performance of this novel model compared to traditional methods.

Results

The SHAP-SEM model was utilized to develop an interpolation model employing data from the Chinese respiratory syndrome surveillance database. We executed three distinct experiments to establish the model datasets, comprising a total of 100 replicates. The performance of the model was evaluated using the root mean square error (RMSE), correlation coefficient (r), and F-score. The findings demonstrate that the SHAP-SEM model consistently achieves superior accuracy in data interpolation, which is evident across different seasons and in overall performance.

Discussion

We conclude that the SHAP-SEM model demonstrates an exceptional capacity for accurately interpolating volatile and incomplete data. This capability is crucial for developing a comprehensive database that is essential for conducting risk assessments related to syndromes.

HTML

Syndrome surveillance is crucial for the rapid detection and alerting of infectious disease outbreaks. Nonetheless, it often encounters challenges including uneven distribution of monitoring sites, irregular reporting schedules, and incomplete data (1). These factors hinder the ability to accurately delineate disease distribution temporally and spatially, and to discern patterns and anomalies. Traditional spatio-temporal interpolation methods (2) are ill-suited for addressing the volatility and gaps in disease data, particularly when integrating significant influencing factors. Conversely, the structural equation model (SEM) facilitates analysis of complex data interactions to unearth underlying relationships among variables (3), thus enabling more precise interpolation. We suggest employing a SHapley Additive explanation (SHAP)-based SEM for the spatiotemporal interpolation of unstable and incomplete data.

DISCUSSION

Missing data frequently complicates syndromic surveillance, obstructing the analysis of disease patterns and trends, thereby impeding efforts in disease prevention and control. Developing methods for data interpolation in environments characterized by unstable monitoring and significant data gaps presents a formidable research challenge. Conventional interpolation techniques are often inadequate in contexts involving complex interactions between diseases and their determinants. These methods generally underperform in addressing missing values within sparse datasets.

This study employed interpolation techniques to estimate the prevalence of primary viruses associated with seasonal respiratory syndrome across 13 major urban areas in China between 2010 and 2018, accounting for sparse data and missing values. The accuracy of these estimates was assessed using RMSE, r, and F-score. The results indicate that this method surpasses other approaches in enhancing the accuracy of data on primary respiratory syndrome viruses, achieving significant improvements in overall and seasonal accuracy.

Most spatiotemporal interpolation models for diseases incorporate both spatiotemporal autocorrelation and differentiation. The Co-kriging method (10) is a geostatistical technique leveraging correlations between various variables across different sites for spatial interpolation and prediction. However, employing Co-kriging to estimate missing data introduces increased uncertainty because these estimations depend heavily on the availability of complete datasets and on the spatial correlation among variables. A lack of data in a dataset can undermine spatial autocorrelation, rendering predictions unreliable. Bayesian hierarchical models (11) attempt to manage missing data under the assumption of random data loss. If the missing data mechanism is misrepresented within the model, however, parameter estimations might be biased. Conventionally, it is assumed that data loss is random; however, overlooking the specific missing data mechanism can lead to biases in the estimated parameters due to improper data handling. The sandwich approach (12) further complicates the issue by treating missing and observed data as independent, disregarding any patterns or correlations in the data removal process. This can result in incorrect standard errors and inferences, particularly when the data deletion mechanisms are informative or directly linked to the variables of interest. Additionally, the complex dynamics of disease occurrence, infection, and transmission intersect variably with different factors. To enhance spatiotemporal interpolation accuracy, advanced techniques like deep learning, including random forest models (13) and regressive neural networks (14), have been utilized. Despite their effectiveness, these models are often complex and do not sufficiently address the multifaceted nature of disease prevention. This study proposes the integration of SEM with SHAP to discern crucial features and their interrelationships, thus tackling issues related to unstable and fragmented data in health monitoring. By resolving these issues and synthesizing them within our research, we can derive precise insights about syndromes, affected regions, and causal factors. This approach promises to yield scientifically based recommendations for efficacious local prevention and control strategies.

However, this study is subject to some limitations. The SHAP-SEM model necessitates a large sample size, which may not always be feasible with incomplete syndrome surveillance data. Additionally, it is crucial to recognize that viral activity is influenced by a range of risk factors. Future endeavors to integrate incomplete syndrome monitoring data with the SHAP-SEM model should include more factors. It is also important to note that the scalability of SHAP-SEM models may be compromised when handling large datasets. As the size of the dataset expands, the computational and memory demands escalate, potentially leading to extended processing times and heightened complexity.

In further studies, comprehensive descriptions and analyses of the syndrome’s spatial and temporal distribution can be achieved through interpolation methods. Additionally, examining variations in viral activity and seasonal trends across different regions is possible. Building on this knowledge, identifying specific risk groups and areas becomes feasible, providing essential data to support targeted, time-sensitive, and location-specific prevention and control strategies.

Conflicts of interest

No conflicts of interest.

Reference (14)

Citation:

[1]	Jia P, Yang SJ. China needs a national intelligent syndromic surveillance system. Nat Med 2020;26(7):990.
[2]	Koch T. Disease mapping and innovation: a history from wood-block prints to Web 3. 0. Patterns 2022;3(6):100507.
[3]	Lee KY, Li LX. Functional structural equation model. J Roy Stat Soc Ser B Stat Methodol 2022;84(2):600 − 29.
[4]	Liu YC, Liu ZH, Luo X, Zhao HJT. Diagnosis of Parkinson's disease based on SHAP value feature selection. Biocybern Biomed Eng 2022;42(3):856 − 69.
[5]	Willmott CJ, Matsuura K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res 2005;30(1):79 − 82.
[6]	Zhang WG, He YW, Wang LQ, Liu SL, Meng XY. Landslide susceptibility mapping using random forest and extreme gradient boosting: a case study of Fengjie, Chongqing. Geol J 2023;58(6):2372 − 87.
[7]	Li ZJ, Zhang HY, Ren LL, Lu QB, Ren X, Zhang CH, et al. Etiological and epidemiological features of acute respiratory infections in China. Nat Commun 2021;12(1):5026.
[8]	Wang JL, Chen T, Deng LL, Han YJ, Wang DY, Wang LP, et al. Epidemiological characteristics of imported respiratory infectious diseases in China, 2014‒2018. Infect Dis Poverty 2022;11(1):22.
[9]	Zhao YJ, Lu RJ, Shen J, Xie ZD, Liu GS, Tan WJ. Comparison of viral and epidemiological profiles of hospitalized children with severe acute respiratory infection in Beijing and Shanghai, China. BMC Infect Dis 2019;19(1):729.
[10]	Xiao MY, Zhang GH, Breitkopf P, Villon P, Zhang WH. Extended Co-Kriging interpolation method based on multi-fidelity data. Appl Math Comput 2018;323:120 − 31.
[11]	Miller PC, Ren MD, Schlame M, Toth MJ, Phoon CKL. A bayesian analysis to determine the prevalence of barth syndrome in the pediatric population. J Pediatr 2020;217:139 − 44.
[12]	Wang JF, Haining R, Cao ZD. Sample surveying to estimate the mean of a heterogeneous surface: reducing the error variance through zoning. Int J Geogr Inf Sci 2010;24(4):523 − 43.
[13]	Mariano C, Mónica B. A random forest-based algorithm for data-intensive spatial interpolation in crop yield mapping. Comput Electron Agric 2021;184:106094.
[14]	Abdelaziz M, Wang TF, Elazab A. Alzheimer's disease diagnosis framework from incomplete multimodal data using convolutional neural networks. J Biomed Inform 2021;121:103863.

Methods and Applications: A New Integrated Interpolation Method for High Missing Unstable Disease Surveillance Data — 12 Urban Agglomerations, China, 2009–2020