Advanced Search

Methods and Applications: Mapping the Characteristics of Respiratory Infectious Disease Epidemics in China Based on the Baidu Index from November 2022 to January 2023

View author affiliations
  • Abstract

    Introduction

    Infectious diseases pose a significant global health and economic burden, underscoring the critical need for precise predictive models. The Baidu index provides enhanced real-time surveillance capabilities that augment traditional systems.

    Methods

    Baidu search engine data on the keyword “fever” were extracted from 255 cities in China from November 2022 to January 2023. Onset and peak dates for influenza epidemics were identified by testing various criteria that combined thresholds and consecutive days.

    Results

    The most effective scenario for indicating epidemic commencement involved a 90th percentile threshold exceeded for seven consecutive days, minimizing false starts. Peak detection was optimized using a 7-day moving average, balancing stability and precision.

    Discussion

    The use of internet search data, such as the Baidu index, significantly improves the timeliness and accuracy of disease surveillance models. This innovative approach supports faster public health interventions and demonstrates its potential for enhancing epidemic monitoring and response efforts.

  • loading...
  • Conflicts of interest: No conflicts of interest.
  • Funding: Supported by grants from the CAMS Innovation Fund for Medical Sciences (2021-I2M-1-044, 2023-I2M-3-011) and National Key Research and Development Program of China (2023YFC2308701)
  • [1] Nii-Trebi NI. Emerging and neglected infectious diseases: insights, advances, and challenges. BioMed Res Int 2017;2017:5245021. https://doi.org/10.1155/2017/5245021CrossRef
    [2] Ellwanger JH, Kaminski VDL, Chies JAB. Emerging infectious disease prevention: where should we invest our resources and efforts? J Infect Public Health 2019;12(3):313-6. http://dx.doi.org/10.1016/j.jiph.2019.03.010.
    [3] Ma SM, Yang SH. COVID-19 forecasts using Internet search information in the United States. Sci Rep 2022;12(1):11539. https://doi.org/10.1038/s41598-022-15478-yCrossRef
    [4] Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature 2009;457(7232):1012 − 4. https://doi.org/10.1038/nature07634CrossRef
    [5] Liang F, Guan P, Wu W, Huang DS. Forecasting influenza epidemics by integrating internet search queries and traditional surveillance data with the support vector machine regression model in Liaoning, from 2011 to 2015. PeerJ 2018;6:e5134. https://doi.org/10.7717/peerj.5134CrossRef
    [6] Gong X, Han YY, Hou MC, Guo R. Online public attention during the early days of the COVID-19 pandemic: infoveillance study based on Baidu Index. JMIR Public Health Surveill 2020;6(4):e23098. https://doi.org/10.2196/23098CrossRef
    [7] Barros JM, Duggan J, Rebholz-Schuhmann D. The application of internet-based sources for public health surveillance (Infoveillance): systematic review. J Med Internet Res 2020;22(3):e13680. https://doi.org/10.2196/13680CrossRef
    [8] Xinhua. China Focus: China releases measures to optimize COVID-19 response; https://english.news.cn/20221111/d4399114a082438eaac32d08a02bf58d/c.html. [2024-8-30].
    [9] China CDC Weekly. Epidemic situation of novel coronavirus infection in China.https://www.chinacdc.cn/jkzt/crb/zl/szkb_11803/jszl_13141/202302/t20230201_263576.html[2024-8-30].
    [10] Lu J, Lin AR, Jiang CM, Zhang AM, Yang ZZ. Influence of transportation network on transmission heterogeneity of COVID-19 in China. Transp Res Part C Emerg Technol 2021;129:103231. https://doi.org/10.1016/j.trc.2021.103231CrossRef
    [11] Xiang W, Chen L, Yan XD, Wang B, Liu XB. The impact of traffic control measures on the spread of COVID-19 within urban agglomerations based on a modified epidemic model. Cities 2023;135:104238. https://doi.org/10.1016/j.cities.2023.104238CrossRef
  • FIGURE 1.  Study structure.

    FIGURE 2.  Onset dates and peak dates across 255 cities in various PLADs of China. (A) Onset dates evaluation across 255 cities in various PLADs of China; (B) Peak dates evaluation across 255 cities in various PLADs of China.

    Note: Each row corresponds to a PLAD, and each colored block represents a city. The color gradient indicates the timing: shades closer to red represent earlier times, while shades closer to green indicate later times.

    Abbreviation: PLAD=provincial-level administrative division.

    FIGURE 3.  Onset and peak dates of the influenza epidemic in 2023. (A) Onset and peak dates evaluation in northern PLADs. (B) Onset and peak dates evaluation in southern PLADs. (C) Onset and peak dates evaluation in Yichang. (D) Onset and peak dates evaluation in Yichang.

    Note: The green dotted line represents the outset date identified in this study, while the green block indicates the corresponding officially reported week. The red dotted line signifies the estimated peak time of the epidemic based on this study, and the red block represents the officially reported week for the peak.

    Abbreviation: PLAD=provincial-level administrative division.

    TABLE 1.  Evaluation and comparison of criteria for reaching the peak.

    Region m=3 m=7 m=31
    Excess search index Peak date Excess search index Peak date Excess search index Peak date
    Northern PLADs 36.69 March 10 39.37 March 16 42.11 March 13
    Southern PLADs 24.00 March 14 30.70 March 09 24.07 March 14
    Yichang 39.87 March 16 36.75 March 10 39.04 March 16
    Weifang 32.46 March 10 24.09 March 14 30.66 March 9
    Abbreviation: PLADs=provincial-level administrative divisions.
    Download: CSV

Citation:

通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索
Turn off MathJax
Article Contents

Article Metrics

Article views(1663) PDF downloads(25) Cited by()

Share

Related

Mapping the Characteristics of Respiratory Infectious Disease Epidemics in China Based on the Baidu Index from November 2022 to January 2023

View author affiliations

Abstract

Introduction

Infectious diseases pose a significant global health and economic burden, underscoring the critical need for precise predictive models. The Baidu index provides enhanced real-time surveillance capabilities that augment traditional systems.

Methods

Baidu search engine data on the keyword “fever” were extracted from 255 cities in China from November 2022 to January 2023. Onset and peak dates for influenza epidemics were identified by testing various criteria that combined thresholds and consecutive days.

Results

The most effective scenario for indicating epidemic commencement involved a 90th percentile threshold exceeded for seven consecutive days, minimizing false starts. Peak detection was optimized using a 7-day moving average, balancing stability and precision.

Discussion

The use of internet search data, such as the Baidu index, significantly improves the timeliness and accuracy of disease surveillance models. This innovative approach supports faster public health interventions and demonstrates its potential for enhancing epidemic monitoring and response efforts.

  • 1. School of Health Policy and Management, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
  • 2. School of Population Medicine and Public Health, Chinese Academy of Medical Sciences (CAMS) & Peking Union Medical College, Beijing, China
  • 3. Yichang Center for Disease Prevention and Control, Yichang City, Hubei Province, China
  • 4. Weifang Center for Disease Prevention and Control, Weifang City, Shandong Province, China
  • 5. Beijing Center for Disease Prevention and Control, Beijing, China
  • 6. School of Public Health, Dali University, Dali City, Yunnan Province, China
  • 7. Shanghai Pudong New Area Center for Disease Control and Prevention, Shanghai, China
  • Corresponding authors:

    Weizhong Yang, yangweizhong@cams.cn

    Chen Wang, cyh-birm@263.net

  • Funding: Supported by grants from the CAMS Innovation Fund for Medical Sciences (2021-I2M-1-044, 2023-I2M-3-011) and National Key Research and Development Program of China (2023YFC2308701)
  • Online Date: September 13 2024
    Issue Date: September 13 2024
    doi: 10.46234/ccdcw2024.195
  • Infectious diseases are a leading cause of death and disability worldwide, imposing a significant burden on public health and economic stability (1). The recent increase in emerging infectious diseases highlights the urgent need for accurate disease propagation predictions (2-3). The widespread use of internet data, particularly from platforms like Baidu, provides complementary real-time insights that enhance traditional infectious disease surveillance mechanisms (4-5). The Baidu index, distinguished by its superior forecasting accuracy and stability (6), has emerged as an invaluable asset for enriching existing surveillance systems.

    This study aimed to develop a surveillance model for epidemiological trends using the Baidu index as a cornerstone. Precise trend detection promises to facilitate prompt and effective public health interventions (7).

    • This study utilized data from the Baidu search engine, extracted from the publicly accessible Baidu Index website, covering trend analyses across 31 provincial-level administrative divisions (PLADs) in the Chinese mainland. The research focused on tracking the keywords “fever,” including “fa re” and “fa shao” in Chinese, to analyze coronavirus disease 2019 (COVID-19)-related data trends and compiled data from 255 cities. Notably, China enforced a dynamic COVID-zero strategy between 2019 and 2022 and initiated a pivotal policy transition on November 11, 2022, ultimately abandoning the COVID-zero strategy on December 7, 2022 (8).

      Baseline data were established using Baidu Index data from August to October 2020–2022 for cities without pandemic activity. The year with the lowest average index was selected. If outbreaks occurred during these months, data from May to July were used.

      This study investigated the 2023 influenza outbreak using data collected from November 1, 2022, to January 2, 2023. Data from the northern city of Weifang and the southern city of Yichang were used to identify regional variations. To avoid the influence of the COVID-19 pandemic, pre-2019 influenza-related internet search data were used. Data from October to December 2018 served as the baseline for adjusting changes in internet usage over time.

    • Various thresholds, paired with criteria for consecutive days, were used to pinpoint the precise start of the influenza epidemic. First, thresholds based on the 70th, 80th, and 90th percentiles of days with non-zero values on the Baidu index were established. These percentiles spanned from August 1, 2020, to November 1, 2020. The onset criteria were defined as the Baidu index exceeding the thresholds for at least three consecutive days or a longer sequence of seven consecutive days. This criterion was applied universally, resulting in six scenarios. To mitigate the influence of extraneous variables on the search index, a three-day moving average was incorporated into the analysis.

    • The real-time Baidu index from November 1, 2022, to January 2, 2023, was compared to a historical baseline calculated as the moving average of the Baidu index from 2020 to 2021. The Baidu Excess Search index, which measures excess search activity, was calculated using the following formula:

      Baidu Excess Search index = $ \dfrac{\sum \left(\int x-\int n\right)}{\int m} $

      Here, x denotes the real-time Baidu index, n represents the n-day moving average of the Baidu index for the historical period, and m denotes the m-day moving average for the same period.

      To determine the peak date reflecting the highest surge in search volume attributable to new cases in the current outbreak, we adjusted for the influence of other factors and diseases on the keyword search index. The optimal “m” value for the moving average was determined by evaluating three scenarios: m=3, m=7, and m=3l (assuming “l” is defined in the context). The “m” value was selected by comparing the outcomes of these scenarios. To minimize the impact of extraneous factors, we applied the “m”-day moving average for the historical Baidu index.

      Subsequently, a city’s peak was identified based on two criteria, with outcomes compared to render a judgment: the Baidu Engine Excess Search index demonstrated an abnormal rise from the start date followed by a decline for three to five consecutive days (1).

    • Baidu index data from 331 Chinese cities underwent a stringent quality assessment. This process excluded 76 cities based on two criteria. First, cities with a Baidu index below the 30th percentile were excluded. Second, cities reporting a zero index for any week between August 1 and November 31, 2022, were removed (Figure 1). After establishing the methodology and confirming its efficacy using the testing data, influenza search data were used to corroborate its versatility.

      Figure 1. 

      Study structure.

    • Baseline threshold assessment. A comparative analysis of the average Baidu index and standard deviation was conducted across cities without pandemics during August–October 2020, 2021, and 2022. The year 2020, demonstrating the lowest average, was subsequently chosen as the reference for establishing the threshold (Supplementary Figure S1).

      The criterion combining the 90th percentile threshold with seven consecutive days exceeding this threshold identified the fewest epidemic onsets among the six scenarios. Therefore, this combination was adopted to define the commencement of an outbreak.

      Onset criteria evaluation revealed that the epidemic began on November 9 in more northern cities and PLADs, while southern cities and PLADs tended to see an onset date of December 28 (Figure 2A).

      Figure 2. 

      Onset dates and peak dates across 255 cities in various PLADs of China. (A) Onset dates evaluation across 255 cities in various PLADs of China; (B) Peak dates evaluation across 255 cities in various PLADs of China.

      Note: Each row corresponds to a PLAD, and each colored block represents a city. The color gradient indicates the timing: shades closer to red represent earlier times, while shades closer to green indicate later times.

      Abbreviation: PLAD=provincial-level administrative division.

    • Daily Baidu engine excess search values and peak dates were calculated for m-values of 3, 7, and 31. An m-value of 3 exhibited data fluctuations rather than a consistent trend (Supplementary Table S1). Conversely, an m-value of 31 blurred details and risked distortion. An m-value of 7 provided stable results and was chosen for its reliable measurement, minimizing noise and avoiding the loss of significant data variations.

      Peak criteria evaluation. A criterion of three consecutive days of decline adequately signaled the peak in Sanya, Kashi, and Baishan, but five consecutive days did not. Therefore, a three-day decline was designated as the criterion. Applying this criterion to 255 cities, eight cities had not reached their peak by January 2, 2023 (Figure 2B).

      Northern cities experienced a median ILI onset date of December 9 (interquartile range: December 4–10), which was slightly earlier than the median onset date of December 11 (interquartile range: December 9–13) observed in southern cities. These data suggest that the influenza pandemic was initiated earlier in northern cities than southern cities.

      Northern cities peaked around December 18 (interquartile range: December 17–20), whereas southern cities peaked later on December 20 (interquartile range: December 17–23), indicating earlier pandemic intensification in the north.

      Comparison with official reports. The research findings indicated a pandemic onset date of December 8, consistent with results from the China CDC that over 96% of cities passed their peak ILI activity before January 2023 (9). The results of this study were highly consistent with nucleic acid assay results. Additionally, the peak dates for the Beijing (December 13) and Tianjin (December 17) municipalities aligned with the China CDC reports in December 14 and 19, respectively.

      Furthermore, the study observed a sharp increase in search volume starting on December 9 in northern cities and December 11 in southern cities. This increase corresponds with the rise in ILI% reported by the China CDC from 824 sentinel hospitals during the 50th week (December 12–18) (10).

      Fever clinic visit data, as reported by the China CDC, peaked on December 23, 2022 (11). Our study found that the peak occurred from December 18 to 20, 2022, suggesting that the Baidu Index may be a leading indicator of healthcare utilization during an outbreak.

      Overall, the China CDC data on ILI percentage and fever clinic visits support the study’s findings, validating the methodology and the use of search engine data as a reliable, real-time indicator of pandemic spread and intensity.

    • To test our model’s effectiveness, we analyzed the 2023 influenza outbreak (January 1 to April 1, 2023) in northern and southern PLADs, Yichang, and Weifang to demonstrate its utility across different regions. We used October to December 2018 as our baseline to account for changes in internet usage. Outbreak onset was determined using an 80th percentile threshold of search volume for seven consecutive days, a method that proved effective across all regions. For peak detection, we used a three-day continuous decline in search activity after an initial increase, with calculations based on a three-day moving average (m-value). This approach allowed for rapid, minimal data analysis while accurately capturing the outbreak peak.

    • The onset dates for Northern and Southern China were February 20 and February 26, respectively, and February 25 and February 27 for Weifang and Yichang, respectively. Table 1 presents the peak dates for Northern China, Southern China, Yichang, and Weifang.

      Region m=3 m=7 m=31
      Excess search index Peak date Excess search index Peak date Excess search index Peak date
      Northern PLADs 36.69 March 10 39.37 March 16 42.11 March 13
      Southern PLADs 24.00 March 14 30.70 March 09 24.07 March 14
      Yichang 39.87 March 16 36.75 March 10 39.04 March 16
      Weifang 32.46 March 10 24.09 March 14 30.66 March 9
      Abbreviation: PLADs=provincial-level administrative divisions.

      Table 1.  Evaluation and comparison of criteria for reaching the peak.

    • ILI percentages reported by the China CDC rose in Northern and Southern China during week 9 of 2023 (February 27–March 5), remained high in week 10 (March 6–12), and declined in week 11 (March 13–19). The onset and peak dates for ILI activity in these regions aligned with these results.

      Peak dates in Yichang (March 16) and Weifang (March 10) preceded the dates for the highest ILI% (March 29 and March 20, respectively). The median dates for the three highest ILI% (March 26 and March 16) were over 10 and 7 days later, respectively (Figure 3). Additionally, the epidemic outset period reported by the Yichang CDC and Weifang CDC was week 9 in 2023, consistent with our outset dates of February 27 and February 25.

      Figure 3. 

      Onset and peak dates of the influenza epidemic in 2023. (A) Onset and peak dates evaluation in northern PLADs. (B) Onset and peak dates evaluation in southern PLADs. (C) Onset and peak dates evaluation in Yichang. (D) Onset and peak dates evaluation in Yichang.

      Note: The green dotted line represents the outset date identified in this study, while the green block indicates the corresponding officially reported week. The red dotted line signifies the estimated peak time of the epidemic based on this study, and the red block represents the officially reported week for the peak.

      Abbreviation: PLAD=provincial-level administrative division.

    • This research underscores the vital role of search engine analytics in bolstering public health surveillance and early warning systems. By leveraging internet search data, this study demonstrates the potential for a more nuanced and immediate understanding of disease dynamics, facilitating the early identification of both pandemic outbreaks and seasonal epidemic patterns. Using varied threshold levels, this approach discerned the preliminary and peak phases of disease spread more rapidly and accurately than traditional methods.

      As a potential supplement to traditional surveillance systems, internet search data has shown promise in identifying trends and peak timing before official reports (December 12–18, December 23, 2022) (10). This earlier detection is attributable to the immediacy of internet data, which circumvents lengthy processing and validation steps required for official reporting, and its ability to reflect real-time shifts in public concern and interest.

      Validation in two cities under seasonal influenza scenarios in Northern and Southern China has confirmed that this procedure is extrapolatable for identifying the onset and peak of respiratory infectious diseases. The emergence of novel respiratory pathogens is unpredictable, and traditional surveillance systems often struggle to adapt quickly when a new pathogen spreads rapidly and causes a pandemic. Due to the ready availability of data and the simplicity of the method, this procedure can serve as an alternative option. Moreover, it remains timely and effective in detecting patterns even when epidemiological trends of seasonal respiratory diseases change.

      In application, different thresholds may need to be adopted based on the actual conditions in areas with varying population sizes, search behaviors, climates, and epidemiological characteristics of diseases. This study analyzed only 255 cities, excluding those with a continuous Baidu index of zero. This exclusion could be related to the scale of internet users and their online habits. The timeliness of detection in northern cities was earlier than in southern cities, potentially due to the stable seasonal epidemic trends historically observed in the north, typically characterized by a single peak in cases. Additionally, population size might play a role in the timeliness of detection. Weifang has a population of 9.4 million compared to Yichang’s 4 million, suggesting that the earlier detection in Weifang might also be related to its larger population size.

      This study is subject to some limitations. First, due to the lack of referential data from surveillance systems during the pandemic, this study was unable to validate the results for the 255 cities. Second, as the study aimed to provide a scalable and simple tool, it did not account for other factors that could affect Baidu searches, which may impact the accuracy of the results.

  • Conflicts of interest: No conflicts of interest.
  • Reference (11)

    Citation:

    Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return