-
Data of confirmed cases in the mainland of China were obtained from the official websites of China’s National Health Commission and local health commissions. Hubei Province was excluded since its diagnostic criteria was revised (4). The numbers of cumulative confirmed cases in Guangdong, Henan, Zhejiang, Hunan, and Anhui provinces, as well as the total number of cases in the mainland of China (except Hubei), were used for developing the model. Provincial data from the report starting date to February 24, 2020, and national data from January 19 to February 24, 2020 were included for analysis. The daily cumulative number of confirmed cases in the US, Republic of Korea, and Italy were collected from the World Health Organization (WHO) COVID-19 situation reports and the Johns Hopkins University dashboard (5), and these data were used for model testing and trend analysis.
ARIMA and Holt models were applied for short-term prediction on daily number of cumulative cases in China (except Hubei) and selected provinces. The ARIMA model capitalized on the associations in the sequentially lagged relationships that exist in the given dataset. The Holt method, also known as the double exponential model, is an extension of single exponential smoothing and can be used to analyze the time series data with levels and trends. Model performance was compared across a 5-day, a 6-day, and a 7-day prediction time spans, respectively. Mean absolute percentage error (MAPE) (Equation 1) was used to evaluate the prediction accuracy between the predicted and actual values, and the model with the best MAPE was selected.
$$ MAPE=\frac{\sum \frac{\left|A-P\right|}{A}\times 100}{N} $$ (1) A: actual value; P: predicted value; N: number of days predicted.
The mainland of China (except Hubei) with a 5-day prediction time span was used as an example. First, the number of daily cumulative cases from the first day to January 31, 2020 was used for prediction from February 1 to February 5. Then, data as of February 5 were used for prediction from February 6 to February 10, and the prediction would loop to the next 5-day span until February 24. The prediction was made for every 5-day period and each model was re-calibrated by including updated data. The epidemic trend was analyzed by applying the optimal model on data of the US, Italy, and Republic of Korea. Predictions started when the new cases were reported for three consecutive days. Therefore, the confirmed cases of the US, Republic of Korea, and Italy since the first report day to March 20, February 18, and February 23 was used as the first training set. Subsequent predictions were done every 5 days until May 19, 2020 using the same method. Since the prediction error tends to increase as the prediction time extend, only the predicted values for every fifth day were kept to exclude the influence of prediction error on epidemic trend analysis. By analyzing the pattern of differences between predicted and observed trends (Equation 2), the time points that might reflect either the starting of intervention effect or the occurrence of unexpected incidents can be detected. For example, if a negative difference pattern (more actual cases than predicted) changed to a positive one (more predicted cases than actual), the changing point might reflect the effects of interventions since the predicted trend still followed the previous upward trend pattern, whereas the real trend was flattened due to intervention effects.
$$ Difference=P-A $$ (2) Modelling analysis were performed using auto.arima() and holt() functions in the forecast package in R software (version 3.6.2; RStudio Inc; US) (6).
-
For a 5-day prediction time span, both ARIMA and Holt models showed excellent model performance (MAPE <5%) regardless of study regions (Table 1). The overall prediction accuracy of the ARIMA model was slightly better (overall MAPEs: 3.07% vs. 4.11%) than the Holt model (Figure 1). Lower MAPEs were observed in 5-day prediction (3.07%, range: 2.05%–5.05%) compared to that in the 6-day (4.31%, range: 3.06%–6.72%), and the 7-day (5.13%, range: 2.02%–10.26%) predictions. The Holt model yielded the similar result and also favored the 5-day prediction span.
Area Models MAPEs February 1–5 February 6–10 February 11–15 February 16–20 February 21–24 February 1–24 The mainland of China ARIMA 0.99 4.14 1.28 1.07 6.25 2.60 Holt 0.87 4.15 1.86 1.69 4.61 2.55 Guangdong ARIMA 14.80 5.01 0.89 3.13 0.54 5.05 Holt 14.80 4.95 1.21 4.76 0.63 5.47 Zhejiang ARIMA 2.99 5.16 1.99 0.57 5.64 3.17 Holt 3.82 5.16 3.45 0.64 5.64 3.66 Henan ARIMA 4.94 8.58 0.40 1.71 0.14 3.28 Holt 4.94 8.63 1.22 2.41 0.14 3.61 Hunan ARIMA 4.91 2.84 1.62 0.32 0.17 2.05 Holt 13.60 4.11 3.53 0.87 0.10 4.62 Anhui ARIMA 2.03 2.02 2.21 4.36 0.20 2.25 Holt 14.27 2.10 4.64 1.85 0.05 4.77 Abbreviations: ARIMA=autoregressive integrated moving average model; COVID-19=coronavirus disease 2019; Holt=Holt exponential smoothing model; MAPEs=mean absolute percentage errors. Table 1. MAPEs between reported and predicted numbers of COVID-19 cases of the mainland of China (excluding Hubei Province) and five provinces in China using ARIMA and Holt models (%).
Figure 1.Comparison between reported and predicted numbers of COVID-19 in the preceding 5 days using ARIMA model and Holt model in the mainland of China (except Hubei) (A), Guangdong (B), Zhejiang (C), Henan (D), Hunan (E), and Anhui (F).
Abbreviations: COVID-19=Coronavirus disease 2019; ARIMA=autoregressive integrated moving average model; Holt=Holt exponential smoothing model.Based on the results, the ARIMA model with a 5-day prediction time span was further tested using data from the US, Italy, and Republic of Korea. ARIMA also performed well for the other three countries in the late-stages (Figure 2). Almost all the difference values were positive after February 7 in the mainland of China, March 5 in Republic of Korea, and April 27 in Italy. However, the predicted versus observed differences still fluctuated between positive and negative values in the US as of May 19.
Figure 2.Differences and MAPEs (%) between reported and predicted numbers of daily COVID-19 cumulative confirmed cases in the mainland of China (A), Republic of Korea (B), Italy (C), and the United States (D).
Abbreviations: COVID-19=corona virus disease 2019; MAPEs=mean absolute percentage errors; Differences: the predicted number minus the reported number.
HTML
Citation: |