Advanced Search

Methods and Applications: Applying Machine Learning Approach to Explore Childhood Circumstances and Self-Rated Health in Old Age — China and the US, 2020–2021

View author affiliations
  • Abstract

    Introduction

    Childhood circumstances impact senior health, prompting the introduction of machine learning methods to assess their individual and collective contributions to senior health.

    Methods

    Using health and retirement study (HRS) and China Health and Retirement Longitudinal Study (CHARLS), we analyzed 2,434 American and 5,612 Chinese participants aged 60 and above. Conditional inference trees and forests were employed to estimate the influence of childhood circumstances on self-rated health (SRH).

    Results

    The conventional method estimated higher inequality of opportunity (IOP) values in both China (0.039, accounting for 22.67% of the total Gini coefficient 0.172) and the US (0.067, accounting for 35.08% of the total Gini coefficient 0.191). In contrast, the conditional inference tree yielded lower estimates (China: 0.022, accounting for 12.79% of 0.172; US: 0.044, accounting for 23.04% of 0.191), as did the forest (China: 0.035, accounting for 20.35% of 0.172; US: 0.054, accounting for 28.27% of 0.191). Childhood health, financial status, and regional differences were key determinants of senior health. The conditional inference forest consistently outperformed others in predictive accuracy, as demonstrated by lower out-of-sample mean squared error (MSE).

    Discussion

    The findings emphasize the need for early-life interventions to promote health equity in aging populations. Machine learning showcases the potential in identifying contributing factors.

  • loading...
  • 1Conventionally, researchers bootstrap to select sample for each tree in random forest. However, it has been shown that the bootstrapping can lead to biased variable selection (Strobl et al., 2007).
  • Funding: Supported by the U.S. National Institute on Aging (R01AG077529; P30AG021342; R01AG037031)
  • [1] Moffitt TE, Belsky DW, Danese A, Poulton R, Caspi A. The longitudinal study of aging in human young adults: knowledge gaps and research agenda. J Gerontol A Biol Sci Med Sci 2017;72(2):210 − 5. https://doi.org/10.1093/gerona/glw191CrossRef
    [2] Bor J, Cohen GH, Galea S. Population health in an era of rising income inequality: USA, 1980-2015. Lancet 2017;389(10077):1475 − 90. https://doi.org/10.1016/S0140-6736(17)30571-8CrossRef
    [3] Carrieri V, Jones AM. Inequality of opportunity in health: a decomposition-based approach. Health Econ 2018;27(12):1981 − 95. https://doi.org/10.1002/hec.3814CrossRef
    [4] Moody-Ayers S, Lindquist K, Sen S, Covinsky KE. Childhood social and economic well-being and health in older age. Am J Epidemiol 2007;166(9):1059 − 67. https://doi.org/10.1093/aje/kwm185CrossRef
    [5] Strauss J, Witoelar F, Meng QQ, Chen XX, Zhao YH, Sikoki B, et al. Cognition and SES relationships among the mid-aged and elderly: a comparison of China and Indonesia. National Bureau of Economic Research; 2018 May Report No.: 24583. https://www.nber.org/papers/w24583.
    [6] Isen A, Rossin-Slater M, Walker WR. Every breath you take—every dollar you’ll make: the long-term consequences of the clean air act of 1970. J Polit Econ 2017;125(3):848 − 902. https://doi.org/10.1086/691465CrossRef
    [7] Roemer JE. Equality of opportunity. Cambridge: Harvard University Press. 1998; p. 130.
    [8] Roemer JE, Trannoy A. Equality of opportunity: theory and measurement. J Econ Lit 2016;54(4):1288 − 332. https://doi.org/10.1257/jel.20151206CrossRef
    [9] Marmot M, Friel S, Bell R, Houweling TAJ, Taylor S, Commission on Social Determinants of Health. Closing the gap in a generation: health equity through action on the social determinants of health. Lancet 2008;372(9650):1661 − 9. https://doi.org/10.1016/S0140-6736(08)61690-6CrossRef
    [10] Brunori P, Hufe P, Mahler D. The roots of inequality: estimating inequality of opportunity from regression trees and forests. Scand J Econ 2023;125(4):900 − 32. https://doi.org/10.1111/sjoe.12530CrossRef
    [11] Ferreira FHG, Gignoux J. The measurement of inequality of opportunity: theory and an application to Latin America. Rev Income Wealth 2011;57(4):622 − 57. https://doi.org/10.1111/j.1475-4991.2011.00467.xCrossRef
    [12] Hufe P, Peichl A, Roemer J, Ungerer M. Inequality of income acquisition: the role of childhood circumstances. Soc Choice Welf 2017;49(3):499 − 544. https://doi.org/10.1007/s00355-017-1044-xCrossRef
    [13] Ferreira FHG, Gignoux J. The measurement of educational inequality: achievement and opportunity. World Bank Econ Rev 2014;28(2):210 − 46. https://doi.org/10.1093/wber/lht004CrossRef
    [14] Qi YJ. Random forest for bioinformatics. In: Zhang C, Ma YQ, editors. Ensemble machine learning: methods and applications. New York: Springer. 2012; p. 307-23. http://dx.doi.org/10.1007/978-1-4419-9326-7_11.
    [15] Schneider J, Hapfelmeier A, Thöres S, Obermeier A, Schulz C, Pförringer D, et al. Mortality Risk for Acute Cholangitis (MAC): a risk prediction model for in-hospital mortality in patients with acute cholangitis. BMC Gastroenterol 2016;16(1):15. https://doi.org/10.1186/s12876-016-0428-1CrossRef
  • FIGURE 1.  Correlation of estimates by method.

    Note: The plot shows the estimates using each method (i.e., the conventional parametric Roemer method and the conditional inference trees) against the estimates from conditional inference forest. The x-axis represents the scale of Gini coefficients for the forest method. The Gini coefficients range between 0 and 1. The larger the more unequal. The y-axis represents the scale of Gini coefficients for the Roemer method and tree methods. The black diagonal indicates the 45-degree line, on which all data points should align if the different methods were perfectly congruent. This plot confirms that the conventional parametric Roemer method delivers higher estimates than forest, while tree estimates are lower than those based on forest.

    Abbreviation: SRH=self-rated health.

    FIGURE 2.  Conditional inference tree for self-rated health. (A) China; (B) the US.

    FIGURE 3.  Importance of childhood circumstances to self-rated health using conditional inference forest. (A) China; (B) the US.

    FIGURE 4.  Comparison of models’ test errors. (A) Parametric method vs. random forest; (B) Conditional inference trees vs. random forest.

    Note: All models aim to minimize the MSE. MSE from Random Forest is used as the reference group. Ratios larger than 1 means the corresponding methods and outcome measures generate larger MSE than using Random Forest. The 95% confidence intervals are derived based on 200 bootstrapped re-samples of the test data.

    Abbreviation: MSE=mean squared error.

Citation:

通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索
Turn off MathJax
Article Contents

Article Metrics

Article views(1365) PDF downloads(8) Cited by()

Share

Related

Applying Machine Learning Approach to Explore Childhood Circumstances and Self-Rated Health in Old Age — China and the US, 2020–2021

View author affiliations

Abstract

Introduction

Childhood circumstances impact senior health, prompting the introduction of machine learning methods to assess their individual and collective contributions to senior health.

Methods

Using health and retirement study (HRS) and China Health and Retirement Longitudinal Study (CHARLS), we analyzed 2,434 American and 5,612 Chinese participants aged 60 and above. Conditional inference trees and forests were employed to estimate the influence of childhood circumstances on self-rated health (SRH).

Results

The conventional method estimated higher inequality of opportunity (IOP) values in both China (0.039, accounting for 22.67% of the total Gini coefficient 0.172) and the US (0.067, accounting for 35.08% of the total Gini coefficient 0.191). In contrast, the conditional inference tree yielded lower estimates (China: 0.022, accounting for 12.79% of 0.172; US: 0.044, accounting for 23.04% of 0.191), as did the forest (China: 0.035, accounting for 20.35% of 0.172; US: 0.054, accounting for 28.27% of 0.191). Childhood health, financial status, and regional differences were key determinants of senior health. The conditional inference forest consistently outperformed others in predictive accuracy, as demonstrated by lower out-of-sample mean squared error (MSE).

Discussion

The findings emphasize the need for early-life interventions to promote health equity in aging populations. Machine learning showcases the potential in identifying contributing factors.

  • 1. Department of Health, Society & Behavior, Public Health, University of California, Irvine, CA, USA
  • 2. Department of Statistics and Data Science, Yale University, New Haven, CT, US
  • 3. Department of Internal Medicine, Yale School of Medicine, New Haven, CT, US
  • 4. Department of Health Policy and Management, Yale School of Public Health, New Haven, CT, US
  • 5. Department of Economics, Yale University, New Haven, CT, US
  • Corresponding author:

    Xi Chen, xi.chen@yale.edu

  • Funding: Supported by the U.S. National Institute on Aging (R01AG077529; P30AG021342; R01AG037031)
  • Online Date: March 15 2024
    Issue Date: March 15 2024
    doi: 10.46234/ccdcw2024.043
    • The global phenomenon of rapid population aging, coupled with the growing health burden among older adults, highlights the importance of investigating the long-term effects of early life stages on the aging process (1). Previous research in the fields of economics and epidemiology has consistently shown that childhood circumstances have a significant impact on later-life health outcomes. This suggests that childhood is a crucial period for implementing interventions aimed at reducing health disparities (2). These circumstances encompass a wide range of factors, including parental influences (3), family socioeconomic status (SES) (4), as well as community and environmental factors such as rural/urban status (5) and natural surroundings (6).

      Both early-life and later-life factors contribute to health outcomes in older age. However, childhood circumstances, particularly those that are beyond an individual’s control, are considered to be the most unacceptable and illegitimate sources of health inequality in older age (78). This type of inequality, attributed to childhood circumstances, is commonly referred to as inequality of opportunity (IOP). The focus on reducing IOP arises from a wide-ranging political and social discussion aimed at creating equal opportunities during the early stages of life and addressing the unfair health inequalities identified by the World Health Organization Commission on Social Determinants of Health (9).

      Despite the considerable amount of research conducted on the impact of childhood circumstances on health outcomes, there are still methodological challenges that need to be addressed. These challenges include the arbitrary selection of childhood circumstances and potential biases in estimating health inequality among older adults (1011). In our study, we aimed to overcome these challenges by utilizing machine learning techniques to identify the most relevant set of childhood circumstances. By adopting this approach, we allowed the data to inform our understanding of unequal childhood circumstances, thus minimizing the influence of researcher bias on the model specification (1012). Furthermore, we compared our findings to those obtained using the conventional parametric Roemer method in order to highlight the significant improvements our approach offers in measuring inequality throughout an individual’s life.

    • Our study utilized data from the health and retirement study (HRS) in the US and the China Health and Retirement Longitudinal Study (CHARLS) in China. We analyzed 2020–2021 wave of HRS and the 2020 wave of CHARLS, both of which matched with life history surveys. The final sample consisted of 2,434 Americans and 5,612 Chinese individuals aged 60 and above. Self-rated health (SRH) was used as the health outcome measure, assessed on a scale from excellent (=1) to poor (=5) in both surveys. The analysis included data on 43 childhood circumstances from HRS and 36 from CHARLS, categorized into seven domains such as birth environment, family SES, and childhood relationships (Supplementary Tables S1 and S2). While there were slight variations, the domains predominantly included the same core measures for both countries. The analysis was conducted using R (version 4.3.1; R Core Team, Vienna, Austria).

      Supplementary Material provides a comprehensive conceptual and analytic framework for this study. Initially, we used the Roemer method with Shapley value decomposition to estimate the individual and collective impact of childhood circumstances on health inequality in later life. This framework serves as a foundation for evaluating policy interventions. By partitioning the population into distinct, non-overlapping groups based on observable circumstances, such as parental education (high vs. low) and financial hardship (yes vs. no), we can derive a counterfactual distribution of health outcomes. The disparity in health across these groups can be solely attributed to differences in childhood circumstances, which we refer to as the IOP. In our study, we quantified the contribution of childhood circumstances to health inequality using the Gini coefficient (8,11). We also calculated the IOP by dividing this measure of absolute health inequality by the overall health inequality, representing the proportion of health inequality explained by childhood circumstances. While not establishing causality, this analysis provides valuable insights into the statistical significance of childhood circumstances (13).

      Conditional inference trees are particularly advantageous for analyzing the impact of childhood circumstances on IOP. They allow for sequential hypothesis tests and provide a visual representation for comparing different childhood circumstances. Each test examines IOP within a specific subset of the population, and the depth of the tree reflects the diversity of childhood circumstances within a society. Additionally, these trees address the issue of arbitrary variable and model selection that often arises in the IOP literature. They consider a comprehensive set of observed variables that qualify as childhood circumstances. In our study, we used these childhood circumstances to divide the population into distinct groups (terminal nodes) in the context of regression trees. We calculated the predicted outcome value for an individual observation as the average outcome of the group to which the individual was assigned, taking into account the number of observations in that group. Furthermore, we used 5-fold cross-validation to optimize the model parameters. We found that our results are consistent regardless of the choice of K.

      Conditional inference trees have advantages in providing non-arbitrary population segmentation. However, they have limitations such as using limited data, struggling with highly correlated childhood circumstances, and exhibiting high prediction variance, making them sensitive to sample changes. To address these limitations, random forest is employed to mitigate these issues. Random forest forms a forest of decision trees from bootstrapped samples, utilizing a random selection of predictors at each split to reduce prediction variance, resulting in a more reliable model. In this study, 200 trees were used based on considerations of computational cost-efficiency and prediction accuracy to predict outcomes (Supplementary Figure S1). A 4-step method was applied, involving the random selection of half the observations in each tree, along with random data subsampling and subsets of circumstances, to determine optimal parameters through out-of-bag error minimization. Predictor importance for each childhood circumstance was evaluated using the residual sum of squares (RSS).

      To evaluate the potential biases in measuring IOP in healthy individuals that could impact the accuracy of predictions, we divided the dataset into a training set representing 2/3 of the total sample size (N) and a test set representing the remaining 1/3. The training set was used to train our model, while the test set was used to assess the performance of three different methods: the conventional parametric Roemer method, conditional inference trees, and conditional inference forest.

    • First, the Gini coefficient indicated that there was a higher level of inequality in self-rated health in the US compared to China. We then used the Gini coefficients to measure the IOP in the counterfactual distribution. Figure 1 illustrates that the conventional parametric Roemer method yielded the highest estimates of IOP, followed by the conditional inference forest method and the conditional inference tree method. Specifically, in China, IOP accounted for 22.67% (0.039 out of 0.172 total Gini coefficient) of the inequality in self-rated health, while in the US it accounted for 35.08% (0.067 out of 0.191 total Gini coefficient). In contrast, the conditional inference tree method accounted for 12.79% in China (0.022 out of 0.172 total Gini coefficient) and 23.04% in the US (0.044 out of 0.191 total Gini coefficient), while the forest method represented 20.35% in China (0.035 out of 0.172 total Gini coefficient) and 28.27% in the US (0.054 out of 0.191 total Gini coefficient).

      Figure 1. 

      Correlation of estimates by method.

      Note: The plot shows the estimates using each method (i.e., the conventional parametric Roemer method and the conditional inference trees) against the estimates from conditional inference forest. The x-axis represents the scale of Gini coefficients for the forest method. The Gini coefficients range between 0 and 1. The larger the more unequal. The y-axis represents the scale of Gini coefficients for the Roemer method and tree methods. The black diagonal indicates the 45-degree line, on which all data points should align if the different methods were perfectly congruent. This plot confirms that the conventional parametric Roemer method delivers higher estimates than forest, while tree estimates are lower than those based on forest.

      Abbreviation: SRH=self-rated health.

      Figure 2A shows the structure of the IOP for self-rated health in China using a tree with five terminal nodes. The tree is formed by factors such as childhood health, birth region, and childhood family financial status. The most advantaged type (terminal node 5) includes people with good childhood health, good family financial status, and born in Eastern China. On the other hand, the group with the worst self-rated health (terminal node 6) typically had poorer child health. In the US, as depicted in Figure 2B, individuals with poor childhood health fell into the disadvantaged circumstance type (terminal nodes 7). In contrast, individuals with certain favorable conditions, such as having more books at home, being healthy in childhood, and being White, generally reported better health in old age (terminal node 6).

      Figure 2. 

      Conditional inference tree for self-rated health. (A) China; (B) the US.

      Figure 3A reveals that in China, using conditional inference forest, the key factors impacting self-rated health are childhood health and being born in the eastern China, which corroborates findings from the conditional inference trees (Figure 2A). Additionally, parents’ health status (staying in bed for a long time) and relationship with parents also have a high impact on self-rated health in older ages. Similarly, Figure 3B demonstrates that in the US, childhood health, number of books at home at age 10, and race/ethnicity are significant factors, which largely align with results obtained through conditional inference trees (Figure 2B).

      Figure 3. 

      Importance of childhood circumstances to self-rated health using conditional inference forest. (A) China; (B) the US.

      As previously mentioned, all tested models were designed to minimize the mean squared error (MSE). We derived 95% confidence intervals using 200 bootstrap re-samplings of the test data. The MSE for the random forest model was standardized to a value of 1 to facilitate comparison of prediction performance across models. Therefore, an MSE greater than 1 indicated a poorer out-of-sample fit. In terms of self-rated health, both the conditional inference tree and parametric Roemer methods performed worse than the conditional inference forest, as shown in Figure 4AB. On average, the conditional inference trees demonstrated lower test error rates compared to the conventional parametric Roemer method.

      Figure 4. 

      Comparison of models’ test errors. (A) Parametric method vs. random forest; (B) Conditional inference trees vs. random forest.

      Note: All models aim to minimize the MSE. MSE from Random Forest is used as the reference group. Ratios larger than 1 means the corresponding methods and outcome measures generate larger MSE than using Random Forest. The 95% confidence intervals are derived based on 200 bootstrapped re-samples of the test data.

      Abbreviation: MSE=mean squared error.

    • This study utilized two machine learning methods, namely the conditional inference tree and forest, to investigate the effects of various childhood circumstances on health disparity among older adults in China and the US. We identified several key predictors of health conditions in older adults, including childhood health, socioeconomic status, number of books at home (in the case of Americans), and birth region (in the case of Chinese). By employing these methods, we aimed to address concerns regarding the arbitrary selection of childhood circumstances and mitigate potential biases in our estimates of the impact of childhood circumstances on health. Our findings emphasize the importance of mitigating health disparities stemming from childhood circumstances, and suggest the need for policy and intervention strategies to promote health equity in both China and the US. Implementing preventive measures during childhood can alleviate the economic burden of diseases, enhance quality of life, and improve longevity, particularly in the absence of effective treatments for chronic diseases like Alzheimer's, hypertension, and diabetes.

      The conditional inference forest (CIF) demonstrates superior out-of-sample performance compared to other methods, resulting in the most accurate estimates of childhood circumstances on health inequality in old age. This finding is in line with previous studies in various fields (1415). While conditional inference trees provide a simpler model and a visually accessible representation of childhood circumstances, the CIF leverages information on childhood circumstances more effectively, yielding results consistent with the trees in terms of importance and estimates of influence on health outcomes. These machine learning methods employ explicit algorithms to interpret health outcomes and do not rely on strong assumptions regarding the significance of specific childhood circumstances. By utilizing statistical techniques such as K-fold cross-validation and bootstrap, our modeling approach becomes more transparent and generalizable.

      There are several limitations to this study. First, the life course approach used in this study only focuses on current older adults, which may not accurately reflect the experiences of younger cohorts. Therefore, future research should also consider monitoring younger cohorts. Second, it is important to note that the associations identified in this study should not be interpreted as causal. It is possible that unobservable childhood circumstances may introduce bias to our estimates. Therefore, further research is needed to identify the causal mechanisms at play. Lastly, the data used in this analysis are from the most recently released CHARLS (2020) and HRS (2020–2021) surveys, which overlap with the coronavirus disease 2019 (COVID-19) pandemic. This may introduce bias to self-rated health measures. However, our robustness checks using CHARLS/HRS pre-pandemic waves have yielded consistent results, providing reassurance.

      In conclusion, our study utilized a life course approach and machine learning techniques to identify key factors influencing health in older adults. We applied this approach to the two largest economies and aging societies in the world. Our findings underscore the importance of incorporating a life course perspective in public health research and policy development.

    • No conflicts of interest.

  • 1Conventionally, researchers bootstrap to select sample for each tree in random forest. However, it has been shown that the bootstrapping can lead to biased variable selection (Strobl et al., 2007).
  • Reference (15)

    Citation:

    Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return