Part of Health Survey for England predicting height, weight and body mass index from self-reported data
Developing and using prediction equations
Background
Previous studies have shown that misreporting of height and weight (and BMI derived from these) varies by socio-demographic and health-related factors. To address this, equations have been developed including variables predictive of misreporting to adjust self-reported values of height and weight so that they more closely approximate measured values of height and weight, using datasets such as the Health Survey for England (HSE) and the National Health and Nutrition Examination Survey (NHANES) in the USA that contain both self-report and measured data from the same participants.
Such equations can then be transferred for use with surveys that collect only self-report data on height and weight. In England, self-reported height and weight in the Active Lives Survey (ALS) datasets are currently adjusted by formulae based on HSE 2012-2014 data for the purpose of monitoring levels of excess weight (BMI 25kg/m2 or more) across local authorities. For surveys such as the ALS, which collect only self-report data on height and weight, such adjustments are made in the expectation that the adjusted values more closely approximate measured values, and so, for example, improve BMI classification and obesity prevalence estimates, compared with the unadjusted estimates based on self-reported data alone.
For this report, using combined HSE 2011-2016 data, prediction equations were developed to correct self-reported height and weight to more closely approximate measured height and weight. This adjustment was assessed to see if it improved the accuracy of obesity prevalence estimates relative to those based only on BMI derived from self-reported height and weight.
A fuller description of the methods and findings of our study is provided in an accompanying academic paper (Scholes et al, 2022)2.
Methods
A set of adjustments to self-reported values of height and weight were based on linear regression models that estimated measured values of height and weight from self-reported values of height and weight. In this report, two sets of equations are presented to enable researchers to evaluate for themselves whether either approach, and if so which, best suit their data and goals.
Model 1
The first approach (Model 1) estimated measured values of height and weight from self-reported values of height and weight, with additional adjustments for age group and any other socio-demographic and health-related variables independently associated with the difference between self-reported and interviewer-measured height and weight. Implementing this approach involved four main steps.
First, participants for whom the absolute difference between self-reported and interviewer-measured height and weight was more than four standard deviations away from the mean were considered outliers (with possible unrealistic reported values). These participants were excluded from the analysis (189 with outlying height values; 276 with outlying weight values) to avoid potentially undue influence on the equations.
Second, the remaining sample (38,475 participants) was used to identify which socio-demographic and health-related variables, if any, other than age group and self-reported height and weight, were independently associated with the difference between self-reported and interviewer-measured height and weight (thereby meriting inclusion in prediction equations).
Height and weight were analysed separately. Multiple linear regression was used with the difference between the self-reported and interviewer-measured values of height and weight as the outcome variable. Socio-demographic and health-related variables, selected based on reviews of the literature and availability in HSE 2011-2016, were entered as independent variables. The independent variables considered for inclusion were as follows:
- self-reported height (for the height models);
- self-reported weight (for the weight models);
- age group (16 to 17, 18 to 19, and in five-year intervals up to 85+ years);
- ethnicity (White, Black, Asian, Mixed, Other);
- Government Office Region
- healthy lifestyle behaviours (cigarette smoking status);
- general health;
- presence of a limiting longstanding illness; and
- indicators of socioeconomic status (highest educational qualification and Index of Multiple Deprivation (IMD) quintile).
As only categorical age is now provided on publicly available HSE datasets (to preserve anonymity of participants), using age group enables researchers to easily reproduce our results and revise or update equations accordingly.
To maximise sample sizes, missing values on independent variables were assigned to a separate missing category (if the number of participants with missing values was 30 or more) or the modal category (if the number of participants with missing values was less than 30).
To allow for possible non-linearity, quadratic and cubic terms for self-reported height and weight were also assessed. Independent variables that were statistically significant (p<0.05) according to the Wald test were retained as candidates for inclusion in the prediction equations.
Third, having identified the initial set of variables significantly associated with the difference between self-reported and interviewer-measured height and weight, the analytical sample was randomly divided into two parts using a 70:30 ratio (hereafter referred to as split-samples A and B: 27,033 and 11,442 participants, respectively).
Model-fitting, model-refinement and parameter estimation (prediction equations) were performed on split-sample A. To derive the prediction equations using linear regression modelling, measured height and weight were the outcome variables, and self-reported height and weight (including any non-linear terms), age group, and those additional socio-demographic and health-related variables significantly associated with the difference between self-reported and measured values were the independent variables. In a final refinement step, only statistically significant independent variables were retained for the final equations. This refinement step was performed to ensure the models were parsimonious (achieved similar goodness of fit with as few independent variables as possible).
Finally, the prediction equations derived from split-sample A were applied to the data in split-sample B to provide an independent assessment of the predictive accuracy of the equations. Corrected BMI was derived using the predicted values of height and weight. Descriptive statistics were used to compare self-reported, corrected, and interviewer-measured mean height, weight and BMI. BMI status was also compared across the three sets of data. Using BMI status from interviewer-measured height and weight as the gold standard, estimates of sensitivity and specificity were calculated to quantify by how much the corrected values improved the classification accuracy of BMI, compared with BMI status derived from self-reported height and weight.
Model 2
The second approach (Model 2) to deriving prediction equations was based on linear regression modelling with the measured values of height and weight as the outcome variable and the self-reported values of height and weight plus age (in single-year bands but trimmed at 90 years) as independent variables. Continuous age was used to replicate the approach used to correct self-reported height and weight in the Active Lives Survey. Linear, quadratic and cubic terms were entered for self-reported height and weight, and linear and quadratic terms were entered for age. Each term was retained in the model (thereby included in the equation) irrespective of statistical significance to allow for possible associations in future datasets.
As with Model 1, model-fitting was performed using split-sample A and the assessment of predictive accuracy was performed using split-sample B. This approach provides an updated set of HSE derived correction factors for use by the Office for Health Improvement and Disparities (OHID) to adjust the self-reported height and weight of participants in the Active Lives Survey for the purpose of monitoring levels of excess weight (BMI 25kg/m2 or more) across local authorities in England.
Results
Prediction equations for height and weight: Model 1
Height
Among men, based on the multiple linear regression analysis (outcome variable: self-reported minus interviewer-measured height), older age, lower educational status (below degree or no qualifications versus having a degree), being Asian (versus White), living in the North East, the North West and the West Midlands (versus the South East), and reporting bad/very bad general health (versus good/very good) were associated with greater overestimation of height.
Among women, older age, lower educational status (no qualifications versus having a degree), living in the North West (versus the South East), living in the most (versus least) deprived areas, and being in the Black, Asian, mixed ethnic or other groups (versus White) was associated with greater overestimation of height.
The regression coefficients (prediction equations) for the aforementioned variables based on the models with measured height as the outcome variable correct self-reported height upwards (positive signs) or downwards (negative signs) as appropriate. For example, compared with participants in the White group, the predicted measured height from the self-reported height of participants in the Asian group is corrected downwards by 0.61cm and 1.78cm among men and women, respectively.
Weight
The regression coefficients (prediction equations) for the aforementioned variables based on the models with measured weight as the outcome variable correct self-reported weight upwards (positive signs) or downwards (negative signs) as appropriate. For example, compared with current smokers, the predicted measured weight from the self-reported weight of participants who never smoked is corrected upwards by 0.65kg and 0.18kg among men and women respectively.
The prediction equations for Model 1 are presented in Appendix 1.
Base: Men aged 16 and over (split-sample B)
Base: Women aged 16 and over (split-sample B)
Base: Men aged 16 and over (split- sample B)
Base: Women aged 16 and over (split-sample B)
Base: Men aged 16 and over (split-sample B)
Base: Women aged 16 and over (split-sample B)
Conclusions
In agreement with previous epidemiological studies, analyses of HSE 2011-2016 data showed that participants on average overestimate height and underestimate weight, leading to an underestimation of BMI and of obesity prevalence. It is not possible to know whether the differences between self-reported and interviewer-measured values arise due to participants’ lack of knowledge about their current height and weight or whether it is due to misreporting of information that is accurately known.
Corrected BMI, based on the prediction equations, was much better than BMI derived from self-reported height and weight at approximating interviewer-measured BMI. This was demonstrated by the reduced gap in obesity prevalence and the increased sensitivity of the obese category versus self-reported data. However, whilst the prediction equations can overcome the well-known limitations of BMI based on self-reported height and weight to some extent, there are caveats to their development and use that must be considered.3
Caveats concerning the use of the prediction equations
Prediction equations are specific to time, place, target population and methods of data collection. As such, we do not assume that these equations developed using HSE 2011-2016 data will provide equivalently precise estimates when applied to surveys with more recent data, or different socio-demographic, health and self-reported anthropometric profiles. However, given the well-established tendency for self-reports to underestimate obesity, these equations may provide a better set of estimates than relying on self-report alone. That said, there are caveats that should be taken into account.
The first caveat concerns a limitation of this study, the sizeable number of participants excluded from the analyses due to missing data on height and/or weight. The findings presented in this report could be biased if those participants with valid values of self-reported and interviewer-measured height and weight were systematically different from those with missing data (for example, if those who refused to be measured were more likely than those who did not refuse to under-report their weight or to under-report to a greater extent, due at least partly to being heavier). Such bias could result in prediction equations that are inaccurate.
Second, it is likely that HSE 2011-2016 participants might have anticipated that interviewers would take direct measurements of height and weight, resulting in more ‘truthful’ reporting compared, for example, with a telephone interview where participants would not anticipate being measured. Previous studies have shown that misreporting of height (except for older adults) and weight was lower for interviews conducted face-to-face compared with telephone interviews (Ezzati et al, 2006). Surveys conducted in-person with direct measurements of height and weight may capture a lower bound of bias associated with self-reported anthropometric measures (Hattori and Sturm, 2013). Applying these prediction equations on surveys which collect height and weight data by telephone interviews or using self-completion web or postal questionnaires would be likely to underestimate measured values of BMI to a greater extent than shown in the present study.
Third, a small but non-trivial number of participants with large observed differences between self-reported and interviewer-measured values were excluded from the analyses to minimise the impact of large reporting errors. This exclusion limited the generalisability of analyses to some extent. Such cases would be impossible to identify and exclude in surveys that collect self-report but not measured data.
Fourth, compared with self-reported data, the prediction equations modestly reduced specificity in the overweight (including obesity) category, through erroneously shifting a proportion of healthy (normal) weight participants, according to measured data, to the overweight category. As a result, the equations slightly overestimated the proportion of adults classified as either overweight or obese. This finding was consistent with previous studies (Connor Gorber et al, 2008; Dutton and McLaren, 2014), and is likely to reflect higher levels of reporting accuracy of height and weight among healthy (normal) weight adults.
As reporting bias increases with higher measured BMI4 the prediction equations developed for this report are therefore not suitable for classifying adults into the five mutually exclusive BMI categories (underweight; healthy/normal) weight; overweight, obese grades I and II; morbidly obese). The equations are most suitable for classifying adults according to the more broadly defined dichotomous categories: either overweight or obese (versus not overweight nor obese), and obese (versus not obese). For these specific BMI categories, the modest reduction in specificity compared with self-reported data is more than counterbalanced by the reduced gap in prevalence estimates and the increase in sensitivity.
Finally, although the results showed no steady rate of increase or decrease in misreporting, the prediction equations developed using 2011-2016 data might not be entirely applicable to more recent data. This might be the case if obesity prevalence has greatly increased or decreased, and/or the social desirability of having a healthy/normal BMI increased or decreased. Changes in the amount of ownership and/or accuracy of home scales, or in the up-to-date knowledge of one’s own height and weight (for example, if health-workers began to routinely measure height as part of BMI assessment, and relay that information to patients) could also affect the applicability of these equations to more recent data. Likewise, any potential increase in misreporting of weight associated with weight gain during the Covid-19 pandemic (for example, due to fewer opportunities for outdoor physical activity) is not taken into account by the equations.
Which, if any, equation to use?
The equations developed for this report can be transferred for use in any survey in England which contains the set of variables included for the purposes of predicting measured values of height and weight. Using the Model 1 equations requires self-reported values of height and weight; categorical age; Government Office Region; educational status; ethnicity; general health; IMD quintile; and current cigarette smoking status. Using the Model 2 equations requires self-reported values of height and weight plus age (in single-year bands but trimmed at 90 years).
The findings showed that the two sets of prediction equations produced very similar results, due to both using self-reported values of height and weight along with participants’ age to predict measured values of height and weight. No single model stood out as the best overall candidate as adding socio-demographic variables such as ethnic group and educational status to the equations improved predictive accuracy only minimally. It may be reasonably concluded that including additional variables such as educational status and ethnic group does not add enough predictive power to justify the added complexity of including them in prediction equations.
Last edited: 8 December 2022 11:39 am