Part of Health Survey for England predicting height, weight and body mass index from self-reported data

Developing and using prediction equations

Background

Previous studies have shown that misreporting of height and weight (and BMI derived from these) varies by socio-demographic and health-related factors. To address this, equations have been developed including variables predictive of misreporting to adjust self-reported values of height and weight so that they more closely approximate measured values of height and weight, using datasets such as the Health Survey for England (HSE) and the National Health and Nutrition Examination Survey (NHANES) in the USA that contain both self-report and measured data from the same participants.

Such equations can then be transferred for use with surveys that collect only self-report data on height and weight. In England, self-reported height and weight in the Active Lives Survey (ALS) datasets are currently adjusted by formulae based on HSE 2012-2014 data for the purpose of monitoring levels of excess weight (BMI 25kg/m²or more) across local authorities. For surveys such as the ALS, which collect only self-report data on height and weight, such adjustments are made in the expectation that the adjusted values more closely approximate measured values, and so, for example, improve BMI classification and obesity prevalence estimates, compared with the unadjusted estimates based on self-reported data alone.

For this report, using combined HSE 2011-2016 data, prediction equations were developed to correct self-reported height and weight to more closely approximate measured height and weight. This adjustment was assessed to see if it improved the accuracy of obesity prevalence estimates relative to those based only on BMI derived from self-reported height and weight.

A fuller description of the methods and findings of our study is provided in an accompanying academic paper (Scholes et al, 2022)².

Footnote

2. This is currently published on medRxiv, the preprint server for Health Sciences. Note that preprints are preliminary reports of work that have not been certified by peer review. They should not be relied on to guide clinical practice or health-related behaviour and should not be reported in news media as established information.

Methods

A set of adjustments to self-reported values of height and weight were based on linear regression models that estimated measured values of height and weight from self-reported values of height and weight. In this report, two sets of equations are presented to enable researchers to evaluate for themselves whether either approach, and if so which, best suit their data and goals.

Model 1

The first approach (Model 1) estimated measured values of height and weight from self-reported values of height and weight, with additional adjustments for age group and any other socio-demographic and health-related variables independently associated with the difference between self-reported and interviewer-measured height and weight. Implementing this approach involved four main steps.

First, participants for whom the absolute difference between self-reported and interviewer-measured height and weight was more than four standard deviations away from the mean were considered outliers (with possible unrealistic reported values). These participants were excluded from the analysis (189 with outlying height values; 276 with outlying weight values) to avoid potentially undue influence on the equations.

Second, the remaining sample (38,475 participants) was used to identify which socio-demographic and health-related variables, if any, other than age group and self-reported height and weight, were independently associated with the difference between self-reported and interviewer-measured height and weight (thereby meriting inclusion in prediction equations).

Height and weight were analysed separately. Multiple linear regression was used with the difference between the self-reported and interviewer-measured values of height and weight as the outcome variable. Socio-demographic and health-related variables, selected based on reviews of the literature and availability in HSE 2011-2016, were entered as independent variables. The independent variables considered for inclusion were as follows:

self-reported height (for the height models);
self-reported weight (for the weight models);
age group (16 to 17, 18 to 19, and in five-year intervals up to 85+ years);
ethnicity (White, Black, Asian, Mixed, Other);
Government Office Region
healthy lifestyle behaviours (cigarette smoking status);
general health;
presence of a limiting longstanding illness; and
indicators of socioeconomic status (highest educational qualification and Index of Multiple Deprivation (IMD) quintile).

As only categorical age is now provided on publicly available HSE datasets (to preserve anonymity of participants), using age group enables researchers to easily reproduce our results and revise or update equations accordingly.

To maximise sample sizes, missing values on independent variables were assigned to a separate missing category (if the number of participants with missing values was 30 or more) or the modal category (if the number of participants with missing values was less than 30).

To allow for possible non-linearity, quadratic and cubic terms for self-reported height and weight were also assessed. Independent variables that were statistically significant (p<0.05) according to the Wald test were retained as candidates for inclusion in the prediction equations.

Third, having identified the initial set of variables significantly associated with the difference between self-reported and interviewer-measured height and weight, the analytical sample was randomly divided into two parts using a 70:30 ratio (hereafter referred to as split-samples A and B: 27,033 and 11,442 participants, respectively).

Model-fitting, model-refinement and parameter estimation (prediction equations) were performed on split-sample A. To derive the prediction equations using linear regression modelling, measured height and weight were the outcome variables, and self-reported height and weight (including any non-linear terms), age group, and those additional socio-demographic and health-related variables significantly associated with the difference between self-reported and measured values were the independent variables. In a final refinement step, only statistically significant independent variables were retained for the final equations. This refinement step was performed to ensure the models were parsimonious (achieved similar goodness of fit with as few independent variables as possible).

Finally, the prediction equations derived from split-sample A were applied to the data in split-sample B to provide an independent assessment of the predictive accuracy of the equations. Corrected BMI was derived using the predicted values of height and weight. Descriptive statistics were used to compare self-reported, corrected, and interviewer-measured mean height, weight and BMI. BMI status was also compared across the three sets of data. Using BMI status from interviewer-measured height and weight as the gold standard, estimates of sensitivity and specificity were calculated to quantify by how much the corrected values improved the classification accuracy of BMI, compared with BMI status derived from self-reported height and weight.

Model 2

The second approach (Model 2) to deriving prediction equations was based on linear regression modelling with the measured values of height and weight as the outcome variable and the self-reported values of height and weight plus age (in single-year bands but trimmed at 90 years) as independent variables. Continuous age was used to replicate the approach used to correct self-reported height and weight in the Active Lives Survey. Linear, quadratic and cubic terms were entered for self-reported height and weight, and linear and quadratic terms were entered for age. Each term was retained in the model (thereby included in the equation) irrespective of statistical significance to allow for possible associations in future datasets.

As with Model 1, model-fitting was performed using split-sample A and the assessment of predictive accuracy was performed using split-sample B. This approach provides an updated set of HSE derived correction factors for use by the Office for Health Improvement and Disparities (OHID) to adjust the self-reported height and weight of participants in the Active Lives Survey for the purpose of monitoring levels of excess weight (BMI 25kg/m²or more) across local authorities in England.

Results

Prediction equations for height and weight: Model 1

Height

Among men, based on the multiple linear regression analysis (outcome variable: self-reported minus interviewer-measured height), older age, lower educational status (below degree or no qualifications versus having a degree), being Asian (versus White), living in the North East, the North West and the West Midlands (versus the South East), and reporting bad/very bad general health (versus good/very good) were associated with greater overestimation of height.

Among women, older age, lower educational status (no qualifications versus having a degree), living in the North West (versus the South East), living in the most (versus least) deprived areas, and being in the Black, Asian, mixed ethnic or other groups (versus White) was associated with greater overestimation of height.

Table A1: Linear regression results for difference between self-reported and measured height in men (Model 1)

xlsx 15 KB

Table A2: Linear regression results for difference between self-reported and measured height in women (Model 1)

xlsx 15 KB

The regression coefficients (prediction equations) for the aforementioned variables based on the models with measured height as the outcome variable correct self-reported height upwards (positive signs) or downwards (negative signs) as appropriate. For example, compared with participants in the White group, the predicted measured height from the self-reported height of participants in the Asian group is corrected downwards by 0.61cm and 1.78cm among men and women, respectively.

Weight

Among men, being an ex-regular or never (versus current) smoker was associated with greater underestimation of weight, whilst lower educational status (no qualifications versus having a degree) was associated with lower underestimation. Among women, being an ex-regular or never (versus current) smoker and being in the Black (versus White) ethnic group was associated with greater underestimation of weight.

Table A3: Linear regression results for difference between self-reported and interviewer-measured weight in men (Model 1)

xlsx 14 KB

Table A4: Linear regression results for difference between self-reported and measured weight in women (Model 1)

xlsx 14 KB

The regression coefficients (prediction equations) for the aforementioned variables based on the models with measured weight as the outcome variable correct self-reported weight upwards (positive signs) or downwards (negative signs) as appropriate. For example, compared with current smokers, the predicted measured weight from the self-reported weight of participants who never smoked is corrected upwards by 0.65kg and 0.18kg among men and women respectively.

The prediction equations for Model 1 are presented in Appendix 1.

Prediction equations for height and weight: Model 2

The prediction equations based on the linear regression models that estimated the association between self-reported and interviewer-measured height and weight with age (in single-year bands) as the only predictor of misreporting are presented in Appendix 1.

Among both sexes, the prediction equations for height correct self-reported height downwards more sharply at older ages. Among men, the prediction equation for weight adjusts self-reported weight upwards progressively with age. Among women, the upward adjustment of self-reported weight showed no clear pattern with age.

Table A5: Linear regression results for measured height on self-reported height and age, by sex (Model 2)

xlsx 13 KB

Table A6: Linear regression results for measured weight on self-reported weight and age, by sex (Model 2)

xlsx 12 KB

Self-reported, interviewer-measured and corrected height, weight, BMI, overweight (including obesity) and obesity prevalence, by sex

The prediction equations for height and weight were applied to the participants in split-sample B to generate corrected values of height, weight and BMI. Corrected estimates for mean height, weight and BMI were closer to interviewer-measured mean values than self-reported values were. For example, among men, mean BMI derived from self-reported height and weight was 26.3kg/m2, compared with 27.3kg/m2 for both corrected and interviewer-measured mean BMI. The equivalent figures for women were 25.7kg/m2, 26.8kg/m2 and 26.7kg/m2.

25% of men and 24% of women were classified as obese based on interviewer-measured BMI. Based on BMI from self-reported height and weight, 18% of men and 19% of women were classified as obese, compared with 24% of men and 23% of women who were classified as obese based on corrected BMI (Model 1). Thus, compared with interviewer-measured BMI, the underestimation of the proportion of adults classified as obese was lower using corrected BMI (1 percentage point for both sexes) than using BMI from self-report (6 percentage points for men, 5 percentage points for women). Estimates based on Model 2 were similar.

66% of men and 55% of women were classified as either overweight or obese based on interviewer-measured BMI. Based on BMI from self-reported height and weight, 58% of men and 47% of women were classified as either overweight or obese, compared with 67% of men and 56% of women who were classified as overweight or obese based on corrected BMI (Model 1). Thus, compared with interviewer-measured BMI, the proportion of adults classified as either overweight or obese was slightly overestimated using corrected BMI among men (Model 1: 1 percentage point for men, less than 1 percentage point for women) and underestimated using BMI from self-reported height and weight (8 percentage points for men, 9 percentage points for women). Estimates based on Model 2 were similar.

Table 6: Mean height, weight, BMI and BMI status from self-reported, interviewer-measured and corrected data

xlsx 15 KB

Download the data for this chart Figure 4: Difference between self report/corrected and measured mean height by age (Model 1)

Base: Men aged 16 and over (split-sample B)

Download the data for this chart Figure 5: Difference between self report/corrected and measured mean height by age (Model 1)

Base: Women aged 16 and over (split-sample B)

Download the data for this chart Figure 6: Difference between self report/corrected and measured mean weight by age (Model 1)

Base: Men aged 16 and over (split- sample B)

Download the data for this chart Figure 7: Difference between self report/corrected and measured mean weight by age (Model 1)

Base: Women aged 16 and over (split-sample B)

Download the data for this chart Figure 8: Mean BMI by age: from self-reported, corrected (Model 1) and interviewer-measured height and weight

Base: Men aged 16 and over (split-sample B)

Download the data for this chart Figure 9: Mean BMI by age: from self reported, corrected (Model 1) and interview measured height and weight

Base: Women aged 16 and over (split-sample B)

Footnote

3. For a fuller, more technical description please see the accompanying academic paper, Scholes et al, 2022.

Comparison of BMI status from self-reported, corrected and interviewer-measured height and weight, by sex

Table 7 compares BMI status from self-reported, corrected (Models 1 and 2) and interviewer-measured height and weight, using split-sample B.

Model 1

Based on the five category BMI classification, using interviewer-measured BMI status as the gold standard, 83% of men and 84% of women were correctly classified based on corrected BMI.

The sensitivity of the obese category based on BMI derived from self-reported height and weight was 71% among men and 75% among women. Sensitivity values were higher using corrected BMI, being 86% among men and 87% among women. In contrast, specificity values were slightly higher for BMI from self-report (99% for both sexes) than for corrected BMI (96% men, 97% women).

A similar pattern was found for the overweight (including obesity) category. Sensitivity values were higher for corrected BMI (94% men, 93% women) than for BMI derived from self-reported height and weight (85% men, 83% women). In contrast, specificity values were higher for BMI from self-report (94% men, 98% women) than for corrected BMI (85% men, 90% women).

Model 2

Based on the five category BMI classification, 83% of both sexes were correctly classified based on corrected BMI. Estimates of sensitivity and specificity were very similar to those described above for Model 1.

Table 7: Comparison of BMI status from self-reported, corrected (Models 1 and 2) and interviewer-measured height and weight, by sex

xlsx 21 KB

Conclusions

In agreement with previous epidemiological studies, analyses of HSE 2011-2016 data showed that participants on average overestimate height and underestimate weight, leading to an underestimation of BMI and of obesity prevalence. It is not possible to know whether the differences between self-reported and interviewer-measured values arise due to participants’ lack of knowledge about their current height and weight or whether it is due to misreporting of information that is accurately known.

Corrected BMI, based on the prediction equations, was much better than BMI derived from self-reported height and weight at approximating interviewer-measured BMI. This was demonstrated by the reduced gap in obesity prevalence and the increased sensitivity of the obese category versus self-reported data. However, whilst the prediction equations can overcome the well-known limitations of BMI based on self-reported height and weight to some extent, there are caveats to their development and use that must be considered.³

Caveats concerning the use of the prediction equations

Prediction equations are specific to time, place, target population and methods of data collection. As such, we do not assume that these equations developed using HSE 2011-2016 data will provide equivalently precise estimates when applied to surveys with more recent data, or different socio-demographic, health and self-reported anthropometric profiles. However, given the well-established tendency for self-reports to underestimate obesity, these equations may provide a better set of estimates than relying on self-report alone. That said, there are caveats that should be taken into account.

The first caveat concerns a limitation of this study, the sizeable number of participants excluded from the analyses due to missing data on height and/or weight. The findings presented in this report could be biased if those participants with valid values of self-reported and interviewer-measured height and weight were systematically different from those with missing data (for example, if those who refused to be measured were more likely than those who did not refuse to under-report their weight or to under-report to a greater extent, due at least partly to being heavier). Such bias could result in prediction equations that are inaccurate.

Second, it is likely that HSE 2011-2016 participants might have anticipated that interviewers would take direct measurements of height and weight, resulting in more ‘truthful’ reporting compared, for example, with a telephone interview where participants would not anticipate being measured. Previous studies have shown that misreporting of height (except for older adults) and weight was lower for interviews conducted face-to-face compared with telephone interviews (Ezzati et al, 2006). Surveys conducted in-person with direct measurements of height and weight may capture a lower bound of bias associated with self-reported anthropometric measures (Hattori and Sturm, 2013). Applying these prediction equations on surveys which collect height and weight data by telephone interviews or using self-completion web or postal questionnaires would be likely to underestimate measured values of BMI to a greater extent than shown in the present study.

Third, a small but non-trivial number of participants with large observed differences between self-reported and interviewer-measured values were excluded from the analyses to minimise the impact of large reporting errors. This exclusion limited the generalisability of analyses to some extent. Such cases would be impossible to identify and exclude in surveys that collect self-report but not measured data.

Fourth, compared with self-reported data, the prediction equations modestly reduced specificity in the overweight (including obesity) category, through erroneously shifting a proportion of healthy (normal) weight participants, according to measured data, to the overweight category. As a result, the equations slightly overestimated the proportion of adults classified as either overweight or obese. This finding was consistent with previous studies (Connor Gorber et al, 2008; Dutton and McLaren, 2014), and is likely to reflect higher levels of reporting accuracy of height and weight among healthy (normal) weight adults.

As reporting bias increases with higher measured BMI⁴ the prediction equations developed for this report are therefore not suitable for classifying adults into the five mutually exclusive BMI categories (underweight; healthy/normal) weight; overweight, obese grades I and II; morbidly obese). The equations are most suitable for classifying adults according to the more broadly defined dichotomous categories: either overweight or obese (versus not overweight nor obese), and obese (versus not obese). For these specific BMI categories, the modest reduction in specificity compared with self-reported data is more than counterbalanced by the reduced gap in prevalence estimates and the increase in sensitivity.

Finally, although the results showed no steady rate of increase or decrease in misreporting, the prediction equations developed using 2011-2016 data might not be entirely applicable to more recent data. This might be the case if obesity prevalence has greatly increased or decreased, and/or the social desirability of having a healthy/normal BMI increased or decreased. Changes in the amount of ownership and/or accuracy of home scales, or in the up-to-date knowledge of one’s own height and weight (for example, if health-workers began to routinely measure height as part of BMI assessment, and relay that information to patients) could also affect the applicability of these equations to more recent data. Likewise, any potential increase in misreporting of weight associated with weight gain during the Covid-19 pandemic (for example, due to fewer opportunities for outdoor physical activity) is not taken into account by the equations.

Footnote

4. See Table S2 in Scholes et al, 2022.

Which, if any, equation to use?

The equations developed for this report can be transferred for use in any survey in England which contains the set of variables included for the purposes of predicting measured values of height and weight. Using the Model 1 equations requires self-reported values of height and weight; categorical age; Government Office Region; educational status; ethnicity; general health; IMD quintile; and current cigarette smoking status. Using the Model 2 equations requires self-reported values of height and weight plus age (in single-year bands but trimmed at 90 years).

The findings showed that the two sets of prediction equations produced very similar results, due to both using self-reported values of height and weight along with participants’ age to predict measured values of height and weight. No single model stood out as the best overall candidate as adding socio-demographic variables such as ethnic group and educational status to the equations improved predictive accuracy only minimally. It may be reasonably concluded that including additional variables such as educational status and ethnic group does not add enough predictive power to justify the added complexity of including them in prediction equations.

Last edited: 8 December 2022 11:39 am

Developing and using prediction equations

Background

Methods

Model 1

Model 2

Results

Prediction equations for height and weight: Model 1

Height

Weight

Prediction equations for height and weight: Model 2

Self-reported, interviewer-measured and corrected height, weight, BMI, overweight (including obesity) and obesity prevalence, by sex

Comparison of BMI status from self-reported, corrected and interviewer-measured height and weight, by sex

Conclusions

Caveats concerning the use of the prediction equations

Which, if any, equation to use?

Chapters