Skip to main content

Current Chapter

Current chapter – Data validation and management


As described in earlier sections, the data were collected from two sources: an online questionnaire and a paper questionnaire. The online questionnaire included some built-in routing and checks, whereas the paper questionnaire relied on correct navigation by participants and there was no constraint on the answers they could give.

In addition, the online data in their raw form were available immediately to the research team. However, the paper questionnaire data had to be scanned and manually recorded as part of a separate process. Tick box answers were captured by scanning; numbers and other verbatim answers were manually recorded by the research team.


Data editing

The paper data were subject to errors introduced by participants when they did not follow instructions set out in the paper questionnaire. Many of these errors were dealt with through standard editing rules. For example, if a single code question had more than one category ticked, it was set to ‘don’t know’. For multi-code questions, if an exclusive option was coded alongside one or more multi-choice options then the exclusive option was disregarded. If a routed question was answered when it should not have been, then it was set to ‘not applicable’ and the original answer over-written.

Some paper data required manual edits to improve the quality of the data and to make them more consistent and easier to analyse. For example, participants were asked to record hours and minutes spent walking and doing moderate or vigorous physical activity in the last seven days. Where there was missing data for hours spent doing these activities, but minutes had been recorded as one to three, it was assumed that the participant had meant hours spent doing these activities rather than minutes. Responses were manually edited from minutes (minutes recorded as ‘00’) to hours. It is important to note that these types of data edits are based on a series of assumptions made by the research team and as such may in certain cases misinterpret the intention of the participant.   

The online data required less editing as the checks and edits were embedded in the questionnaire. For example, extreme high values of time spent doing activities, or durations of less than 10 minutes were checked. Where multiple answers were selected on single code answer questions, participants were asked to correct their answers.


Data quality and item non-response

Data quality and the level of item non-response [10] were considered for key questions in the HSE FS. This does not include within-mode comparisons (i.e. between paper and online data).

In summary there did not appear to be an increase in levels of item non-response for most questions in the HSE FS when compared with the HSE face-to-face question equivalents. For example, for key demographic questions such as date of birth and sex, low item non-response was recorded (in the region of 1%) in the HSE FS. Likewise, for key survey estimates such as general health (including limiting longstanding illness) and smoking (ever smoked a cigarette and smoked nowadays), ‘don’t know’ and ‘refusal’ rates were low (in the region of 1%). This was similar to the levels found in the Health Survey for England (HSE) face-to-face survey.   

Questions perceived to be more difficult to answer yielded a slightly higher level of item non-response. For example, questions about recalling the frequency of alcohol drunk in the last 12 months contained detailed instructions to include or exclude certain types and strengths of alcoholic drinks. Levels of item non-response were in the region of 3% compared with less than 1% in the HSE face-to-face survey. Similarly, questions within the Health Survey for England HSE Feasibility Study (HSE FS) physical activity module requiring computations about the time spent walking and doing moderate or vigorous physical activity yielded an item non-response rate in the region of 3.5%.   

The online and paper questionnaires offered participants the option to record other types of alcoholic drinks consumed on the heaviest drinking day that were not covered in the existing response options. At the data processing stage, a number of verbatim responses were identified as ‘vague’ (e.g. ‘beer’ with no indication on whether it was strong or normal strength beer) and could not be coded into the existing code frame. These cases were coded as missing data.

In the HSE face-to-face survey, a visual aid for portion size is presented to participants in the form of a show card depicting spoon sizes. This was not included in the HSE FS paper questionnaire and portion sizes may, therefore, have been inaccurately recorded. Furthermore, in the HSE 2018 face-to-face survey, a more sophisticated level of fruit coding was conducted by the interviewer using a long list of answer categories. In the HSE FS, this was not possible due to the cognitive burden it could have placed on participants. As such, this may have resulted in participants recording fruit into the incorrect categories. In the paper questionnaire, where answer options for portion sizes were left blank, the question was coded as missing value. Some data may have been impacted by participants leaving the questions blank instead of writing in ‘00’.  


Data validation

With face-to-face surveys there is confidence that almost all the data is collected in a controlled manner and from the right individual. With most self-completion survey methods, there is no interviewer to do this work so it must be accomplished via other methods. With that in mind, a programme of post-fieldwork validation was implemented.

Duplication

For stage one, each household was provided with two log-in codes for completing the online survey and up to two paper questionnaires with the final reminder mailing. This could cause duplicate responses, where either a single participant completed the survey a second time, or where more than two people in a household completed the survey (for example, two completing the survey online and two different people completing the paper questionnaires).

Checks were undertaken to identify any potential duplicate cases. These included checks across modes (to see if a participant had completed the survey both online and on paper). However, it was impossible to distinguish between people who had disguised the duplication and genuinely different people within the same household completing the survey. 

Checking for duplicates was undertaken based on observing matches on the following criteria:

  • household serial number
  • full name (first name and surname)
  • date of birth / age
  • sex

If more than one case had the same full name, these were manually reviewed to determine if they were a cause for concern. Full names were manually checked against other demographic data such as date of birth and sex to see if they were valid duplicate cases. If cases appeared to be duplicates, one case was deleted based on the following criteria:

  • fully productive questionnaires prioritised over partially complete ones [11]
  • online responses prioritised over paper responses, as the online responses were considered to be more comprehensive and allowed for more complex routing

In total, 24 genuine duplicate cases were removed. 

Following this process of deduplication, it was assumed that all responses in the dataset were from unique individuals. Even after this process, it was possible that unique responses had been provided from more than two individuals in the household. In these circumstances, ‘legitimate’ duplicate responses were retained in the data as they were considered valid unique responses. Three households submitted three unique responses to the survey (for example two adults completed the web survey and one completed a paper questionnaire).


Ensuring consistency in household-level data

Household-level variables are used to weight the data, and it is good practice to ensure that everyone within a household receives the same household-level weight.

To do this, appropriate household-level variables such as the number of adults and children living in the household, their sex and date of birth, household income and tenure needed to be consistent within a household. This step was also necessary for the selection of the stage two child sample.   

To determine which household data to use for the whole household (where there were inconsistencies) the following steps were applied to prioritise the responses:

  • online responses were prioritised over paper responses
  • if mode of completion was the same, then responses from the oldest participant were prioritised
  • if participants’ date of birth were identical then household-level responses were taken from the first serial number identified in the household 

For stage two sample selection the same principles were applied.

Footnotes

[10] Includes ‘don’t know’, ‘refusal’ and ‘prefer not to say’ (online survey only) options. 

[11] A fully productive questionnaire was defined as one where the participant completed the survey. A partially productive questionnaire was defined as one where at least the household demographics and the general health questions were completed but not the whole questionnaire (5%, 270 adults partially completed the online survey). A further 4% (689 cases) of the eligible issued sample clicked on the survey link but did not reach the point of being counted as partially productive. This level of detail is not known for the paper survey.

 


Last edited: 30 November 2021 1:06 pm