Skip to main content

Publication, Part of

Dentists’ Working Patterns, Motivation and Morale - 2018/19 and 2019/20 Methodology

Page contents

Annex E – Regression Analysis and Key Assumptions

Multiple Linear Regression Analysis

Linear regression analysis attempts to model the relationship between a dependent variable y (in the case of this report, the ‘motivation index’) and an independent (explanatory) variable denoted x (such as average weekly hours or NHS/HS share). Multiple linear regression considers more than one explanatory variable at a time (x1, x2, x3…) and quantifies the strength of relationships between them (recorded as parameter estimates) and the dependent variable y.

Unlike other results in the report, the multiple linear regression analysis (and logistic regression provided in the accompanying CSV files) is not weighted. Normally in regression analysis it is assumed that the standard deviation of the error term is constant over all values of the explanatory variables. Weighting allows certain observations to be assigned more weight in the regression analysis (thereby having more influence on the calculated parameter estimates) because it is believed that they are more accurate. Without weighting all the observations are treated equally, which is the preferred option for this report.

The ‘goodness of fit’ of a multiple linear regression model can be assessed by considering the adjusted R-squared value produced in the analytical output. The adjusted R-squared value measures the proportion of the variation in the dependent variable accounted for by the explanatory variables. The values of adjusted R-squared range from 0 to 1, with a value of 1 indicating that the regression line perfectly fits the data. The results in the report are generally lower than 0.2, which would traditionally be seen as low in many objective prediction models. However, in prediction models used in psychology, the R‑squared values tend to be lower. This is because the goal is to determine which variables are statistically significant and how they relate to changes in the response variable, meaning that the adjusted R‑squared value becomes less important.

Assumption Testing of Linear Regression Models[1]

There are four principal assumptions supporting the use of linear regression models for purposes of inference:

  1. Linearity and additivity of the relationship between the dependent and independent variables
  2. Statistical independence of errors
  3. Constant variance of the errors
  4. Normality of the error distribution

Violations of linearity or additivity are extremely serious: if a linear model is fitted to data which are non-linearly or non-additivity related, then the predictions are likely to be in serious error. In multiple regression models, non-linearity or non-additivity may be tested by systematic patterns in plots of the residuals versus individual explanatory variables. The points should be symmetrically distributed around the horizontal line with a roughly constant variance. In terms of statistical independence of errors, the residuals should be randomly and symmetrically distributed around zero under all conditions. To satisfy constant variance of errors, the residuals should not get systematically larger in any one direction of the explanatory variables by a significant amount.

Violations of normality create problems for determining whether model coefficients are significantly different from zero and for calculating confidence intervals for forecasts. Sometimes the error distribution is "skewed" by the presence of a few large outliers. Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates. Calculation of confidence intervals and various significance tests for coefficients are all based on the assumptions of normally distributed errors. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.

One of the best tests for normally distributed errors is a normal quantile plot of the residuals. This is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on such a plot should fall close to the diagonal reference line.

 

[1] The majority of the material used in this section was taken from ‘Statistical forecasting: notes on regression and time series analysis’ website, Faqua School of Business, Duke University: http://people.duke.edu/~rnau/testing.htm



Last edited: 27 August 2020 10:35 am