Online Learning Platform

Business Analytics > Predictive Modelling > Regression assumptions that have to Test & keep in mind

Regression assumptions

The statistical hypothesis tests associated with regression analysis are predicated on some key assumptions about the data.

Linearity: This is usually checked by examining a scatter diagram of the data or examining the residual plot. If the model is appropriate, then the residuals should appear to be randomly scattered about zero, with no apparent pattern. If the residuals exhibit some well-defined pattern, such as a linear trend or a parabolic shape, then there is good evidence that some other functional form might better fit the data.

Normality of errors: Regression analysis assumes that the errors for each individual value of X are normally distributed, with a mean of zero. This can be verified either by examining a histogram of the standard residuals and inspecting for a bell-shaped distribution or by using more formal goodness of fit tests. It is usually difficult to evaluate normality with small sample sizes. However, regression analysis is fairly robust against departures from normality, so in most cases, this is not a serious issue.

Homoscedasticity: The variation about the regression line is constant for all values of the independent variable. This can also be evaluated by examining the residual plot and looking for large differences in the variances at different values of the independent variable.

In most of the situations, the model is derived from limited data, and multiple observations for different values of X are not available, making it difficult to draw definitive conclusions about homoscedasticity. If this assumption is seriously violated, then techniques other than least squares should be used for estimating the regression model.

Independence of errors: Finally, residuals should be independent for each value of the independent variable. For cross-sectional data, this assumption is usually not a problem. However, when time is the independent variable, this is an important assumption. If the scatter plot of errors against time shows a trend or pattern then errors are not independent.

If successive observations appear to be correlate i.e. successive errors. For example, by becoming larger over time or exhibiting a cyclical type of pattern then this assumption is violated. Correlation among successive observations over time is called autocorrelation Autocorrelation can be evaluated more formally using a statistical test based on a measure called the Durbin–Watson statistic. The Durbin–Watson statistic is a ratio of the squared differences in successive residuals to the sum of the squares of all residuals.

$DW = \frac{\sum_{t=2}^{n}(e_t - e_{t-1})^2}{\sum_{t=1}^{n} e_t^2},\; e_t = \text{residual at time } t$

D will range from 0 to 4.

There is a tabulated sheet to check for critical value. For most practical purposes, values below 1 suggest autocorrelation; values above 1.5 and below 2.5 suggest no autocorrelation; and values above 2.5 suggest negative autocorrelation.

Online Learning Platform

Business Analytics > Predictive Modelling > Regression assumptions that have to Test & keep in mind

Bias of Omitted Variables

Logistic Regression

Feedback

ABOUT

Statlearner STUDY