Let’s first understand what does regression mean?
Regression is a process through which we figure out relationship between dependent and independent variables through various statistical methods.
Types of regression:
- Linear Regression
- Logistic Regression
- Polynomial Regression
- Stepwise Regression
- Ridge Regression
- Lasso Regression
- ElasticNet Regression
In this post, I have mainly focused on ‘Linear Regression’
Here I would like to clear some confusions about linear regression;
- If regression is just analyzing variables and predicting future variables, then can time series forecasting problems be solved using regression techniques?
- What if any of the assumptions for linear regression gets failed?
- There are a lot of statistics test to check linear regression’s assumptions, which one to choose for which assumption?
- What is the cost function and loss function in linear regression?
- How can we extract information using various plots for linear regression?
Before applying linear regression there are certain assumptions, we need to check.
Let’s check all assumptions one by one:
- Relationship between independent and dependent variables should be linear. In simplest terms, by increasing/decreasing value of independent variable, value of dependent variable should also increase or decrease.
How to check linear relationship among variables:
This can be checked by correlation plot, scatter plot, pair plot or heat map.
- There should not be multi collinearity among independent variables.
What is multi-collinearity?
Multi-collinearity is a state of very high inter-correlations or inter-associations among the independent variables.
What if multi-collinearity is present:
General Effect: If a model has independent variables correlated, it becomes a tough task to figure out the true relationship of a features(predictors) with response variable because it becomes difficult to find out which variable is contributing to predict the target variable.
Statistical Effect: Also, with presence of correlated independent variables, the standard errors tend to increase. And, as standard errors will increase, the confidence interval becomes wider leading to less precise estimates of slope parameters.
How to check multi-collinearity:
Multi-collinearity can be checked via correlation plot, VIF factor (Variance Inflation Factor).
For no multi-collinearity, VIF should be (VIF < 2) (2 is not a fixed criterion, it may be 3 or 4 also, depends on data).
In case of high VIF, look for correlation table to find highly correlated variables and drop one of correlated ones.
There should not be any pattern in the plot of residuals and fitted values. Residuals are error terms obtained after applying linear regression model.
- Residual terms should have constant but unknown variance (Homoscedasticity).
How to check Homoscedasticity?
We can check this by plotting residual vs fitted values plot. There should not be any pattern visible.
What if Heteroscedasticity exist:
If variance between error terms is not constant then it is known as Heteroscedasticity. Usually, non-constant variance occurs in presence of outliers or extreme leverage values. Because of outliers, these values get too much weight, thereby disproportionately influences the model’s performance. When Heteroscedasticity occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrow.
- All residuals should be normally distributed and their mean should be zero.
What if normal distribution is not there among residuals:
If the error terms are non-normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Presence of non-normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.
How to check if normal distribution is present among residuals:
Using Q-Q plot that is a scatter plot between quantiles. Using this plot, we can infer if the data comes from a normal distribution. If yes, the plot would show almost straight line. Absence of normality in the errors can be seen with deviation in the straight line.
Designing the model
Linear regression follows equation like this:
Y= a + bX + e
where Y is the target output, a is the y intercept of the line, b is slope of the line and e is considered as error or residual obtained. To get most accurate model we need to find a and b in such a way that they can give optimized solution.
How to check accuracy
To check the error obtained, we will calculate sum of squares of the error.
Error = Σ(y_actual - y_predicted)²
Error calculated as individual data point is known as loss function. While combining errors for whole dataset is known as cost function. Cost function is basically a measure of how bad our model is performing. To achieve high accuracy, we need to minimize cost function.
Now question arises, how to minimize the cost function?
In linear regression cost function is calculated as sum of squares of the error. To optimize cost function gradient descent algorithm is used.
How time series is different from linear regression?
Let’s understand the possible issues one by one:
- In time series analysis, future values depend on past time period values i.e. lagged values. So if we add lagged terms in linear regression, it may cause collinearity problem.
- Usually in time series, dependent or target variable is not stationary, it varies with time while in regression it’s not.
- In time series data, usually lagged errors there but in linear regression, errors should be randomly distributed.
In short, unless assumptions of regression are valid, we can apply regression on time series data otherwise not.