Step Regression

This package is designed to show how adding successive predictors to a regression modifies the residuals and mean-squared error. It is not meant as a replacement for the standard lm(y ~ x1, x2, x3) function, but instead a supplement. This package helps one show how regressions work, and how each successive variable in the design matrix helps explain the response vector. In this spirit, the focus is on how the residuals change with each successive variable added to a regression.

library(StepRegression)

Example One: mtcars

Let us take one of the most commonly used built-in datasets in R: mtcars. First, we’ll asign mpg to our response vector y, and the remaining columns to a design matrix x. Then, we’ll call our step_regression function, and plug the results into res_plotter_double() with addIntercept = TRUE.

res_plotter() and res_plotter_double()

There are two built-in plotting functions for this package: res_plotter() and res_plotter_double(). They both create residuals vs values sub-plots for each variable, with each corner containing the SSE after each variable is added to the regression, and the percent by which each variable reduced the SSE. The only difference between the two functions is that res_plotter_double has two columns of plots.

res_plotter_double(res,y,main="Residuals v MPG for Various Factors")

Utility

This function has two main uses. First, it helps identify which variables to include in a multivariate regression. Looking at the mtcars visual, it is interesting to note that several variables that seemed important when examined in isolation are found to do little to improve the SSE when others are included as well. The “disp” (displacement), “drat” (rear axle ratio), “vs” (v-shape vs inline), “carb” (number of carburetors) and “gear” all turned out to be less important than they initially seemed, while the opposite could be said for “hp” (horsepower), “am” (manual vs automatic), and “qsec” (quarter mile time).

The second use is helping to understand how regressions work. Rather than regress all of the variables together, step_regression regresses each out of y and the remaining x variables. Each step is essentially applying regression to the residuals leftover.

Swiss Example & res_plotter()

Here is a second example, this time using the Swiss dataset with the res_plotter function. Here, we are excluding “Infant.Mortality” as it does not seem an appropriate predictor of fertility. The information conveyed by res_plotter() is essentially the same as res_plotter_double(), albeit with just one column. One can once again see that the importance of variables changes when others are included–in this case the “examination” and “agriculture” have switches places of importance.

y<-swiss$Fertility
x<-swiss[,c(-1,-6)]
res<-step_regression(x,y,addIntercept=TRUE)
res_plotter(res,y,main="Residuals of Fertility by Various Factors")

Under the Hood using mtcars dataset

As mentiond above, this package works by regressing one variable out at a time from the response vector and all other variables in the design matrix. For each variable except the last, the function is essentially doing two regressions: the first on y, and the second on the other x variables. Then, the residuals for both y and the remaining x variables are calculated, and the function is then repeated, regressing one variable out at a time until none remain.

Below I’ve included a proof, checking each step with the lm() function. To avoid too much repetition, several variables will be regressed at once. For the first step, I’ve performed a regression on all of the variables in mtcars–one using linear algebra and the second using lm(). The second does two: the y-intercept and “cyl” column. Notice that while the coefficients match those for the lm() function, neither of these match the first multivariate regression performed. While doing a regression in steps like this can be useful for examining the work each variable is doing to explain y, getting correct coefficients for all of the variables would require doing all of the variables at once.

In each step though, whether regressing out 2, 6, 1, or any number of variables, the coefficients are the same (with room for floating point errors) as when one uses the lm() function to do the entire regression at once. However, results will vary between two different uses of stepRegression if one regresses the variables in different orders or different groups. A stepwise regression is more a tool for examining residuals than coefficients.

EXAMPLE 2: Regress two variables twice, all but the last, then the last variable

Regress intercept and number of cylinders

solveY <- solve(t(x[,1:2]) %*% x[,1:2]) %*% t(x[,1:2]) %*% y
print("Linear Algebra Result")
#> [1] "Linear Algebra Result"
t(solveY)
#>      Intercept      cyl
#> [1,]  37.88458 -2.87579
print("")
#> [1] ""
print("Builtin lm Function Result")
#> [1] "Builtin lm Function Result"
xTemp <- x[,2]
print(lm(y ~ xTemp)$coefficients)
#> (Intercept)       xTemp 
#>    37.88458    -2.87579

Regress displacement and horsepower

solveY <- solve(t(xError[,1:2]) %*% xError[,1:2]) %*% t(xError[,1:2]) %*% yError
solveX <- solve(t(xError[,1:2]) %*% xError[,1:2]) %*% t(xError[,1:2]) %*% xError[,3:ncol(xError)]
print("Linear Algebra Result")
#> [1] "Linear Algebra Result"
print(t(solveY))
#>             disp          hp
#> [1,] -0.01883809 -0.01467933
x0 <- x[,2:4]
print("")
#> [1] ""
print("Builtin lm Function Result")
#> [1] "Builtin lm Function Result"
print(lm(y ~ x0)$coefficients[3:4])
#>      x0disp        x0hp 
#> -0.01883809 -0.01467933

Regress rear axle ratio, weight, quarter-mile time, engine type, transmission, and number of forward gears

solveY <- solve(t(xError[,1:6]) %*% xError[,1:6]) %*% t(xError[,1:6]) %*% yError
solveX <- solve(t(xError[,1:6]) %*% xError[,1:6]) %*% t(xError[,1:6]) %*% xError[,7:ncol(xError)]
print("Linear Algebra Result")
#> [1] "Linear Algebra Result"
print(t(solveY))
#>           drat        wt      qsec        vs       am      gear
#> [1,] 0.7059008 -4.032142 0.8682852 0.3647043 2.550928 0.5029362
x0 <- x[,2:10]
print("")
#> [1] ""
print("Builtin lm Function Result")
#> [1] "Builtin lm Function Result"
print(lm(y ~ x0)$coefficients[5:10])
#>     x0drat       x0wt     x0qsec       x0vs       x0am     x0gear 
#>  0.7059008 -4.0321421  0.8682852  0.3647043  2.5509285  0.5029362

Regress number of carburetors

solveY <- solve(t(xError[,1]) %*% xError[,1]) %*% t(xError[,1]) %*% yError
print(solveY)
#>            [,1]
#> [1,] -0.1994193
x0 <- x[,2:11]
print("")
#> [1] ""
print("Builting lm Function Result")
#> [1] "Builting lm Function Result"
print(lm(y ~ x0)$coefficients[11])
#>     x0carb 
#> -0.1994193

Conclusion

I hope that you may fine this small package useful. It is designed to fill a narrow niche, but I hope that it fills it well.