Regression analysis employs algebraic formulas to estimate the value of a continuous random variable, called a dependent variable, using the value of another, independent, variable. Statistical methods are used to determine the most correct estimate of that dependent variable, and whether the estimate is valid at all.
Regressions may be used for a wide variety of purposes where estimation is important. For example, a marketer may employ a regression to determine how sales of products might be affected by investments in advertising. An employer may perform a similar analysis to estimate an employee's job evaluation scores based on the employee's performance on an aptitude test. A biologist can even use a regression to see how temperature changes might affect the rate of reproduction in frogs.
While closely related, regression differs from correlation analysis in an important way. Where regression is used to estimate the value of a dependent variable, correlation measures the degree of relationship between two variables. In other words, correlation analysis can indicate the strength of a linear relationship between variables, but it is left to regression analysis to provide predictions of the dependent variable based on values of an independent variable.
A simple regression analysis is one in which a single independent variable is used to determine a dependent variable. The relationship between the variables is assumed to be consistent, or linear. Figure 1 shows examples of linear, nonlinear, and curvilinear scatter diagrams, as well as one where there is no consistent relationship between X and Y variables.
The equation that represents the simple linear regression is
where Y i = the value of the dependent variable in a certain observation, i;
X i = the value of the independent variable in the observation i;
α = the value of Y when X is equal to zero, and may be thought of as the intercept (sometimes denoted β 0 );
β = the slope of the regression line;
e i is the random error in the observation i.
The values of both the independent variable X and the dependent variable Y are provided by a survey, or set of observed numerical samples. These sets of numbers are maintained as ordered pairs—a range of values of Y is indicated for each value of X. The value e i represents the sampling error associated with the dependent random variable Y.
Some assumptions must be satisfied to perform the regression analysis. First, if we plot the values of X on a scatter diagram, the sampling error ei, or variance from a mean, must be reasonably consistent for all values of X. In other words, for each value of X, the variation in values of Y must be reasonably consistent. This quality is called homoscedasticity.
Second, observed values of the random variable and amounts of random error must be uncorrelated, a condition usually satisfied by random sampling of the dependent value.
A simple regression analysis uses only one independent variable. There are many situations, however, where a dependent variable is determined by 2,3,5, or even 100 independent variables. As a result, it becomes difficult to represent the relationships between the variables in a visual model.
For example, a simple regression with two variables can be represented on a graph, with one variable measured on the X axis and the other on the Y axis. But add a third variable, and the graph requires a third dimension, X2. As a result, the regression line becomes a regression plane.
Add a fourth variable, and the regression can no longer be represented visually. Conceptually, it has four dimensions, also called hyperplanes or arrays. The same applies for regressions with even more variables; eight variables require eight dimensions.
These relationships can be expressed in complex mathematical formulas. They are no longer simple regressions, but multiple regressions.
[ John Simley ,
updated by Kevin J. Murphy ]
Foster, D. P., R. A. Stine, and R. P. Waterman. Business Analysis Using Regression. New York: Springer-Verlag, 1998.
Golberg, M. Introduction to Regression Analysis. Computational Mechanics Inc./WIT Press, 2000.
Rawlings, John 0., G. Sastry, and David A. Dickey. Applied Regression Analysis. New York: Springer-Verlag, 1998.