# DISCRIMINANT ANALYSIS

Photo by: Helder Almeida

Discriminant analysis is a statistical method that is used by researchers to help them understand the relationship between a "dependent variable" and one or more "independent variables." A dependent variable is the variable that a researcher is trying to explain or predict from the values of the independent variables. Discriminant analysis is similar to regression analysis and analysis of variance (ANOVA). The principal difference between discriminant analysis and the other two methods is with regard to the nature of the dependent variable.

Discriminant analysis requires the researcher to have measures of the dependent variable and all of the independent variables for a large number of cases. In regression analysis and ANOVA, the dependent variable must be a "continuous variable." A numeric variable indicates the degree to which a subject possesses some characteristic, so that the higher the value of the variable, the greater the level of the characteristic. A good example of a continuous variable is a person's income.

In discriminant analysis, the dependent variable must be a "categorical variable." The values of a categorical variable serve only to name groups and do not necessarily indicate the degree to which some characteristic is present. An example of a categorical variable is a measure indicating to which one of several different market segments a customer belongs; another example is a measure indicating whether or not a particular employee is a "high potential" worker. The categories must be mutually exclusive; that is, a subject can belong to one and only one of the groups indicated by the categorical variable. While a categorical variable must have at least two values (as in the "high potential" case), it may have numerous values (as in the case of the market segmentation measure). As the mathematical methods used in discriminant analysis are complex, they are described here only in general terms. We will do this by providing an example of a simple case in which the dependent variable has only two categories.

Discriminant analysis is most often used to help a researcher predict the group or category to which a subject belongs. For example, when individuals are interviewed for a job, managers will not know for sure how job candidates will perform on the job if hired. Suppose, however, that a human resource manager has a list of current employees who have been classified into two groups: "high performers" and "low performers." These individuals have been working for the company for some time, have been evaluated by their supervisors, and are known to fall into one of these two mutually exclusive categories. The manager also has information on the employees' backgrounds: educational attainment, prior work experience, participation in training programs, work attitude measures, personality characteristics, and so forth. This information was known at the time these employees were hired. The manager wants to be able to predict, with some confidence, which future job candidates are high performers and which are not. A researcher or consultant can use discriminant analysis, along with existing data, to help in this task.

There are two basic steps in discriminant analysis. The first involves estimating coefficients, or weighting factors, that can be applied to the known characteristics of job candidates (i.e., the independent variables) to calculate some measure of their tendency or propensity to become high performers. This measure is called a "discriminant function." Second, this information can then be used to develop a decision rule that specifies some cut-off value for predicting which job candidates are likely to become high performers.

The tendency of an individual to become a high performer can be written as a linear equation. The values of the various predictors of high performer status (i.e., independent variables) are multiplied by "discriminant function coefficients" and these products are added together to obtain a predicted discriminant function score. This score is used in the second step to predict the job candidates likelihood of becoming a high performer. Suppose that you were to use three different independent variables in the discriminant analysis. Then the discriminant function has the following form:

where D = discriminant function score,
B , = discriminant function coefficient relating independent variable i to the discriminant function score,
X = value of independent variable i.

The equation is quite similar to a regression equation. Conventional regression analysis should not be used in place of discriminant analysis. The dependent variable would have only two values (high performer and low performer) and would thus violate important assumptions of the regression model. Discriminant analysis does not have these limitations with respect to the dependent variable.

Estimation of the discriminant function coefficients requires a set of cases in which values of the independent variables and the dependent variables are known. In the case described above, the company has this information for a current group of employees. There are several different ways that can be used to estimate discriminant function coefficients, but all work on the same general principle: the values of the coefficients are selected so that differences between the groups defined by the dependent variable are maximized with regard to some objective function. One commonly used objective function is the F-ratio, which is defined as it is in ANOVA and regression problems. The coefficients are chosen to maximize the F-ratio when analysis of variance is performed on the resulting discriminant function, using the dependent variable (i.e., job performance) as the grouping variable. Most general statistical programs, such as the Statistical Package for the Social Sciences, contain discriminant analysis modules.

There are various tests of significance that can be used in discriminant analysis. One widely used test statistic is based on Wilks lambda, which provides an assessment of the discriminating power of the function derived from the analysis. If this value is found to be statistically significant, then the set of independent variables can be assumed to differentiate between the groups of the categorical variable. This test, which is analogous to the F-ratio test in ANOVA and regression, is useful in evaluating the overall adequacy of the analysis.

Unfortunately, discriminant analysis does not generate estimates of the standard errors of the individual coefficients, as in regression, so it is not quite so simple to assess the statistical significance of each coefficient. For example, most discriminant analysis programs have a stepwise option. Independent variables are entered into the equation one at a time. Again, Wilks lambda can be used to assess the potential contribution of each variable to the explanatory power of the model. Variables from the set of independent variables are added to the equation until a point is reached for which additional items provide no statistically significant increment in explanatory power.

Once the analysis is completed, the discriminant function coefficients can be used to assess the contributions of the various independent variables to the tendency of an employee to be a high performer. The discriminant function coefficients are analogous regression coefficients and they range between values of -1.0 and 1.0. The first box in Figure 1 (on the facing page) provides hypothetical results of the discriminant analysis. The second box provides the within-group averages for the discriminant function for the two categories of the dependent variable. Note that the high performers have an average score of 1.45 on the discriminant function, while the low performers have an average score of -.89. The discriminant function is treated as a standardized variable, so it has a mean of zero and a standard deviation of one. The average values of the discriminant function scores are meaningful only in that they help us interpret the coefficients. Since the high performers are at the upper end of the scale, all of the positive coefficients indicate that the greater the value of those variables, the greater the likelihood of a worker being a high performer (e.g., education, motivation). The magnitudes of the coefficients also tell us something about the relative contributions of the independent variables. The closer the value of a coefficient is to zero, the weaker it is as a predictor of the dependent variable. On the other hand, the closer the value of a coefficient is to either 1.0 or -1.0, the stronger it is as a predictor of the dependent variable. In this example, then, years of education and ability to handle stress both have positive coefficients, though the latter is quite weak. Finally, individuals who place high importance on family life are less likely to be high performers than those who do not.

The second step in discriminant analysis involves predicting to which group in the dependent variable a particular case belongs. A subject's discriminant score can be translated into a probability of being in a particular group by means of Bayes Rule. Separate probabilities are computed for each group and the subject is assigned to the group with the highest probability. Another test of the adequacy of a model is the degree to which known cases are correctly classified. As in other statistical procedures, it is generally preferable to test the model on a set of cases that were not used to estimate the model's parameters. This provides a more conservative test of the model. Thus, a set of cases should, if possible, be saved for this purpose. Having completed the analysis, the results can be used to predict the work potential of job candidates and hopefully serve to improve the selection process.

There are more complicated cases, in which the dependent variable has more than two categories. For example, workers might have been divided into three groups: high performers, average performers, low performers. Discriminant analysis allows for such a case, as well as many more categories. The interpretation, however, of the discriminant function scores and coefficients becomes more complex. The books included in the "Further Reading" section below explain in detail how to perform discriminant analysis with multiple categories and provide in-depth technical discussions.

[ John J. Lawler ]

Huberty, Carl J. Applied Discriminant Analysis. New York: Wiley, 1994.

Klecka, William R. Discriminant Analysis for Social Sciences. Beverly Hills, CA: Sage Publications, 1980.

Lachenbruch, Peter A. Discriminant Analysis. New York: Hafner Press, 1975.

McLachlan, Geoffrey J. Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley, 1992.

## User Contributions:

Srinivasan Ramesh
Feb 20, 2007 @ 12:00 am
I find this paper very interesting and useful. Can I have some details how discriminant analysis be used in Medical diagnosis.