Statistical analysis provides a framework for organizing data, analyzing data, and examining business problems in a logical and systematic way. With the tremendous strides in computer technology that have taken place, businesses have greater access to and more data than ever before. Statistical analysis provides managers with the tools necessary to make sense of large quantities of data and to make ever more effective business decisions based on inferences drawn from data.
Statistical methods may be broken into two broad categories—methods of description and methods of inference. Descriptive statistics methods consist of a variety of techniques—both mathematical and graphical—by which to organize and describe data. Two characteristics of great interest in data description are the central tendency and degree of variation in a given variable. For example, a manager might be interested in what the average earnings of a group of workers is or might be interested in knowing whether there is much variation in the diameter of items produced in a production run.
In order to determine central tendency and variation graphically, the manager could chart data for a given variable using a frequency histogram. A histogram is a bar chart that breaks a variable into subranges from lowest to highest values of the data and plots the frequency of occurrence in each sub-range (or class). Often it is the case that the most frequently occurring values of the variable will appear near the middle of the histogram, thus this sub-range will have the bar with the greatest height. Variation will be shown by how spread out the bars are. If categories at the lower and upper ends of the data have bars without much height, then the data are not very spread out or variable.
Though the frequency histogram is the most popular means by which to display data graphically, a variety of other techniques exist. These include stem and leaf plots, box plots, and pie charts. In addition to determining central tendency and variation of a given variable, the manager may in other instances be interested in determining movement in data over time or may be interested in ascertaining whether two variables bear any relationship with respect to each other. In each of these instances, a scatterplot would be used to represent the data graphically. In the former instance, the variable in question would be charted against time itself, while in the latter case one variable would be plotted against the other.
Mathematical techniques also exist by which to describe data. These methods are usually used in conjunction with rather than as an alternative to the graphical techniques noted above. The chief measures of central tendency are the mean, median, and mode of a variable. The mean is obtained by taking the sum of the observations on the variable and dividing by the number of observations. The mean is overly influenced by particularly large or particularly small values of the variable and for this reason the median is sometimes a better measure of central tendency than is the mean. The median is found by sorting values of the variable from lowest to highest and identifying the middle value in this ranking. The mode is the most easily obtained measure of central tendency and is simply the most frequently occurring value in the data.
Measures of variation are the variance, standard deviation, and range of the data. The variance is the sum of the squared deviations from the mean of the variable divided by the number of observations minus one (or, in other words, it is approximately the average squared deviation from the mean). The standard deviation is the square root of the variance. If the data are highly variable, then many observations will fall a considerable distance from the mean of the data and for this reason both the variance and the standard deviation will take on relatively large values. The range of the data is simply the largest value minus the smallest value. Though much easier to compute than the variance and standard deviation, the range can be misleading, particularly if the variable has a small number of large or small values that are unrepresentative of the general tendency of the data.
In many cases, a manager may wish to go beyond mere description of a variable and instead draw some larger inference regarding the variable based on the available data. This issue becomes relevant when the data at hand is a sample of data drawn from some larger population. For example, one might have hourly earnings for a random sample of 100 high school graduates in a metropolitan area. The mean hourly earnings of this group could be calculated. The question is, does the average for this sample help you infer the average for the population of high school graduates in the metropolitan area? It turns out that methods of statistical inference do indeed allow one to make inferences concerning the population (which is really the group of interest) on the basis of the information contained in the sample.
Statistical inference techniques may be broken into two broad categories—estimation and hypothesis testing. Underpinning both estimation and hypothesis testing are the concepts of the random variable and the probability distribution drawn from probability theory. With estimation, one is interested in estimating some population parameter (say, the mean of the population) on the basis of information in the sample of data. One could estimate the population parameter of interest using the analog of the corresponding concept for the sample (known as a statistic), but usually this is of only passing interest. This is because a sample may accidentally over represent the higher end (or the lower end) of the population due to the random nature of the sampling. In turn, the sample statistic usually either over- or underestimates the population parameter. Using results from probability theory, however, one can establish a range (known as a confidence interval) within which it is highly likely (usually 90 percent, 95 percent, or 99 percent, depending on the degree of confidence desired) that the population parameter of interest lies. As long as this range is reasonably narrow, then the inference will be highly informative. So, returning to the hourly earnings example from above, suppose average earnings in the sample is $9.25 per hour. One could not be very confident that this is average earnings for all high school graduates in the metropolitan area. But the methods of statistical inference allow one to take this measure (known as a point estimate) and use additional information concerning variation of data within the sample to make the inference that it is 95 percent likely that mean average earnings of high school graduates in the metropolitan area is, say, between $9.10 and $9.40 per hour.
In conducting a hypothesis test, one poses a null hypothesis that the population parameter in question equals some specific value against an open-ended alternative hypothesis that the parameter does not equal the value specified under the null (known as a two-tailed test), is greater than the value specified in the null (a one-tail test), or is less than the value specified in the null (also a one-tail test). Using information from the sample of data, one then determines whether or not enough evidence exists to conclusively reject the null hypothesis.
Confidence intervals and hypothesis tests are powerful tools and can be applied to a variety of questions. The example above showed how these tools may be used to make an inference about a population mean. One may also use these tools to make inferences about population proportions, about standard deviations of populations, and so forth.
Additional useful tools in the manager's statistical analysis tool kit include analysis of variance and regression analysis. Analysis of variance allows one to determine whether or not means of a number of populations (three or more) differ. Regression analysis allows one to determine the impact of any number of variables (called independent variables) on some variable of interest (called the dependent variable). Concepts of confidence interval and hypothesis testing are also used in the context of these techniques.
[ Kevin J. Murphy ]
Albright, S. C., W. Winston, and C. J. Zappe. Data Analysis and Decision Making with Microsoft ExceL Pacific Grove, CA: Duxbury Press, 1999.
Freund, J. E., F. J. Williams, and B. M. Perles. Elementary Business Statistics: The Modern Approach. 6th ed. Englewood Cliffs, NJ: Prentice Hall, 1993.
Neter, J., W. Wasserman, and G. A. Whitmore. Applied Statistics. 4th ed. Boston: Allyn & Bacon, 1993.