Similar in appearance to a column graph, histograms illustrate the frequency of occurrence of some measurable event or property. They are used to display statistics in business, economics, and other disciplines, and provide a useful tool for analyzing data and trends.
Histograms are most often employed to chart a distribution of values or results from a set of observations. In this respect they are related to the notion of a bell curve—or a skewed curve—to describe a series of data. For example, a researcher might use a histogram to summarize data collected from a survey of its customers' household incomes. The horizontal axis could represent income, and the vertical axis could represent the number of respondents. A typical approach would be to determine meaningful ranges of values, or scale, for the horizontal axis, which in this case is income. Thus, the researcher might choose to divide income into units of $10,000, creating a first category of $10,000 or less, a second category ranging from $10,000 to $20,000, and so forth. Using a scale of equal ranges means that each column in the histogram will have uniform width. The column heights would then be determined by the number of respondents falling into each income category. Either actual numbers from the survey or proportions or percentages of the whole may be used to represent the frequency, and hence the height, for each income category in the histogram. Once the data is prepared, the finished diagram may be rendered easily from a spreadsheet or graphing program on a personal computer.
The above exercise entails all that is needed to create a simple histogram. Making it meaningful of course requires more effort. Certain kinds of data lend themselves better to histograms than others, and some types of information aren't appropriate for histograms at all.
The horizontal axis should only represent a concept with a definite numerical order and measurable scale. For example, a series of company names could not serve as the horizontal values; such a graph would simply be a column graph comparing discrete observations. In most instances measures of time, such as years, would also be inappropriate as a horizontal variable because these show sequence rather than scale. Since a histogram is used chiefly to illustrate the frequency and dispersion of values, it also makes little sense to choose a horizontal scale with very few ranges (e.g., little dispersion can be shown in just three categories). Despite the income example above, however, it is acceptable to define horizontal categories in unequal ranges along the scale, if the data suggests this is necessary to visually capture nuances.
In contrast, the vertical axis normally represents indicators of either frequency or proportion, and therefore figures such as revenues and profits wouldn't be appropriate here. Finding a suitable scale is much simpler for the vertical axis, though. If actual frequencies are used, the scale may reflect their range of values. Similarly, proportions and percentages contain implicit scales: the sum of all the frequencies must equal 1 or 100, respectively, and thus each vertical length will be less than the respective sum. Also, because the vertical axis represents frequency or proportion, histograms never show negative values; zero is the lowest frequency or proportion possible.
While the information to be gleaned from a histogram varies widely with the nature of the data and the perspective of the beholder, a few generalizations may be made. The obvious points of interest in a histogram are its peaks and troughs. These show how widely and how evenly the variable measured in the horizontal axis is dispersed. In a so-called normal distribution, there is one broad peak in the middle category and the histogram's left and right sides are symmetrical. This need not be the case, however. There may be multiple peaks and they may be skewed toward either end of the scale.
Consider again the example of the customer income study. Suppose there were a peak in customer income in the $80,000-$90,000 category and virtually no values below the $30,000 range. Clearly, this would indicate that households in the company's customer base earn on average substantially more than the typical U.S. household. (This assumes of course that the sample was representative and there's no reason to believe respondents misreported their incomes.)
Perhaps contrary to intuition, when a histogram displays a large, tight cluster of values in one range and very few values in other ranges, this means a fairly homogeneous distribution exists, at least in terms of the variable being measured. In fact, a perfectly homogeneous distribution would have just one large column, because all values would fall in the same category. By contrast, if values are widely dispersed and there is no clear peak, a highly heterogeneous distribution is suggested. A perfectly heterogeneous distribution would indeed produce a flat histogram. Again, depending on what's being measured and who's measuring it, these properties may be considered favorable or unfavorable.
As a final example, suppose the researcher studying customer income finds a highly homogeneous population of customers all in high income brackets. If the researcher seeks to target a marketing campaign at the company's most important demographic group, this homogeneity would likely be seen as positive. However, if the company is under criminal investigation for illegally discriminating against customers with low incomes, this histogram would most likely elicit a negative interpretation.
Freedman, David, Robert Pisani, and Roger Purves. Statistics. 3rd ed. New York: W.W. Norton & Co., 1997.
Morris, Clare. "Seven Simple Tools for Problem Solving." Financial Times, 31 May 1996.