This blog is still alive, just in semi-hibernation.
When I want to write something longer than a tweet about something other than math or sci-fi, here is where I'll write it.
Wednesday, September 23, 2009
Wednesday Math, Vol. 89: Mode
There are several ways to determine the "center" of a set of data. The three most commonly taught methods are mean, median and mode. The mean, also known as average, can only be done with numerical data, since you have to add up all the values then divide by the number of things on the list. The median can be done with any data that can be put in order. Once put in order, find the middle position on the list, and the value in that position is the median. For example, if we put all the students in a college in order by class, freshmen, sophomores, juniors and seniors, because students drop out there are likely more freshmen than sophomores, more sophomores than juniors and more juniors than seniors. This means the median is likely a sophomore, unless the drop-out rate is so severe that the median is a freshman.
Then we come to mode, which means the most common value in a data set, as long as there are any duplicates on the list. I was checking sources about the definition. I nicked the definition description seen here from a website for teachers, and I defaced it because as far as I've been taught, it's wrong. Mode can be used with any kind of data, whether it's a set of numbers of a set of categories. For example, if we have a set of cars, we can have as our data the make of car, or the model or the color. For any of those non-numerical variables, we can discuss the mode, which would mean the most popular manufacturer or most popular model or most popular color, respectively. In the example of college students above, freshman should be the mode.
There are some types of numerical data where order does not have much meaning, especially coded data. Take zip codes, for example. If we take a set of zip codes, we can take the average, but it is a number without meaning. Likewise the median has no meaning. This is because two zip codes that are a single number apart can be very far apart on the map. Locally, 94501 and 94502 are right next to each other in Alameda, but 94503 is up in American Canyon, more than forty miles to the north. For such a set of numerical data, the mode is the only measure of center that tells us something useful, and that would be the most popular zip code on a list of data.
Besides the lack of agreement on whether mode is just for numbers and can be used on any data set, mode has exceptions to the rules, something that isn't true with average or median. A set needs to have duplicate values to have a mode. If the data was the social security numbers of a group of people, or the registration numbers of a set of cars, we would expect that all those numbers would be unique, so those sets would have no mode. If there are duplicates in a set, but there is a tie for the most frequent value, then we can have more than one mode. If we take the set of U.S. Presidents and look at the variable of first name, the most common value is James, as there have been six presidents with that first name if we include James Earl Carter. If instead we look at last names, we have several that have shown up twice but none that have shown up three times. The modes for last name are Adams, Johnson, Roosevelt, Harrison and Bush.
And this brings us to Excel and its mode function. The nice folks from Microsoft have decided that there is no mode when dealing with a non-numerical set. Whether checking for first names or last names of presidents, the answer Excel 2007 would give on such a data set is "N/A", or not applicable.
If you are working with a numerical data set with no repeats, Excel will also tell you "N/A", and this time that's actually the correct answer.
The big problem Excel has, which is why I've shown this picture of their dark side logo in colors opposite the green and white they usually use, is that it will not tell you when a data set has more than one mode. If a data set has two values of 3 and two values of 77 and no other duplicates, Excel might tell you the mode is 3 or might tell you the mode is 77, depending on how the data is sorted.
There are workarounds. You can take a data set, let's say it's in column A in your spreadsheet, and copy it into column B. There's a function under the data tab called "remove duplicates". If we perform that on the copy of the list now in column B, and Excel tells us no duplicates, then there is no mode. If there are duplicates, type in this formula into the first cell of column C. (We are assuming the data is in column A from row 1 to row 100.)
Now in cell C1, the number will tell you how many times the value in B1 shows up on the unadulterated list in column A. Click and drag that function down the C column, and that information can be found for all the distinct values, regardless of whether the original data was numerical or categorical. You can then select columns B and C together and sort by biggest number in column C. This will tell you the mode, and it will also tell you if there is more than one mode, if the number in C1 is the same as the number in C2. In the case of the presidential last names, columns B and C would start as follows, assuming the original list of presidents was sorted by chronological order of their administrations:
It's always nice when there is a workaround for your problems, but it's better when there's agreement on definitions that avoids the problems in the first place. The more I study statistics, the happier I am my degree is in mathematics.