Wednesday, September 23, 2009

Wednesday Math, Vol. 89: Mode

There are several ways to determine the "center" of a set of data. The three most commonly taught methods are mean, median and mode. The mean, also known as average, can only be done with numerical data, since you have to add up all the values then divide by the number of things on the list. The median can be done with any data that can be put in order. Once put in order, find the middle position on the list, and the value in that position is the median. For example, if we put all the students in a college in order by class, freshmen, sophomores, juniors and seniors, because students drop out there are likely more freshmen than sophomores, more sophomores than juniors and more juniors than seniors. This means the median is likely a sophomore, unless the drop-out rate is so severe that the median is a freshman.

Then we come to mode, which means the most common value in a data set, as long as there are any duplicates on the list. I was checking sources about the definition. I nicked the definition description seen here from a website for teachers, and I defaced it because as far as I've been taught, it's wrong. Mode can be used with any kind of data, whether it's a set of numbers of a set of categories. For example, if we have a set of cars, we can have as our data the make of car, or the model or the color. For any of those non-numerical variables, we can discuss the mode, which would mean the most popular manufacturer or most popular model or most popular color, respectively. In the example of college students above, freshman should be the mode.

There are some types of numerical data where order does not have much meaning, especially coded data. Take zip codes, for example. If we take a set of zip codes, we can take the average, but it is a number without meaning. Likewise the median has no meaning. This is because two zip codes that are a single number apart can be very far apart on the map. Locally, 94501 and 94502 are right next to each other in Alameda, but 94503 is up in American Canyon, more than forty miles to the north. For such a set of numerical data, the mode is the only measure of center that tells us something useful, and that would be the most popular zip code on a list of data.

Besides the lack of agreement on whether mode is just for numbers and can be used on any data set, mode has exceptions to the rules, something that isn't true with average or median. A set needs to have duplicate values to have a mode. If the data was the social security numbers of a group of people, or the registration numbers of a set of cars, we would expect that all those numbers would be unique, so those sets would have no mode. If there are duplicates in a set, but there is a tie for the most frequent value, then we can have more than one mode. If we take the set of U.S. Presidents and look at the variable of first name, the most common value is James, as there have been six presidents with that first name if we include James Earl Carter. If instead we look at last names, we have several that have shown up twice but none that have shown up three times. The modes for last name are Adams, Johnson, Roosevelt, Harrison and Bush.

And this brings us to Excel and its mode function. The nice folks from Microsoft have decided that there is no mode when dealing with a non-numerical set. Whether checking for first names or last names of presidents, the answer Excel 2007 would give on such a data set is "N/A", or not applicable.

If you are working with a numerical data set with no repeats, Excel will also tell you "N/A", and this time that's actually the correct answer.

The big problem Excel has, which is why I've shown this picture of their dark side logo in colors opposite the green and white they usually use, is that it will not tell you when a data set has more than one mode. If a data set has two values of 3 and two values of 77 and no other duplicates, Excel might tell you the mode is 3 or might tell you the mode is 77, depending on how the data is sorted.

There are workarounds. You can take a data set, let's say it's in column A in your spreadsheet, and copy it into column B. There's a function under the data tab called "remove duplicates". If we perform that on the copy of the list now in column B, and Excel tells us no duplicates, then there is no mode. If there are duplicates, type in this formula into the first cell of column C. (We are assuming the data is in column A from row 1 to row 100.)

=countif(a$1:a$100, b1)

Now in cell C1, the number will tell you how many times the value in B1 shows up on the unadulterated list in column A. Click and drag that function down the C column, and that information can be found for all the distinct values, regardless of whether the original data was numerical or categorical. You can then select columns B and C together and sort by biggest number in column C. This will tell you the mode, and it will also tell you if there is more than one mode, if the number in C1 is the same as the number in C2. In the case of the presidential last names, columns B and C would start as follows, assuming the original list of presidents was sorted by chronological order of their administrations:

Adams_____ 2
Harrison__ 2
Johnson___ 2
Roosevelt_ 2
Bush______ 2
Washington 1

It's always nice when there is a workaround for your problems, but it's better when there's agreement on definitions that avoids the problems in the first place. The more I study statistics, the happier I am my degree is in mathematics.


Anonymous said...

Interesting comments! Can you help me out? What would the median be if there were 1000 freshmen and 1000 others (500 sophomores, 300 juniors, 200 seniors). This is puzzling me.

Matty Boy said...

Hi, anonymous. In that case, you would say the median lies between freshmen and sophomores. If these were replaced by grade numbers like 9th, 10th, 11th and 12th, you would say the median lies between 9th and 10th, or you could say it was 9.5.

Anonymous said...

Thanks. But what if the groups were maybe small, medium and large. Like maybe 100 small plastic bags, 60 medium plastic bags and 40 large plastic bags. What would it mean to say that the median is between a small bag and a medium bag? Would the median have an actual value then? I don't know if the median has to have an actual value. Or would the median be a sort of fuzzy or vague kind of thing.

Also, what you said about the mode is really interesting. Have you ever seen any books say that the mode can be talked about for a set of categories?

Matty Boy said...

If median is taken with numerical data, sometimes it is the average of two values, and therefore a value that doesn't actually occur in the data set. The same can happen with categorical data.

As for mode and categorical data, I pulled three the first three stats texts off the shelf, and all three agree mode is okay with categorical. The three books are Triola, DeVeaux/Velleman/Bock and Larson/Farber. D/V/B says mode is better with categorical data than it is with numerical, as with numerical we often put the data into demographic categories before finding mode.

There are stats texts that disagree. This is one of the things that make statistics more annoying than math to me, because in math, definitions are much more solid. When two definitions disagree in math, usually one is obsolete.

Anonymous said...

Can u recommend one of those books for me to learn some statistics & maybe why you like it?

Wow -- statistics just isn't so clear. Doesn't it make it frustrating for you to teach it and students to learn it?

Here's what I still don't get but I bet you can explain it. Like -- what's betwen small and medium bags? Or, what about countries in this hemisphere: 3 are North (US, Canada Mexico), and say 7 are Central and 10 are South. I don't know how many there really are. Then what's the median of 3 north, 7 central, and 10 south. I know the mode is south, but the median? There's no country between Panama and Columbia or whatever is the 1st one in south america.

Matty Boy said...

I use Triola, but you can find open source books that are much cheaper. If you are going to pay for a text, Johnson/Kuby may be the best value.

There are special cases in stats, but students often find it a little easier than other math courses because it has so many applications.

As to median: In ordered categorical data, you would be saying that half the bags are small and the other half are medium or large. In your second example, you would be saying there are as many countries in North and Central America combined as there are in South America. (Not actually true.)

Anonymous said...

Thanks again. Knew you'd be helpful. Do u know how to find an open source textbook? Cheaper is good. Are there any u like to use or recommend?

Median is still a puzzle, because for the countries in the 2nd example, I think the median doesn't have a value, or does it? Like, median = ? I'm thinking the same thing actually also for the bags in the 1st example, that median has no value. Or am I not seeing the value, & this is something that's easy that I don't see. Sorry not to be clear yet on this.

Anonymous said...

After thinking about it more I don't think you're right when you said that "the median can be done with any data that can be put in order." Because with ordered categorical data "north north central central south" has a median = central but "north north central south" doesn't have a median. I think the definition for median should give an answer for every set of ordered categorical data. Do you know of any book or source that besides yourself who says that median can be applied to ordered categorical data? Daniel T.

Matty Boy said...

Check out Triola. Median can be used with ordinal data.

The answer "between category j and category k" is legitimate.