## Wednesday, February 20, 2008

### Wednesday Math, Vol. 16: John Tukey

Last week when I was thinking about publishing my post on obsolete math, I considered putting in the ideas of John Tukey, a mathematician and statistician who did a lot of his best work in the 1960s. I decided not to, because I didn't want to show disrespect to a guy who was a lot better at this stuff than I am. So let me list some of the ideas he is still remembered for.

A not at all obsolete idea of John Tukey: Tukey is given credit for a data compaction idea called the Fast Fourier Transform, or FFT for short. This is huge; a list of the top ten algorithms of the 20th Century included this, as well it should. Quite simply, without data compression, the Internet wouldn't work, and the FFT is a central concept in data compression. Good on ya, Tukey!

A knack for names: Not every good mathematician is good at naming things, but two words Tukey coined are now in everyday usage.

He came up with the word "software" to signify the instructions that a computer uses, different from the actual wires and circuits, which was already called hardware.

There is also the idea of a binary digits. In base 10 math, there are ten symbols for numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. In base 2 math, or binary, there are two symbols: 0 and 1. Back in the beginning of computer science, someone wanted to shorten "binary digit" to binit, pronounced either bin-it or bine-it. Tukey thought bit sounded better, and there wouldn't be any confusion about pronunciation. He was right, of course.

Two ideas of Tukey's from data representation: I present two of Tukey's inventions that still get taught a lot, but not used that much. Both deal with ways to represent lists of numbers. Here are two lists of numbers, the numbers of wins by teams in the NBA as of a week ago Monday, split up into the East and the West, fifteen teams each.

West: 36, 34, 34, 33, 33, 32, 31, 30, 30, 28, 23, 16, 13, 13, 10
East: 39, 37, 32, 28, 27, 24, 23, 21, 21, 21, 20, 19, 18, 15, 9

Idea #1: Stem and leaf plot: The idea of the stem and leaf plot is to write the same data using less writing, and also to give some idea of where the data falls. What we do is clump all the numbers from 39 to 30 together, all the numbers from 29 to 20 together, etc. The tens digit they share in common is the stem, and it is written first, followed by a separator, for which I use the | symbol. We then write a list of the ones digits, here listed from low to high. Here is the data from above re-written in stem and leaf form.

West:
3|001233446
2|38
1|0336
East:
3|279
2|01113478
1|589
0|9

Notice among the leaves in the first stem in the West data, there are two 0s, two 3s and two 4s. This is because 30, 33 and 34 show up on the list twice each.

The other thing we see from this listing is that a large clump of the data in the West list is in the 30s, while the largest clump of data in the East list is in the 20s. What this shows is that the West has more successful teams than the East does, though the east has the top team with the most wins at 39, as of a week ago Monday.

Idea #2: The Five Number Summary and the box and whiskers plot: The Five number Summary of the data was a labor saving way to show how the data was spread out. The numbers are High, Q3, Q2, Q1 and Low. Once the data is put in order, the median (the middle value) is Q2. This splits the list into the high half and the low half. The median of the high half is Q3 and the median of the low is Q1. The most used method of finding out about data spread in statistics is standard deviation, and back in the 1960s before calculators and spreadsheets were readily available, standard deviation was incredibly labor intensive, even for lists of numbers as small as the ones here. Here are the five number summaries, West first and East second.

Hi: 36 39
Q3: 33 28
Q2: 30 21
Q1: 16 19
Lo: 10 9

Tukey took these five numbers and turned them into a box and whiskers plot, a visual representation of the five numbers. The outside edges "box" are where the Q1 and Q3 numbers fall, and the dotted line is Q2. The whiskers extend out to the high and low numbers. The longer whiskers in the East show that the East contains both the best team around and the worst team around, and the fact that the rightmost side of the West box and the dotted line of the West box and both to the right of the East box shows that the average team in the West is doing much better than the average team in the East.

Both stem-and-leaf and box-and-whiskers are still taught in math classes, and a Texas Instruments TI-83 calculator or better will give you the five number summary. Many spreadsheets have box-and-whisker options for data representation, but if you look up these topics on Google, you will find a lot more websites that teach these ideas than one that actually use them to represent data.

One place where an idea which branched off from box-and-whiskers shows up is weekly financial charts. The scale is now up and down instead of left and right, but a week is represented by a dot and whiskers. The whiskers represent the highest and lowest values of the index that week, and the dot is the closing value. If you just connect the dots, you can follow the closing values, but the whiskers give an idea of how volatile the prices were in a particular week. In this chart for example, we can see the prices staying the same for most of the early part of 2006, steadily rising in the second half of 2006, making a big jump early in 2007 with volatility growing slightly, then much greater volatility in late 2007, along with the market peaking and then losing value.

You can't blame Tukey for what's happening to the market, him being dead and all. But he is the originator of this better way to represent the data. Good on ya, Tukey!

Yay, Flags of many Lands! Yay, Cambodia!
The total number of visiting countries is now 121, which is 11x11, for you fans of perfect squares.

----------------
Now playing: Blondie - Picture This
via FoxyTunes