In their breakthrough bestselling book Freakonomics: A Rogue Economist Explores the Hidden Side of Everything, Steven Levitt and Stephen Dubner provided many examples of how data can tell a story — and this was a few years before data got sumo-sized into Big Data.
One of their data-driven stories was about cheating in sports. More specifically, cheating to lose, which in the United States, where baseball is big, might conjure memories of the 1919 Chicago White Sox, the team that conspired with gamblers to throw the World Series, earning themselves the ignominious moniker of the Black Sox.
Levitt and Dubner looked at the premier sport of another great nation — sumo wrestling in Japan. To help put this into context for non-Japanese readers, they explained how “the incentive scheme that rules sumo is intricate and extraordinarily powerful. Each wrestler maintains a ranking that affects every slice of his life: how much money he makes, how large an entourage he carries, how much he gets to eat, sleep, and otherwise take advantage of his success. The sixty-six highest-ranked wrestlers in Japan make up the sumo elite. A wrestler near the top of this elite pyramid may earn millions and is treated like royalty. Any wrestler in the top forty earns at least $170,000 a year. The seventieth-ranked wrestler in Japan, meanwhile, earns only $15,000 a year. Life isn’t very sweet outside the elite. Low-ranked wrestlers must tend to their superiors, preparing their meals, cleaning their quarters, and even soaping up their hardest-to-reach body parts. So ranking is everything.”
In a sumo tournament, each wrestler has 15 bouts, one per day over 15 consecutive days. A winning record (8 victories or better) raises their ranking, whereas a losing record lowers it. So a wrestler with a 7-7 record on the final day of a tournament has more to gain from a victory than an opponent with a 8-6 record.
Using a data set containing nearly every official match among top-ranked Japanese sumo wrestlers between January 1989 and January 2000, a total of 32,000 bouts fought by 281 different wrestlers, Levitt determined that 80 percent of the matches between 7-7 wrestlers and 8-6 wrestlers were won by the 7-7 wrestler. And this was despite the fact that pre-match odds predicted they had a less than 50 percent chance of winning based on previous matches against the same opponent. Also, in the very next match (the first in the next tournament) between the same wrestlers, the 7-7 wrestlers won only 40 percent of the time.
Despite the story this data apparently showed, “no formal disciplinary action has ever been taken against a Japanese sumo wrestler for match rigging,” Levitt and Dubner explained. “Officials from the Japanese Sumo Association typically dismiss any charges as fabrications by disgruntled former wrestlers. In fact, the mere utterance of the words sumo and rigged in the same sentence can cause a national furor. People tend to get defensive when the integrity of their national sport is impugned.”
As I have previously blogged, people also tend to get defensive when business strategy is impugned by data science. But the sumo story is also telling in another way for those wrestling with the enormity of big data.
“This information was always apparent. It existed in plain sight,” Kenneth Cukier and Viktor Mayer-Schonberger explained in their book Big Data: A Revolution That Will Transform How We Live, Work, and Think. “But random sampling of the bouts might have failed to reveal it. Even though it relied on basic statistics, without knowing what to look for, one would have no idea what sample to use. In contrast, Levitt and his colleague uncovered it by using a far larger set of data — striving to examine the entire universe of matches.”
“An investigation using big data,” Cukier and Mayer-Schonberger concluded, “is almost like a fishing expedition: it is unclear at the outset not only whether one will catch anything but what one may catch. The dataset need not span terabytes. In the sumo case, the entire dataset contained fewer bits than a typical digital photo these days. But as big data analysis, it looked at more than a typical random sample. When we talk about big data, we mean big less in absolute than in relative terms: relative to the comprehensive set of data.”
With the technological advancements powering big data analytics, sampling is senseless. However, that doesn’t mean sumo-sized data analytics always deals with big data sets. A small data set, as long as it is comprehensive, can deliver an analytical bark much bigger than its bytes.