DT
PT
Subscribe To Print Edition About The Tribune Code Of Ethics Download App Advertise with us Classifieds
search-icon-img
search-icon-img
Advertisement

The significance of variability in data-based estimates

The average is not the ‘only’ message. If individuals from all walks of life understand that, the data-based world will become more transparent and clear to people.
  • fb
  • twitter
  • whatsapp
  • whatsapp
featured-img featured-img
Vital factor: Predicting the pandemic required knowledge of disease transmission variability. PTI
Advertisement

IONCE served in a debate competition as one of the three judges. The ranking made by me prevailed even after the final rankings of the candidates were prepared by aggregating the scores given by all judges. The other two judges were taken aback. However, as a statistician, I was aware of the cause. The other two judges' ratings were much less variable; they were typically in the range of, say, 60 to 65. However, the distribution of the scores given by me had a lot more variability because they ranged from 5 to 95.

Ideally, the scores should be added up after each judge's score has been divided by an appropriate measure of variability. Unfortunately, there was no such clause. This also holds true in many other real-world scenarios.

In fact, statistics recognises many measures of central tendency, or "average", such as the mean, which is the total divided by the number of observations, and the median, which is the middle-most value. We often state that country X has a larger per capita income than, say, country Y.

Advertisement

However, is the average income a reliable measure of the state of the economy if social inequality is ignored?

In his 2013 book Capital in the Twenty-First Century, Thomas Piketty used quantiles, such as the percentage of total wealth held by the poorest half of the population, the top 10%, the top 5%, the top 1%, or even the top 0.1% or 0.01%, because he believes that the Gini coefficient — a single number between 0 and 1 — is insufficient to explain the economic inequalities and their evolution. Four or five of these quantiles are seen to be adequate to understand the key components of economic inequality and its development, depending on the type of inequality and the time period that is being studied.

Advertisement

Therefore, we basically require a sense of how wealth is distributed, which is just variability. Unfortunately, the majority of data-based estimates of different political and socioeconomic indicators are typically published without sufficiently addressing the estimates' variability.

Three researchers from the University of Southern Denmark have illustrated four classes of circumstances in which the conclusion reached based solely on the mean is qualitatively changed when variability is also taken into account. Their paper, "Variability Matters", was published in the International Journal of Environmental Research and Public Health in 2020.

Variability is a significant issue when it comes to economic disparities, health and longevity disparities among social groups and population selection potential, which is based on the fitness distribution's tails. This is becoming increasingly clear to experts working on data. It has also been suggested that predicting and comprehending the Covid-19 pandemic required an understanding of the disease transmission variability.

Again, Havelock Ellis, an English physician and author, first put up the theory of greater male variability in 1894 to explain the overabundance of males among the eminent and among the mentally ill. This theory has been used to attempt to explain why Harvard has so few female mathematics professors, for instance.

The statement, "A statistician confidently waded through a river that was on average 50 cm deep. He drowned," is commonly credited to Dutch author and television personality Godfried Bomans. To be fair, though, a statistician would never do that because she understands the importance of variability and that the river's depth might surpass her height in the midway.

The significance of understanding variability in medical prognosis is demonstrated by the extraordinary story of renowned evolutionary biologist Stephen Jay Gould. Gould was diagnosed with abdominal mesothelioma, an incurable cancer, in 1982. From the medical literature, he learnt that the median survival duration was eight months. "I will probably be dead in eight months," he thought initially.

However, means and medians are abstractions of reality, as Gould was aware. Because Gould was an optimist and because he understood nature, statistics and the meaning of variance in life's processes, he consequently took a very different approach to the mesothelioma statistics. Well, half of the population will undoubtedly survive longer than eight months if the median is eight months. Due to his youth, disease identification at a relatively early stage, first-rate medical care he received, and a strong will to live, he didn't give up. "The distribution of variation had to be right-skewed… the upper (or right) half can extend out for years and years… my favourable profile made me a good candidate for that part of the curve," he thought. Gould's 1991 essay "The Median Isn't the Message" was included in his book Bully for Brontosaurus: Reflections on Natural History, which has been reprinted numerous times for a wide range of readers. Twenty years (not 20 months!) after being diagnosed, Gould passed away due to a different cancer.

Perhaps, significant in this context is a quotation from William Winwood Reade that was referenced in Arthur Conan Doyle's novel The Sign of the Four. "While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty," Sherlock Holmes said to Watson. "You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to," said Holmes.

Finally, let's consider a snapshot of election prediction. In the recent US presidential election, the final-minute survey indicated that Donald Trump would receive 47.2% of the national vote and Kamala Harris 48.4%. But in the end, Trump and Harris received 49.9% and 48.4% of the vote, respectively. Was the poll prediction, at least for the vote share percentages, overly weird statistically? Obviously not. The opinion poll predictions are usually subject to a plus-minus 3% margin of error if proper care is taken to design them. This means that if the same process is applied frequently, the true population average will, 95% of the time, fall within the sample estimate, plus or minus 3%, which is derived using the variability of the estimates, and sample sizes in opinion polls are ideally set accordingly. As per this, the aforementioned opinion poll suggests that Trump's expected vote share could reach 50.4%. At least, it didn't go beyond that limit!

All being said, the average is not the "only" message. If individuals from all walks of life understand that, the data-based world will become more transparent to the people.

Advertisement
Advertisement
Advertisement
Advertisement
tlbr_img1 Home tlbr_img2 Opinion tlbr_img3 Classifieds tlbr_img4 Videos tlbr_img5 E-Paper