The infodemic: Why most statistical analyses of Covid-19 are unreliable

Most of them are based on incomplete or misleading data.

ByRonak Borana
The infodemic: Why most statistical analyses of Covid-19 are unreliable
  • whatsapp
  • copy

Six months into the coronavirus pandemic, the internet is rife with Medium articles and Twitter threads that use graphs and statistical trends to predict the course of the crisis. Most of these analyses are based on incomplete or misleading data. So, basing policy decisions on such patchy assessments, as India appears to be doing, has grave consequences in a pandemic that has already killed over four lakh people.

Underreporting of cases

Most of these predictions are based on numbers reported by the Indian government. But the official data often leaves out asymptomatic and mild cases of coronavirus infection. A study by the Centre for Mathematical Modelling of Infectious Diseases, or CMMID, at the London School of Hygiene and Tropical Medicine suggests that India might be reporting only about 35 percent of its total coronavirus infections. There’s no data on the rest of the phantom cases. Then, there are the differences in testing rates. While states such as Telegana have dangerously low levels of testing, others are doing a better job. Using official data on infections without factoring in these wide discrepancies adds sampling bias to the model and skews the results.

CMMID’s estimation of the percentage of cases reported by countries with the highest Covid-19 burden.

CMMID’s estimation of the percentage of cases reported by countries with the highest Covid-19 burden.

The CMMID’s researchers used the delay adjusted case fatality ratio, or CFR, to estimate the undercounting of infections. CFR, now a household figure, is the ratio of deaths to that of confirmed infections. Their model assumes the baseline CFR of 1.4 percent and uses it to calculate the likely underreporting of actual cases. The MRC Centre for Global Infectious Disease Analysis at the Imperial College London assumes their fatality rate to be one percent to back-calculate the actual number of infections.

Given that estimating the fatality rate in a rapidly evolving outbreak is a perilous task, the CMMID’s model has its own limitations. The most reliable way to estimate the actual number of cases is to check for the presence of antibodies against the novel coronavirus that linger in the body long after the patient recovers. One study of 70,000 Spaniards suggests that Spain was not recording 90 percent of the actual cases, concurring with the CMMID’s assessment.

The latest crop of research suggests that the fatality rate of Covid-19 is 0.5-1 percent. This means that underreporting is more widespread than previously estimated. Most analyses based on official numbers, therefore, rely on incomplete data that doesn’t capture the true extent of the prevalence of the disease.

No data in real time

Many inferences are drawn based on the daily numbers of tests, cases and deaths reported by the government. This incomplete data plotted on per day axes isn’t real-time; a person testing positive today was likely infected several days ago. There is lag in reporting infections because of the differences in incubation period, testing criteria, result turnaround time, public disclosure. All these steps are capriciously staggered.

In Maharashtra, a mounting backlog means the turnaround time for Covid-19 tests in a government lab is between two and seven days. In a private lab, the result of the same test would come back in under a day. So, despite collecting samples on the same day, they would report results on different days. Such massive delays are consistent across states. Similarly, these trends are sensitive to several iffy variables. For instance, a change in the testing criteria led to a nine-fold increase in the number of cases in China on February 13.

Such bottlenecks and interventions result in sudden surges and ebbs in the number of cases that have little statistical meaning. It’s impossible to account for such interventions because city and state officials keep improvising policies in accordance with local needs.

After a coronavirus infection is recorded at the state level, there is a long chain of reporting. The Integrated Disease Surveillance Programme and the Indian Council of Medical Research have different channels of registering cases. The IDSP does it through states whereas the ICMR relies directly on labs. Investigations by the news website Article 14 show that by the end of April, when India had around 31,000 cases, both the databases had a discrepancy of 5,024 cases. Most of these were duplicates, unverified entries and data entry flaws.

Many analysts, including government researchers, often take their raw data from dubious secondary aggregators such as This further adds noise in a dataset that was grainy to begin with.

Misleading comparisons

Comparing Covid-19 related data from different countries is misleading. Many analysts try to resolve this by adjusting their values for population. Even then, the number of cases or tests per lakh cannot be compared directly.

For instance, the health ministry recently said India had just 7.1 Covid-19 cases per lakh population as against 431 in the United States, 492 in Spain, and 372 in Italy. Lav Aggarwal, a joint secretary in the ministry tasked with communicating the government’s response to the pandemic, similarly praised India for having the lowest per lakh coronavirus cases and deaths in the world.

Such comparisons are only valid when the intervention has subsided or countries are battling the outbreak at the same time. But bost major countries are at different stages of the outbreak. Some are in the midst of their worst phase, a few have lived through it, and others like Iran are anticipating a second wave.

Let’s look at Italy and India. Italy has already experienced its peak and is recording only a few hundred cases per day now. India, on the other hand, is clocking around 10,000 cases per day, and is rapidly moving towards its peak. So, comparing two countries, states or cities with different arcs of the outbreak is misleading, even disingenuous.

Covid-19 cases reported by Italy and India from mid-February to June 2020.

Covid-19 cases reported by Italy and India from mid-February to June 2020.

Many fallacies

Most Covid-19 numbers are subject to numerous overlapping policy interventions that are hard to tease apart. One example is the recovery rate. While it is a meaningful figure to estimate the disease burden at a given time, the recovery rate can’t be extrapolated to make sweeping predictions about the course of the pandemic.

Take, for example, the health ministry’s discharge policy. It was changed on May 8 to do away with the requirement of Covid-19 patients testing negative twice before being discharged. Under the modified policy, the majority of patients that exhibit mild symptoms will be discharged 10 days after the onset of symptoms. This will skew the recovery rate, making it look like a state is performing well even when it isn’t.

Such interventions make sub-population analyses pointless. Aggregating them to make nationwide assessments misses the fact that states are experiencing this outbreak differently. Take the test positivity metric, for instance. While the national average is constant, different states swing from one extreme to the other.

Stats, policy and media

Most of these analyses are unreliable because the data we have is too crude to be insightful. We are writing as we are experiencing this pandemic. Such fallibility is the nature of early assessments. The problem arises when these precarious numbers are cited as justification for political chest-thumping and formulating policies like lockdowns and travel bans.

The Indian government has repeatedly used bad statistics to back its assertions. In mid-April, they used dodgy projections to claim there would have been 8.2 lakh Covid-19 cases without the national lockdown. At a press briefing, Niti Aayogs’s VK Paul suggested that the number of cases would go down to zero by May 16, providing only a poorly made graph as proof for his prediction. Recently, two researchers from the health ministry published a paper hinting that the pandemic will end in India by mid-September. That paper violated the conflict of interest norms, was published in a predatory journal, had plagiarised text, used shoddy analysis, and committed almost every major sin in scientific publishing.

This misinterpretation of data to produce sensational but hollow headlines is rampant even in the media. The Economist claiming that smokers are less likely to contract Covid-19 is a textbook example of collider bias. The New York Times smart thermometer and physical distancing article infer causality without stating the evidence. The Indian media had a field day with the theory of the BCG vaccine having a protective effect against Covid-19 without presenting the convincing evidence against it.

Almost every classic statistics book has an adage, “All models are wrong. Some models are useful.” There are good models and predictions that offer meaningful insights out there. They all acknowledge the limitations of their assessments, and usually explore broader long-term trends instead of daily updates. They rarely make headlines though.

Along with the novel coronavirus, it’s important to curb and banish all the rest of the unreliable statistical analyses that fuel the coronavirus infodemic.


The media industry is in crisis. Journalists, more than ever, need your support. Support independent media and pay to keep news free. Because when the advertiser pays, the advertiser is served, but if the public pays, the public is served. Subscribe to Newslaundry today.

newslaundry logo

Pay to keep news free

Complaining about the media is easy and often justified. But hey, it’s the model that’s flawed.

You may also like