Can India's Covid data be trusted? A Big Data investigation into what the numbers say (and hide)

Decoding Covid Data

Can India's Covid data be trusted? A Big Data investigation into what the numbers show (and hide)

From testing to positivity rates, there are lessons to be learned from states’ responses. We break it down.

By:Chintan Patel & Vivek Kaul

Date:

Read in app

In this piece, we try to make sense of a lot of Covid related data that is being generated day in and day out.

But before we get into the details, let’s first try and explain what we want to achieve through this piece.

The aim

The peak of the second wave of Covid-19 seems to be behind us. Both the daily number of Covid cases as well as deaths have been coming down over the last few weeks. Having said that, the lack of preparedness across all levels of the government in handling the second wave was stark and much of the loss of life was avoidable.

The collection of accurate data is challenging due to reasons we touch upon at the outset. Nevertheless, a thorough data-driven analysis of both the severity of the pandemic and the government’s response to it is critical in order to identify areas for improvement and be better prepared for any possible future waves.

This piece is one such attempt to try and make sense of the enormous amount of data that is collected related to Covid. We focus on two fundamental metrics: Covid testing and positivity rate, and the link between them.

There are two kinds of Covid tests, both of which are performed on nasal swabs: RT-PCR tests and antigen tests. Of the two, the RT-PCR test is more reliable but both tests are used to detect Covid infections. The data reported includes results from both the tests. Positivity rate is the ratio of people who test positive for the infection compared to the total number of people tested.

The total number of tests conducted in a state, even in the absence of a significant number of Covid cases, is an important metric. It indicates the state’s ability to detect any uptick in infection and allows a window for early intervention. When the testing rate is high, more people are tested. This increases the chances of early detection if there is a surge in infections, compared to a scenario with less testing. Thus, the testing rate of a state can be viewed as a one measure of a state’s vigilance level, or the extent to which it is prepared to quickly catch any increase in the spread of Covid.

While increased testing is good in all scenarios, it becomes even more important in the face of rising Covid positive cases. Experts agree that as positivity increases, testing should be increased too. This is because a very high positivity rate indicates the possibility of only people with severe symptoms being tested. As a result, many others who may be infected, but are showing fewer symptoms or no symptoms at all, are likely not being tested. These undetected infections can cause rapid spread of the virus because people carrying these infections do not isolate themselves, given that they are unaware that they are carrying the virus in the first place.

Thus, the way a state responds when positivity rates start to surge, as was the case during the second wave peak in India, can tell us how responsive it was. Ideally, testing rates should increase significantly within a few days of a surge in positivity rates. This tells us that the government is aware of the prevailing situation and is trying to do something about it.

Using data collected from states between January 1, 2021 and June 10, 2021, we conducted a detailed analysis on testing numbers and positivity rates. We found wide differences between states across these two metrics, both in absolute numbers and the progression over time, which we will discuss in detail.

Interestingly, not all states that were vigilant were responsive. And not all states that responded well to rising cases were that vigilant. Of course, making such broad statements to describe what is a minefield of data is risky business. The analysis also reveals some questionable data from a few states in how much it differs from nationwide averages, pointing to potential manipulation. Flagging potential inaccuracies in case of Covid data – willful or not – can perhaps discourage such practices.

On the topic of data fudging, there have been a spate of recent reports highlighting massive underreporting of Covid deaths in many states. Investigative work by data scientist Rukmini S along with Chinmay Tumbe of IIM Ahmedabad reveal significant fudging of Covid death numbers by states. Taking a cue from these folks, other media reports have also surfaced showing similar fudging of numbers in more states. We had discussed this detail in an earlier piece titled “But How Do You Hide the Dead”.

Our current analysis in a way confirms what the media is already highlighting about fatality numbers through other data and anecdotal evidence.

Because in the end, data is all we have got. So, let’s start.

***

The second wave of Covid, which had paralysed the nation for over one and a half months, is receding. The all-India number of new cases on a single day, is down from a peak of 4,14,280 on May 6, 2021 to 37,070 on June 28, 2021 – a decline of over 91 percent in less than two months according to Covid19india.org. To give some context of how severe the second wave was, the highest number recorded for daily infections in the first wave was 97,680 on September 17, 2020. The steep drop in infections in recent weeks is encouraging.

The big question in the mind of many is: can Covid data be trusted? Any scepticism towards the accuracy of Covid-data, and thus the utility of data-driven analysis of the pandemic, is understandable. For a variety of reasons, there are huge gaps in our ability to gather data on Covid infections. Let’s list them out one by one.

First, as we all know by now, not everyone who gets infected shows symptoms. Folks who show symptoms of Covid when infected are called symptomatic patients, whereas those who do not show symptoms despite being infected are called asymptomatic patients. This phenomenon of asymptomatic infections automatically causes a significant number of cases to go unreported. People who don’t feel sick will generally not get tested and hence, won’t be counted as being infected.

It is important to understand this, simply because asymptomatic patients also spread Covid. As Anirban Mahapatra writes in Covid 19: Separating Fact from Fiction: “During this pandemic, it became clear that people who were infected but not sick were spreading the disease silently. A significant proportion of spread of SARS-CoV-2 is by asymptomatic carriers who can spread virus-laden particles as aerosols from anywhere between three to twelve days.”

Second, the surge of sickness brought by the second wave clearly overwhelmed our health infrastructure, including testing capabilities. The system won’t record those it cannot serve. This became more important given that in some states, even getting admitted into a hospital was made difficult by the bureaucratic regulations that were in place.

In Uttar Pradesh, for instance, in order to get admitted into a hospital, a patient required a reference letter from the chief medical officer “who heads the integrated command and control centres set up by the government in all districts”. Due to this rule, patients were turned away from hospitals. And if such a patient died, they wouldn’t be counted in the Covid deaths. Of course, this was over and above whether medical infrastructure was available and the patient had the ability to access it in the first place.

Third, while the Covid testing infrastructure in urban and semi-urban areas is over-stretched, it is either absent or completely inadequate in rural India in many states. Thus, the spread of the virus in the hinterland does not show up in the numbers in a proper way.

As Dr Chandrakant Lahariya, a Delhi-based epidemiologist and public policy and health systems expert, told India Today in June: “In the absence of reliable Covid surveillance and data from rural India, we cannot be sure about the extent and severity of the pandemic. National aggregates may indicate a declining spread in urban settings, but it is possible the virus is still spreading in rural India.”

Fourth, even in places where testing is available, people often avoid getting tested due to the fear of restrictions imposed if they test positive. Then there are prior beliefs and WhatsApp forward influencing beliefs, which are at play as well. This, coupled with the fact that there is a small (but not insignificant) chance of the test returning positive even if one is not infected – or what is referred to as a false positive – fuels a reluctance among people to get tested unless absolutely needed.

Finally, one can’t rule out the possibility of data being fudged by authorities to avoid embarrassment and/or public and political backlash. (Again, something we had documented in our earlier piece.)

Thus, data-driven analysis of Covid-19 testing and infections has quite a few limitations. Yet, this piece will do just that. Our rationale is simple. While the numbers do not reflect 100 percent reality, they are a useful proxy. Most of the limitations of data described above don’t change much over time. Thus, the data collected can inform about mitigation measures and provide insight into the severity of the disease, efficacy of the government response, and perhaps even flag instances of fudging.

All that said, let’s dive into some cold, hard numbers to understand what they tell us about how different states and regions have fared in the second wave. Specifically, we examine the data on testing, positive infections, and the dynamics that link them.

Let’s first start with some aggregate level data analysis. While there are many ways of grouping states to create aggregate data, we picked two.

Geographic division: We bundled together the data from the northern and the eastern part of the country, which is the lesser developed part, on one side, and the southern and the western part of the country, which is the more developed part, on the other.
Political division: The states governed by the National Democratic Alliance parties versus the non-NDA governed states. Obviously, almost all big NDA governed states are governed by the Bharatiya Janata Party except Bihar, where the party is in alliance with the Janata Dal (United).

The data, when cleaved in this fashion, is very striking.

(1) Geographic division

Uttar Pradesh, Bihar, Madhya Pradesh, Rajasthan, Delhi, Haryana, Punjab, West Bengal, Assam, Uttarakhand, Jammu and Kashmir, Jharkhand, Chhattisgarh and Delhi comprise the north and east group.

Maharashtra, Tamil Nadu, Gujarat, Karnataka, Andhra Pradesh, Odisha, Telangana and Kerala comprise the south and west group.

In both the cases, we only considered states with a population of more than one crore.

First, let’s look at the total testing for these two groups from January 1, 2021 to June 10, 2021.

The total testing in the South and the West adjusted for the population is much more than in the North and the East. This can also seen in the daily testing numbers shown in Figure 1 below.

As Figure 1 shows, the curves for both groups track each other pretty well especially in how testing was ramped up in early April in response to the second Covid wave. For the south and the west, the daily testing started to drop after peaking in early May and has plateaued since then.

Now, let’s look at the total positive cases (per one lakh of population) for these two groups.

As Table 2 shows, there is a stark difference in the total number of positive cases adjusted for population between the two groups. The number of positive cases adjusted for population in the south and west are more than double of that in the north and east. This is again a reflection of the fact that the states in the southern and the western part of the country tested more and as a result were able to identify more Covid cases.

Next, let’s see whether the daily positive numbers reflect this stark gap.

Figure 2 shows that the spike in cases started almost at the same time in both the regions. But the daily cases at the peak of the spike in the south and west were more than double those in the north and east. This two-fold difference has been maintained in late May and early June when daily cases declined from their early May peaks.

Over and above the fact that the south and the west carried out more tests, a possible explanation for this perhaps lies in the fact that some of the bigger states in the north and east may not have been declaring data properly. It might also be a function of the fact that the health systems in the south and west work much better than the north and particularly the east (states like Bihar, Jharkhand and large parts of Uttar Pradesh). This might imply that the chances that the health system catches on to new cases once testing is carried out is much better in the south and the west than in the north and the east. Also, this goes back to some states not declaring data properly. This can be said from the fact that the difference in the number of covid cases between the two segments is much more than the difference in testing.

Of course, more research is needed to say this with absolute certainty, but this seems to be the simplest possible explanation.

(2) Governing party

We grouped states as governed by the BJP-led NDA and those not governed by the NDA.

The states of Uttar Pradesh, Bihar, Madhya Pradesh, Haryana, Gujarat, Karnataka, Assam, Uttarakhand, and Jammu and Kashmir (given that the union territory is governed by a governor and not an elected chief minister) comprise the NDA group.

The states of Maharashtra, Tamil Nadu, Rajasthan, Punjab, Andhra Pradesh, Odisha, Telangana, Jharkhand, Delhi, Chhattisgarh and Kerala, comprise the non-NDA group. Again, only states with a population of over one crore have been considered.

First, let’s look at the total testing for these two groups from January 1 to June 10.

The total testing that happened, adjusted for population, was slightly more in the non-NDA states. This difference is also seen in the daily testing for these two groups as seen in Figure 3 below.

As Figure 3 shows, the graphs for both groups track each other well with the non-NDA states having higher testing numbers throughout.

Now, let’s look at the total positive cases (per one lakh) for these two groups as well.

As Table 4 shows, there is a stark difference in the total number of positive cases between the two groups. The number of positive cases in non-NDA states, adjusted for population, is more than double of that in the NDA states.

This wide difference is also seen in the daily positive cases shown in Figure 4 below.

Figure 4 shows that the spike in cases started almost at the same time in both groups and the shape of the graphs is very similar. Except that the peak of the curve in the non-NDA states is nearly double than that of the NDA states. Even with the decline in cases after early May’s peak, the daily cases reported in non-NDA stats continued to be around double that of the NDA states, after we adjust them for population.

A small part of this would be because of the fact that non-NDA states carried out more tests. But the difference is too big to be just explained by more testing by non-NDA states. The simplest explanation for this lies in the fact that some of the bigger NDA states may have been fudging data. Also, their health systems might not be developed enough to capture data well.

Take, for instance, the fact that for quite some time, just the city of Delhi had more cases than neighbouring Uttar Pradesh, which has a population 10 times that of Delhi and a very weak health infrastructure to boot. Or the fact that the city of Nagpur, which is in Maharashtra and very close to the Madhya Pradesh border, for a while had more cases than the entire state of Madhya Pradesh. The data doesn’t pass the basic smell test.

Media in Gujarat has regularly highlighted how the total number of cremations and burials being carried out under the Covid protocol are significantly more than the death numbers being declared by the government.

Or take the case of Bihar. On June 9, 3,951 uncounted Covid-related deaths were added to the state’s overall tally and pushed the total number of deaths up to 9,429 deaths. This single addition made up nearly 42 percent of the total number of deaths in the state. As the state’s health secretary told reporters: “We had set up two committees, who reassessed Covid death toll, taking into account unaccounted deaths in private hospitals and other places.”

Similarly in Maharashtra, 8,756 more deaths were added for the period of May 1 to June 7 after a reconciliation exercise, taking the state’s death tally for that period to 22,099 and the state’s total death tally close to 1,00,000. The discrepancy that’s being corrected, though, has simply to do with delayed reporting rather than underreporting or misreporting of deaths. Also, unlike Bihar, the number of deaths added was not a huge proportion of the overall number.

These facts don’t pass the basic smell test. All the NDA states mentioned are governed by the BJP on its own except for Bihar, where it is in a coalition.

Now, let’s take a look at more detailed state-level data.

Testing

Extensive testing is essential to an informed public health response to Covid-19. Public health experts rely on testing data to understand the spread of the disease, determine the adequate response to an outbreak, assess whether containment measures are working, and get early warnings about impending waves. Obviously, it is not practical to test the entire population at any given time to get a completely accurate picture of disease prevalence. But the higher the data points, the better the resolution.

So, how much emphasis have different states placed on testing in recent months?

Figure 5 below charts the total tests conducted by each state (with population over one crore).

Some obvious observations from this data.

Delhi is an outlier, having tested more relative to its population than any other state. Kerala is the other state which has done well on this front. Other states that have high testing rates include Uttarakhand, Telangana and Karnataka.
The states of West Bengal, Madhya Pradesh, Rajasthan, Bihar and Jharkhand have conducted the fewest tests relative to their population.

Positivity rate

While the aggregate testing data gives us some insight into the comparative performance of states over a period, another more notable metric is daily testing and infection trends. It is obvious that when testing increases, the number of positive cases will also increase. So, the metric to look at to figure out the level of infection in a community is not the total number of positive cases, but the positivity rate or percent positive, which is the fraction of positive cases relative to the total tests carried out.

The higher the positivity rate, the more concerning the situation.

Nonetheless, as a rule of thumb, one threshold for the positivity rate being “too high” is five percent. For example, the World Health Organisation recommended in May 2020 that the positivity rate should remain below five percent for at least two weeks before any government considers reopening the local economy. A higher positivity rate suggests higher infection levels and that there are likely to be more people with coronavirus in the community, who haven’t been tested as yet.

The link between testing and positivity rate is crucial. Let us try to understand this through an example.

Suppose we conduct 100 Covid tests in a day and 20 of them are positive, resulting in a daily positivity rate of 20 percent. Now, if we increase the daily tests to 1,000 and the number of people who test positive increases to 100, the positivity rate drops to 10 percent. Sticking with this example, if we only test 100 people a day, chances are that only patients with severe symptoms are getting tested and the positivity rate will be high.

As Ryan A Bourne writes in Economics in One Virus: “There’s good reason to think that there is an inverted-U shape of Covid-19 cases as testing numbers increase. At first, conducting more tests leads to finding additional cases.”

Of course, this makes governments look bad in the short run. This explains why some governments have been reluctant to test or some have even been reluctant to report the right result. In the short-term, testing increases the positivity rate. Only, once the number of people being tested keeps increasing, does the positivity rate fall.

As we expand that testing pool to a larger group (let’s say 1,000 in this example), folks with milder symptoms will also get tested. This expansion of testing to a larger group – one that is not limited to symptomatic folks – has two benefits.

One, it reveals cases that have gone undetected and thus have the potential to reduce the spread by asking people to isolate. As Bourne writes: “When testing is widespread and regular enough, conducting more tests actually reduces Covid-19 cases…[as]… potentially infectious people can isolate themselves immediately and notify those they have been in contact with sooner.” What this does is it minimises “the window of transmission between people becoming infectious and ultimately isolating – the time the person would likely be out spreading the disease.”

Two, more testing actually decreases the positivity rate since those showing milder/no symptoms will have lower chances of testing positive.

This two-fold benefit of expanding the testing pool explains why epidemiologists maintain that testing levels should be increased if the number of positive cases begin to increase in a locality.

Statewide data on testing and positivity rate

Armed with this understanding, let’s look at daily tests and daily positivity trends for different states to assess their preparedness and response to the pandemic. The extent of testing conducted by each state even before the second wave hit (baseline testing) is an indicator of how vigilant the state authorities were in monitoring the pandemic.

On the other hand, the rate at which a state ramped up its testing rate, once positive cases began to spike, is an indicator of its responsiveness to the second wave. The link between positivity rate and testing can be discerned by looking at the respective trends over time.

Obviously, an ideal response is one where the level of testing was high before the second wave started in March and the testing rate increased in line with increasing cases. As we will see, not all states that had relatively high testing levels in January and February, and hence had higher vigilance levels, were the most responsive when the cases began to surge.

We discuss individual states in the next two sections. We have divided the analysis into three groups of five states each, to minimise visual clutter.

Figures 6 and 7 below chart the daily Covid cases and the positivity rates for the five states with the highest testing rates: Delhi, Kerala, Uttarakhand, Telangana and Karnataka. Note that all the daily data graphs are seven-day moving averages to smoothen the day-to-day fluctuations.

Here are the key observations we can make looking at Figures 6 and 7.

Delhi had an exceptionally high baseline level of daily testing even before the second surge started, which only increased marginally as the positivity rate spiked.
Except Telangana, the four other states had steep increases in positivity rates and high peak positivity rates. Telangana’s peak positivity rate is significantly lower than the other states in the chart. It is also lower compared to nationwide metrics as well, as we shall see later in the article when comparing the peak positivity rates of all states.
Kerala and Karnataka were the only states where daily testing increased as soon as positivity rates began to increase. So, the two states seem to have responded well to the second wave, at least in terms of increased testing. This is hardly surprising given that these states have a much better health infrastructure in comparison to other parts of the country.
Delhi, as we mentioned earlier, had a minor increase in testing despite a huge spike of positivity. We shouldn’t judge Delhi too harshly for not ramping up testing when cases increased, since their testing levels were already very high when compared to other parts of the country.
Uttarakhand had a significant spike in daily testing around April 1 although the spike in positivity did not happen over a month later. This can be explained by the timing of the Kumbh Mela in Haridwar which started on April 1. Since a negative RT-PCR test was required for attending the event, the spike in testing around that date makes sense. Whether these increased tests were real or fake is now under scanner. A recent report alleges that up to one lakh tests reported during the Kumbh Mela were fake. A private company responsible for Covid testing during the Kumbh Mela had allegedly fabricated test results to meet their daily quota of tests.

Now let’s look at the daily testing and positivity rate for the five states with lowest testing rates: West Bengal, Madhya Pradesh, Rajasthan, Bihar and Jharkhand.

The key observations that can be made looking at Figures 8 and 9 are as follows.

All five states had very low daily testing rates compared to the first group, with Jharkhand being the best among the five, albeit a bit choppy.
Interestingly, the ramp-up in testing in all the five states coincided closely with increased positivity rates. So, it appears that while these states performed poorly when it comes to the volume of testing, all of them increased testing in a timely manner.

Finally, let’s look at these two metrics for the five most populous states that haven’t figured in either of the earlier groups: Andhra Pradesh, Gujarat, Maharashtra, Tamil Nadu and Uttar Pradesh.

The key observations that can be made looking at Figures 10 and 11 are as follows:

Gujarat had high testing rates, with a peak daily testing rate comparable to that of Karnataka. But the total testing numbers were lower than the first group of five best performing states. Its positivity rate was among the lowest with a peak positivity rate less than 10 percent (9.71 percent to be precise) while the neighbouring states of Rajasthan, Madhya Pradesh and Maharashtra had positivity rates close to 25 percent. This raises some red flags. There have been several newspaper reports citing instances of manipulation of Covid-related data in Gujarat. While the main focus of those reports have been underreporting of Covid deaths, it appears that case infection numbers may also be fudged.
For instance, on June 5, the CEO of a top hospital in Ahmedabad was quoted as saying: “All the figures from daily case numbers to daily tests to cases of hospitalisation and deaths were lower than the actual count. That made our job more difficult because the numbers were kept artificially low, but in reality, there were no beds available anywhere in the city.”
Maharashtra has a unique positivity rate chart. All other states (including those in the previous two groups) show a shark spike in positivity in late March or early April. Maharashtra, on the other hand, showed a more gradual uptick starting as early as middle of February, reaching near-maximum levels in late March. Thereon, instead of a sharp peak like most other states (implying falling positivity levels after a short peak), the positivity rate in Maharashtra plateaued for around six weeks, before beginning to decline in mid-May.
Uttar Pradesh is unique in how sharply its positivity rate declined after hitting the peak in late April. In fact, in this article by economist Omkar Goswami, the peak positivity rate of Uttar Pradesh has been compared against that of states like Maharashtra and Tamil Nadu to suggest a possibility of data fudging by Uttar Pradesh officials.
In our analysis, comparing peak positivity of all states, the Uttar Pradesh peak positivity figure of around 16 percent is low compared to many big states such as Maharashtra (24 percent), Rajasthan (25 percent), Karnataka (35 percent) and West Bengal (30 percent), but there are some states with lower positivity rates too, such as Bihar(15 percent), Gujarat(10 percent) and Punjab (14 percent). So, based on peak positivity rate alone, classifying Uttar Pradesh as the only suspicious outlier is not borne out by the data. It is possible that all states with significantly lower positivity rates could actually be fudging the data as well.
Also, the sharp decline from peak positivity in Uttar Pradesh is unique (the blue curve in Figure 11), and perhaps a more telling indicator of possible data manipulation in UP. Given the exceptional fall in positivity rates in the state not seen anywhere else in the country, testing result data from Uttar Pradesh does not pass the basic smell test.

Peak positivity

One metric that we have already discussed while interpreting the graphs above is worth comparing across states nationwide: peak positivity rate. The figure below charts the peak daily positivity rate for all states with a population over one crore, and the dates when the peaks occurred.

Some observations from the chart above.

The northern states of the country experienced the peak earlier than the southern states, with all southern states except Telangana peaking within a two-week period between May 15 and May 30.
There are several states which reported positivity rates above 25 percent, with Karnataka having the highest peak. As mentioned earlier, Gujarat has the lowest peak positivity of 9.71 percent, compared to an average positivity rate of 21.39 percent, for the states included in this study. Given the much higher peak positivity in nearby states like Maharashtra and Rajasthan, this number might raise a few eyebrows though its daily testing was higher than those states which can bring the positivity rate down to some extent, as we discussed at the beginning of the article.
Like Gujarat, Telangana’s positivity rate of 10.42 percent is exceptionally low compared to the average rate which casts a shadow on the accuracy of the data. Others have raised similar doubts based on data reported by a few districts in the state. For example, after comparing official Covid case count with numbers from local data sources in April, the News Minute concluded: “Telangana government is ‘officially’ underreporting the Covid-19 case count by more than 70% at least.”
This chart also illustrates an earlier point that while Uttar Pradesh’s peak positivity rate is lower than other populous states like Maharashtra, West Bengal and Tamil Nadu, the number is not an outlier as other states like Bihar, Punjab and Gujarat, have a lower corresponding figure.

Conclusion

So, what is the point of all the analysis in this piece?

The idea is not to rank different states or pit them against each other. But it is important to know how different states responded to the second wave when it came to testing and infections. Such analyses can help identify shortcomings that need remedial action and offer an opportunity of cross-pollination of ideas between states. That the analysis can sometimes reveal potential data manipulation, is incidental but also important in its own right, since calling out suspicious numbers could discourage blatant data manipulation.

Given that we are looking at several more months before enough people get vaccinated to build herd immunity, and that the threat of a third wave persists, testing will continue to be a critical part of the government response to the pandemic. Governments – both state and central – will do well to learn the lessons contained in the data collected so far. But for that to happen, someone needs to sift through the data that is being collected.

This is one such attempt.

This story is part of the NL Sena project which our readers contributed to. It was made by possible by Nirvik Dey, Anupam Das, Suraj Kaul, Somok Gupta Roy, Aditya Deuskar, Sumeet M Moghe, Abhishek Kumar, Swarnava Sarkar, Karthikeya Muchinthaya, KV Radhakrishnan, Rajkumar Jindal, Rajdeep Adhikari, and other NL Sena members.

Contribute to our upcoming Sena project, Plunder of the Aravallis, and help to keep news free and independent.

Can India's Covid data be trusted? A Big Data investigation into what the numbers show (and hide)

TAGS

Comments

You may also like