## Deaths v Cases – MI COVID data

Is there a relationship between the number of new cases reported and the number of subsequent deaths? It’s a difficult question because of uncertainties in the data (sparse testing early on, COVID deaths likely under-reported, etc.). Here’s an attempt at an analysis.

First – here’s a graph showing the number of new cases and the number of deaths throughout the bulk of the pandemic. Deaths are scaled to make their change over time more apparent – read the number of deaths from the y-axis on the right side. First, note that the two variables do indeed tend to change together. However, the scaled deaths early on are much higher than the number of new cases, and the scaled deaths later are lower than new cases. This suggests what many have suspected – that testing was probably missing many cases early in the pandemic: serious cases, those that were symptomatic and more likely to lead to death, were being recorded, and asymptomatic cases were probably being missed.

Here’s my new look at the data regarding the relationship between cases and deaths. This was sparked by a question from my friend Cliff Harris: **“can you find a reliable correlation between cases and deaths? For instance, is there any number of days x where, the # of deaths/(# cases x days previous) is close to constant?”** The graphs below address the correlation question directly, and suggest an answer to his question about

*x*.

I split the data somewhat arbitrarily into early and late periods, corresponding approximately to the point where the Cases and Death curves cross in my graph above. I did this under the assumption that a smaller number of tests early in the pandemic might produce different results than are seen with the larger and perhaps more reliable testing done later.

These graphs plot Cases against Deaths, and include information about the regression lines for the early and late data. It is clear that there is a relationship between cases and deaths, and it is also clear that this relationship differs if one compares early data with later data. The largest daily death counts are associated with low numbers of new cases early in the pandemic (blue points), when cases were probably largely undetected; late in the pandemic (red points), when testing was more widespread and more cases were reported, the medical community had learned more about COVID-19 and was better able to prevent death. More interestingly, each graph varies the “lag” between the Cases and Deaths. If Lag=0, the graph represents the relationship between Cases on a particular day and the number of deaths reported that day; Lag=5 shows the relationship between Cases and Deaths that are reported 5 days later, and so on. The Lag=0 and Lag=5 graphs include the recently reported exceedingly high number of cases, tending to increase the linearity of the data; these high numbers do not appear in the Lag=10 (or greater) graphs because we are not yet 10 days out from these high numbers. (Click on a graph to enlarge it.)

One thing to note is that the linear relationship between Cases and Deaths breaks down for the Early data starting around Lag=15, perhaps because of limited testing resulting in undercounting of Cases in this early phase. The linear relationship is largely maintained up through Lag=25 for the Late phase.

Here are the correlation coefficients (r) for the various lags, both Early and Late:

EARLY | LATE | |

Lag | r | r |

0 | 0.777 | 0.908 |

5 | 0.806 | 0.897 |

10 | 0.710 | 0.864 |

15 | 0.614 | 0.770 |

20 | 0.523 | 0.814 |

25 | 0.458 | 0.686 |

30 | 0.427 | 0.483 |

35 | 0.381 | 0.359 |

In the late phase, when testing is more prevalent and the recent outliers are removed, lags of 10 through 20 yield correlation coefficients ranging from r=.77 to r=.86. These values suggest that about 2/3 of the variability in the number of deaths reported on a given day is accounted for by the number of new cases reported 10 – 20 days earlier. This is a strong relationship, but not a perfect relationship: there are other factors that account for 1/3 of the variability. These might well include different time courses of the illness: two diagnoses on Day 1 that result in deaths on Days 15 and 18 would reduce the predictive value of New Cases in predicting deaths exactly 15 days out. Nonetheless, this analysis suggests that the number of deaths will be related to the number of cases reported 10 – 20 days earlier.

So my short and imprecise answer to Cliff’s question: *10 < x < 20*.

[Addendum: additional info related to Cliff’s question: During the time when the number of cases was most stable, roughly 8/1 through 9/27, the value * # of deaths/(# cases x days previous) *is about 0.0147, meaning that whether you choose a lag of 10, 15, or 20 days, some 1.5% of the people diagnosed will die that many days later. Note that this is lower than the state’s reported Case Fatality Rate of 3.2% because of the imprecision inherent in predicting the exact course of the illness.]

*(Disclaimer: My case data come from the MI.gov site that reports daily new cases. My death data were extracted from the Cases and Deaths by County by Date
of Onset of Symptoms and Date of Death spreadsheet that the state makes available for download; this spreadsheet offers deaths by county for each day, but does not offer a statewide death count for each day, so I had to calculate that. I accept all responsibility for any errors in this regard.)*