Friday, April 24, 2020

Data Trend Model: Details

This post gives some details about the "Data Trend Model" I use for COVID-19 projections, so it's a bit technical. For a more casual description, please read this post.

Model Goals

The goal of the model is to answer the question "What will happen in the US if the current trends continue?".  It tries to do this with a minimum amount of assumptions. Data trends are calculated separately for each US state (or region etc. for places like Puerto Rico and Guam), and added together to get projections for the entire US.


Day-to-day variances in reporting: The daily reports of new confirmed COVID-19 cases and deaths shows a lot of variability; on occasion, daily numbers for some counties or even states are completely missing, and then presumably included in the next day's report.

Weekly cycles: Data for multiple states show strong weekly patterns, for example significantly lower case numbers on weekends, and higher numbers during the middle of the week.

"Special events": Reported numbers often show short dramatic increases that can be linked to specific causes. One example was a spike in the cases reported by Ohio when prisoners in the state were tested, and a very large fraction tested positive for COVID-19.


Data source: Daily case numbers are downloaded from the Johns Hopkins data page on GitHub.  County data for each state are summed up, and daily new cases are calculated from totals.

Data smoothing: To reduce the impact of day-to-day variations, daily new cases are smoothed with a 3-day block average (2-day average for the last day). In the future, I will re-evaluate using a 7-day average that would also remove most errors from weekly cycles; however, the longer smoothing will reduce the sensitivity of the model to recent changes.

Trend lines: To extrapolate future changes, trend lines (linear regressions) are computed from the log of the smoothed data. Log numbers are used because new infections during steady phases of the epidemic are proportional to the number of active infections, which will give exponential increases or decreases that result in straight lines in the log graphs. To avoid distortions from weekly cycles, primary trend lines are calculated for 7 day periods and 15-day periods.

Limits to growth rates: Several steps are taken to prevent over-estimates of growth rates to cause "runaway states" to dominate the statistics. When 7-day trend lines show an increase in cases (a positive slope), the 15-day trend line is also considered, and the 15-day slope is used if it is lower. This reduces or eliminates some of the artifacts from "special events".
When estimating future case counts, the growth of daily case numbers in each state is limited to 14 days. After 14 days, the number of daily cases in the state remains constant. This reflects the expectation that governors will eventually impose stricter measures when reported cases increase consistently. It also reduces the effects of over-estimates from "special event" artifacts.

Estimating deaths: Daily deaths are estimated using time-shifted case numbers (-9 days) and the observed  time-corrected case fatality rate (7.14%).


Steady state assumption: The model assumes that the transmission rates remain constant going back about 7 days and going forward. It does not predict the impact of future changes like relaxing or imposing new social distancing rules. However, the effect of such changes will eventually become visible through changes in the predictions.

Sensitivity to noise: Predicted trend lines are sensitive to random day-by-day variation, especially for the last day's data. This is reduced, but not eliminated, by data smoothing and the use of 7-day trend lines.

Test rate differences: Projections are based on assuming uniform test rates between states. This assumption in very unlikely to be correct, but accurate per-state information about what percentage of actual infections in each state is reported in the "confirmed cases" number is not available.

Hospital overloading: The model does not take the potential effects of limited hospital and intensive care capacity into account, which may increase fatality rates.

Under-reported deaths: It is likely that the reported number of COVID-19 deaths underestimates the actual number of deaths caused directly or indirectly by COVID-19, for example because death certificates for people who died without a prior COVID-19 test often do not show COVID-19 as a factor in the death. The model does not try to adjust for such under-reporting, and therefore is likely to under-estimate the number of COVID-19 related deaths.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.