Dr. Ryuta Yoshimatsu, Dr. Simon Hefti
When we talk about Covid-19, we sometimes refer to the effective reproduction number, Rt. Rt is one of the key metrics to evaluate where we stand in terms of epidemic growth. Simply put, it indicates how many people an infected person infects on average at time t. When this value is above 1, the number of infections will grow exponentially and when it’s below 1, the opposite is expected. It’s hard to get the real-time estimate of the Rt due to the reporting inefficiency. The latest Rt estimate available today provided by the Swiss Science Task Force, for example, is that from 2 weeks ago.
In this blog post, we explore one possible way to bridge this gap and to give an estimate to today’s Rt. We will first look at the number of total reported Covid-19 cases in Switzerland [data] since the beginning of the epidemic and will study the dynamics of the infection waves. Then, we will model the time series and make predictions into the near-past (described below). Since we know that the Rt is a function of the daily number of cases together with model assumptions, we will use our prediction to estimate the Rt of the missing days including today.
We acknowledge that the reported case numbers we used in this work [data] only serves as a rough proxy for the actual number of infections. These numbers were sometimes largely under-reported due to overloading of the testing capacity and sometimes over-reported due testing errors and biases. Moreover, Rt estimates published by the Swiss Science Task Force, which we validated our prediction against, is an estimate based on yet another estimate. We are by no means trying to claim that we have developed a powerful model that captures the full dynamics of Covid-19 infections or a perfect prediction model to estimate the unknown Rt. We are just two curious data scientists trying to make sense of the data that we have at hand and share what we’ve learned. Enjoy!
Understanding the Reported Swiss Covid-19 Cases
The figure on the top shows the time evolution of the number of total (cumulative) reported Covid-19 cases and below is the number of total reported fatalities due to Covid-19 in Switzerland [data]. Looking at the plots, we spot periods of growth and plateau that come after the other.
This wave-like population growth is often modelled using the logistic equation. Logistic curves can conveniently capture the dynamics of an exponential growth followed by a linear growth and eventually a saturation (plateau). For Covid-19, there have already been a lot of studies published using some variant of this equation, in other words, the generalized logistic equations: e.g. paper1, paper2.
In this work, we used the stretched logistic equation (equation below), which also belongs to the family of the generalized logistic equations. In essence, the stretched logistic equation encodes time dependency in the exponential growth rate. The assumption here is that the growth rate decreases over time due to external contributions such as a state intervention (lockdown) or the rise in people’s awareness. Researchers have shown that the stretched logistic curve fits pretty well to the Covid-19 cases as well as to other epidemic spreads.
Since it seems like we have more than one cycle of growth and plateau, we take a linear combination of multiple stretched logistic curves to fit to the number of total cases:
The fit is performed using the least squares and the number of stretched logistic terms to be included (i in the above equation) in the model at time t is determined by comparing the square error produced by separate models including 1 to m number of terms. The best fitting model as of March 15, 2021 contains three stretched logistic curves. The animation below shows this dynamic model selection process. The grey and the blue curves are the observed number of total cases and the orange curve is the best fit using the above equation. The orange vertical dashed lines indicate t1, t2 and t3. Note that the fit deviates from the actual observation more at the onset of each growth. It’s generally understood that a good prediction could only be made in the second half of the sigmoid. Fortunately, the fit seems to be stable today (March 15, 2021) but at the onset of another growth period, we might have to find a more robust model.
Shown below are the snapshots of the fit as of March 15, 2021 using a linear combination of three stretched logistic curves. This model does pretty well on the overall trend.
Estimating the latest Rt
Rt of today can be inferred from a disease transmission model, an initial probability distribution of the Rt and the history of the daily number of cases including that of today using a Bayesian framework. The black curve in the plot below is the estimated Rt obtained using this approach and the red curve is the logarithm of daily number of cases. You can see in the graph that as soon as the Rt exceeds the threshold, the number of cases starts increasing and vice versa when it goes below.
The difficulty in estimating the latest Rt is in obtaining the daily number of cases of today (red box in the above figure). There is an unavoidable time lag between an actual event of infection and when this event is reported. For this reason, the Swiss Science Task Force waits typically for about 10–14 days until the reported number stabilises and then they estimate the Rt for that given day. This is why the Rt provided on the website today only reflects the infection dynamics of 10–14 days ago.
The animation above demonstrates how the reporting efficiency evolved over the entire history of the epidemic. On the horizontal axis, the number of days since the first report is plotted, which indicates how many days was needed until the number of reported cases stabilised. On the vertical axis is the normalized reported cases. The orange line is a fit to the observed data using an arctan function. We can see that especially at the onset of the new wave, the reporting efficiency suffered dramatically.
From the first section of this blog post, we think we have a good model that fits well to the data. Leveraging this model, we want to give a prediction to the number of daily cases in the last two weeks including today (forecast the near-past). Once we obtain this prediction, we feed this further into another model that gives us the estimate of Rt, which is our ultimate goal.
Let’s look one more time at the number of total cases (black curve in top figure top plot) and our fit using a linear combination of three logistic terms (red curve in top figure top plot). The model does a great job fitting to the overall trend but fails to capture the local structure: i.e. seasonality. You can see this in the first order derivative of the number of total cases or the daily number of cases (top figure bottom plot). There seems to be a strong seasonal dependency in the time series.
We use the red curve as a trend and subtract that from the black curve. The remaining is the “detrended” time series (above figure). Here, you can also clearly see a strong weekly seasonality (oscillation) and a volatility (time dependent variance). To give a good prediction, we need to capture the seasonal components in our model as well.
We decomposed the detrended time series into the second order trend component, the seasonality and the remainder assuming an additive model. The presence of a weekly seasonality was not surprising for us from looking at the original time series but the monthly seasonality was unexpected. The recent increase in the second order trend component may suggest the onset of the fourth wave and the structure in the remainder implies a potential problem of describing the time series using the additive model: e.g. not being able to capture the time dependent magnitude of the seasonal fluctuations or the volatility.
We performed generalized linear regression on the detrended time series with ARIMA errors and Fourier terms of base periodicities 7 (week) and 30.4 (month) as additional external regressors. The model selection (AR order, MA order, number of Fourier terms per base frequency, amplitudes of the harmonic terms, etc.) was carried out using the Akaike information criteria. We then used the best fit model to make a forecast on the detrended time series of the last two weeks (top right), superimposed those values on to the overall trend obtained from the stretched logistics curves, and finally arrived at our near-past forecast (red wiggly curve in the bottom plot). Visit our GitHub repository for the details of the implementation.
Now that we have a prediction of the daily number of cases of the recent days including today, we could solve for the Rt directly using the analytical expression given by the Bayesian update rule. Alternatively, we could reformulate the time series setting as a supervised learning problem and use machine learning to reverse engineer that update rule and give the prediction of the missing Rt.
We decided to take the latter approach to add another layer of fun (validation against the analytical solution shortly described).
Above is a snapshot of the dataset we constructed using the time series data: i.e. time, Rt and the number of daily cases. We engineered lag features, which are Rt and daily cases at prior time steps, and rolling mean features, which are an aggregation of Rt and daily cases over a fixed window of prior time steps. These features provide information about the serial correlations between the values to the dataset. Rt is our response variable and all the rest except for the time is our predictor. We performed Poisson regression using the gradient boosting algorithm. Visit our GitHub repository for the details of the implementation.
We validated the gradient boosting model using the walk-forward method. The top plot in the above figure shows the root mean square error (RMSE) of the predicted Rt of the following 14 days from day x, where x ranges from April 1, 2020 to March 1, 2021. At day x, the model is trained using the data only available up until that day. We can see that at the beginning of the epidemic, the model is struggling to make a good prediction with RMSE > 0.2, but as the time goes by and the more data become available, it starts learning the function and the metric stabilises. It did suffer again at the onset of the second growth but during the third growth period and thereafter the predictions worked well with RMSE < 0.005.