Policy Research Working Paper 9419 Modeling and Predicting the Spread of Covid-19 Comparative Results for the United States, the Philippines, and South Africa Susmita Dasgupta David Wheeler Development Economics Development Research Group & Urban, Disaster Risk Management, Resilience and Land Global Practice October 2020 Policy Research Working Paper 9419 Abstract A model of Covid-19 transmission among locations within demonstrate the model’s usefulness. The model variables a country has been developed that is (1) implementable include indicators of interactions among infected residents, anywhere spatially-disaggregated Covid-19 infection data locally and at a greater distance, with infection dynamics are available; (2) scalable for locations of different sizes, captured by a Gompertz growth model. The model results from individual regions to countries of continental scale; (3) for all three countries suggest that local infection growth reliant solely on data that are free and open to public access; is affected by the scale of infections in relatively distant (4) grounded in a rigorous, proven methodology; and (5) places. Forecasts of hotspots 14 and 28 days in advance, capable of forecasting future hotspots with enough accuracy using only information available on the first day of the to provide useful alerts. Applications to the United States, forecast, indicate an imperfect but nonetheless informative the Philippines, and South Africa’s Western Cape province identification of actual hotspots. This paper is a product of the Development Research Group, Development Economics and the Urban, Disaster Risk Management, Resilience and Land Global Practice.. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at sdasgupta@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Modeling and Predicting the Spread of Covid-19: Comparative Results for the United States, the Philippines, and South Africa Susmita Dasgupta* Lead Environmental Economist, World Bank David Wheeler Consultant, World Bank * Authors’ names are in alphabetical order. Keywords: Epidemic spread; Epidemic prediction; COVID-19; Gravity modeling; United States; Philippines; South Africa JEL Classification: I10 Acknowledgement: Authors are grateful to Somik Lall for providing valuable guidance and funding for this research. Special thanks go to Michael Toman for his advice and support. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. 1. Introduction The sudden emergence and rapid spread of Covid-19 have potentially disastrous implications for many developed and developing countries. The global research community has responded with a surge of research on the origins of the disease and its transmission potential under varying conditions. In the first round, many studies focused on the initial outbreaks in China, the European Union and the United States. Recently, however, attention has shifted to the billions of people whose countries lack the resources available to the first-round nations. In developing Africa, Asia and Latin America, national and regional governments confront an urgent need to understand how the pandemic is spreading and where the next hotspots may emerge. To act effectively, they require the assistance of Covid-19 transmission models that are both feasible to implement and capable of forecasting with enough accuracy and timeliness to be useful. This paper attempts to contribute by confronting the modeling challenge on the terrain occupied by developing-country policy makers. To do so, we accept constraints that are not normally binding for empirical researchers. For our purposes, an optimal approach to Covid-19 modeling must be (1) implementable anywhere spatially-disaggregated Covid-19 infection data are available; (2) scalable for entities of arbitrary size, from individual regions to countries of continental scale; (3) reliant solely on data that are free and open to public access; (4) grounded in a rigorous, proven methodology; and (5) capable of forecasting future hotspots. In this paper, we develop and estimate a Covid-19 transmission model that meets conditions (1) - (4) for three pilot cases that differ by scale and region: the United States, the Philippines, and South Africa’s Western Cape province. We test the model’s forecasting accuracy and find that it is sufficiently robust to warrant trial use for early alerts. We conclude with a question for interested colleagues: Within the constraints that we have accepted, and given the accuracy benchmarks established by this paper, is there a modeling approach that demonstrably performs better? The remainder of the paper is organized as follows. Section 2 reviews prior research and motivates our selection of a modeling approach. In Section 3, we develop the theoretical foundations for our econometric model of Covid-19 transmission. Section 4 introduces the data for the model, which can be replicated from free public sources for any developing country that has spatially- disaggregated data on Covid-19 infections. We report our model estimates for the United States, the Philippines and Western Cape province in Section 5. Section 6 tests the accuracy of out-of- sample predictions for 14 and 28 days that rely solely on information known at the start of the prediction period. We provide some thoughts on possible extensions of the work in Section 7, while Section 8 summarizes and concludes the paper. 2. Prior Research Covid-19 has prompted a large body of empirical research, and many journals have responded to the emergency by posting article submissions prior to peer review. Our approach to modeling the pandemic draws on recently posted and published research, as well as a substantial prior literature on modeling epidemic spread. We believe that our research strategy is appropriate, given the self- imposed conditions that we describe in the Introduction. In this section we provide the rationale 2 for our approach in a brief review of the options for modeling pandemic spread, as well as recent examples in the literature. SEIR and Agent-Based Models These models can provide detailed forecasts for locales, but they are highly nonlinear and include so many parameters that the available data are generally insufficient for joint estimation by convergent methods. This is particularly true for locales in the first stages of the pandemic, when initial parameter values are imported from other places and initial predictions are adjusted using educated judgment. SEIR models aggregate populations by functional class (thus the abbreviation: “Susceptible”, “Exposed”, “Infected”, “Recovered”) and define transition rules for movement between classes using assumptions about disease characteristics, social mixing and public health policies. SEIR models have been prominent in first-stage research on the pandemic outbreak in China (Chen et al. 2020; Hou et al. 2020; Prem et al. 2020) and Italy (Gatto et al. 2020; Chinazzi et al. 2020; López and Rodó 2020). In contrast, agent-based models simulate the outcomes of interactions among individuals, using detailed assumptions about their movements, mixing patterns, and public health interventions (Currie et al. 2020; Chang et al. 2020; Orazio et al. 2020). SEIR and agent-based models are extremely valuable for modeling the impacts of specific behaviors and policy interventions in a locality. However, we are interested in a modeling approach that scales easily across a broad geographic range. SEIR and agent-based models can accommodate this in principle, and some of the previously-cited applications incorporate multiple locales. However, broader application requires more parameter values that may be untested locally and difficult to calibrate as an ensemble. This is particularly true for developing countries with sparse data availability. Statistical Models We believe that statistical models are better-suited for the problems that this paper addresses. The simplest approaches extrapolate the future path of infections in one locale, using variants of the generalized logistic function fitted to local historical data or data from areas deemed similar by some criteria. Roosa et al. (2020) provide a recent example of this approach for Guangdong and Zhejiang, China. It has the twin advantages of simplicity and immediate application for any locale that publishes infection data, but it does not explicitly incorporate new infections transmitted from other areas. This introduces a serious aggregation error if the model is scaled beyond a single area without accounting for spatial interaction. In more scalable approaches, localities are parts of spatial networks in which human interactions spread infections within and across areas. Models in this domain have advanced through three stages of technical progress. In the first stage, network infection transmission was modeled using variants of the gravity model first applied by Zipf (1946) to population movement and Isard (1954) to interregional trade analysis. In the gravity model of epidemic spread, infections are transmitted between two locales by travelers whose number is directly proportional to the product of their populations and inversely proportional to the distance between them. 3 This modeling approach held sway for decades, 1 and some recent studies of Covid-19 spread in China have incorporated it (e.g. Kang et al. 2020; Li and Ma 2020). However, two significant weaknesses have been clear from the outset. First, geographic distance is often a poor proxy for travel cost, principally related to travel time, because of great variation in topography and transport infrastructure quality. The difference is most pronounced in the developing countries that are the focus of the present exercise. Fortunately, this weakness has been overcome by global-scale collaborative work in the OpenStreetMaps (OSM 2020) and Open Street Routing Machine (OSRM 2020) projects, which enable realistic estimation of point-to-point road distances and travel times for an arbitrary number of locales at no cost. Recent studies of Covid-19 spread in the European Union have employed road distances (e.g., Felbermayr, Chowdhry and Hinz 2020). This approximation for travel time is unlikely to introduce significant measurement errors in the European Union, given the density of its road network. For global applications, however, we believe that travel time should be the new measurement standard. The second weakness of epidemic gravity models is structural: They capture the ultimate results of human travel and interaction, but with composite parameters that incorporate five unobserved elements: travel by individuals from locale A to locale B; their infection rate; the frequency of their interaction with people in locale B; the infection rate in locale B; and the likelihood that each interaction of an infected person with an uninfected counterpart will transmit an infection. In principle, technical progress in communications and infection testing has made four of the five elements susceptible to direct measurement. Where the requisite cell phone data are available, they can enable tracking of individuals between locales and (with sufficiently-precise coordinates and timestamps) identification of co-locations that are close enough for transmission to occur. Mass testing in each locale can produce relatively precise estimates of infection rates. If cell phone data and testing are fully mobilized, the five unknown elements of the composite gravity parameter are reduced to one for statistical estimation: the likelihood of infection per interaction. Even if testing data are not available, the statistical problem is reduced to finding a composite measure for the likelihood of infection. The need for a distance measure is eliminated, since travel and interactions are observed directly. By implication, gravity-type studies of Covid-19 transmission in networks of places can be supplanted by models that use aggregative cell phone data when the following conditions hold: (1) Travelers carry cell phones with high probability; (2) access to the full set of time-and-location- stamped cell phone data for the entire network is free and open; and (3) local technical resources can accommodate the programming and computational demands imposed by very large-scale analysis of cell phone data. No setting currently meets these conditions, to our knowledge, but more limited Covid-19 studies are appearing within geographic domains bounded by the researchers’ access to cell phone data (e.g. Pullano et al. (2020) for transmission within France; Jia et al. (2020) for transmission from Wuhan, China). Having reviewed the available options, we return to our motivating conditions: An optimal modeling approach must be implementable immediately; in any network of places where Covid- 19 infection data are available; scalable for entities of arbitrary size; solely reliant on free, publicly- available data; rigorously grounded in proven methodologies; and, far from least, capable of 1 For a detailed review of applications in epidemiology, see Truscott and Ferguson (2012). 4 forecasting future Covid-19 hotspots with sufficient timeliness and accuracy to be useful for contingency planning. Satisfaction of the latter condition can be established only after a model is implemented and tested. Otherwise, the choice seems clear: The best feasible alternative at present is a gravity-modeling approach that uses freely-computable travel times instead of geographic distances. In the following sections, we develop, estimate and test the forecasting accuracy of a model that can be applied right now in any setting where spatially-disaggregated infection data are available. All other data requirements can be met from open and freely-available public sources. 3. Theoretical Specification Our specification of Covid-19 spread draws on models that are commonly employed in empirical studies of trade, population migration, epidemiology and technology diffusion. The core model views the spread of Covid-19 between locales as a function of their populations, infection rates, and the effective distance between them as measured by travel time. Transmission of Infection between Two Locales (1) = where Nij = People infected in locale i by interactions with people in locale j ρ = Probability that interaction with an infected person will lead to an infection nj = Infection rate in locale j ([ = ]; Nj = Infections in locale j Pj = Population of locale j) Iij = Number of interactions between inhabitants of locales i and j (2) = where Pi , Pj = Populations of locales i and j tij = Travel time between i and j Collecting terms: (3) = = Across all areas j: (4) = ∑ =1 Dividing by Pi, we obtain: 5 (5) = = ∑ =1 To summarize, the infection rate in locale i that is attributable to interactions with other locales is proportional to the sum of infections in those locales, each inverse-weighted by an exponential transformation of its travel time to locale i. Infection transmission within a locale is obviously important as well. Our proxy variable is the population density of the locale, but we should acknowledge a significant caveat. We believe that this relatively crude measure of own-transmission potential is useful for cases involving broad variation across locales, from sparse to heavy population density. All three of our cases are in this category, as are most countries and regions of significant size. However, we are much less confident that this variable is an appropriate measure for heavily-populated sub-locales whose population densities are all relatively high by national or regional standards. Three recent treatments of this issue (Lall and Wahba 2020; Borjas 2020; Hamidi, Sabouri Ewing 2020) all conclude that other variables account for inter-locale variation within high-density urban areas. Infection Dynamics Numerous variants of the generalized logistic function have been used to model dynamic processes related to technology diffusion (e.g., Dasgupta, Lall and Wheeler 2005), tumor growth (Vaghi et al. 2020), animal population growth (Nahashon et al. 2006), and epidemic spread (Liu et al. 2015). We employ the Gompertz variant, which features an asymmetric approach to its upper and lower limits.2 We have chosen the Gompertz model for two reasons. First, as illustrated by Figure 1, it performs well in tracking infection dynamics. Second, it is easily transformed into a logarithmic approximation that lends itself to econometric estimation. 2 For comparisons with other epidemic spread models, see Burger et al. (2019). 6 Figure 1: Gompertz curve application for South Korea Source: Eschenbach (2020) Formally, we model infection rate change in an area with the following Gompertz specification: (6) ̇ = [log � − log ] where ̇ = Percent growth of the infection rate � = Upper limit for the infection rate = Current infection rate Chow (1967, 1983) provides an econometrically-tractable logarithmic approximation for (6): ∗ ∗ (7) log − log − = [log − log − ] = log − log − where n* is a function of covariates X: ∗ (8) log = 0 + ∑ log Substituting and adding a random error term yields an estimating equation: ∗ (9) log − log − = [log − log − ] = 0 + ∑ log − α log − + In our case, we specify (9) as follows: 7 (10) log − log − = 0 + 1 log + 2 log − α log − + where = = ∑ =1 = Nit , Njt = Total infections in locales i and j at time t Pi = Population of locale i Ai = Area of locale i tij = Travel time from locale i to local j 4. Data Selection of our three country cases was guided by the availability of publicly-available Covid-19 infection data and the desire for geographic diversity. In the United States, daily county-level data on Covid-19 cases are available from multiple sources, including the New York Times national Covid-19 reporting project (NYT 2020) and a website operated by Johns Hopkins University. The data draw on the same reporting sources, and we chose the New York Times database because it provides better information about the early period of the pandemic. For the Philippines, we construct a daily database for municipalities from a public file of individual cases by date and location maintained by the Covid-19 Tracker website of the Philippines Department of Health (PDH, 2020). For South Africa’s Western Cape province, we construct a daily sub-district database from online daily reports by the Premier (WCP 2020). We compute US county population densities from data files maintained by the US Census Bureau for county populations (USCB 2020a) and areas (USCB 2020b). For the Philippines and Western Cape province, we compute mean population densities for municipalities and sub-districts, respectively, from raster files at 100 m resolution maintained by the WorldPop project of the University of Southhampton (WorldPop 2020). We compute travel-time-weighted infections (Wit in model (10)) in a multi-step process. We base the computation on the centroids of the relevant locales (counties in the United States, municipalities in the Philippines, sub-districts in Western Cape province). 3 For the centroid of each locale, we select all neighboring locales whose centroids are within 200 km. We compute all centroid-to-centroid travel times using the Open Source Routing Machine (OSRM 2020), which uses road information from OpenStreetMaps (OSM 2020) to compute the most time-efficient route between two points. Figures 2(a) and 2(b) display representative gradient maps for travel times within 200 km for five US counties and two Philippines municipalities. 3 We use geographic centroids for US counties and Philippines municipalities. Some irregularly-shaped Western Cape sub-districts have geographic centroids lying outside the sub-districts, so we use centroids of the most densely- populated census wards within the sub-districts. 8 Figure 2: Travel time gradients 2(a) Five US counties 2(b) Two Philippines municipalities We access the OSRM API server with a package in the R language; packages are also available in Python. Computation of all centroid-to-centroid times involves many calculations, but these must only be performed once for each country or region and they are freely provided by the OSRM project. Computed centroid-to-centroid travel times comprise tij in our regression model (10). For each locale i, Wit in (10) is computed as the sum of inverse-travel-time-weighted Covid-19 infections in each locale whose centroid is within 200 km of locale i. We compute the value of θ, the travel time exponent, via grid search for each country or region, using the standard RMSE criterion for selection. Each iteration of the search involves spatial econometric estimation of (10) with a different θ-value. 5. Econometric Results Estimation Issues The Covid-19 infection data in our three country databases have been collected in error-prone processes affected by differential case verification standards, errors in testing, variations in the human and technical resources of participating institutions, and underreporting of cases because so many are asymptomatic. The implications for econometric estimation depend on the nature of the errors. Construction of Wit involves aggregation across error-prone infection measures from many locales. Attenuation bias in parameter estimates will be minimized by the law of large numbers if reporting errors vary randomly. If all locales have the same degree of under-reporting, it introduces a self-canceling linear transformation on both the left- and right-hand sides of (10). A problem of estimation bias would be introduced if Wit, measured for the entire surrounding region, was collinear with some excluded variable for locale i. This might be true for regional variables like temperature for continental entities like the United States that have very broad temperature ranges at each moment in time. But a temperature-related case would be harder to make for more climatically-homogeneous entities like the Philippines and Western Cape 9 province. Similar arguments could be posed for other variables, but they would take us further afield than seems warranted for this exercise. We will return to the question of excluded variables in a subsequent section. While the potential impact of measurement error is debatable, the case for spatial autocorrelation seems firmly grounded. Locale boundaries are historically arbitrary, and there is no reason to suppose that variables measured at the county, municipality or sub-district level are not subject to spatial “spillovers”. For this reason, we believe that adjustment for spatial autocorrelation is important for this exercise. Accordingly, we have used an appropriate spatial econometric estimator for all of the regression work. Results Table 1 reports our estimation results for the United States, the Philippines and South Africa’s Western Cape province. For each country, we report 14- and 28-day results in separate columns. Estimates are for the periods of first rapid onset and spread of Covid-19: late March to mid-April in the United States and the Philippines; late April to mid-May in Western Cape province. All estimates incorporate adjustments for spatial autocorrelation, and the estimates for θ, the travel time exponent, have been obtained by grid search. 4 Our estimation process has followed Tukey’s precept, with all prior experimentation limited to a 1/3 random sample and estimation with the full sample delayed until the end (Tukey, 1977). Model parameter estimates are appropriately signed and highly significant in all cases. The magnitudes for 28-day changes are uniformly greater in absolute value than the magnitudes for 14-day changes, as would be expected. Estimates for the Gompertz parameter α are greatest for the United States and least for the Philippines. The R2 values for the United States and the Philippines are quite large for percent change equations estimated with so many degrees of freedom. The θ-values for all three countries are less than one, indicating that inverse weighting is less than proportional to travel time. We illustrate the implications in Figure 3, which plots infection influence (normalized to a maximum value of 100) against travel time in minutes. For contrast, we plot travel time gradients for three values of θ: 0.7 (the US value), 1.0 and 2.0. The implications are striking: While the influence of infections is quite small at a travel time of 80 minutes for inverse weighting by the square of travel time (θ=2.0), it remains significant at a travel time of 3 hours for θ=0.7 (our US estimate). Since both the Philippines and Western Cape province have θ-values less than 1.0, our results imply that changes in a locale’s infection rate can be significantly influenced by total infections in locales that are quite distant. 4 The standard RMSE criterion has been used to determine best-fit values for θ. Table 1: Regression results by time scale a Dependent Variable: log[Infection Rateit] - log[Infection Rateit-k] Country US US Philippines Philippines Western Cape Western Cape t Apr 14 Apr 14 Apr 14 Apr 14 May 18 May 18 k 14 28 14 28 14 28 θ 0.7 0.7 0.9 0.9 0.6 0.6 log [Infection Rateit-k] -0.567 -0.946 -0.164 -0.441 -0.392 -0.615 (-α) (41.30)** (33.62)** (6.66)** (5.06)** (4.19)** (5.64)** log [Wit] 0.266 0.458 0.150 0.676 0.156 0.452 (5.07)** (4.56)** (3.14)** (3.47)** (2.33)* (3.25)** log [Di] 0.328 0.771 0.134 0.265 0.138 0.212 (9.33)** (19.93)** (6.46)** (9.89)** (1.99)* (2.17)* Constant 0.168 -0.789 -0.275 0.093 2.266 2.809 (0.82) (3.63)** (2.57)* (1.05) (4.48)** (3.91)** Obs. 3,096 3,098 1,531 1,531 32 32 R2 0.35 0.33 0.18 0.36 0.37 0.58 a Estimates incorporate spatial autocorrelation Absolute value of t statistics in parentheses * significant at 5%; ** significant at 1% 11 Figure 3: Effect of travel time on infection influence In summary, our regression results indicate that an appropriately-specified model of spatial interaction within and between locales provides a very good fit to data on Covid-19 spread in places as different as the United States, the Philippines and South Africa. Model parameters differ by country, reflecting all the unobserved factors that govern the infection spread mechanism in the three countries. 6. Prediction Results While our model estimates have high significance for all three countries, we believe that their ultimate utility depends upon their ability to predict future pandemic hotspots with enough lead time and accuracy to provide useful alerts. Accordingly, we use the model’s estimated parameters to perform 14- and 28-day predictions based solely on prior information. That is, using our 14- and 28-day regression results through day t and observations on the dependent variables for day t, we predict changes in Covid-19 infections rates by days t+14 and t+28. We assess our results in two ways. First, we regress actual changes on predicted changes for each country. Table 2 presents our results, which have high significance in all cases. 5 The results are equally strong for 14-day and 28-day predictions. Does statistical significance translate to robust identification of future hotspots? To address this question, we select the areas with the highest predicted infection growth and assess their match with areas where infections actually grew most quickly. For comparison, we also assess the match for “business as usual” predictions, which assume that areas with the fastest infection growth in the previous period will maintain their lead during the prediction period. We limit this approach to 14-day predictions based on infection growth in the 14 previous days, because our panel is too limited to permit the 28-day variant in all cases. We also benchmark the exercise with a match for randomly-selected areas. 5 We use robust regression for Western Cape Province to guard against outlier effects for the small sample. Table 2: Prediction regression results by time scale Dependent Variable: Actual ΔLC (=Log[Caseratet]- Log[Caseratet-k]) Country US US Philippines Philippines Western Capea Western Capea Forecast Period Days 14 28 14 28 14 28 Predicted ΔLC 0.456 0.402 0.249 0.075 0.724 0.620 (23.52)** (30.43)** (9.69)** (9.11)** (8.04)** (3.88)** Constant 0.157 0.271 0.005 0.059 0.294 0.850 (3.88)** (5.27)** (0.36) (2.67)** (3.00)** (2.62)* Observations 3,084 3,092 1,531 1,531 31 32 R2 0.15 0.23 0.06 0.05 0.69 0.33 a Robust regression Absolute value of t statistics in parentheses * significant at 5%; ** significant at 1% 13 Table 3 presents our results for the United States, the Philippines and Western Cape province, with test groups that vary in size because the sample sizes are so different (3,084 counties in the United States; 1,531 municipalities in the Philippines; 32 sub-districts in Western Cape province). We rank areas by predicted future infection growth, select the highest-ranking areas (80 for the United States; 40 for the Philippines; 10 for Western Cape province); and tabulate the actual growth rankings for those areas during the 14-day and 28-day prediction periods. For each country, the first three rows of Table 3 display the number of areas included; the final date of the prediction period; and the ranking categories used for comparison. In the United States, we focus on counties ranked in the top 50, 100 and 150 for actual infection growth during the prediction period. These counties rank in the top 1.6%, 3.2% and 4.8% of all US counties, respectively. Rows 6, 7 and 8 tabulate our results for predictions based on model forecasts, past infection growth and random selection. For 80 counties identified by the model as the fastest- growing future hotspots in the 14-day prediction, row 6 shows that 14 ranked in the top 50 for actual infection growth, 28 in the top 100, and 42 in the top 150. As we would expect, forecast accuracy declines somewhat for the US 28-day predictions: 11 model-identified hotspots rank in the actual top 50; 21 in the top 100; and 33 in the top 150. In contrast, predictions based on infection growth in the previous 14 days identify 0 counties in the actual top 50, 0 in the top 100, and 2 in the top 150. Random selection actually does a little better, identifying 1 county in the top 50, 3 in the top 100, and 4 in the top 150. To summarize the US case, 52.5% of model-identified hotspots are top-5% actual hotspots for 14- day predictions; 41.3% are actual top-5% hotspots for 28-day predictions. Can we conclude that model-based predictions at these accuracy levels provide useful alerts? For perspective, consider the problem confronting local authorities without such information. Our comparative results suggest that extrapolation from recent experience will not help. Viewing a Covid-19 outbreak as a natural disaster -- which is undeniably the case -- lends additional insight. How would local authorities respond to a credible forecast that their community would be struck by a natural disaster in two weeks with 52.5% probability, or in four weeks with 41.3% probability? Viewed this way, it seems very likely that such probabilities would prompt a significant response from the authorities. Before responding, they might also want information about the consequences of ignoring an alert. These could be severe if the remaining top-80 counties have typically been in the top 10% for actual changes, but negligible (at least in the short run) if the remaining counties have little or no actual change. For the United States, the latter case holds: While 52.5% of model- identified hotspots for 14-day predictions are in the top 5% of US counties for actual changes, the other 47.5% have little or no change during the prediction period. This extreme bifurcation suggests the exclusion of at least one powerful predictor from the model, but we have yet to identify it. Extensive experimentation has revealed no significant change in the bifurcation when the US model is augmented with locally-significant variables (e.g., percent of a county’s population that is African-American). This remains an important problem for future research. Prior knowledge of the bifurcation might lead local authorities to “hedge” their response to alerts, although they would still be confronted by a high likelihood of near-term disaster. Their actual reaction might well be context-specific. 14 Table 3: Predicted changes in Covid-19 infection rates [Out-of-sample predictions] United States Predicted Total Areas (Counties) 3,084 Top Prediction Date April 28 (14 Days) May 12 (28 Days) Areas Actual Change Rank (in 3,084) ≤ 50 ≤ 100 ≤ 150 ≤ 50 ≤ 100 ≤ 150 Percent of All Areas 1.6 3.2 4.8 1.6 3.2 4.8 Prediction Method Number Predicted Model 14 28 42 11 21 33 80 Prior 14-Day Change 0 0 2 80 Random Selection 1 3 4 1 3 4 80 Philippines Total Areas (Municipalities) 1,531 Prediction Date April 28 (14 Days) May 12 (28 Days) Actual Change Rank (in 1,531) ≤ 30 ≤ 60 ≤ 90 ≤ 30 ≤ 60 ≤ 90 Percent of All Reas 2.0 3.9 5.9 2.0 3.9 5.9 Prediction Method Number Predicted Model 8 10 17 5 6 9 40 Prior 14-Day Change 0 5 10 40 Random Selection 1 2 3 1 2 3 40 Western Cape Province, South Africa Total Areas (Sub-Districts) 32 Prediction Date May 28 (14 Days) June 12 (28 Days) Actual Change Rank (in 32) ≤ 10 ≤ 10 Percent of All Areas 31.3 31.3 Prediction Method Number Predicted Model 7 5 10 Prior 14-Day Change 3 10 Random Selection 3 3 10 15 For the Philippines, we use model predictions for 14 and 28 days to identify top-40 municipalities. Of the 14-day top 40, 8 are in the top 2% for actual infection growth, 10 are in the top 4%, and 17 are in the top 6%. The corresponding tallies for the 28-day predictions are 5, 6 and 9, respectively. Extrapolation from past infection growth yields 14-day predictions that are notably better than the US extrapolative predictions, but still significantly less accurate than the model-based predictions. To summarize, the model estimates for the Philippines identify hotspots in the top-6% of municipalities with 42.5% accuracy for 14 days and 22.5% accuracy for 28 days. Per our previous discussion for the United States, the model’s 14-day accuracy seems easily good enough to warrant alerts. Its 28-day accuracy is significantly lower, but still high enough to warrant consideration. As always, the relevant question remains, “Compared to what?” For South Africa’s Western Cape province, we use model estimates to identify the future top-10 sub-districts in 14 and 28 days. We find that these estimates identify actual top-10 sub-districts with 70% accuracy for 14 days and 50% accuracy for 28 days. In contrast, extrapolation from past infection growth does no better than random selection. By our previous arguments, these results seem easily good enough to warrant alerts. To summarize, our estimated models for the United States, the Philippines and South Africa’s Western Cape province produce 14-day and 28-day out-of-sample predictions that seem good enough to warrant their use for community alerts. We believe that all three models could be improved by adding relevant, locally-available variables that would help explain the observed variation in Covid-19 infection rates. At the same time, we have found that the pattern illustrated by our 14-day US analysis holds for all three countries: Actual hotspots are identified with enough accuracy to warrant alerts, but more than a few predicted hotspots have little or no actual change, even if the core model is augmented with locally-significant variables. This persistent bifurcation is an important issue for future research. For the present, we believe that our results suffice to convey the following message: Any country with a panel of spatially-disaggregated Covid-19 infection data can draw on free, globally- available databases to estimate a simple, theoretically-grounded infection spread model based on human spatial interaction. Our evidence from three very different country cases suggests that the results can produce useful alerts for future infections as far as one month in advance. 7. Possible Extensions of the Model We have focused on developing and testing a core model that can be estimated with publicly- available data in any setting. In our model of Covid-19 growth dynamics, a locale’s population density, time-varying W-value, Gompertz α-value and travel time θ-value determine its growth rate toward a time-varying asymptote. The asymptote for a particular locale undoubtedly has other determinants that are country-specific, and there may well be determinants that belong in the core model because they are universally-available and affect infection transmission in all cases with high significance and the same sign. We have explored both possibilities in a large number of experiments conducted with our “Tukey 1/3” sample set aside for exploration. Among the variables explored in one or more countries are health risk (diabetes and obesity incidence), life expectancy (a proxy for general health conditions), income, ethnicity, employment-related infection risk, demography (percent elderly), temperature and humidity, wind conditions, altitude 16 and air pollution. Although we find significant variables in each locale, we find no variable that meets all of our criteria for the core model: universally available and relevant, statistically significant, and identically signed. Further exploration may well identify additional core variables, and this is an important domain for future research. For the purposes of this paper, however, we believe that presentation of diverse country-specific results that cannot be replicated elsewhere would distract attention from our central themes. 8. Summary and Conclusions This paper has been motivated by the urgent need for a model of Covid-19 transmission between locales that can be applied anywhere in the developing world. To meet this challenge, we focus on developing a model that is (1) implementable anywhere spatially-disaggregated Covid-19 infection data are available; (2) scalable for entities of arbitrary size, from individual regions to countries of continental scale; (3) reliant solely on data that are free and open to public access; (4) grounded in a rigorous, proven methodology; and (5) capable of forecasting future hotspots with enough accuracy to provide useful alerts. After a review of existing approaches and technical resources, we opt for a model that combines a gravity-type specification of spatial interaction; the use of travel time rather than geographic distance as a measure of spatial friction; and a model of growth dynamics based on the Gompertz variant of the generalized logistic model. Using free, publicly-available data for the United States, the Philippines and South Africa’s Western Cape province, we estimate a core model of infection growth in a locale that proxies local interactions with population density (D) and incorporates interactions with other locales in a theoretically-derived variable (W) that is the summation of their infections, weighted inversely by an exponential transformation of travel time. We find high significance and the appropriate signs for D and W for all three countries, and for changes over 14 and 28 days. After extensive experimentation, we find no other variable that is common to the three countries, publicly available, statistically significant and identically signed. We also find that the exponential transformation of travel time is less than unitary in all three cases, which indicates that infection growth in a locale is affected by the scale of infections in relatively distant locales. We assess the accuracy of the model by forecasting 14 and 28 days in advance, using only information available on the first day of the forecast. For the United States, we find that 52.5% of model-identified hotspots are top-5% actual hotspots for 14-day predictions; 41.3% are actual top- 5% hotspots for 28-day predictions. For the Philippines, we find that 42.5% of model-identified hotspots are actual top-6% hotpots for 14-day predictions and 22.5% for 28-day predictions. In Western Cape province, the model identifies the actual top-10 sub-districts with 70% accuracy for 14-day predictions and 50% accuracy for 28-day predictions. We conclude that the accuracy levels are sufficient to warrant use of this approach for community alerts, with appropriate caveats about expected accuracy. Among these is the notable bifurcation of predictions that we have discussed above. To conclude, we believe that this exercise has demonstrated the feasibility and accuracy of a modeling approach that relies solely on free, publicly-available data and can be applied in any developing country or region that reports spatially-disaggregated Covid-19 infection data. We do 17 not doubt that improvements are possible, and we would urge colleagues in the policy research community to develop models that have higher forecasting accuracy while adhering to the constraints that have bounded this exercise. 18 References Borjas, G. 2020. Demographic Determinants of Testing Incidence and COVID-19 Infections in New York City Neighborhoods. NBER Working Paper No. 26952. April. Burger, R., G. Chowell and L. Lara-D´ıaz. 2019. Comparative analysis of phenomenological growth models applied to epidemic outbreaks. Mathematical Biosciences and Engineering. 16(5): 4250–4273. Chang, S., N. Harding, C. Zachreson et al. 2020. Modelling transmission and control of the COVID-19 pandemic in Australia. University of Sydney. https://arxiv.org/abs/2003.10218 Chen, Y., P. Luy and C. Chang. 2020. A Time-dependent SIR model for COVID-19 with Undetectable Infected Persons. National Tsing Hua University, Taiwan. http:// gibbs1.ee.nthu.edu.tw/A TIME DEPENDENT SIR MODEL FOR COVID 19.PDF Chinazzi, M., et al. 2020. The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19). Science 368(6489): 395-400. Chow, G. 1967. Technological Change and the Demand for Computers. The American Economic Review. 57(5): 1117-1130. Chow, G. 1983. Econometrics (New York: McGraw-Hill). Currie, C., J. Fowler, K. Kotiadis et al. 2020. How simulation modelling can help reduce the impact of COVID-19, Journal of Simulation, 14(2): 83-97. Dasgupta, D., S. Lall and D. Wheeler. 2005. Policy Reform, Economic Growth and the Digital Divide. Oxford Development Studies. 33(2): 229-243. Eschenbach, W. 2020. The Math Of Epidemics. March13 https://wattsupwiththat.com/2020/03/13/the-math-of-epidemics Fan, C., S. Lee, Y. Yang et al. 2020. Effects of Population Co-location Reduction on Cross- county Transmission Risk of COVID-19 in the United States arXiv: Physics and Society. June 1 Felbermayr, G., S. Chowdhry and J. Hinz. 2020. Après-ski: The Spread of Coronavirus from Ischgl through Germany. CEPR Press - COVID ECONOMICS, 22:177. Gatto, M., E. Bertuzzo, L. Mari et al. 2020. Spread and dynamics of the COVID-19 epidemic in Italy: Effects of emergency containment measures. PNAS May 12, 2020 117(19): 10484-10491. 19 Hamidi, S., S. Sabouri and R. Ewing. 2020. Does Density Aggravate the COVID-19 Pandemic? Early Findings and Lessons for Planners. Journal of the American Planning Association, DOI: 10.1080/01944363.2020.1777891. Hou, C., J. Chan, Y. Zhou et al. 2020. The effectiveness of quarantine of Wuhan city against the Corona Virus Disease 2019 (COVID‐19): A well‐mixed SEIR model analysis. Journal of Medical Virology 92(7): 841-848. Isard, W. 1954. Location Theory and Trade Theory: Short-Run Analysis. The Quarterly Journal of Economics, 68(2): 305-320. Jia, J., X. Lu, Y. Yuan et al. 2020. Population flow drives spatio-temporal distribution of COVID-19 in China. Nature. https://doi.org/10.1038/s41586-020-2284-y (2020). Kang, D., H. Choi, J. Kim and J. Choi. 2020. Spatial epidemic dynamics of the COVID-19 outbreak in China. International Journal of Infectious Diseases. 94(2020): 96-102. Lall, S. and S. Wahba. 2020. No Urban Myth: Building Inclusive and Sustainable Cities in the Pandemic Recovery. Washington: World Bank. June 18 https://www.worldbank.org/en/news/immersive-story/2020/06/18/no-urban-myth-building- inclusive-and-sustainable-cities-in-the-pandemic-recovery Li, B. and L. Ma. 2020. Migration, Transportation Infrastructure, and the Spatial Transmission of COVID-19 in China. National University of Singapore. May Liu, W., S. Tang and Y. Xiao. 2015. Model Selection and Evaluation Based on Emerging Infectious Disease Data Sets including A/H1N1 and Ebola. Comput Math Methods Med. 2015:207105 López, L. and X. Rodó. 2020. A Modified SEIR Model to Predict the COVID-19 Outbreak in Spain and Italy: Simulating Control Scenarios and Multi-Scale Epidemics. http://dx.doi.org/10.2139/ssrn.3576802 Nahashon, S., S. Aggrey, N. Adefope et al. 2006. Characteristics of Pearl Grey Guinea Fowl as predicted by the Richards, Gompertz and logistic models. Poultry Science. 2006;85:359–63. New York Times (NYT). 2020. Coronavirus in the U.S.: Latest Map and Case Count. https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv Orazio, M., G. Bernardini and E. Quagliarini. 2020. How to restart? An agent-based simulation model towards the definition of strategies for COVID-19. Università Politecnica delle Marche. https://arxiv.org/ftp/arxiv/papers/2004/2004.12927.pdf OSM. 2020. OpenStreetMap. https://www.openstreetmap.org/about 20 OSRM. 2020. Open Source Routing Machine; Modern C++ routing engine for shortest paths in road networks. http://project-osrm.org/ Philippines Department of Health (PDH). 2020. Covid-19 Tracker website. https://www.doh.gov.ph/covid19tracker Prem, K., Y. Liu, T. Russell et al. 2020. The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study. The Lancet Public Health 5(5): e261-e270. Pullano, G., E. Valdano, N. Scarpa et al. 2020. Population mobility reductions during COVID-19 epidemic in France under lockdown. medRxiv. doi: https://doi.org/10.1101/2020.05.29.20097097 Roosa, K., Y. Lee, R. Luo et al. 2020. Short-term Forecasts of the COVID-19 Epidemic in Guangdong and Zhejiang, China: February 13–23, 2020. J. Clin. Med. 2020, 9(2), 596. https://doi.org/10.3390/jcm9020596 - 22 Feb 2020. Truscott, J. and N. Ferguson. 2012. Evaluating the Adequacy of Gravity Models as a Description of Human Mobility for Epidemic Modelling. PLoS Comput Biol 8(10): e1002699. Tukey, J. 1977. Exploratory Data Analysis. Reading, Mass: Addison-Wesley. US Census Bureau. 2020a. Annual Resident Population Estimates, Estimated Components of Resident Population Change, and Rates of the Components of Resident Population Change for States and Counties: April 1, 2010 to July 1, 2019. co-est2019-alldata.csv, available online at https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/ US Census Bureau. 2020b. Cartographic Boundary File - US Counties. cb_2018_us_county_20m.zip, available online at https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html Vaghi, C., A. Rodallec, R. Fanciullino et al. 2020. Population modeling of tumor growth curves and the reduced Gompertz model improve prediction of the age of experimental tumors. PLoS Comput Biol. 2020;16(2):e1007178. Western Cape Province, Republic of South Africa (WCP). 2020. Daily update on the coronavirus by the Premier. https://coronavirus.westerncape.gov.za/news WorldPop. 2020a. The spatial distribution of population in 2020, Philippines. WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). https://www.worldpop.org/geodata/summary?id=6316 21 WorldPop. 2020b. The spatial distribution of population in 2020, South Africa. WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). https://www.worldpop.org/geodata/summary?id=6322 Yang Z., Z. Zeng, K. Wang et al. 2020. Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions. J Thorac Dis. 2020;12(3):165-174. Zhang, C., L. Qian and J. Hu. 2020. Pathways of the COVID-19 Pandemic with Human Mobility across Countries. medRxiv. doi: https://doi.org/10.1101/2020.05.21.20108589 Zipf, G. 1946. The P1 P2/D Hypothesis: On the Intercity Movement of Persons. Am Sociol Rev 11: 677–686. 22