PREDICTING URBAN EMPLOYMENT DISTRIBUTIONS A toolkit for more targeted urban investment and planning decisions The challenge: employment density maps, and scalable approximation methodology that relies on widely key to targeting urban investments, are available public data, making it a quick, low-cost solution. We often outdated, imprecise, or unavailable show that in most developing countries, this approach can open new avenues for targeted urban investment and planning Information on the spatial distribution of jobs, both formal and decisions that are based on systematic empirical evidence. informal, in urban areas is a fundamental requirement for many project appraisals and analyses. Such information allows urban planners and developers to identify economic hubs within a An analytical solution, in a nutshell city and take targeted measures to improve their productivity, connectivity, and resilience—for example, by investing in A machine learning approach for high-resolution infrastructure upgrades or flood protection systems, enhancing urban employment prediction in developing countries commuting options, and adapting urban planning decisions. Relying on open-source data extracted from OpenStreetMap Such measures support firms and yield city-wide benefits for the (OSM) and Google Earth Engine (GEE), we provide a new analytical lives and livelihoods of workers and their communities. toolkit to approximate for the spatial distribution of jobs in urban In practice, business registries, employment censuses, or travel developing country areas. Using machine learning algorithms, surveys are the most common sources for mapping the density we show that it is possible to predict employment density based and spatial distribution of jobs within a city. But such data are on urban form attributes, such as street density and amenities. rarely available; and when they do exist, tend to be incomplete, Using this approach, we generate predicted employment density unreliable, or outdated. Recent initiatives have successfully maps for 14 cities in Latin America and Sub-Saharan Africa, leveraged mobile phone-derived data to document “meaningful” from Dakar in Senegal to Buenos Aires in Argentina (figure 1), locations, including jobs. This is a breakthrough, especially validating the robustness of these maps against survey-based given the increasingly ubiquitous use of mobile phones. And observed employment density data. Generally, we find that the yet, accessing and processing mobile phone data is a difficult, approach predicts within-city employment concentrations with lengthy, and often costly process. high resolution and accuracy, and can therefore be replicated in In this note, we demonstrate a machine learning approach for cities with no observed employment information to inform urban high-resolution urban employment prediction using a robust investments and planning decisions. Figure 1. Observed and predicted employment density in urban Buenos Aires, Argentina (R2=0.78) Note: R2, or goodness of fit measure, indicates the share of variation in the observed employment density that can be explained by the algorithm. (a) Observed employment density (b) Predicted employment density Employment density (log) – Normalised -1 and below -0.5 0 0.5 1 1.5 and above RESULTS IN RESILIENCE SERIES 2 | A toolkit for more targeted urban investment and planning decisions Methodology We exploit existing correlations between urban form properties— GEE and is tested for 14 cities in Latin America and Sub-Saharan such as nighttime light brightness and road intersection density— Africa, chosen for having available employment data through and job density to predict the spatial distribution of employment business registries or travel surveys, which enabled us to validate in urban areas of developing country cities. Our methodology predictions. relies entirely on open-source data leveraged from OSM and Figure 2. Illustration Feature distribution (log) of selected features at 500x500-meter 2 and above hexagonal grid cells 1 (a) Major roads (b) Street intersections (c) Public trasnport hubs 0 (d) Amenities (e) Shops (f) Night lights -1 Note: Panels (a) to (e) are density-adjusted; Normalized Difference Vegetation Index (NDVI) 2 and below is not logarithm-adjusted; all features are within-city scaled. (g) Population (h) Urban land cover (i) NDVI First, we use several methodologies to understand the explanatory which we have some employment data and incomplete spatial power of various features of the data derived from OSM and GEE coverage. to predict employment distribution in the 14 urban areas. We find that population, nighttime lights, urban land cover, amenities, When there is no employment data and road intersections are strongly and positively correlated The first type of analysis aims to understand how well our to job locations, and that terrain roughness, water bodies, and methodology can predict employment density at grid cell level for vegetation indices are strongly and negatively correlated to the entirely “unseen” cities, where the algorithm has not been trained presence of jobs. and we have no employment information. To do this, we train the Second, we test several model specifications for employment spatial random forest algorithm on data from 13 cities and hold predictions at grid cell level, including Penalized Linear back data from the city we are trying to predict. We repeat this 14 Regression models and Ensemble Tree methods. We settle on times, each time holding back data from a different city, to predict a spatial variation of the Random Forest machine learning employment density for all the cities in the study. The R2—that is, algorithm that can capture the clustering of economic activity to the goodness of fit measure—indicates the share of variation in predict employment density at grid cell level. the observed employment density that can be explained by the predictions derived from the algorithm (Table 1). Applications and insights We find that our method can predict employment density for The approach enables two types of predictive analysis. The first polygons in out-of-sample cities with medium to high accuracy predicts employment distribution in cities for which we have no (mean R2 is 0.63, and maximum R2 is 0.81). However, the results employment data. The second predicts employment in cities for show heterogeneity in predictive performance, with R2s ranging Predicting urban employment distributions | 3 Figure 2. Performance comparison across Random Forest models with spatial effects for out-of-sample cities (R2) Sub-Saharan Africa Abidjan Dakar Dar Es Salaam Douala Harare Kampala Kigali Kinshasa Nairobi 0.70 0.70 0.71 0.65 0.30 0.54 0.81 0.55 0.77 Latin America Belo Horizonte Bogota Buenos Aires Lima Mexico City 0.77 0.52 0.78 0.42 0.58 from 0.30 for Harare to 0.81 for Kigali. Comparing the results by obtained in the closest related literature estimating employment geographical region reveals that Latin American and Sub Saharan factors which obtained R2s ranging from 0.24 to 0.33.2 African cities are equally well predicted on average, with mean Are the city level R2 differences a reason for concern? We R2 of 0.61 and 0.64 respectively. These results are in line with, but can attribute the variation in out-of-sample prediction to a above, a similar analysis using satellite imagery undertaken to combination of city-specific relationships between employment predict consumption expenditure and household assets of spatial and OSM and GEE features, and varying degrees of quality for clusters across four SSA countries1 and substantially above those employment and feature data across cities. 1 Jean, Neal, Marshall Burke, Michael Xie, W. Matthew Davis, David 2 Goldblatt, Ran, Kilian Heilmann, and Yonatan Vaizman. 2020. “Can Me- B. Lobell, and Stefano Ermon. 2016. “Combining Satellite Imagery dium-Resolution Satellite Imagery Measure Economic Activity at Small and Machine Learning to Predict Poverty.” Science 353 (6301): 790–94. Geographies? Evidence from Landsat in Vietnam.” The World Bank https://doi.org/10.1126/science.aaf7894. Economic Review 34 (3): 635–53. https://doi.org/10.1093/wber/lhz001. Figure 3. Empoyment Employment density (log) predictions across Normalised various cities 1.5 and above 1 0.5 (a) Mumbai, India (b) Karachi, Pakistan 0 -0.5 -1 and below (c) Port au Prince, Haiti (d) Guayaquil, Ecuador 4 | A toolkit for more targeted urban investment and planning decisions Figure 4. Observed and predicted values for test data grid cells in urban Dakar, Senegal, using spatial Random Forest models Note: Test cells are outlined in white (a) Observed employment density (b) Predicted employment density Employment density (log) – Normalised -1 and below -0.5 0 0.5 1 1.5 and above When information is patchy Future work and caveats The second type of analysis trains the spatial Random Forest We have tested this version of our methodology on Sub-Saharan algorithm on 80 percent of all grid cells, pooling grid cells from the and Latin American cities. Future efforts should replicate this 14 cities for which we have employment data. It then applies the work in the other World Bank regions: East Asia and Pacific, trained algorithm on the 20 percent remaining grid cells in each Europe and Central Asia, Middle East and North Africa, and city. This analysis is useful when some information on employment South Asia. Additional time and efforts should also be invested in is available at the city level but spatially incomplete. The algorithm better understanding the heterogenous predictive performance is therefore trained on existing city level data. Our algorithm’s of the machine learning algorithms for “unseen” cities. When predictive performance is measured by the quality of fit between alternative data, such as employment data derived from the predicted and observed value of employment density in the 20 employment censuses, travel surveys, or even mobile phones, is percent “untrained” grid cells. We find that our method can predict available, it should be preferred. We equally caution against the employment density in held-back within-city cells in our test cities use of this prediction methodology to measure the evolution of with extremely high accuracy, as measured by an R2 higher than job distribution over time without further validation. 0.95 (figure 4). This shows that when we have spatially “patchy” or incomplete city-level employment data, we can approximate for employment density in the areas without data. Further reading The interested reader is welcome to explore our technical paper: Possible applications of the methodology Barzin, Samira, Paolo Avner, Jun Rentschler, and Neave O’Clery. 2022. “Where Are All the Jobs ?: A Machine Learning Approach Our methodology’s satisfactory performance in predicting spatial for High Resolution Urban Employment Prediction in Developing employment distribution in “data patchy” or “unseen” cities opens Countries.” World Bank Policy Research Working Paper, no. 9979. up possibilities for analytical and operational applications. First, https://openknowledge.worldbank.org/handle/10986/37195 it allows for more systematic employment accessibility analyses in developing country cities. Such analyses aim to better measure the benefits of, and tailor, transport investments and land use Contacts interventions, and are mandatory for World Bank urban transport The Global Facility for Disaster Reduction and Recovery’s (GFDRR) interventions. Second, it could help increase the development Global Programs on Resilient Infrastructure and Disaster Risk and application of quantitative spatial economic models, which Analytics can provide support in applying the operational need location and volume of employment as inputs. And third, analytics approach presented in this note to urban infrastructure it could help document the spatial structure of urban areas and and resilience projects. improve the measurement of agglomeration forces in developing country cities. For more information, or if you are interested in applying this methodology to your projects or analyses, please feel free to contact: • Paolo Avner, Urban Economist, GFDRR: pavner@worldbank.org • Jun Rentschler, Senior Economist, Office of the Chief Economist for Sustainable Development: jrentschler@worldbank.org