PREDICTING URBAN EMPLOYMENT DISTRIBUTIONS


    A toolkit for more targeted urban investment and planning decisions

The challenge: employment density maps,                                  and scalable approximation methodology that relies on widely
key to targeting urban investments, are                                  available public data, making it a quick, low-cost solution. We
often outdated, imprecise, or unavailable                                show that in most developing countries, this approach can
                                                                         open new avenues for targeted urban investment and planning
Information on the spatial distribution of jobs, both formal and
                                                                         decisions that are based on systematic empirical evidence.
informal, in urban areas is a fundamental requirement for many
project appraisals and analyses. Such information allows urban
planners and developers to identify economic hubs within a               An analytical solution, in a nutshell
city and take targeted measures to improve their productivity,
connectivity, and resilience—for example, by investing in                A machine learning approach for high-resolution
infrastructure upgrades or flood protection systems, enhancing           urban employment prediction in developing countries
commuting options, and adapting urban planning decisions.
                                                                         Relying on open-source data extracted from OpenStreetMap
Such measures support firms and yield city-wide benefits for the
                                                                         (OSM) and Google Earth Engine (GEE), we provide a new analytical
lives and livelihoods of workers and their communities.
                                                                         toolkit to approximate for the spatial distribution of jobs in urban
In practice, business registries, employment censuses, or travel         developing country areas. Using machine learning algorithms,
surveys are the most common sources for mapping the density              we show that it is possible to predict employment density based
and spatial distribution of jobs within a city. But such data are        on urban form attributes, such as street density and amenities.
rarely available; and when they do exist, tend to be incomplete,         Using this approach, we generate predicted employment density
unreliable, or outdated. Recent initiatives have successfully            maps for 14 cities in Latin America and Sub-Saharan Africa,
leveraged mobile phone-derived data to document “meaningful”             from Dakar in Senegal to Buenos Aires in Argentina (figure 1),
locations, including jobs. This is a breakthrough, especially            validating the robustness of these maps against survey-based
given the increasingly ubiquitous use of mobile phones. And              observed employment density data. Generally, we find that the
yet, accessing and processing mobile phone data is a difficult,
                                                                         approach predicts within-city employment concentrations with
lengthy, and often costly process.
                                                                         high resolution and accuracy, and can therefore be replicated in
In this note, we demonstrate a machine learning approach for             cities with no observed employment information to inform urban
high-resolution urban employment prediction using a robust               investments and planning decisions.


Figure 1. Observed
and predicted
employment density
in urban Buenos
Aires, Argentina
(R2=0.78)




Note: R2, or goodness
of fit measure,
indicates the share
of variation in the
observed employment
density that can be
explained by the
algorithm.
                        (a) Observed employment density                                 (b) Predicted employment density
                                                               Employment density (log) – Normalised

                          -1 and below	       -0.5	                 0	                    0.5	              1	             1.5 and above




RESULTS IN RESILIENCE SERIES
2 | A toolkit for more targeted urban investment and planning decisions



Methodology
We exploit existing correlations between urban form properties—            GEE and is tested for 14 cities in Latin America and Sub-Saharan
such as nighttime light brightness and road intersection density—          Africa, chosen for having available employment data through
and job density to predict the spatial distribution of employment          business registries or travel surveys, which enabled us to validate
in urban areas of developing country cities. Our methodology               predictions.
relies entirely on open-source data leveraged from OSM and


Figure 2. Illustration                                                                                         Feature
                                                                                                               distribution (log)
of selected features
at 500x500-meter                                                                                                   2 and above
hexagonal grid cells



                                                                                                                   1

                               (a) Major roads        (b) Street intersections     (c) Public trasnport hubs




                                                                                                                   0




                               (d) Amenities          (e) Shops                    (f) Night lights
                                                                                                                   -1




Note: Panels (a) to (e)
are density-adjusted;
Normalized Difference
Vegetation Index (NDVI)                                                                                            2 and below
is not logarithm-adjusted;
all features are within-city
scaled.
                               (g) Population         (h) Urban land cover         (i) NDVI



First, we use several methodologies to understand the explanatory          which we have some employment data and incomplete spatial
power of various features of the data derived from OSM and GEE             coverage.
to predict employment distribution in the 14 urban areas. We find
that population, nighttime lights, urban land cover, amenities,            When there is no employment data
and road intersections are strongly and positively correlated
                                                                           The first type of analysis aims to understand how well our
to job locations, and that terrain roughness, water bodies, and
                                                                           methodology can predict employment density at grid cell level for
vegetation indices are strongly and negatively correlated to the
                                                                           entirely “unseen” cities, where the algorithm has not been trained
presence of jobs.
                                                                           and we have no employment information. To do this, we train the
Second, we test several model specifications for employment                spatial random forest algorithm on data from 13 cities and hold
predictions at grid cell level, including Penalized Linear                 back data from the city we are trying to predict. We repeat this 14
Regression models and Ensemble Tree methods. We settle on                  times, each time holding back data from a different city, to predict
a spatial variation of the Random Forest machine learning                  employment density for all the cities in the study. The R2—that is,
algorithm that can capture the clustering of economic activity to          the goodness of fit measure—indicates the share of variation in
predict employment density at grid cell level.                             the observed employment density that can be explained by the
                                                                           predictions derived from the algorithm (Table 1).
Applications and insights                                                  We find that our method can predict employment density for
The approach enables two types of predictive analysis. The first           polygons in out-of-sample cities with medium to high accuracy
predicts employment distribution in cities for which we have no            (mean R2 is 0.63, and maximum R2 is 0.81). However, the results
employment data. The second predicts employment in cities for              show heterogeneity in predictive performance, with R2s ranging
                                                                                                     Predicting urban employment distributions | 3


Figure 2. Performance comparison across Random Forest models with spatial effects for out-of-sample cities (R2)
                                                                  Sub-Saharan Africa
    Abidjan          Dakar        Dar Es Salaam           Douala           Harare          Kampala            Kigali        Kinshasa           Nairobi
      0.70             0.70              0.71              0.65             0.30              0.54             0.81           0.55                  0.77

                                                                     Latin America
       Belo Horizonte                     Bogota                      Buenos Aires                        Lima                       Mexico City
              0.77                          0.52                           0.78                           0.42                          0.58

from 0.30 for Harare to 0.81 for Kigali. Comparing the results by                 obtained in the closest related literature estimating employment
geographical region reveals that Latin American and Sub Saharan                   factors which obtained R2s ranging from 0.24 to 0.33.2
African cities are equally well predicted on average, with mean
                                                                                  Are the city level R2 differences a reason for concern? We
R2 of 0.61 and 0.64 respectively. These results are in line with, but
                                                                                  can attribute the variation in out-of-sample prediction to a
above, a similar analysis using satellite imagery undertaken to
                                                                                  combination of city-specific relationships between employment
predict consumption expenditure and household assets of spatial
                                                                                  and OSM and GEE features, and varying degrees of quality for
clusters across four SSA countries1 and substantially above those
                                                                                  employment and feature data across cities.

1
 	 Jean, Neal, Marshall Burke, Michael Xie, W. Matthew Davis, David               2
                                                                                   	 Goldblatt, Ran, Kilian Heilmann, and Yonatan Vaizman. 2020. “Can Me-
   B. Lobell, and Stefano Ermon. 2016. “Combining Satellite Imagery                  dium-Resolution Satellite Imagery Measure Economic Activity at Small
   and Machine Learning to Predict Poverty.” Science 353 (6301): 790–94.             Geographies? Evidence from Landsat in Vietnam.” The World Bank
   https://doi.org/10.1126/science.aaf7894.	                                         Economic Review 34 (3): 635–53. https://doi.org/10.1093/wber/lhz001.


Figure 3. Empoyment                                                                                                                       Employment
                                                                                                                                          density (log)
predictions across                                                                                                                        Normalised

various cities
                                                                                                                                               1.5 and above	




                                                                                                                                               1	




                                                                                                                                               0.5	



                              (a) Mumbai, India                                       (b) Karachi, Pakistan



                                                                                                                                               0	




                                                                                                                                               -0.5	




                                                                                                                                               -1 and below




                              (c) Port au Prince, Haiti                               (d) Guayaquil, Ecuador
4 | A toolkit for more targeted urban investment and planning decisions


Figure 4. Observed
and predicted values
for test data grid
cells in urban Dakar,
Senegal, using
spatial Random
Forest models




Note: Test cells are
outlined in white
                         (a) Observed employment density                                (b) Predicted employment density
                                                                  Employment density (log) – Normalised

                          -1 and below	         -0.5	                0	                      0.5	            1	             1.5 and above


When information is patchy                                                   Future work and caveats
The second type of analysis trains the spatial Random Forest                 We have tested this version of our methodology on Sub-Saharan
algorithm on 80 percent of all grid cells, pooling grid cells from the       and Latin American cities. Future efforts should replicate this
14 cities for which we have employment data. It then applies the             work in the other World Bank regions: East Asia and Pacific,
trained algorithm on the 20 percent remaining grid cells in each             Europe and Central Asia, Middle East and North Africa, and
city. This analysis is useful when some information on employment            South Asia. Additional time and efforts should also be invested in
is available at the city level but spatially incomplete. The algorithm       better understanding the heterogenous predictive performance
is therefore trained on existing city level data. Our algorithm’s            of the machine learning algorithms for “unseen” cities. When
predictive performance is measured by the quality of fit between             alternative data, such as employment data derived from
the predicted and observed value of employment density in the 20             employment censuses, travel surveys, or even mobile phones, is
percent “untrained” grid cells. We find that our method can predict          available, it should be preferred. We equally caution against the
employment density in held-back within-city cells in our test cities         use of this prediction methodology to measure the evolution of
with extremely high accuracy, as measured by an R2 higher than               job distribution over time without further validation.
0.95 (figure 4). This shows that when we have spatially “patchy” or
incomplete city-level employment data, we can approximate for
employment density in the areas without data.
                                                                             Further reading
                                                                             The interested reader is welcome to explore our technical paper:

Possible applications of the methodology                                     Barzin, Samira, Paolo Avner, Jun Rentschler, and Neave O’Clery.
                                                                             2022. “Where Are All the Jobs ?: A Machine Learning Approach
Our methodology’s satisfactory performance in predicting spatial
                                                                             for High Resolution Urban Employment Prediction in Developing
employment distribution in “data patchy” or “unseen” cities opens
                                                                             Countries.” World Bank Policy Research Working Paper, no. 9979.
up possibilities for analytical and operational applications. First,
                                                                             https://openknowledge.worldbank.org/handle/10986/37195
it allows for more systematic employment accessibility analyses
in developing country cities. Such analyses aim to better measure
the benefits of, and tailor, transport investments and land use              Contacts
interventions, and are mandatory for World Bank urban transport              The Global Facility for Disaster Reduction and Recovery’s (GFDRR)
interventions. Second, it could help increase the development                Global Programs on Resilient Infrastructure and Disaster Risk
and application of quantitative spatial economic models, which               Analytics can provide support in applying the operational
need location and volume of employment as inputs. And third,                 analytics approach presented in this note to urban infrastructure
it could help document the spatial structure of urban areas and              and resilience projects.
improve the measurement of agglomeration forces in developing
country cities.                                                              For more information, or if you are interested in applying this
                                                                             methodology to your projects or analyses, please feel free to
                                                                             contact:
                                                                             •	 Paolo Avner, Urban Economist, GFDRR: pavner@worldbank.org
                                                                             •	 Jun Rentschler, Senior Economist, Office of the Chief Economist
                                                                                for Sustainable Development: jrentschler@worldbank.org