Policy Research Working Paper 8972 A Novel Approach to the Automatic Designation of Predefined Census Enumeration Areas and Population Sampling Frames A Case Study in Somalia Sarchil Qader Veronique Lefebvre Amy Ninneman Kristen Himelein Utz Pape Linus Bengtsson Andy Tatem Tomas Bird Poverty and Equity Global Practice August 2019 Policy Research Working Paper 8972 Abstract Enumeration areas are the operational geographic units enumeration areas that were created in urban areas in Mog- for the collection, dissemination, and analysis of census adishu and Hargeisa for the United Nations Population data and are often used as a national sampling frame for Estimation Survey for Somalia in 2014. The automatically various types of surveys. Traditionally, enumeration areas generated enumeration areas are consistent with standard are created by manually digitizing small geographic units enumeration areas, including having identifiable boundar- on high-resolution satellite imagery or physically walking ies to field teams on the ground, and appropriate sizing and the boundaries of units, both of which are highly time, population for coverage by an enumerator. Furthermore, cost, and labor intensive. In addition, creating enumera- the automated urban enumeration areas have no gaps. The tion areas requires considering the size of the population paper extends this work to rural Somalia, for which no and area within each unit. This is an optimization prob- records exist of previous enumeration area demarcations. lem that can best be solved by a computer. This paper, for This work shows the time, labor, and cost-saving value of the first time, produces an automatic designation of pre- automated enumeration area delineation and points to defined census enumeration areas based on high-resolution the potential for broadly available tools that are suitable gridded population and settlement data sets and using pub- for low-income and data-poor settings but applicable to licly available natural and administrative boundaries. This potentially wider contexts. automated approach is compared with manually digitized This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at upape@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team A Novel Approach to the Automatic Designation of Predefined Census Enumeration Areas and Population Sampling Frames: A Case Study in Somalia Sarchil Qader Veronique Lefebvre Amy Ninneman Kristen Himelein Utz Pape Linus Bengtsson Andy Tatem Tomas Bird JEL: C55, C83, C80 Keywords: Census; Enumeration Areas; Sampling Frame; Somalia 1. Introduction In most countries of the world, population data count and distribution data in the form of census enumerations are used for population segmentation, planning, and a myriad of other functions that support the government. In Low and Middle-Income Country (LMIC) settings, spatially referenced data on populations are crucial for targeting interventions, monitoring the impact of population growth, health and environmental applications, poverty mapping and transport and city planning (Hay et al. 2005; Balk et al. 2006; Tatem et al. 2007). Enumeration areas (EAs) are the operational geographic units for the collection of census data (United Nations Statistics Division 2010). The set of all EAs of a country constitutes a partition of that country, with EAs not overlapping with each other. In principle, EAs are designed such that each unit contains a similar population size and conforms to certain constraints imposed by the logistics of counting large numbers of people with limited resources. Given data on population density at a sufficiently high spatial resolution, the process of designing EAs can be automated, which could substantially accelerate census mapping. In some LMIC contexts, and particularly in conflicted affected areas, EAs are based on old population data or do not exist, a fact that has far-reaching consequences. In particular, the lack of properly-defined EAs means that there is no nationally representative sampling frame. National sampling frames are used to draw representative samples from the population to understand the geographic distribution of population characteristics (Turner 2003). Without sample weights provided by a national sampling frame, survey data sampled from a country’s population are likely to be biased, typically under- sampling vulnerable populations (Thomson et al. 2012; Ellard-Gray et al. 2015). Therefore, the creation of EAs to form a sampling frame creates significant benefits to generate nationally representative data to allow evidence-based analysis informing governments and NGOs. Especially in fragile contexts where data are even more important due to the need of timely and evidence-based planning in volatile situations, up-to-date sampling frames are necessary. However, defining new EAs is a challenging process that needs to consider a number of factors, including the quality of available data and the logistics of sampling; EAs need to make sense relative to the real-world distribution of people, buildings, and infrastructure. Good design of census EAs has a large impact on the efficiency of the census, a process which can cost on the order of hundreds of millions of dollars (Loots 2015). To better facilitate efficient census processes, EAs should lead to (i) better organization and management of the census itself and to (ii) small area statistics which meet common user needs (Martin 2002). Historically the creation of census EAs has been an expensive problem solved by brute-force approaches such as physically walking to map EA boundaries, which can require thousands of staff and many years to complete (Lu 2009; Yacyshyn and Swanson 2011). For instance, in Zambia, the 2010 census mapping exercise was expected to take about two years to be completed at a total cost of US $ 7 million (United Nation Secretariat 2007). Since the advent of Geographic Information Systems (GIS) and high-resolution satellite photography, census cartographers in some countries have been able to manually digitize EA boundaries from satellite imagery, leading to better control of EA delineation. However, manual digitization of EAs is still cost-, time-, and labor-intensive. It is also prone to human error and can have poor accuracy (Alazab et al. 2009; Balinski et al. 2010; Cocking et al. 2011). In addition, poor quality or outdated satellite imagery can make it difficult or impossible to identify the boundaries or house blocks that align with actual house blocks and EA boundaries that are misaligned with real-world structures while natural boundaries can cause significant challenges in the field enumeration process. Furthermore, drawing or growing EAs manually with respect to population size and area constraints is challenging and can result in ‘nonsensical’ boundaries (e.g. a single structure split into different EAs). Finally, EAs or sampling frames in countries with large population 2    displacements and rapid urbanization need regular updating, meaning that the manual digitization process must be replicated. Previously, automated zone design methods have been employed to group areas into zones for a range of purposes including investigation of neighborhood effects on health and release of census data (Cockings et al. 2011; Flowerdew et al. 2007; Haynes et al. 2007). To construct the created new zones some constraints were considered such as population size and building types. However, these works were conducted in developed countries where actual household data were available, and the method relied on previously-defined sub-units or EAs. In addition, in most cases, the created zone boundaries do not match well with the logical ground natural boundary. Several studies have also employed region merging techniques to derive the census enumeration areas. For instance, Folch & Spielman (2014) showed the advantage of applying the improved max-p algorithm on growing the irregular regions from census tracts. In New Zealand, the ArcGIS/AZTool toolkit, which is a region-merging tool, was developed to design new reporting geographic units using the 2006 census data as the main input, with significant improvements over traditional methods reported (Martin and Lyndon 2009). However, all these examples used existing EAs as their base unit and no works exist that have employed region merging techniques to generate new EAs in a country where an existing product is not available. Despite the many potential advantages of automated EA generation, the approach has not gained significant uptake in LMIC contexts where the kinds of spatial data previously used in tools like the AZTool (such as census tracts) are unavailable or outdated. Fortunately, many new and freely available data sources have become available, which can directly inform the creation of EAs. For example, crowdsourced road and natural feature data, which can be used to define logical boundaries for EAs, are increasingly available in repositories such as Open Street Maps (OSM). Globally, OSM is ~83% complete; more than 40% of countries (including several in the developing world) have a fully mapped street network (Barrington-Leigh & Millard-Ball 2017). In addition, global high-resolution population models such as WorldPop (www.worldpop.org, 2019) provide estimates of population density that can be used to help inform the creation of EAs. Here for the first time, we describe how these freely available data on population and georeferenced features can be combined to design a new full set of pre-defined census EAs, or to update existing EAs, using Somalia to demonstrate the process. Our automated process is based on a methodology we termed ‘split and merge’, inspired from the field of image processing, specifically image segmentation using mathematical morphology (e.g. watershed and waterfall algorithms; Beucher 1994; Damiand & Resch 2003). The process first splits the country into regions as small as possible that follow visible boundaries (e.g. roads), then progressively re-merges them so that they are as large as possible while respecting given constraints on the area and population size, and so that they do not cross obstacles and administrative boundaries. This process ensures creating EAs so that their boundaries do not cut across buildings and can be seen easily from the ground by surveyors. 2. Methodology At the request of the World Bank, we applied an automated EA delineation process to build a national sampling frame for Somalia. This section describes our approach and then details its application in the context of Somalia. 2.1 General approach We developed a novel automated process to create EAs for an entire country that falls within given ranges of population size and area, that does not cut through buildings, that do not cross obstacles and administrative boundaries, and that follow tangible borders visible from the ground as much as possible. 3    Two kinds of data are needed in our process: i) vector data of roads, natural boundaries and/or settlement boundaries with which to split the country area into sub-units and ii) high-resolution (100 m) estimates of population density. Our process first combines all sources of vector data to split the country into small sub-areas, then merges them into units matching constraints such as population size or area, a process we termed the split and merge algorithm. Details of data requirements and processing are discussed in a later section. In regions where data on geographic and manmade features are too sparse to obtain small enough regions after the splitting stage (either nonexistent e.g. desertic areas or unmapped), a quadtree algorithm is used to further split the country. In this case, not all EAs follow visible boundaries. In general, using our automated delineation method being developed into a tool we expect that a solution can always be found where no EA falls outside of the limits set. As long as space is partitioned in sufficiently small areas the region merging algorithm can have enough flexibility to create EAs that meet all the targets. Where there is a lack of geo-referenced features to obtain small enough areas, then the automated tool will use the quadtree algorithm to split. A flowchart of the process is given in Figure 1.   4      Figure 1. Schematic diagram of our automated EA delineation process. The first step is to split the territory into regions as small as possible using georeferenced features such as administrative and natural boundaries (e.g. rivers, terrain), settlement location and outlines, and road and path networks. We then compute the estimated population in each region thus obtained (using a very high spatial resolution population density map) and check if all regions have a population and an area below given thresholds. If not, we further split regions using the quadtree algorithm, until all regions are below the population and area thresholds. We then merge regions so that they exceed the given minimum area threshold and until they are as close as possible to but remain below the maximum population and area thresholds. The merging process does not merge regions across a set of specified boundaries (e.g. administrative boundaries and large rivers). The result is a partition of the country into EAs that follow visible boundaries, that are not across obstacles or administrative boundaries, and that comply with given ranges of population size and area. 5    2.2 Case study area: Somalia Somalia is situated in the Horn of Africa with an official population estimated at 12.3 million in 2014, up from the 1975 estimate of 4.1 million with slightly more males (6.2 million) than females (6.1 million) (UNFPA, 2016). In 2012, the first nationwide Population Estimation Survey (PESS) took place. The Somalian government, United Nations Population Fund (UNFPA), and United Nations Development Programme (UNDP) collaborated, prepared, and carried out this survey, aiming to use the PESS as a basis for a census. The PESS survey in 2014 estimated that 42% of the population was permanently settled in urban areas and 23% in rural areas, while 26% were nomadic people and 9% were internally displaced persons (IDPs) (UNFPA 2014). The Somali population is rapidly increasing with almost 3% population growth per year and a high fertility rate of 6.26 children per woman, which is the fourth highest in the world (Gure et al. 2015). However, the results of the PESS alone were not suitable for creating a nationally representative sampling frame, as the PESS created EAs in urban areas only. The risks associated with fieldwork and the lack of funding were just two hurdles this approach faced; significantly displaced populations exist in parts of Somalia, without any official population information available. Therefore, the World Bank recognized that re-building Somalia’s statistical infrastructure and capacity was key in supporting resilience efforts and proposed a spatial analysis approach as an innovative way to create a new sampling frame, especially given the barriers in this context (e.g. security risks, lack of funding). 2.3 Data sources To conduct this work, several data sets have been compiled and combined from various resources (table 1). Table 1. Data used for Somalia Enumeration Areas. Source Data description Road Data (OSM) Lines Waterway (OSM) Lines River (OSM) Lines Residential area (OSM) Part points and part polygons Building (OSM) polygons ‘places’, ‘hamlet’, and ‘villages’ (OSM) Points DLR Global Urban Footprint (GUF) Binary raster BMGF/DigitalGlobe (DG) population estimates scatter points World Bank/Flowminder/WorldPop building polygons counts from Google Satellite imagery UNFPA/PESS urban Enumeration Areas Polygons UNFPA/PESS urban Enumeration Areas (EAs) # households per EA. 6    household number BMGF/DigitalGlobe (DG) settlement outlines for polygons North Somalia UNFPA/PESS rural population estimates points Pre-war regions boundary polygons Waterbody (OSM) polygons   2.4 Data pre-processing 2.4.1 Definition of urban and rural stratum within 18 pre-war regions Defining major strata is the first step in generating EAs as the maximum population size and areas constraints of EAs may need to vary in different strata. For example, Urban versus Rural strata typically need very different constraints to account for the differing population densities. In Somalia, urban strata were defined using the previous urban EAs from PESS 2014 (UNFPA 2014). The previous urban EAs were dissolved using the dissolve tool in ArcGIS. The remaining area outside of the urban strata was considered rural. To define and compute the urban and rural strata for each Somalia pre-war regions, urban and rural strata were intersected with the 18 pre-war regions administrative boundary. Based on World Bank recommendations the urban and rural strata in Banadir were merged and considered as urban. 2.4.2 Settlement boundaries Data on settlement locations and boundaries are needed to inform EA delineation in the ‘merge’ part of the algorithm and are also used to create our refined 100m population density estimates. Recent increased availability of high-resolution satellite imagery and high-power computing resources with adequate image processing algorithm have advanced the development of high-resolution human settlement layers (Vijayaraj et al. 2007; Florczyk et al. 2016; Roy Chowdhury et al. 2018). Examples of recently developed human settlement layers include Global Urban FootPrint (GUF) (Esch et al. 2017), Global Human Settlement Layer (GHSL) (Pesaresi et al. 2013) and LandScan Settlement Layer (LandScan SL) (Cheriyadat et al. 2007). These data sets either are not available for Somalia or are incomplete for the country or they might not specifically be trained for Somalia since the country is facing a constant regional instability and severe drought that forced substantial migration within the region. In addition, during this work, several new data sets on settlements were collected from a variety of sources (table 1), each of which georeferencing settlements absent from the other existing settlement layers. This has motivated the current work to generate an improved high-resolution settlement map by combining the datasets listed in table 1. Appendix 1 gives more detail on the processing done on each data set. The resulting settlement map was used to improve population density predictions and EA delineation. 2.4.3 High-resolution population density estimates for Somalia A high spatial resolution map (100m x 100m) is necessary to estimate the population of each sub-region created during the automated EA delineation process. Multiple high to moderate resolution global modeled population datasets are freely accessible to download including WorldPop 7    (www.worldpop.org, 2018), Global Rural-Urban Mapping Project, Version 1 (GRUMP) (CIESIN et al. 2011), Gridded Population of the World Version 4 (GPWv4) (CIESIN, 2017), Gridded Population of the World, United Nation (GPW UNEP, 2006) and Global Human Settlement Population Grid (GHS- POP) (JRC and CIESIN 2015). However, none of these data sets on its own was sufficient for our purposes, as they were created without the use of the PESS 2014 data, or the final total population was not adjusted to match the PESS regional total. In addition, we had access to more recent data sets (high- resolution DigitalGlobe population estimates), which we wanted to use to ensure that our EA delineation of Somalia is based on the most up to date population estimates. Therefore, we produced a novel 100m x 100m population density map to calibrate our EA delineation. We give below an only succinct overview of the method employed as it is not the object of the present paper, and it is not relevant to the description and results of our novel automated process for EA delineation, which can accept as input any gridded data set of sufficiently high resolution. Appendices 2 and 3 list the data sources that we used for urban and rural areas respectively, and the transformations we applied in order to obtain a 100m x 100m raster for each. Data sources include information on building density, household density and population density. We used the World Bank survey (UNFPA 2014) to estimate a median number of people per building and per household to approximate population density from data on building and household densities. In places lacking data but identified as settled, we modeled population density based on the distribution of population estimates in similar settlements. We then set population density to zero in locations known to be not settled, and to a low value in locations that could be settled but for which we have no data (around known settlements). Finally, we rescaled the population density map thus obtained using the PESS 2014 regional totals (UNFPA 2014). 2.5 Split and merge algorithm for the creation of Enumeration Areas (EAs) 2.5.1 Splitting process The aim of the splitting process is to partition the country into regions that are as small as possible so that the subsequent merging process has enough flexibility to combine them into optimal EAs. We used three steps to do this. Step 1 - Splitting based on geo-referenced features to create regions with tangible boundaries The country was split using road data, rural settlement boundaries, waterway and river data and administrative boundaries from OpenStreetMap (OSM), using the feature to polygons tool in ArcGIS. These data sets where either lines or polygons, whose geometry will be used to create area features, were the input features to the tool. From the merging of these features, each small “closed” area became a feature in the output feature class (here called ‘Primary Units’ (PU). This first step results in a set of fully contiguous units that are much smaller than the target EA size, with no gaps or islands and with all regions delineated by georeferenced features. If the road data are complete, then the process ensures that no building will be cut. For the areas that do not meet the criteria, Thiessen polygons based on settlement locations were created. However, for sparsely populated areas or vast featureless areas, with only a few geo-referenced settlements or residential areas, areas created from this approach were still larger than 9km2. Step 2 - Estimate population and area With the PU feature set defined, we were able to then compute the population size for each PU using the high resolution gridded population data sets in the Zonal Statistics Tool in ArcGIS. 8    Optional step 3 - Splitting based on Quadtree algorithm Some areas do not contain sufficient line data to create small enough regions to meet the 9 km2 area threshold or the 750 person population threshold - this was particularly true in sparse and non-populated areas, which remained larger than 9 km2, despite having a population below 750. We further split any areas still larger than 9 km2 or containing more than 750 people after steps 1 and 2 into square grid cells using a quadtree algorithm (Finkel and Bentley 1974). A quadtree is a tree data structure in which each internal node in the underlying tree has exactly four children (Wanderer 2017). This approach is commonly employed to partition a two-dimension space by recursively decomposing it into four equal quadrants or regions (Feng and Watanabe 2015). Here, the algorithm splits the area and population into successively smaller quadrants by checking whether the content of each split is smaller than prescribed values (e.g., population> 750 & area < 9km2). Following this step, all shapes produced were smaller than 9 km2 and contained fewer than 750 people. Figure 2. This figure shows the small irregular polygons in rural areas with population size after the splitting process. 2.5.2 Merging process After splitting the area of interest (e.g. country) into small regions as small as possible and with population size and area smaller than the requested thresholds, the regions are merged until they match 9    the constraints given in table 2. The aim of the merging process is to obtain regions as close to the target as possible and with population below threshold (e.g. 750) and area within a specified range (e.g. 0.1 to 9km2) while ensuring that merged areas do not cut across obstacles or administrative borders. With a lower priority, the process also tends to produce shapes that are as compact as possible. By construction, the boundaries of the EAs resulting from the merging process will also follow geo-referenced features (or square sides, if quadtree squares are needed). The merging process takes as inputs: 1/ the PU features defined in the split process, 2/ ranges of target population and area values for EAs (table 2), 3/ the gridded population density dataset to compute the population for re-merged region at each step, and 4/ a set of specified boundaries across which regions should not be merged (e.g. large rivers, delineation of urban and rural strata, administrative boundaries). Since this is a computationally intensive process, the method was applied to every 18 pre-war regions separately. To complete the merging step, we used the Automated Zone-design Tool (AZTool) since it is publicly available, most compatible compare to the other tools for the purpose of this study and user-friendly. AZTool was developed by Martin et. (2002) and is based on Openshaw’s (1977) Automated Zoning Procedure (AZP). The AZP methodology was developed by the Office for National Statistics (ONS) for the 2001 census in England and Wales (Cocking et al 2011) and to revise census geography in New Zealand (Martin and Lyndon 2009). AZTool iteratively combines and recombines sets of geographic building blocks to generate larger zones optimized to meet a set of pre-defined user-specified constraints. Such specified constraints include population threshold (Min, Max and target) and compactness of the shape. Criteria and constraints used in the current study are reported in table 2. Compactness is a minor constraint used so that shapes ‘tend’ to be compact and avoid difficult shapes such as snake-like shapes. Donut is one output area surrounding another. The Area constraint was not originally included as a constraint in AZTool, however, we modified the .xml file to include the area constraint for the purpose of this study. Table 2. Illustrates the criteria were set in AZTool to generate sensible EAs in Somalia rural area. Criteria Hard Constraint Soft Constraint On or off Shape compactness Yes on Population Min=0 and Max=750 Target=650 on Area Min=0.1 km2 and Max=9 km2 Target=8.8 km2 on Donuts Yes on 2.6 Computation of EA probability of selection The split and merge algorithm results in a partition of the country into regions that satisfy the definition of an EA. We calculate the population of each EA using the population density map and compute the probability of selection. The probability of sampling for each EA was defined according to the expected population within each EA divided by the total population in the regional stratum, recalling that the country was divided into 18 pre-war regions and further subdivided into urban/rural strata. 10    ∑ PEA(1 1)+ PEA(2 1)+ PEA(3 1)+ ….+ PEA(n j)= 1 for each stratum within a pre-war region is a probability of selection for an EA in a specific stratum (Urban or rural) and pre-war regions (e.g. rural Bari), i is an EA number, j is the regional stratum type. is the population within an EA in a specific regional stratum, is the sum of EAs’ population within a regional stratum, n is the number of EAs within a regional stratum. 2.7 Comparison between Manual PESS urban EAs and our automated approach Urban EAs In UNFPA’s 2014 PESS, no Rural EAs were created, and the urban EAs were manually digitized based on high-resolution Google Earth Imagery. We compared the results of our automated methodology against those manually digitized EAs in urban areas in both Mogadishu and Hargeisa cities. Since the boundary of PESS urban EAs and their household size are confidential, we were unable to publish a complete comparison. Instead, we compared the distribution of population size and EA area between automatically and manually generated Rural EAs. The population and area were also computed for PESS urban EAs based on our high-resolution gridded population data sets and the populations and area of EAs within Hargeisa and Mogadishu cities were compared. 3. Results: Somalia case study 3.1 Rural EAs 3.1.1 Description of results from our split and merge algorithm A total of 253,833 polygons were created in Somalia rural areas after the splitting process. After applying the region merging technique in a rural area, 113,367 EAs were created, with population size ranging from 0 to 750 and a maximum area of 9 km2 (Figure 3a). In addition, the probability of EA selection proportional to the population size and area were also computed for each EA. Furthermore, the probabilities of selection were summed up in each regional stratum type and is equal to 1 (Figure 3b). 11    Figure 3. (a) Rural EAs in Somalia with the population size after the merging process, (b) Urban and rural EAs with their probability of selections proportional to the population size per regional strata. Outlines of some generated Rural EAs are overlaid on high-resolution Google Earth imagery in figure (4), showing that EA boundaries conform well to natural boundaries in populated areas. Three categories of EAs boundaries can be seen. The first category is in towns or highly populated areas, where EA boundaries are well matched with logical ground natural boundary such as roads ( Figure 4a, b, c, g, h). The second category can mostly be seen in very sparse or unpopulated areas, where roads and natural boundaries are still adhered to (Figure 4 d and f). Finally, the third category represents very sparsely populated or unpopulated areas, where natural boundaries do not contribute to EA shapes (Figure 4e). 12      Figure 4. Outlines of rural EAs in different locations over Somalia generated by the split-merge algorithm. 3.2 Urban EAs 3.2.1 Description of results from split and merge algorithm The automated approach was applied to both Mogadishu and Hargeisa cities. Figure 5 presents the results obtained from overlaying our automated urban EAs boundaries on high-resolution Google Earth imagery. Figure 5a, b, c, d, and e show the results for Hargeisa city while Figure 5f, g, h and i are illustrating the boundary of urban EAs in Mogadishu city. Importantly the boundary of EAs perfectly matches the natural demarcation on the ground particularly roads, reflecting the good quality of road data available for these areas.   13      Figure 5. Outlines of our automated urban EAs in different locations over Hargeisa and Mogadishu in Somalia. 3.2.2 Comparison with PESS manual urban EAs We compared our automated urban EAs to the manually digitized PESS 2014 urban EAs in terms of the population and area covered (Figure 6). The total number of PESS EAs is 1,380 and the total automated EAs is 1,775. If we consider the maximum urban population as 2,000 people per an EA and the preferred target is less 1,000 people per an EA, the automated EAs might have the same performance or even better in some cases compared to PESS urban EAs. While we cannot directly present the PESS urban EA data here, but we have noticed some issues when we have reviewed and generated the histograms. For the most part, PESS urban EAs follow roads well but we have found some examples where this is not the case as they cut the houses, likely due to changes in building layouts since the construction of the PESS EA data set. The minimum population size per PESS urban EA in Mogadishu, which was obtained from high gridded population data sets, ranges from zero to 17,000 and the minimum area ranges 5 m2 to around 7,000,000 m2. The zero values in population size and small area indicate the presence of gaps in the data sets. The large population and area size for some of the EAs indicate that the EAs may not be practical for a surveyor in the urban context as it may either cover a high-populated area or cover a large space. In the automated process, these constraints can be tuned based on user requirements. The minimum population size for the automated EAs was 150 and the 14    maximum was 2,000. The area constraints were tuned as well, and it starts from 2,000 m2 to around 4,000,000 m2.     Figure 6. Histogram distribution of population and area for urban PESS 2014’s EAs and automated urban EAs in Hargeisa and Mogadishu, Somalia. For the comparison, we assume the maximum population could be 2000 people per EA but the preferred target is less than 1000 people per EA. 4. Discussion Our split and merge approach using freely available data on human population density and boundary data provides an automated practical approach to generating EAs, compared with the manual digitization of EA boundaries. This approach can be used to generate EAs automatically in countries where EAs are not existent or exist but need regular updating because of large population movement/changes or in the preparation of census and are inaccessible due to insecurity restrictions. The accuracy and comparison between both approaches might be varied from an area to another but if adequate data sets are available for a country, our automatic approach can produce a similar result and  ensures all EAs will match the defined criteria, given the data the method has access to with a significant cost reduction. Using the methods shown in this paper, EAs for a country can be generated within ~2 weeks by a team of 2-3 GIS analysts if a good high-resolution population dataset and geospatial data on natural boundaries such as roads are available. Verification and validation, however, will need additional time. By contrast, adopting the traditional manual digitization to create EAs for a country can take years, hundreds of staff, a budget on the order of millions of dollars and often impose a significant risk to on-the-ground mapping staff. In addition, the task of accounting for appropriate 15    population size and area within EAs is much more easily handled algorithmically. By taking into account recent data on boundaries, roads and other features the algorithmic process also creates EAs that are more practical for fieldwork and can be updated as new data become available. Finally, because each EA is already associated with an estimation of population size, these EAs can be easily turned into a nationally representative sampling frame. One limitation of the method we have presented is that it relies on good quality data. Modeling population density is not the object of this work (Lloyd et al. 2017; Wardrop et al. 2018), however, reliable population estimates are crucial to use our method to obtain operational EAs. In addition, in Somalia, the method works well where OSM data sets and other ancillary data were available and of good quality - mostly in the highly populated areas or where volunteer OSM mapping effort has been highest. For instance, the boundaries of the first category (Figure 4a, b, c, g, h and i) are aligned with visible demarcations on the ground. This is since the used geospatial layers to automatize the EA delineation were derived from high-resolution satellite imagery. This pattern can be seen in the second category, but there is often a visible discrepancy to the actual natural boundaries (Figure 4 d and f) where natural boundary data appear not to exist. Finally, EA boundaries in the third category (Figure 4e) are devoid of boundary data and rely entirely on the quadtree algorithm to generate EAs. The lack of data or vast featureless areas is particularly problematic in rural areas because large stretches of land in Somalia are not or only sparsely populated. In addition, OSM road coverage does not exist in these areas. However, we note that the algorithm could easily be updated and re-run when new data become available. With the increasing prevalence of open-source data in development contexts (e.g. AidData.org, GRID3), these data gaps are likely to be filled before the next census. Manual digitization of EAs may be still a preferable option in some cases, where no data on roads or natural boundaries are available, or where local context weighs significantly in terms of choosing EAs. For example, areas where security or gerrymandering may be important considerations will require more hands-on control of EAs to include or exclude specific houses or neighborhoods. In addition, manual digitization does not rely on ancillary data sets as long as high-resolution satellite imagery is available, the EAs can be manually digitized. However, manual digitization is prone to many errors including leaving gaps, unclosed EAs and greater irregularity in actual population within EAs. Importantly, intensive multi-resource involvement in manual EA delineation must be considered. Although the actual household size per urban EA was one of the inputs of the population modeling, deviation still exists between actual population (based on HH size) and estimated population based on the modeling in manual urban EAs. This result may be explained by the fact that the modeling considered additional recent data sets besides the urban HH information to distribute the population within the grid cells (see appendixes 2 and 3). In addition, there were discrepancies between PESS 2014 HH number within urban EAs and regional population estimates. This means that, if HH size per EAs were considered to generate the total population per region, similar total population on the region could not be achieved as it was reported in the PESS 2014 particularly in Gedo. In addition, the very high population size based on the gridded population data set in the manual urban EAs (Figure 6) could be a result of their large spatial coverage. For instance, there is a case that an urban EA contains 1,500 HH (The HH were listed in the field based on PESS 2014). Our use of a pre-existing tool that was not designed for the application provided us with valuable learning about the relevant features of EAs, as well as the challenges associated with the process and areas where the approach can be optimized. For example, AZTool was not developed to aggregate the multitude of small regions that were created by splitting the map using so many different data sets. The AZTool has a soft compactness constraint to produce EAs that are not oddly shaped (e.g. elongated or convoluted), however, the constraint is not often satisfied. This might be because the input data set 16    contains many irregular shapes and donuts which make it difficult for the tool to merge the neighboring EAs considering all given criteria. In addition, some region merging techniques are better than others at ensuring all final regions match the desired criteria. In some cases, techniques become stuck in configurations with no solution (e.g. with small unmerged areas along the map borders or surrounded by fully formed regions that cannot be merged more). Such errors could be eliminated by incorporating harder constraints or additional constraints that are more appropriate to the use of many small areas. Our use of AZTool in the context of Somalia also pointed to potential extensions of the region-merging approach other criteria based on socio-economic variables such that socio-economic metrics are homogenous within the produced EAs. For instance, if the purpose of a survey is to collect data on poverty, the existence of high-resolution poverty maps (Tatem et al. 2014; Steele et al 2017) could be used to provide a weighting constraint which the tool could use to merge regions with the aim of ensuring that poverty levels with EAs were relatively homogeneous. This approach could help to ensure that selected samples are more representative, but still the priority is to have excellent population data, and if only low-resolution socio-economic data are available, the combination still can be made. In addition, a compactness metric based on the comparison of the longest length in a shape to the diameter of a circle of same area would provide a more robust selection criterion during the merging process and lead to more compact EAs, as it is insensitive to shape size, unlike the compactness metric used in the AZ tool. Furthermore, if available data have consistent and reliable labeling, the hierarchy approach will be adopted in terms of prioritizing the features to be merged in the merging process. By doing this, traversability will be accounted as well through ranking of split lines, and additional data sources such as the building delineation derived from satellite imagery (San and Turker 2010; Vakalopoulou et al. 2015) would supplement road data, especially in rural areas and could contribute to more targeted EAs by clustering building locations. Identified buildings could also help to inform the delineation of rural EAs, as well as providing spatial variables (building size and layout) that may be linked to socio- economic indicators. Based on the experience in the present study, we aim to develop a software tool that can be used to generate optimal and practical EAs with minimal user interaction and tailored to a wider set of needs. The tool has the capability of working directly with shapefiles and a wider. The tool will have the ability to consider several varieties of parameters to produce constructive EAs, as well as the integration of socio-economic variables as described above. It could also have broad development impact since it can update existing EAs with the condition if the old EA boundary should be kept or not or generate the new set of EAs and turning them into national sample frame within time, cost and quality constraints. Conclusion Somalia presents an example of a country where the need for basic spatial demographic data such as EAs is crucial to continued development, but where on-the-ground logistical constraints, security and data limitations remain significant barriers. Here, we show how a novel approach to creating predefined census EAs in such contexts where the census data are badly outdated and EA data are needed to progress development efforts. In addition, by highlighting existing freely-available, up-to-date high- resolution gridded population data and settlement maps, we hope to help remove a further barrier to broader uptake of this method. Future research will aim to improve the accessibility of the approach by providing user-friendly tools that will automate the data access, splitting and merging process according to user specifications. By presenting the approach in the challenging context of Somalia and providing the first up-to-date rural EAs in decades, we aim to demonstrate the feasibility of the approach and the potential for application in other contexts. 17    References Alazab, M., Islam, M., & Venkatraman, S. (2009). Towards Automatic Image Segmentation Using Optimised Region Growing Technique. In A. Nicholson & X. Li (Eds.), Ai 2009: Advances in Artificial Intelligence, Proceedings (Vol. 5866, pp. 131-+). Balinski M, Johnston R, McLean I,Young P, 2010 Drawing a New Constituency Map for the United Kingdom: The Parliamentary Voting System and Constituencies Bill 2010 (The British Academy, London). Balk, D. L., Deichmann, U., Yetman, G., Pozzi, F., Hay, S. I., & Nelson, A. (2006). Determining global population distribution: Methods, applications and data. In S. I. Hay, A. Graham, & D. J. Rogers (Eds.), Advances in Parasitology, Vol 62: Global Mapping of Infectious Diseases: Methods, Examples and Emerging Applications (Vol. 62, pp. 119-156). Barrington-Leigh, C., & Millard-Ball, A. (2017). The world's user-generated road map is more than 80% complete. Plos One, 12(8). doi:10.1371/journal.pone.0180698 Beucher, S. (1994). Watershed, hierarchical segmentation and waterfall algorithm (Vol. 2). Center for International Earth Science Information Network - CIESIN - Columbia University, International Food Policy Research Institute - IFPRI, The World Bank, and Centro Internacional de Agricultura Tropical - CIAT. 2011. Global Rural-Urban Mapping Project, Version 1 (GRUMPv1): Population Count Grid. Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC). http://dx.doi.org/10.7927/H4VT1Q1H. Accessed 09 07 2018. Center for International Earth Science Information Network - CIESIN - Columbia University. 2017. Gridded Population of the World, Version 4 (GPWv4): Population Density, Revision 10. Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC).https://doi.org/10.7927/H4DZ068D. Accessed 05 July 2018. Cheriyadat, A., Bright, E., Potere, D., & Bhaduri, B. (2007). Mapping of settlements in high- resolution satellite imagery using high performance computing. Geojournal, 69, 119-129. http://doi.org/10.1007/s10708-007-9101-0 Cockings, S., Harfoot, A., Martin, D., & Hornby, D. (2011). Maintaining existing zoning systems using automated zone-design techniques: methods for creating the 2011 Census output geographies for England and Wales. Environment and Planning A, 43(10), 2399-2418. doi:10.1068/a43601 Damiand, G., & Resch, P. (2003). Split-and-merge algorithms defined on topological maps for 3D image segmentation. Graphical Models, 65(1-3), 149-167. doi:10.1016/s1524- 0703(03)00009-2 Ellard-Gray, A., Jeffrey, N. K., Choubak, M., & Crann, S. E. (2015). Finding the Hidden Participant: Solutions for Recruiting Hidden, Hard-to-Reach, and Vulnerable Populations. International Journal of Qualitative Methods, 14. doi:10.1177/1609406915621420 Environmental Data Explorer: Gridded Population of the World. United Nations Environment Programme, Nairobi. 2006. http://geodata.grid.unep.ch/. Accessed 10 May 2018 18    Esch, T., Heldens, W., Hirner, A., Keil, M., Marconcini, M., Roth, A., Strano, E. (2017). Breaking new ground in mapping human settlements from space - The Global Urban Footprint. Isprs Journal of Photogrammetry and Remote Sensing, 134, 30-42. doi:10.1016/j.isprsjprs.2017.10.012 European Commission, Joint Research Centre (JRC); Columbia University, Center for International Earth Science Information Network - CIESIN (2015): GHS population grid, derived from GPW4, multitemporal (1975, 1990, 2000, 2015). European Commission, Joint Research Centre (JRC) PID: http://data.europa.eu/89h/jrc-ghsl-ghs_pop_gpw4_globe_r2015a Feng, J., & Watanabe, T. (2015). Index and Query Methods in Road Networks Index and Query Methods in Road Networks (Vol. 29, pp. 1-161). Finkel, R. A.; Bentley, J. L. (1974). "Quad Trees A Data Structure for Retrieval on Composite Keys". Acta Informatica. Springer-Verlag. 4: 1–9. doi:10.1007/bf00288933 Florczyk, A. J., Ferri, S., Syrris, V., Kemper, T., Halkia, M., Soille, P., & Pesaresi, M. (2016). A New European Settlement Map From Optical Remotely Sensed Data. Ieee Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 9(5), 1978-1992. doi:10.1109/jstars.2015.2485662 Flowerdew, R., Feng, Z. Q., & Manley, D. (2007). Constructing data zones for Scottish Neighbourhood Statistics. Computers Environment and Urban Systems, 31(1), 76-90. doi:10.1016/j.compenvurbsys.2005.07.008 Folch, D. C., & Spielman, S. E. (2014). Identifying regions based on flexible user-defined constraints. International Journal of Geographical Information Science, 28(1), 164-184. doi:10.1080/13658816.2013.848986 Gure, F., Yusuf, M., & Foster, A. M. (2015). Exploring Somali women's reproductive health knowledge and experiences: results from focus group discussions in Mogadishu. Reproductive Health Matters, 23(46), 136-144. doi:10.1016/j.rhm.2015.11.018 Hay, S. I., Guerra, C. A., Tatem, A. J., Atkinson, P. M., & Snow, R. W. (2005). Urbanization, malaria transmission and disease burden in Africa. Nature Reviews Microbiology, 3(1), 81-90. doi:10.1038/nrmicro1069 Haynes, R., Daras, K., Reading, R., & Jones, A. (2007). Modifiable neighbourhood units, zone design and residents ' perceptions. Health & Place, 13(4), 812-825. doi:10.1016/j.healthplace.2007.01.002 Lloyd, C. T., Sorichetta, A., & Tatem, A. J. (2017). High resolution global gridded data for use in population studies. Scientific Data, 4. doi:10.1038/sdata.2017.1 Loots H. (2015). Demarcation of census enumeration areas for the 2016 population and housing census in Lesotho. Geomatics Indaba Proceedings 2015 – Stream 1 http://www.ee.co.za/wp- content/uploads/2015/08/Hennie-Loots.pdf Lu, X. (2009). “Need a job? Apply to become a Census enumerator.” Wise Bread: Living on a Small Budget. Available at http://www.wisebread.com/need-a-job-apply-to-become-acensus- enumerator(Accessed January 2011). 19    Martin R. and Lyndon A. (2009). Optimised geographies for data reporting: zone design tools for Census output geographies (Statistics New Zealand Working Paper No 09–01). Wellington: Statistics New Zealand Martin, D. J, (2002), Geography for the 2001 Census in England and Wales, Population Trends, 108, pp7-15. Openshaw, S, (1977), A geographical solution to scale and aggregation problems in region-building, partitioning and spatial modelling, Transactions of the Institute of British Geographers, NS 2, pp459-472. Pesaresi, M., Guo, H. D., Blaes, X., Ehrlich, D., Ferri, S., Gueguen, L., Zanchetta, L. (2013). A Global Human Settlement Layer From Optical HR/VHR RS Data: Concept and First Results. Ieee Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 6(5), 2102-2131. doi:10.1109/jstars.2013.2271445 Roy Chowdhury, P. K., Bhaduri, B. L., & McKee, J. J. (2018). Estimating urban areas: New insights from very high-resolution human settlement data. Remote Sensing Applications: Society and Environment, 10, 93-103. doi:10.1016/j.rsase.2018.03.002 San, D. K., & Turker, M. (2010). BUILDING EXTRACTION FROM HIGH RESOLUTION SATELLITE IMAGES USING HOUGH TRANSFORM. In K. Kajiwara, K. Muramatsu, N. Soyama, T. Endo, A. Ono, & S. Akatsuka (Eds.), Networking the World with Remote Sensing (Vol. 38, pp. 1063-1068). Somalia National Development plan 2016. Somalia National Development plan (SNDP) – Towards recovery, democracy and prosperity 2017-2019. http://extwprlegs1.fao.org/docs/pdf/som169866.pdf Steele, J. E., Sundsoy, P. R., Pezzulo, C., Alegana, V. A., Bird, T. J., Blumenstock, J., . . . Bengtsson, L. (2017). Mapping poverty using mobile phone and satellite data. Journal of the Royal Society Interface, 14(127). doi:10.1098/rsif.2016.0690 Tatem AJ, Gething PW, Pezzulo C, Weiss D and Bhatt S, 2014, Development of High-Resolution Gridded Poverty Surfaces, Report for the Bill and Melinda Gates Foundation: http://www.worldpop.org.uk/resources/docs/Poverty-mapping-report.pdf Tatem, A. J., Noor, A. M., von Hagen, C., Di Gregorio, A., & Hay, S. I. (2007). High Resolution Population Maps for Low Income Nations: Combining Land Cover and Census in East Africa. Plos One, 2(12). doi:10.1371/journal.pone.0001298 Thomson, D. R., Hadley, M. B., Greenough, P. G., & Castro, M. C. (2012). Modelling strategic interventions in a population with a total fertility rate of 8.3: a cross-sectional study of Idjwi Island, DRC. Bmc Public Health, 12. doi:10.1186/1471-2458-12-959 Turner A. G. (2003). Sampling frames and master samples. Expert Group Meeting to Review the Draft Handbook on Designing of Household Sample Surveys. UNITED NATIONS SECRETARIAT, ESA/STAT/AC.93/3. https://unstats.un.org/UNSD/demographic/meetings/egm/Sampling_1203/docs/no_3.pdf UNFPA (2016). Population Composition and Demographic Characteristics of the Somali People. http://analyticalreports.org/pdf/UNFPA_PESS_Vol_2.pdf 20    UNFPA, Federal Republic of Somalia (2014) Population Estimation Survey 2014 for the Pre-War Regions of Somalia (PESS) (UNFPA, Nairobi). http://somalia.unfpa.org/sites/default/files/pub-pdf/Population-Estimation-Survey-of-Somalia- PESS-2013-2014.pdf UNITED NATIONS SECRETARIAT (UNS) 2007. Report of the Sub-regional Workshop on Census Cartography and Management. ESA/STAT/AC.144/L.3 United Nations Statistics Division 2010. Definition of the National Census Geography. UNSD- CELADE Regional Workshop on Census Cartography for the 2010 Latin America’s census round. https://www.cepal.org/celade/noticias/paginas/8/35368/pdfs/3UNSD.pdf Vakalopoulou, M., Karantzalos, K., Komodakis, N., Paragios, N., & Ieee. (2015). BUILDING DETECTION IN VERY HIGH RESOLUTION MULTISPECTRAL DATA WITH DEEP LEARNING FEATURES 2015 Ieee International Geoscience and Remote Sensing Symposium (pp. 1873-1876). Vijayaraj, V., Bright, E. A., Bhaduri, B. L., & Ieee. (2007). High resolution urban feature extraction for global population mapping using high performance computing Igarss: 2007 Ieee International Geoscience and Remote Sensing Symposium, Vols 1-12: Sensing and Understanding Our Planet (pp. 278-281). Wanderer, J. P. (2017). Analysis of Large and Complex Data. Anesthesia and Analgesia, 125(1), 345- 345. doi:10.1213/ane.0000000000002127 Wardrop, N. A., Jochem, W. C., Bird, T. J., Chamberlain, H. R., Clarke, D., Kerr, D., . . . Tatem, A. J. (2018). Spatially disaggregated population estimates in the absence of national population and housing census data. Proceedings of the National Academy of Sciences of the United States of America, 115(14), 3529-3537. doi:10.1073/pnas.1715305115 WorldPop Data. WorldPop, University of Southampton, Southampton UK. 2018. http://www.worldpop.org.uk/data/data_sources. Accessed 10 Mar 2018. Yacyshyn A. M. & Swanson D. A. (2011). The Costs of Conducting a National Census: Rationale for Re-Designing Current Census Methodology in Canada and the United States. The 21st Annual Warren E. Kalbach Population Conference, at the University of Alberta, Edmonton, Canada 21    Appendix 1. Data sources used to create the ‘rural settled areas’ raster (100m) 22    Appendix 2. Data sources used to obtain population estimates at 100m in urban areas. Appendix 3. Data sources used to obtain population estimate at 100m in settled rural areas. 23    24    25