Policy Research Working Paper 9160 Identifying Urban Areas by Combining Human Judgment and Machine Learning An Application to India Virgilio Galdo Yue Li Martin Rama South Asia Region & Latin America and the Caribbean Region February 2020 Policy Research Working Paper 9160 Abstract This paper proposes a methodology for identifying urban LASSO and random forests methods, are applied. These areas that combines subjective assessments with machine approaches are then used to decide whether each of the learning, and applies it to India, a country where several out-of-sample administrative units in India is urban or rural studies see the official urbanization rate as an under-estimate. in practice. The analysis does not find that India is substan- For a representative sample of cities, towns and villages, tially more urban than officially claimed. However, there as administratively defined, human judgment of Google are important differences at more disaggregated levels, with images is used to determine whether they are urban or rural “other towns” and “census towns” being more rural, and in practice. Judgments are collected across four groups of some southern states more urban, than is officially claimed. assessors, differing in their familiarity with India and with The consistency of human judgment across assessors and urban issues, following two different protocols. The judg- protocols, the easy availability of crowd-sourcing, and the ment-based classification is then combined with data from stability of predictions across approaches, suggest that the the population census and from satellite imagery to pre- proposed methodology is a promising avenue for studying dict the urban status of the sample. The Logit model, and urban issues. This paper is a product of the Office of the Chief Economist, South Asia Region, and the Office of the Chief Economist, Latin America and the Caribbean Region. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at yli7@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Identifying Urban Areas by Combining Human Judgment and Machine Learning: An Application to India Virgilio Galdo, Yue Li and Martin Rama 1 Key words: urban area; urbanization rate; human judgment; Google images; crowd sourcing; population census; satellite imagery; machine learning JEL Classification: O1, O18, R1 1 Virgilio Galdo and Yue Li are with the Office of the Chief Economist for the South Asia Region and Martin Rama is with the Office of the Chief Economist for the Latin America and Caribbean Region at the World Bank. The corresponding author is Yue Li. Her contact information is yli7@worldbank.org. This research was funded by the World Bank and by the Department for International Development of U.K. as part of the Sustainable Urban Development Multi-Donor Trust Fund. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Skillful research analysis was provided by Maria Florencia Pinto, Sutirtha Sinha Roy, and Wenqing Zhang. The authors thank Gilles Duranton, Hans Timmer, two anonymous reviewers and participants at the 8th European Meeting of the Urban Economics Association for very helpful comments and suggestions. The authors appreciate operational support provided by Ryan Engstrom and Richard Hinton from the Center for Urban and Environmental Research at George Washington University, and Charles Fox and Keith Garrett from the Geospatial Operational Support Team at the World Bank. 1. Introduction What is a city? The most candid answer may be “I know one when I see it.” The subjectivity of the city concept is understandable because of the continuum between urban and rural spaces. The many terms used to describe this in-betweensuburb, exurb, edge city, and urban fringe, among otherstestifies to the absence of a clear divide. Even allegedly urban localities can differ substantially in their key attributes, as they range from compact cities to sprawling low-density areas. In this context, just like beauty, a city is in the eyes of the beholder. The rural-urban continuum and the heterogeneity of urban settings pose an obvious challenge to identifying urban areas and measuring urbanization rates in a consistent way within and across countries. An objective methodology for distinguishing between urban and rural areas that is based on one or two metrics with fixed thresholds may not adequately capture the wide diversity of places. A richer combination of criteria would better describe the multifaceted nature of a city’s function and its environment, but the joint interpretation of these criteria may require an element of human judgment. In this paper we turn this unavoidable subjectivity into an opportunity, by proposing a methodology to identify urban areas that combines human judgment with machine learning. Human judgment is used to classify a representative sample of places into urban or rural. Reliance on several, qualitatively different groups of assessors, as well as on different classification protocols, provides reassurance that the outcome is stable. This sample is in turn used as the training set for a machine learning exercise allowing to classify out-of-sample places as urban or rural. A comparison between the various classification approaches provides further reassurance that that the prediction outcome is robust. We illustrate the potential of this methodology by applying it to India. Two reasons motivate this choice. First is scale: accounting for almost a fifth of the world population, India encompasses states with incomes per capita comparable to Mexico and others similar to Benin. Second is the “messiness” of its urbanization process, characterized by a wide variation in the strength of local urban authorities, from high in state capitals to dismal in other cities. Scale and messiness result in an enormous diversity of places, ranging from a glamorous metropolis such as Mumbai, capital of Maharashtra, to a shabby town such as Bagula in West Bengal, to any among hundreds of thousands of villages. With such wide spatial diversity, not surprisingly there is considerable debate as to how urbanized India actually is. The official urbanization rate for 2011 is 31.2 percent. Several studies using alternative population thresholds or satellite imageries of built-up cover have argued that many areas of India are misclassified as rural by the administrative definition. Depending on the methodology, the urbanization rates estimated by these studies range from 42.0 to 78.0 percent (Denis and Marius-Gnanou 2011; Dijkstra et al. 2018; Ministry of Finance 2017). Conversely, a study in this special issue relies on a range of parameters to delineate urban markets and suggests that India could be even less urbanized than official figures imply. In this other study the urbanization rate ranges from 14.8 to 33.5 percent when using nighttime light data, and from 27.1 to 39.4 percent when using built-up cover data (Baragwanath et al. 2019). Shedding light on this debate is the second motivation for choosing India to illustrate our proposed methodology. Subjective assessments are increasingly being used to complement more objective approaches across various fields in economics. For example, good-looking ratings by independent parties have been shown to be correlated with occupational sorting, earnings differentials and physical performance (Beller et al. 2 1994; Hamermesh and Biddle 1994; Postma 2014). Similarly, self-reported happiness has been adopted as a wellbeing indicator on the grounds that everybody has his or her own views on what a good life looks like, and what makes a good life may touch on dimensions for which no reliable indicator is available (Frey and Stutzer 2002; Veenhoven 2004). In the urban economics literature, crowd-sourced assessments of street-level images have been used to determine how safe a neighborhood feels, how clean it looks, or how lively it seems. These are aspects of cities that standard measures, such as income levels, are unable to fully capture (Salesses et al. 2013). Further, this literature has applied machine learning to extend the assessment to other cities, not covered by the original crowd-sourcing (Naik et al. 2016). Machine learning has also been used to predict a neighborhood’s socioeconomic characteristics out of its appearance (Glaeser et al. 2018; Naik et al. 2015). Human judgment is also behind imagery interpretation in the remote sensing literature. Machine learning is often used in this literature to classify billions of “cells” of satellite imageries, for example in terms of their land use. However, satellite imageries are difficult to interpret, because they are two-dimensional, taken directly from above and generally lacking recognizable details. Therefore, an important step is to generate subjective assessments for a subset of cells that is then used as either a training sample or validation data for the machine learning exercise (Campbell and Wynne 2011). Building on these precedents, our methodology applies subjective assessments to open-source images generated by Google for a representative sample of places in India. We overlay these images with the digitized boundaries of Indian cities, towns and villages, as administratively defined, and use the portion falling within the corresponding boundaries as the subject for human judgment. Relying on information at the level of cities, towns and villages to shed light on important topics in economic geography has precedents in the literature (Eeckhout 2004; Levy 2009; Michaels, Rauch and Redding 2012). Fairly disaggregated administrative units have also been shown to perform as well as gridded cells when dealing with critical topics in the economic geography literature (Briant, Combes and Lafourcade 2010). An admittedly preferable approach would be to let commuting pattern data delineate local labor markets (Duranton 2015; US Office of Management and Budget 2010). However, data of this sort is rarely available in developing countries, and India is not an exception in this respect. In our methodology, four groups of assessors independently judge whether the Google images from a place correspond to an urban or a rural settlement. The four groups differ in their familiarity with India and in their expertise in urban issues. The most knowledgeable group is made of in-house research analysts. A second group comprises university students from the US who have experience in land use classification but no exposure to India. The last two groups are made of crowdsourced anonymous viewers hired through Amazon Mechanical Turk (MTurk) who are unlikely to have expertise in urban issues. In the third group all viewers are from India while in the fourth one they are all from the US. Assessments also follow two different protocols. The more structured one requests the assessors to sequentially evaluate the density of construction, the nature of transport infrastructure, and the availability of urban amenities, before making their judgment. All four groups of assessors follow this protocol. But before doing so, the three outsider groups are also asked to have an impromptu judgment of the urban or rural nature of the place, without any guidance. By proceeding this way, every place in our sample of cities, towns and villages is classified a total of seven classification rounds. We find that the classifications are highly consistent, with two thirds of the localities 3 having the same status regardless of the group of assessors and the protocol, and with almost 90 percent of them being classified in the same way in at least five of the seven rounds. We also show that the characteristics of the assessors and the protocol followed affect judgment results, but their overall influence is very small. Given this high level of consistency, we pool all seven rounds of assessments which results in 50 or more human judgments for each place in the sample, and then classify every place as urban or rural based on a majority rule. We treat this pooled classification as the outcome of our human judgment exercise. We then to develop an approach to predict urban status, as assessed through human judgment, based on observable characteristics of the corresponding places. We emphasize four characteristics for each place. Total population and population density, both drawn from the population census, are key indicators in the urban economics literature. Built-up cover and nighttime lights, both from satellite imageries, have been at the core of recent attempts to estimate a country’s “true” urbanization rate. We also consider other indicators, but the four key characteristics just listed remain the most important predictors of subjective urban status. Several approaches, drawn from classical data modeling and from machine learning, are applied for the prediction exercise as well. We first use a standard Logit model linking the urban status of a place with the values of its characteristics. We then apply more sophisticated machine learning methods, including LASSO and random forests. We find that all three approaches yield a similar prediction accuracy, with random forests performing slightly better. These approaches are then used to predict whether each of the other places in the country is urban or rural in practice. The urbanization rate emerging from this exercise for India’s is 29.9 percent. Unlike the remarkably high rates reported by some studies, this figure is quite close to the corresponding official estimate. However, we also find important gaps with the official estimates at more disaggregated levels. Consistent with those other, recent studies, we show that many places administratively defined as villages have urban characteristics. But we also find that a significant share of “statutory towns” and “census towns” would be better classified as rural. And there are differences across states as well. While these subnational gaps with official estimates could be interpreted as the outcome of prediction error, we show that the sign of the gaps is consistent with the fiscal incentives and statistical innovations characterizing the administrative classification of places in India. We also show that our estimate of the urbanization rate is not an outlier relative to estimates based on global urban layer products, as the main discrepancy with them arises from the population data used, rather from the underlying land use classification. Beyond offering new insights into the debate on urbanization in India, the exercise in this paper illustrates several strengths of the proposed methodology. First, the methodology is holistic. Because it relies on human judgment to assess the “urbanness” of places, it allows the multiple characteristics of a place to be evaluated jointly. Second, it makes the most of existing data. A growing number of studies are using satellite imagery data to identify urban areas, contributing to urban studies (Donaldson and Storeygard 2016). However, our analysis shows that relying on built-up cover or nighttime lights alone can be insufficient. And third, our methodology is scalable. High-quality Google images are becoming available for an increasing number of places, and human judgment can be crowd-sourced efficiently nowadays. A legitimate concern with our methodology is its reliability, in the sense that small changes to its design could result in important changes in its outcomes. However, the variants tried in this paper suggest that 4 the results are robust. Our analysis shows that assessor characteristics and decision protocols do not significantly affect the classification of places. And the prediction approaches used to extend the classification beyond the original sample do not make a major difference either. Therefore, the proposed methodology may contribute to a growing literature in urban economics that applies innovative approaches to the identification of urban areas (de Bellefon et al. 2018, Diegel et al. 2019, Rozenfeld et al. 2011). The paper is organized as follows. In Section 2 we present the stratified sampling strategy. In Section 3 we describe the images used for assessment and the collection of human judgments from different assessors. We also compare classifications based on different human judgment exercises. In Section 4 we introduce the additional sources of data and apply three approaches—Logit model, and LASSO and random forests methods—to predict the urban status of places in the sample. In Section 5 we show the results on the estimated urbanization rate for India in 2011, and also for fine administrative types and for states. Finally, we compare our results with urbanization rates from other studies. 2. A sample of places Important information about cities, towns and villages is captured by the administrative boundaries of localities. These boundaries are generally built on historical data and tend to reflect the best knowledge available on the spatial distribution of economic activities and people. They also define the constituency to which each local government provides services and is accountable. In this study, we used the administrative units adopted by the Census of India as our unit of analysis. The universe of places To compile the universe of places we took advantage of newly digitized boundaries of administrative units in India, available down to the town or village level, to define the place of analysis. These boundaries are based on India’s Administrative Atlas2011 (ORGI 2011b). They were generated as part of a broader research project, the Spatial Database for South Asia (Li et al. 2015). The construction of the boundaries required scanning and georeferencing 6,598 physical maps: 35 maps of states and union territories, 640 district maps, and 5,923 subdistrict maps that present the boundaries of towns and villages. The location, size, and shape of these towns and villages were digitized in the form of 649,818 boundary polygons. Attributions such as place codes, names, and administrative types were added to the polygons. Administrative units need to be integrated further in the case of metropolitan areas, which represent broader integrated labor markets. The preferred approach in this respect is to base the integration process on commuter pattern data (Duranton 2015; US Office of Management and Budget 2010). This type of data is unfortunately not available for most developing countries, including India. In its absence, we simply merged the polygons of cities spreading over multiple administrative units, based on their names and unique geo-codes. We made two other adjustments to the digitized administrative boundaries. In the original Spatial Database some villages were presented as points on the source maps as their boundaries were unavailable they are mostly in the mountainous or forest areas of the states of Arunachal Pradesh, Chhattisgarh, Himachal Pradesh, and Meghalaya. And population information was not available in the 2011 Census of India for some villages, which raises concerns about the quality of the corresponding official maps. We therefore excluded villages without boundaries or population data. 5 As a result of these mergers and exclusions, the number of administrative units we retained as the universe of places for the analysis fell to 564,052. Two concerns can be raised on the use of such administrative units as the unit of analysis. The first one, known as the modifiable areal unit problem (MAUP), is the potential sensitivity of results to their considerable diversity in shape and size. It has been shown that shape does not make much of a difference for three important topics in the economic geography literature (namely, spatial concentration, agglomeration economies, and inter-regional trade). As for size, administrative boundaries do not suffer significantly from the MAUP, except in the case of large-scale spatial units, such as states or provinces (Briant, Combes and Lafourcade 2010). However, none of the administrative units considered here reaches such large scale. A second concern is whether administrative units, self-standing or merged in the case of bigger cities, provide an accurate representation of integrated labor markets. An alternative is to delineate the urban extent by aggregating contiguous cells. Depending on the studies, the aggregation of cells is based on criteria such as the density of their built-up or the intensity of their nighttime light (Baragwanath et al. 2019; Ch et al. 2018; Rozenfeld et al. 2011). However, the nature of the links between cells in a broader spatial context may vary across localities. The aggregation process can introduce other biases (Bosker et al. 2018). For selecting a representative sample, we further restricted the set of administrative units to those of the 21 largest states and union territories (hereafter states). These large states are covered by the high-quality household surveys whose data are needed for a proper stratification. Focusing exclusively on them resulted in 561,624 places being considered for sampling. The total population of these 561,624 polygons in 2011 was 1,174 million, or 3.0 percent less than reported by the 2011 Census of India for the country as a whole (ORGI 2011a). Almost a third of this population gapabout 11.3 million peopleis accounted for by the small states excluded from the analysis. Some of these small states are richer than the Indian average and some are poorer, which suggests that their exclusion from the analysis may not bias results much. The rest of the gap is associated with the villages excluded because their boundaries or population figures were unavailable in the official sources. These villages are most likely rural in practice, and their population is small about 200 people at the medianwhich implies that they too are unlikely to be an important source of bias. A two-stage stratification sampling We used a two-stage disproportionate stratification strategy to select the sample of administrative units for human judgment (figure 1). In the first stage, we divided the universe into seven strata by administrative categories, following the definitions of the 2011 Census of India. Four of these categories state capitals and municipal corporations, municipalities, industrial towns, and other townsare statutorily designated as urban. Two of the categoriescensus towns and outgrowthsare statutorily rural but recognized by the 2011 Census as urban. The seventh category is villages, or areas that are rural according to both the Constitution and the 2011 Census of India (ORGI 2011a). 6 Figure 1. Two-stage stratification sampling N = 561,624 n* = 1,277 Second-stage 21 large states First-stage stratification by Random draws with high-quality stratification by Sample state-level without expenditure administrative household replacement information category expenditure In the first stage we relied on a disproportionate sampling strategy. The 561,624 administrative units comprise 3,852 statutory urban places, 4,645 places that are urban in terms of the census, and 553,127 villages. If a proportionate sampling strategy had been adopted, the share of statutory and census urban places in the sample would have been a mere 2 percent. But villages are likely to have rural characteristics despite the limitations of administrative definitions, while misclassification is more likely for statutory and census urban places. It therefore made sense to overrepresent these two groups. For each administrative category stratum, we chose a sample size sufficiently large to ensure the accuracy of subsequent analyses. There is a well-known trade-off in this respect. A small sample could fail to generate enough information, whereas a large sample would be costly to implement. Building on the survey design literature, we rely on the Central Limit Theorem to determine the sample size of each stratum (appendix 1). In the second stage, we divided each of the seven strata into three substrata defined by average consumption per capita at the state level. Consumption data were from the household consumer expenditure module of the 68th round of National Sample Survey of India (NSSO 2012). Monthly household expenditure was divided by household size to estimate consumption per capita. To account for the spatial variation in prices, the result was deflated by a year-specific price index that differs across states and between rural and urban areas. We ranked states on the basis of their average consumption per capita and divided them into three groups of equal size: richer, intermediate, and poorer. The sample for each administrative stratum was split evenly across these three groups of states. The sampling process is completed by conducting random draws without replacement for each of the 21 substrata. The resulting sample of places The sample drawn through this process comprises 1,277 places, of which 405 are statutorily urban and 557 are urban in terms of the census; the remaining 315 places are villages (table 1). For the full sample we achieved a relative standard error of 8 percent, which is below the 10 percent threshold required for a survey design to be considered good (Knottnerus 2003; UN 2005). Recognizing that the Central Limit Theorem may not apply for a finite sample, we used bootstrapping to generate an approximation of the distribution through a Monte Carlo simulation (Cameron and Trivedi 2005). The analysis confirmed that the sample was sufficiently large for each administrative stratum. Standard errors were below 3 percent for six administrative strata, and around 5 percent for industrial towns. 7 Table 1. The distribution of the sample across administrative categories Prior estimate on Margin of N the probability of n* error being urban Capitals and Municipal Corporations 147 0.05 0.97 34 Municipalities 905 0.05 0.90 120 Industrial Towns 31 0.05 0.90 25 Other Towns 2,769 0.05 0.80 226 Census Towns 3,710 0.05 0.71 292 Outgrowths 935 0.05 0.60 264 Villages 553,127 0.03 0.10 315 Total 561,624 1,277 Mean probability of being urban 0.11 Relative standard error 0.08 Note: N is the total number of places belonging to an administrative category and n* is the selected sample size for an administrative category. Figure 2. The spatial distribution of sample places Our final sample with 1,277 administrative units is quite evenly distributed across India’s territory (Figure 2). The selected places cover five of the six regions of the country northern, central, eastern, western, 8 and southernrelatively well. Only for the northeastern region the coverage of the sample is limited. This is partially due to the exclusion from the universe of Manipur, Mizoram, Tripura, Meghalaya, and Assam, five small states with relatively poor household survey data. It is also due to the shortcomings of village boundaries in the states of Arunachal Pradesh and Meghalaya. Not surprisingly, the places in the sample are quite different in size, with their surface varying from 0.1 to 464.7 km2. 3. Human judgment Places are usually classified as urban or rural based on a narrow set of indicators that most often includes population and administrative status. Sometimes additional metrics such as population density or the share of employment out of agriculture are considered as well. But even this may not be enough to ensure that the classification is consistent. Cities have diverse characteristics both across and within countries. This multiplicity of aspects to consider calls for an element of human judgment. Google images as virtual site visits A thorough subjective assessment would require visiting every city, town, and village in the sample and literally seeing how they look like. However, the time and cost implications of visiting 1,277 places scattered across a large country are dissuasive. To resolve a similar challenge, an emerging literature uses Google images as a proxy for information gathered through field surveys. Salesses et al. (2013) rely on over 4,000 Google street view images to crowdsource perceptions on safety and livability, for two major cities of the US. In a scaling up effort, Dubey et al. (2016) expand this approach to over 110,000 images from 56 cities. Naik et al. (2016) use the images associated with quantified human judgment as the training set and develop a model to further quantify neighborhood appearance for 19 cities in the US through machine learning. And Naik et al. (2015) apply the machine learning model to study the changes in neighborhood appearance and their relationship with neighborhood socioeconomic characteristics. Following the literature, we relied on open-source images accessible through Google Earth and Google Maps as a proxy for real site visits. The images provided by Google are an integrated collection of processed satellite imagery, aerial photos and street view pictures. A zoom-in option allows exploring details with a resolution ranging from 15 cm to 15 m. With these images a layperson, with no special training in remote sensing, can judge the characteristics of large numbers of places. 2 The views convey information on the type of land cover, the characteristics of man-made structures, the layout of these structures, and transportation vehicles on roads. While three-dimensional images are not available for India, the combination of rotation and tilt options allows examining each place from different angles, giving a sense of the height and quality of buildings. A street view option providing 360-degree panoramic ground-level photos is also available for selected places in India. And tags highlight amenities such as educational institutions, health facilities, religious buildings and recreational parks, among others. To support the virtual assessment of the places in the sample, we overlaid the digitized boundary of each of these places on top of Google images, and used the images falling within the corresponding boundary as the subject for human judgment (figure 3). For the vast majority of places, we relied on images taken 2 Information on these images is available at https://www.google.com/streetview/explore/, https://support.google.com/earth/answer and https://support.google.com/mapsdata/. 9 in 2010–12, to be consistent with the population data from the 2011 Census of India. However, when the images for these three years were of poor quality, we complemented them with images from more recent years. Similarly, when Google Earth images were not available, we used Google Maps images that were composites from recent years. Figure 3. Examples of Google images overlaid with administrative boundaries a. Jigani (Karnataka) b. Bahadurganj (Bihar) Note: The yellow lines depict the digitized administrative boundaries for the respective administrative units. Diverse groups of assessors To collect subjective assessments of Google images of the 1,277 places in the sample we mobilized a broad pool of assessors, emphasizing heterogeneity along two dimensions: familiarity with India and knowledge of urban issues. Heterogeneity is important to explore whether the assessors’ characteristics matter. The two dimensions considered lead to the constitution of four distinct groups of assessors. Our first group of assessors were three in-house research analysts who were both familiar with India and knowledgeable on urban issues. These analysts all had graduate-level degrees in economics and were given background information on the purpose of the exercise. While they were originally from three different developing countries – China, India and Peru – they were all familiar with India as they worked at the time at the South Asia region at the World Bank. The second group of assessors comprised 15 university students who were knowledgeable on urban issues but unfamiliar with India. They all had experience in land use classification using satellite images, because of their activities at the Center for Urban and Environmental Research at George Washington University (GWU). These students participated on a voluntary basis, for a fixed payment upon completing the classification of all the places assigned to them for judgment. The students were all from the US. 10 The third and fourth group of assessors were anonymous online workers mobilized through Amazon’s Mechanical Turk (MTurk). This is a crowdsourcing marketplace that facilitates the virtual breakdown and distribution of manual time-consuming tasks among thousands of anonymous workers and has become increasingly popular as a tool for research. 3 While MTurk workers often have a college degree or above, they are unlikely to be urban experts (Ross et al. 2010). To ensure quality, we required participants in the third and fourth groups of assessors to have MTurk “master” accreditation, which reflects substantial experience and low rates of rejection. MTurk allows to restrict the nationality of participants in crowdsourced tasks. Our third group of assessors was thus made exclusively of MTurk workers from India, and the fourth one exclusively from the US. In all, the third group had 72 members and the fourth one 207. Two assessment protocols All four groups of assessors followed a tightly structured protocol for the classification of places. However, before being exposed to this structured approach, assessors in the second, third and fourth groups were also asked to provide an impromptu judgment. Without any guidance they had to decide by themselves whether the Google images shared with them corresponded to urban or rural places. As a result, each of the places in the sample was subject to seven rounds of classification: three of them impromptu and four following a structured protocol. Under the structured protocol, assessors were requested to rely on a decision tree to classify the places allocated to them. The decision tree had three nodes, each involving the interpretation of Google images along one dimension. The first node referred to the distribution of land cover types. The second node was about the characteristics of the buildings and the relationship between them. And the third one concerned the presence of transportation networks and the availability of amenities such as schools, universities, hospitals or cultural sites (appendix 2). The decision tree was built on the economic geography and urban economics literatures. Indeed, models of land use predict that the density of construction declines as one moves away from a city center to more rural areas (Alonso 1964; Mills 1967; Muth 1969; Brueckner 1987; Duranton and Puga 2015). It has also been shown that large infrastructure investments facilitating access to markets spur the concentration of economic activities (Duranton and Puga 2004; Fujita et al. 2001; Krugman 1991; Scotchmer 2002). Finally, services and amenities are crucial in explaining where firms prefer to operate, and where households prefer to live (Ahlfeldt et al. 2015; Combes et al. 2010; Straszheim 1987). While the structured protocol was the same across all four groups, the diversity of the assessors required slight adjustments to its implementation in each case. With the first group, the sample of cities, towns, and villages was randomly split into three subsets, each allocated to one of the three analysts. A randomly selected subset of 10 places was reallocated from the original research analyst to a different one to assess the robustness of the classifications. When one research analyst could not reach an unambiguous conclusion on a specific place, another analyst was called in to consult and make a joint decision. With the second group we used an open-source data collection tool, Collect Earth, to be able to work remotely. Collect Earth enables data gathering and image analysis through Google Earth. 4 For each 3 Information on Amazon’s Mechanical Turk is available through https://www.mturk.com/. 4 Information on Collect Earth is available through http://www.openforis.org/tools/collect-earth.html. 11 university student in the second group we created two Collect Earth “projects” including 640 randomly chosen places each, and we shared them sequentially. For places in the first project, the assessors were asked to decide without any guidance whether they were urban or rural. For the second project, on the other hand, they had to follow the structured protocol. The procedure was similar for anonymous MTurk workers in the third and fourth groups of assessors. We created two MTurk “tasks” including links to the Google images of all 1,277 places with overlaid administrative boundaries. The first task requested impromptu assessments while for the second one the assessors had to follow the structured protocol. Because MTurk workers are anonymous, we could not ensure that the assessors participating in the first and second round were the same. But for each round, we could specify that each of the 1,277 places needed to be assessed exactly 10 times and that the same MTurk workers could not classify any place more than once. Assessing consistency Relying on human judgment to classify places raises understandable concerns about the consistency of the outcomes depending on the characteristics of the assessors and the protocols they follow. Relying on administrative boundaries as the unit of analysis also implies that places can have very different sizes and this diversity could affect human judgment. We addressed these two concerns sequentially and found that the classification of places generated by our methodology was reassuringly stable. First, we conducted a regression analysis linking the probability of a place being judged urban by an assessor with the characteristics of the accessor and with the protocol followed. We used a standard Logit model, with one observation per place and assessor, and with errors clustered at the place level. The two assessor characteristics considered were familiarity with India and knowledge of urban issues. The results show that more places are considered urban by assessors familiar with India, while fewer places are judged to be urban among assessors who are knowledgeable on urban issues (table 2). It also appears that following a structured protocol leads to more places being classified as urban. However, while all coefficients are significant, the overall explanatory power of the regression is very small, with the pseudo R-square never exceeding 0.01. This means that subjective biases exist but do not have a substantial impact on the classification of places. Another way to check whether human judgment leads to consistent results is to compare the classification of places across the seven rounds of assessments. To determine whether a place is urban or rural, we applied the majority rule to all the available assessments within each round. The few places for which we face a tie were excluded from the analysis. Overall, assessments are quite consistent. Two thirds of places have the same status in all seven rounds, an additional 13 percent are classified identically in six rounds, and yet another 10 percent in five rounds. The classifications also appear to be highly consistent when considering pairwise comparisons between the seven rounds. The percentage of places assessed in the same way in two different rounds of classification varies between 82.3 and 96.7 (table 3). This is regardless of whether the few tied cases are included or not. The result is similar if Tetrachoric correlation coefficients are considered instead, with the estimated coefficients hovering around 0.9. 12 Table 2. The relationship between classifications, assessor characteristics and protocols (1) (2) (3) (4) coef. dy/dx coef. dy/dx coef. dy/dx coef. dy/dx Familiar with India 0.371 *** 0.091 *** 0.131 *** 0.032 *** (0.015) (0.004) (0.017) (0.004) Knowledgeable on urban issues -0.296 *** -0.073 *** -0.193 *** -0.047 *** (0.018) (0.004) (0.018) (0.004) Structured protocol 0.397 *** 0.097 *** 0.297 *** 0.072 *** (0.014) (0.003) (0.017) (0.004) Observations 68,978 68,978 68,978 68,978 Log pseudolikelihood -47030 -47129 -46904 -46857 Pseudo R2 0.004 0.0019 0.0067 0.0077 Note: Results are from a Logit model with the dependent variable being whether a place is judged to be rural (0) or urban (1) by an assessor, with one observation per assessor and place, and with standard errors clustered at the place. Coef. stands for estimated coefficient, dy/dx is the marginal effect, the numbers in the parentheses are standard errors clustered at the place level, and statistical significance is reported by *** if p < 0.01, ** if p < 0.05, and * if p < 0.1. 13 Table 3. Percentage of agreement between rounds of assessments World GWU students MTurk- India MTurk- USA Bank Impromptu Structured Impromptu Structured Impromptu Structured analysts judgment protocol judgment protocol judgment protocol World Bank analysts Structured protocol 100 82.3 83.7 85.6 88.2 87.6 89.1 GWU students Impromptu judgment 100 96.7 91.9 86.0 89.3 87.0 Structured protocol 100 92.0 86.0 89.3 86.6 MTurk- India Impromptu judgment 100 89.5 93.5 91.3 Structured protocol 100 92.6 93.0 MTurk- USA Impromptu judgment 100 93.6 Structured protocol 100 Given this high level of consistency across assessors and protocols, we pooled all seven rounds of assessments which resulted in 50 or more human judgments for each place in the sample. We then classified every place as urban or rural based on a majority rule over these 50 or more judgments and got only one tie among the 1,277 places in the sample. Second, we assessed whether the size of administrative units affected human judgment. To do this, we classified the 1,277 places in the sample into 20 size quantiles and computed the share of places in each quantile that was considered urban based on the pooled classification. For the top three size quantiles the share of urban places is high and increases with their size. But for the other 17 quantiles there is no clear correlation between size and urban share (appendix 3). We also investigated the relationship between the size of a place and the share of its surface that is built- up, which provides the first node in the decision tree followed by the assessors under the structured protocol. We found that the median built-up share varies across quantiles, but it displays no consistent relationship with sizes. This finding gave us further assurance that human judgment is not affected by the size of administrative units. The classification of the sample Given the consistency of outcomes across assessors, protocols and sizes, we treated the classification of places resulting from the pooled assessments as the outcome of our human judgment exercise. This pooled classification turns out to be generally aligned with the official status of the places in the sample, but there are also some significant discrepancies. For example, Jigani of Karnataka is administratively rural but meets most of the criteria to be considered urban in practice (figure 4a). The built-up area covers a substantial share of the place’s surface, the buildings are compact, there is a network of roads both inside the place and linking it to external markets, 14 and signs of amenities are clearly visible. The place is classified as urban in 56 of the 57 assessments we have for it. Conversely, Bahadurganj of Bihar is administratively urban, but its cropland areas are vast, the built-up area is relatively small, buildings are scattered, amenities are few, and the road network is limited both within and outside of the place (figure 4b). This place is considered rural in 42 out of 57 assessments available. Figure 4. Examples of discrepancies with the administrative classification a. Jigani (Karnataka) is reclassified as urban b. Bahadurganj (Bihar) is reclassified as rural Note: The yellow lines depict the digitized administrative boundaries for the respective administrative units. According to the pooled classification, 43 percent of the 1,277 places in the sample are urban in practice, and 57 percent are rural (table 4). Therefore, the sampling strategy does ensure a balanced representation of rural and urban places. A nonnegligible fraction of statutorily urban places is classified as rural in practice. Discrepancies are also large for places deemed urban in terms of the census, many of which are 15 assessed as rural in practice. But there are no substantial differences for most villages, whose rural status is confirmed by the subjective assessments. Table 4. Classification based on human judgments versus administrative classification Assessed urban Assessed rural Total Administrative classification (percent) (percent) (percent) Statutory towns 23.1 8.6 31.7 Census towns 20.0 23.6 43.6 Villages 0.2 24.5 24.7 Total 43.3 56.7 100.0 4. Machine learning The urban status of places derived from human judgments can be used to infer the urban status of the other places in the universe. Doing this requires first learning from the sample and then using the results to make out-of-sample predictions. A first question is which indicators to consider for this exercise. Official measures of urbanization are based on traditional data sources, such as population censuses, whereas more recent studies rely on modern data, and especially on satellite imageries. A second question refers to the statistical approach to be used. A specific data generation process is assumed in the classical statistical tradition, but supervised machine learning is becoming increasingly popular. Our methodology brings together traditional and modern data, and compares the outcomes depending on whether classical econometric models or machine learning are used for the prediction. Traditional and modern data We chose as key covariates for the prediction a set of indicators which appear recurrently in the urban literature and are observable for all administrative units in India. Traditional data were from the 2011 Census of India. These data were georeferenced to the digitized boundaries of administrative units as part of the Spatial Database for South Asia project. The range of georeferenced indicators available for all places in the country as of 2011 is wide (Li et al. 2015). Following the practice of national governments and international organizations, we selected population and population density at the town and village level as the key indicators. This choice is consistent with that of studies relying on high-quality microdata from the ground for more advanced economies (Michaels, Rauch and Redding 2012; Rozenfeld et al. 2011). Population density was computed as the ratio between the population of the administrative unit and the size of the area within the corresponding boundary. Modern data were obtained from open-source products built on satellite imagery. Following more recent global products, built-up cover and nighttime lights are the preferred data in this respect. For built-up cover, we selected the Landsat-based product provided by the Global Human Settlement Layer, hereafter referred to as GHSL (Pesaresi et al. 2016). The product is derived from images taken by sensors aboard Landsat 7 and Landsat 8 satellites. As one of the first civilian satellite programs on land cover, Landsat has produced the longest continuous space-based record of the Earth’s land surface. Landcover products 16 derived from Landsat 7 and Landsat 8 satellites have been widely used for academic research, because of their high spatial resolution, high spectral resolution and long temporal coverage (Burchfield et al. 2006). 5 We compared GHSL with two other products reporting built-up cover. One of them is based on imagery from MODIS and the other on imagery from the twin satellites TerraSAR-X and TanSAR (Friedl et al. 2010; Esch et al. 2017). 6 Because of its higher resolution GHSL captures built-up cover in forest areas, such as Kerala, better than MODIS. The correlation between built-up data based on GHSL and those on TerraSAR- X and TanSAR imagery is very high. For the built-up share at the town and village level it reaches 0.96. In light of this, and also given the wide use of the Landsat-based landcover product, we chose GHSL 2013/14 as our data source. Our built-up share indicator was computed for every place as the ratio between its built-up area and its total area. For nighttime lights, we relied on the Global Radiance Calibrated Nighttime Lights product. Data in this case are derived from images taken by the Operational Linescan System (OLS) sensors aboard satellites from the US Air Force Defense Meteorological Satellite Program (DMSP). The product has a higher quality than other products because it captures stable lights, addresses the saturation problem, and is intercalibrated to reduce inconsistencies across satellites (Elvidge et al. 2009; Hsu et al. 2015; NOAA 2014). The intensity of nighttime lights is reported in Digital Numbers (DN) as of 2011. A well-known challenge when using data on nighttime lights is that even areas without human activity may appear to be lit up (Henderson et al. 2003; Li and Zhou 2017). This may be due to blooming, to the reflection of light from surrounding water, to the sensitivity of the sensors, or to geo-referencing errors made while capturing and composing the data. To avoid an overestimation of the lit-up share, a threshold is generally set up for nighttime light intensity, and only observations above the threshold are considered urban. However, the appropriate threshold varies considerably across countries, and it is correlated with their level of economic development. To identify the appropriate threshold, we reviewed a dozen studies that together present 16 threshold estimates for different countries. We then matched their results with the corresponding gross domestic product per capita. We found a positive and statistically significant correlation between threshold estimates and product per capita. Using the estimated relationship, we concluded that the appropriate threshold for India in 2011 was 15 DN (Galdo et al. 2018). Our lit-up share indicator was thus computed as the fraction of a place’s surface with light intensity above this threshold. Beyond these four indicators, our framework is flexible to the inclusion of other metrics providing information on how urban a place may be in practice. Including these other metrics in the prediction model helps assess whether the four indicators we consider are informative enough on their own. Following the literature, in some of the analyses we added the normalized difference vegetation index 5 The Landsat 7 satellite was launched in 1999. It has seven spectral bands at a spatial resolution of 30 m and a temporal frequency of 16 days. Landsat 8 was launched in 2013. It includes nine spectral bands with the same spatial resolution and temporal frequency as Landsat 8. 6 MODIS based product relies on images taken by sensors aboard satellites from the Earth Observing System, known as the Land Cover Type Yearly Grid. The product has a resolution of 500 m and is available from 2001 onward. The imagery from the twin satellites TerraSAR-X and TanSAR are radar images, which are part of the recent WorldDEM program. Built-up cover data for 2011 are available at a resolution of 12 m. 17 (NDVI) and the normalized difference water index (NDWI) to improve prediction accuracy. Both of these metrics are derived from Landsat satellite imagery, with NDVI being related to the coverage of vegetation and NDWI related to the water content on Earth’s surface. The summary statistics for our key indicators differ between the sample and the universe, because the former was designed so as to over-represent urban places (table 5). As expected, the mean values for population, population density, the built-up share and the lit-up share are substantially higher in the sample than in the universe. On the other hand, the mean values of NDVI and NDWI are not statistically different, suggesting that they may not help improve the prediction substantially. Table 5. Summary statistics of key indicators Observations Mean Std. Dev. Selected sample Population size (thousands ) 1,277 46.79 392.48 2 Population density (people per km ) 1,277 3085 6057 Built-up share (percent ) 1,277 13.05 19.99 Lit-up share (percent ) 1,277 52.97 45.45 NDVI (-1 to 1) 1,277 0.29 0.10 NDWI (-1 to 1) 1,277 0.09 0.09 All Population size (thousands ) 564,052 2.10 33.53 2 Population density (people per km ) 564,052 693 2700 Built-up share (percent ) 564,052 0.81 4.43 Lit-up share (percent ) 564,052 8.57 25.93 NDVI (-1 to 1) 564,052 0.31 0.11 NDWI (-1 to 1) 564,052 0.08 0.09 Classical data modeling Subjective assessments of the places in the sample can be used to infer the status of all other places in the universe. In line with the classical statistical tradition, this can be done using an econometric model, in which case an explicit probabilistic distribution is assumed for the data. The most common functional choices in the classical statistical tradition are the Logit and the Probit models, which respectively assume a logistic and a normal distribution for the data (Cameron and Trivedi 2005). Following the classical statistical tradition, we specified a probabilistic model on whether a place is urban or rural in practice. We used the standard Logit model linking the urban status of a place with the values of its key indicators. The specification took the following form: 1 ℎ = � 0 ℎ 1 − (1) = Pr[ = 1| ] = ( ) 18 where is the subjective assessment of place , is the expected likelihood that the place is urban given the values of its key covariates , and (. ) Is the logistic distribution function. A linear function of was assumed for the entrance to (. ). We relied on maximum likelihood estimation, implying that parameters were approximated by the values that maximize the log likelihood function ln ( |; β): = argmax � � � � � + (1 − ) �1 − � ��� � =1 (2) = argmin � − � � � � − �1 + � ��� � =1 The Logit model was applied to the pooled classification of the 1,277 places in the sample. The key indicators were population, population density, built-up share, and lit-up share. The corresponding coefficients had the expected signs, and most of them were statistically significant regardless of the specification chosen (table 6). The specification including these four key indicators provided the benchmark for our analysis. However, to enrich the model for prediction, we further expanded the covariates to include NDVI and NDWI, quadratic terms of individual indicators and terms interacting two indicators at a time. A comparison of results across specifications suggests that overall built-up share and population size are the most stable predictors of urban status. The most critical step in the analysis is the classification of the out-sample places. Inferring prediction power from explanatory power is potentially misleading because there is a risk of overfitting the sample. Following the literature, we evaluated the different specifications on their prediction accuracy. A popular way of doing this is through cross-validation. This involves separating the sample into a training set, on which a prediction model is built, and a validation set, on which the model’s performance is evaluated. In our analysis we randomly partitioned the sample of 1,277 places into 10 equally sized subsamples, took one of the subsamples as the validation data and the rest as training data, and computed the percentage of observations correctly predicted. The process was repeated ten times, one with each of the ten subsamples used as the training data (Hastie et al. 2017; Mullainathan and Spiess 2017). Varying the number of folds or applying the Monte-Carlo cross-validation did not affect the conclusions. The results of this 10-fold cross-validation confirmed that combining multiple sources of data increases prediction accuracy in the Logit model (figure 5a). The prediction accuracy of the benchmark specification reached 85.9 percent. This specification performed significantly better than the more parsimonious ones based exclusively on indicators from traditional data or from modern data. On the other hand, the richer specifications improved prediction accuracy over the benchmark specification only at the margin. 7 7 We replicated all prediction exercises of all three approaches with even higher-order polynomials but found no improvement in prediction accuracy. Results are available upon request. 19 Table 6. Logit models on the probability of being classified as urban Traditional data Modern data Benchmark Extended Quadratic Full (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) Population 0.101*** 0.0916*** 0.105*** 0.106*** 0.0994*** 0.163*** (0.00715) (0.00720) (0.00848) (0.00858) (0.00984) (0.0431) Population density 0.346*** 0.127*** 0.0146 0.0171 0.144*** 0.461*** (0.0291) (0.0243) (0.0139) (0.0150) (0.0447) (0.142) Built-up share 0.134*** 0.110*** 0.0837*** 0.0841*** 0.168*** 0.247*** (0.00925) (0.00940) (0.00877) (0.00883) (0.0163) (0.0490) Lit-up share 0.0234*** 0.0107*** 0.0137*** 0.0140*** 0.0199* 0.0558*** (0.00148) (0.00177) (0.00215) (0.00218) (0.0120) (0.0179) NDVI -2.816*** 0.612 -3.820 2.389 (0.575) (1.154) (5.164) (8.896) NDWI -0.415 -1.964 6.444** 7.253 (0.639) (1.217) (2.906) (7.259) Population squared -7.95e-06*** -9.92e-06 (1.38e-06) (7.15e-06) Population density squared -0.00250*** -0.00596*** (0.000833) (0.00164) Built-up share squared -0.00147*** -0.00216*** (0.000196) (0.000258) Lit-up share squared -9.32e-05 -0.000215* (0.000117) (0.000126) NDVI squared 6.534 4.058 (8.223) (14.25) NDWI squared -38.36*** -27.90* (10.68) (16.44) Constant -1.562*** -1.235*** -1.794*** -1.519*** -1.603*** 0.550*** -0.233*** -1.919*** -3.330*** -3.361*** -3.276*** -6.541*** (0.0977) (0.0887) (0.109) (0.0929) (0.112) (0.176) (0.0817) (0.122) (0.201) (0.370) (0.768) (1.584) Interaction terms No No No No No No No No No No No Yes Observations 1,276 1,276 1,276 1,276 1,276 1,276 1,276 1,276 1,276 1,276 1,276 1,276 Pseudo R2 0.282 0.148 0.306 0.317 0.170 0.0142 0.000242 0.338 0.506 0.507 0.553 0.597 Log likelihood -626.5 -744 -606 -596.2 -724.3 -860.4 -872.6 -578.1 -431.4 -429.9 -390.5 -351.8 BIC 1267 1502 1233 1207 1463 1735 1760 1178 898.6 909.9 874 903.8 Note: The dependent variable is the urban status of the place according to the pooled classification. Bechmark refers to the four key indicators; Extended denotes the inclusion of NDVI and NDWI; Quadratic adds quadratic terms of individual indicators; and Full indicates that both quadratic terms and terms interacting two indicators at a time are included. Standard errors are reported in parentheses, and statistical significance is reported by *** if p < 0.01, ** if p < 0.05, and * if p < 0.1. 20 Figure 5. Prediction accuracy across approaches and specifications a. Logit model b. LASSO method c. Random forests method Note: The solid blue line reports prediction accrucary and the dotted red lines presents its 95% confidency interval based on the 10-fold cross-validation. The x-axis presents the combination of covariates: Population denotes population size; Density denotes population density; Built-up denotes built-up share; Lit-up denotes lit-up share; Tradional represents population size and density; Modern represents built-up and lit- up shares; Bechmark represents the four key indicators; Extended denotes the inclusion of NDVI and NDWI; Quadratic adds quadratic terms of individual indicators; and Full indicates that both quadratic terms and terms interacting two indicators at a time are included. 21 Algorithmic modeling Inferences from the sample can also be based on an algorithmic approach, which does not require any specific assumption on data distribution. This approach, known in computer science as machine learning, has gained prominence in the last the two decades thanks to the rapid increase in computing power and the availability of big data (Athey 2018; Mullainathan and Spiess 2017). In the case of urban economics, algorithmic modelling has been stimulated by the increasing availability of data measuring city characteristics at high frequencies and granular scale (Glaeser et al. 2016; Glaeser et al. 2018; Gorin et al. 2018; Varian 2014). Following the literature, we also relied on machine learning to uncover generalizable patterns from the data and produce predictions on the outcome. To this effect we reviewed some of the tools most familiar to economists, including ridge regression, LASSO, nearest-neighbors, decision trees, random forests, neural networks and ensemble (Breiman 2001b; Hastie et al. 2017). The review led us to select the Least Absolute Shrinkage and Selection Operator (LASSO) as a parametric method and random forests as a non- parametric method. The LASSO method applies a regularized regression to estimate parameter values in which an additional constraint is introduced to penalize large models and enhance prediction accuracy (Hastie et al. 2017; Mullainathan and Spiess 2017). Thus, parameters are now approximated by the values that obtain from maximizing the log likelihood function ln ( |; β) subject to an additional constraint: β = argmin �− ∑ =1 � � � − �1 + � ���� ∑=1� � ≤ (3) β = argmin �− � � � � − �1 + � ��� + �� �� =1 =1 As the penalty term in the LASSO method becomes smaller, the optimization problem in equation (3) converges to that in equation (2) and LASSO → . But in the general case, the LASSO method identifies the covariates that have little predictive power and may contribute to overfitting and penalizes them by setting their coefficients to zero. Given the ability of the LASSO method to significantly reduce the number of covariates, we considered rich initial specifications. The results were largely consistent with those of the Logit model as the four key indicators, together with some quadratic terms, emerged as the most important predictors of urban status. However, the coefficients associated with these four key indicators cannot be interpreted literally because the objective of the LASSO method is prediction, not explanation. Coefficients could change from sample to sample even if the out-of-sample prediction results remain similar. We applied the 10-fold cross-validation procedure to evaluate the performance of the LASSO method and found that its prediction accuracy was similar to that of the Logit model (figure 5b). The random forests method, in turn, works as a combination of individual trees. In each tree, data are split into groups at various nodes, each defined by a threshold for a covariate. Unlike standard trees, the trees in random forests focus only on a randomly chosen subset of the covariates to develop their nodes. That way, the trees in the random forests method are less likely to correlate with each other and offer 22 different information on the underlying data generation process. The random forests method essentially averages over many noisy but approximately unbiased and un-correlated trees to reduce the variance of the predictor (Breiman 2001a; Hastie et al. 2017). Following the literature, we considered a large number () of binary classification trees for our urban classification problem, each of them () grown on an independently bootstrapped subsample from the 1,277 sample of places. The bootstrapped subsample thus served as training data. For each tree we randomly selected out of the covariates in the full set, with the rule of thumb being that should be smaller than √ to reduce the correlation between trees. We also set a predetermined number of nodes () for each tree. Parameters , and characterize the forest. At each node of each tree, we chose the cutoff point of the selected covariate that ensured the greatest similarity among places in each of the two resulting groups, and the greatest diversity between places in the two groups. Similarity and diversity were evaluated based on the pooled classification of places, using the Gini index to identify the optimal cutoff point. We then considered the next node, again involving one of randomly selected covariates, and repeated the same steps. We continued the process recursively until the predetermined number of nodes was reached. Proceeding this way, each of the trees led to an unequivocal assessment of all places in the training subsample as either urban or rural. But such assessment is not necessarily the same as that resulting from the pooled classification. We then used the trees developed on each of the independently bootstrapped subsamples to the entire universe. By proceeding this way, all places in the universe were assessed times. The random forest predictor of the urban status of a place is derived by applying majority rule to these assessments. We tuned our random forests method by varying the number of trees (B) and the number of covariates to be randomly selected to develop each node (). We relied on the standard out-of-bag (OOB) error rate to evaluate the performance of different combinations of parameter values. We found that the OOB error rate is minimized as the number of trees considered increases above 300 and the number of covariates to be selected declines to two (appendix 4). In light of this, we choose 500 trees and two covariates at each node to implement the random forests model. To facilitate the comparison with the results from the Logit model, we also considered multiple combinations of covariates. We focused on mean decrease accuracy and mean decrease Gini to evaluate their importance. Overall, built-up share and population size emerged as the most important predictors of urban status (appendix 5). This finding is consistent with the results from the Logit model. However, the relative importance of different covariates is only indicative. This is because the random forests method is a prediction tool, not an explanation tool, and the estimated relative importance of predictors is also sensitive to sample and process. Again, we evaluated the prediction accuracy of the random forests method using the 10-fold cross- validation procedure and found that it outperforms the Logit model (figure 5c). The percentage of observations correctly predicted reached 88.2 percent, 1 percentage point higher than with the best- performing specification of the Logit model. And the confidence interval was comparable. However, the results from the random forests method are consistent with those from the Logit model in terms of the selection of covariates. Including all four key indicators significantly improves prediction accuracy over using indicators based exclusively on traditional or modern data. On the other hand, the addition of NDVI and NDWI indicators, and of higher-order terms, yields marginal gains. 23 Robustness to subjective bias We applied both classical statistical modeling and algorithmic modeling to the classification of the sample based on the pooled classification. This exercise led to high consistency of predictions regardless of the specific approach used. Overall the random forests method is the best performer, followed closely by the Logit model. But a relevant question is whether the prediction performance would have been similar had the sample classifications based on individual rounds of assessments been used instead. We showed above that assessor characteristics and the nature of the protocol followed only had a modest impact on the classification of places in the sample, but their impact could potentially become significant when predicting the urban status for out-of-sample units. To evaluate this potential bias, we first trained the model on each of the eight classifications, including the pooled classification and the seven classifications based on individual rounds of assessments. To this effect we used the random forests method, given its higher prediction accuracy. We then generated a predicted classification and computed the percentage of agreement between this predicted classification and each of the other classifications. The prediction model performs well regardless of which subjective classification is used for training and regardless of the subjective classification used for validation (table 7). The average prediction accuracy is between 84.8 and 88.4. Standard deviations are small in all cases. This confirms our previous finding on the high consistence between classifications. 5. How urban is India? Our proposed methodology can be used to shed light on the ongoing debate about India’s “true” urbanization rate. This can be done by applying the trained models from the sample to predict the urban status of all administrative units in our universe of places in 2011 and then estimating the population living in those administrative units we deem urban. This prediction can be compared with official estimates, with the outcomes of other studies and with recent global products that estimate land use. The urbanization rate in 2011 We chose as the benchmark to estimate India’s urbanization rate the pooled classification of places in the sample. Also, because of its higher prediction accuracy, we relied on the random forests method with the full set of covariates (key indicators, NDVI, NDWI, quadratic terms and interactive terms) as our preferred approach. If the predicted likelihood of being urban was above 0.5, we classified the place as urban; otherwise we considered it rural. According to the census, 31.2 percent of the population lived in urban areas in 2011. But this figure corresponds to the entire country, whereas our universe of places excludes small states as well as villages for which information is missing. If only the 564,052 places considered in our analyses were retained, the census would yield an urbanization rate of 31.6 percent. This is only slightly higher than the 29.9 percent rate predicted by the random forests method with the full set of covariates. 8 8 Predictions also fall within a close range of the official estimate when the Logit model or the LASSO method are used or when smaller sets of covariates based on both types of data are considered. The results are available upon request. 24 Table 7. Cross-validation of predictions across classifications Validation World GWU students MTurk- India MTurk- USA Prediction accuracy Pooled Bank Impromptu Structured Impromptu Structured Impromptu Structured Average Std. Dev. analysts judgment protocol judgment protocol judgment protocol Pooled 88.1 84.7 86.5 88.3 87.8 86.6 87.8 88.1 87.2 1.24 World Bank analysts Structured protocol 84.6 87.2 80.3 82.3 83.6 86.8 85.8 88.0 84.8 2.67 GWU Students Impromptu judgment 85.8 80.3 88.7 91.1 88.5 80.9 85.1 82.5 85.4 3.92 Train Structured protocol 87.0 81.8 90.1 89.2 88.0 82.9 86.3 84.4 86.2 2.98 MTurk- India Impromptu judgment 88.5 85.1 89.5 90.4 88.3 86.4 88.6 88.2 88.1 1.68 Structured protocol 86.0 86.3 81.0 82.3 85.3 88.9 86.7 88.3 85.6 2.72 MTurk- USA Impromptu judgment 89.2 87.1 87.5 88.9 88.7 88.1 87.6 90.2 88.4 1.04 Structured protocol 88.8 87.5 83.8 85.5 86.4 89.5 89.2 88.6 87.4 2.00 Note: Train indicates the classification used for prediction model development and validation indicates the classification used for prediction accuracy evaluation. The random forests method is used. 25 However, when relying only on indicators from traditional data, the estimated urbanization rate becomes significantly higher (table 8). It reaches 35.2 percent of the population when applying the random forests model and 32.1 percent when applying the Logit model. On the other hand, when only relying on indicators from modern data, the estimate can be significantly smaller. For example, the urbanization rate falls to 26.4 percent according to the Logit model. Table 8. The predicted urbanization rate of India in 2011 Urbanization rate Jaccard index Popu- With census With other approaches Area Approach lation Area Pop. Area St. dev. Pop. St. dev. Logit (full) 3.4 30.2 0.56 0.84 0.83 (0.043) 0.94 (0.014) LASSO (full) 3.3 29.7 0.53 0.83 0.83 (0.047) 0.94 (0.016) Random forests (full) 3.2 29.9 0.54 0.85 0.83 (0.065) 0.94 (0.021) Logit (traditional) 4.6 32.1 0.53 0.83 0.52 (0.054) 0.84 (0.017) Random forests (traditional) 5.1 35.2 0.38 0.74 0.41 (0.018) 0.76 (0.012) Logit (modern) 2.9 26.4 0.38 0.74 0.62 (0.049) 0.84 (0.016) Random forests (modern) 5.7 30.3 0.33 0.73 0.41 (0.017) 0.79 (0.010) Note: Other approaches include the Logit model (benchmark, quadratic and full), the LASSO method (quadratic and full) and the random forests method (benchmark, quadratic and full). Estimations that come close at the national level may differ significantly at more disaggregated levels. Following de Bellefon et al. (2018), we use the Jaccard similarity index to measure the extent to which urban areas predicted by two different approaches overlap. A Jaccard index is the ratio between the size of the intersection of the urban areas predicted by two approaches and the size of the union of the two urban areas. When considering the full set of covariates, the average Jaccard index across approaches is 0.83 when size is measured in terms of surface, and 0.94 when it is measured in population terms. These results reveal a high coincidence in the delineation of the urban extent by our methodology even at disaggregated levels. However, predictions are more volatile when relying exclusively on indicators from traditional or from modern data. For example, the average Jaccard index falls to 0.41 when measured by surface, and to 0.79 when measured by population, if the random forests prediction relies exclusively on modern data. Urbanization by administrative category The similarity between our predicted urbanization rate for 2011 and the census estimate could hide important differences at subnational levels. To explore this possibility more systematically, we first reran the comparison for places falling under each of the seven administrative categories considered by the 2011 Census of India (figure 6). 26 Figure 6. India’s official and predicted urban population by administrative category Note: The figure reports the official and predicted urban population for each of the seven administrative categories. Our prediction suggests that many places deemed villages under the administrative classification have urban characteristics in practice. Based on the random forests method with the full set of covariates, about 7 percent of 555,292 places administratively classified as villages could well be urban. The gap may seem small in relative terms, but it has nontrivial implications when estimating the urban population. Overall, 20 million people resided in villages with urban characteristics in 2011. Our results also reveal that many places administratively classified as towns could be deemed rural. Based on the random forests method with the full set of covariates, 48 percent of the 3,847 places labelled census towns, 48 percent of the 2,861 places administratively designated as other towns, and 16 percent of the 911 places deemed municipalities, could be considered rural. Taken together, this is the equivalent of about 22 million people whose classification as urban residents could be questioned. Substantial gaps in urbanization rates relative to official figures are corroborated, in the case of municipalities, other towns, census towns and villages, when using other combinations of covariates for 27 the random forests method, as well as when relying on predictions based on the Logit model and the LASSO method. 9 These gaps in urbanization rates by administrative category could be interpreted as the outcome of prediction error in both directions, around otherwise consistent urbanization rates at the aggregate level. However, it should be noted that the gap does not affect all administrative categories in the mid-range between rural and urban to the same extent. In relative terms, the gap is much larger among census towns and other towns than among municipalities and outgrowths. These uneven gaps suggest that there could be more than prediction error at play, and the way places are administratively classified may matter as well. Different biases by administrative levels may indeed reflect an official classification of places that is not exclusively based on their urban characteristics but is also affected by fiscal incentives and statistical initiatives. In India, the implementation of central government programs and the financing available for the development of settlements depend on the urban or rural status of beneficiary places. Statutory towns are declared by the state governments and governed by urban local authorities. These urban local governments have taxation authority, but they are also subject to a complex set of rules, regulations, building bylaws and development planning controls known as “urban laws”. Rural local governments have no taxation powers, but they receive a host of transfers in support of farmers and the rural poor. As a result, there are numerous examples of statutory urban places being reclassified back and forth, especially other towns (Aijaz 2017; Mukhopadhyay et al. 2012; Ministry of Finance 2017). Census towns are not even the result of an explicit reclassification of places by local authorities, but rather a statistical attempt by the Census of India to generate a more accurate estimation of the urban population. Census towns started being defined consistently in 1971, as an addition to the statutory towns declared by the state governments. They correspond to places that are administratively rural, host more than 5,000 inhabitants, have a population density exceeding 400 persons per square kilometer, and where over 75 percent of their male working population engaged in non-agricultural activities (2011 Census of India). These places are deemed urban by the Census of India, but not by local and state authorities. Urbanization by state There are also significant differences between our predicted urbanization rates and official figures at the state level. Again, using the random forests method with the full set of covariates, our estimate of the urban population falls short of the official estimate by roughly 1.5 million people in each of the states of Orissa, Gujarat and Rajasthan. The gap is even bigger in Madhya Pradesh, where it reaches 3.9 million people, and especially in Tamil Nadu, where it approaches 6.6 million people. Conversely, our estimates of the urban population exceed the official figures by 0.8 million people in both Bihar and Kerala, and by 0.7 million in Andhra Pradesh (figure 7). While these gaps in the estimated urban population could reflect prediction error, there are grounds to believe that they result, at least in part, from prevailing institutional arrangements. In India, urban development is a subject under the purview of state governments. By default, all settlements are rural 9 The conclusions also hold when using classifications based on different rounds of assessments. The results are available upon request. 28 and they become urban only after the state government converts them, following a well-specified legal process. Converted settlements are labelled statutory towns. Figure 7. Gap between India’s predicted and official urban populations by state 2.0 Urban population difference (millions) 1.0 0.0 -1.0 -2.0 -3.0 -4.0 -5.0 -6.0 -7.0 Daman and Diu Sikkim Bihar Mizoram Punjab Lakshadweep Puducherry Assam Manipur Meghalaya Tamil Nadu Madhya Pradesh Gujarat Uttarakhand West Bengal Jharkhand Goa Haryana Nagaland Dadra and Nagar Haveli Tripura Chhatisgarh Maharashtra Chandigarh Delhi Andhra Pradesh Rajasthan Karnataka Orissa Himachal Pradesh Uttar Pradesh Kerala Note: The figure presents the difference between the urban population by state based on our prediction and that based on the official census. The unit of measurement is million people. There are guidelines for classifying a settlement as a statutory town, but they are vague and not binding. As a result, state governments exert large discretion in their choices, and the criteria to define a statutory town vary considerably across states. For example, in the southern coastal states the threshold population of statutory towns varies from around 2,000 in Tamil Nadu to over 20,000 in Kerala and above 30,000 in Andhra Pradesh (Aijaz 2018; Denis and Marius-Gnanou 2011). The first state has the least strict criteria in the country while these last two states have among the strictest, which is consistent with the sizeable gap between our predicted urbanization rates and the corresponding official figures. A comparison with other estimates Our results are in contrast with those reported in several studies using traditional data. Measures based on different thresholds for population size yield urbanization rates ranging from 47.2 to 64.9 percent for 2011 and those based on different cutoffs for population density produce urbanization rates at 55.0 and 78.0 percent (IDFC Institute 2015; Ministry of Finance 2017). In a World Bank report, a measure based on an agglomeration index that combines population density, the population of a “large” center and the estimated travel time to that center yields an urbanization rate of 55.3 percent in 2011 (Ellis and Roberts 2016). 29 The discrepancies between these measures and our predicted urbanization rates are substantial. But such discrepancies should not come as a surprise, as these other measures are more parsimonious in their definition of what makes a place urban. In light of these differences, we see these studies as offering interesting perspectives, but not fundamentally challenging our results. The discrepancy is also substantial relative to some recent urban layer products based on satellite imagery that report high urbanization rates. The Geopolis project proposes a measure based on built-up cover, cell contiguity and a population threshold, and predicts an urbanization rate of 42.0 percent for 2011 (Denis and Marius-Gnanou 2011). The GHSL project applies thresholds on share of built-up area, cell contiguity, population size and population density to define urban cores. Based on this approach, GHSL estimates that 53.6 percent of India’s population was urban in 2015 (Dijkstra et al. 2018; Pesaresi et al. 2019). 10 These figures exceed our predicted urbanization rate, sometimes by a wide margin. Meanwhile, other recent studies based on satellite imagery suggest the urbanization rate could be close to or even lower than our prediction. Baragwanath et al. (2019) apply an algorithmic approach with different distance buffers to define urban markets for India. Their estimates on the share of India’s population residing in urban markets in 2011 have a median value of 29.1 percent and a mean value of 27.6 percent. As another example, the Metropolitan Areas Extension Database or BEAM project uses carefully cleaned nighttime lights to identify urban areas (Ch et al. 2018). Based on its results, 23.9 percent of India’s population lived in urban areas in 2010. On closer inspection, however, the difference between our estimates and those in other recent studies does not arise from the underlying land classification, but rather from the population data used. This can be seen by comparing our estimates to two global urban layer products relying on satellite imagery and available as open source data, namely BEAM and GHSL. As mentioned above, a recent study based on BEAM leads to a lower urbanization rate than our methodology, while a study based on GHSL delivers a much higher urbanization rate. For the Indian places considered in our universe, the total area we classify as urban based on our methodology is quite close to that of the GHSL urban core and higher than the BEAM urban area (table 9). However, these urban areas translate into very different estimates of the urbanization rate depending on which gridded population data are used. We consider three products in this respect: GHSL for 2015, LandScan for 2011 and WorldPop for 2010. 11 Our predicted urbanization rate for India would have declined only marginally had we used population data from LandScan, instead of data from the 2011 Census of India. But it would have been much smaller if we had used population data from WorldPop and much higher had we used population data from GHSL. The similarity of results, relative to the 2011 Census of India, makes the use of Landscan data more appealing than the alternatives. 10 GHSL country statistics can be generated from the Community pre-Release of GHS Data Package https://ghsl.jrc.ec.europa.eu/CFS.php. 11 Information on gridded population is provided by Schiavina et al. (2019) for GHSL, by LandScanTM (2011) for LandScan and by WorldPop (2017) for WorldPop. 30 Table 9. India’s urban area and urban population by global urban layer products Urban population Urban area (percent of population) (percent of land area) Census GHSL LandScan WorldPop Prediction 3.2% 29.9% 39.6% 28.8% 22.7% GHSL 2.9% - 54.7% 29.5% 23.1% BEAM 1.9% - 29.0% 26.9% 20.1% Combining the GHSL’s classification of urban cores with population data from LandScan yields an urbanization rate that is less than half a percentage point apart from our preferred estimate (29.9 percent). Similarly, applying the same approach to BEAM urban areas leads to an urbanization rate that is 3 percentage points lower than our preferred estimate. Therefore, when relying on the same population data the urbanization rates predicted by other recent studies are quite similar to the one we obtain with our methodology. Discrepancies between our proposed methodology and global urban layer products become much wider when using other population data. All predicted urbanization rates decline when relying on WorldPop data and increase when using GHSL data instead. But regardless of the population data used our predicted urbanization rate falls in between the BEAM-based and GHSL-based estimates. We interpret this as further evidence that our predicted urbanization rate is not an outlier. 6. Conclusion When assessing the urban extent there is value in relying on what one “sees,” especially in countries where urbanization is messy in nature. Subjective assessments can capture the multifaceted nature of cities—relatively large spaces with a higher density of construction, better access to transportation networks, and greater availability of residential amenities. In this paper, we develop what we believe is a credible methodology for the subjective assessment of urban status across very diverse places. Human judgment could be dismissed as a tool on the grounds that much rests on the eyes of the beholder. However, our methodology convincingly addresses this potential shortcoming. It shows indeed that the classification of places only changes at the margin when involving assessors with different backgrounds or following different protocols. The increased availability of satellite imageries, of crowdsourcing tools and of computing capability also makes human judgment a scalable methodology. Physically visiting thousands of places to assess their status would be costly and time-consuming. But Google images make a remote assessment possible in a growing number of countries. Crowdsourcing makes it efficient to collect judgments for a representative sample of places. And the availability of remote sensing data, especially of built-up cover and nighttime lights, makes it viable to predict the urban status of thousands of other places. 31 Building on the burgeoning literature that uses machine learning to study economic issues, we show that the outcome of the prediction is not significantly affected by the prediction approach usedstatistical or algorithmic. It is not substantially affected by the number of indicators considered either: as long as traditional and modern data are combined, predictions are very similar. In passing, our analyses shed light to the ongoing debate on India’s urbanization rate. In recent years, studies using different population thresholds, or combining traditional indicators into more complex indices, have claimed that the actual urbanization rate is higher than official figures suggest. Analyses based on global urban layer products seemingly confirm this conclusion. However, we find relatively minor discrepancies with official figures, and identify the institutional and statistical mechanisms that could underlie the observed gaps. Importantly, we show that the discrepancy with estimates based on global urban layer products is mainly due to the population data used, not to the resulting land classification. The methodology we propose could also serve as the basis for further research. In this paper we apply it to the classification of places at one point in time, but it would be interesting to explore how to use it to assess changes in urbanization. Our results also suggest that the discrepancies in the assessment of the urban areas vary across states. A relevant question is whether accuracy would increase if prediction approaches were fine-tuned for places operating under different institutional settings or facing diverse geographical conditions. Finally, our methodology uses administrative boundaries as the unit of analysis, whereas global products are built on grid cells of standard size. Digitized administrative boundariesthe polygons delimiting the jurisdiction of cities, towns, and villagesare an important anchor for data integration. They can be used as the basis for combining traditional and modern data, and therefore to extend prediction beyond the places virtually visited by assessors. A more systematic evaluation of the tradeoffs faced when relying on administrative boundaries, instead of grid cells, is therefore warranted. 32 References Ahlfeldt, Gabriel M., Stephen J. Redding, Daniel M. Sturm, and Nikolaus Wolf. 2015. “The Economics of Density: Evidence from the Berlin Wall.” Econometrica 83 (6): 212789. Aijaz, R., 2017. Measuring Urbanisation in India. ORF Issue Brief, 218. Observer Research Foudation. Alonso, W. 1964. “Place and Land Use: Toward a General Theory of Land Rent.” Harvard University, Cambridge, MA. Athey, S., 2018. The impact of machine learning on economics. In The economics of artificial intelligence: An agenda. University of Chicago Press. Beller, A., Borjas, G., Tienda, M., Bloom, D. and Grenier, G., 1994. “Beauty and the Labor Market.” The American Economic Review 84(5): 1174-1194. Bosker, Maarten, Jane Park and Mark Roberts, 2018. "Definition Matters: Metropolitan Areas and Agglomeration Economies in a Large Developing Country," CEPR Discussion Papers 13359, C.E.P.R. Discussion Papers. Breiman, Leo. 2001a. "Random Forests." Machine Learning 45 (1): 5–32. ___________. 2001b. "Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)." Statistical Science 16 (3): 199–231. Briant, A., Combes, P.P. and Lafourcade, M., 2010. Dots to boxes: Do the size and shape of spatial units jeopardize economic geography estimations?. Journal of Urban Economics, 67(3), pp.287-302. Brueckner, Jan K. 1987. The structure of urban equilibria: A unified treatment of the Muth-Mills model. In Edwin S. Mills (ed.) Handbook of Regional and Urban Economics, volume 2. Amsterdam: North- Holland, 821–845. Cameron, A. Colin, and Pravin K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge, U.K.: Cambridge University Press. Campbell, J.B. and Wynne, R.H., 2011. Introduction to remote sensing. Guilford Press. Ch, Rafael, Diego Martin, and Juan F. Vargas. 2018. “Measuring cities with nighttime light data.” Processed, CAF Development Bank of Latin America. Combes, Pierre-Philippe, Gilles Duranton, Laurent Gobillon, and Sébastien Roux. 2010. “Estimating Agglomeration Economies with History, Geology, and Worker Effects.” In Agglomeration Economics, edited by Edward L. Glaeser, 15–66. Chicago: University of Chicago Press. de Bellefon, Marie-Pierre, Pierre-Philippe Combes, Gilles Duranton, and Laurent Gobillon. 2018. “Delineating Urban Areas using Building Density.” Working Paper No. 811, The Wharton School, University of Pennsylvania. Denis, Eric, and Kamala Marius-Gnanou. 2011. “Toward a Better Appraisal of Urbanization in India: A Fresh Look at the Landscape of Morphological Agglomerates.” Cybergeo: European Journal of Geography. Diegel, Jonathan, Antonio Miscio and Donald R. Davis. 2019. “Cities, Lights and Skills in Developing Economies.” Journal of Urban Economics. In Press, Corrected Proof, Available online 23 May. 33 Dijkstra, Lewis, Aneta Florczyk, Sergio Freire, Thomas Kemper, and Martino Pesaresi. 2018. “Applying the degree of urbanisation to the globe: A new harmonised definition reveals a different picture of global urbanisation.” Processed, European Commission. Donaldson, Dave, and Adam Storeygard. 2016. “The View from Above: Applications of Satellite Data in Economics.” Journal of Economic Perspectives 30 (4): 171–98. Dubey, A., Naik, N., Parikh, D., Raskar, R. and Hidalgo, C.A., 2016, October. Deep learning the city: Quantifying urban perception at a global scale. In European conference on computer vision (pp. 196- 212). Springer, Cham. Duranton, G., and D. Puga. 2004. “Micro-foundations of Urban Agglomeration Economies.” In Handbook of Regional and Urban Economics, Vol. 4, edited by J. V. Henderson and J. F. Thisse, 2063–117. Amsterdam: North-Holland. ______. 2015. “Urban Land Use.” In Handbook of Regional and Urban Economics, Vol. 5, edited by Gilles Duranton, J. Vernon Henderson, and William Strange, 467–560. Amsterdam: North-Holland. Eeckhout, J., 2004. Gibrat's law for (all) cities. American Economic Review, 94(5), pp.1429-1451. Ellis, Peter and Roberts, Mark. 2016. “Leveraging Urbanization in South Asia: Managing Spatial Transformation for Prosperity and Livability”. Washington, DC: World Bank. Elvidge, Christopher D., Daniel Ziskin, Kimberly E. Baugh, Benjamin T. Tuttle, Tilottama Ghosh, Dee W. Pack, Edward H. Erwin, and Mikhail Zhizhin. 2009. "A Fifteen Year Record of Global Natural Gas Flaring Derived From Satellite Data." Energies 2 (3): 595–622. Esch, Thomas, Wieke Heldens, Andreas Hirner, Manfred Keil, Mattia Marconcini, Achim Roth, Julian Zeidler, Stefan Dech, and Emanuele Strano. 2017. "Breaking New Ground in Mapping Human Settlements from SpaceThe Global Urban Footprint." ISPRS Journal of Photogrammetry and Remote Sensing 134: 30–42. Frey, Bruno S., and Alois Stutzer. 2002. "What Can Economists Learn from Happiness Research?" Journal of Economic Literature 40 (2): 402–35. Friedl, Mark A., Damien Sulla-Menashe, Bin Tan, Annemarie Schneider, Navin Ramankutty, Adam Sibley, and Xiaoman Huang. 2010. "MODIS Collection 5 Global Land Cover: Algorithm Refinements and Characterization of New Datasets." Remote Sensing of Environment 114 (1): 168–82. Fujita, M., P. R. Krugman, and A. J. Venables. 2001. The Spatial Economy: Cities, Regions, and International Trade. Cambridge, MA: MIT Press. Galdo,Virgilio, Yue Li and Martin G. Rama. 2018. "Identifying Urban Areas by Combining Data from the Ground and from Outer Space : An Application to India," Policy Research Working Paper Series 8628, The World Bank. Glaeser, E.L., Hillis, A., Kominers, S.D. and Luca, M., 2016. Crowdsourcing city government: Using tournaments to improve inspection accuracy. American Economic Review, 106(5), pp.114-18. Glaeser, E.L., Kominers, S.D., Luca, M. and Naik, N., 2018. Big data and big cities: The promises and limitations of improved measures of urban life. Economic Inquiry, 56(1), pp.114-137. 34 Gorin, Clément, Pierre-Philippe Combes, Gilles Duranton, and Laurent Gobillon 2018. A random forest approach to mining historical map data: Land use in 19th-century France. (Processed) Hamermesh, D.S. and Biddle, J.E., 1994. “Beauty and the Labor Market”. American Economic Review 84(5): 1174-1194. Hastie, Trevor, Robert Tibshirani and Jerome Friedman. 2017. “The Elements of Statistical Learning: Data Mining, Inference and Prediction.” Second Edition (Corrected 12th printing), Springer, New York. Henderson, M., E. T. Yeh, P. Gong, C. Elvidge, and K. Baugh. 2003. “Validation of Urban Boundaries Derived from Global Night-Time Satellite Imagery.” International Journal of Remote Sensing 24 (3): 595–609. Hsu, Feng-Chi, Kimberly E. Baugh, Tilottama Ghosh, Mikhail Zhizhin, and Christopher D. Elvidge. 2015. “DMSP-OLS Radiance Calibrated Nighttime Lights Time Series with Intercalibration.” Remote Sensing 7: 1855–76. IDFC Institute. 2015. “Chasing Definitions in India.” http://www.idfcinstitute.org/knowledge/publications/op-eds/chasing-definitions-in-india/. Knottnerus, Paul. 2003. Sample Survey Theory. Berlin: Springer Science+Business Media. Krugman, Paul. 1991. “Increasing Returns and Economic Geography.” Journal of Political Economy 99 (3): 483–99. LandScan™ 2011. “High Resolution global Population Data Set copyrighted by UT-Battelle”, LLC, operator of Oak Ridge National Laboratory under Contract No. DE-AC05-00OR22725 with the United States Department of Energy. Levy, M., 2009. Gibrat's law for (all) cities: Comment. American Economic Review, 99(4), pp.1672-75. Li, X. and Zhou, Y., 2017. Urban mapping using DMSP/OLS stable night-time light: a review. International Journal of Remote Sensing, 38(21), pp.6030-6046. Li, Yue, Martin Rama, Virgilio Galdo, and Maria Florencia Pinto. 2015. “A Spatial Database for South Asia.” World Bank, Washington, DC. Michaels, G., F. Rauch, and S. J. Redding. 2012. “Urbanization and Structural Transformation.” Quarterly Journal of Economics 127 (2): 535–86. Mills, E. S. 1967. “An Aggregative Model of Resource Alplace in a Metropolitan Area.” American Economic Review 57 (2): 197–210. Minsitry of Finance. 2017, Economic Survey 2016-17, Government of India, Ministry of Finance, Department of Economic Affairs. Mullainathan, Sendhil, and Jann Spiess. 2017. "Machine learning: an applied econometric approach." Journal of Economic Perspectives 31(2): 87-106.Muth, R. F. 1969. Cities and Housing: The Spatial Patterns of Urban Residential Land Use. Chicago: University of Chicago Press. Mukhopadhyay, P., Zerah, M.H. and Denis, E., 2012. Subaltern urbanisation in India. Economic and Political Weekly, XLVII, 30, pp.52-62. 35 Naik, N., Kominers, S.D., Raskar, R., Glaeser, E.L. and Hidalgo, C.A., 2015. Do people shape cities, or do cities shape people? The co-evolution of physical, social, and economic change in five major US cities (No. w21620). National Bureau of Economic Research. Naik, N., Raskar, R. and Hidalgo, C.A., 2016. Cities are physical too: Using computer vision to measure the quality and impact of urban appearance. American Economic Review, 106(5), pp.128-32. NOAA (National Oceanic and Atmospheric Administration). 2014. Global Radiance Calibrated Night Lights, National Geophysical Data Center, n NSSO (National Sample Survey Office). 2012. “The 68th Round of National Sample Survey of India.” National Sample Survey Office, Ministry of Statistics and Programme Implementation, Government of India. ORGI (Office of the Registrar General and Census Commissioner). 2011a. Census of India2011. New Delhi: Ministry of Home Affairs, Government of India. http://censusindia.gov.in/ ______. 2011b. Administrative Atlas2011. New Delhi: Ministry of Home Affairs, Government of India. http://censusindia.gov.in/2011census/maps/administrative_maps/admmaps2011.html. Pesaresi, Martino, Daniele Ehrlich, Stefano Ferri, Aneta Florczyk, Sergio Freire, Matina Halkia, Andreea Julea, Thomas Kemper, Pierre Soille, and Vasileios Syrris. 2016. Operating Procedure for the Production of the Global Human Settlement Layer from Landsat Data of the Epochs 1975, 1990, 2000, and 2014. Luxembourg: Publications Office of the European Union. Pesaresi, Martino; Florczyk, Aneta; Schiavina, Marcello; Melchiorri, Michele; Maffenini, Luca. 2019. “GHS settlement grid, updated and refined REGIO model 2014, multitemporal (1975-1990-2000-2015)”, R2019A. European Commission, Joint Research Centre (JRC). Postma, E. 2014. “A Relationship between Attractiveness and Performance in Professional Cyclists”. Biology Letters 10(2): 20130966. Ross, J., Irani, I., Silberman, M. Six, Zaldivar, A., and Tomlinson, B., 2010. Who are the Crowdworkers?: Shifting Demographics in Amazon Mechanical Turk. In: CHI EA 2010. (2863-2872). Rozenfeld, H. D., D. Rybski, X. Gabaix, and H. A. Makse. 2011. “The Area and Population of Cities: New Insights from a Different Perspective on Cities.” American Economic Review 101 (5): 2205–25. Salesses, Philip, Katja Schechtner, and César A. Hidalgo. 2013. "The Collaborative Image of the City: Mapping the Inequality of Urban Perception." PLOS One 8 (7): e68400. Schiavina, Marcello; Freire, Sergio; MacManus, Kytt 2019. “GHS population grid multitemporal (1975, 1990, 2000, 2015) R2019A”. European Commission, Joint Research Centre (JRC) Scotchmer, S. 2002. “Local Public Goods and Clubs.” Handbooks in Economics 4 (4): 1997–2044. Straszheim, M. 1987. “The Theory of Urban Residential Place.” In Handbook of Regional and Urban Economics 2: 717–57. Amsterdam: Elsevier. UN (United Nations). 2005. “Designing Household Survey Samples: Practical Guidelines.” Department of Economic and Social Affairs, Statistics Division, United Nations, New York. US Office of Management and Budget, 2010. 2010 standards for delineating metropolitan and micropolitan statistical areas; Notice. Washington, D. C.: Federal Register. 36 Varian, H.R., 2014. Big data: New tricks for econometrics. Journal of Economic Perspectives, 28(2), pp.3- 28. Veenhoven, Ruut. 2004. “Happiness as a Public Policy Aim: The Greatest Happiness Principle.” In Positive Psychology in Practice, edited by P. A. Linley and S. Joseph. Hoboken, N.J.: John Wiley. WorldPop. 2017. “India 100m Population”, Version 2. University of Southampton. DOI: 10.5258/SOTON/WP00532. 37 Appendices Appendix 1. The formula for sampling size determination We use the following formula to determine the sample size for each administrative category: 2 (1 − ) ∗ = 2 2 (1) (1 − ) 1+� � 2 ∗ where is the sample size for administrative category ; is the total number of places belonging to the category; and is the prior probability of the place being urban in practice. In the absence of other information, the prior probability should be set equal to 0.5. However, administrative categories provide information on the likelihood that a place is urban or rural in practice, so that the values of should be correlated with the order of these administrative categories in the rural-urban gradation. The term is the margin of error allowed for administrative category , and indicates the boundaries of the chosen confidence interval. For a conventional 95 percent confidence, the value of is 1.96, and is equivalent to 1.96 times the standard error of . Based on the visual analysis of a random sample of 250 administrative units, we assume that the standard error of the prior probability of being urban is much smaller for villages than for the seven other administrative categories. Therefore, is set at 3 percent for villages and at 5 percent for all other strata. 38 Appendix 2. The structured assessment protocol The structured protocol to assess the urban status of places in the sample is based on a decision tree involving three steps. First, the assessor focuses on land cover. If the built-up area appears to be extensive as a share of the overall area, the place is likely to be urban; it is likely to be rural if the built-up area is small. Between these two extremes, the assessment is inconclusive. Next, the assessor adjusts this preliminary judgment based on the relationship among buildings. Compactness or clustering of buildings increases the likelihood that the place is urban, whereas a scattered pattern of buildings suggests that the place is rural. Finally, the assessor zooms in (and pulls out street views if available) to check the availability of amenities, high-quality buildings, and transportation networks. The availability of some of these structures confirms that the place is urban in practice. 39 Appendix 3. The impact of the size of places on judgment outcomes a. The relationship between size and judgments Size Note: The blue bars report the share of places judged to be urban for a size quantile. 40 b. The relationship between size and built-up share Note: The blue bars report the range of the built-up share defined by the 25 and 75 percentiles for a size quantile. The blue lines within the bars report the median value of the built-up share for a size quantile. 41 Appendix 4. Tuning of random forests Note: The figure reports out-of-bag (OOB) error rate for different combinations of the number of trees and the number of covariates to be sampled to develop each node when the four key indicators, NDVI, NDWI, the quadratic terms of individual indicators and terms interacting two indicators at a time are included as covariates in the random forests analysis. 42 Appendix 5. The importance of covariates in the random forests analysis Note: The figures report the Mean Decrease Accuracy and Mean Decrease Gini for the four key indicators when all of them are included as covariates in the random forests analysis. Mean Decrease Accuracy gives a rough estimate of the loss in prediction performance when a covariate is omitted from the training set. Mean Decrease Gini estimates the important of a covariate is to split the data correctly. It does so by assessing the decrease in Gini, which is a measure of node purity, when a covariate is omitted. 43