WPS7849 Policy Research Working Paper 7849 Retooling Poverty Targeting Using Out-of-Sample Validation and Machine Learning Linden McBride Austin Nichols Development Economics Vice Presidency Operations and Strategy Team October 2016 Policy Research Working Paper 7849 Abstract Proxy means test (PMT) poverty targeting tools have through cross-validation and stochastic ensemble methods, become common tools for beneficiary targeting and in PMT tool development can substantially improve the poverty assessment where full means tests are costly. Cur- out-of-sample performance of these targeting tools. The rently popular estimation procedures for generating these USAID poverty assessment tool and base data are used for tools prioritize minimization of in-sample prediction demonstration of these methods; however, the methods errors; however, the objective in generating such tools is applied in this paper should be considered for PMT and out-of-sample prediction. This paper presents evidence other poverty-targeting tool development more broadly. that prioritizing minimal out-of-sample error, identified This paper is a product of the Operations and Strategy Team, Development Economics Vice Presidency. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be contacted at lem247@cornell.edu. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Retooling Poverty Targeting Using Out-of-Sample Validation and Machine Learning Linden McBride and Austin Nichols JEL codes: C140, I320, O220, O150 Linden McBride (corresponding author) is a PhD candidate at Cornell’s Dyson School of Applied Economics and Management; her email address is lem247@cornell.edu. Austin Nichols is a principal associate at Abt Associates; his email address is austinnichols@gmail.com. The authors gratefully acknowledge insights from Chris Barrett, Mark Schreiner, Daniel Fink, participants of the University of MN Trade and Development Seminar, participants of the Barrett Research Group Seminar, and two anonymous reviewers. The authors are especially grateful to Nicolai Meinshausen for his innovative quantile regression forest program. All errors are our own. Accurate targeting is one of the most important components of an effective and efficient food security or social safety net intervention (Barrett and Lentz 2013; Coady, Grosh, and Hoddinott 2004). To achieve accurate targeting, project implementers seek to minimize rates of leakage (benefits reaching those who don’t need them) and undercoverage (benefits not reaching those who do need them). Full means tests for identification of project beneficiaries can include detailed expenditure and/or consumption surveys; while effective, such tests are also time consuming and expensive. Proxy means tests (PMTs), a shortcut to full means tests, were first developed for the targeting of social programs in Latin American countries during the 1980s. PMTs have become common tools for targeting and poverty assessment where full means tests are costly (Coady, Grosh, and Hoddinott 2004). Today they are used by USAID (United States Agency for International Development) microenterprise project implementing partners, the World Food Program, and the World Bank, among many others for the purpose of poverty assessment, beneficiary targeting, and program monitoring and evaluation in developing countries (PAT 2014; WBG 2011). PMT tools are typically developed by assignment of weights, or parameters, to a number of easily verifiable household characteristics via either regression or principal components analysis (PCA) in an available, nationally representative data set. In the regression approach, household-level income/expenditures or poverty status are regressed on household characteristics with the objective of selecting and parameterizing a subset of those characteristics to explain a significant proportion of the variation in expenditures/income or poverty status. In the PCA approach, the parameters are generated by extracting from a set of variables an orthogonal linear combination of a subset of those variables that captures most of the common variation (Filmer and Pritchett 2001; Hastie, Tibshirani, and Friedman 2009). Although each approach has its 2 advocates, those interested solely in targeting tend to rely on regression approaches, while PCA has become popular among those interested in generating asset indices that may or may not be used for targeting. Note that the problem of developing tools for poverty targeting can be a fundamentally different problem from that of generating asset indices1; this paper speaks only to the problem of developing targeting tools. The regression approach to PMT tool development requires practitioners to select from a large set of potential observables a subset of household characteristics that can account for a substantial amount of the variation in the dependent variable. In practice, this is usually done through stepwise regression and the best performing tool is selected as that which performs best in-sample; more recently, efforts to validate in-sample-generated tools via out-of-sample testing have also been introduced (Schreiner 2006). Once a PMT tool has been developed from a sample from a particular population, the development practitioner can apply the tool to the subpopulation selected for intervention to rank or classify households according to PMT score. This process involves implementation of a brief household survey in the targeted subpopulation so as to assign values for each of the household characteristics identified during tool development. The observed household characteristics, , are then multiplied by the PMT tool weights, , for each characteristic j to generate a PMT score for household , as shown in equation (1): 1 For example, we might be concerned about endogeneity but not concerned about out-of-sample performance when generating an asset index to estimate the relationship between school enrollment and wealth, as in Filmer and Pritchett (2001). We have no such endogeneity concern when generating targeting tools because we are not attempting causal inference; however, out-of-sample performance is a primary concern. 3 . (1) In many applications, the calculated PMT scores are used to rank households from poorest to wealthiest2 and the poorest households are selected as program beneficiaries. In the case of the USAID poverty assessment tools that will be described below, the use is more conservative: the PMT scores are used to quantify the number of households above and below an identified poverty threshold so as to ensure proper allocation of USAID funds (PAT 2014). The methodological improvements we propose in this paper apply to both types of uses for PMT tools. Overall, the objective of a PMT tool is to quickly and accurately identify households meeting particular criteria in a new setting (but under the same data-generating process) using a model parameterized with previously available data. Therefore, for PMT tools to serve their purpose, it is important that they perform well not only within the data set or sample in which they were parameterized but also, especially, within the new data set or sample. In other words, high out-of-sample prediction accuracy must be prioritized in the development of PMT tools. In the fields of machine learning and predictive analytics, stochastic ensemble methods have been 2 There are several long-standing debates as to whether targeting tools, PCA type asset indices, and/or the use of consumption or income data in the regression approach capture long run economic status, permanent income, current consumption levels, current welfare, nonfood spending, or something else altogether. Lee (2014) points out that much of the theoretical support for these various claims is dubious and offers a theoretically grounded approach to the development of asset indices to measure poverty. As much as possible, we remain agnostic on the particular type of well-being that PMT tools capture while noting that the methods we discuss and the way in which we discuss them (e.g., their interpretation as capturing household poverty status) are standard in the literature and in practice. 4 shown to perform very well out-of-sample due to the bias- and variance-reducing features of such methods. In this paper, we present evidence that the prioritization of the out-of-sample performance of PMT targeting tools can substantially improve their out-of-sample accuracy. We propose two methods for this prioritization: (1) selecting a tool based on its cross-validation performance and (2) using stochastic ensemble methods, which have cross-validation built in, to develop the tool. Stochastic ensemble methods offer the additional feature, over and above traditional methods combined with cross-validation, of selecting the variables with which to build the tool, an otherwise time-consuming process. We take a set of PMT tools that have been developed by the University of Maryland IRIS Center (IRIS: Institutional Reform and Informal Sector) for the purpose of USAID poverty assessment for demonstration of these methods; however, the methods applied in this paper should be considered for PMT and other poverty targeting tool development more broadly. We next present the USAID poverty assessment tool development and accuracy evaluation criteria; we then introduce the stochastic ensemble algorithms, regression forests, and quantile regression forests, that we apply to the problem of developing more accurate out-of- sample targeting tools; an explanation of our data and methods follows. We close with results and conclusions. I. THE USAID POVERTY ASSESSMENT TOOL The development of the USAID poverty assessment tool (PAT) dates from 2000, when the US Congress passed the Microenterprise for Self-Reliance and International Anti-Corruption Act, 5 mandating that half of all USAID microenterprise funds benefit the very poor (PAT 2014). In the context of this legislation, the very poor are defined as those households living on less than the equivalent of a dollar per day or those households considered “among the poorest 50 percent of households below the country’s own national poverty line” (IRIS Center 2005). Subsequent legislation required USAID to develop and certify low-cost tools to enable its microenterprise project-implementing partners3 to assess the poverty status of microenterprise beneficiaries. USAID engaged the IRIS Center at the University of Maryland in 2003 to create the tools. To date, the IRIS Center has developed, and USAID has certified, tools for 38 countries.4 Using existing Living Standards Measurement Study (LSMS) data as well as survey data collected by IRIS, the IRIS Center developed country-specific PAT tools following the general PMT development procedure: they first identified a subset of household characteristics (approximately 15) from the larger data set of 70–125 available observables that accounted for the greatest variation in household level income via an R-squared maximization routine, SAS MAXR5; they then selected for the final tool the parameters identified by the statistical model— 3 The implementing partners who are required to make use of the PAT include “all projects and partner organizations receiving at least US$100,000 from USAID in a fiscal year for microenterprise activities in countries with a USAID-approved tool” (PAT 2014). In 2013, this entailed 71 partners receiving a total of 110 million dollars (USAID MMR). 4 Albania, Azerbaijan, Bangladesh, Bolivia, Bosnia and Herzegovina, Cambodia, Colombia, East Timor, Ecuador, El Salvador, Ethiopia, Ghana, Guatemala, Haiti, India, Indonesia, Jamaica, Kazakhstan, Kenya, Kosovo, Liberia, Madagascar, Malawi, Mexico, Nepal, Nicaragua, Nigeria, Paraguay, Peru, The Philippines, Rwanda, Senegal, Serbia, Tanzania, Tajikistan, Uganda, Vietnam, and the West Bank. 5 The MAXR procedure operates by selecting and rejecting variables one by one with the objective of maximizing the improvement in a model’s R2 (SAS 2009). 6 whether ordinary least squares (OLS), quantile regression, logit, or probit—that produced the highest predictive accuracy in-sample. In some cases, but not all, out-of-sample validation tests were performed. The predictive ability of the resulting PMT model was evaluated against a number of accuracy criteria—total accuracy, poverty accuracy, undercoverage, leakage, and the balanced poverty accuracy criterion—each of which is defined below. These criteria allow for ex ante evaluation of the generated poverty assessment tools via systematic consideration of each possible outcome/error type as presented in the confusion matrix in table 1: true positive (the true very poor, p = 1, are identified by the tool as very poor, ̂ 1); false negative (the true very poor, p =1, are identified by the tool as non very poor, ̂ 0); false positive (the true non very poor, 0, are identified by the tool as very poor, ̂ 1); true negative (and the true non very poor, 0, are identified by the tool as non very poor, ̂ 0). The classification literature has developed many metrics based on confusion matrices, such as that presented in table 1, for the assessment of classification accuracy; the IRIS Center draws on standard metrics from the literature and has also developed a new metric for their evaluation of the PAT. Following the IRIS Center and relying on the categories given in table 1, the accuracy criteria we use to assess PAT performance are defined as follows: total accuracy (TA) is the sum of the correctly predicted very poor and the correctly predicted non very poor as a percentage of the total sample, (TA=(TP+TN)/(TP+TN+FP+FN)). Poverty accuracy (PA) is the correctly predicted very poor as a percentage of the total true very poor, (PA=TP/(TP+FN)). The undercoverage rate is the ratio of true very poor incorrectly predicted as non very poor to total true very poor, (UC=FN/(TP+FN)), while the leakage rate is the ratio of true non very poor incorrectly identified as very poor to total true very poor, (LE=FP/(TP+FN)). Finally, the 7 balanced poverty accuracy criterion (BPAC) is the correctly predicted very poor as a percentage of the true very poor minus the absolute difference between the undercoverage and leakage rates, (BPAC=TP/(TP+FN)-|FN/(TP+FN)-FP/(TP+FN)|). These accuracy criteria are summarized in table 2. Total accuracy, or one minus mean squared error, is very familiar to economists as a metric for model assessment. However, there are several reasons why total accuracy might not be an adequate metric for assessing the accuracy of a poverty tool. Consider an example wherein a population of 100 includes 10 poor households. A tool that simply classifies the entire population as nonpoor would have a total accuracy rate of 90 percent, which seems quite good. However, this tool would have failed to identify a single poor household. Therefore, metrics beyond total accuracy are necessary for assessment of poverty tool performance; these additional metrics include poverty accuracy (also known as precision in the classification and predictive analytics literature) and undercoverage (false negative) and leakage (false positive) rates. In the example just given, the poverty accuracy of the tool would be 0 percent, and the undercoverage rate would be 100 percent. These additional metrics offer a better picture of the tool’s performance than does total accuracy alone. The BPAC combines these three metrics—poverty accuracy, undercoverage, and leakage—by penalizing the poverty accuracy rate with the extent to which the leakage and undercoverage rates exceed one another. The BPAC is an innovation of the IRIS Center; it was created to balance “the stipulations of the Congressional Mandate against the practical implications of the assessment tools” (IRIS 2005). The other criteria are standard in 8 PMT development. However, it should be noted that IRIS computes leakage in an unconventional manner.6 PAT model selection for each country was ultimately made by IRIS based on the BPAC results in-sample. While we follow the prioritization of the BPAC criteria in the analysis that follows, the methods we propose can just as easily be used to meet other prioritized accuracy criteria. II. STOCHASTIC ENSEMBLE METHODS: REGRESSION FORESTS AND QUANTILE REGRESSION FORESTS Classification and regression trees are a class of supervised learning methods that produce predictive models via stratification of a feature (in the case of poverty tool development, a feature is a variable or characteristic) space into a number of regions following a decision rule (Hastie, Tibshirani, and Friedman 2009). A canonical and intuitive example of a classification tree is that of predicting, based on a number of features such as age, gender, and class, who 6 Whereas leakage rates are commonly computed as FP/(TP+FP), IRIS computes leakage rates as FP/(TP+FN). This adjustment to the denominator in the calculation of leakage rates has two consequences: 1) it can lead to calculated leakage rates that are greater than one, producing a heavy penalty in the calculation of BPAC where leakage occurs (it is not clear that IRIS intended for this outcome); 2) it keeps constant the denominator across poverty accuracy, undercoverage, and leakage rates, allowing IRIS to easily perform the addition and subtraction necessary for the BPAC calculation. We assume this was IRIS’s purpose in modifying the denominator. 9 survived the sinking of the Titanic.7 While both classification and regression trees can be used to make predictions regarding the poverty status of households based on observable household characteristics, this paper focuses on regression and, in particular, quantile regression forests due to the advantages the latter offers in terms of making predictions about households concentrated at the lower end of the income distribution. Regression trees operate via a recursive binary splitting algorithm as follows (Hastie, Tibshirani, and Friedman 2009): for N observations of response variable, , and a vector of characteristics, , where 1,2, … is the number of observations and j=1,2,…J is the number of features, consider the splitting variable, , and the split point, where , that define the half planes and , as indicated in equation (2): , | and , | . (2) The algorithm selects and to solve the minimization problem, min min ∈ , min ∈ , , (3) where the inner minimizations are solved by | ∈ , and | ∈ , . (4) In words, the regression tree algorithm chooses the variable, (the splitting variable), and the value of that variable, (the split point), which minimizes the summed squared distance between the mean response variable and the actual response variables for the observations found in each 7 See Varian (2014) for an example. Many examples and data are also available at The Comprehensive R Archive Network at http://cran.r-project.org. 10 of the resulting regions. In this manner, the algorithm effectively weights the response variables by the predictive value of the observations within each region (Lin and Jeon 2006). Once the optimal split in equation (3) is identified, the algorithm proceeds within the new partitions. One way to think about a regression tree is as an OLS regression for which one knows in advance all of the split variables and split points across which to partition, and then conditionally partition, the feature space, which therefore defines appropriate binary variables and interaction terms to capture these partitions. Such an OLS would return the same results as a regression tree built over the same data. However, such split variables and split points are not known in advance; therefore, what the regression tree algorithm offers over and above an OLS is a heuristic method for the selection of those variables, split points, and conditional splits—the binary variables and their interactions—with which to build the model so as to minimize prediction error. To do this using OLS would require a stepwise regression that iterates and then conditionally iterates through each split point of each variable—a computationally intensive process. The recursive binary splitting process of the regression tree can continue until a stopping criterion is reached; however, larger trees may overfit the data. In the case that we want to bootstrap over this algorithm—a good idea, as the algorithm may make different splitting decisions in different subsets of the data—it becomes apparent that a bias for variance trade-off 11 is made as we allow the trees to grow large.8 A collection of larger trees will have high variance but low bias while a collection of smaller trees will have low variance but high bias. Fortunately, in this setting, the bias-variance trade-off can be somewhat overcome via a process called bootstrap aggregation, or bagging. Bagging involves bootstrapping a number of approximately unbiased and identically distributed regression trees and then averaging across them so as to reduce the variance of the predictor. However, bagging cannot address the persistent variance that arises due to the fact that the trees themselves are correlated, as they were generated over the same feature space. Consider, for example, a set of identically distributed but correlated regression trees, each with variance . If represents the pairwise correlation between the trees, then the variance of the average of these trees is . As grows large, the term will approach zero, reducing the overall variance. However, the first term, , persists (Hastie, Tibshirani, and Friedman 2009). Reducing this persistent variance component of the bagged predictor is the innovation of random forests. Introduced by Breiman (2001), regression forests improve the variance reduction feature of bagged regression trees by decorrelating the trees, and thereby reducing via a random selection of the features (variables) over which the algorithm may split. The number of random features available to the algorithm at any split is typically limited to one-third of the total number of features (Hastie, Tibshirani, and Friedman 2009); this is a tuning parameter of the algorithm. 8 A variety of options for “pruning” trees exist to address these issues in a regression tree framework (Hastie, Tibshirani, and Friedman 2009). We don’t discuss these here but move on instead to random forests, which address the problem without pruning. 12 Critically, in a random forest algorithm, the mean squared error of the prediction is estimated in the “out of bag” sample (OOB), the (on average) third of the training data set on which any given tree has not been built (Breiman 2001), in a manner similar to k-fold cross- validation. This OOB sample offers an unbiased estimate of the model’s performance out-of- sample. The random forest training algorithm produces a collection of trees, denoted as ; , where indicates the bth tree. The regression forest predictor is then the bagged prediction ∑ ; . (5) The regression forest algorithm is detailed in the Appendix. It has been shown that regression forests offer consistent and approximately unbiased estimates of the conditional mean of a response variable (Breiman 2004; Hastie, Tibshirani, and Friedman 2009). However, as elaborated by Koenker (2005), among others, the conditional mean tells only part of the story of the conditional distribution of y given X. Therefore, we also apply quantile regression forests, as developed by Meinshausen (2006), to our PMT tool development. Meinshausen (2006) draws on insights from Lin and Jeon (2006), who show that random forest predictors can be thought of as weighted means of the response variable, , as shown in equation (6): ∑ ; ∑ ; ∑ . (6) In equation (6), ; represents the weight vector obtained by averaging over the observed values in a given region ,( 1… . Application of the weight vector to the response variable 13 is simply another way of considering the conditional averaging of the response variable, as represented in equation (4) above and shown in equation (7): ; | ∈ , . (7) With this insight, Meinshausen (2006) produces quantile regression forests, as a generalization of regression forests in which not only the conditional mean, but the entire conditional distribution of the response variable is estimated (Equation 8): ∑ ; ∑ 1 . (8) Meinshausen (2006) provides a proof for the consistency of this method and demonstrates the gains in predictive performance of quantile regression forests over linear quantile regression. These gains are due to the fact that quantile regression forests retain all the bias-minimizing and variance-reducing components of regression forests in that they bootstrap aggregate across a great number of decorrelated trees; quantile regression forests additionally offer the ability to make predictions across the conditional distribution. A quantile approach is particularly useful for the purposes of PMT tool development due to the fact that the very poor are often concentrated at one end of the conditional income distribution, far from the conditional mean. The quantile regression forest algorithm is detailed in the Appendix. The advantages that stochastic ensemble methods, such as the regression forest and quantile regression forest algorithms, offer over traditional PMT development tools include the selection of the variables that offer the greatest predictive accuracy without the need to resort to stepwise regression and/or running multiple model specifications—rather, the algorithms build the model—and built-in cross-validation via the out-of-bag error estimates. 14 Therefore, using regression forest and quantile regression forest algorithms, we expect to realize improvements in the out-of-sample targeting accuracy of the PAT. We note, however, that this method requires the critical assumption that the data-generating process remains unchanged between tool development and tool application. That is, the algorithm can perform well out of sample but not out of population. This limitation plagues any sample-based estimation routine. III. EMPIRICAL METHOD AND DATA We produce a set of country-specific examples from the survey data that was used by the IRIS Center to construct their PATs. We replicate the PAT development process by extracting the same variables that IRIS extracted from the same data sets and then generating identical estimation models. We are limited in our replication process to the use of LSMS data sets that are publicly available. We have additionally constrained ourselves to the LSMS data sets for which income or expenditure aggregates are also publicly available due to the challenges of precisely replicating an income or expenditure aggregate that IRIS may have generated. From the publicly available data sets meeting these criteria, we selected three nearly arbitrarily: the 2005 Bolivia Encuesta de Hogares (EH), the 2001 Timor Leste Living Standards Survey (TLSS), and the 2004-2005 Malawi Second Integrated Household Survey (IHS2). These data sets present a reasonable representation of the settings in which PATs have been developed. Each data set differs in number of observations, poverty level, and IRIS-selected household characteristics. The data are summarized in table 3, where we can see that the number of household level observations ranges from 1,800 in East Timor to 11,280 in Malawi. Likewise, 15 the USAID-defined poverty rates range considerably, from 24.2 percent in Bolivia to 64.8 percent in Malawi. The fourth column of table 3 displays the household-level characteristics selected by IRIS for PAT tool development; many characteristics such as household size, age of household head, household construction materials, and material possessions are common across data sets. We provide the IRIS reported in-sample accuracy estimates for each country-level data set in each row 1 of Appendix table A1. These are the estimates on which the IRIS model selection was made. We provide the IRIS-reported out-of-sample accuracy assessment results for each country in rows 2–4 of table A1. We replicate the IRIS in-sample models and report the replication estimates in each row 5 of Appendix table A1. Within-country comparisons of our replication estimates (table A1, row 5), with the estimates reported by IRIS (table A1, row 1), serve as a check on how well we have replicated the PAT tool development process. In the case of Bolivia, our replication estimates do not perform as well as those of IRIS; however, it should be noted that IRIS built the Bolivia PAT tool on a randomly selected subset of the data. We cannot replicate precisely the same random draw and so report the full sample estimates. The full sample replication does not perform as well as the half sample performance reported by IRIS, but that half sample is unusual in its high performance, and not representative of the thousand half sample splits we explored or that IRIS reported for their calculation of out-of-sample performance (see rows 2 through 4 of Appendix table A1 for Bolivia). For this reason, we are not concerned about spuriously overestimating the performance of our methods relative to those of IRIS and therefore retain this data set in our analysis. In the case of East Timor and Malawi, our replication estimates are very close to those reported by IRIS, and we are likewise not concerned about unfair comparisons of our methods with those of IRIS. 16 Our empirical approach is to randomly draw, with replacement, two samples of size N/2 from each country-level data set, producing a training sample and a testing sample. Over this split of the data, we first reproduce IRIS’s methods, training their preferred model in the training data and then testing it on 1,000 bootstrap samples of the testing data. 9 However, instead of basing tool selection on in-sample performance as IRIS does, we perform k-fold cross-validation in the training sample and select as our preferred model the one that produces the best BPAC in cross-validation. For this exercise, we use k-fold cross-validation; in particular, we produce 500 iterations of three-fold cross-validation, which entails training the model on two-thirds of the training data set and assessing performance in the remaining third of the training data set on which the model was not trained. We take this approach because it most closely approximates the out-of-bag error produced using the stochastic ensemble methods. Following the method for out-of-sample testing used by the IRIS center, we test the classification accuracy of the cross-validation-selected tool using 1,000 bootstrapped samples of the testing sample. The out-of-sample performance of this tool in the testing sample is presented for each country in figures 1–3, as well as in Appendix table A1, rows 6 through 8. We refer to this approach of using cross-validation to select the best-performing model in the training sample as the “cross-validation” approach throughout remaining sections to distinguish it from both IRIS’s approach and from the stochastic ensemble method approach (note that stochastic ensemble methods also use cross-validation; however, it is referred to as out-of-bag error in that setting). 9 This method was first used in Schreiner (2006). 17 We next turn to the stochastic ensemble methods. Over the same split of the data as used for the cross-validation approach, the random forest and quantile regression forest models are built in the training sample where, for any given , , an average of two-thirds of the training data are used to build bagged regression trees and the remaining third is reserved for out-of-bag, and therefore unbiased, running estimates of the prediction error over a forest of 500 trees.10 We run the regression forest and quantile regression forest algorithms in R using packages developed by Liaw and Wiener (2002) and Meinshausen (2016), respectively. We select our preferred model as that with the lowest BPAC error in the OOB sample. This model is then taken to the testing sample to assess classification accuracy. The performance of this tool in the testing sample is presented for each country in figures 1–3, as well as in Appendix table A1, rows 9 through 11. We statistically compare the mean of the IRIS-reported bootstrapped accuracy estimates with those produced using both of our approaches to tool development—the cross-validation approach and the stochastic ensemble approach—using Tukey Kramer tests, selected to account for the family-wise error rate. The results are reported in table 4. Finally, so as to assess the robustness of our results to the poverty thresholds in each country, we report in Appendix table A2 the performance of our methods as compared with those of IRIS under two new poverty lines: one that is half the original poverty line and a second that is twice the original poverty line. We cannot observe actual IRIS tool performance metrics under 10 Five hundred trees is the default setting in the randomForest package in R. From casual observation, the OOB error has largely stabilized by the time the forest has reached 200–300 trees; this observation is consistent with the literature (Hastie, Tibshirani, and Friedman 2009). 18 these new poverty lines, but we estimate the best possible results IRIS could have gotten using their methods and preferred tools by adapting those tools to obtain the greatest BPAC under the new poverty lines. In practice, this means selection of the quantile that offers the best in-sample BPAC under the new poverty lines in Bolivia and Malawi. In the case of East Timor, we include a quantile regression approach along with IRIS’s preferred approach under the original poverty line, the probit model, because the probit performs poorly at the lower poverty line. This means we are comparing our cross-validation and ensemble method approaches to the best possible outcomes of the approach employed by IRIS. IV. RESULTS Results of the cross-validation (CV) and stochastic ensemble (SE) approaches to PMT tool development are displayed graphically in figures 1, 2, and 3 and numerically in Appendix table A1. In both formats, we compare the out-of-sample bootstrap accuracy estimates of the IRIS- produced tools (rows 2–4 in the table A1) with those produced by each of our approaches. The confidence bars in each figure display the nonparametric bootstrap confidence intervals, where the lower bound is the 2.5th percentile and upper bound is the 97.5th percentile bootstrap estimate. Standard errors are reported in table A1. In addition, Tukey Kramer tests of the differences in the out-of-sample bootstrap means are reported in table 4. While cross-validation improves on the total accuracy of the IRIS-generated tool only in the case of Bolivia and the stochastic ensemble methods do not improve on the total accuracy at all (figure 1, first graph), gains in poverty accuracy are observed using cross-validation across all countries and using stochastic ensemble methods in both East Timor and Malawi (figure 1, 19 second graph). Recall from the discussion above that total accuracy has serious limitations as a metric for assessing the performance of a poverty-targeting tool. From figure 2 (first graph), we can see that these gains in poverty accuracy are not without trade-offs: the leakage rates for the cross-validation and stochastic ensemble approaches are significantly greater than those reported for the IRIS-generated tools in both Bolivia and East Timor, meaning that these tools err on the side of classifying nonpoor households as poor. Given that leakage rates are heavily penalized by the IRIS accuracy metrics, these increases are not very surprising. Meanwhile, the cross-validation approach performs much better than IRIS’s in terms of undercoverage rates; the undercoverage rate is decreased across all countries (figure 2, second graph). The stochastic ensemble approach likewise outperforms IRIS’s in both East Timor and Malawi. The critical question, then, is how these trade-offs net out in terms of USAID’s key accuracy metric, the BPAC. Figure 3 demonstrates that the accuracy of the cross-validation approach outperforms that of the IRIS-generated tool in each country. Improvements range from 2.7 percent in Malawi to 17.5 percent in Bolivia. The performance of the stochastic ensemble approach closely follows that of the cross-validation approach in both East Timor and Malawi; although the cross-validation results are statistically significantly different from the stochastic ensemble results, the magnitude of those differences is trivial in the case of Malawi and quite small in the case of East Timor (table 4). In addition to gains in average BPAC, we also see large gains in the lower bound (2.5th percentile) performance using cross-validation and stochastic ensemble methods. The cross- validation (stochastic ensemble) approach improves the lower bound BPAC accuracy in Boliva by 38 (7) percent, in East Timor by 11 (8) percent, and in Malawi by 3 (2) percent. 20 Although the gains in poverty accuracy and BPAC in Malawi using the cross-validation approach are not as impressive as those in Bolivia and East Timor, note that the tool is able to outperform the already relatively accurate IRIS tool for Malawi in terms of these metrics while also reducing both the leakage and undercoverage rates. The relatively strong performance of the cross-validation approach compared with the stochastic ensemble approach is due to the fact that the cross-validation approach benefits from IRIS’s time and effort in selecting from a large set of possible variables a subset that explains much of the variation in the dependent variable. Because we have limited our analysis to the same subset of variables as selected by IRIS for their preferred models, the relative strengths of the stochastic ensemble methods in terms of variable selection are not well displayed through this analysis. Therefore, it remains an open question (that we plan to address in a later paper) as to whether our stochastic ensemble approach would outperform the combination of IRIS’s parametric model with cross-validation had we begun with the full set of 70–125 variables instead of the selected subset. Our analysis does suggest, however, that the proxy means test tool developer who prefers to skip the time-consuming and computationally intensive process of stepwise regression followed by the comparison of multiple model specifications would do at least nearly as well in terms of out-of-sample performance as the tool developer who does take the time to perform these analyses and then combine them with cross-validation. Finally, the robustness results for the assessment of tool performance under new poverty lines are reported in appendix table A2. From a comparison of rows 2, 6, and 9 for each country, we can see that the cross-validation and stochastic ensemble approaches perform about the same as the IRIS approach under the new poverty lines. Overall, however, across all results, including the robustness results, we find that the cross-validation and stochastic ensemble approaches do 21 no worse than, and in many cases substantially outperform, the traditional approach to PMT tool development. V. CONCLUSION We have proposed methods for the improvement of a particular type of poverty-targeting tool: proxy means test targeting. In the country-level case studies analyzed here, prioritization of the out-of-sample performance of these targeting tools during tool development either through selecting a model based on its cross-validation performance or using a method such as stochastic ensemble methods that both selects variables and performs cross-validation along the way can significantly improve the out-of-sample performance of these tools. In particular, we find that application of cross-validation and stochastic ensemble methods to the problem of developing a poverty-targeting tool produces a gain in poverty accuracy, a reduction in undercoverage rates, and an overall improvement in BPAC in comparison to traditional methods. Our analysis takes as given the IRIS-selected PAT variables so as to demonstrate the power of machine learning methods in this setting; however, beginning with a larger set of variables over which the stochastic ensemble methods may build a targeting model may produce even greater gains in targeting accuracy for this approach than observed here.11 Therefore, the gains in accuracy we have reported are likely conservative. Moreover, applying a stochastic ensemble approach over a larger set of variables would obviate the time-consuming tasks of both 11 Note, however, that an algorithm cannot be given completely free range in variable selection as the selected variables must be easily observable household characteristics that can be quickly verified with a visit to the household for them to contribute meaningfully to a PMT test. 22 stepwise regression for variable selection and the process of running and comparing the performance of multiple statistical models, as was done by the IRIS center. Overall, our findings suggest that further exploration of machine learning methods for PMT tool development is merited. VI. APPENDIX Random forest algorithm (Hastie, Tibshirani, and Friedman 2009; Breiman 2001): 1. Grow trees, , 1, … , , by recursively repeating steps (a)-(c): a. Select m variables at random from the total J variables (j=1,…J). b. Select variable and split point to solve the minimization problem as shown in EQ2–EQ4. c. Split data into the resulting regions. 2. Output ensemble of trees . 3. To make prediction at new point, x, drop observation down all trees and calculate ∑ . Quantile regression forest algorithm (Meinshausen 2006): 1) Grow trees, , 1, … , , as in the random forests algorithm. However, retain the value of all observation in a given region, not just their average. 23 2) For a given , drop observation down all trees and compute the weight, ; , of ∈ , observation for every tree, b, as ; ∑ . Then compute the weight ∈ , ∑ ; for every observation as an average over all trees as . ∑ ; 3) Compute the estimate of the distribution function as ∑ 1 for all y. REFERENCES Barrett, C. B., and E. Lentz. 2013. “Hunger and Food Insecurity.” In D. Brady and L.M. Burton, eds., The Oxford Handbook of Poverty and Society. Oxford: Oxford University Press. Breiman, L. 2001. “Random Forests.” Machine Learning 45: 5–32. ———. 2004. “Consistency for a Simple Model of Random Forests.” Technical Report. University of California-Berkeley. Coady, D., M. Grosh, and J. Hoddinott. 2004. Targeting of Transfers in Developing Countries: Review of Lessons and Experience. Washington, DC: The International Bank for Reconstruction and Development. Filmer, D., and L. H. Pritchett. 2001. “Estimating Wealth Effects without Expenditure Data or Tears: An Application to Educational Enrollments in States of India.” Demography 38 (1): 115–32. Grosh, M., and J. Baker. 1995. “Proxy Means Tests for Targeting Social Programs.” LSMS Working Paper No. 118. The World Bank, Washington, DC. 24 Hastie, T., R. J. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer. IRIS Center. 2005. “Note on Assessment and Improvement of Tool Accuracy. Poverty Assessment Tools.” USAID. Accessed January 2014. http://www.povertytools.org/training_documents /Introduction%20to%20PA/Accuracy_Note.pdf. ———. 2007. “Poverty Assessment Tool Accuracy Submission. USAID/IRIS Tool for Timor-Leste. Poverty Assessment Tools.” USAID. Accessed January 2014. http://www.povertytools.org/tools.html. ———. 2009. “Poverty Assessment Tool Accuracy Submission. USAID/IRIS Tool for Bolivia. Poverty Assessment Tools.” USAID. Accessed January 2014. http://www.povertytools.org/tools.html. ———. 2012. “Poverty Assessment Tool Accuracy Submission. USAID/IRIS Tool for Malawi. Poverty Assessment Tools.” USAID. Accessed January 2014. http://www.povertytools.org/tools.html. Koenker, R. 2005. Quantile Regression. Cambridge: Cambridge University Press. Liaw, A., and M. Wiener. 2002. “Classification and regression by randomForest.” R News 2:18– 22. Lee, D. 2014. “Measuring Poverty Using Asset Ownership: Developing a Theory-Driven Asset Index Incorporating Utility and Prices.” Unpublished Job Market Paper. University of California-Berkeley. Accessed January 2014. http://areweb.berkeley.edu/candidate/Diana_Lee. 25 Lin, Y., and Y. Jeon. 2006. “Random Forest and Adaptive Nearest Neighbors.” Journal of the American Statistical Association 101 (474): 578–590. Meinshausen, N. 2006. “Quantile Regression Forests.” Journal of Machine Learning Research 7: 983–99. Meinshausen, N. 2016. quantregForest: Quantile Regression Forests. R package version 1.3-5. Available at http://CRAN.R-project.org/package=quantregForest PAT (Poverty Assessment Tool). 2014. “Quantifying the Very Poor. Poverty Assessment Tools Website.” Accessed February 2014. http://www.povertytools.org. R Development Core Team. 2005. “R: A language and environment for statistical computing.” R Foundation for Statistical Computing. Vienna, Austria. SAS Institute Inc. 2009. “SAS/STAT 9.2 User’s Guide, Second Edition.” SAS Institute Inc., Cary, NC. Accessed May 13, 2012. https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#tit lepage.htm. Schreiner, M. 2006. “A Simple Poverty Scorecard for Bangladesh, Report to Grameen Foundation USA.” Working Paper. Accessed 15 February 2016. www.microfinance.com/ English/Papers/Scoring_Poverty_Bangladesh.pdf. Saint Louis, MO: Microfinance Risk Management, L.L.C. USAID MRR. “USAID Microenterprise Results Reporting Portal.” Accessed December 17, 2014. eads.usaid.gov/mrr/. Varian, H. 2014. “Big Data: New Tricks for Econometrics.” Journal of Economic Perspectives 28 (2): 3–28. 26 WBG (World Bank Group). 2011. Targeting: Safety Nets and Transfers: Proxy Means Testing. Washington, DC: The World Bank. Accessed May 2014. http://web.worldbank.org/WBSITE/EXTERNAL/TOPICS/EXTSOCIALPROTECTION/ EXTSAFETYNETSANDTRANSFERS/0,,contentMDK:22188486~pagePK:210058~piP K:210062~theSitePK:282761,00.html 27 Figure 1. Total and Poverty Accuracy by Country and Estimation Procedure (a) IRIS Q(42) Bolivia CV Q(38) SE QRF(41) IRIS Probit East Timor CV Q(46) SE QRF(47) IRIS Q(57) Malawi CV Q(55) SE QRF(57) 0 10 20 30 40 50 60 70 80 90 100 Total Accuracy (b) 28 IRIS Q(42) Bolivia CV Q(38) SE QRF(41) IRIS Probit East Timor CV Q(46) SE QRF(47) IRIS Q(57) Malawi CV Q(55) SE QRF(57) 0 10 20 30 40 50 60 70 80 90 100 Poverty Accuracy Notes: “IRIS Q(#)” indicates quantile regression (Q) estimated by IRIS at the #th quantile. “CV Q(#)” indicates quantile regression estimated by the authors using cross-validation (CV) at the #th quantile. “SE QRF(#)” indicates quantile regression forest (QRF) estimated by the authors using stochastic ensemble methods (SE) at the #th quantile. “IRIS probit” indicates probit regression estimated by IRIS. Error bars reflect the nonparametric confidence intervals. Source: Authors’ and IRIS center’s estimates using data and procedures detailed in the text. Figure 2. Leakage and Undercoverage Rates by Country and Estimation Procedure (a) 29 IRIS Q(42) Bolivia CV Q(38) SE QRF(41) IRIS Probit East Timor CV Q(46) SE QRF(47) IRIS Q(57) Malawi CV Q(55) SE QRF(57) 0 10 20 30 40 50 Undercoverage (b) IRIS Q(42) Bolivia CV Q(38) SE QRF(41) IRIS Probit East Timor CV Q(46) SE QRF(47) IRIS Q(57) Malawi CV Q(55) SE QRF(57) 0 10 20 30 40 50 Leakage Notes: “IRIS Q(#)” indicates quantile regression (Q) estimated by IRIS at the #th quantile. “CV Q(#)” indicates quantile regression estimated by the authors using cross-validation (CV) at the 30 #th quantile. “SE QRF(#)” indicates quantile regression forest (QRF) estimated by the authors using stochastic ensemble methods (SE) at the #th quantile. “IRIS probit” indicates probit regression estimated by IRIS. Error bars reflect the nonparametric confidence intervals. Source: Authors’ and IRIS center’s estimates using data and procedures detailed in the text. Figure 3. Balanced Poverty Accuracy Criteria by Country and Estimation Procedure IRIS Q(42) Bolivia CV Q(38) SE QRF(41) IRIS Probit East Timor CV Q(46) SE QRF(47) IRIS Q(57) Malawi CV Q(55) SE QRF(57) 0 10 20 30 40 50 60 70 80 90 100 Balanced Poverty Accuracy Notes: “IRIS Q(#)” indicates quantile regression (Q) estimated by IRIS at the #th quantile. “CV Q(#)” indicates quantile regression estimated by the authors using cross-validation (CV) at the #th quantile. “SE QRF(#)” indicates quantile regression forest (QRF) estimated by the authors using stochastic ensemble methods (SE) at the #th quantile. “IRIS probit” indicates probit regression estimated by IRIS. Error bars reflect the nonparametric confidence intervals. Source: Authors’ and IRIS center’s estimates using data and procedures detailed in the text. 31 Table 1. Poverty Prediction Outcomes 1 0 ̂ 1 True positive (TP) False positive (FP) ̂ 0 False negative (FN) True negative (TN) Source: Standard confusion matrix. 32 Table 2. Targeting Accuracy Metrics Total accuracy TA=(TP+TN)/(TP+TN+FP+FN) Poverty accuracy PA=TP/(TP+FN) Leakage LE=FP/(TP+FN) Undercoverage UC=FN/(TP+FN) Balanced poverty accuracy criterion BPAC=TP/(TP+FN)‐|FN/(TP+FN)‐FP/(TP+FN)| Source: Authors’ summary based on IRIS Center 2005. 33 Table 3. LSMS Surveys and Variables Used in PAT Development and Replicated by Authors County Data Obs. IRIS selected variables Poverty rate (%) hhsize, hhsize2, age head, age head2, regions, rural, sublease, brick wall, wood 2005 Encuesta de wall, dirt floor, cement floor, Bolivia 4,086 24.03 Hogares (EH) fridge, radio, tv, dvd, fan, car, number beds, number kitchens, number computers, sheep hhsize, hhsize2, age head, age head2, regions, rural, never married, share of adults with out education, share of adults 2004-2005 Second who can read, number of Malawi Integrated Household 11,280 64.78 rooms, cement floor, Survey (IHS2) electricity, flush toilet, soap, bed, bike, music player, coffee table, iron, garden, goats hhsize, hhsize2, age head, age head2, regions, rattantin wall, leaf roof, concreter or tile roof, number rooms, private 2001 Timor Leste water, shared water, toilet is a East Living Standards 1,800 bowl or bucket, electricity 44.73 Timor Survey (TLSS) light, private light, fan, number of adults who read, farmland, number of axes number of baskets, number of chickens Source: Authors’ summary based on the data indicated as well as reports from IRIS Center 2007, 2009, and 2012. 34 Table 4. Tukey-Kramer Tests of Equality of Bootstrap Poverty Accuracy and BPAC Means across Estimates Poverty accuracy Balanced poverty accuracy criteria Estimation Difference TK test statistic Difference TK test statistic CV vs IRIS 5.79* 37.55 8.61* 28.20 Bolivia SE vs IRIS -2.25* -14.07 0.85 2.38 CV vs SE 8.04* 54.14 7.76* 29.04 CV vs IRIS 3.69* 23.89 2.78* 11.87 East Timor SE vs IRIS 2.43* 15.43 1.29* 5.39 CV vs SE 1.26* 8.40 1.49* 7.68 CV vs IRIS 2.25* 59.06 2.19* 50.03 Malawi SE vs IRIS 2.06* 49.11 1.43* 30.85 CV vs SE 0.19 4.90 0.76* 17.51 Note: CV = cross-validation estimates; IRIS = IRIS reported estimates; SE = stochastic ensemble estimates. * Indicates difference is significant at 1% significance level. Source: Authors’ estimates using data and procedures detailed in the text. 35 Table A1. A Comparison of IRIS, Cross-Validation, and Stochastic Ensemble Accuracy Results Country Source Estimation TA PA UC LE BPAC 1) QR (0.42)-In sample (half) 83.65 67.18 32.82 33.29 66.71 a 81.88 57.58 42.42 34.3 49.33 IRIS 2) QR (0.42) 3) Std. err. 1.02 2.61 2.61 3.6 6.11 4) QR (0.42)b [79.78, 83.68] [52.51, 62.65] [37.35, 47.49] [27.6, 41.66] [36.73, 60.48] Bolivia (2005 EH) Rep 5) QR (0.42) rep.-In sample (full) 82.45 60.69 39.30 33.71 55.10 ensemble validation 6) QR (0.38)a 81.76 63.37 36.63 41.61 57.94 Stochastic Cross- 7) Std. err. 0.86 2.25 2.25 3.39 3.04 8) QR (0.38) b [80.10, 83.32] [58.84, 67.88] [32.12,41.15] [35.15, 48.24] [50.61, 63.44] 9) QRF (0.41) a 80.17 55.33 44.67 40.44 50.18 10) Std. err. 0.95 2.44 2.44 3.78 5.14 11) QRF (0.41)b [78.25, 82.03] [50.51, 60.12] [39.88, 49.49] [33.59, 48.11] [39.26, 58.57] 1) Probit-In sample (full) 77.14 75.08 24.92 26.20 73.79 2) Probita,c 75.56 69.32 30.68 28.71 65.56 IRIS 3) Std. err. 1.52 2.56 2.56 3.38 4.33 East Timor (2001 TLSS) 4) Probit b,c [72.63, 78.50] [64.38, 74.33] [25.67, 35.62] [22.40, 35.58] [55.57, 72.08] Rep 5) Probit rep.-In sample (full) 77.16 71.41 28.59 27.63 70.45 ensemble validation 6) QR (0.46)a 76.19 73.01 26.99 30.97 68.34 Stochastic Cross- 7) Std. err. 1.42 2.32 2.32 3.33 2.96 8) QR (0.46)b [73.43, 77.51] [73.43, 77.51] [22.49, 31.46] [24.35, 37.99] [61.84, 73.38] 9) QRF (0.47) a 75.05 71.75 28.25 32.51 66.85 10) Std. err. 1.50 2.42 2.42 3.55 3.17 b 11) QRF (0.47) [72.19, 78.03] [67.12, 76.72] [23.28, 32.88] [25.94, 39.90] [59.77, 72.22] 36 1) QR (0.57)-In sample (half) 80.15 84.12 15.88 16.43 83.57 79.69 83.47 16.53 17.06 82.56 IRIS 2) QR (0.57)a 3) Std. err. 0.55 0.65 0.65 0.76 0.74 Malawi (2004/5 IHS2) 4) QR (0.57) b [78.6, 80.84] [82.2, 84.77] 15.23, 17.79] [15.53, 18.56] [80.95, 83.82] Rep 5) QR (0.57) rep.-In sample (full) 80.82 84.88 15.11 14.39 84.17 ensemble validation 6) QR (0.55)a 80.79 85.72 14.28 15.07 84.75 Stochastic Cross- 7) Std. err. 0.52 0.55 0.55 0.69 0.64 8) QR (0.55)b [79.79, 81.84] [84.68, 86.86] [13.14, 15.32] [13.73, 16.38] [83.42, 85.86] 9) QRF (0.57)a 80.10 85.53 14.47 15.93 83.99 10) Std. err. 0.58 0.67 0.67 0.75 0.73 11) QRF (0.57)b [78.93, 81.19] [84.22, 86.80] [13.20, 15.78] [14.51, 17.47] [82.46, 85.25] Note: QR(#) = quantile regression estimated at the #th quantile; QRF(#) = quantile regression forest estimated at the #th quantile. a Bootstrapped 1,000 times, with replacement, mean reported. b Bootstrapped 1,000 times, with replacement; 95% bootstrap confidence interval reported, where lower bound is 2.5% and upper bound is 97.5%. c Because these bootstrapped estimates were not available in materials made public by IRIS, the estimates reported here were calculated by the authors based on the replication sample and model. Source: Authors’ and IRIS center’s estimates using data and procedures detailed in the text. 37 Table A2. A Comparison of IRIS, Cross-Validation, and Stochastic Ensemble Accuracy Results under Halved and Doubled Poverty Lines Poverty rate Data Estimation TA PA UC LE BPAC Poverty line (%) 2) QR (0.22)a 94.55 41.65 58.35 65.89 30.07 IRIS 3) Std. err. 0.54 5.45 5.45 11.72 9.53 4) QR (0.22) b [93.44, 95.56] [30.72, 52.63] [47.37, 69.28] [45.54, 92.19] [6.73, 44.38] ensemble validation 6) QR(0.24) a 94.53 41.20 58.80 66.31 29.65 Stochastic Cross- 7) Std. err. 0.56 5.44 5.44 11.85 9.50 Half 4.92 8) QR (0.22\4) b [93.39, 95.61] [31.14, 52.47] [47.53, 68.86] [44.93, 91.90] [7.90, 44.58] 9) QRF (0.26) a 94.39 43.65 56.35 71.03 26.94 Bolivia (2005 EH) 10) Std. err. 0.56 5.71 5.71 13.25 11.70 11) QRF (0.26)b [93.24, 95.43] [32.00, 55.00] [45.00, 68.00] [47.66, 100.00] [0.00, 45.27] 2) QR (0.54)a 78.90 82.64 17.36 16.65 81.10 IRIS 3) Std. err. 0.94 1.12 1.12 1.35 1.86 4) QR (0.54) b [77.11, 84.79] [80.40, 84.79] [15.21, 19.60] [14.14, 19.17] [76.61, 83.88] ensemble validation 6) QR (0.52) a 79.01 83.60 16.40 17.45 81.92 Stochastic Cross- 7) Std. err. 0.94 1.11 1.11 1.39 1.31 Double 62.26 8) QR (0.52)b [77.17, 80.83] [81.38, 85.67] [14.33, 18.62] [14.84, 20.09] [79.05, 84.14] 9) QRF (0.54)a 77.89 83.66 16.34 19.32 80.62 10) Std. err. 1.02 1.14 1.14 1.38 1.33 11) QRF (0.54)b [75.99, 79.79] [81.41, 85.77] [14.22, 18.59] [16.68, 21.99] [78.01, 83.10] 38 2) Probita 90.82 28.51 71.50 23.37 -19.62 3) Std. err. 1.03 5.30 5.30 6.03 12.06 4) Probitb [88.69, 92.75] [18.79, 39.13] [60.87, 81.21] [13.37, 36.72] [-42.95, 3.82] IRIS 2) QR (0.27)a 89.02 49.26 50.74 62.58 35.45 3) Std. err. 1.11 5.91 5.91 10.95 9.64 4) QR (0.27) b [86.81, 91.25] [38.03, 61.41] [38.59, 61.97] [43.63, 85.61] [14.04, 51.27] Half 10.65 ensemble validation 6) QR (0.28) a 88.76 46.09 53.91 61.67 35.04 Stochastic Cross- 7) Std. err. 1.05 5.44 5.43 10.50 8.71 8) QR (0.28)b [86.71, 90.81] [35.29, 56.88] [43.12, 64.71] [42.44, 83.23] [16.26, 48.94] East Timor (2001 TLSS) 9) QRF (0.28)a 89.34 39.20 60.80 48.97 23.91 10) Std. err. 1.20 5.80 5.80 11.73 13.89 11) QRF (0.28)b [86.99, 91.70] [27.77, 50.55] [49.45, 72.23] [29.46, 74.37] [-5.68, 45.75] 2) Probita 84.15 93.04 6.96 13.72 86.28 3) Std. err. 1.08 0.67 0.67 1.37 1.37 b [82.75, 85.75] [92.20, 94.07] [5.93, 7.80] [12.12, 15.51] [84.49, 87.88] IRIS 4) Probit 2) QR (0.60) a 83.34 89.27 10.73 11.16 87.72 3) Std. err. 1.33 1.27 1.27 1.43 1.61 4) QR (0.60)b [80.75, 85.75] [86.70, 91.68] [8.32, 13.30] [8.33, 14.04] [83.82, 90.33] Double 80.20 ensemble validation 6) QR (0.57)a 83.86 91.18 8.82 12.40 87.58 Stochastic Cross- 7) Std. err. 1.21 1.06 1.06 1.44 1.41 8) QR (0.57) b [81.61, 86.10] [89.16, 93.30] [6.70, 10.84] [9.65, 15.33] [84.67, 90.17] 9) QRF (0.58) a 82.63 89.96 11.04 11.78 87.28 10) Std. err. 1.28 1.26 1.26 1.44 1.46 11) QRF (0.58) b [80.04, 85.09] [86.52, 91.29] [8.71, 13.48] [8.99, 14.59] [84.00, 89.58] 39 2) QR (0.41)a 79.67 58.19 41.81 45.48 54.25 IRIS 3) Std. err. 0.56 1.15 1.48 2.42 2.17 4) QR (0.41)b [78.58, 80.64] [55.38, 61.15] [38.85, 44.62] [40.77, 50.31] [49.63, 58.19] ensemble validation 6) QR (0.40)a 79.59 56.97 40.02 47.48 52.52 Stochastic Cross- 7) Std. err. 0.56 1.36 1.36 2.40 2.40 Half 23.43 8) QR (0.40) b [78.54, 80.67] [57.21, 62.88] [37.11, 42.79] [43.06, 52.16] [47.84, 56.91] Malawi (2004/5 IHS2) 9) QRF (0.42) a 79.24 56.04 43.96 45.15 53.43 10) Std. err. 0.58 1.56 1.56 2.41 2.13 11) QRF (0.42)b [78.09, 80.32] [53.04, 59.10] [40.91, 46.96] [40.63, 49.76] [48.74, 57.10] 2) QR (0.66)a 92.12 95.95 4.05 4.65 95.32 IRIS 3) Std. err. 0.38 0.29 0.29 0.33 0.31 4) QR (0.66)b [91.37, 92.86] [95.36, 96.51] [3.50, 4.64] [4.05, 5.32] [94.67, 95.88] ensemble validation 6) QR (0.64) a 92.34 96.26 3.74 4.72 95.28 Stochastic Cross- 7) Std. err. 0.35 0.27 0.27 0.31 0.30 Double 90.65 8) QR (0.64) b [91.63, 93.01] [95.75, 96.76] [3.24, 4.24] [4.14, 5.36] [94.64, 95.84] 9) QRF (0.66) a 92.11 95.76 4.23 4.48 95.36 10) Std. err. 0.37 0.30 0.30 0.31 0.33 11) QRF (0.66)b [91.33, 92.81] [95.17, 96.33] [3.82, 5.0] [3.89, 5.11] [94.62, 95.91] Note: QR(#) = quantile regression estimated at the #th quintile; QRF(#) = quantile regression forest estimated at the #th quantile. a Bootstrapped 1,000 times, with replacement, mean reported. b Bootstrapped 1,000 times, with replacement; 95% bootstrap confidence interval reported, where lower bound is 2.5% and upper bound is 97.5%. Source: Authors’ and IRIS center’s estimates using data and procedures detailed in the text. 40 41