WPS7841 Policy Research Working Paper 7841 Is Predicted Data a Viable Alternative to Real Data? Tomoki Fujii Roy van der Weide Development Research Group Poverty and Inequality Team September 2016 Policy Research Working Paper 7841 Abstract It is costly to collect the household- and individual-level meaningful reductions in financial costs while preserving data that underlies official estimates of poverty and health. statistical precision. The study does this using analytical For this reason, developing countries often do not have calculations that allow for considering a wide range of the budget to update their estimates of poverty and health parameter values that are plausible to real applications. The regularly, even though these estimates are most needed there. benefits of using double sampling are found to be modest. One way to reduce the financial burden is to substitute some There are circumstances for which the gains can be more of the real data with predicted data. An approach referred substantial, but the study conjectures that these denote the to as double sampling collects the expensive outcome vari- exceptions rather than the rule. The recommendation is able for a sub-sample only while collecting the covariates to rely on real data whenever there is a need for new data, used for prediction for the full sample. The objective of this and use the prediction estimator to leverage existing data. study is to determine if this would indeed allow for realizing This paper is a product of the Poverty and Inequality Team, Development Research Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be contacted at rvanderweide@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team ∗ Is predicted data a viable alternative to real data? Tomoki Fujii† Roy van der Weide‡ JEL classification c odes: C20, C53, I32. Keywords: Prediction; Double sampling; Survey costs; Poverty. ∗ This study has been funded by the World Bank’s Knowledge for Change Program for which the authors are grateful. † Singapore Management University. Email: tfujii@smu.edu.sg ‡ World Bank. Email: rvanderweide@worldbank.org 1 Introduction In economics, health sciences, and other disciplines, data on the outcome variable of interest is often costly to collect. The measurement of poverty for instance relies on household consump- tion expenditure data. Collecting this data involves long questionnaires administered over an extended period of time which substantially adds to the data collection costs. Individual-level data collection that involves physical examinations also tends to be costly. This has implications for the sample size and the frequency with which the data is collected. In many developing countries for example, particularly in Sub-Saharan Africa, estimates of poverty, malnutrition and health are obtained highly irregularly and as a result are often outdated.1 The demand for bigger and more frequent data has motivated researchers to explore ways to substitute predicted data for real data. Consider the application to poverty measurement by Douidich et al. (2015) where household consumption poverty is predicted into annual labor force surveys in order to increase the frequency of poverty estimates from once every 7 years to every year.2 Similar applications can be found in Stifel and Christiaensen (2007) and Christi- aensen et al. (2012). Prediction methods have also been used to dramatically expand the sample size of poverty and health data. A prominent example of this is the small area estimation of welfare where the outcome variable of interest (i.e. consumption poverty) is predicted into a population census which covers (almost) all members of the population unlike typical house- hold surveys. This allows researchers to obtain estimates of poverty at a highly disaggregated level such as districts, communities, and towns. These small-area estimates are often plotted in the form of a map known as a poverty map. The small-area estimation, which provides the methodological foundations for poverty mapping, was pioneered by Elbers et al. (2003) and ex- tended by e.g. Tarozzi and Deaton (2009), Tarozzi (2011) and Elbers and van der Weide (2014). Fujii (2010) has modified the approach to obtain small areas estimates of the prevalence of 1 For example, across the 26 low-income countries in Sub-Saharan Africa over the period between 1993 and 2012, the national poverty rate and prevalence of stunting for children under five are on average reported only once every five years and once every ten years in the World Development Indicators. 2 This approach could of course be expanded to other outcome variables of interest (such as health outcomes), while surveys other than labor force surveys could be considered to further extend the frequency of estimates. The same approach could also be adopted to construct a measure of consumption poverty that alleviate concerns of comparability over time when the original consumption data is deemed incomparable due to changes in the questionnaire. See for example the debate on the comparability of poverty estimates in India documented in e.g. Deaton (2003), Deaton and Dr` eze (2002), Kijima and Lanjouw (2005), Tarozzi (2007), and Deaton (2005). 2 stunting and underweight of children using predicted data. When new data is needed, it may be tempting to purposefully only collect the covariates x, which are used to predict the outcome variable of interest y , and to scale down the collection of y itself, particularly if this yields a significant financial cost reduction. It is not uncommon that predictors of household welfare such as demographic and dwelling characteristics, educa- tion, employment, and asset ownership, can indeed be collected relatively inexpensively. In health sciences, simple oral questions and anthropometric data taken by a simple non-invasive device too may serve as a predictor of the outcome that is expensive to measure. Collecting y for a sub-sample only, and the covariates x for all, is referred to as “double-sampling”,3 see e.g. Hidiroglou (2001).4 The advantage of this approach is that the prediction model can be estimated with data for the relevant population and time period (the same population for which the covariate data is available). The alternative where the model is estimated to data from an entirely different dataset that may describe a different population at a different point in time is referred to as “non-nested double sampling”, in which case one will have to make assumptions about how the model has evolved between the two different datasets (or assume that the model is invariant), see e.g. Kim and Rao (2012) as well as Douidich et al. (2015) for an empirical application to poverty measurement. We will refer to any estimator that works with predicted data (i.e. imputations of y ) as a prediction estimator. There is a considerable practical interest in adopting a double-sampling approach and re- lying on predicted data in the hope of reducing financial costs while preserving a reasonable degree of statistical precision. Consider for example the recent initiative by the World Bank under the name of SWIFT (Survey of Well-being via Instant and Frequent Tracking) which “does not collect direct income or consumption data which can be both time-consuming and vulnerable to error without the right know-how and resources; instead, it collects poverty cor- relates, such as household size, ownership of assets or education levels, and then converts them 3 It is also known as “two-phase sampling”. 4 The literature on double sampling dates back to Neyman (1938) and Bose (1943). Many of the existing studies, including Hidiroglou (2001), Kim et al. (2006), Palmgren (1987), Rao and Sitter (1995), and Sitter (1997), provide analytical or simulation results on the properties of the estimators based on double sampling. While there are some explicit empirical applications of double sampling such as Tamhane (1978), Hansen and Tepping (1990), and Armstrong et al. (1993), the use of double sampling appears to be comparatively limited. Even fewer studies take into account the costs of data collection (Cochran (1977), Davidov and Haitovsky (2000), S¨ arndal et al. (2003), and Fujii and van der Weide (2013)), despite the fact that one major attraction of double sampling is the reduction in data collection cost. 3 to poverty statistics using estimation models.” (Yoshida et al. (2015)). It employs Computer Assisted Personal Interviewing technology to collect the data and carefully builds and estimates the model that is used for prediction. Both nested and non-nested double sampling approaches are considered.5 While SWIFT is still relatively young, at the time of writing, it has already been applied 34 times in 27 countries. Assessments of the cost-precision trade-offs however are still limited. Ahmed et al. (2014) denotes an exception which provides a SWIFT-like applica- tion of double sampling to poverty estimation in Bangladesh, where a variety of different data collection scenarios are being considered. Their simulation results indicate that substantial cost savings could be achieved with a moderate loss in precision. It is unclear however how much of this may be attributed to the use of double sampling as, in addition to relying on predicted data, they also reduce the number of primary sampling units.6 Pape and Mistiaen (2015) apply a double sampling approach to poverty measurement in Mogadishu. They extend the set of predictors by including a sub-set of consumption items. Ligon and Sohnesen (2016) explore a similar strategy and include an empicical illustration using data from Uganda, Tanzania and Rwanda. Both these studies build on the idea that was put forward by Lanjouw and Lanjouw (2001). This approach strengthens the correlation with total household consumption but also adds to the costs, although Pape and Mistiaen (2015) in their application to Mogadishu are able to keep the face-to-face interview time below 60 minutes.7 Other popular examples of prediction-based poverty estimation include the “Simple Poverty Scorecard” (SPS) project and the “Poverty Assessment Tool” (PAT) developed by the IRIS Center for USAID, see Schreiner (2014a) for a comparison.8 These approaches predict a household’s poverty status on the basis of a small number of questions, mostly relying on non- nested double sampling, and then aggregate these predictions to obtain estimates of poverty at the national level (and possibly other administrative levels). SPS and PAT have been applied to 5 In its version of the non-nested double sampling approach infrequent conventional surveys, which collect both y and x for all households, are alternated with more frequent low cost surveys that only collect x. The infrequent full surveys are used to estimate (and update) the prediction model which is then used to predict poverty into the subsequent low cost surveys that are conducted for the years in between the full surveys. 6 Furthermore, they do not provide a full break-down of the costs involved; only selected cost components are reported which excludes transport costs for example. 7 Security concerns denote an important motivation for reducing the time of data collection by means of face- to-face interviews. 8 The poverty scorecard takes a more pragmatic approach to building its prediction model as it emphasizes simplicity and ease of implementation. See Schreiner (2014b) for further details on the Simple Poverty Scorecard. 4 around 60 and 40 countries, respectively. A recent evaluation can be found in Diamond et al. (2015). To the best of our knowledge, no cost-precision assessments are available for any of these initiatives. The objective of this study is to determine if and when double sampling may reasonably be expected to yield meaningful reductions in the cost of data collection while preserving statis- tical precision. We do this using analytical calculations that rely on an approximation to the financial cost function and on the asymptotic variance as the measure of statistical precision. This allows us to consider an inclusive set of parameter values. We subsequently make an at- tempt to calibrate the parameters involved to a variety of real data. Specifically, we solve a cost minimization problem subject to a statistical precision constraint and its dual problem of vari- ance minimization problem subject to a budget constraint. This helps us identify the conditions under which the gains from double sampling are relatively large (and small). To the best of our knowledge, this is the first study to attempt an analytical assessment of how much precision prediction estimators trade for financial cost savings. We treat the sample direct estimator and the prediction estimator under single- and double-sampling in a unified framework and derive the analytic results for cost or variance reduction. The assumed model allows for clustering, which plays an important role in empirical applications but which is often ignored in theoretical work. We find that the financial gains from double sampling over optimal single sampling tend to be modest for many of the parameter values considered, which are calibrated to real data. The magnitude of the potential discounts are mostly below 20 percent and can be as low as zero percent. There are circumstances in which the gains can be more substantial, but we conjecture that these denote the exceptions rather than the rule. Double sampling is most advantageous when: (a) the marginal cost of collecting y is particularly large, (b) x is highly correlated with y , (c) travel costs between clusters are modest, (d) the sampling error is large relative to the model error, and (e) the spatial correlation in the data is modest. Unfortunately, these conditions are rarely jointly satisfied. When the covariates x are particularly low-cost (as is the case with SPS and PAT), then the correlation between x and y tends to be weak. When the covariates offer exceptionally good predictors (as may be the case in Pape and Mistiaen (2015) and Ligon and Sohnesen (2016)), then the marginal cost of collecting y tends to be low. SWIFT arguably lies 5 somewhere in between. We have assumed away any error that may stem from model misspecification, i.e. all results hold under the standard assumption that the prediction model is correctly specified. This means that the real gains realized by the prediction estimators could be more modest yet. Model misspecification would introduce an entirely new source of error which is not accounted for in estimates of statistical precision.9 Consequently, applied users should always bear in mind that prediction estimators are arguably less precise than is suggested by conventional standard errors. The financial savings are larger for non-nested double sampling estimators. However these are also based on stronger assumptions. The added assumption is that the model parameters did not change between the dataset used for estimation and the dataset used for prediction which can be multiple years apart.10 This adds to the risk of model misspecification error. Our recommendation is to rely on real data on the outcome variable of interest y whenever there is a need for new data. There is an argument for scaling back the collection of y , and collecting covariates of y instead, if there is insufficient budget to accommodate an adequate sample of observations with real data on y (and x). Note that this does not in any way rule out the use of prediction estimators, such as the approaches employed in Douidich et al. (2015) and Elbers et al. (2003). This variety of non-nested double sampling estimators provides researchers with a means of leveraging existing data, which adds to the value of these data. The remainder of this paper is organized as follows. Section 2 presents the prediction es- timators under the single- and double-sampling setup. Then, they are applied to cost-effective sampling in Section 3. We study the conditions under which double sampling is most useful and explore the magnitude of potential gains from double sampling under practical conditions in Section 4. Finally, Section 5 provides some discussion. 9 As any given model can be misspecified in infinitely different ways, there are currently no methods available that would account for model misspecification error. 10 One does not necessarily need to assume that the model parameters are time-invariant. Alternatively, one could make assumptions about how the model has evolved exactly between the two datasets, but this assumption is just as strong. 6 2 Prediction estimator 2.1 Preliminaries Consider the following data generating process: ych = xT T ch β + uch = xch β + ηc + ech , where c and h are the indexes of clusters and households, respectively. The continuous state variable ych , which may or may not be observable, is related to the observable outcome variable of interest Ych = m(ych ) by some function m. The L-vectors of (observable) covariates and co- efficients are denoted by xch and β , respectively. The idiosyncratic error term uch (= ηc + ech ), which consists of the cluster-specific error ηc and the household-specific error ech , is unobserv- able. This error structure allows for some degree of spatial correlation in the errors. We denote 2 2 2 2 2 the variances of error terms by σe ≡ var[ech ], ση ≡ var[ηc ], and σu ≡ var[uch ](= σe + ση ). Each cluster is assumed to consist of K sampled households. We make the following as- sumptions about xch , ηc , and ech : Assumption 1 The triple ({xch }K K h=1 , {ech }h=1 , ηc ) is iid across c. Assumption 2 xch , ech , and ηc are independent of each other for all c. Furthermore, ech is independent across h for all c. It is convenient to denote the stacked error terms for households in cluster c by Uc ≡ T (uc1 , uc2 , . . . , ucK )T and all households by U ≡ (U1 T , U2 T , . . . , UJ ), where J is the number of clusters in the sample. Similarly, we denote all x’s stacked together in cluster c by Xc , and T all Xc stacked together by X . We define Ω ≡ E [U U T ] and Ωc ≡ E [Uc Uc ]. Let us define the expected value of Ych conditional on xch by g (xch , θ) ≡ Eu [Ych |xch ], where θ is a κ-vector of identifiable model parameters and g is assumed differentiable with respect to θ. We also define εch ≡ Ych − g (xch , θ). The parameter of interest in this study is µ ≡ E [Ych ] = Ex [g (xch , θ)] and not θ. The ¯ = n−1 standard estimator for µ is the sample mean Y Ych , where n = JK denotes the c h sample size. This estimator will be referred to as the sample direct estimator. 7 2.2 Binary outcome variable For ease of exposition, we specialize in a case where the outcome variable is binary. This is an important special case, because binary outcomes are routinely encountered in empirical applications. Examples include the poverty rate (proportion of the individuals under the poverty line), prevalence of undernourished children, and the share of underemployed people among the employed individuals. We will subsequently use the poverty rate for the purpose of illustration, but it is straightforward to apply our theory to other contexts. In the context of binary outcome, we maintain the following assumption: Assumption 3 The variables Ych and ych are related by Ych = Ind(ych < z ), where z is a constant. Furthermore, ηc and ech are normally distributed. The normality of ηc and ech is not essential but it helps us to simplify our presentation as uch is also a normal random variable in this case. When µ represents the poverty rate, the constant z corresponds to the poverty line. We denote the probability density function and cumulative distribution function for a standard normal random variable by φ and Φ, respectively. In this case it follows that: g (xch , θ) = Φ (z − xT ch β )/σu . 2.3 Single-sample prediction estimator The prediction estimator for µ relies on the assumption that the functional form of g is known.11 This is true whether we use the single-sample or the double-sample estimator. Consider first the single-sample prediction estimator: µ ˆ) ≡ n−1 ˆ (θ ˆ), g (xch , θ (1) c h ˆ denotes the estimator for θ. All the model uncertainty is captured by the estimator where θ ˆ of the unknown model parameter θ. Rather than taking the sample average over the actual θ realizations of Ych , it estimates the average of the conditional mean g given xch . For the binary outcome variable the prediction estimator is seen to solve:12 11 We avoid using the term “regression estimator” in this study because it typically refers to the prediction estimator under the assumption of linearity (and often a single covariate). 12 Notice that the predicted values do not incorporate estimates of the cluster-specific effects ηc conditional on the available data in eq. (2). In other words, it is not an Empirical Bayes estimator (See Elbers and van der 8 1 1 1 1 z − xT ˆ ˆch ) = ch β ˆ= µ Φ(B Φ , (2) J c K h J c K h ˆu σ where: Bch ≡ (z − xT ch β )/σu . The sample direct and prediction estimators are subject to different sources of error. The former is purely a function of the sampling error. The latter trades some of the sampling error for the model error. More precisely, it averages out the error terms ηc and ech and reduces the sampling error component. However, it introduces the model error instead because the prediction estimator uses an estimate of θ rather than the true parameter value. Put differently, the contributions of η and e to the error of the estimate of µ are “re-packaged” from sampling error to model error in the prediction estimator. To study the properties of prediction estimators, we need to make some assumptions about ˆ is a consistent and asymptotically ˆ. Let us begin conservatively by merely assuming that θ θ normal estimator for θ. Note that this accommodates practically all commonly used estimators. ˆ of the model parameters θ satisfies the following properties: Assumption 4 The estimator θ p √ d ˆ→ θ − θ and ˆ − θ) → J (θ − N (0, Vθ /K ) as J → ∞, ˆ. where Vθ is a symmetric positive-definite κ × κ asymptotic covariance matrix of θ Remark 5 Because K is fixed, there is a bijective correspondence between n and J . Therefore, it is also possible to write Assumption 4 as follows: p √ d ˆ→ θ − θ and ˆ − θ) → n(θ − N (0, Vθ ) as n → ∞. Hereafter, we assume that suitable regularity conditions always hold. In particular, we as- sume the almost-sure existence and non-singularity of relevant moments, which typically pose no problem in empirical applications. With these assumptions, we have the following theorem (all proofs are in the Appendix): Weide (2014) for an application of Empirical Bayes estimator in a similar context). While the prediction estimator in eq. (2) is generally inefficient when ych is observed, we chose to use this form for two reasons. First, the exposition is much simpler when eq. (2) is used. Second, the subsequent discussion is applicable without much modifications even when ych and z are unobservable. 9 Theorem 6 Let Mg ≡ E [∂g (xch , θ)/∂θ](= 0κ ), where 0κ is a column κ-vector of zeros. Then, under Assumptions 1, 2, and 4, we have: p √ d ˆ→ µ − µ and J (ˆ ˆ) − µ) → µ(θ T − N 0, Vg + Mg Vθ Mg /K as J → ∞, (3) K where Vg ≡ varxch [K −1 h=1 g (xch , θ)]. The factor K −1 in the definition of Vg is included to normalize the variance at the cluster level. Notice that Vg = K −1 var[g (xch , θ)] holds when xch is independent across all c and h. It should also be noted that the asymptotic variance of µ ˆ) can be consistently estimated by ˆ (θ replacing Vg , Mg and Vθ with their consistent estimators. For ease of presentation, we hereafter simply write µ ˆ. ˆ dropping the argument θ ˆ may be more precise than the sample direct As discussed earlier, the prediction estimator µ ¯ . The improvement in precision achieved by the prediction estimator is not neces- estimator Y sarily the main attraction of the prediction estimator; its advantage is that it allows for a double sampling strategy, which potentially leads to the reduction in financial costs for collecting sur- vey data without compromising the statistical precision. We pursue this idea in Section 3. 2.4 Double-sample prediction estimator The prediction estimator given in eq. (2) uses a single sample. However, it is clear that we only need the observations of xch once the estimates of β , σe , and ση are obtained. This, in turn, means that it is not necessary to observe Y (or y ) for all sample households; a sub-sample will do. We may still use the full sample to evaluate the mean prediction estimator if x is collected for all. That is, even when data on Y (or y ) is collected only for a sub-sample, provided that x is collected for the full sample of households, predicted values of Y may still be evaluated for the full sample. This approach is referred to as “double sampling”. If observing Y is much more expensive than observing x, then double sampling may be preferred to the standard single sampling approach—where both x and Y are observed for all households in the sample; double sampling has the potential to realize a reduction in the finan- cial costs associated with collecting the necessary survey data while maintaining the desired statistical precision. 10 To formally introduce double sampling, we make a few assumptions. Assumption 7 The covariates x are observed for all sample households, while y is observed for the first k ≤ K households in all J clusters.13 Let sch denote the indicator variable that equals one if the household is in the full sample and zero otherwise. We further denote by sI II I ch [sch (= sch − sch )] the indicator variable that the household is in the sub-sample containing data on both y and x [x only]. Using this notation, the number of households included in the former and latter sub-samples equals, respectively, rn and (1 − r)n, where n is the sample size of the full sample and r(= k/K ) is the ratio of sample households with an observation of y . Assumption 8 The distribution of (xch , ηc , ech ) is independent of (sI II ch , sch ). This requires that the selection into either sample carries no information about xch , ηc , or ech . This is a reasonable assumption in our setup because the researcher chooses whether to observe ych . ˆI of θ, We use the households with the observations of y and x to compute the estimator θ ˆI satisfies the following assumption, which is a double-sample analogue of Assump- where θ tion 4: ˆI of the model parameters θ satisfies the following properties: Assumption 9 The estimator θ p √ d ˆI → θ − θ and ˆI − θ) → J (θ − N (0, VθI /k ) as J → ∞, ˆI . where VθI is a symmetric positive-definite κ × κ asymptotic covariance matrix of θ 13 Since data on x is already being collected for all households in all clusters of the sample, the marginal cost of collecting data on y from all clusters is identical to the cost of collecting y for the same number of households from half the number of clusters, say. Given that any two households from different clusters carry more information than a pair of households from the same cluster, due to spatial correlation, it is optimal to collect y for a sub-sample of households from all clusters. The added advantage of collecting household expenditure data from all clusters is that it allows the user to construct unit value prices for each cluster which are needed to convert nominal household expenditure data into real terms. Alternatively, the survey could collect its price data by visiting local markets, i.e. by including a community price module. In this study we will abstract away from spatial (and temporal) price adjustments, and assume that this is taken care off. Note however that this is not a trivial matter, see e.g. van Veelen and van der Weide (2008). 11 ˆI ) for all observations in S . Therefore, the prediction ˆI , we can predict µ by g (xch , θ Using θ estimator for µ under Assumption 7 is given by: ˆDS ≡ n−1 µ ˆI ). g (xch , θ ch∈S The following is a direct extension of Theorem 6 to the double-sample estimator: ˆDS satisfies the following Theorem 10 Suppose that Assumptions 1, 2, 7, 8, and 9 hold. Then, µ properties as J → ∞: p ˆDS → µ − µ, (4) √ d − N (0, Vg + r−1 Mg µDS − µ) → J (ˆ T I Vθ Mg /K ) (5) Note that Theorem 6 is a special case of Theorem 10 when r = 1 (i.e., k = K ). In the binary context, we have the following results: ˆI = (β Theorem 11 Suppose that Assumptions 1, 2, 3, 7, and 8 hold. Further, θ ˆT , σ ˆη2 ,σ ˆe2 T ) is a ˆDS satisfies the following as J → ∞: maximum-likelihood estimator. Then, µ  p   ˆDS → µ − µ √ ΣT −1 [X T Ω−1 X ]Σ Σ2 4 2 2 4 (6) d φx E c c φx φB (kση +2σe ση +σe )  µDS − µ) →  J (ˆ − N 0, Vg + c σu2 + 4 2kσu , where Σφx and ΣφB are defined as follows: Σφx ≡ E [φ(Bch )xch ], and ΣφB ≡ E [φ(Bch )Bch ]. Further, Ω−1 c can be written as: 2 1 ση Ω− c 1 ≡ 2 Ik − 2 1 1T , 2 k k σe σe + kση where Ik and 1k are the K × K -identity matrix and K -vector of ones, respectively. Let us further develop the expression for the asymptotic variance by making some modest simplifying assumptions that will ease the exposition. 12 Assumption 12 The vector of covariates xch can be written as xch = x0 1 0 c + xch , where xc is iid across c, x1 1 0 0 T 0 1 1 T 1 ch is iid across c and h, E [xch ] = 0, E [xc (xc ) ] ≡ Σxx , and E [xch (xch ) ] ≡ Σxx . This assumption essentially states that the covariates can be decomposed into cluster- and household-specific components. The following lemma follows directly from the Law of To- tal Variance and the definition of Vg . Lemma 13 Under Assumptions 1 and 12, the sampling variance component Vg in eq. (6) can be decomposed in the following manner: Vg = K −1 Vg0 + Vg1 , where Vg0 and Vg1 are normalized variances due to the cluster-level and household-level variations in the sample and have the following definitions: Vg0 ≡ KEx0 c [varx1 ch [K −1 g (xch , θ)|x0 c ]] = Ex0 c [varx1 ch [g (xch , θ)|x0 c ]] h Vg1 ≡ varx0 c [Ex1 ch [K − 1 g (xch , θ)|x0 c ]] = varx0 c [Ex1 ch [g (xch , θ)|x0 c ]] h In particular, when Assumption 3 holds, Vg0 and Vg1 can be written as follows: z − ( x0T 1T c + xch )β z − x0 T c β β T Σ1 xx β Vg0 = Ex0 c varx1 Φ E x0 c φ2 · 2 (7) ch σu σu σu z − ( xc + x1 0T T ch )β 0T z − xc β Vg1 = varx0 c E x1 Φ varx0 c Φ , (8) ch σu σu where the approximation is taken around x1 ch = 0L and used to obtain the sample analogues of Vg0 and Vg1 . Hereafter, we also make the following relationship to hold: 2 ση α≡ 2 1. (9) σe This assumption is valid empirically for the datasets we used to calibrate the parameters for the model error (See Table 4 in the Appendix). This assumption is also found to be valid widely in the small-area estimation literature. 13 T −1 When eq. (9) holds, E −1 [Xc Ωc Xc ] can be approximated as follows: 2 −1 σe α T −1 E − 1 [Xc Ωc Xc ] = (Σ0 1 xx + Σxx ) − (k Σ0 1 xx + Σxx ) k 1 + kα 2 σe 1 −1 1 −1 1 −1 (Σ0 xx + Σxx ) + α(Σ0 0 1 0 xx + Σxx ) (k Σxx + Σxx )(Σxx + Σxx ) k 2 σe −1 1 −1 = (Σ0 + Σ1xx ) IL + α(k Σ0 1 0 xx + Σxx )(Σxx + Σxx ) , (10) k xx where we have taken a first-order approximation with respect to α in the second line using the formula of the differentiation of an matrix inverse (e.g., p.151 of Magnus and Neudecker ˆDS can be approxi- (2007)). Plugging eq. (10) in eq. (6) and using Lemma 13, the variance of µ mated as follows: 2 T 1 −1 1 −1 DS 1 Vg0 1 σe Σφx (Σ0 xx + Σxx ) [IL + α(k Σ0 1 0 xx + Σxx )(Σxx + Σxx ) ] Σφx µ var[ˆ ] + Vg + 2 J K kσu Σ2 4 2 2 4 φB (kση + 2σe ση + σe ) + 4 2kσu Vi Vh Vc = + + (≡ V ), (11) n nr J where Vi , Vh , and Vc are the variance components due to household-specific sampling errors, model errors, and cluster-specific errors, respectively, and have the following definitions: Vi ≡ Vg0 1 −1 1 −1 ΣT 0 1 0 φx [Σxx + Σxx ] [IL + αΣxx [Σxx + Σxx ] ]Σφx Σ2 φB (2α + 1) Vh ≡ + 1+α 2(1 + α)2 1 −1 0 1 −1 αΣT 0 0 φx [Σxx + Σxx ] Σxx [Σxx + Σxx ] Σφx Σ2φB α 2 Vc ≡ Vg1 + + 1+α 2(1 + α)2 The formula for Vc shows that the variance component due to cluster-specific errors has two important parts: the first term (Vg1 ) represents the sampling errors at the cluster level (i.e., errors due to the variations of x0 c ) whereas the second and third terms represent the idiosyncratic errors at the cluster level (i.e., errors due to the variations of ηc ). One important observation to make here is that the total variance V consists of three com- ponents, each inversely proportionate to the full sample size (i.e., n), the size of sub-sample with the outcome variable (i.e., nr), and the number of clusters (i.e., J ). Note also that eq. (11) 14 includes the single-sampling prediction estimator as a special case with r = 1. To compare the variances of various estimators, it is useful to show that the survey direct estimator can be written in a form very similar to eq. (11) as the following theorem shows: Theorem 14 Under Assumptions 1, 2, and 12, the following holds: ˜ ˜ ¯ ] = µ and E [Y ¯ ] = Vi + Vc , var[Y (12) n J ˜c ≡ EXc [varηc [Ee [Ych |ηc , Xc ]]] + Vg1 . ˜i ≡ EXc [Eηc [vare [Ych |ηc , Xc ]]] + Vg0 and V where V ch ch ˜c can be written as follows: ˜i and V Further, under Assumption 3, V ˜i = EXc Eηc Φ zch − xT ch β − ηc zch − xT ch β − ηc V 1−Φ + Vg0 σe σe T T zch − xch β zch − xch β EXc Φ 1−Φ + Vg0 (13) σe σe ˜c = EXc zch − xT ch β − ηc V varηc Φ + Vg1 σe 2 zch − xT ch β EXc φ α + Vg1 (14) σe ˜ We hereafter use tilde () to emphasize that it is derived for the sample direct estimator. This theorem is useful because the optimal single sampling discussed in the next section is directly applicable to the sample direct estimator using eq. (12), even though our primary focus is on the prediction estimator. Furthermore, this result helps us to compare the sample direct estimator with the prediction estimator as we elaborate in Section 4. For completeness, let us also present the non-nested double sampling estimator (see e.g. Kim and Rao (2012)): ˆN N,DS ≡ n−1 µ ˇ), g (xch , θ (15) ch∈S ˇ denotes an estimator for θ that is derived from a secondary non-nested sample. This where θ secondary sample is considered given and its data collection cost is already sunk. The sec- ondary sample often refers to a previous survey of the same type or a contemporaneous survey of a different type in practice.14 For example, in a poverty measurement application to Morocco, 14 Note that the imputed household expenditures or welfare indicators could also be used in a regression analysis, in addition to evaluating mean values of the predicted values, see Elbers et al. (2005). 15 Douidich et al. (2015) estimate the relationship between consumption poverty and household characteristics using a household consumption survey and then use this model to predict con- sumption poverty into a series of annual labor force surveys. This approach enables users to leverage existing data sources. If one decides to collect new data for the estimation of poverty, one has the option of only collecting the covariates x, since an estimator for θ can be obtained from the secondary data source. Put differently, the covariate-only sample is typically part of ˆN N,DS is ˆDS is used but this is not necessarily the case when µ data collection planning when µ used. ˆN N,DS here as this would require us to make We will not formally derive the precision of µ assumptions about the dynamics of the model parameters. When a previous survey is used as in Douidich et al. (2015) for example, one would have to make an assumption about how the model ˆDS . In has evolved over time (or assume that it is time-invariant), which is not required for µ Section 4.1 we will however provide a brief discussion on the financial costs savings that may be expected when using the non-nested double sampling estimator (under some simplifying assumptions). 3 Cost efficient sampling To see whether a meaningful reduction in costs is achievable, we examine the trade-offs between financial costs and statistical precision analytically. A stylized yet informative financial cost function is used. For a measure of statistical precision we appeal to the analytic expression for the asymptotic variance. The advantage of studying this trade-off analytically is that it allows us to work out the conditions under which double-sampling may be expected to be most beneficial. And, similarly, under what conditions the benefits will be marginal. Specifically, we consider the problem of minimizing financial costs given a statistical pre- cision constraint and its dual problem of maximizing statistical precision under a given budget constraint. Formally, we make the following assumption: Assumption 15 The cost of collecting only x for any additional household in a given cluster equals τ ∈ (0, 1). The travel cost to visit an additional cluster is equal to c. 16 Here, we normalize the cost of collecting (xch , ych ) to be equal to one. As a result, it is rea- sonable to require τ ∈ (0, 1), because it costs something to observe the covariates but not as much as it would if both the covariates and the outcome variable are to be observed. Under Assumption 15, the total variable cost of data collection is given by: C = nr + n(1 − r)τ + cJ. (16) We ignore the fixed cost of data collection as it does not affect the optimal sampling design. If the fixed cost differs between the single and double sampling, the difference has to be taken into account in the choice of optimal design. We now consider the optimal sampling under single and double sampling. In the former ¯ or case, we fix r = 1 and choose n and J to minimize the financial cost for a given variance V ¯ (i.e. budget for data collection). In the case of minimize the variance for a given total cost C the latter, we also allow r to vary. 3.1 Optimal single sampling To assess how much one stands to gain by adopting double sampling over single sampling, we derive the level of statistical precision and the cost for the optimal single sample case. To provide a competitive benchmark we will consider the optimal single-sampling prediction estimator, which may be a sample direct estimator or a prediction estimator. Suppose that one wants to minimize the cost of data collection subject to a required accuracy. This formulation is relevant, for example, when the researchers or policy-makers know how ˆ should be. Therefore, when the variance has the form of eq. (11) and the accurate the estimate µ cost function is given in eq. (16), the cost minimization problem can be formulated as follows: ∗ Vi + Vh Vc ¯ C1 ≡ min n + cJ s.t. + =V (17) n,J n J Ignoring the integer constraints for n and J for simplicity of presentation, we can obtain the 17 following minimizing arguments (n∗ ∗ ∗ 1 , J1 ) and minimized cost C1 : H1 H1 Vc H2 n∗ 1 = ¯ Vi + Vh , ∗ J1 = ¯ , ∗ C1 = ¯1 , V V c V √ √ where H1 ≡ Vi + Vh + Vc c. When the budget for data collection is exogenously given, the following dual problem would be more relevant. Vi + Vh Vc ¯ V1+ = min + s.t. n + cJ = C n,J n J Solving this yields: C¯ C¯ Vc H2 n+ 1 = Vi + Vh , + J1 = , V1+ = ¯1 . H1 H1 c C The solution above shows that the optimal total sample size n increases with Vi (as a larger n will be needed to curb the sampling error). Similarly, the optimal number of clusters J increases with Vc (as a larger J in this case is needed to curb the cluster level error component) and decreases with c (i.e. the price tag associated with adding to the number of clusters). 3.2 Optimal double sampling We now turn to the optimization problem for double sampling, in which r(≤ 1) is also a choice variable. In this case, the cost minimization corresponding to eq. (17) is as follows: ∗ Vi Vh Vc ¯ C2 = min nr + n(1 − r)τ + cJ s.t. + + =V (18) n,r,J n nr J When we have an interior solution, solving the first order conditions yields: τ wh H2 Vi H2 Vc H2 r∗ = , n∗ 2 = ¯ , ∗ J2 = ¯ ∗ , and C2 = ¯2 . (19) 1−τ V τ V c V √ √ where H2 ≡ τ Vi + (1 − τ )Vh + Vc c and wh ≡ Vh /Vi . As with the case of single sampling, we can also consider the dual problem of eq. (18). In this case, the total variance is minimized 18 ¯ for data collection: under a fixed budget C Vi Vh Vc ¯ V2+ = min + + s.t. nr + n(1 − r)τ + cJ = C (20) n,r,J n nr J Solving this, we obtain: τ wh C¯ Vi C¯ Vc H2 r+ = , n+ 2 = , and + J2 = , and V2+ = ¯2 . 1−τ H2 τ H2 c C The interpretation of the solutions for n and J is intuitive and similar to the case of single sampling. The optimal solution for r (i.e., the share of observations for which data on both y and x will be collected) is found to be an increasing function of wh and of the cost parameter τ . The positive relationship with wh conveys the fact that it takes data on y to reduce the model error component: If the model error is important relative to the sampling error, then it is optimal to collect more data on y (i.e. increase rn). If the sampling error is relatively more important, then it is optimal to expand the total sample size (i.e. n) at the expense of limiting the number of households for which data on y is collected. The positive relationship between r∗ (or r+ ) and τ conveys the fact that collecting data on y is relatively lighter on the budget when τ is larger. Note that the solution above does not necessarily satisfy r ≤ 1. For this to hold, the follow- ing condition for an interior solution needs to be satisfied: 1−τ ≥ wh . (21) τ 4 Evaluating the potential gains from double sampling A general comparison between sample direct estimators and prediction estimators is compli- cated by the fact that the latter relies on a prediction model which can be estimated using differ- ent methods. Prediction estimators may or may not outperform and the sample direct estimator (see Matloff (1981) and Fujii and van der Weide (2013)). In Section 4.1, we first consider the comparison between the optimal single- and double-sample estimators. This choice has at least two advantages. First, because both use prediction estimators, all the error components (i.e., Vi , Vh , and Vc ) are the same. Their differences come only from the differences in sampling (i.e., 19 the choices of n, r, and J ). This in turn means that the difference can be taken as the pure effect of choosing double sampling. If we compare the optimal double-sampling estimator with the sample direct estimator, then part of the difference must be attributed to the fact that the optimal double-sampling estimator is a prediction estimator (while the direct estimator is not). Second, because the variance components are the same between optimal single- and double-sampling estimators, the comparison provides a clear prediction about the circumstances under which double sampling is most useful. In Section 4.2, we compute the potential gains from the optimal double sampling estimator not only in comparison with the optimal single sampling estimator but also with the sample direct estimator using empirically-relevant parameter values. This analysis provides plausible ballpark estimates of the gains from optimal double sampling for the estimation of poverty rates. 4.1 Optimal single sampling vs optimal double sampling One intuitive measure of comparative performance between the single- and double-sampling estimator is the ratio of their respective variances given the same budget. Another candidate measure is the ratio of financial cost between the optimal double- and single-sampling estima- tors given the same statistical precision. Under our assumptions, it conveniently follows that these two measures coincide and can be expressed as follows: √ √ 2 ∗ C2 V2+ 2 H2 τ + (1 − τ )wh + wc c ρ(c, wc , τ, wh ) = ∗ = + = 2 = √ √ , (22) C1 V1 H1 1 + wh + wc c where wc ≡ Vc /Vi . Note that the relative performance measure satisfies ρ ∈ (0, 1] as long as eq. (21) is satisfied, where lower values of ρ indicate larger gains from double sampling. It can be verified that ρ = 1 holds if and only if eq. (21) is satisfied with equality. This is the threshold case where optimal double sampling reduces to optimal single sampling. When eq. (21) is vio- lated, the double-sampling optimization problems in eqs. (18) and (20) have a corner solution, which is the solution for the comparable single-sampling optimization problems. In this case, double-sampling offers no advantage over single sampling, a situation that occurs when wh and τ are sufficiently high. The following lemma, which follows directly from eq. (22), shows under what variance 20 and cost parameters one stands to gain the most from adopting the optimal double sampling estimator: Lemma 16 Suppose that eq. (21) holds. Then, ρ satisfies the following conditions: ∂ρ ∂ρ ∂ρ ∂ρ ≥ 0, ≥ 0, ≥ 0, and ≥ 0. (23) ∂c ∂wc ∂τ ∂wh These results are intuitive. First, consider the impact of c on ρ. When the travel costs make up a larger share of total costs, ρ goes up. This makes the optimal double sampling estimator less attractive relative to the optimal single sampling estimator. Second, the impact of wc on ρ is also positive. When cluster-specific variations make up a larger share of the total variance of the prediction estimator, the gains from double sampling will be smaller as double sampling helps to reduce neither the travel cost nor cluster-specific variations. These two points can also be readily seen from the facts that eq. (22) is an increasing function of wc c and that ρ tends to 1 in the limit where wc c tends to infinity. In this case, the number of clusters to be visited is the only thing that matters asymptotically and thus there are no gains from double sampling. Third, ρ tends to go up when wh is higher. This essentially means that double sampling is beneficial when the household-specific sampling error is important relative to the model error. This result is also intuitive as the use of double sampling does not reduce the model error but it helps to reduce the household-specific sampling error. Finally, ρ also tends to go up when τ is higher. Hence, the double sampling strategy is most useful when we have covariates that can be collected cheaply. It can be verified that τ functions as a lower bound for ρ. Lemma 17 Suppose that eq. (21) holds. Then, ρ ≥ τ . To further understand the difference between the optimal single- and double-sampling schemes, it is useful to consider the following ratio of cluster sizes for the main and dual problems: ∗ + ∗ J2 H2 √ + J2 H1 √ rJ ≡ ∗ = = ρ, and rJ ≡ + = = 1/ ρ. J1 H1 J1 H2 ∗ + It is clear that rJ < 1 whereas rJ > 1 when eq. (21) holds with a strict inequality. Therefore, in the main [dual] problem where the cost [variance] is to be minimized, the number of clusters in the optimal double sampling is smaller [larger] from its counterpart in the optimal single 21 ∗ + sampling to save the travel cost [cancel out the cluster-specific errors]. Further, because rJ [rJ ] is an increasing [a decreasing] monotonic transformation of ρ, the signs of its partial derivatives with respect to c, wc , τ , and wh are the same as [the opposite of] those in Lemma 16. It is also interesting to note that the number of households to be sampled in each cluster is the same between the main and dual problems as the following equation shows: n∗ 1 n+ 1 (1 + wh )c n∗ 2 n+ 2 c K1 ≡ ∗ = + = , K2 ≡ ∗ = + = . (24) J1 J1 wc J2 J2 τ wc Both equations show that the cluster size tends to get smaller when the errors due to the cluster-level variations become more pronounced. On the other hand, the cluster size tends to get larger (and the number of clusters to be visited get smaller) when the travel cost is larger. The ratio K2 /K1 = 1/ (1 + wh )τ , which is no less than one when eq. (21) is satisfied, represents the change in the cluster size when one switches from optimal single sampling to optimal double sampling. When the parameters are unfavorable to double sampling (i.e., when wh and τ are high), the ratio of cluster sizes tends to be smaller. 4.2 Realistic estimates of gains from double sampling Let us now evaluate the potential benefits of double sampling using a realistic set of param- eter values (c, τ, wc , wh ) taken from existing surveys. While the parameter values are clearly context-dependent and cannot be readily extrapolated to other contexts, this exercise is still useful as it gives practitioners a sense of how much they could reasonably expect to gain from double sampling. It also facilitates comparisons between the sample direct estimator and the optimal single sampling estimator. Ideally, all the parameter values should come from a single survey. However, we are unable to do so due to the lack of data. In particular, the information on survey costs is typically unavailable to the public. Therefore, we collect empirical values of c, τ , ,wc , and wh from various sources. For each of these parameters, we specify low, mid, and high values. The low and high values are close to the minimum and maximum observed in our data sources. The mid value is either the arithmetic or geometric mean of minimum and maximum. The parameter values are presented in Table 1. The details of the data sources and assumptions are provided in 22 Table 1: Set of parameter values used in this study. Value Low Mid High c cL = 4 cM = 16 cH = 64 τ L τ = 0.06 τ M = 0.36 τ H = 0.66 L M H wc wc = 0.6 wc = 1.2 wc = 1.8 L M H wh wh = 0.4 wh = 1.2 wh = 3.6 Appendix B. While the parameter values reported in Table 1 are based on real data and existing studies, they do not necessarily represent the full range of values one might encounter in practice. Fur- thermore, because the parameter values are taken from different sources, we do not know the correlational structure of these parameters. To make the most of the available data, we choose to calculate ρ for all the 81(= 34 ) combinations. Table 2 provides the values of ρ under different combinations of parameter values. A few points are worth mentioning here. First, Table 2 shows that eq. (21) is not satisfied for some combinations of parameter values. This occurs when τ and wh are relatively high. In this case, it does not save much to omit y from observations and the model error is relatively high. Thus, optimal double sampling has no advantage over optimal single sampling. In fact, from a practical perspective, both τ and wh have to be low for optimal double sampling to have a meaningful advantage over optimal single sampling. Second, in comparison with τ and wh , the values of c and wc have limited impact on ρ for the range of values we consider. Third, the gains from optimal double sampling relative to optimal single sampling appear to be reasonably modest, even when eq. (21) is satisfied. The lowest number reported in Table 2 is 0.776. This means that the cost saving from optimal double sampling is at best 22.4 percent of the cost for optimal single sampling within the set of parameters we considered. Figure 1 gives an idea of what values ρ might attain by plotting ρ as a function of τ for four different choices of wc c and wh . The top (bottom) row corresponds to a relatively low (high) value for wh , while the left (right) column refers to relatively low (high) values for wc c (see Table 1). The diagonal line denotes the lower bound for ρ from Lemma 17. Note that ρ is only evaluated for τ < τmax = 1/(1 + wh ) (see eq. (21)). Judging by these figures, ρ mostly 23 Table 2: Values of ρ for different combinations of parameters (τ, c, wc , wh ). L L L M L H M L M M M H H L H M H H (wc , wh ) (wc , wh ) (wc , wh ) (wc , wh ) (wc , wh ) (wc , wh ) (wc , wh ) (wc , wh ) (wc , wh ) (τ L , cL ) 0.776 0.887 0.968 0.817 0.906 0.972 0.839 0.917 0.975 (τ L , cM ) 0.854 0.925 0.977 0.887 0.941 0.982 0.903 0.949 0.984 (τ L , cH ) 0.914 0.955 0.986 0.936 0.966 0.989 0.946 0.971 0.991 24 (τ M , cL ) 0.944 0.995 — 0.955 0.996 — 0.960 0.997 — (τ M , cM ) 0.964 0.997 — 0.972 0.998 — 0.977 0.998 — (τ M , cH ) 0.979 0.998 — 0.985 0.999 — 0.987 0.999 — (τ H , cL ) 0.999 — — 0.999 — — 0.999 — — (τ H , cM ) 0.999 — — 0.999 — — 0.999 — — (τ H , cH ) 0.999 — — 1.000 — — 1.000 — — — indicates that eq. (21) is not satisfied. ranges between 0.8 and 1, and it takes incredibly low values of τ to realize reductions in costs (or variance) of 10 percent or more (i.e. values of ρ below 0.9). Assuming that the parameter values for wc c and wh are indeed reasonable, the gains from double sampling are expected to be modest unless conditions are particularly favorable. 1.0 1.0 0.8 0.8 0.6 0.6 rho rho 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 tau tau 1.0 1.0 0.8 0.8 0.6 0.6 rho rho 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 tau tau Figure 1: Plots of ρ versus τ for different choices of wc c and wh : (a) wc c = 1 and wh = 0.1 (top-left), (b) wc c = 50 and wh = 0.1 (top-right), (c) wc c = 1 and wh = 2 (bottom-left), and (d) wc c = 50 and wh = 2 (bottom-right) Recall that ρ is the ratio of data collection costs between optimal single and double sampling strategies. One may also be interested in the gains relative to the sample direct estimator. Let ω denote the ratio of data collection costs between the sample direct estimator and the double sample estimator given the required accuracy (variance) constraint. 25 To this end, we first derive the ratio ψ of data collection costs between the sample direct estimator and the single-sampling prediction estimator (given the required accuracy constraint), which solves: √ √ C∗ V1+ Vi + Vh + Vc c ψ≡ 1 = = . (25) C˜1 ∗ ˜1∗ V ˜i + V V ˜c c ˜c , and c) can be ˜i , V This quantity can be computed when all the relevant inputs (i.e., Vi , Vh , Vc , V obtained from a single data source. This is indeed the case for Malawi (ψ = 0.945) and Niger (ψ = 0.977). The gain of using the optimal single-sample estimator instead of the sample-direct estimator is thus found to be very small, at least for these two countries. If we assume that the choice of K in a given survey is optimized for the sample direct estimator, we are able to derive ψ without data on c. To see this point, notice first that the derivation of eq. (24) does not depend on the nature of the variance components Vi , Vh , and Vc . Therefore, we can apply the variance components of the sample direct estimator defined in Theorem 14 to eq. (24), where we have Vh = 0 because there is no model error for the sample direct estimator. Solving for c, we obtain: ˜c K 2 V ˜= c , (26) ˜i V ˜ can be interpreted as the implied travel where K is the cluster size in the sample. The resulting c cost that justifies the design of the sample, because the observed choice of K is consistent with ˜ in place of c in the optimal design for the sample direct estimator. We can then substitute c ˜ to distinguish it eq. (25). We denote the left hand side of eq. (25) derived in this way by ψ from ψ . ˜ tend to be higher than the values of c observed in Malawi and Niger (see The values of c ˜ = 0.948 in ˜ are not very different from ψ (ψ Tables 3 and 4). However, the resulting values of ψ ˜ = 0.983 in Niger). Therefore, the alternative derivation of ψ appears to provide Malawi and ψ a reasonable indication of the gains associated with the optimal single-sample estimator relative to the sample direct estimator. ˜ = 0.961) as well as the following three strata of Cam- ˜ for Tanzania (ψ We also derived ψ ˜ = 0.977), and Rural (ψ ˜ = 0.945), Other Urban (ψ bodia: Phnom Penh (ψ ˜ = 0.965). All these 26 estimates are close to unity. Hence, even if we take the lowest value of ρ and ψ among our esti- mates, the resulting value of the ratio ω ≡ ψρ of minimized variances [costs] under the optimal double sampling and optimal sample direct estimators for an exogenously given cost [variance] is 0.733( 0.776 × 0.945). This suggests that the gain from optimal double sampling relative to the sample direct estimator is only about 27 percent even in the most optimistic case. Let us also briefly comment on the financial cost savings that may be expected when the poverty rate is estimated using the non-nested double sampling estimator from eq. (15). We expect the reduction in costs to be more substantial in this case. For ease of exposition, sup- pose that the primary sample (the newly collected data) and the secondary sample (a previous survey, say) would offer similar estimates of the model parameters θ, in terms of precision, if the primary sample indeed includes the observations of y for all households. This ensures that the non-nested double sampling estimator (which predicts poverty into the primary sample using a model that is estimated from the secondary sample) will match the precision of the sin- gle sampling estimator (which uses the primary sample only), allowing for a fair comparison of costs. The non-nested double sampling estimator is able to achieve this level of precision without using any of the data on y from the primary sample. By not collecting any y but only the covariates x the data for the primary sample come at a cost of τ rather than unit cost. Note that this coincides with the lower bound value for ρ as derived in Lemma 17. The difference between ρ and τ tends to be substantial, as can be seen in Figure 1, which suggests that the gains from non-nested double sampling are indeed expected to be substantially larger compared to those obtained from nested double sampling. It should be noted however that it is by no means guaranteed that a secondary sample (which may denote a noticeably older survey) can compete with an up-to-date sample as far as the estimation of θ is concerned. A more detailed study of the cost savings that can be realized with non-nested double sampling, under different assumptions of model stability, is recommended but is beyond the scope of this paper. 4.3 When is double sampling promising? A relatively high value of ω under the most optimistic case indicates that gains from double sampling are rather limited. However, this finding should not be overextrapolated because the 27 parameter values depend on the application. Different survey designs and data collection tech- niques may lead to different values of ω . Let us therefore also consider the possibility where ω may be substantially lower than 0.733. In some contexts, the cost of observing outcomes may be very expensive. For example, if the data collection involves physically invasive techniques (e.g., blood testing), observing y may be costly and thus the value of τ may be substantially lower than 0.06. If we use τ L = 0.001 instead of τ L = 0.06, the lowest estimate for ρ in Table 2 would be ρ = 0.656 instead of ρ = 0.776. The use of new data collection technology may also allow for significantly lower data collection costs. The marginal cost of collecting data online for example is typically much lower than that through traditional face-to-face interviews. It is conceivable that the values we use for wh and wc too may be lower in other contexts, particularly when an obvious and strong proxy for the outcome of interest is available. Consider an application where consumption data derived from a short-form questionnaire serves as a proxy for the complete consumption aggregate that is based on a long-form questionnaire. The predictive power of the model is likely to be strong in that case such that wc and wh may be L L much lower. Just for the purpose of illustration, if we use (wc , wh ) = (0.15, 0.1) instead of (0.6, 0.4) while keeping (τcL , cL ) = (0.06, 4), then ρ can go as low as 0.529. Therefore, one can think of circumstances where the gains from double sampling can be substantial, but these may be the exceptions rather than the rule. To further elucidate this point, consider a study by Ahmed et al. (2014), which implies considerable gains from using a prediction estimator in an application to poverty measurement in Bangladesh. Unfortunately, their approach does not fit perfectly into our analytical frame- work such that we are unable to compute the values for (wc , wh ) that would apply to their data. Specifically, their prediction estimator works with different data (compared to the sample direct estimator), the number of clusters is not necessarily chosen optimally, and neither the level of statistical precision nor the financial costs are fixed. This hampers a fair comparison between the two estimators. However, the standard errors for the national poverty rate derived from a small sub-sample of 640 households are 2.9 and 2.4 percentage points (see Table 5 of Ahmed et al. (2014)), respectively, for the sample direct and prediction estimators, which corresponds to a 32 percent 28 (≈ 1 − (2.4/2.9)2 ) reduction in variance. Without further information, however, it is hard to determine how much of this cost-precision trade-off can be attributed to the fact that their double sampling estimator substituted predicted data for real data.15 It is unlikely that the gains are due to a favorably low level of τ as τ is estimated to be around 0.6 in their case, which is in the mid range. If the gains can indeed be attributed to the use of predicted data, then it is more likely that this is due to favorable values of wh and wc . 5 Discussion The primary motivation for the use of prediction in economics, health sciences, and other dis- ciplines has been to deal with various forms of missing data problems. One could also make a case for adopting prediction estimators to obtain more cost-efficient estimates of the population mean when it is expensive to observe the outcome of interest in comparison with its covariates. For example, consider the estimation of poverty and malnutrition rates. The conventional sam- ple direct estimators in this case require household- and individual-level data on expenditures and health outcomes. Collecting this data is generally costly. It is not uncommon that in devel- oping countries, where poverty and poor health outcomes are most pressing, statistical agencies do not have the budget that is needed to collect these data frequently. As a result, official es- timates of poverty and malnutrition are often outdated. This then makes it difficult to monitor progress (or lack thereof) in times when circumstances might be subject to considerable change, such as shocks to international staple prices, domestic climate shocks, etc. Using predicted data as a substitute for real data may then offer a valuable alternative. In recent years, a number of studies have explored the option of predicting household expen- diture data into existing secondary surveys in an effort to supplement existing poverty estimates and increase their frequency (Stifel and Christiaensen, 2007; Douidich et al., 2015). Douidich et al. (2015) for example considers the Labour Force Survey as their secondary survey, which is often available at a higher frequency than household expenditure surveys. There is also a large literature that predicts household expenditure data into the population census, see for example Elbers et al. (2003) and the references therein. The objective here is to 15 Also, the study does not report the total cost of data collection, but selected cost components instead. 29 obtain estimates of poverty at a high level of disaggregation, or at the level of small area such as a district. It would be unpractical to use the sample direct estimator because the data must contain an extraordinarily large number of households to obtain a reliable estimate for each small area, which would be financially infeasible. It is then a small step to purposefully collect data on covariates that are ideally suited for the prediction of household- or individual-level outcomes of interest. If real data on the variable of interest is collected for a sub-sample of households, then this sub-sample can be used to estimate the model parameters that are used for prediction. The advantage of this double sampling approach is that the prediction model will apply to the population of interest by construction. There is certainly a considerable interest in adopting such an approach in practice in the hope that this will enable a meaningful reduction in financial costs while preserving a reasonable level of statistical precision. The objective of our study is to investigate the potential gains that might be derived from a double sampling approach under a set of fairly general conditions. We achieve this by analyti- cally deriving the asymptotic variances of the single- and double-sample prediction estimators and by considering approximations to a financial cost function. This allows us to maximize statistical precision [minimize financial costs] under a budget constraint [statistical precision constraint] for a wide set of parameter values. Even though we are working with analytic ap- proximations, we expect that the broad findings coming out of this analysis may carry over to real applications. When we calibrate the parameters from the variance structure and the financial cost function to real data in the context of the estimation of poverty, we find that the reductions in costs rarely exceed 25 percent and are often below 10 percent. Furthermore, we find that the magnitude of the gains derived from double sampling are primarily determined by the following factors: (a) relative size of the travel costs, (b) degree of spatial correlation between residuals, (c) financial discount obtained by not collecting the outcome variable of interest, and (d) the share of total error that may be attributed to model error (versus sampling error). Double sampling is most effective when a reasonably large geographic coverage can be obtained without having to spend a disproportionate share of the budget on travel, and when the spatial correlation between the residuals is smaller rather than larger. The financial discount obtained by not collecting the 30 expensive outcome variable of interest is most notable when a larger share of the total error is due to the sampling error (rather than the model error). It is conceivable that larger gains can be obtained under certain conditions, for example, when the cost of collecting the outcome variable of interest is extraordinarily high, when the cheaply available predictors exhibit an exceptionally high correlation with the outcome variable of interest, and when the data exhibits very little spatial correlation. We conjecture, however, that these circumstances represent the exception rather than the rule. Moreover, we have cur- rently abstracted away from model misspecification error. Ignoring this component of error obviously favors the prediction estimator. Accounting for misspecification error is not obvious; it is hard to quantify since the true model is inherently unknown and any given estimate of the model can be misspecified in infinitely many ways. Given these observations, when new data is to be collected, we recommend that the outcome variable of interest should be included so that one does not have to rely on predicted data. This does not mean that there is no role for prediction estimators. Under the right circumstances we believe they could be of great value. For one, prediction estimators provide the means of leveraging already existing data (think of non-nested double sampling estimators, see for exam- ple Kim and Rao (2012) and Douidich et al. (2015)). Furthermore, if no previous data exists and the budget is particularly constrained such that one may be left with the choice between predicted data or no data, then the former may be preferred over the latter. In such a data-poor environment, which is not unheard of in developing countries, double sampling estimators may continue to provide a valuable option. References Ahmed, F., C. Dorji, S. Takamatsu, and N. Yoshida (2014) ‘Hybrid survey to improve the relia- bility of poverty statistics in a cost-effective manner.’ World Bank Policy Research Working Paper 6909, The World Bank Aliaga, A., and R. Ren (2006) ‘Optimal sample sizes for two-stage cluster sampling in demo- graphic and health surveys.’ DHS Working Papers 2006 No.30, ORC Macro 31 Armstrong, J., C. Block, and K.P. Srinath (1993) ‘Two-phase sampling of tax records for busi- ness surveys.’ Journal of Business & Economic Statistics 11(4), 407–419 Beegle, K., J. De Weerdt, J. Friedman, and J. Gibson (2012) ‘Methods of household consump- tion measurement through surveys: Experimental results from tanzania.’ Journal of Develop- ment Economics 98(1), 3–18 a 6, 329– Bose, C. (1943) ‘Note on the sampling error in the method of double sampling.’ Sankhy¯ 330 Christiaensen, L., P. Lanjouw, J. Luoto, and D. Stifel (2012) ‘Small area estimation-based pre- diction methods to track poverty: validation and applications.’ Journal of Economic Inequal- ity 10(2), 267–297 Cochran, W.G. (1977) Sampling Techniques, 3rd edition ed. (John Wiley & Sons) Davidov, O., and Y. Haitovsky (2000) ‘Optimal design for double sampling with continuous outcomes.’ Journal of Statistical Planning and Inference 86, 253–263 Deaton, A. (2003) ‘Adjusted Indian poverty estimates for 1999-2000.’ Economic and Political Weekly 38(4), 322–326 (2005) ‘Data and dogma: The great indian poverty debate.’ World Bank Research Observer 20(2), 177–199 eze (2002) ‘Poverty and inequality in India: A reexamination.’ Economic Deaton, A., and J.P. Dr` and Political Weekly 37(36), 3729–3748 Diamond, A., M. Gill, M. Dellepiane, E. Skoufias, K. Vinha, and Y. Xu (2015) ‘Estimating poverty rates in target populations: An assessment of the simple poverty scorecard and alter- native approaches.’ mimeo Douidich, M., A. Ezzrari, R. van der Weide, and P. Verme (2015) ‘Estimating quarterly poverty rates using labor force surveys: A primer.’ World Bank Economic Review. Advance Access pulished 2015 32 Elbers, C., and R. van der Weide (2014) ‘Estimation of normal mixtures in a nested error model with an application to small area estimation of poverty and inequality.’ World Bank Policy Research Working Paper 6962, The World Bank Elbers, C., J. Lanjouw, and P. Lanjouw (2003) ‘Micro-level estimation of poverty and inequal- ity.’ Econometrica 71, 355–364 (2005) ‘Imputed welfare estimates in regression analysis.’ Journal of Economic Geography 5, 101–118 Fujii, T (2006) ‘Community-level estimation of poverty measures and its application in cam- bodia.’ In ‘Spatial Disparities in Human Development’ (United Nations University Press) pp. 289–314 Fujii, T. (2010) ‘Micro-level estimation of child undernutrition indicators in Cambodia.’ World Bank Economic Review 24(3), 520–553 Fujii, T., and R. van der Weide (2013) ‘Cost-effective estimation of the population mean using prediction estimators.’ World Bank Policy Research Working Paper 6509, The World Bank noz (1996) ‘A manual for planning and implementing the living standards Grosh, M.E., and Mu˜ measurement study survey.’ Living Standards Measurement Study Working Paper 126, The World Bank Hansen, M.H., and B.J. Tepping (1990) ‘Regression estimates in federal welfare quality control programs.’ Journal of the American Statistical Association 85(411), 856–864 Hidiroglou, M. (2001) ‘Double sampling.’ Survey Methodology 27(2), 143–154 Humphreys, C.P. (1979) ‘The cost of sample survey designs.’ In ‘Proceedings of the Survey Research Methods Section’ American Statistical Association pp. 395–400 Kijima, Y., and P. Lanjouw (2005) ‘Economic diversification and poverty in rural India.’ Indian Journal of Labour Economics 48(2), 349–374 Kim, J., A. Navarro, and W. Fuller (2006) ‘Replication variance estimation for two-phase strat- ified sampling.’ Journal of the American Statistical Association 101(473), 312–320 33 Kim, J., and J. Rao (2012) ‘Combining data from two independent surveys: A model-assisted approach.’ Biometrika 99, 85–100 Lanjouw, J., and P. Lanjouw (2001) ‘How to compare apples and oranges? poverty measurment based on different definitions of consumption.’ Review of Income and Wealth 47(1), 25–42 Ligon, E., and T. Sohnesen (2016) ‘Using reduced consumption aggregates to track and analyze poverty.’ mimeo Magnus, J., and H. Neudecker (2007) Matrix Differential Calculus with Applications in Statis- tics and Econometrics: Revised Edition (John Wiley & Sons) Matloff, N. (1981) ‘Use of regression functions for improved estimation of means.’ Biometrika 68, 685–689 Ministry of Planning, and United Nations World Food Programme (2002) ‘Estimation of poverty rates at commune-level in cambodia: Using the small area estimation technique to obtain reliable estimates.’ , Ministry of Planning, Royal Government of Cambodia and United Nations World Food Programme, Phnom Penh, Cambodia Neyman, J. (1938) ‘Contribution to the theory of sampling human populations.’ Journal of the American Statistical Association 33, 101–116 Palmgren, J. (1987) ‘Precision of double sampling estimators for comparing two probabilities.’ Biometrika 74(4), 687–694 Pape, U., and J. Mistiaen (2015) ‘Measuring household consumption and poverty in 60 minutes: The Mogadishu high frequency survey.’ mimeo, The World Bank Pettersson, H., and B. Sisouphanthong (2005) ‘Cost model for an income and expendi- ture survey.’ In ‘Household Sample Surveys in Developing and Transition Countries,’ vol. ST/ESA/STAT/SER.F/96 of Series F (Department of Economic and Social Affairs, United Nations Statistics Division) studies in methods 13, pp. 267–277 Rao, J., and R. Sitter (1995) ‘Variance estimation under two-phase sampling with application to imputation for missing data.’ Biometrika 82(2), 453–460 34 arndal, C.-E., B. Swensson, and J. Wretman (2003) Model Assisted Survey Sampling S¨ (Springer) Schreiner, M. (2014a) ‘How do the poverty scorecard and the PAT differ?’ mimeo (2014b) ‘The process of poverty-scoring analysis.’ mimeo Sitter, R. (1997) ‘Variance estimation for the regression estimator in two-phase sampling.’ Jour- nal of the American Statistical Association 92, 780–787 Stifel, D., and L. Christiaensen (2007) ‘Tracking poverty over time in the absence of comparable consumption data.’ World Bank Economic Review 21(2), 317–341 Tamhane, A.C. (1978) ‘Inference based on regression estimator in double sampling.’ Biometrika 65(2), 419–427 Tarozzi, A. (2007) ‘Calculating comparable statistics from incomparable surveys, with an ap- plication to poverty in India.’ Journal of Business & Economic Statistics 25(3), 314–336 (2011) ‘Can census data alone signal heterogeneity in the estimation of poverty maps?’ Jour- nal of Development Economics 95(2), 170–185 Tarozzi, A., and A. Deaton (2009) ‘Using census and survey data to estimate poverty and in- equality for small areas.’ Review of Economics and Statistics 91(4), 773–792 van Veelen, Matthijs, and Roy van der Weide (2008) ‘A note on different approaches to index number theory.’ American Economic Review 98(4), 1722–1730 Yoshida, N., R. Munoz, A. Skinner, C. Kyung-eun Lee, M. Brataj, W. Durbin, and D. Sharma (2015) ‘Swift data collection guidelines version 2.’ mimeo, The World Bank A Proofs ˆ in the proof of Theorem 10, we ˆI = θ Proof of Theorem 6 Letting k = K (or r = 1) and θ obtain the proof of Theorem 6. 35 ˆI around θ, the Law of Proof of Theorem 10 By an exact first-order Taylor expansion of θ Large Numbers, and Assumption 4, we have the following as J → ∞: 1 1 ˜I ) ∂g (xch , θ p ˆDS = µ g (xch , θ) + ˆI − θ) → (θ − µ, (27) JK c h JK c h ∂θT ˆI . ˜I is between θ and θ where θ By the Central Limit Theorem and Assumption 4, we obtain: √ 1 1 1 1 ˜) √ ∂g (xch , θ µDS − µ) = J (ˆ √ g (xch , θ) − µ + ˆI − θ) J (θ J c K h J c K h ∂θT p −1 T − N (0, Vg + r → Mg Vθ Mg /K ), as J → ∞. This completes the proof. Proof of Theorem 11 Let φ(·) be the probability density function for the standard normal distribution. Then, the log-likelihood function lc (θ) for cluster c and has the following form: ∞ 1 ηc 1 ych − xT ch β − ηc lc (θ) ≡ ln φ φ dηc −∞ ση ση h σe σe     2 2 2 1 1  T 2 ση ση = − 2 [ych − xch β ] − 2 2 ych − xT ch β 2   + ln k + 1 + k ln[2πσe 2 ] , 2 σe h σe + kση h σe 2 where the set of parameters to be estimated is θ = (β T , ση 2 T , σe ) . Therefore, the log-likelihood satisfies l(θ) ≡ ˆI = l(θ). It is c lc (θ ) and the maximum likelihood estimator is given by θ ˆI satisfies Assumption 9 with V I given by the following: straightforward to show that θ θ ∂ 2 l (θ ) VθI = −E −1 ∂θ∂θT ∂ 2 lc (θ) = −J −1 E −1 ∂θ∂θT   E − 1 [Xc T −1 Ω c Xc ] 0T L      4 +2(k −1)σ 2 σ 2 +σ 4 k(k−1)ση  = E 4 4 σe η e e , −1   (28) 2σe 0L   k(k−1)     −1 k 36 where 0L is an L-vector of zeros. By the definition of Mg , we have: ΣT φx ΣφB ΣφB Mg = −E , 2, 2 (29) σu 2σu 2σu Applying eqs. (28) and (29) to Theorem 10, we obtain eq. (6). ¯ , the following holds: Proof of Theorem 14 By the definition of Y ¯ ] = n−1 E [Y E [Ych ] = µ c h By Assumptions 1 and 2, Lemma 13, and the Law of Iterated Variance, we have: ¯] = 1 ¯c ] var[Y var[Y J 1 ¯c |Xc ]] + varxc [E [Y ¯c |Xc ]] = EXc [var[Y J 1 ¯c |ηc , Xc ]] + varηc [Ee [Y¯c |ηc , Xc ]] + varxc 1 = EXc Eηc [varech [Y ch g (xch , θ) J K h ˜c ˜i V V = + , n J ¯c ≡ K −1 K where Y h=1 Ych . Proof of Lemma 17 The results follows directly from the fact that ρ = τ when wc c = 0 and wh = 0 and the fact that ρ is an increasing function of wc c and wh for all τ . The latter is established in Lemma 16. B Details of the variance and cost parameter estimates To compute ρ in eq. (22), we need a realistic set of values for the following parameters: τ , c, wc , and wh . All these values are context-dependent. Further, wc , and wh are also dependent on the choice of covariates and poverty line. However, to see the potential benefits of double sampling, it is useful to have a reasonable range of values that the designers of surveys may encounter. Therefore, we compiled the estimates of these parameter values from various sources to obtain 37 a plausible range for each of these parameters. Some empirical values of τ Beegle et al. (2012) consider the cost implications for various types of questionnaire in Tanza- nia. For example, Beegle et al. (2012, Table 10) report that four households can be interviewed by an interviewer per day. Assuming that each interviewer interviews for eight hours a day, it takes 8 · 60/4 = 120 minutes to complete a survey with recall consumption model. The same study also reports that it takes on average 41 minutes to complete a short con- sumption module with 17 most important items (Beegle et al., 2012, Figure 1). Therefore, the proportion of time that is spent to collect consumption data is 41/120 ≈ 0.34. If we assume that the data entry cost is roughly proportionate to the time needed for data collection and that there is no other cost items, we have τ ≈ 1 − 0.34 = 0.66 in this case. This estimate may be close to the upper bound of τ . The average time to complete a recall consumption module is longer when a long questionnaire (with 58 items) is used and when a longer recollection period is adopted, even though the difference is modest. When the personal diary format is used, the cost of collecting consumption data can be much higher because it involves frequent visits to the household, As reported in Beegle et al. (2012, Table 10), the number of interviews that can be performed per day may be as low as 0.35. Assuming that the time required to complete the non-consumption component remains the same, the value of τ when personal diary format is used is τ ≈ 0.66 · 0.35/4 = 0.06. It should be noted that Beegle et al. (2012) only consider a relatively short questionnaire with a complete consumption module and not a large multi-topic survey. Therefore, when we consider a typical Living Standard Measurement Survey, the value of τ may be higher as the weights of non-consumption modules become more important. On the other hand, if we are only concerned with getting an accurate estimate of poverty, τ could be lower because we typically need a small fraction of non-consumption modules to predict consumption. In Bangladesh, Ahmed et al. (2014) adopt the assumption that five enumerators have to stay for two weeks to complete the survey in one primary sampling unit (PSU) to estimate the cost of conducting Household Income Expenditure Survey in Bangladesh. They further assume that 38 two persons engage in collecting consumption data and three persons collect non-consumption data for two weeks in one PSU. Under these assumptions, we have τ = 3/5 = 0.6. Given these examples, we use τ = {0.06, 0.36, 0.66} as a plausible set of values to consider. Some empirical values of c Compared with τ , there are more studies that provide us with some empirical values of c. For Demographic and Health Surveys, Aliaga and Ren (2006, Table 3.1) present an estimate of c based on past eight surveys. The value varies from 10 in Cambodia and Uganda to 52 in Togo. For consumption surveys, we are not aware of a study that provides cost comparisons from multiple surveys. Therefore, we derive the estimates of c from existing studies. Let us start with Pettersson and Sisouphanthong (2005, p.275), who provide the cost ratio, or the ratio of the cost of adding a PSU to the cost of adding a household, for the third Lao Expenditure and Consumption Survey (LECS-3) conducted in 2002-2003. This ratio is nothing but c in our notation. They report c = 3.9 in urban areas and c = 6.1 for rural areas. The lower cost of c in urban areas reflects the lower cost of travel to move between primary sampling units in the urban areas. A few cautions are in order here. First, travel costs and field allowances are included in the calculation of c but the permanent staff salaries are excluded. As Pettersson and Sisouphanthong (2005) claim, the cost ratio may be affected only slightly by the omission of salaries as the omission will have rather similar effects on both the denominator and numerator of the ratio. Second, as noted by Pettersson and Sisouphanthong (2005), the cost ratios in their study are rather low. This reflects the fact that the survey required considerable time for interview and follow-up per household over the month when the interviewer-supported diary method was used. Therefore, when a simpler consumption module is used, c is likely to be higher. Next, we also calculated c based on the survey costs reported in Humphreys (1979, p.396) for the Ada Baseline Survey (ABS) conducted in rural Ethiopia. Their first stage sample con- sists of 87 administrative localities and costs $5,547, whereas 632 observations were in the second stage costs at the cost of $5,899.16 Therefore, we have an estimate of c as follows: 16 In addition, there were fixed costs of $3,589. Humphreys (1979) use a slightly different cost function for their analysis, but their cost model reduces to ours when we ignore the terms involving square roots. 39 c = (5547/87)/(5899/632) ≈ 7. The numbers obtained above are calculated from a highly aggregated budgetary figures. Therefore, we consider the generic, all-inclusive budget for a one year, 3,200-household living noz (1996, Table 8.2). While the numbers are standards survey presented in Grosh and Mu˜ hypothetical, they are meant to be used to create a prototype budget and may well serve our purpose to create a ballpark figure. We assume that each cluster has 16 households as with a majority of Living Standards Measurement Study (LSMS) surveys reviewed in Table 4.1 of noz (1996), which implies that there are 200 clusters in this hypothetical survey. Grosh and Mu˜ There remain a challenge in deriving an estimate of c because we need to assign each cost components to (i) variable costs proportionate to the number of clusters, (ii) variable costs pro- portionate to the number of households, and (iii) the fixed costs. We decided to assign all costs relating to “travel allowance” and vehicles, fuel, and car maintenance to (i) variable costs pro- portionate to the number of clusters, which amounts to $345,880 for 200 clusters. For (ii) vari- able costs proportionate to the number of households, we include all the costs relating to “base salaries” and printing of questionnaire as well as “materials” other than vehicles, fuel, and car maintenance, which in total amount to $475,150 for 3,200 households. The remaining costs in- cluding “consultancy and travel” and “other” are taken as (iii) fixed costs. Based on these classi- fications, an estimate of c can be obtained as follows: c = (345880/200)/(475150/3200) ≈ 12. We have also collected similar budgetary information for several household surveys through personal communications. We then computed c in a similar manner. Table 3 summarizes all the estimates of c we have obtained using the budgetary information we obtained from personal communications. Based on this table, we choose to use the following values of the cost ratio: c ∈ {4, 16, 64}. Note that the cost calculations are based on average cost. Because the actual cost function for a survey depends on the logistical arrangement and is in general not exactly equal to eq. (16), the marginal and average costs are likely to differ. For example, we would not need to incur additional training cost to just add one more household to the sample. However, if we scale up the survey significantly, it is likely that we need to increase most of the cost components, such as the costs of personnel, travel, and materials. Therefore, the calculations of c provided in Table 3 should be taken as approximations. 40 Table 3: Summary of the estimates of c Country/Area c Survey Source Cambodia 10 Demographic and Health Survey Aliaga and Ren (2006) Uganda 10 Demographic and Health Survey Aliaga and Ren (2006) Jordan 12 Demographic and Health Survey Aliaga and Ren (2006) Ethiopia 12 Demographic and Health Survey Aliaga and Ren (2006) Haiti 15 Demographic and Health Survey Aliaga and Ren (2006) Turkey 27 Demographic and Health Survey Aliaga and Ren (2006) 41 Burkina Faso 48 Demographic and Health Survey Aliaga and Ren (2006) Togo 52 Demographic and Health Survey Aliaga and Ren (2006) Laos-Urban 3.9 Third Lao Expenditure and Consumption Survey Pettersson and Sisouphanthong (2005) Laos-Rural 6.1 Third Lao Expenditure and Consumption Survey Pettersson and Sisouphanthong (2005) Ethiopia 7 Ada Baseline Survey Humphreys (1979) −−− 12 Generic LSMS Grosh and Mu˜ noz (1996) Malawi 39 Malawi Third Integrated Household Survey 2010/11 Personal communication Nigeria 24 Nigeria General Household Survey 2012/13 Personal communication Niger 31 2011 National Survey on Household Living Conditions and Agriculture Personal communication Estonia 10 Household Budget Survey 2012 Personal communication Some empirical values of wc and wh To consider the values of wc and wh in a realistic setup, we take several prediction models and poverty lines used in small-area estimation or survey-to-survey imputation in several countries. For Cambodia, we adopt the consumption model and poverty line used in a small-area esti- mation project detailed in Ministry of Planning and United Nations World Food Programme (2002) and Fujii (2006). This project uses the Cambodia Socioeconomic Survey 1997. In each of the three strata (Phnom Penh, Other Urban and Rural), a separate consumption model and a separate poverty line are used. Our point estimates of β (unreported) are slightly different from those used in this project because the heteroskedasticity of ech is allowed for in the former and the estimation method is slightly different as a result. However, the difference is not important for our purpose because we are only interested in the plausible range of wc and wh . For Malawi, Niger, and Tanzania, we used a prediction model and national poverty line from an independent survey-to-survey imputation project.17 The datasets used are Malawi Third ete Nationale sur les Conditions de Integrated Household Survey 2010-2011 for Malawi, l’Enquˆ enages et l’Agriculture 2011-12 for Niger, and Tanzania National Panel Survey 2010- Vie des M´ 11 for Tanzania. All these surveys are a part of the Living Standard Measurement Surveys by the World Bank. Table 4 provides a summary of the estimates of wc , wh , c ˜, and α for various surveys. ˜, ψ The first two columns report the basic information about the survey such as the number of observations (n) and number of clusters (K ). The next two columns show that the values of wc and wh vary substantially across countries. Based on this, we use wc ∈ {0.6, 1.2, 1.8} and wh ∈ {0.4, 1.2, 3.6}. The next two columns report the implied travel cost tildec under the ˜ of data collection assumption of optimal sampling for the sample direct estimator and the ratio ψ costs between the sample direct and single-sample prediction estimators with the implied travel cost. As shown in the last column, the ratio α of variances between the cluster- and household- specific error terms satisfies eq. (9), which is the condition for the approximation we use. 17 Specific details, including the model specifications used, are available upon request. 42 Table 4: Summary of the estimates of wc , wh , c ˜, and α. ˜, ψ Country n K wc wh ˜† c ˜ ψ α Cambodia (Phnom Penh) 1200 120 1.141 3.600 15 0.945 0.073 Cambodia (Other Urban) 1000 100 0.961 1.357 26 0.977 0.013 Cambodia (Rural) 3810 254 1.809 1.999 81 0.965 0.199 Malawi 12239 768 0.612 0.667 67 0.948 0.268 Niger 3961 270 1.415 0.761 117 0.983 0.201 Tanzania 3272 349 0.708 0.990 22 0.961 0.123 † In Malawi, Niger, and Tanzania, there are some variations in the cluster size. Therefore, we use the average cluster size (K¯ ≡ n/J ) in eq. (26) to derive c ˜. In Cambodia, all the cluster have exactly the same number of households. 43