WPS6587 Policy Research Working Paper 6587 Evaluation of Development Programs Randomized Controlled Trials or Regressions? Chris Elbers Jan Willem Gunning The World Bank Development Economics Vice Presidency Partnerships, Capacity Building Unit September 2013 Policy Research Working Paper 6587 Abstract Can project evaluation methods be used to evaluate often correlated. The TPE can also deal with the common programs: complex interventions involving multiple situation in which such a correlation is the result of activities? A program evaluation cannot be based simply decisions on (intended) program participation not being on separate evaluations of its components if interactions taken centrally. In this context RCTs are less suitable even between the activities are important. In this paper a for the simplest interventions. measure is proposed, the total program effect (TPE), The TPE can be estimated by applying regression which is an extension of the average treatment effect on techniques to observational data from a representative the treated (ATET). It explicitly takes into account that sample from the targeted population. The approach in the real world (with heterogeneous treatment effects) is illustrated with an evaluation of a health insurance individual treatment effects and program assignment are program in Vietnam. This paper is a product of the Partnerships, Capacity Building Unit, Development Economics Vice Presidency. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be contacted at c.t.m.elbers@vu.nl and j.w.gunning@vu.nl. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Evaluation of Development Programs: Randomized Controlled Trials or Regressions? Chris Elbers and Jan Willem Gunning* JEL Classification Codes: C21, C33, O22 Keywords: program evaluation; randomized controlled trials; policy evaluation; treatment heterogeneity; budget support; sector-wide programs; aid effectiveness Sector Board: Economic Policy (EPOL) * The authors are professors at the VU University Amsterdam and fellows of the Tinbergen Institute. Their addresses are c.t.m.elbers@vu.nl and j.w.gunning@vu.nl (corresponding author). They are grateful to Remco Oostendorp, Menno Pradhan, Martin Ravallion, Elisabeth Sadoulet, Finn Tarp, the editors and two anonymous referees of the World Bank Economic Review and to seminar participants in Amsterdam, Namur, Oxford and Paris for very valuable comments on previous versions. Experimental methods for impact evaluation presuppose that the intervention is well-defined: the “project� is limited in space and scope (e.g. Duflo et al., 2008). However, governments, NGOs and donor agencies are often interested in evaluating the effect of a program consisting of various interventions, e.g. a sector-wide health or education programs (De Kemp et al., 2011). Program evaluation faces two complications. First, a sharp distinction between treatment and control groups is usually impossible. For example, a program in the education sector may involve activities such as school building, teacher training and supply of textbooks. Typically all communities are affected in some way by the program, but they may differ dramatically in what interventions they are exposed to and the extent of that exposure. Secondly, in a program the interventions are typically implemented at various administrative levels so that the policy maker has only imperfect control over actual treatment. The impact of such a program cannot simply be calculated on the basis of the results of randomized controlled trials (RCTs). This would run into well known problems of external validity (Bracht and Glass, 1968, Rodrik, 2008, Ravallion, 2009, Banerjee and Duflo, 2009, Deaton, 2010, Imbens, 2010) even if the program involved only a single intervention. In addition, if the program involves multiple interventions and interactions are important then it is not clear how RCT evaluations of individual components of the program should be combined to an overall assessment of the program. However, regression techniques can be used for program evaluation. This involves drawing a representative sample of beneficiaries (e.g. households, schools, communities) and collecting data on the combination of interventions experienced by each beneficiary, together with other possible determinants of the outcome variables of interest. Regression techniques can then be used to estimate the impact of the various interventions. 1 In this paper this approach is generalized by allowing for treatment heterogeneity and a way of estimating aggregate program impact is proposed. 2 Obviously, the intervention variables are likely to be endogenous in a regression analysis. For example, an unobserved variable such as the political preferences of the community may affect both the impact variable of interest and the intervention. Also, the impact of the intervention will differ between beneficiaries and the allocation of interventions across beneficiaries may be based on such treatment heterogeneity, either through self-selection or through the allocation decisions of program officers. Heckman (1997) and Heckman et al. (2008) call this “selection on the gain�. The first complication is usually dealt with by using panel data or by randomized assignment of treatment. The second complication is much more serious. It may be particularly hard for RCTs when program assignment in practice cannot be mimicked by assignment to the treatment arm in an RCT since this would not capture the way program officers take their decisions. However, it will be shown that regression techniques can be adapted so as to produce an appropriate estimate of the program effect. The paper is organized as follows. In the first section the total program effect (TPE) is introduced. This measure extends the average treatment effect on the treated (ATET). The TPE is suitable for complex interventions and can deal with selection on the gain (treatment heterogeneity). Then two complications are considered: correlation between program variables and the controls in section 2 and spillover effects in section 3. Section 4 investigates whether estimating the TPE using RCTs is an alternative. The approach is illustrated in section 5 by estimating the TPE for a health insurance intervention in Vietnam. Section 6 concludes. I. The Total Program Effect (TPE) Consider the following model: = α X it + βi Pit + γ t + ηi + ε it yit (1) 3 where y measures an outcome of interest, in this paper taken to be a scalar; t = 0, 1 is the time of measurement; and i = 1,..., n denotes the unit of observation, e.g. households or locations. P denotes a vector of the interventions to be evaluated and X a vector of observed controls. 2 The P-variables can either be binary variables or multi-valued (discrete or continuous) variables. α and βi are vectors of parameters, γ t denotes a time effect and ηi represents time-invariant unobserved characteristics and ε it is the error term, assumed to be independent over time. It is also assumed that the interventions and control variables are uncorrelated with the error process: X i1 , X i 0 , Pi1 , Pi 0 ⊥ ε i1 , ε i 0 . At this stage P and X are assumed to be independent: X i1 , X i 0 ⊥ Pi1 , Pi 0 . This will be relaxed in section 2. Note that equation (1) excludes spillover effects of the type where yit depends on Pjt (i ≠ j ) and j is not necessarily included in the sample. This point will be discussed in section 3. In many applications (1) will represent a reduced form or “black box� regression, but it can also represent a structural model. The evaluator is interested in the expectation (in the population) of the effect of interventions on the outcome variable, the total program effect (TPE): 3 TPE E βi ( Pi1 − Pi 0 ). = Note that the impact parameters βi need not be the same for all i: heterogeneity of program impact is allowed. As an example consider a very simple special case: yit= βi Pit + γ t + ηi + ε it t= 0,1 (2) 4 where Pit now is a binary variable rather than a vector, Pi= 0 = 0 for all i and Pi Pi1 − Pi . Taking first differences gives: ∆yi βi Pi + γ + ∆ε i = = γ 1 − γ 0 . This is analogous to the equation for a standard project evaluation, but written where γ in differences. 4 The TPE for this case equals E βi Pi which is related to the familiar average treatment effect on the treated (ATET) TPE ATET = . EPi In another special case of equation (1) the TPE can be identified as follows. Assume that data are available from a random sample and that for a subsample (the “control group�) there is no change in the interventions: Pi1 = Pi 0 . (At this stage it is not assumed that the assignment to intended “treatment� and “control� groups is random.) Taking first differences in (1) for this group gives: ∆yi = α∆X i + γ + ∆ε i if Pi = 0. This allows estimation of α and hence α ˆ ∆X i so that the TPE can be estimated as ˆ = TPE ∆yi − α ˆ ∆X i . However, in a program consisting of multiple interventions, the context of this paper, there will usually not be a sufficiently large control group to make this identification strategy realistic. Indeed, typically the control group will be empty: all i will have experienced a change in at least some components of the vector ∆Pi . For this more general case ∆yi = α∆X i + βi ∆Pi + γ + ∆ε i (3) 5 Allowing for “selection on the gain�, correlation between impact parameters β i and the program variables Pi and also for correlation between β i and X i equation (3) can be rewritten as ∆yi =α∆X i + E ( βi | ∆X i , ∆Pi )∆Pi + γ + ωi , (4) where ωi =∆ε i + ( βi − E ( βi | ∆X i , ∆Pi ))∆Pi and this is uncorrelated with ∆X i and ∆Pi . The term E ( βi | ∆X i , ∆Pi ) can be approximated linearly: 5 E ( βi | ∆X i , ∆Pi ) ≈ δ 0 + δ1∆X i + δ 2 ∆Pi . Substitution in (4) and collecting terms gives ∆y =i γ + θ1∆X i + θ 2 ∆Pi + θ3∆X i ⊗ ∆Pi + θ 4 ∆Pi ⊗ ∆Pi + ωi (5) where θ 2 ∆Pi + θ3∆X i ⊗ ∆Pi + θ 4 ∆Pi ⊗ ∆Pi is the approximation of Ti = E ( βi ∆Pi | ∆X i , ∆Pi ). Equation (5) can be estimated using the sample data. The estimated coefficients can then be used to estimate Ti as Tˆ= θˆ ∆P + θˆ ∆X ⊗ ∆P + θˆ ∆P ⊗ ∆P . i 2 i 3 i i 4 i i The TPE can now be estimated as the average of Tˆ in the sample. i ˆ = 1 ∑T TPE ˆ= θˆ ∆P + θˆ ∆X ⊗ ∆P + θˆ ∆P ⊗ ∆P n i i 2 i 3 i i 4 i i (6) where bars denote sample averages. 6 6 In practice this means that one regresses ∆yi on ∆X i , ∆Pi and their interactions with ∆Pi and collects all terms involving ∆Pi to calculate the total program effect. Since the estimated TPE is linear in the θˆ parameters its standard error can be obtained from the covariance matrix of the OLS-coefficients. It is instructive to consider the special case of equation (5) where Di = ∆Pi is a binary variable taking the value 1 for the treatment group and 0 for the control group, i.e. the case of a difference-in-difference analysis. Equation (5) now reduces to ∆y= i γ + θ1∆X i + θ 2 Di + θ3 Di ∆X i + ωi since in this case Di2 = Di . Compared to a standard diff-in-diff regression this equation contains the interaction term Di ∆X i . The total program effect will in this case be estimated as ˆ = θ TPE ˆ D +θˆ D ∆X . (7) 2 i 3 i i This shows that when the sample is representative sample means can be used to construct the total program effect. The interaction term in (7) avoids the bias resulting from correlations between treatment effects and either program participation or controls. Many diff-in-diff studies do not include the interaction terms (e.g., Khandker et al., 2009 or Almeida and Galasso, 2010). Studies that do often report estimates of impact for different values of the controls X which makes it difficult to assess the aggregate impact of a program. 7 Equation (1) allows for two types of selection effects: Pit may be correlated with β i or with the unobserved characteristics ηi . A correlation of Pit and ηi is dealt with by differencing, as in (3). 7 However, the TPE measures the effect of the program inclusive of selectivity in the assignment of program interventions resulting in a correlation of β i and ∆Pi . This is appropriate since the way the program was assigned (in an ex post evaluation) or will be assigned (in an ex ante evaluation) is one of its characteristics. If the program was successful in part because program officers made sure the program interventions were assigned to households or locations where they expected a high impact, then obviously the evaluation should reflect this. In fact the evaluation would be misleading if it tried to “correct� for such selection effects by presenting (if this were feasible) an estimate ( E βi ) of the program’s impact if it had been assigned randomly. Recall that in the special, binary case of a `project’ evaluation TPE= E βi ∆Pi= ATET × E ∆Pi . If administrative data can be used to estimate E ∆Pi the question arises whether the ATET is identified in an RCT. Obviously this is the case if βi = β for all i . More generally, if ∆Pi and βi are independent the TPE can be estimated on the basis of an RCT: the trial would give an estimate of E βi which in this case is also the ATET. A special case of independence is that of universal treatment ( Pi = 1 for all i ). 8 In the most general case when ∆Pi and β i are not independent the ATET as established by an RCT may differ from the ATET in the population and estimating the TPE on the basis of RCTs can become problematic. This issue will be considered in section 4. 8 II. Correlation between P and X In the previous section P and X were assumed to be independent. (P, X) correlations are often important in evaluations. For example, changes in teacher training may induce changes in parental input. 910 Not all such inputs will be observed (e.g. additional parental help with homework will probably not be recorded); Pit will then be correlated with β i and this was already considered in the previous section. Conversely, if the parental input is observed then Pit will be correlated with X it . In that case the TPE identifies the direct effect of P, but not its total effect (including the indirect effect through induced changes in X). If the induced effect is to be included then the affected components of ∆X i should be omitted from the regression (5). If causality is in the reverse direction, from ∆X i to ∆Pi , then there is no need to amend the section 1 estimate of the TPE since there is no induced change in ∆X i . (The asymmetry arises because in either case the interest is in the impact of changes in ∆Pi , rather than in the impact of changes in ∆X i .) In the general case where the direction of causality is not known it will usually not be possible to estimate the indirect effect of the program. Occasionally, however, appropriate instruments can be found so that the impact of ∆Pi on ∆X i can be identified. 9 III. Spillover Effects Recall that in section 1 spillover effects were excluded: in equation (1) yi of case i does not depend on Pj of case j. In evaluations there are two important situations where this assumption is untenable. First, Chen et al. (2009) and Deaton (2010) discuss the possibility that policy in control villages is partly determined by policies in treatment villages so that the SUTVA (stable unit treatment value assumption) is violated. Indeed, if policies thus affected are not represented in the policy vector Pi this creates a classical case of omitted variable bias. In Chen et al. the problem arises because the data record participation in a particular program as a binary Pi variable, while other programs which may affect the outcome are initially ignored. In the approach advocated in the present paper all potentially relevant programs would in principle be included in Pi so that the problem of SUTVA violation is avoided. 11 Secondly, policies in village j may affect outcomes in village i. For example, a program aimed at an infectious disease in village j may affect health outcomes in the “untreated� village i. 12 If the external effects of policy are general equilibrium effects such as regional wage increases, it will be hard to identify the full impact of a policy. But often more structure can be imposed, e.g. by including a proxy for relevant policies in neighboring villages in the outcome regression, so that equation (3) is extended to ∆yi = βi ∆Pi + α∆X i + γ + δ∆K i + ∆ε i . where ∆K i is the proxy for policy changes in the neighborhood. If there is sufficient variation in K i then δ is identified in this regression. The TPE would then be E βi ∆Pi + δ E ∆K i . 10 IV. Regression Methods and RCTs Compared In section 1 it was shown how the TPE can be estimated using regression methods. A natural question is whether the TPE can also be estimated using RCTs. Using RCTs may be difficult, e.g. because in programs the distinction between treatment and control groups may break down. However, there may be problems even in the case of binary treatments, namely under treatment heterogeneity when the probability of treatment is correlated with the individual impact parameters β i and unknown to the evaluator. If this correlation arises through self-selection then the usual response is to consider the average treatment effect on the treated rather than the average treatment effect in the population. If, however, the correlation arises at a higher level, e.g. because the policy maker targets on observables, then an RCT would have to mimic this assignment, possibly by stratifying the sample on the basis of the targeting variables. But in many government and NGO programs the “policy maker� does not directly control the P variables: assignment is decided by lower level staff (“program officers�) on the basis of private information, variables that cannot be observed by the policy maker or the evaluator. In this case an RCT can still identify the TPE, but at the cost of having to randomize at a higher level than the treatment under consideration: randomization would apply to program officers rather than beneficiaries. This implies that the power of the statistical analysis may be reduced. It also involves losing the direct link with the intervention. This may be illustrated with an example. Consider the following model y= i βi Pi + γ + ε i where β i and ε i are independent, Pi is binary and Eε i = 0 . For simplicity β i will be considered as the intention-to-treat impact, so that a subject i ’s refusal to undergo offered treatment Pi is 11 reflected in β i , rather than in Pi . Program implementation involves program officers who have imperfect knowledge of β i : they perceive ω= i βi + ηi and will assign treatment if and only if ωi > 0 . Assume that ηi has mean zero and is independent of βi and ε i . Crucially, this knowledge of program officers is unknown to the evaluator. Denote the CDF of ηi by F . With this assignment rule Pi is exogenous (i.e. independent of ε ij ). An RCT evaluation might involve drawing a random sample from the population and assign treatment randomly within this sample. The researcher would then estimate the program’s intention to treat effect (ITE) as E βi . The TPE would be estimated as E βi EPi . This would be incorrect since, under the assumptions made above > 0) E[(1 − F (− βi )) βi ] ≠ EPE βi Pi E ( βi | βi + ηi > 0) P( βi + ηi = TPE = E= i βi . = P( βi + ηi > 0) (Note that E (1 − F (− βi )) = EPi . As before, the ATET = TPE / EPi .) The problem arises because in this case the RCT design does not mimic the actual assignment process. To obtain an unbiased estimate of the TPE randomization would have to take place at a higher level, that of the program officers. 13 The control group then consist of program officers who never “treat� and the treatment group of program officers who sometimes (but not always) treat. The proposed regression method gives an unbiased estimator of the TPE using observational data for ( yi , Pi ) from a random sample of the population. The difference is that while the RCT approach compares average outcomes at the level of program officers the regression approach does so at the level of beneficiaries. The RCT approach therefore has lower statistical power. 14 12 Moving beyond the example there is a more fundamental objection to the RCT approach if outcomes depend not only on P but also on X, as in (1). If the RCT involved randomization over actual program officers then it is unlikely that randomization can also be achieved in terms of all the confounding X variables since program officers will not have been posted randomly across space. This introduces a correlation between X and characteristics of the program officers and hence a correlation between P and X. The two groups of program officers (“treatment� and “control�) will therefore differ systematically so that internal validity is lost. 15 The proposed approach, by contrast, collects data at the level of beneficiaries and can therefore control for differences in X. In summary, estimating the TPE on the basis of group averages from RCTs becomes problematic when β and P are correlated as a result of targeting on the basis of unobservables. If one randomizes at the level of beneficiaries the TPE estimator will be biased because the correlation is not taken into account. If one randomizes at the level of program officers the estimator is inefficient and, if confounders are important, may become inconsistent. V. An Empirical Example: Estimating the Total Program Effect for a Health Insurance Program in Vietnam To illustrate how the total program effect can deviate from a naïve approach to calculating the effect of a program a study of the impact of a health insurance program in Vietnam (Wagstaff and Pradhan, 2005) is reconsidered. Health insurance was introduced between the 1992-93 and the 1997-98 rounds of the Vietnam Household Living Standards Survey (General Statistics Office of Vietnam, 1993 and 1998). To account for possible treatment heterogeneity Wagstaff and Pradhan match households on propensity scores and then compare changes in health 13 outcomes (as well as some non-health outcomes) between insured households or individuals and (matching) uninsured households or individuals. They find modest favorable effects on children’s nutritional status, a mild effect on health expenditure and a sizeable effect on non- health spending. A propensity score based approach is not suitable for calculation of a total program effect since the common support requirement in a PSM approach will exclude part of the population in a systematic way. Therefore the Vietnam data are used to estimate the effect of the program using a standard diff-in-diff approach i.e. without allowing for heterogeneity (labeled ‘naïve’). The results are compared with an estimate of the TPE. In this case the ‘program’ is a simple intervention. 16 This makes a comparison with a standard approach clearer. The data are summarized in Table 1. A difficulty is that some of the outcome variables are individual anthropometric measurements while only households can be matched between survey rounds. Therefore the individual measurements have been averaged per household - a crude procedure only suitable for the current purpose of illustrating the TPE. Lacking information on 1992-3 the sampling weights from 1997-8 are used; clustering is also based on the 1997-8 survey round. The outcome variables considered are changes in arm circumference, height, body weight, health expenditure and total expenditure. The explanatory variables are the other variables shown in Table 1 (insurance status and the controls school attended, currently attending school, gender, age, a farm dummy, household size): and their interaction with the intervention variable are used as explanatory variables.1 When total expenditure is not a dependent variable it is also used as control variable. 14 Table 2 summarizes the results. First a naïve regression is run (without interaction terms) and the implied program effect (calculated as the regression coefficient of insurance times mean insurance). This naïve program effect is then compared with a TPE calculated as in equation (6). The results show striking differences between the two methods. In the case of arm circumference the standard method would have led to the conclusion that insurance had no (significant) effect. Once treatment heterogeneity is allowed for the effect is in fact highly significant albeit very small. For height neither method finds a significant effect. For body weight both methods show a significant increase but the effect is more than twice as large when heterogeneity is allowed for. Insurance appears to have no significant effect on health expenditure irrespective of the method used. Both methods do find a substantial (and significant) effect of insurance on total consumption. Again, the effect is stronger once one takes heterogeneity into account. Obviously, there is no reason why these results should generalize. However, they do suggest that treatment heterogeneity can have a substantial effect on the estimates of a program’s impact. A simple way to investigate this possibility is to test for the joint significance of the coefficients on the variables which would not normally be included in the regression: the interactions of treatment variables with themselves and with the controls. When this test indicates that heterogeneity may be an issue it is advisable to calculate the TPE. VI. Conclusion Policy makers in developing countries, NGOs and donor agencies are under increasing pressure to demonstrate the effectiveness of their program activities. At the same time there is a growing 15 interest in using randomized controlled trials (RCTs) for impact evaluation of projects. This raises the question to what extent RCTs can be used to evaluate programs, for instance by aggregating the impact of the components of the program. This question is particularly relevant for the evaluation of budget support or of NGOs which typically involve a wide variety of activities. The strength of RCTs is in establishing proof of principle. Going further and using RCTs to estimate the impact of programs is possible in special cases but becomes problematic if the probability of assignment is correlated with the effectiveness of the intervention. For example, teachers may give more attention to children who they think can benefit more from it. An RCT which randomizes at the level of beneficiaries (children) would produce a biased estimate of the program effect by ignoring this correlation between assignment and treatment effects. Alternatively, randomization at the appropriate level (teachers) would require a larger sample for the same precision. If confounders are important and correlated with characteristics of the program officers, the RCT-based estimate of the program’s impact would even be inconsistent. The approach proposed in this paper requires observational panel data for a representative sample of beneficiaries rather than experimental data for randomly selected treatment and control groups. If treatment is exogenous this will correctly reflect the assignment process even under treatment heterogeneity. Instead of estimating average impact coefficients for each of the various interventions of the program, the expected value (across beneficiaries) of the total impact of the combined interventions is estimated. This gives the total program effect (TPE). The paper has shown how and under what conditions regression techniques can be used to estimate the TPE in the presence of selection effects. As an example TPE estimates for a simple intervention: a health insurance program in Vietnam were presented. The example shows that 16 allowing for heterogeneity can lead to very different estimates of a program’s effect. The proposed method offers a simple way of dealing with such heterogeneity. The approach has three advantages. First, by using observational data for a random sample from the population of intended beneficiaries external validity is ensured. While the disadvantages of observational data are well known, this is an important advantage. Secondly, by focusing on the combined effect of program components they are automatically correctly weighted. Finally, it avoids the problems which RCTs encounter when assignment is imperfectly controlled and correlated with unobservables, as is plausible in development programs. 17 Notes 1 This approach is discussed in White (2006) and Elbers et al. (2009). 2 Here P reflects “actual� treatment . In principle it could reflect “intended� treatment if intended treatment can be observed, e.g. because intended beneficiaries were offered vouchers. 3 Strictly speaking this is the total effect of changes in the program. The symbol E is used for population averages and a bar over a variable for sample averages. Note that the total program effect does not include general equilibrium effects of the program. 4 This assumes that the autonomous trend = γ 1 − γ 0 is the same for all subjects (or, alternatively that the γ difference ∆γ it is exogenous and can be treated as part of the residual). In the terminology of double differencing this is the assumption of parallel trends. If this assumption is questionable then data for more periods are needed to estimate how trends depend on P. This paper abstracts from this complication and limit the analysis to two periods. The extension to more periods is non-trivial but conceptually straightforward. 5 Higher order approximations would not change the argument but it should be noted that the number of regressors expands very rapidly. De Janvry et al. (2012) account for treatment heterogeneity in a similar way in the context of a schooling program. 6 Obviously, to identify θ4 a restriction on parameters like θ 4,k � = θ 4,�k is required. 7 Differencing is sufficient because of the assumption of parallel trends (cf. footnote 5). 8 Imbens (2010) describes a reduction in class size in all California schools. This is an example of universal treatment. 9 Deaton (2010) gives the example where random assignments made by the central government (e.g. the Ministry of Education) are partly offset by induced changes in allocations by local or provincial governments. Ravallion (2012) gives a similar example and Chen et al. (2009) quantify such a spillover effect in China. Similarly, the political economy may be such that the central government is unable to prevent allocations being diverted to favored ethnic or political groups. In either case Pi might be correlated with βi. 10 This is similar to the case considered by Das et al. (2004, 2007) where teacher absenteeism as a result of HIV/AIDS induces greater parental input. 11 Recall that the approach does not involve a distinction between treatment and control groups: most if not all subjects receive some treatment. 18 12 This has implications for sampling: since data on policies in neighboring villages are required one must sample groups (possibly pairs) of adjacent villages. 13 Duflo et al. (2008, pp. 3935-37) make this point in a similar context (partial compliance) concluding that “One must compare all those initially allocated to the treatment group to all those initially randomized to the comparison group�. 14 This is shown in the supplemental appendix. 15 This is shown in the supplemental appendix. 16 It should be noted that the intervention variable is not binary (as it would be in a ‘project’) since insurance enrollment is measured as an average at the household level. 19 References Almeida, Rita K., and Emanuela Galasso (2010), ‘Jump-starting Self-employment? Evidence for Welfare Participants in Argentina’, World Development 38 (5): 742–55. Banerjee, Abhijit V. and Esther Duflo (2009), ‘The Experimental Approach to Development Economics’, Annual Review of Economics 1: 151-78. Bracht, Glenn H. and Glass, Gene V. (1968), ‘The External Validity of Experiments’, American Education Research Journal 5(4) : 437-74. Chen, Shaohua, Ren Mu, and Martin Ravallion (2009), ‘Are There Lasting Impacts of Aid to Poor Areas?’, Journal of Public Economics 93(3): 512-28. Das, Jishnu, Stefan Dercon, James Habyarimana, Pramila Krishnan (2004), ‘When Can School Inputs Improve Test Scores?’, World Bank Policy Research Working Paper 3217, Washington DC: The World Bank. Das, Jishnu, Stefan Dercon, James Habyarimana, Pramila Krishnan (2007), ‘Teacher Shocks and Student Learning: Evidence from Zambia’, Journal of Human Resources 42(4): 820-62. Deaton, Angus (2010), ‘Instruments, Randomization, and Learning about Development’, Journal of Economic Literature 28(2): 424-55. De Janvry, Alain, Frederico Finan, and Elisabeth Sadoulet (2012), ‘Local Electoral Incentives and Decentralized Program Performance,’ Review of Economics and Statistics 94(3): 672–85. 20 De Kemp, Anthonie, Jörg Faust and Stefan Leiderer (2011), Between High Expectations and Reality: an Evaluation of Budget Support in Zambia, Bonn/The Hague/ Stockholm: BMZ/Ministry of Foreign Affairs/Sida. Duflo, Esther, Rachel Glennerster and Michael Kremer (2008), ‘Using Randomization in Development Economics Research: a Toolkit’, in T. Paul Schultz and John Strauss (eds.), Handbook of Development Economics, Amsterdam: North-Holland, pp. 3895-3962. Elbers, Chris and Jan Willem Gunning (2009), ‘Evaluation of Development Policy: Treatment versus Program Effects’, Tinbergen Institute Discussion Paper 2009-073/2. Elbers, Chris, Jan Willem Gunning and Kobus de Hoop (2009), ‘Assessing Sector-Wide Programs with Statistical Impact Evaluation: a Methodological Proposal’, World Development 37(2): 513-20. General Statistics Office of Vietnam (1993) Living Standards Survey 1992-93, http://go.worldbank.org/JZFNBLXM80. General Statistics Office of Vietnam (1998) Living Standards Survey 1997-98, http://go.worldbank.org/4QR0OSXMD0. Heckman James J. (1997), ‘Instrumental Variables: a Study of Implicit Behavioral Assumptions Used in Making Program Evaluations’, Journal of Human Resources 32(3): 441-62. 21 Heckman, James J., Sergio Urzua and Edward J. Vytlacil (2008), ‘Understanding Instrumental ariables with Essential Heterogeneity’, Review of Economics and Statistics 88(3): 389-432. Imbens, Guido W. and Joshua D. Angrist (1994), ‘Identification and Estimation of Local Average Treatment Effects’, Econometrica 62(2): 467-76. Khandker, Shahidur R., Zaid Bakht, and Gayatri B. Koolwal (2009), ‘The Poverty Impact of Rural Roads: Evidence from Bangladesh’, Economic Development and Cultural Change 57(4): 685–722. Ravallion, Martin (2009), ‘Evaluation in the Practice of Development’, World Bank Research Observer 24(1): 29-53. Ravallion, Martin (2012), ‘Fighting Poverty One Experiment at a Time: a Review of Abhijit Banerjee and Esther Duflo’s Poor Economics: A Radical Rethinking of the Way to Fight Global Poverty’, Journal of Economic Literature 50(1): 103-114. Rodrik, Dani (2008), ‘The New Development Economics: We Shall Experiment But How Shall We Learn?’, John F. Kennedy School of Government, Harvard University, HKS Working Paper RWP 08-055. Wagstaff, Adam, and Menno Pradhan (2005), ‘Health Insurance Impacts on Health and Nonmedical Consumption in a Developing Country’, World Bank Policy Research Working Paper 3563, Washington DC: The World Bank. 22 White, Howard (2006), Impact Evaluation: the Experience of the Independent Evaluation Group of the World Bank. Washington, DC: World Bank. 23 Table 1: Data for the Vietnam Insurance Example Variable: change in (average) Mean Std. Dev Min Max Arm circumference (cm) 1.154 2.013 -7.3 9.4 Height (cm) 5.175 11.35 -49.57 39.84 Body weight (kg) 2.983 6.544 -27.75 26.25 Health expenditure (‘000 Dong) 1,081 5,519 -8808 23,3965 Total consumption expenditure (‘000 Dong) 6,513 8,009 -22,988 11,6826 Insurance (binary at individual level) 0.170 0.268 0 1 School attended16 -0.017 0.683 -3.5 3 Currently attending school (binary at individual level) 0.082 0.388 -2 2 Gender 0.002 0.138 -0.75 1 Age 3.522 8.299 -48.43 48.6 Farm dummy -0.079 0.421 -1 1 Household size -0.267 1.696 -18 11 The number of observations varies between 4299 and 4305. Source: authors’ calculations using the Vietnam Living Standard Surveys 1992-3, 1997-8. 24 Table 2: Total Program Effects Dependent variable Naïve program Total program R-squared Remarks † effect (I) effect†† (II) of (s.e.) (s.e.) underlying regressions I II Arm circumference .022 0.090*** 0.22 0.23 (.029) (0.027) Height -0.190 .095 0.34 0.36 (0.154) ( 0.139) Body weight 0.167* 0.384*** 0.31 0.33 (0.083) (0.074) Health expenditure -28.08 -52.79 0.03 0.04 Total (60.59) (51.01) consumption included in controls Health expenditure 55.41 64.32 0.00 0.00 Total (66.42) (52.87) consumption expenditure not included Total consumption 626.7*** 888.8*** 0.10 0.12 Total expenditure (110.9) (105.7) consumption expenditure not included Robust clustered standard errors in parentheses. In all but the health expenditure regressions squared intervention and interactions of controls with intervention are jointly significant. Significance: * indicates 5% threshold, *** 0.1%. † The naïve program effect is calculated as the regression coefficient on the insurance variable time the estimated population mean of that variable. †† The total program effect is calculated according to equation (6). The sampling errors on the estimated population means are not taken into account. Source: authors’ calculations using the Vietnam Living Standard Surveys 1992-3, 1997-8. 25 Elbers and Gunning: Evaluation of Development Programs    Supplemental Appendix  Precision of TPE estimators when treatment is exogenous but not fully  controlled1  Using RCTs “Program Officers� (POs) are divided into treatment- and control-POs. All subjects within the catchment area of a treatment-PO are considered as treated (i.e., we want to estimate the intention to treat effect). Consider the following model linking outcome yij to (actual) treatment Pij : yij  � i  � ij Pij  � ij , where i refers to the program officer responsible for administrating treatment to subject j who falls within the catchment area of i . The disturbance � ij is assumed to be homoscedastic and independent of � i , � ij and Pij . To model clustering by POs an officer random effect � i is included in the model. Random effects are assumed to be i.i.d. and independent of � ij and Pij . We further assume that the number of subjects per PO is constant to avoid trivial complications of weighing. The evaluator wants to estimate TPE  E � ij Pij and in order to capture any selectivity in application of treatment by the program officers (PO) a random sample of POs has been drawn and subsequently been randomly divided into a group T of treatment-POs who are supposed to apply treatment to the ultimate beneficiaries j and a group C of control-POs who are asked not 1 The context is that of section 4 in the main text of the paper. to give treatment to subjects. Within the catchment area of sampled POs a random sample of subjects is drawn for whom we observe (at least) yij . This allows estimation of the TPE as the difference in average outcomes between group T and group C subjects: hat over TPE? ˆ  y  y  �  �  [� P ]  �  � , TPE (A.1) T C T C ij ij T T C where the bars denote sample averages over the two groups of subjects. Since this estimator is unbiased, its precision can be determined by the variance: ˆ ) MSE(TPE 1 1  2 1   ��  [var( �ij Pij )]T  ( 1  1 )� �2  T n nC  N T N T N C where nT and nC denote the number of sampled treatment-POs and control-POs, NT the total number of sampled subjects associated with treatment-POs, and N C the number of sampled subjects falling under control-POs. Regression using observational data Now consider sampling directly at the level of subjects. Typically such a sample will also be clustered, albeit not necessarily by PO. To create a ‘level playing field’ we will assume that the sample has n  nT  nC clusters with a total of N  NT  N C subjects. For each sampled subject j from cluster i we observe Pij (actual treatment) and yij . The estimator for the TPE reduces to ˆ  y  yij (1  Pij )  � P  �  � i (1  Pij )  �  � ij (1  Pij ) . TPE (A.2) 1  Pij 1  Pij 1  Pij ij ij i ij Assuming as in the RCT setup that � i is independent of Pij and � ij this estimator is again unbiased2 and 2 Correlation of �i and � ij , Pij would reflect level effects which, as explained in section 2, should be neutralized by using differenced data. 1     ˆ )  1 var( � P )  var  �  � i (1  Pij )   var  �  � ij (1  Pij )  . MSE(TPE N ij ij  i 1  Pij   ij 1  Pij      N 1 Using the delta method and the equality E ( Pij  Pij ) 2  EP(1  EP) it can be verified that3 N  1   � ij (1  Pij )  N � ij( Pij  Pij )  P 1 var  � ij    var  ij   ij � �2 ,  1  Pij    1  Pij  1  Pij N  1      and likewise that  1   � i (1  Pij )   N  � i ( Pij  P )  P 1 var  � i    var    ij �� ij 2 .  1  P   1  P  1  P N  1  ij   ij  ij   It follows that in the regression setup precision is of order N while in the RCT setup precision if at best of order NT / 2 and, if clustering of data is an issue, of order nT / 2 . (Note that if the two groups are of equal size: NT = N/2, then the regression setup is twice as precise as the RCT setup.) Covariates Both methods fail if � and � P are correlated. What if there are observables X ij determining both P and y ? This could be the result of program targeting. In that case formulas (A.1) and (A.2) can no longer be used. To account for the confounding effect of covariates a regression approach is required, also with an RCT setup. For RCTs using intention to treat by PO for estimating the TPE, efficient estimation would amount to a regression equation like yij  � i  TPE I{iT }  � X ij  � ij . The reason formula (A.1) can no longer be used is that randomization over POs does not guarantee randomization over observables xij . Applying formula (A.1) we would find 3 In this case E denotes an average over all possible samples. 2 ˆ  y  y  �  �  � ( X T  X C )  [� P ]  �  � . TPE T C T C ij ij T T C The bias � ( X T  X C ) would vanish if X T  X C , i.e., when X ij and I{iT } . are uncorrelated. 3