WPS8139 Policy Research Working Paper 8139 Distributional Impact Analysis Toolkit and Illustrations of Impacts Beyond the Average Treatment Effect Guadalupe Bedoya Luca Bittarello Jonathan Davis Nikolas Mittag Development Research Group Impact Evaluation Team July 2017 Policy Research Working Paper 8139 Abstract Program evaluations often focus on average treatment effects. distributions and estimation of the distribution of treatment However, average treatment effects miss important aspects impacts. The article then discusses extensions to condi- of policy evaluation, such as the impact on inequality and tional treatment effect heterogeneity, that is, to analyses of whether treatment harms some individuals. A growing liter- how treatment impacts vary with observed characteristics. ature develops methods to evaluate such issues by examining The paper offers advice on inference, testing, and power the distributional impacts of programs and policies. This calculations, which are important when implementing dis- toolkit reviews methods to do so, focus-ing on their applica- tributional analyses in practice. Finally, the paper illustrates tion to randomized control trials. The paper emphasizes two select methods using data from two randomized evaluations. strands of the literature: estimation of impacts on outcome This paper is a product of the Impact Evaluation Team, Development Research Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be contacted at gbedoya@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Distribu�onal Impact Analysis: Toolkit and Illustra�ons of Impacts Beyond the Average Treatment Effect Guadalupe Bedoya (World Bank) Luca Bitarello (Northwestern University) Jonathan Davis (University of Chicago) Nikolas Mitag (CERGE-EI) * Advisors: Stéphane Bonhomme (University of Chicago) Sergio Firpo (Insper) Keywords: Policy Evalua�on, Distribu�onal Impact Analysis, Heterogeneous Treatment Effects, Im- pacts on Outcome Distribu�ons, Distribu�on of Treatment Effects, Random Control Trials. JEL codes: C18, C21, C54, C93, D39 * We benefited from the generous advice of Joel L. Horowitz and Andreas Menzel. We thank Arianna Legovini, David Evans, and Caio Piza for making this toolkit possible. We are grateful to Moussa P. Blimpo and par�ci- pants at the DIME seminar and the Course on Distribu�onal Impact Analysis held at the World Bank. We also benefited from the work and data from “The Impact of High School Financial Educa�on: Experimental Evidence from Brazil” and “Parental Human Capital and Effec�ve School Management: Evidence from the Gambia” and from conversa�ons with members of both teams. We thank Kristoffer Bjärkefur for assistance with program- ming. This work was generously supported by the Africa Poverty and Social Impact Analysis (PSIA) and the Im- pact Evalua�on to Development Impact (i2i) Trust Funds. Contents 1. INTRODUCTION ........................................................................................................... 2 2. QUESTIONS OF INTEREST AND DEFINITIONS ................................................................ 4 1. Defini�ons and Nota�on ................................................................................................. 4 2. Two Uncondi�onal Approaches ...................................................................................... 5 3. From Uncondi�onal to Condi�onal Analysis ................................................................... 8 3. IMPACT ON OUTCOME DISTRIBUTIONS ....................................................................... 9 1. Introduc�on ..................................................................................................................... 9 2. Randomized Control Trials Without Endogenous Selec�on ......................................... 10 3. Applica�ons with Selec�on on Observables: Inverse Probability Weigh�ng................ 11 4. Applica�ons with Selec�on on Unobservables: Instrumental Variables ...................... 12 5. Interpre�ng Quan�le Effects ......................................................................................... 14 4. DISTRIBUTION OF TREATMENT IMPACTS ................................................................... 15 1. Introduc�on ................................................................................................................... 15 2. Bounding the Distribu�on of Treatment Effects ........................................................... 17 3. Point Iden�fica�on of Features of the Distribu�on of Treatment Effects ..................... 20 4. Es�ma�on Methods ...................................................................................................... 24 5. DISTRIBUTIONAL IMPACTS AND CONDITIONAL ANALYSES ......................................... 31 1. Introduc�on ................................................................................................................... 31 2. Subgroup Analysis ......................................................................................................... 32 3. Condi�onal Average Treatment Effects ......................................................................... 33 6. STATISTICAL INFERENCE AND POWER CALCULATIONS ................................................ 35 1. Introduc�on ................................................................................................................... 35 2. Sta�s�cal Inference ....................................................................................................... 36 3. Tests of Heterogeneous Treatment Effects ................................................................... 40 4. Power Calcula�ons ........................................................................................................ 41 7. APPLICATIONS ........................................................................................................... 43 1. Financial Educa�on RCT in Brazil ................................................................................... 44 2. School Management RCT in The Gambia ...................................................................... 52 REFERENCES ..................................................................................................................... 59 APPENDIX 1. SIMULATION DETAILS ................................................................................... 62 APPENDIX 2. ESTIMATING CONDITIONAL PROBABILITIES.................................................. 63 APPENDIX 3. RECURSIVELY SOLVING FOR HIGHER ORDER MOMENTS ............................... 64 APPENDIX 4. ADDITIONAL RESULTS FROM APPLICATIONS ................................................ 66 1. Introduc�on Tradi�onal methods to evaluate the impacts of social programs and the vast majority of ap- plied econometric policy evalua�ons focus on the analysis of means (Carneiro, Hansen and Heckman, 2002; Angrist and Pischke, 2009). However, there is also a large and growing liter- ature on methods to evaluate the effects of programs and policies beyond their mean im- pact. While less frequently applied, these methods can provide informa�on that is valuable or even necessary in the assessment of the consequences of policies and their desirability. The purpose of this toolkit is to provide an overview of the ques�ons such methods can ad- dress and the core approaches that have been developed to answer them, including discus- sions of the assump�ons they require, and prac�cal issues in their implementa�on. Mean impacts are a natural first summary sta�s�c to describe the effect of a policy. The mean impact of a policy or interven�on tells us by how much the outcome would increase or decrease on average when every member of a par�cular popula�on is exposed to the pol- icy or interven�on. Thereby, they provide the central piece in any cost-benefit analysis. However, a decision-maker usually requires informa�on on the effects of a policy beyond its mean impact. For example, mean impacts allow us to calculate the total gain from a program or policy, but do not allow us to say anything about the distribu�on of the gain or how the outcome distribu�on is affected by the program beyond changes in its mean. A posi�ve av- erage program effect tells us that a program can generate social surplus, but it may not be sufficient to allow us to judge whether the program is desirable or not if any weight is placed on distribu�onal concerns, such as whether inequality is affected by the program, whether some people are harmed by the policy or whether a par�cular demographic group benefits. Even a purely welfare-maximizing social planner with no norma�ve concerns for par�cu- lar demographic groups, inequality or not harming anyone will o�en need informa�on on program impacts beyond their average. For example, judging a program by its mean impact assumes that the welfare consequences of the distribu�onal aspects of programs are either unimportant or are offset by transfers. As Heckman, Smith and Clements (1997) argue, this assump�on is strong. Many outcomes, such as educa�onal atainment and health status, cannot feasibly be redistributed themselves. In order to redistribute the welfare gains de- rived from such outcomes, one needs to know the rela�on between the outcomes and indi- vidual u�lity, which can usually at best be approximated. In prac�ce, transfers may be costly and implemen�ng the op�mal transfer scheme requires some knowledge of the distribu�on of gains and losses, i.e., an evalua�on that goes beyond the mean impact. Finally, some in- terven�ons may work well for par�cular subgroups of the target popula�on, such as those living in urban areas, while there may be beter op�ons for rural popula�ons. Knowing which groups benefit more or less can help improve the targe�ng of policies and programs and thereby help allocate limited resources more effec�vely. The common theme of these issues is that they cannot be addressed by mean impacts alone. Finding answers requires thinking about the impact of a program or policy as a col- lec�on of distribu�onal parameters rather than a single scalar parameter such as the mean. Hence, we refer to these types of analyses as “distribu�onal impact analysis” (DIA). DIA con- cerns the study of the distribu�onal consequences of interven�ons due to par�cipants’ het- erogeneous responses or par�cipa�on decisions. In par�cular, DIA inves�gates features be- yond the gross total gain of a program by studying where the gains/losses of a program – if any – were produced, and who wins or loses as a result of program par�cipa�on. The goal of this toolkit is to provide guidance for researchers who want to use RCTs to answer ques�ons 2 for which the mean is insufficient as an answer. We focus on RCTs because they allow us to simplify the exposi�on of the methods through the use of randomiza�on as a sta�s�cal so- lu�on to the selec�on problems that are central to impact evalua�on. Subject to addressing these selec�on problems, the methods we discuss are applicable to non-experimental anal- yses as well. 1 In this toolkit, we study two related approaches to distribu�onal impact analysis. The first approach examines a policy’s impact on the outcome distributions. It concerns differ- ences between (sta�s�cs of) the distribu�on of outcomes with the policy and the distribu- �on of outcomes without it, such as the impact of a policy on the variance or a specific quan�le of the outcome distribu�on. The second approach is to examine the distribution of treatment impacts. This approach answers ques�ons such as what frac�on of the popula�on is harmed by the policy or what the botom quar�le of the impact of a policy is. It requires (sta�s�cs of) the distribu�on of the policy gains or losses, i.e., the distribu�on of differences between the outcomes of a given individual with the policy and without it. What dis�n- guishes these two approaches is that the first focuses on how the interven�on affects the distribu�on of the outcome in the popula�on (e.g., how would the botom quin�le of finan- cial literacy test scores change if we were to provide training on financial literacy for the en- �re popula�on), but disregards how any par�cular individual is affected by the program. The goal of the second approach is to analyze the distribu�on of these individual treatment effects (e.g., how individual gains from financial literacy training vary in the popula�on). Due to this addi�onal ambi�on, the second approach requires stronger assump�ons for iden�fi- ca�on and es�ma�on. To simplify the exposi�on, we focus on ques�ons concerning uncondi�onal distribu�onal impact parameters first. That is, we start by discussing ques�ons regarding treatment effect heterogeneity in the en�re popula�on and only introduce addi�onal covariates to make the underlying assump�ons hold. In prac�ce, one may also be interested in how treatment im- pacts differ between observable sub-popula�ons such as males and females, or how they vary with con�nuous covariates such as age or income. Therefore, we then review ways to extend the methods of both approaches to also allow for heterogeneity within observed subpopula�ons or along con�nuous covariates. This allows us to answer ques�ons such as whether men or women are more likely to benefit from a program or whether program im- pacts are increasing in the baseline outcome. Such condi�onal analyses usually require ad- di�onal assump�ons and (o�en substan�ally) larger samples. If this makes them infeasible, condi�onal means are usually simple to es�mate and can s�ll be informa�ve about hetero- geneity. The benefits of DIA outlined above beg the ques�on why we do not see more of these methods applied in prac�ce. Recent studies that es�mate distribu�onal impacts have exam- ined earnings and employment (e.g., Abadie, Angrist and Imbens, 2002; Lechner, 1999), safety net programs (e.g., Black et al., 2003; Bitler, Gelbach and Hoynes, 2006), social exper- iments (e.g., Djebbari and Smith, 2008; Firpo 2007; Heckman, Smith and Clements, 1997) and educa�on (e.g., Carneiro, Hansen and Heckman, 2003; Cunha, Heckman and Schennach, 2010). However, the literature is s�ll nascent compared to the importance of the topic and the tools available. There are o�en good reasons to focus primarily on mean impact and the mo�ves to not look beyond it depend on the applica�on. For example, DIA usually requires 1 See, among others, the overviews in Chernozhukov and Hansen (2013), Heckman and Vytlacil (2007) and Ab- bring and Heckman (2007). 3 larger sample sizes, and many of the methods below are jus�fied asympto�cally and are not necessarily unbiased in finite samples, which can be problema�c in applica�ons with small samples such as most RCTs. Some methods rely on addi�onal assump�ons that are reasona- ble in some applica�ons, but not in others. However, part of the hesita�on to apply these methods also seems to stem from iner�a and the fact that they are rela�vely new or have only recently become computa�onally feasible. Iner�a may be an obstacle if researchers or their audiences are more used to and hence comfortable interpre�ng mean impacts and the condi�ons under which they are valid. To mi�gate these issues, we review the prac�cal con- sidera�ons of sta�s�cal inference and power calcula�ons for DIA. Finally, we illustrate select DIA methods and what can be learned from them by re-analyzing the impacts of a financial literacy program in Brazil and a school management program in The Gambia. We focus on the types of ques�ons introduced in the previous paragraphs, but there are many other parameters of interest, and ques�ons that go beyond mean impacts. The meth- ods and ques�ons we discuss are not meant to be exhaus�ve, but are a selec�on to illustrate how they can complement analyses of means in assessing the value or desirability of a policy or program. We intend to provide a set of baseline methods, to discuss common problems and how to detect them, and to shed light on prac�cal issues such as required sample sizes and es�ma�on of standard errors in (poten�ally dependent) samples. Programs to imple- ment the methods are available online. 2 This paper is organized as follows. Part 2 introduces nota�on and dis�nguishes the types of DIA ques�ons we examine: We first introduce uncondi�onal analyses of impacts on out- come distribu�ons and the distribu�on of treatment effects. We then point out how these two approaches extend to ques�ons of condi�onal heterogeneity. Next, we discuss key methods for each approach. Part 3 considers ques�ons on the impact on outcome distribu- �ons. Part 4 considers ques�ons on the distribu�on of treatment effects. Part 5 considers condi�onal analyses to answer ques�ons of how treatment impacts differ with observables. Part 6 considers prac�cal considera�ons for RCTs related to sta�s�cal inference and power calcula�ons. Part 7 presents our applica�ons. 2. Ques�ons of Interest and Defini�ons In this part, we introduce nota�on and discuss the similari�es and differences between the two main approaches, impact on the outcome distributions and distribution of treatment impacts. We then discuss ques�ons that require each approach to be extended to condi- �onal analyses. Unless stated otherwise, we consider an RCT with full compliance with treatment assignment to illustrate each method without concerns about selec�on issues. 3 1. Defini�ons and Nota�on Suppose we have a sample of observa�ons, indexed by . The sample is randomized into a treatment group (which received the policy of interest) and a control group (which did not). The indicator variable denotes treatment assignment: = 0 if observa�on belongs to the control group and = 1 otherwise. We focus on binary policies to simplify the exposi- 2 The codes can be found at htps://github.com/worldbank/DIA-toolkit 3 Technically, this requires treatment status to be independent of counterfactual outcomes; full compliance is neither necessary nor sufficient for independence. 4 �on, although most methods in this toolkit extend to more complex interven�ons. The indi- cator variable denotes treatment par�cipa�on: = 0 if observa�on did not receive the treatment and = 1 otherwise. Full compliance with treatment assignment implies that everyone in the treatment group par�cipates and no one in the control group par�cipates, so = . We assume that outcomes are con�nuous. 4 We write 0 for poten�al outcomes without treatment, with cumula�ve distribu�on func�on 0 and -th quan�le 0 (). Poten�al out- comes under treatment are 1 , with CDF 1 and -th quan�le 1 (). 5 Each of these poten�al outcomes is defined for all individuals, regardless of their treatment status, which allows us to precisely define counterfactuals such as 0 ( ∣ = 1 ), the distribu�on of outcomes for the treated if they had not received treatment. We observe outcome , which is 0 if = 0 and 1 otherwise. Formally, = (1 − )×0 + ×1 . Following the prac�ce of most RCTs, we assume the popula�on of interest is the sub- popula�on of individuals who apply for the program under evalua�on. For this reason, es�- mates from RCTs are o�en considered to be impacts “on the treated” and we follow this prac�ce. This terminology only depends on what popula�on one considers the available sample to represent, and not on which individuals from this sample par�cipate or the meth- ods used. In prac�ce, there is usually a second selec�on step, which is compliance with treatment assignment. 6 We consider methods to adjust for this second selec�on step in Sec�ons 3.3 and 3.4. These methods can be extended to the methods in Sec�ons 4 and 5 or to correct for selec�ve applica�on to par�cipate in the RCT. For mean impacts, researchers o�en setle for intent-to-treat parameters instead, to avoid further assump�ons. Methodo- logically, this approach extends to many DIA parameters. However, when analyzing hetero- geneity, intent-to-treat parameters confound heterogeneity in take-up 7 with heterogeneity in treatment effects. It will usually be more informa�ve to analyze these two sources of het- erogeneity separately. Compliance with treatment assignment is observed, so heterogeneity in compliance can be analyzed using the standard methods used to study program take-up. In this toolkit, we focus on heterogeneity in treatment effects, i.e., parameters of treatment on the treated. 2. Two Uncondi�onal Approaches In terms of impact on the outcome distributions, consider ques�ons such as: • Does microfinance boost average incomes? To answer this ques�on, we would es�mate the average treatment effect, (1 ) − (0 ), or the average treatment effect on the treated, (1 | = 1) − (0 | = 1). • Does hospital regula�on raise minimum levels of pa�ent safety? 4 While these methods can be applied using con�nuous, discrete or binary outcomes, they add litle in the bi- nary case since the distribu�on has only two points of support and is completely characterized by its mean. 5 Formally, (| ⋅) ≡ arg inf {P( ≤ | ⋅) ≥ } for ∈ (0,1). 6 Consequently, “on the treated” remains slightly ambiguous, as it could also be defined based on actual treatment receipt within the given sample or popula�on, i.e., in terms of . 7 Note that this is take-up given par�cipa�on in the RCT. For predic�ons about policy impacts in the popula- �on, we need the uncondi�onal take-up model unless the RCT sample is representa�ve of the en�re popula- �on. 5 To answer this ques�on, we would es�mate the treatment effect on the minimum, min 1 − min 0 , or on a quan�le in the le� tail, such as 1 (0.1) − 0 (0.1). • Does educa�on reform decrease dispersion of student’s test scores? To answer this ques�on, we could es�mate the treatment effect on a measure of in- equality, such as the variance, var(1 ) − var(0 ), or the Gini index. The answer to the first ques�on is based on the mean, a par�cular feature of the outcome distribu�on that is the focus of most conven�onal approaches to evaluate policies. However, randomiza�on iden�fies the full distribu�on of treated and untreated outcomes, 0 and 1 . is independent of 0 and 1 , so that 0 (0 | = 0) = 0 (0 ) and 1 (1 | = 1) = 1 (1 ). Consequently, under full compliance, we can es�mate the distribu�on of treated outcomes, 1 , from the treatment group and the distribu�on of untreated outcomes, 0 from the con- trol group. Thus, we can compare different features of the treatment and control distribu- �ons to answer the remaining ques�ons. We can measure the impact of a policy on the me- dian, minimum or low quan�les of the distribu�on of realized outcomes by taking the differ- ence in the median, minimum or botom decile of the outcome between the treatment and control group. By the same token, we can measure the impact of an interven�on on a par- �cular inequality measure, such as the standard devia�on or the Gini coefficient, as in the third ques�on. Importantly, these methods are not informa�ve about how a program’s impact varies across individuals, because they ignore who within the popula�on belongs in different seg- ments of the outcome distribu�on. We may want to study the distribution of treatment effects if we are concerned about how policy impacts vary across individuals. Analyses of the distribu�on of treatment effects, discussed in Part 4, can answer ques�ons like: • What propor�on of students benefit from an educa�onal reform? For this ques�on, we would compute P(1 > 0 ) = P(Δ > 0). • Are the improvements in average pa�ent outcomes from health facility inspec�ons driven by a few people who benefit considerably? Formally: is there significant skew- ness in effects? For these ques�ons, we would compute {[(1 − 0 − )⁄]3 }, where = (1 − 0 ) and 2 = [(1 − 0 − )2 ]. • What is the median impact of a microfinance program? More generally, what are the quan�les of the impact distribu�on, like the minimum or maximum program impact? For this ques�on, we would compute the relevant quan�le of treatment effects. Unlike evalua�ng policy impacts on outcome distribu�ons, studying the distribu�on of impacts requires addi�onal, some�mes strong, assump�ons about how individuals would fare in a counterfactual treatment state. Even a perfect RCT cannot yield this informa�on since the same person cannot be in both the treatment and control group at the same �me. To further illustrate the difference between the two approaches, consider the following fic�onal example. 8 Researchers select a class of five students to receive finance lessons. Atendance is mandatory and compliance is perfect. At the end of the program, researchers administer tests to measure financial literacy. Table 1 presents the students’ test scores. It also shows what their grades would have been if they had not received the lessons. These 8 Part 7 presents a real RCT with a similar setup. 6 counterfactual outcomes are of course unobservable in real life, but we nonetheless show them to clarify a few concepts. In standard impact analysis, we would compute the average treatment effect (ATE): the difference in means between the two poten�al outcomes. In our example, it is 1.6. Now consider the treatment effect on the median: the difference in the medians between the dis- tribu�ons of poten�al outcomes. In our example, it is 2. Why is it larger than the ATE? Be- cause the treatment increased lower scores more than those at the top. If the lessons had instead increased all scores by the same amount, the two effects would have agreed. Note that the difference in medians is also higher than the median of individual effects, 1, due to individual mobility across the distribu�on. For example, student E would have earned the lowest grade without lessons, whereas E achieved the highest score under treatment. A litle formalism is useful for clarifying this point. Suppose that the poten�al outcomes (0 , 1 ) of observa�on correspond to quan�les (0 , 1 ) of 0 and 1 , respec�vely. Then the impact of the policy on individual is given by: ≡ 1 − 0 = 1 (1 ) − 0 (0 ). This individual impact can never be es�mated, since even an ideal RCT will only provide an es�mate of either 1 (1 ) o 0 (0 ). We can rewrite the impact on individual as: = 1 (1 ) − 0 (1 ) ����������� + 0 (1 ) − 0 (0 ). ����������� quantile treatment effect at 1 mobility effect Note that 0 (0 ) is a counterfactual outcome for individual corresponding to the 0 - quan�le of 0 . The first difference is a quantile treatment effect from the first approach, which is observable. The second difference is a mobility effect, which captures the change in outcomes due to the movement of individuals to different quan�les within the same distri- bu�on. This equa�on clarifies why quan�le effects are only equal to individual treatment effects if everyone keeps the same rank in 0 and 1 . That is, the mobility effect is zero if 0 = 1 for all individuals, which is called rank invariance. For example, the two black dots in Figure 1, show the poten�al outcomes for a par�cular person in the treated (Treatment) and untreat- ed states (Control). This person is hurt by the treatment despite the fact that most of the popula�on benefits, because her rela�ve rank in the distribu�on is much lower in the treat- ment group than in the control group. While RCTs iden�fy the quan�le treatment effect, un- fortunately, the mobility effect can never be recovered directly from the data without addi- �onal assump�ons. This is again because we can never observe the same person in both the treatment and control group at the same �me. Even if we observe the same person in the Table 1: Mock Dataset Outcome Outcome Difference in Student absent treatment under treatment poten�al outcomes A 1 2 1 B 2 4 2 C 3 4 1 D 4 4 0 E 1 5 4 7 Figure 1: Potential Outcomes and an Individual-Specific Treatment Effect treatment and in the control group over �me, we must s�ll make an assump�on about how untreated outcomes vary over �me and how they depend on prior treatment status. The key difference between studying impact distribu�ons in Part 4 and impacts on distribu�ons in Part 3 is modeling how an individual’s rela�ve performance changes when they are in the treatment group instead of the control group, i.e., modeling the mobility effect in the above decomposi�on. Once we have a credible model of the rela�onship between 1 and 0 and the right data, we can iden�fy the en�re distribu�on of impacts! 3. From Uncondi�onal to Condi�onal Analysis The discussions so far concern treatment effect heterogeneity in the en�re popula�on. We are o�en also interested in how treatment effects vary with observable characteris�cs, such as gender, income or baseline outcomes. As a few concrete examples, consider ques�ons like: • Are men or women more likely to benefit from a program? • Are the returns to a financial literacy training program higher for males or for fe- males? • Are schools with higher baseline test scores more likely to benefit from school man- agement training? Part 5 provides tools to analyze such ques�ons by discussing methods to study hetero- geneity across and, to a lesser extent, within subpopula�ons defined by observable charac- teris�cs. It is important to note that these ques�ons are about how treatment effects corre- late with observable characteris�cs, not necessarily the causal impact of these observable characteris�cs on the treatment effects. To be sure, the average causal impact of treatment is s�ll iden�fied among subgroups, so long as treatment is randomly assigned within the subgroup. However, the subgroup’s characteris�cs, such as being poor or female, are not randomly assigned. Therefore, if we find that a program’s impacts are greater among the poor, we cannot conclude that the treatment effects are great because the people are poor. For example, an omited variable, like neighborhood of residence, may drive the observed correla�on. Nonetheless, answering such ques�ons can provide useful informa�on to policymakers. For example, understanding how treatment effects vary with observable characteris�cs sug- gests how to best target a program to maximize its impact (e.g., Manski, 2004) and contrib- utes to the understanding of how par�cular subgroups of interest, like women or the poor- 8 est families, respond to the program. It may also shed light on underlying mechanisms (e.g., Pit, Rosenzweig and Hassan, 2012) and beter predict a program’s effects in a popula�on with different characteris�cs from the original experimental popula�on (e.g., O’Muirchear- taigh and Hedges, 2014). 3. Impact on Outcome Distribu�ons 1. Introduc�on Policymakers o�en worry about aggregate treatment effects. For example, they may wish to raise average income, decrease inequality or fight poverty. For such purposes, changes in the level and the shape of the marginal outcome distribu�on mater more than individual re- sponses to the interven�on. This part provides tools to quan�fy these effects. There are different approaches to this task. We may analyze simple sta�s�cs, such as the mean, an inequality index or a poverty line. This strategy has the advantages of parsimony and familiarity. For a more detailed picture, we may consider shi�s in the CDF or its quan- �les. We discuss quan�les for the sake of concreteness, but the methods in this part extend to other quan��es under minimal condi�ons and with minimal adjustments. Quan�le treatment effects are the difference between the quan�les of poten�al out- comes. In graphical terms, they measure the horizontal distance between outcome distribu- �ons (Firpo, 2007). We formally define quan�le treatment effects on the treated (QTT) as: QTT () ≡ 1 (| = 1) − 0 (| = 1), where (| = 1) is the -th quan�le of poten�al outcomes for the treated. 9 We only observe 1 (| = 1) in the data. The remainder of this part surveys three approaches to re- cover the counterfactual quan�le 0 (| = 1) from untreated observa�ons, which in turn will allow us to es�mate QTT (). 10 We focus on uncondi�onal treatment effects. These effects are par�cularly relevant for policy evalua�on, because they capture changes in the dispersion of outcomes both be- tween and within subgroups of the popula�on (Firpo, For�n and Lemieux, 2009). Since the pioneering work of Koenker and Basset (1978), a related literature has explored condi�onal quan�le regression (CQR). CQR es�mates quan�le effects within subgroups under restric�ve assump�ons. Part 5 discusses condi�onal effects. Note that unlike average effects, uncondi- �onal quan�le effects are not weighted averages of subgroup effects. This part con�nues as follows. Sec�on 3.2 considers RCTs without endogenous selec�on. This perfectly randomized setup ensures that the distribu�ons of poten�al outcomes do not depend on treatment status, so that randomiza�on iden�fies QTTs without further assump- �ons. However, many applica�ons deviate from this benchmark. For example, par�cipants might not comply with treatment assignment. The literature offers a wealth of strategies to account for sample selec�on, as it does for average treatment effects. Sec�on 3.3 surveys inverse probability weigh�ng, which exploits the assump�on of selec�on on observables. 9 In analogy to average treatment effects, we can also define the quan�le treatment effect (QTE): 1 () − 0 (). See Firpo and Pinto (2016) for addi�onal discussion. 10 In prac�ce, it is unfeasible to analyze all infinite points of a con�nuous distribu�on. Therefore, we limit our endeavor to par�cular quan�les of interest. If outcomes are discrete, in principle we could measure treatment effects at each value of the support. 9 Sec�on 3.4 presents an instrumental-variable approach. Sec�on 3.5 concludes with remarks on the interpreta�on of quan�le effects. 2. Randomized Control Trials Without Endogenous Selec�on This sec�on considers randomized trials without selec�on problems, i.e., we assume: i. Poten�al outcomes, (0 , 1 ), are jointly independent of treatment receipt, . This is a strong assump�on, which is only likely to hold under idealized condi�ons. The prob- ability of assignment to treatment must be equal for all par�cipants. 11 Moreover, the proba- bility of treatment take-up must be constant. This condi�on requires full compliance with treatment assignment ( = ) or random noncompliance. We must also observe outcomes for all par�cipants with the same probability. Should there be nonpar�cipa�on or nonre- sponse, they must be independent of treatment status. For simplicity, we refer to RCTs which sa�sfy assump�on (i) as RCTs with perfect compliance or RCTs without endogenous selec�on. Assump�on (i) ensures that the counterfactual distribu�on of poten�al outcomes 0 of the treated group, 0 (| = 1), is the same as that of the untreated group, 0 (| = 0). Therefore, a consistent es�mator of the QTT is simply the difference in quan�les between observed outcomes of treated and untreated units: ̂ QTT () = �0 (| = 0), �1 (| = 1) − where � (|) is any consistent es�mator of the -th quan�le of outcomes for group , such as the empirical quan�le. This es�mator is straigh�orward to extend to other objects, such as the variance or the Gini coefficient: it suffices to take the difference between the sta�s�cs for treated and untreated observa�ons. Note that we may also es�mate 1 (| = 1) from the treatment group and 0 (| = 1) from the control group under assump�on (i). 12 It is also possible to recover the QTT from a quan�le regression, in the same way that linear regression yields the average treatment effect. To do so, run a quan�le regression of observed outcomes, , on a constant and a treatment indicator, . The slope gives the QTT, whereas the intercept gives the quan�le �0 (| = 0). Assump�on (i) is strong and fails in many applica�ons. For example, the probability of assignment to treatment may depend on par�cipants’ atributes, such as region of resi- dence. The equivalence between the distribu�ons of poten�al outcomes then breaks down, leading to composi�on bias. For the same reason, endogenous selec�on into treatment is a concern when par�cipants do not comply with treatment assignment. Similar to average effects, we may either setle for intent-to-treat effects or account for selec�on and es�mate effects on the treated under alterna�ve frameworks, as the next sec�ons show. 11 If the probability of assignment to treatment differs across individuals by design, reweigh�ng the sample yields consistent es�mators. See Sec�on 3.3. 12 With random non-compliance, one should include those with = 1 & = 0 in the control group and those with = 0 & = 1 in the treatment group to es�mate these distribu�ons. 10 3. Applica�ons with Selec�on on Observables: Inverse Probability Weigh�ng In the ideal RCT of Sec�on 3.2, treated and untreated observa�ons have the same distribu- �on of poten�al outcomes. Many applica�ons deviate from this benchmark, though. This sec�on considers the weaker assump�on of selec�on on observables. 13 Selec�on on observ- ables is an assump�on of condi�onal independence between poten�al outcomes and treatment status, 14 which relaxes our previous assump�on of uncondi�onal independence. As the name suggests, we postulate that treatment take-up only depends on observed vari- ables. Therefore, treated and untreated par�cipants differ only in observed characteris�cs (other than treatment status). Correc�ng imbalances in covariates should restore the equivalence between outcome distribu�ons and allow us to es�mate treatment effects. This intui�on mo�vates two classes of es�mators: inverse probability weigh�ng (IPW) and matching. IPW consists of reweigh�ng the sample to balance covariates across treated and untreated observa�ons, which should then have the same distribu�on of poten�al out- comes. 15 Hirano, Imbens and Ridder (2003) propose IPW for average treatment effects, which Firpo (2007) extends to quan�le effects. Donald and Hsu (2014) consider CDFs, whereas Firpo and Pinto (2016) discuss inequality indexes. 16 Matching consists of pairing observa�ons according to covariates and differencing outcomes to es�mate treatment effects. See Heckman and Vytlacil (2007) and Imbens and Wooldridge (2009) for surveys. Matching does not readily extend from average to distribu�onal effects, because it relies on the law of iterated expecta�ons (Frölich, 2007), which fails for quan�les and other nonlinear sta�s�cs. The consistency of IPW es�ma�on of distribu�onal effects relies on the same assump- �ons and formulas as average effects. It requires: 17 i. Selec�on on observables: poten�al outcomes, (0 , 1 ), are jointly independent of treatment status, , given observed covariates, . ii. Common support: 0 < P( = | = ) < 1 for all and . Assump�on (i) is the key iden�fica�on assump�on. Assump�on (ii) is crucial, albeit standard. Matching requires that covariates take the same value range in the treated and untreated groups. Imbens (2015) presents methods to assess and enforce common support in the data. We implement the IPW es�mator in two steps. First, we compute the weight func�on: 1 − P( = 1|) IPW (, ) = + × . P( = 1) P( = 1) 1 − P( = 1|) Here, P( = 1) is the uncondi�onal probability of treatment take-up, which we can es�mate with the share of the treated in the sample. The condi�onal probability P( = 1|) is the propensity score, a building block of many es�mators of average effects. See Appendix 2 for a discussion of es�ma�on. 13 We abstract henceforth from nonpar�cipa�on and nonresponse. 14 This assump�on is also known as unconfoundedness, ignorability and condi�onal independence. 15 DiNardo, For�n and Lemieux (1996) use reweigh�ng to decompose changes in the density of wages in the U.S. in an early and influen�al paper. However, they do not inves�gate the proper�es of their es�mator. 16 Cataneo (2010) considers mul�valued treatments. 17 Addi�onal regularity condi�ons are required. See Firpo (2007, p. 263). 11 To build intui�on for IPW (, ), remember that we wish to es�mate the quan�le treat- ment effect on the treated. We can s�ll es�mate 1 (| = 1) from the treated. Thus, the func�on IPW (, ) gives them equal weight, 1⁄P( = 1). To recover 0 (| = 1) from the untreated, however, their distribu�on of poten�al outcomes must be comparable to that of the treated. Hence, we reweight this subsample. By the assump�on of selec�on on observa- bles, we only need to balance the distribu�on of . For that reason, IPW (, ) is increasing in the propensity score: we give more weight to untreated observa�ons which resemble the treated and less weight to observa�ons with characteris�cs that are uncommon among the treated. A�er construc�ng �IPW (, ), 18 we proceed as above. The QTT es�mator is the differ- ence between the quan�les �0 (| = 0) of the reweighted sample, which �1 (| = 1) and we compute separately for treated and untreated observa�ons. As before, this approach extends to other sta�s�cs. It also accommodates alterna�ve treatment effects, such as effects on the en�re popula�on: it suffices to adjust the weight func�on (Firpo and Pin- to, 2015). It is also possible to es�mate quan�le effects with quan�le regressions. One should run a regression of outcomes on an intercept and a treatment indicator on the reweighted sam- ple. Covariates should only enter the model through the weights �IPW (, ). If they were included as control variables, we would es�mate a condi�onal quan�le effect (see Part 5). The assump�on of selec�on on observables is strong and has received extensive discus- sion in the program evalua�on literature. The covariate set must be rich enough that the un- explained component of treatment take-up is independent of poten�al outcomes. This ap- proach is popular in observa�onal studies, and may also be useful in many RCTs, since de- signers o�en collect extensive data on par�cipants’ backgrounds. Imbens (2015) suggests placebo tests to assess plausibility. The assump�on becomes testable if a valid instrument is available: see Donald, Hsu and Lieli (2014) for details. 4. Applica�ons with Selec�on on Unobservables: Instrumental Variables The previous sec�on leveraged the assump�on of selec�on on observables to address se- lec�on bias. This sec�on explores an alterna�ve strategy: instrumental variables (IV). IV es�ma�on of average effects has a long tradi�on: see Heckman and Vytlacil (2007) and Imbens and Wooldridge (2009) for surveys. Imbens and Rubin (1997) and Abadie (2002) extend it to effects on distribu�ons. Important early contribu�ons for quan�le effects are Abadie, Angrist and Imbens (2002) and Chernozhukov and Hansen (2004, 2005). Frölich and Melly (2013a,b) propose es�mators of uncondi�onal distribu�onal effects. As Angrist and Imbens (1994) argue, the IV framework only iden�fies treatment effects on compliers – the subpopula�on whose treatment status depends on the value of the in- strument. Further assump�ons are necessary for iden�fica�on of effects on the treated. Frö- lich and Melly (2013a) exploit one-sided noncompliance: for some value of the instrument, par�cipants never take the interven�on. For intui�on, consider a clinical trial of a vaccine. Our candidate instrument is randomiza�on. Par�cipants in the treatment group might refuse the vaccina�on. On the other hand, the control group has no access to it. Noncompliance is one-sided in that it only affects the treatment group. 19 This property implies that all the 18 If the data include sampling weights, mul�ply �IPW (, ) by the weights (Ridgeway et al., 2015). 19 See the analysis of Head Start by Kline and Walters (2015) for a counter-example. Their paper highlights the importance of verifying the assump�on of one-sided non-compliance in prac�ce. 12 treated are compliers, which allows us to iden�fy effects on the treated from effects on compliers. This sec�on presents the framework of Frölich and Melly (2013a). We assume the exist- ence of a binary instrument , such that = 0 implies = 0. One example is randomiza- �on. 20 Formally, we assume: 21 i. Independent instrument: 0 is independent of given covariates, , for all such that P( = 1| = ) > 0. ii. One-sided noncompliance: P( = 0| = 0) = 1. iii. Support condi�on: P( = 0| = ) > 0 for all such that P( = 1| = ) > 0. If the assump�on of one-sided noncompliance fails, this procedure yields consistent es�- mates of effects on treated compliers. It is also possible to obtain local treatment effects. See Frölich and Melly (2013b) for details. Condi�on (i) ensures that the instrument is valid. We need full independence, which is stronger than the assump�on of uncorrelatedness of the linear IV model. In an RCT, it means that 0 does not depend on treatment assignment, which is reasonable. Condi�on (iii) is analogous to the assump�on of common support in the previous subsec�on. We implement the IV es�mator in two steps. First we compute the weight func�on: 1 − P( = 0|) − IV (, , ) = + × . P( = 1) P( = 1) P( = 0|) Note the similarity to IPW (, ). Appendix 2 discusses the es�ma�on of the condi�onal probability P( = 0|). To build intui�on for IV (, , ), suppose that the instrument is randomiza�on, i.e., = . The distribu�on of poten�al outcomes is ini�ally balanced across the treatment and control groups, due to random assignment. Noncompliance distorts the distribu�on for the treated, which now consists of compliers. The untreated include noncompliers from the treatment group, as well as counterfactual compliers and noncompliers from the control group. To recover 0 (| = 1), we would ideally restrict the untreated to counterfactual compliers, but we do not know who they are. Note however that the distribu�on of 0 for noncompliers is the same in the treatment and control groups, because of randomiza�on. Therefore, giving nega�ve weights to the outcomes of noncompliers from the treatment group makes them “cancel out” counterfactual noncompliers in the control group, leaving us with the distribu�on of 0 for counterfactual compliers! Accoun�ng for covariates accommodates applica�ons in which the instrument is only condi�onally exogenous, such as RCTs with stra�fied randomiza�on and observa�onal data. We might also want to purge indirect effects of an interven�on to focus on par�cular mech- anisms. Although controlling for covariates undoes the equivalence between the treated and compliers, the iden�fica�on result of Frölich and Melly (2013a) holds nonetheless. A�er construc�ng IV (, , ), 22 we proceed as above. The QTT es�mator is the differ- ence between the quan�les �0 (| = 0) of the reweighted sample, which �1 (| = 1) and 20 The es�mator extends to mul�valued and con�nuous IVs: see Sec�on 3.1 in Frölich and Melly (2013a). 21 Addi�onal regularity condi�ons are required. See Frölich and Melly (2013a, p. 391). In par�cular, we assume that there is no endogenous atri�on. The authors propose adjustments for atri�on in their Sec�on 4. 22 If the data include sampling weights, mul�ply �IPW (, ) by them (Ridgeway et al., 2015). 13 we compute separately for treated and untreated observa�ons. 23 This approach extends to other sta�s�cs, such as the variance. 5. Interpre�ng Quan�le Effects Quan�le effects are o�en misinterpreted, which can result in unwarranted conclusions for policy. This sec�on discusses and illustrates two common pi�alls: implicit assump�ons of rank invariance and extrapola�ng treatment effects to different popula�ons. It is easy to conflate quan�le effects (i.e., changes in the distribu�on of outcomes) and individual effects (i.e., the distribu�on of changes in outcomes). One might naively reason: “The median outcome in the control group is 50. The difference in medians is 10. John is in the control group and his outcome is 50. Therefore, his outcome would be 60 under treat- ment.” Recall that the quan�le effect is the difference in quan�les between treated and un- treated par�cipants. Thus, we implicitly assumed that John’s outcome would be equal to the median of the treated if he underwent treatment himself – in other words, rank invariance (cf. Part 2). Rank invariance is a strong assump�on, which is implausible when treatment effects differ across observably iden�cal par�cipants. Assump�ons of rank invariance are o�en subtle. For example, researchers might argue that all par�cipants benefited from the interven�on if all quan�les effects are strictly posi- �ve. They would have invoked rank invariance across treatment status. One might also equate quan�le effects and changes with respect to baseline outcomes instead of 0 . This interpreta�on requires rank invariance both across treatment states and over �me. For illustra�on, consider Figure 2. It shows es�mates of treatment effects, based on data from Simula�on 1 (see Appendix 1 for details). We set = 2, so that the first period is the baseline period. Individual effects are independent of baseline outcomes; as a consequence, the average treatment effect at each quan�le of baseline outcomes is close to the overall average effect, as can be seen from the line labeled “ATE at Quan�le of Baseline Y”. Yet, quan�le effects are increasing. The QTT (dashed line) clearly differs from both average effects at baseline values (thin line) and quan�les of the distribu�on of individual effects Δ (thick line). This discrepancy underlines the no�on that implicitly assuming rank invariance and misinterpre�ng quan�le effect may lead to incorrect conclusions. Note, moreover, that all quan�le effects are posi�ve, even though some par�cipants are worse off a�er treat- ment. A different concern is the comparison of treatment effects across (sub-)popula�ons. This part has surveyed the es�ma�on of changes in the distribu�on of poten�al outcomes. As Sec�on 4.2.2 argued, RCTs iden�fy these marginal distribu�ons, which allows us to compute treatment effects under minimal assump�ons. However, these effects depend on the uni- den�fied joint distribu�on of (0 , 1 ). As a consequence, it is difficult to extrapolate or com- pare results between different (sub-)popula�ons, unless we can account for differences in both the distribu�ons of 0 and the rela�on between 0 and 1 . For example, suppose that we observe differences in QTTs across genders. By itself, this finding does not tell us whether these discrepancies arise from differences in responses to the interven�on between men and women or from differences in outcomes in the absence of treatment. In other words, they can stem from gender differences in the unknown distribu�on of treatment effects, in the distribu�on of 0 or both. We can es�mate the marginal distribu�on of outcomes, which 23 As before, it is possible to es�mate the QTT from a weighted quan�le regression. The slope gives the QTT. 14 Figure 2: QTTs and Common Misinterpretations allows us to compare treatment effects on quan�les that correspond to the same value of 0 . See the discussion of Figures 6 and 7 in Bitler, Hoynes and Domina (2014) for an example and strategies to compare quan�le effects across groups. 4. Distribu�on of Treatment Impacts 1. Introduc�on 1. Definitions and Outline Part 3 discusses methods to es�mate the impact of a policy or program on (func�ons of) the marginal distribu�on of outcomes. In this part, we are interested in the distribu�on of indi- vidual specific treatment effects, . The distribu�on of treatment effects is required to an- swer ques�ons such as: • What is the variance of treatment effects? • What was the median impact of the program? • What propor�on of the popula�on was hurt by the program? As described in Part 2, these ques�ons cannot be answered by the methods in Part 3, be- cause, rather than effects on distribu�ons, they concern effects on individuals. The central difficulty of studying the distribu�on of individual treatment effects is that they require a counterfactual outcome for every individual. Suppose person is in the treatment group, so we observe 1 . In order to es�mate the impact of treatment on person , we need to predict �0 . Then, the individual treatment effect is the difference between the observed 0 , say by outcome in the treatment state and the counterfactual control group outcome for person : ̂ = 1 − �0 . Similarly, for person in the control group, the researcher needs to predict 1 , the outcome of person if were treated, in order to construct ̂ = �1 − 0 . Importantly, the individual es�mates ̂ are only of interest to iden�fy the distribu�on of treatment effects in order to answer ques�ons such as those raised above. The es�mated 15 ̂ , is of less interest both because it is noisily es�mated and because effect on individual , this individual has already been treated, so this treatment effect provides litle informa�on about future implementa�ons of the policy. In Part 5, we discuss methods to analyze heter- ogeneity, which is informa�ve about the types of people who benefit from a program. In contrast to studying average treatment effects or differences in marginal distribu�ons as in Part 3, individual counterfactuals cannot be iden�fied by randomiza�on alone. Ran- domly selec�ng treatment and control groups iden�fies 1 and 0 but not how the out- comes of a single individual vary across the treatment or control states. In general, the mar- ginal distribu�ons and quan�le treatment effects from Part 3 inform us of changes in the frequency of outcomes and inequality, but they only provide limited informa�on about idio- syncra�c responses to treatment (Bitler, Gelbach and Hoynes, 2014). 24 The distribu�on of impacts only equals the difference in marginal poten�al outcome distribu�ons (or quan�le effects) when individuals maintain exactly the same rank in both the treatment and control outcome distribu�ons. This rank preservation or invariance condi�on implies that observa- �ons with the same rank in the treatment and control outcome distribu�ons are appropriate counterfactuals for each other. When the rank invariance condi�on is sa�sfied, es�ma�ng the distribu�on of treatment effects only requires the methods discussed in Part 3. Rank in- variance is a strong assump�on. If it does not hold, the parameters from the previous part are s�ll iden�fied, but their interpreta�on can be difficult. The interpreta�on of the distribu- �on of treatment effects is always clear, but the distribu�on is no longer iden�fied by ran- domiza�on if rank invariance fails. As a result, we must either make addi�onal, some�mes strong, assump�ons that imply individual counterfactuals to point iden�fy the distribu�on of treatment effects, or we must setle for par�al iden�fica�on, where only a range of parame- ters is iden�fied. The empirical strategy depends on the validity of assump�ons that we need to assess for the case at hand, so we first provide some background and then discuss more general es�- ma�on principles before we discuss a specific model that applies to many evalua�ons. In par�cular, we first illustrate the iden�fica�on problem using the variance of treatment effects as an example. In Sec�on 4.2, we discuss par�al iden�fica�on and how addi�onal assump�ons can narrow the bounds from this approach. Sec�on 4.3 discusses point iden�fi- ca�on. We first illustrate the required assump�ons and provide an overview of common ap- proaches to jus�fy them. We then introduce methods to es�mate features (moments) of the distribu�on or, under more stringent assump�ons, the en�re distribu�on. In prac�ce, we need to adapt the methods to jus�fy the assump�ons and the es�ma�on method to our ap- plica�on and data availability, so we close by giving an overview of a specific panel data model. The model is general enough to cover many common se�ngs and can serve as a blueprint which is adaptable to other situa�ons. 2. An Example: Variance of Treatment Effects The vast majority of impact evalua�ons focus on average treatment effects: () = (1 − 0 ). This is the first moment of the treatment effect distribu�on, which provides a measure of loca�on of the distribu�on, i.e., how much individuals benefit on average. If 24 As Bitler, Gelbach and Hoynes (2014) point out, quan�le effects contain addi�onal informa�on about the distribu�on of individual effects. If one or more quan�le effects are posi�ve, at least one par�cipant benefited from the interven�on. The converse is also true. Note that Makarov bounds allow us to quan�fy the shares of winners and losers under minimal assump�ons: see Sec�on 4.3. 16 treatment effects are constant, it completely describes the effects of the program. However, when there is heterogeneity, a natural extension is to study higher order moments of the distribu�on. For example, the variance of treatment effects is the second (centered) mo- ment of this distribu�on and provides a measure of the dispersion of the treatment effects, 25 i.e., how much they vary across individuals. In the fic�onal RCT in Part 2, the vari- ance of treatment effects summarizes how the impact of financial literacy training varies across students. The variance of treatment effects is a useful measure of the importance of heterogeneity. For instance, if the square root of the variance is close to the average treat- ment effect, some individuals are likely to be harmed by the program. The variance of indi- vidual treatment effects is: var() = var(1 ) + var(0 ) − 2 cov(0 , 1 ). The variances of 1 and 0 are features of their marginal distribu�ons and can be es�mated using the methods discussed in Part 3. However, cov(0 , 1 ) requires the researcher to know how 1 relates to 0 . Unfortunately, we can never observe the same person in both the treatment and control states simultaneously. As a result, the data do not iden�fy cov(0 , 1 ) or, by extension, var(), without addi�onal assump�ons. Consider the mock data in Table 1, but more realis�cally, suppose we do not know which outcomes under treatment are paired with outcomes in the absence of treatment. As usual, we can calculate the average treatment effect by taking the difference between the average outcome in the treatment and control groups, 26 which yields 1.6. Similarly, for the variance of treatment effects, we get: � (1 ) + var � (1 − 0 ) = var var � (0 ) − 2 cov(0 , 1 ) = 1.36 + 0.94 − 2 cov(0 , 1 ). Thus, the variance of treatment effects is not iden�fied from the data alone. We face an analogous iden�fica�on problem when es�ma�ng the en�re distribu�on of individual treatment effects or other features of this distribu�on: while features of the marginal distri- bu�ons 0 and 1 can be calculated from the data, parameters that describe the rela�onship between 0 and 1 are necessary for point iden�fica�on of the distribu�on of treatment effects. We discuss two ways to proceed below. First, we can setle for bounds instead of point es�mates as discussed in Sec�on 4.2. Bounds are o�en simple to obtain, but inference can be difficult. They also tend to be too wide to be informa�ve and narrowing them re- quires addi�onal assump�ons. Second, we can es�mate the (parameters of the) rela�onship between counterfactual outcomes, as we discuss in Sec�on 4.3. This requires further as- sump�ons and modeling, such as amending the RCT with a model of par�cipa�on choice. 2. Bounding the Distribu�on of Treatment Effects While the data cannot iden�fy higher order moments of the treatment effects distribu�on without addi�onal assump�ons, they can iden�fy a range of values that must include the true value. Bounds do not tell us anything regarding where in the range the parameter is 25 Here, the variance of the treatment effect means the variance of individual treatment effects in the popula- �on. This is dis�nct from the variance of the es�mate of the average treatment effect due to sampling that is used for inference on average effects. 26 To keep the example simple, we abstract from sampling varia�on throughout. 17 likely to lie, but they rule out parameter values outside this range, because the data are in- consistent with them. For example, if we can find the largest and smallest possible value of cov(0 , 1 ), we can plug these values into the formula above to obtain bounds on the vari- ance of individual treatment effects, i.e., the largest and smallest possible values it can take. We first con�nue the example of the variance and then extend this idea to the distribu�on of treatment effects and discuss its advantages and problems. Recall that cov(0 , 1 ) = 01 1 0 , where 01 is the correla�on between 1 and 0 , 1 is the standard devia�on of 1 and 0 is the standard devia�on of 0 . We can es�mate 1 and 0 from the treatment and control group. The only remaining unknown is the correla�on coefficient, 01 , which must lie between – 1 and 1 by defini�on. Therefore, without addi- �onal assump�ons, −1 0 ≤ cov(0 , 1 ) ≤ 1 0 . Subs�tu�ng these bounds for cov(0 , 1 ) in the formula for var() yields bounds for var(). Using the data in Table 1, the bounds are: � (1 − 0 ) ≤ 1.36 + 0.94 + 2√1.36×0.94, 1.36 + 0.94 − 2√1.36×0.94 ≤ var � (1 − 0 ) ≤ 4.56. 0.04 ≤ var What can we learn from these bounds? At the upper bound, the average treatment effect is only 1.3 standard devia�ons from zero, so the data do not seem to rule out nega�ve treat- ment effects. Since the lower bound is greater than zero, there has to be at least some treatment effect heterogeneity. In this way, bounds can also be used to test relevant hy- potheses. For example, the classical standard approach to impact evalua�on assumes that treatment effects are constant, so var() = 0. If the bounds do not include zero, as above, the data are not consistent with constant treatment effects (Heckman, Smith and Clements, 1997). 27 In Simula�on 1, where treatment effects are normally distributed with unit mean and variance, the bounds on the standard devia�on of treatment effects tell us that it must be between 0.78 and 3.13. Therefore, the bounds rule out a constant treatment effect, but include es�mates over three �mes higher than the true standard devia�on. The bounds can be �ghtened using addi�onal assump�ons. The results in Heckman, Smith and Clements (1997) suggest assuming that poten�al outcomes are posi�vely corre- lated (01 ≥ 0) may be reasonable in some cases. If we are willing to assume that people who do well in the absence of treatment also do well with treatment, the bounds become: 0 ≤ cov(0 , 1 ) ≤ 1 0 . � (1 − 0 ) yields the narrower bounds: Subs�tu�ng these bounds into the formula for var � (1 − 0 ) ≤ 1.36 + 0.94 1.36 + 0.94 − 2√1.36×0.94 ≤ var � (1 − 0 ) ≤ 2.3. 0.04 ≤ var 27 To conduct a formal hypothesis test, the researcher needs to calculate standard errors for the bounds. We return to this problem at the end of this sec�on. 18 The data in Table 1 actually imply 01 = 0.21. Therefore, the true variance of the treatment effect is 1.84, which lies within these bounds. How does the example extend to learning about the distribu�on of treatment effects without making assump�ons beyond what was required in Part 3? Just as in the case of the variance above, one can s�ll use the marginal outcome distribu�ons to calculate bounds on the en�re distribu�on of treatment effects. Assume we have es�mates of the marginal po- ten�al outcome distribu�ons 1 (⋅ |) and 0 (⋅ |) from the treatment and control groups. Then the distribu�on of treatment effects, () = ( ≤ ), can be bounded at any point using the following Makarov Bounds (Makarov, 1982; Firpo and Ridder, 2008): 28 sup max{1 () − 0 ( − ),0} ≤ ( ) ≤ inf min{1 () − 0 ( − ) + 1 , 1}. Bounds may also be informa�ve about specific ques�ons of interest. For example, marginal poten�al outcome distribu�ons are some�mes sufficient to find whether anyone was hurt by the program or the share of treatment effects that are nega�ve, i.e., (0). The Makarov bounds indicate the range of values for the joint distribu�on that are consistent with the ob- served marginal distribu�ons. We break down the pieces of the lower bound as an illustra- �on. The maximum func�on in the lower bound just imposes the logical restric�on that a CDF cannot be nega�ve. If we ignore the maximum for now, the bound simplifies to: sup 1 () − 0 (), which is just the largest ver�cal difference between the treatment and control group CDFs. As with the lower bound, the minimum in the upper bound imposes the restric�on that the CDF can be no greater than 1. Ignoring the minimum, the upper bound becomes: inf 1 () − 0 () + 1, which is determined by the point where the treatment group looks best in comparison to the control group. The first panel of Figure 3 illustrates Makarov Bounds for the two data genera�ng pro- cesses described in simula�on 2. In both cases, the average treatment effect is 1, but the standard devia�on of treatment effects is 0 in the first case and 5 in the second case. Despite the fact that treatment effects are constant in the first case, the bounds do not rule out size- able heterogeneity. On the other hand, the bounds in the second case clearly rule out con- stant treatment effects. The second panel of Figure 3 plots the densi�es of outcomes generated from the second data genera�ng process. The treated outcome density has a slightly higher mean than the untreated density, but is much more dispersed. Consequently, a non-zero share of the treat- ed density’s mass falls below the minimum value of the untreated poten�al outcome densi- ty. This indicates that someone was hurt by the program in this hypothe�cal example. We cannot say who was hurt by the program, but we know that the propor�on of people hurt by the program is at least 1 [min(0 )]. In this simula�on, this bound indicates that at least 26% (and at most 59%) of individuals were hurt by the treatment. In fact, based on the simu- lated data, 40% of individuals were hurt by the treatment. 28 Refer to Firpo and Ridder (2008) for a formula that yields narrower bounds by averaging across bounds from condi�onal distribu�ons. Note that we have simplified the nota�on by assuming con�nuous CDFs. 19 Figure 3: Simula�on 3, DGP and Makarov Bounds with = 5 and = 0.5 In these examples, we have ignored sampling varia�on, i.e., that the marginal distribu- �ons are es�mates rather than the true popula�on distribu�ons. Sampling varia�on exacer- bates the problem that bounds are o�en wide (Heckman, Smith and Clements 1997), since es�ma�on error makes the possible range of parameters even wider than the point es�- mates of the upper and lower bounds. Moreover, obtaining the relevant standard errors is o�en difficult (see Subsec�on 6.2.1). Nonetheless, bounds are usually easy to calculate and clearly demonstrate what the data imply about the distribu�on. Thereby, they can provide a useful informal assessment of what can, and (more o�en) what cannot, be learned from the data alone. For example, if the point es�mates of the variance bounds include zero, the data are not informa�ve about heterogeneity of treatment effects without further assump�ons. 3. Point Iden�fica�on of Features of the Distribu�on of Treatment Effects When we are willing to make addi�onal assump�ons, we can make more progress in point iden�fying the distribu�on of treatment effects. We first con�nue the example of the vari- ance of treatment effects and illustrate what kind of assump�ons are required for point iden�fica�on and what previous studies have done to jus�fy them. The condi�ons under which these assump�ons are plausible crucially depend on the substan�ve problem and the 20 available data. We use cross-sec�onal nota�on to keep the exposi�on simple. In prac�ce, one can easily adapt the methods to richer models, such as panel data models, by including individual fixed effects or lagged values in . To allow researchers to choose and adapt the methods to their cases, we discuss two general es�ma�on frameworks in the next subsec- �on. We conclude with an example of a specific panel data model that can easily be adapted to many common empirical se�ngs. Con�nuing the example of the variance of treatment effects, we can point iden�fy var() if we are willing to assume a par�cular value of the correla�on between the treated and the untreated outcome. For example, if poten�al outcomes are uncorrelated � (1 − 0 ) = 1.36 + 0.94 = 2.3. As cov(0 , 1 ) = 0. In our mock example, this implies: var we have seen above, the true covariance in our example is posi�ve, so assuming that it is zero leads to an overes�mate of the variance of treatment effects. More generally, we need to be able to iden�fy the dependence between the treated and untreated outcomes to iden�fy the distribu�on of treatment effects. So far, we have primari- ly dealt with extreme cases (such as no or perfect dependence), in which the researcher ex- plicitly chooses values for the parameters that determine the dependence of poten�al out- comes. While addi�onal assump�ons are always required, a much beter case for them can be made in prac�ce for two reasons. First, the researcher may have a model that includes the relevant dependence parameters and may be able to es�mate them from this model under weaker condi�ons. For example, cov(0 , 1 ) is a crucial parameter of models of indi- vidual choice such as the (generalized) Roy model, which can be es�mated under condi�ons outlined in Heckman and Honoré (1990) and extended in Abbring and Heckman (2007). Thus, rather than drawing a convenient value from thin air as we do here for ease of exposi- �on, we may be able to es�mate the required parameters under plausible assump�ons by amending the RCT with a model of individual choice. Second, the assump�ons are usually only required to hold condi�onal on observables, i.e., a�er controlling for , as Subsec- �on 4.3.2 describes. 1. Identification and Key Assumptions Iden�fica�on of the distribu�on of treatment effects comes from restric�ons of the depend- ence between the part of the untreated outcome that is not explained by the covariates and the size of the treatment effect. In this subsec�on, we discuss why these assump�ons are necessary, how they solve the iden�fica�on problem and why covariates are important even in an RCT. Throughout this part, we assume perfect compliance, so that one can use and interchangeably. We define parameters and es�mators in terms of here, as the distribu�on of individual “intent-to-treat” parameters is unlikely to be interes�ng and is not clearly de- fined. If perfect compliance fails, one needs to adapt the re-weigh�ng or IV methods from part 3. For simplicity, we assume that the effect of covariates is linear and addi�vely separa- ble: 0 = + , 1 = 0 + = + + , where Δ is an individual specific impact of the treatment. This can conveniently be writen in one equa�on as: 21 = + + . This model is simple to extend: e.g., including non-linear func�ons of is straigh�orward. It is also more general than the resemblance to a standard regression equa�on suggests. In impact evalua�on, we are interested in and not in the effect of on 0 . Thus, one can think of as the linear projec�on coefficient, so that and are uncorrelated by construc- �on (but not necessarily independent). If depends on , the control group s�ll iden�fies , so that the treatment group iden�fies the (not necessarily causal) rela�on of the treatment effect to observables. Thus, as long as randomiza�on works, we can purge the observable part of the model by par�al regression, i.e., by regressing on (and if treatment effects depend on observables) and working with residuals from this regression as if does not mater. 29 In prac�ce, including in the model and es�ma�ng it in one step may be more conven- ient than par�al regression. However, par�al regression provides a useful thought device, as it leaves only the unobservable part of the model, which consists of + for the treated and only for the control group: + . This is a benefit of randomiza�on that helps to iden�fy the distribu�on of treatment effects. To see that it is not sufficient, consider the analogy to iden�fying mean effects, where one would compute the mean of from the con- trol group, the mean of + from the treatment group, and obtain the mean of as their difference. Extending this to distribu�ons, the treatment group iden�fies the distribu�on of the sum of and and the control group iden�fies the distribu�on of . However, one can- not back out the distribu�on of from these two distribu�ons: contrary to means, the difference between two distribu�ons is not the distribu�on of the differences. To make progress, addi�onal restric�ons on the dependence of , the part of the un- treated outcome that is not related to the covariates , and the treatment effect are re- quired. We saw above that, if we are willing to assume that individuals’ poten�al outcomes are uncorrelated across treatment states, then the variance of treatment effects can be point iden�fied. Technically, we have achieved iden�fica�on by imposing a moment re- stric�on. All three terms in the formula for the variance of treatment effects above are sec- ond moments of the data and the covariance is the only moment that depends on the joint distribu�on. Restric�ng it to zero leaves only terms that we can es�mate, since the other two terms, the variances of 0 and 1 , only depend on the marginal distribu�ons. Similarly, in the condi�onal case, we can iden�fy the variance of individual treatment effects if cov(, ) = 0, since: var() = var( + ) − var( ) − 2 cov(, ) = var( + ) − var(). We can es�mate the first term using the treatment group and the second from the control group. This idea generalizes. If we are willing to assume that all third moments that are not determined by the marginal distribu�ons are zero, for example, the third moment of the dis- tribu�on of treatment effects is iden�fied. We provide more detail on moment es�ma�on in Subsec�on 4.4.1. The limi�ng case of this idea is to assume that all moments of the joint dis- tribu�on only depend on the moments of the marginal distribu�on. This implies that treat- ment effects and the unexplained part of the untreated outcome are independent condi- 29 Note that if randomiza�on has been compromised, e.g., by noncompliance, the same has to be done for . 22 �onal on . Then the en�re distribu�on of treatment effects can o�en be es�mated using the method of deconvolu�on, as discussed in Subsec�on 4.4.2. 2. Justifying Identification Assumptions The discussion above shows that iden�fica�on of (features of) the distribu�on of treatment effects requires independence assump�ons. How can we jus�fy such assump�ons? They are sa�sfied if treatment effect heterogeneity is unrelated to any unobservable aspects of indi- vidual , which underscores the importance of covariates. Unlike methods relying on ran- domiza�on, condi�oning on covariates is typically required for iden�fica�on here. The co- variates control for varia�on in 0 that is poten�ally correlated with impact heterogeneity. Consequently, the required assump�ons to es�mate (moments of) the distribu�on of treat- ment effects may be plausible when the data include a rich set of individual characteris�cs. This is because the condi�onal independence assump�on allows Δ to depend on observable characteris�cs of person but not on any unobservable characteris�cs or variables that have been excluded from the model. Such a restric�on seems more plausible the beter the mod- el of untreated outcomes is. As a simple example, when the model contains no covariates, = 0 , so assuming and are uncorrelated amounts to the (usually unrealis�c) assump- �on that levels (0 = ) and gains () are not related. However, RCTs in economics o�en col- lect detailed informa�on including baseline values. Including baseline outcomes changes this restric�on to assuming that gains from the program are unrelated to devia�ons of the out- come from its expected path. While the independence assump�on usually seems unrealis�c in cross-sec�onal applica�ons, it o�en seems more plausible in panel data se�ngs. Consequently, a key component of making the condi�onal independence assump�on credible is to atempt to control for all variables that are related to both the size of the treatment effect and the untreated outcome. Subject to the usual caveats (see, e.g., Angrist and Pischke, 2009), this suggests controlling for many characteris�cs in a flexible way and assessing the robustness of results to changes in the condi�oning variables. However, just as in a standard regression, regardless of data availability and modeling, whether the assump- �on can be jus�fied or not depends on the applica�on at hand. If we find the assump�on that is condi�onally independent of too strong, we may turn to several tools from the literature to enhance credibility. Aakvik, Heckman and Vytlacil (2005) show that this assump�on can be weakened by taking a random effects factor approach. Suppose the data are generated by: = + + + , where is a vector of all unobserved variables which are poten�ally correlated with but are independent of and . If we can control for , the distribu�on of can be recovered using deconvolu�on. Aakvik, Heckman and Vytlacil (2005) assume that is an individual specific random effect with a par�cular distribu�on and es�mate the model using maximum likelihood. If panel data are available, we can model using a less restric�ve fixed effects approach, as in the example in Subsec�on 4.4.4 below. If we have access to mul�ple related measures of important omited confounders (or proxies), we can es�mate the distribu�on of using a measurement system (Carneiro, Han- sen and Heckman, 2003). To be concrete, suppose represents ability. If we observe three test scores, we may be willing to assume: 23 1 = + 1 , 2 = 2 + 2 , 3 = 3 + 3 where , 1 , 2 and 3 are all mutually independent and iden�cally distributed across indi- viduals and (1 ) = (2 ) = (3 ) = 0. Then, by Kotlarski (1967), (⋅) is nonparame- terically iden�fied, which implies (⋅) can be recovered using deconvolu�on. The supple- mental appendix of Arcidiacono et al. (2011) provides more detailed instruc�ons on how to use Kotlarski’s Theorem to iden�fy (⋅). The recent literature on dynamic factor models (e.g., Cunha, Heckman and Schennach, 2010) combines this approach with the advantages of panel data. 4. Es�ma�on Methods This sec�on presents two approaches to iden�fying and es�ma�ng features of the distribu- �on of treatment effects. In Subsec�on 4.4.1, we extend our discussion about es�ma�ng the variance of treatment effects to higher order moments of the distribu�on. This first ap- proach relaxes some of the assump�ons required to es�mate the full distribu�on of treat- ment effects. It is therefore preferable for ques�ons that can be answered by moments of the distribu�on of treatment effects. For example, what is the variance of treatment effects? In Subsec�on 4.4.2, we show how to use deconvolu�on to es�mate the en�re distribu�on of treatment effects, (⋅), when our model sufficiently captures the varia�on in the untreated outcome that is related to treatment effect heterogeneity, so that and are independ- ent. All of the policy ques�ons raised at the beginning of this sec�on can be answered by compu�ng features of this distribu�on. Moreover, it is simple to calculate the variance of the es�mated distribu�on to assess the variability of treatment effects or the frac�on of individ- uals hurt by the training, � (0). We present simula�on results to illustrate these two meth- ods in Subsec�on 4.4.3. We con�nue to use our basic model from the previous sec�on to keep the discussion simple. However, cau�on is warranted when applying the methods in prac�ce, as the re- quired condi�onal independence assump�on is strong. This assump�on likely requires a richer underlying model and may thereby limit how useful these methods are in prac�ce. Subsec�on 4.4.4 presents addi�onal details of these methods in a specific panel data con- text where the condi�onal independence assump�on may be plausible. 1. Identifying and Estimating Moments As the example of the variance shows, moments of the distribu�on of treatment effects are iden�fied under weaker condi�ons and o�en provide sufficient informa�on to answer policy ques�ons. 30 The first two moments of a distribu�on imply its variance and the first three moments imply its skewness. Skewness is informa�ve about how lopsided the distribu�on is. For example, a high and nega�ve skewness indicates that there is an important number of individuals whose treatment effect is not far above the mean impact and a few individuals with treatment effects far below the mean impact. With many vaccines and medica�ons, 30 As a reminder, the -th moment of the distribu�on of treatment effects is defined as �k �. 24 one may be worried about very lopsided distribu�ons where most people benefit modestly, but a few people are severely harmed. Examining whether the skewness is large and nega- �ve provides evidence on this issue. While moments are usually easy to es�mate by their sample analog, is never observed, so we cannot es�mate its moment directly. However, we can calculate the residuals from a par�al regression of on for the treatment and control group. Recall from above that if covariates are linearly related to 0 and 1 , the residual from a par�al regression of on corresponds to + for treatment group observa�ons and for control group observa- �ons. We can calculate the variance of treatment effects from the first two moments of the treatment effects distribu�on, which we can es�mate as moments of our par�al residuals: var() = (2 ) − ()2 = [( + )2 ] − ( 2 ) − ()2 . The first two terms are the second moments of treated and untreated par�al residuals and () is just the average treatment effect. Thus, each of these expecta�ons can be es�mated using par�al residuals. This is possible because we have assumed that and are uncorre- lated, so ( ) = 0. This is just a condi�onal version of the assump�on that cov(0 , 1 ) = 0 that we used to iden�fy the variance of treatment effects in our example above. The skewness of the distribu�on of treatment effects is given by: (3 ) − 3() var() − ()3 skewness() = . var()3⁄2 Once we know the variance of treatment effects, the only unknown term in this equa�on is the third moment, (3 ). To es�mate it from our par�al regression residuals, we need to extend our assump�on that terms which do not depend solely on the marginal distribu�ons of poten�al outcomes are zero to ( 2 ) and (2 ). This is a restric�on on how the vari- ance of the error term is related to the treatment effect and vice versa. Under this assump- �on, the third moment of the distribu�on of treatment impacts is given by: (3 ) = [( + )3 ] − 3()( 2 ) − ( 3 ), Thus, we can es�mate the skewness of the distribu�on of treatment effects from moments of our par�al residuals. Appendix 3 shows how to use the binomial formula to solve for higher order moments. In theory, we can use this method to es�mate all moments (or = ∞). This requires all moments that are not determined by the marginal distribu�ons to be zero. Of course, es�- ma�ng infinitely many moments is imprac�cal. Thus, one may prefer to es�mate the first moments and select one of the (usually many) distribu�ons they are consistent with. For ex- ample, in Sec�on 7.2, we es�mate the en�re distribu�on of treatment effects by assuming they are normally distributed and by plugging in mean and variance es�mates. Wu and Per- loff (2006) es�mate the first four moments and recommend selec�ng the distribu�on ac- cording to the Principle of Maximum Entropy. This distribu�on is unique and is “maximally noncommital with regard to missing informa�on” (Jaynes, 1957; Wu and Perloff, 2006). See their paper for details. 25 2. Deconvolution Deconvolu�on methods es�mate the distribu�on of treatment effects by removing the vari- a�on due to the error term from treatment group observa�ons so the only remaining varia- �on is due to the treatment effect. Disentangling the varia�on of the treatment effect from that of the error term requires a strong condi�onal independence assump�on. More pre- cisely, condi�onal independence requires that the random component of condi�onal on (, ), + , is the sum of two independent random variables for the treatment group. The distribu�on of the sum of two random variables is called the convolu�on of these ran- dom variables. Deconvolu�on methods undo this convolu�on by removing the distribu�on of one of the random variables from the sum. In our simple model, deconvolu�on requires to be sta�s�cally independent of given . 31 A testable implica�on of these assump�ons is that: var(1 |) = var() + var( ) ≥ var() = var(0 |). This can be tested using the marginal outcome distribu�ons. If var(1 ) is not greater than var(0 ), we should be wary of using this approach. Note that var(1 ) is greater than var(0 ) in all of the simula�ons in Subsec�on 4.4.3 below, even though the underlying assump�ons are violated in some cases. Deconvolu�on is o�en implemented via es�mated characteris�c func�ons (e.g., Bon- homme and Robin, 2010). Here, we present a simpler algorithm by Mallows (2007), which Arellano and Bonhomme (2012) find works well in prac�ce. It es�mates (⋅) by randomly matching treatment group residuals with control group residuals for the full sample many �mes. To build intui�on for this algorithm, consider again the par�al residuals from above. For members of the control group, they contain just the error term, . For someone in the treatment group, the par�al residual is the sum of the treatment effect and the error term, + . Since and are independent by assump�on, we can create pseudo-draws of to approximate its distribu�on using: � ̂ = + ε − ̂, � where ̂ is a control-group par�al residual and + ε is a treatment-group par�al residual. ̂ Mallows’ Algorithm “shrinks” es�mates of by matching treatment- and control-group par- �al residuals under the assump�on that large values of treatment group par�al residuals are associated with larger than usual error terms and small treatment group par�al residuals are associated with smaller than usual error terms. The algorithm then draws a new pseudo- sample of treatment residuals by randomly matching its shrinkage es�mate of ̂ with resid- uals and es�ma�ng a new shrinkage distribu�on. Mallows’ Algorithm is described in more detail for the conven�onal linear model below: = + + . 31 The model also requires addi�ve separability and restric�ons on the dependence of and or as dis- cussed above. However, these assump�ons are specific to the simple model that we use for clarity here. They can poten�ally be relaxed, e.g., when many periods of pre- and post-treatment observa�ons are available, one can use flexible panel data models, such as in Jacobson, LaLonde and Sullivan (1993). 26 To beter dis�nguish treatment and control par�al residuals, we will follow Arellano and Bonhomme’s (2012) nota�on: = + , = and = . Note that = + . To be sure, and are just vectors of par�al residual es�mates, whereas is a vector of simulat- ed draws from the (unknown) distribu�on of treatment effects. To prepare the data for Mallows’ Algorithm, use es�mates from the conven�onal linear model above to form and using treatment and control group values of − ̂ , respec- �vely. 32 Then, run the algorithm as follows. 1. Set 0 = sort( − ). 2. � 0 be a random permuta�on of 0 . Let 3. ̃ denote sorted according to the order of Let � 0 + . 4. ̃ − �. Repeat steps 2 through 4 many �mes. 33 Set = sort� Each is a pseudo-sample drawn from (⋅). The empirical distribu�on of the full set of s is therefore an es�mate of the distribu�on of (⋅). An example program is in the sup- plemental material available online. 3. Simulation Results Table 2 shows the results of Simula�on 3. We present results using 6 different combina�ons of parameters which determine the correla�on between treatment effects and omited vari- ables and the independence of treatment effects from poten�al outcomes. The moment and deconvolu�on es�ma�on methods always yield the same average effect, but the differences are par�cularly pronounced in the respec�ve es�mates of the standard devia�on of treat- ment effects. As we would expect, both the deconvolu�on and moment methods perform rela�vely beter when their assump�ons are sa�sfied, which is the case for deconvolu�on in column one only and for the moment-based approach in column two only. In our simula�on setup, the moment approach yields standard devia�on es�mates, which are closer to the truth when an omited variable is nega�vely correlated with treatment effects, regardless of whether the covariance sa�sfies the moment or deconvolu�on assump�ons. In contrast, the deconvolu�on standard devia�on es�mates are a beter approxima�on of the truth when the omited variable is posi�vely correlated with treatment effects. 32 The full vector or complete replica�ons of the vector should be used in the algorithm since it is mean zero by construc�on. and must be the same size. If is larger than , replace with bootstrap draws with replica�on where is the length of . If is larger than , either replace with bootstrap draws with rep- lica�on or with 2 draws and replace with [ ′ ′ ]′ . 33 Arellano and Bonhomme (2012) repeat this procedure 2000 �mes and discard the first 500 itera�ons. In prac�ce, the largest number of itera�ons that compu�ng and �me constraints allow is desirable. 27 Table 2: Results from Simulation 3 Correla�on with Omited Variable None Nega�ve Posi�ve 0 independent of Yes No Yes No Yes No A. Summary Statistics of Actual Effects Mean 1.05 1.06 – 0.38 – 0.36 1.52 1.54 Standard devia�on 0.70 1.22 1.63 1.92 0.86 1.32 Skewness 0.06 0.06 0.02 – 0.01 0.12 0.12 B. Summary Statistics from Moments Mean 1.14 1.16 – 0.34 – 0.32 1.64 1.65 Standard devia�on 1.40 1.38 1.63 1.62 1.63 1.62 Skewness 0.05 0.05 –0.01 – 0.02 0.04 0.03 C. Summary Statistics from Deconvolution Mean 1.14 1.16 – 0.34 – 0.32 1.64 1.65 Standard devia�on 0.69 0.72 1.09 1.10 1.09 1.10 Skewness 0.03 0.03 – 0.03 – 0.04 0.03 0.03 Notes: Es�mates based on Simula�on 3, with 01 equal to 0.5 or 0 and equal to – 1, 0, or 1. When 01 is 0.5, treatment effects are independent of poten�al outcomes, so the deconvolu�on assump�ons are sa�sfied. When 01 is 0, the moments assump�ons are sa�sfied. When is 0.5, there is no omited variable. When is 1 or – 1, there is an omited variable which is posi�vely or nega�vely correlated with both treatment outcomes and treatment effects, respec�vely. 4. A Specific Application with Panel Data In prac�ce, we likely need to extend the simple model from above and adapt the methods to a specific se�ng. To provide some guidance for this, we next discuss the panel data se�ng from Arellano and Bonhomme (2012), which can easily be adapted if the assump�ons above do not seem plausible or the available data set has a different structure. The condi�onal in- dependence assump�on required for deconvolu�on is more likely to be sa�sfied when the researcher can control for a rich set of individual characteris�cs. Panel data are a special case that also allows to flexibly control for unobserved individual heterogeneity. As discussed previously, the central difficulty in studying the distribu�on of treatment impacts is that a counterfactual outcome is needed for every individual in the sample. In a panel, the same individual can o�en be observed in the treated and control states at differ- ent �mes. This allows one to model individual heterogeneity quite flexibly, but instead re- quires restric�ons on how the treatment effect and the residual are allowed to vary over �me, as we show below. Arellano and Bonhomme (2012) show how to es�mate higher or- der moments of the treatment effect distribu�on and apply deconvolu�on methods to es�- mate the full distribu�on of treatment effects when panel data are available. Consider the panel version of the linear model we work with throughout this sec�on: = + + + , = 1, … , . Call this model (1). Here, is an indicator for whether person has ever been treated by period . Outcomes are shi�ed by the treatment effect in all periods a�er an individual is treated, so this model implies that the treatment effect is constant over the �me horizon of 28 the panel. 34 This may be a strong assump�on, since treatment effects may not begin imme- diately and/or may fade over �me. For example, when analyzing the impact of a school management interven�on on students’ test scores, this amounts to assuming the interven- �on had constant impacts over the en�re period of data collec�on (i.e., three years in the applica�on in Sec�on 7.2). With very long panels, this can be relaxed by modeling decay parametrically. A more general model is: = + ′ + + + ( − ∗ ) + , = 1, … , . where ∗ is the period in which individual received treatment. Here, the treatment effect is assumed to evolve linearly over �me according to the (non-stochas�c) linear equa�on ,−∗ = + ( − ∗ ). Arellano and Bonhomme (2012) show that the variance and distribu�on of in model (1) are iden�fied under the following two assump�ons: i. Condi�onal independence: is sta�s�cally independent of . ii. Error dynamics: The model requires restric�on of the dependence of errors over �me, so that we can es�mate the �me series process of . See Arellano and Bon- homme (2012) for further detail. In their applica�on, they assume that the errors fol- low an autoregressive or moving-average process with independent and iden�cally distributed innova�ons in each period. Note that the condi�onal independence assump�on is now condi�onal on an individual fixed effect and the �me varying controls, . The only remaining source of poten�ally prob- lema�c residual varia�on is �me-varying unobservables. While this is s�ll a strong inde- pendence assump�on, it is likely more plausible than in the cross-sec�onal se�ng. The sec- ond assump�on requires that the are not too correlated across periods. This is required to ensure that the treatment effect can be disentangled from the persistent component of error terms. Importantly, the rela�onship between and is le� unrestricted. 35 With panel data, the iden�fica�on of the distribu�on of individual treatment effects re- lies primarily on how outcomes of treated individuals vary between treatment and control states. 36 While this allows us to es�mate par�cular , we are primarily interested in the dis- tribu�on of treatment effects, as these individual es�mates will be quite noisy in most appli- ca�ons and most policy relevant ques�ons relate to future realiza�ons of treatment effects. Therefore, es�ma�ng the distribu�on of treatment effects using these models requires the data to have several features. Here, we will focus on model (1) under the assump�ons of Arellano and Bonhomme (2012). In par�cular, we assume someone is “treated” if they have ever received the treatment. In order to iden�fy this model, individuals must be observed at least three �mes, including at least one period before and another a�er the individual re- ceives the treatment. Suppose individual is observed three �mes and receives the treat- ment in the third period. Then, her observed outcomes are given by: , = + + ′ + , 34 Alterna�vely, we could define as an indicator for being treated in period . In this case, treatment is as- sumed to affect the individual only in treated periods. 35 In non-experimental se�ngs, the rela�onship between , , and is unrestricted. 36 Control group varia�on aids es�ma�on of common parameters that vary over �me. 29 ,+1 = + +1 + ′,+1 + ,+1 , ,+2 = + + +2 + ′,+2 + ,+2 . Consider the varia�on in the data needed to iden�fy each of the above parameters. First, observe that and are individual-specific, so that only individual ’s observa�ons are in- forma�ve about their values. If not for and , the parameters that are common across individuals, and could be es�mated separately for each individual. For this reason, is only iden�fied for the subpopula�on of individuals who are observed in both the treatment and control states. In contrast, untreated observa�ons are informa�ve about the common parameters, and . If, instead, individual were treated in the second and third periods, her observed out- comes would be: , = + + ′ + , ,+1 = + + +1 + ′,+1 + ,+1 , ,+2 = + + +2 + ′,+2 + ,+2 . In principle, the distribu�on of is iden�fied if each individual is only treated in one period or, even in a cross-sec�on, if the errors are independent and iden�cally distributed. This assump�on implies that the residual from any individual at any �me period , , could be used as a counterfactual for the residual of individual at �me . If one only uses a cross- sec�on, this reduces to the assump�on that residuals from another individual at the same �me, , are valid counterfactuals for . Exploi�ng panel data to use only within individual varia�on, as we do here, relaxes this assump�on to require only residuals from the same person, ′ , to be valid counterfactuals for for ′ ≠ . We use only individual ’s observa�ons to es�mate , so consistency is in the number of treated and untreated periods, not the number of individuals. This is illustrated in Figure 4 below which is generated from Simula�on 1 with = 4 (le� panel) and = 16 (right panel). In par�cular, no�ce that the distribu�on of regression es�mates of the treatment effects is quite different from the true distribu�on and the deconvolu�on es�mates when = 4. The es�mates are much more similar when = 16. 37 In general, es�mates of will be noisy Figure 4: Estimates of Distribution of Treatment Effects (Left: = 4; Right: = 16) when individuals are only treated in one, or even a few, periods. Moreover, es�mates of 37 The required number of �me periods depends on the SEs of , so there is no general rule. However, wheth- er the varia�on due to the es�ma�on error is minimal can easily be examined in any given applica�on. 30 treatment impacts on a par�cular individual are not of par�cular policy interest, since the individual has already received the treatment in ques�on. The distribu�on of treatment effects is poten�ally more informa�ve about how individuals are likely to be affected by fu- ture applica�ons of a program. Es�mates of individual treatment effects may be of interest for analyzing how varies by individual characteris�cs, which we discuss in more detail in Part 5. For this purpose, the individual es�mates can be regressed on covariates, but stand- ard errors should be adjusted to account for the fact that the outcomes are themselves es- �mated. A less ambi�ous goal than es�ma�ng for each treated individual is to es�mate the dis- tribu�on of treatment effects, (⋅). When , the number of periods in the panel, is large, the distribu�on of es�mated individual treatment impacts, � � (⋅) is likely to be an accurate approxima�on of (⋅). In such cases, one may only need to calculate or plot the distribu�on of the es�mated treatment effects. Unfortunately, is typically rela�vely small in RCTs. � When is small, each is es�mated using few observa�ons, so � (⋅) is an inflated version of (⋅). Therefore, one needs to correct for this es�ma�on error using deconvolu�on as specified in step 3 below. When the data requirements and assump�ons discussed above are sa�sfied, this model can be es�mated using the following procedure: 1. Setup: Set up the data so that there is one observa�on for each unique individual- �me period combina�on. 2. Es�mate regression: Regress the outcome on individual fixed effects, an indicator for having been randomly assigned to treatment by round interacted with individual fixed effects and covariates with common effects across individuals. 3. Es�mate feature of interest: If interested in the variance of treatment effects, use Arellano and Bonhomme (2012), eq. (47). If interested in the full distribu�on of treatment effects, es�mate (⋅) by applying Mallows’ algorithm using treatment ̂ − and control period values of − ′ � . Applica�ons of these methods are shown using data from an actual RCT conducted in The Gambia in Sec�on 7.2 and using simulated data in Sec�on 7.3. Example code appears in the supplementary materials available online. 5. Distribu�onal Impacts and Condi�onal Analyses 1. Introduc�on Thus far, we have considered uncondi�onal heterogeneity, i.e., how treatment impacts vary in the popula�on overall. We may also be interested in ques�ons such as whether and how the impact of a program on inequality differs by gender or whether program impacts are in- creasing or decreasing with baseline income. These ques�ons concern (features of) the joint distribu�on of treatment impacts and other variables. Thus, as discussed in Part 2, they are not answered directly by the methods above and call for conditional analysis. In this part, we discuss how to study condi�onal treatment effect heterogeneity. We first describe subgroup analysis, i.e., how the methods above can be applied to sub- popula�ons, such as by gender. Condi�onal versions of both ques�ons concerning impacts on outcome distribu�ons, such as whether the effect on inequality varies by gender, and 31 ques�ons on the distribu�on of outcomes, such as whether the frac�on of individuals who benefit from the program varies by gender, may be of interest. Sec�on 5.2 shows that for discrete covariates, such as gender, the problem is tractable, but o�en requires larger sam- ple sizes than those typically available. For con�nuous variables, such extensions would re- quire es�ma�ng the joint distribu�on of two con�nuous variables, such as treatment effects and baseline income – which requires even stronger assump�ons than the one discussed above and excep�onally large data. However, es�ma�ng features of this joint distribu�on, such as condi�onal means, may s�ll answer important ques�ons. For example, how would the average treatment effect change if the 1985 age distribu�on were replaced with the cur- rent distribu�on, holding all else constant (Rothe, 2012)? Therefore, researchers o�en es�- mate condi�onal average treatment effects. We discuss their es�ma�on and interpreta�on in Sec�on 5.3. 2. Subgroup Analysis We start with subgroup analysis, i.e., applying the methods discussed above separately to two or more groups of interest. While straigh�orward in theory, this approach requires common support and large sample sizes. In par�cular, we can implement exactly the same methods used to study the full sample on each of the different groups of interest. This con- di�onal approach can be applied to a wide range of methods, including average treatment effects, quan�le treatment effects and distribu�ons of treatment effects. For example, the QTTs for men and women are given by: QTT (|gender = male) = 1 (| = 1, male) − 0 (| = 1, male), QTT (|gender = female) = 1 (| = 1, female) − 0 (| = 1, female). If compliance is not perfect, so that independence of and (0 , 1 ) fails, it is important to take this into account as discussed in Part 3. Just as QTTs are interpretable as the difference between the treatment and control marginal distribu�ons at a par�cular quan�le, the sub- group QTT for, say, males, is the difference in the marginal distribu�ons of outcome among males at a par�cular quan�le. 38 These subgroup distribu�onal impacts are of interest be- cause, as in the full sample, means may mask substan�al within-group treatment heteroge- neity. Note that if the distribu�on of the outcome variable is very different across groups, quan�le treatment effects for each group may look different even if the effects are iden�cal at the same values of the outcome variable. This is due to the fact that the quan�les of the distribu�on of untreated outcomes, 0 (), differ between the groups: see the discussions in Sec�on 3.5 and around Figure 6 of Bitler, Hoynes and Domina (2014). In principle, the above approach of es�ma�ng parameters separately by subgroup works whenever the subgroup has both treatment and control observa�ons, i.e., common sup- 38 Note that subgroup QTTs differ from the es�mates of condi�onal quan�le regression (CQR). CQR models each quan�le as a parametric func�on of covariates. The CQR coefficient on the interac�on of the subpopula- �on indicator and T would indicate how the ℎ quan�le of differs between the treated in the subpopula�on and the overall popula�on (rather than the untreated subpopula�on). It assumes that treatment effects are constant across subgroups. Note that condi�onal quan�le effects do not average to the uncondi�onal effect, so that CQR provides litle informa�on about treatment effects for the popula�on of par�cipants. 32 port. 39 However, subgroup analysis requires even larger sample sizes, because separate analyses are being conducted across each subgroup of interest. Thus, it may o�en be infea- sible, par�cularly when subgroups are defined by several characteris�cs or con�nuous measures. Taking baseline income as an example, it would be quite surprising if every unique income level in the study was reported by at least two families, let alone one treat- ment and one control family. Even at income levels where there are mul�ple observa�ons, there generally will not be enough observa�ons to yield a precise enough es�mate of the mean, not to men�on the en�re condi�onal distribu�on. Adjus�ng for mul�ple hypotheses tests as we discuss in Part 6 further increases the required sample size. Of course, we can par�ally overcome this issue by collapsing con�nuous variables into a small number of cate- gories, like using income deciles, instead of income directly. This approach gains feasibility at the cost of poten�ally missing some heterogeneity, for example, within observa�ons in the same income decile. These assump�ons may allow learning about heterogeneity, but they can also mask it if they are not chosen well, as we discuss further in Sec�on 5.3. 3. Condi�onal Average Treatment Effects In this sec�on, we focus on condi�onal average treatment effects. In prac�ce, most re- searchers setle for showing that average effects change with covariates. This is generally not because learning about how the full distribu�on changes with covariates is not of interest, but because the subgroup sample sizes are too small to es�mate separate distribu�ons with- in each subgroup. However, we may s�ll be able to es�mate the average effect of interest using only the relevant subsample. For example, if we are interested in how average impacts vary across men and women, we can simply es�mate Condi�onal Average Treatment Effects (CATEs) 40 separately using the subsamples of men and women: ( |gender = male) = (| = 1, male) − (| = 0, male), ( |gender = female) = (| = 1, female) − (| = 0, female). As above, this approach requires every subgroup of interest to have a sufficient number of observa�ons in both treatment and control groups. If perfect compliance fails, may be en- dogenous and bias CATEs, just as it biases ATEs. Consequently, reweigh�ng or IV methods are required to recover effects on the treated, as in Part 3. One can also s�ll use randomiza- �on instead of treatment status to obtain intent-to-treat parameters. However, for their in- terpreta�on it is important to take into account that both take-up and treatment effects may vary between groups. Subpopula�on means share most of the advantages and shortcomings of average treatment effects discussed in Part 0, with the added benefit that comparing means across subpopula�ons can shed some light on heterogeneity. Djebbari and Smith (2008) call heterogeneity explained by differences in subgroup means “systema�c heteroge- neity” and heterogeneity remaining a�er controlling for these differences “idiosyncra�c het- erogeneity”. Subgroup means are therefore informa�ve about systema�c heterogeneity, but not idiosyncra�c heterogeneity. In some cases, this can lead to incorrect conclusions about 39 Selec�ng the relevant subgroups is a cri�cal step in condi�onal analysis. It is recommended to define the groups of interest at the design stage and include them as part of the data collec�on. See Sec�on 6.4. Despite extensive planning, however, researchers may not an�cipate the most relevant subgroups. 40 We follow common terminology in calling this parameter CATE rather than “on the treated”, as the discus- sion in Part 2 suggests. 33 the nature and extent of treatment effect heterogeneity. For example, Bitler, Gelbach and Hoynes (2014) re-analyze a welfare experiment to inves�gate the extent to which allowing for heterogeneity in CATEs can explain the heterogeneity in quan�le treatment effects in Bit- ler, Gelbach and Hoynes (2006). Using a simula�on exercise, they conclude heterogeneity across subgroups is unable to explain the observed treatment effect heterogeneity. As pointed out above, es�ma�ng how average treatment effects vary with covariates non-parametrically is o�en infeasible when there are many subgroups or when treatment effects vary with a con�nuous variable. These problems can be mi�gated by making para- metric assump�ons on the treatment effect heterogeneity. We o�en have prior beliefs that suggest a certain func�onal form for treatment effect heterogeneity. For example, we may be willing to assume treatment effects vary linearly or quadra�cally with income. At the risk of misspecifica�on, these assump�ons increase power and allow for iden�fica�on of effects without full common support. Parametric models of treatment effects can be implemented by interac�ng covariates with treatment status: � + + . = + As above, is unlikely to be exogenous when perfect compliance fails, so re-weigh�ng or IV methods are necessary to obtain consistent es�mates. The methods from sec�ons 3.3 and 3.4 are directly applicable, since condi�onal means are features of the marginal outcome distribu�ons. One may be tempted to con�nue to use randomiza�on status instead of treatment status in order to es�mate an intent-to-treat parameter. However, as pointed out above, ITT parameters conflate heterogeneity in take-up with heterogeneity in treat- ment effects, which makes them hard to interpret. It will usually be preferable to separately es�mate take-up given par�cipa�on in the RCT and treatment effects for the treated sepa- rately. We use � instead of here to indicate that all covariates do not need to be interacted with treatment. For example, � might include only gender or income if the researcher is primarily concerned with that dimension of treatment heterogeneity. The coefficients on the interac�ons, , describe how the average treatment effect varies with � . To be sure, this model imposes several parametric assump�ons. Most importantly, poten�al outcomes, 0 and 1 , are assumed to be linearly related to the covariates, since 0 = + and 1 = � + . When the true data genera�ng process is nonlinear, the es�mates can be in- + terpreted as a linear approxima�on of the condi�onal expecta�on func�on and provide a useful first impression of the existence and direc�on of heterogeneity. Also, covariates in- cluded in but not � are assumed to have the same impact on treatment and control out- comes. While linear models can be made quite flexible by including powers of covariates and in- terac�on terms, they may s�ll struggle to detect complex rela�onships. A common exercise in regression analysis is to add covariates to the model to see how a coefficient of interest changes. To explore whether observed subgroup impacts are driven by an omited variable, we can add addi�onal covariates to � . Recent work proposes using machine learning algo- rithms as a data-driven approach to “build” the model (Imai and Ratkovik, 2013; Athey and Imbens, 2015; Wager and Athey, 2015) and include only the most important dimensions of treatment heterogeneity. Condi�onal average treatment effects provide a simple measure of how treatment effects vary with observable characteris�cs. As with the average treatment effect, however, 34 there is a distribu�on around the condi�onal mean. For example, even if all condi�onal means are posi�ve, some individuals may s�ll be hurt by the program. This idiosyncra�c het- erogeneity may pose a threat to the external validity of the results. For example, using sub- group effects to inform future targe�ng of a program may backfire if the impact on the mar- ginal par�cipant is very different from the average treatment effect among the current set of par�cipants, even within each subgroup. Smith (2015) provides the following example. Sup- pose there is a program in which half of men have an impact of 10 and half have an impact of 4. Similarly, half of women have an impact of 12 and the other half have an impact of 1. If the cost of par�cipa�ng in the program is 5, only the “big responders” will choose to par�ci- pate and the program will appear more effec�ve for women than men. However, the impact on the marginal male par�cipant (4) is larger than the impact on the marginal female par�c- ipant (1). We can atempt to overcome the above issue by es�ma�ng a model of par�cipa�on along with the outcome model. To address the concerns raised in the previous paragraph, we need to understand how treatment effects vary with both the observables and the unob- servables, i.e., the error term of this par�cipa�on model. The methods in this sec�on can be used to study how treatment effects vary with observed covariates. Heckman and Vytlacil (1999, 2005, 2007) analogously define marginal treatment effects (MTE) as the aver- age treatment effect at a specific value of the unobservable component of the par�cipa�on equa�on: MTE = (1 − 0 | = , = ). This is a condi�onal average treatment effect just as are those described below, but one condi�oning variable, , is not observed. Thus, es�ma�on is more complex. Under the as- sump�ons of Heckman and Vytlacil, it can be es�mated as the deriva�ve of the outcome, , with respect to the exogenous varia�on in the propensity to par�cipate. Brinch, Mogstad and Wiswall (forthcoming) and Kowalski (2016) are examples of recent studies which use MTEs to examine treatment heterogeneity. Cornelissen et al. (2016) provide a review and discuss the rela�on to LATE. 6. Sta�s�cal Inference and Power Calcula�ons 1. Introduc�on Parts 3 and 4 discuss iden�fica�on and es�ma�on of treatment effects. However, meaning- ful policy evalua�on must also account for the precision of point es�mates. This part surveys some tools for formal hypothesis tests and power calcula�ons in DIA: Sec�on 6.2 discusses sta�s�cal inference for DIA. It focuses on repeated tes�ng and func�onal hypotheses. Sec- �on 6.3 surveys tests of a par�cular distribu�onal hypothesis: whether individual responses to treatment were heterogeneous. Sec�on 6.4 considers the choice of sample size for RCTs. It aims to help researchers design RCTs to address distribu�onal hypotheses. These three sec�ons draw on approxima�ons to the finite-sample behavior of es�mators of treatment effects. Inference and power calcula�ons would ideally use the exact distribu- �on of the es�mator in ques�on for each sample size. However, finite-sample results are on- 35 ly available for special cases. 41 Hence, we must rely on asympto�c approxima�ons, for which there are two main strategies. Tradi�onal asympto�c theory exploits the limi�ng distribu�on of es�mators as the sample size goes to infinity, which is normal in most cases. This ap- proach has two drawbacks. First, the resul�ng approxima�on might be poor. Second, the as- ympto�c distribu�on might be imprac�cal. For example, the asympto�c variance of the IPW es�mator of QTTs (Sec�on 3.3) depends on a condi�onal expecta�on (Firpo, 2007), which is difficult to es�mate due to the curse of dimensionality. Therefore, we recommend simula- �on-based inference instead. The procedure is straigh�orward: es�mate the effect of inter- est on many different samples; pool the es�mates across replica�ons to construct a sample of es�mators; then use the distribu�on of the es�mator from this sample for inference. 2. Sta�s�cal Inference When researchers conduct DIA, they o�en wish to address mul�ple hypotheses or hypothe- ses about mul�ple parameters. However, repeated tes�ng increases the probability of false posi�ves, distor�ng significance and power levels. Therefore, cri�cal values need careful ad- justment according to the ques�on at hand. 42 This sec�on surveys three categories of hypotheses, which o�en occur in DIA: • Single hypotheses about a finite number of points, which lend themselves to standard pointwise inference. • Multiple hypotheses about a finite number of points, which also pertain to pointwise inference, although cri�cal values need adjustment to correct significance levels. • Single hypotheses about a function, which require uniform inference. The next subsec�ons discuss each category in turn, including examples of relevant hypothe- ses and algorithms for valid inference. 1. Pointwise Inference This subsec�on focuses on pointwise inference: tes�ng a single hypothesis about a finite number of points. For instance, consider the following ques�ons: • Is the average treatment effect significantly different from zero? • Is the distribu�on of treatment effects symmetric? • Is the change in the first quar�le of outcomes larger than the third quar�le? The first hypothesis concerns a single sta�s�c of the distribu�on of treatment effects. The second also concerns a single sta�s�c of this distribu�on if we reformulate it in terms of skewness. 43 The third concerns the treatment effect on two points of the outcome distribu- �on. Thus, they all require pointwise inference. 41 For instance, finite-sample theory offers exact formulas for the distribu�on of quan�le es�mators at any sample size if the data are independent and iden�cally distributed. See Koenker (2005) and Chernozhukov, Hansen and Jansson (2009). 42 The design of the RCT should also account for mul�ple tes�ng. For example, op�mal sample sizes are larger. See Sec�on 6.4. 43 Note that no skewness is a necessary condi�on for symmetry, but it is not sufficient. 36 Following the discussion in the introduc�on of this part, we recommend basing inference on the bootstrap. Instead of hypothesis tests, we consider the equivalent problem of con- struc�ng confidence intervals. We accept the null hypothesis if it falls en�rely within the confidence interval and reject it otherwise. Suppose that the sample consists of observa�ons. To construct a confidence interval at the 95% level, 44 Horowitz (2001) recommends the percen�le method: 1. Sample observa�ons at random with replacement from the data. You now have a bootstrap sample. 2. Es�mate the treatment effect of interest on the bootstrap sample and store it. 3. Repeat steps 1 and 2 �mes. You now have a sample of es�mates of size . 4. Compute the quan�les 0.025 and 0.975 of the sample of es�mators, 0.025 and 0.975 . You now have bootstrap cri�cal values. Your confidence interval is: [0.025 , 0.975 ]. For different significance levels, adjust step 4. Alterna�ve resampling procedures in step 1 are possible. For example, we can sample clusters instead of individual observa�ons to ac- count for within-cluster dependence. 45 We can also use the bootstrap to obtain different sta�s�cs. For example, compute the standard devia�on of the sample of es�mators in step 4 to es�mate standard errors. How many bootstrap repe��ons are necessary? Computa�on is o�en trivial, so that a conserva�vely large comes at almost no cost. Based on Andrews and Buchinsky (2000), Cameron and Trivedi (2005) suggest at least 348 repe��ons for confidence intervals at level 0.05 and 685 for level 0.01. In applied work, however, researchers tend to run at least a thousand repe��ons and o�en as many as ten thousand. To achieve correct inference with the bootstrap, step 2 should account for the interac- �on between different sources of es�ma�on noise. Consider, for example, the IPW es�mator of treatment effects (see Sec�on 3.3). We first compute weights for each observa�on, before taking the difference between weighted quan�les (or averages, variances, etc.). One should not overlook es�ma�on error from the first step, which depends on the weight es�mator. In the case of parametric models, it is sufficient to re-es�mate the weights at each itera�on in step 2 to adjust percen�le intervals. 46 Nonparametric or semiparametric es�mators are more challenging, because their rates of convergence depend on tuning parameters (e.g., the kernel bandwidth). Most data-dependent procedures pick op�mal tuning parameters, in the sense that they minimize mean squared error. However, op�mal convergence rates o�en generate asympto�c bias, which invalidates inference (whether it draws on asympto�c theo- ry or the bootstrap). Then it is necessary to either correct for bias or to adjust the tuning pa- rameter. Moreover, these procedures themselves introduce addi�onal es�ma�on noise. It is unclear, however, whether one should correct for it. In a study of kernel density es�mators, for example, Hall and Kang (2001) argue against re-es�ma�ng the bandwidth at each boot- strap round. 44 Analogous algorithms yield one-sided or asymmetric tests. See Horowitz (2001). 45 The usual caveats about cluster-robust es�ma�on apply. We refer the reader to the excellent survey of Cameron and Miller (2015) for details, in par�cular Subsec�on VI.C.3 (“Bootstrap with cau�on”). 46 For most es�mators, analy�cal correc�ons for asympto�c confidence intervals are possible. See Newey and McFadden (2001) for a general discussion of inference for two-step es�mators. 37 Note also that the bootstrap requires con�nuity condi�ons. 47 The es�mators of changes in the outcome distribu�on in Part 3 generally sa�sfy these assump�ons. The excep�on are extremal effects: very low and very high quan�les, for which the scarcity of data affects con- vergence rates. Zhang (2016) considers inference for such extremal quan�le effects. Moment es�mators for the distribu�on of treatment effects (Subsec�on 4.4.1) also sa�sfy the boot- strap regularity condi�ons, as long as this distribu�on is not degenerate. On the other hand, the validity of resampling for the deconvolu�on es�mator of Subsec�ons 4.4.2 and 4.4.4 is an open ques�on, 48 because the asympto�c proper�es of Mallows’ Algorithm are unknown. As the literature stands, the bootstrap provides some insight into the precision of this es�- mator, but it does not allow formal hypothesis tests. The theory of pointwise inference focuses on point es�mates. However, it is also possible to test hypotheses about par�ally iden�fied parameters. For example, Sec�on 4.2 discussed the es�ma�on of bounds on features of the distribu�on of treatment effects. As Imbens and Manski (2004) and Stoye (2009) note, inference raises addi�onal ques�ons in this case. They propose a strategy to construct confidence intervals that cover the true, unknown parame- ter within the bounds with the correct probability. The limits of these intervals depend on the asympto�c distribu�on of the bound es�mators. This task is o�en difficult because many bounds involve extrema, such as the sample maxima and minima in the Makarov bounds in Sec�on 4.2. These discon�nui�es violate the smoothness assump�ons of standard methods of inference. Fan and Park (2010) develop a subsampling strategy to perform inference on Makarov bounds, which is valid under weak condi�ons. 2. Multiple Hypothesis Testing (MHT) Policy evalua�on o�en entails mul�ple hypothesis tests. For example, consider the ques- �ons: • Do average treatment effects vary across subgroups? • What is the average treatment effect for various outcomes? • Which quan�le treatment effects are posi�ve? Repeated tes�ng distorts significance levels, necessita�ng adjustments to cri�cal values for correct inference. To illustrate this point, suppose that researchers worry about heterogenei- ty across ethnici�es. They es�mate average treatment effects for five ethnic groups and test each es�mate against the zero null hypothesis at the 5-percent level. Thus, the probability of a false posi�ve is five percent for each test. Across all tests, however, it is larger. If the tests are independent, it rises to 1 – 0.95⁵ = 0.23! Adjus�ng cri�cal values requires us to extend the concept of test size (i.e., probability of making a Type I error) to mul�ple hypotheses. Several extensions exist and researchers should base their choice of rejec�on rule on substan�ve considera�ons. 49 One approach 47 It is o�en possible to adjust the simula�on algorithm to obtain valid inference under weaker assump�ons. For instance, Otsu and Rai (forthcoming) develop a valid weighted bootstrap approach for matching es�ma- tors. Different resampling methods, such as subsampling, are o�en valid under weaker condi�ons than the bootstrap, although their theore�cal proper�es are not as advantageous. See Horowitz (2001). 48 The nonparametric bootstrap is valid for the kernel deconvolu�on es�mator of the distribu�on of treatment effects (Bissantz et al., 2007). 49 Researchers should ideally define their preferred error rate at the design stage. 38 considers the Familywise Error Rate (FWER). The FWER is the probability of falsely rejec�ng at least one true null hypothesis. This criterion is stringent: it preserves significance levels at a heavy cost of test power. Some authors consider the k-FWER instead, which generalizes the FWER to k false rejec�ons. Also popular is the False Discovery Rate (FDR): the expected propor�on of false posi�ves across all hypothesis tests. As Romano and Wolf (2010) point out, however, control of the FDR does not allow us to make precise statements about the realized probability of false discoveries, which might remain quite high. Romano and Wolf (2010) discuss these different concepts of error rates in more detail. They propose a general method to construct simultaneous confidence regions, using resampling and an itera�ve step-down algorithm to preserve power. 50 The resul�ng intervals achieve balance, in the sense that each marginal interval covers the true parameter with the same probability. Here we present a special case of their algorithm, which controls the FWER. We wish to test hypothesis about parameters { } =1 at level . The sample size is . � We assume that there are es�mators for each parameter , such that converges to a nondegenerate distribu�on, where is a nonnega�ve sequence. For example, might be the average treatment effect or a quan�le effect, in which case = √. We will proceed in a series of single-step tests. At each itera�on, we remove rejected hypotheses from con- sidera�on. The single-step algorithm is: � for each . 1. Es�mate 2. Sample observa�ons at random with replacement from the data. You now have a bootstrap sample. � 3. Es�mate ∗ for each on the bootstrap sample. Use the same es�mator of step 1. Compute and store: � � ∗ � �. − 4. Repeat steps 2 and 3 �mes. 5. Compute the empirical CDF � � of � ∗ � � across bootstrap replica�ons. For each − parameter and bootstrap replica�on , you will have � � � � ∗ � ��. − 6. For each bootstrap replica�on , compute = max ∗ � ��. � − � � � ∗ ∗ ( ) 7. Compute the -th quan�le of , . Note that it lies in the unit interval. 8. For each parameter , compute the ( )-th quan�le of � � ∗ � � across boot- − strap replicates. This quan�le is the cri�cal value for the -th hypothesis test. To start the step-down algorithm, run the single-step algorithm. If you do not reject any hy- pothesis, stop. If you do reject, go back to step 6. When you compute the maximum across parameters, ignore previously rejected hypotheses. Step 7 will yield a different percen�le for use as a cri�cal value in step 8. Con�nue in this fashion un�l you stop rejec�ng. 3. Uniform Inference This subsec�on discusses uniform inference: tes�ng a single hypothesis about a con�nuous object, such as a func�on. Uniform inference allows us to address such ques�ons as: • Are all quan�le treatment effects posi�ve? 50 Lee and Shaikh (2014) and List, Shaikh and Xu (2016) illustrate the algorithm with experimental data. 39 • Is the distribu�on of treatment effects normal? The dis�nc�on between uniform inference and correc�ons for mul�ple tes�ng can be sub- tle. Uniform inference is concerned with single hypotheses about single continuous objects, such as the quan�le process or the distribu�on func�on. Tes�ng whether there was any effect on the outcome distribu�on pertains to uniform inference, because we would test the equality of two distribu�on func�ons. On the other hand, tes�ng which quan�le effects are significant is a problem of mul�ple tes�ng, because we would perform a sequence of pointwise significance tests for each es�mate. Sta�s�cians and econometricians have developed various tests of popular distribu�onal hypotheses, which are o�en based on scalar sta�s�cs. For example, consider the hypothesis of posi�ve quan�le effects. It is equivalent to first-order stochas�c dominance, for which we have the Kolmogorov-Smirnov sta�s�c (the largest absolute difference between the out- come distribu�ons of treated and untreated par�cipants). Chernozhukov, Fernández-Val and Melly (2013) exploit this approach to construct bootstrap uniform confidence bands, whose width depends on the cri�cal values of the Kolmogorov-Smirnov sta�s�c. 51 Uniform confi- dence bands cover the en�re func�on of interest with the correct probability, in the same way that confidence intervals cover point es�mates. They have three advantages over scalar test sta�s�cs: they are easy to display on a graph; they allow researchers to test hypotheses for which no known sta�s�cs are available; and they allow readers to test hypotheses which the authors of a paper may not have considered. 52 3. Tests of Heterogeneous Treatment Effects DIA builds on the premise that treatment responses vary across individuals. How can we test this hypothesis? As far as heterogeneity relates to discrete covariates, significant differences in CATEs in- dicate heterogeneity in individual responses. Such heterogeneity across subgroups is o�en interes�ng in itself. Tests are standard, but researchers should adjust cri�cal values for MHT, which o�en takes a heavy toll on power. Crump et al. (2008) generalize this idea to accom- modate discrete and con�nuous covariates in both parametric and non-parametric es�ma- tors of condi�onal means. They propose a scalar sta�s�c, thereby avoiding MHT. The tests consider heterogeneity as it relates to observables and will thereby fail to de- tect heterogeneity that is unrelated to observables. To test for any heterogeneity, we can use the fact that the distribu�ons of 0 and 1 only differ in means under the null. To test whether treatment effects are constant, therefore, we may test whether the variance of outcomes is equal for the treated and untreated – i.e., test whether var(0 ) = var(1 ). Sig- nificant differences in the variance allow us to reject the null hypothesis. We can implement such a pointwise test with the bootstrap algorithm of Subsec�on 6.2.1. However, this test may have low or no power if the covariance of 0 and is nega�ve: var(1 ) = var(0 ) + var() + 2 cov(, 0 ), so the nega�ve covariance would offset the variance from treatment effect heterogeneity. The analy�cal example in Sec�on 3.5 illustrates this point. The test may also falsely reject in the presence of heteroskedas�city, i.e., if the variances of the error terms differ between treatment and control. Heteroskedas�city may be a concern if random- 51 See algorithms 2 (p. 2221) and 3 (p. 2222) in Chernozhukov, Fernández-Val and Melly (2013). 52 For some es�mators, the literature provides asympto�c theory to construct confidence bands as well. For quan�le treatment effects, see Chernozhukov, Fernández-Val and Melly (2013) and Koenker and Xiao (2002). 40 iza�on was compromised. Conduc�ng the test on residuals may address this concern and increase power, but it may also lead to false rejec�ons if the model is misspecified. The idea in the previous paragraph generalizes to higher moments. Joint tests of mul�ple moments may achieve greater power. One can also use a test of equality of distribu�ons, such as the Kolmogorov-Smirnov test, to assess whether the distribu�ons of devia�ons from the mean differ between the treated and untreated. 53 Similar to the variance, one may be worried that the distribu�ons differ for other reasons if randomiza�on was compromised. Further evidence against the null hypothesis of constant treatment effects may come from quan�le effects or bounds. If treatment effects were constant, all quan�le effects would be equal to the average treatment effect. Implemen�ng this approach requires either correc�ons for mul�ple tes�ng (Subsec�on 6.2.2) or uniform inference (Subsec�on 6.2.3). One may also be able to reject the null if the variance bounds do not include zero, as Sec�on 4.2 discusses. 4. Power Calcula�ons This sec�on considers the choice of sample size for DIA in RCTs. We first adapt the analy�c framework of List, Sadoff and Wagner (2011) to DIA. The complexi�es of DIA o�en render an analy�c approach undesirable or unfeasible, so we close with some guidance on power cal- cula�ons via simula�on. 1. Analytic Approach An important choice in experimental design is the sample size. Due to cost considera�ons, we seek the smallest sample for a desired level of precision. As a first step, we must define precision. Our criterion will be the significance level and the power of a par�cular pointwise test against the zero null hypothesis (ZNH). 54 We wish to determine the op�mal sample size, ∗ , and the op�mal probability of assignment to treat- ment, ∗ , such that a test of the ZNH will reject a given minimum effect size at significance level with power 1 − . These parameters reflect the experimenter’s tolerance of sta�s�- cal error. Here, ∗ includes both treated and untreated par�cipants. To derive the power of a test, we need the asympto�c distribu�on of the es�mator of interest. For concreteness, we focus on quan�le treatment effects under the independence assump�on of Sec�on 3.2. 55 Similar to ATTs, QTTs are asympto�cally normal. Therefore, we only need to specify their as- ympto�c variance. Power calcula�ons for QTTs and ATTs only differ in the exact form of this variance. We proceed in four steps. For a quan�le of interest: 53 Note that this test involves es�mated parameters, which the bootstrap should account for. See Subsec- �on 6.2.1. 54 The ZNH states that the treatment had no effect. It is the main hypothesis of interest in most RCTs. Other null hypotheses are possible with the necessary modifica�ons. As a reminder, the significance level is the probability of a type I error in a two-sided test: rejec�ng the null hypothesis when it is true (a false posi�ve). The power is the probability of rejec�ng the null hypothesis when the alterna�ve is true. 55 We assume that the data are IID. For a discussion of clustered data and stra�fied designs, see Sec�on 4 of List, Sadoff and Wagner (2011), and especially McConnell and Vera-Hernández (2015). 41 1. Specify the quan�le 0 () and the density 0 [0 ()] of poten�al outcomes 0 . Note that this quan�le is the same under the null and the alterna�ve hypotheses. 2. Specify the quan�le 1 () and the density 1 [1 ()] of poten�al outcomes 1 under the alterna�ve hypothesis. Note that 1 () = 0 () and 0 [0 ()] = 1 [1 ()] under the null hypothesis of no treatment effect. 3. Define a minimum detectable effect size 1 () − 0 (), a significance level and a test power 1 − . The RCT will detect treatment effects equal to or greater than 1 () − 0 () with error rates and . 4. Under the alterna�ve hypothesis, the asympto�c variance of the es�mator � () is: 2 (1 − ) () = . [ ()]2 The op�mal randomiza�on rate, ∗ (), and the sample size, ∗ (), are: 1 () ∗ () = , 0 () + 1 () ∗ ⁄2 + 2 0 2 () 2 1 () () = � � � + �, 1 () − 0 () 1 − ∗ () ∗ () where ⁄2 and are the quan�les of the standard normal distribu�on. Note that sample sizes are larger for quan�les in low-density regions, such as the tails of the outcome distri- bu�on, because the scarcity of observa�ons reduces accuracy. These same steps yield op�mal sample sizes for other quan��es. The formulas for ∗ and ∗ in step 4 go through as long as the es�mator of interest is asympto�cally normal. 56 2 Adjust steps 1 and 2 according to the object of interest and its asympto�c variance . For the average treatment effect, for example, one should specify the mean and the variance of 2 poten�al outcomes instead of the quan�le and the density, since = var( ). The procedure above considers a par�cular quan�le of interest. In most applica�ons, however, researchers compute quan�le effects on a grid. Three difficul�es arise. The first consists of formula�ng dis�nct hypothesis tests for each quan�le. It is o�en simpler to think in terms of the en�re distribu�on. For example, a loca�on model assumes that the distribu- 2( ) �on shi�s by a constant. Then all quan�le effects are equal to that constant and 0 = 2 1 (), which reduces the number of free parameters by two thirds. Secondly, the sample size ∗ () and the randomiza�on rate ∗ () differ for each quan�le. 57 One should choose the largest sample size across to ensure correct test size and power at all points. Note, though, that the formulas above are invalid for extremal effects. It is advisable to avoid es�- ma�on of extremal effects – beyond the 5th and 95th percen�les, say. Lastly, grid es�ma�on implies mul�ple hypothesis tes�ng, invalida�ng the formulas above. There are two solu�ons to this problem. It is possible to adjust cri�cal values for repeated tes�ng (cf. Sec�on 6.2). However, this approach involves complex algorithms and many unknown quan��es, which limits its usefulness. Simula�on offers a more feasible route, which we survey in the next 56 It is also possible to use finite-sample distribu�ons instead of the asympto�c normal approxima�on when they are available. See Koenker (2005) and Chernozhukov, Hansen and Jansson (2009) for quan�le treatment effects. 57 For a loca�on model, ∗ () = 0.5 for all . 42 subsec�on. The closed-form formulas provide a first guess of sample size to start the simu- la�on algorithm. If researchers intend to conduct subgroup analysis, the choice of sample size and ran- domiza�on rate should also take MHT into account. Otherwise, significance levels will be higher and/or test power will be lower than planned. Note that the number of observa�ons might drama�cally increase with the number of tests. Spli�ng samples might help keep the RCT financially viable. This approach splits the sample into two subsamples. In one subsam- ple, an automated procedure selects a small set of relevant covariates. In the second sub- sample, we es�mate the resul�ng model and test the significance of each variable, limi�ng data requirements. For further details, see Wasserman and Roeder (2009), Fithian, Sun and Taylor (2017) and Fafchamps and Labonne (2016) and the references therein. 2. Simulation Approach Simula�on is an alterna�ve strategy to determine the required sample size. Simula�on for DIA presents no par�cular challenge, so we only sketch a brief overview here. See McConnell and Vera-Hernández (2015) for details. They also provide an algorithm and sam- ple code. Similar to the previous subsec�on, the selec�on criteria are the significance and power of a hypothesis test against the ZNH. Instead of compu�ng these error probabili�es with as- ympto�c methods, however, we use simula�on. To be precise, we simulate pseudo-samples from the model of interest under the null and the alterna�ve hypothesis. For each pseudo- sample, we perform the relevant hypothesis test. Then we adjust the number of observa- �ons un�l empirical rejec�on rates sa�sfy the desired level of accuracy. Simula�on offers a number of advantages over asympto�c theory. Firstly, it accommo- dates es�mators with unknown or excessively complex asympto�c distribu�on (e.g., the IPW es�mator of QTTs), as well as difficult inference problems (e.g., uniform inference or mul�- ple hypotheses). Secondly, it reflects the finite-sample behavior of es�mators which might converge slowly to their limi�ng distribu�on. Thirdly, it is straigh�orward to incorporate complex designs, such as panel data. To conduct a simula�on exercise, it is necessary to choose the number of pseudo- samples. Unless the computa�onal burden is high, it should be large: five thousand or ten thousand are good figures. One also needs to specify the data genera�ng process. Choosing distribu�ons of covariates and the error term is o�en challenging: they should reproduce the condi�ons of the RCT, including dependence paterns in the error term (such as clusters or serial correla�on in panel data). Previous studies might provide some guidance. If an ex- is�ng dataset contains the covariates of interest (or some of them), it is possible to draw pseudo-samples from it to minimize distribu�onal assump�ons. Sensi�vity analyses are al- ways advisable. 7. Applica�ons In this part, we present applica�ons of the methods discussed above using data from two RCTs. Sec�on 7.1 revisits the study of financial educa�on by Bruhn et al. (2016). In Sec�on 7.2, we study the impact of a school-development program on students’ test scores, building on Blimpo, Evans and Lahire (2016). The goal of this part is threefold: (1) to demonstrate what we can learn from DIA beyond standard mean analyses; (2) to provide examples of 43 how to choose appropriate methods and how to assess their assump�ons and address common concerns; and (3) to illustrate briefly the implementa�on of the methods and their output. 58 The programs corresponding to these analyses can be found in the supplemental materials available online. 1. Financial Educa�on RCT in Brazil This sec�on revisits the financial educa�on program evaluated by Bruhn et al. (2016). Com- prehensive lessons in basic finance and responsible intertemporal choices were integrated into regular classroom curricula of randomly selected high schools. The program increased average financial proficiency by a quarter of a standard devia�on, which is large in compari- son with similar programs (Bruhn et al., 2016). However, policymakers may wish for more informa�on than average effects. For in- stance, they may take par�cular interest in low values of financial proficiency, which might put consumers at risk of pyramid schemes and other predatory tac�cs. They may want to fine tune the program if they observe no change in the frequency of botom scores. In a sim- ilar vein, governments may worry about dispersion in financial proficiency, which could affect the rewards from financial inclusion. They may also ask whether effects differed across subgroups, which could help them target the interven�on and understand why it works (or not). If poverty is associated with low financial proficiency, for example, we might advocate addi�onal financial educa�on for children from poor households. To explore these ques- �ons, we es�mate several DIA parameters, mostly related to changes in the outcome distri- bu�on (Part 3) and condi�onal analysis (Part 5). 1. Data Our sample includes 892 schools and 18,276 students in six states in Brazil. 59 We focus on the short-term impact between the baseline survey (August 2010) and the first follow-up (December 2010). 60 We observe three-quarters of the original baseline sample of around 25,000 students at the first follow-up. The authors report that the groups are overall bal- anced across treatment and control groups at each round of data collec�on. 61 We focus on financial proficiency as the outcome of interest. Each survey included a mul�ple-choice test, from which Bruhn et al. (2016) construct an index of financial knowledge on a hundred-point scale to track students’ progress. We use this score for our analyses. For future reference, the average score in the control group at the first follow-up is 56.05 and the standard devia�on is 14.81. 2. Changes in the Distribution of Financial Proficiency Did the interven�on change the frequency of very low test scores? Did it change dispersion in financial proficiency? Did it reshape the proficiency distribu�on? 58 We focus on DIA and DIA-specific issues. The original papers discuss solu�ons to other common problems that affect both DIA and the es�ma�on of average treatment effects. 59 From an original list of 910 schools, nineteen did not par�cipate for unknown reasons. 60 The dataset is available from the American Economic Journal: Applied Economics. 61 As an excep�on, the authors find a small difference in the gender ra�o, which is significant at the 10% level (these tests are unadjusted for mul�ple hypothesis tes�ng). 44 Figure 5: Outcome Distributions for Financial Education Program To answer these ques�ons, we examine treatment effects on a broad set of parameters: the mean, the standard devia�on and the ra�o of the 75th to the 25th percen�le; every per- cen�le between the 5th and the 95th; and the CDF for every integer score between 33 and 81. 62 We consider a total of 143 parameters. We assume independence between outcomes and treatment status (cf. Sec�on 3.2). The interven�on began halfway into the school year, so students could not change schools or classes before the first follow-up, limi�ng noncom- pliance with treatment assignment. On the other hand, nonresponse may be a concern: only three-quarters of the baseline sample took the follow-up survey. However, Bruhn et al. (2016) find no significant differences in the means of background variables at each round, which suggests that nonresponse is as good as random. Figure 5 plots the outcome distribu�on for each group. Table 3 and Table 4 show es�- mates of treatment effects. 63 All confidence intervals control the FWER at the 95-percent level using the step-down bootstrap algorithm of Romano and Wolf (2010), see Subsec�on 6.2.2. Thus, the probability of at least one false rejec�on across all significance tests is as- ympto�cally smaller than five percent. We find a large increase in average financial proficiency: 4.3 points (0.29 SD or 7.6%), in line with Bruhn et al. (2016). There was also a reduc�on in the propor�on of very low scores. For example, the share of scores below 40 points is lower by 0.08, a 45.8% reduc�on from 0.17 in the control group. Outcomes are also less dispersed in the treatment group: the standard devia�on is smaller by 0.46 points, a 3.12% reduc�on of the 14.81 points in the control group, and the ra�o of the 75/25 quar�les is reduced by 7.5% from 1.53 in the con- trol group. All of these effects are sta�s�cally significant. In summary, the interven�on not only increased average scores, but also decreased inequality in financial proficiency. Quan�le effects provide further insight into these results. As Figure 6 shows, the inter- ven�on shi�ed the en�re CDF to the right. Recall that quan�le effects are the horizontal dis- tance between these curves; hence, they are all posi�ve (and sta�s�cally significant). More- over, the change in lower quan�les is rela�vely larger. The first decile increased by 4.2 points or 11.5%. The first quar�le is greater by 6.2 points (14.1%), whereas the difference in the ninth decile is 3.2 points (4.27%). A Kolmogorov-Smirnov test rejects equality of quan�le 62 The 5th percen�le of control-group outcomes is 32.95. The 95th is 80.46. 63 Figure 12 in Appendix 4 shows changes in the distribu�on func�on (the ver�cal distance between the CDFs in Figure 5). These effects mirror quan�le effects. 45 Table 3: Treatment Effects Estimates for Financial Education Program Control Effect es�mate Standard Simultaneous Sta�s�c group value Value Percent error conf. region (95%) Mean 56.050 – 4.266 1– 7.611% 0.571 – 2.671 – 5.861 Standard devia�on 14.808 – 0.462 1– 3.119% 0.201 – 0.853 – 0.071 75/25 perc. ra�o 01.530 – 0.119 1– 7.748% 0.020 – 0.172 – 0.066 10th percen�le 36.302 – 4.177 – 11.505% 0.582 – 2.552 – 5.802 25th percen�le 43.867 – 6.189 – 14.109% 0.843 – 3.933 – 8.446 50th percen�le 56.157 – 4.642 1– 8.266% 0.700 – 2.717 – 6.568 75th percen�le 67.138 – 3.537 1– 5.268% 0.575 – 1.889 – 5.185 90th percen�le 75.812 – 3.236 1– 4.269% 0.572 – 1.536 – 4.937 CDF at 40 points 00.172 – 0.079 – 45.742% 0.011 – 0.109 – 0.049 CDF at 50 points 00.365 – 0.116 – 31.806% 0.016 – 0.160 – 0.073 CDF at 60 points 00.593 – 0.114 – 19.236% 0.016 – 0.159 – 0.070 CDF at 70 points 00.806 – 0.071 1– 8.864% 0.012 – 0.106 – 0.038 CDF at 80 points 00.945 – 0.032 1– 3.373% 0.006 – 0.049 – 0.015 Notes: Standard errors and confidence region based on the bootstrap (five thousand replica- �ons) and clustered at the school level. The confidence region controls the FWER (probability of at least one false rejec�on across tests), following Romano and Wolf (2010). effects at the 5% level. This patern reveals that the reduc�on in outcome inequality oc- curred despite increases of the upper percen�les (which are desirable in an educa�onal pro- gram). It was a consequence of propor�onally larger gains in the le� tail. How would we es�mate these treatment effects under the assump�on of selec�on on observables? For illustra�on purposes, suppose that a small imbalance in the gender ra�o and baseline scores generated suspicion of selec�on bias. Following Sec�on 3.3, we can use inverse probability weighting to rebalance groups. We compute the weights with a logis�c regression of treatment status on a quadra�c polynomial of baseline scores, an indicator for missing baseline score and an indicator for gender. Table 9 in Appendix 4 show our results. Although the rebalanced es�mates are smaller than their unweighted counterparts, the differences are not sta�s�cally significant, which suggests that endogenous selec�on did not affect our main results. Bruhn et al. (2016) arrived at a similar conclusion, although their Figure 6: Quantile Effects Estimates for Financial Education Program 46 methodology differs. 64 3. Distribution of Individual Treatment Effects Quan�le analysis helps us address a broad range of policy ques�ons. Nonetheless, the pos- sibility of mobility effects constrains our interpreta�on of quan�le effects, in that the change in quan�le may differ from the individual treatment effect at quan�le . Consider Figure 6. It is temp�ng to believe that low-scoring students benefited more than average, since the change in botom percen�les is rela�vely larger. As Part 2 and Sec�on 3.5 argue, however, this reasoning hinges on an implicit assump�on of rank invariance: the same students are in the botom and the top of the distribu�on in both treated and untreated states. Despite this limita�on, our es�mates in Table 3 and Figure 6 have implica�ons for indi- vidual effects. Dispersion decreased, so responses to treatment must have been heteroge- neous. Moreover, posi�ve average and quan�le effects imply that at least some par�cipants benefited from the interven�on. However, the methods of Part 3 do not allow us to quan�fy these qualita�ve results. Es�ma�ng features of the distribu�on of individual effects requires strong assump�ons (cf. Sec�on 4.3), which are hard to jus�fy in this applica�on. As an alterna�ve, Sec�on 4.2 shows that we can construct bounds under minimal condi�ons. As a first exercise, we es�- mate the Makarov bounds on the share of posi�ve effects, P(1 > 0 ). Recall that the bounds comprise the range of values which are consistent with the observed marginal dis- tribu�ons. We find that the probability of posi�ve individual treatment effects lies between 12% and 100%. 65 These bounds confirm that at least some par�cipants benefited from the interven�on; moreover, the data are compa�ble with no one being hurt, because the bounds include 100% posi�ve effects. Next, we bound the standard devia�on of individual effects. Assuming that the correla�on between poten�al outcomes is posi�ve, we obtain that it lies between 0.46 and 20.62 points. The lower limit is posi�ve, 66 which rules out con- stant treatment effects. At 0.46 points, the lower bound represents nearly 11% of the ATT, sugges�ng that the dispersion in treatment effects is non-negligible. To summarize, these bounds suggest the existence of heterogeneous individual treatment effects under minimal assump�ons on the data. 4. Conditional Analysis The previous subsec�ons showed that students’ responses to the financial educa�on pro- gram were heterogeneous. This sec�on inves�gates the rela�on between treatment impacts and pupils’ background characteris�cs. For that purpose, we compute condi�onal average effects (CATEs) and condi�onal QTTs for different subgroups of par�cipants, following Sec- �on 5.2. 67 64 Bruhn et al. (2016) compute average treatment effects by linear regression. They include a quadra�c poly- nomial of baseline scores, an indicator for missing baseline score and an indicator for gender as controls. The resul�ng es�mate is not sta�s�cally different from their baseline regression, which does not include controls. 65 Note that we do not perform inference to assess the precision of these bounds. 66 The 95% confidence interval is [0.07, 20.89] (Imbens and Manski, 2004). Under no assump�ons on the corre- la�on between poten�al outcomes, the upper bound is 29.15 and the confidence region is [0.07, 29.54]. 67 We compute treatment effects as the difference in the relevant sta�s�c between the treatment and control subsamples of each subgroup (cf. Sec�on 5.2), as in Subsec�on 7.1.2. 47 Table 4: Conditional Average Treatment Effects for Financial Education Program Sample Effect es�mate Standard Simultaneous Subgroup size Value Percent error conf. region (95%) Baseline score No 17960 4.120 8.417% 0.476 2.891 5.349 above median Yes 17960 4.069 6.345% 0.588 2.508 5.631 Student has repeat- No 10949 4.472 7.634% 0.607 2.866 6.079 ed grade Yes 14437 4.653 9.112% 0.665 2.950 6.355 No 16941 4.977 8.588% 0.673 3.227 6.728 Student is female Yes 18720 3.806 6.855% 0.609 2.209 5.403 No 10898 4.188 7.311% 0.613 2.589 5.788 Student works Yes 15612 4.722 8.671% 0.702 2.919 6.525 Student earns No 15713 3.585 6.163% 0.624 1.943 5.228 income Yes 10812 4.682 8.877% 0.628 3.061 6.304 Family is on welfare No 10216 5.020 8.301% 0.658 3.294 6.746 (Bolsa Família) Yes 15334 3.386 5.833% 0.678 1.671 5.101 Notes: The first four specifica�ons include 848 schools. The last two include 851 schools. Standard errors and confidence region based on the bootstrap (five thousand replica�ons) and clustered at the school lev- el. The confidence region controls the FWER (probability of at least one false rejec�on across tests), follow- ing Romano and Wolf (2010). Based on the available data, we consider six background variables: (1) baseline proficien- cy score (above or below median); (2) grade repe��on; (3) gender; (4) family welfare status (Bolsa Família); (5) working status; and (6) income-earning status (i.e., whether the student earns any income, including pocket money). Each variable defines two subgroups. 68 Table 1 in Bruhn et al. (2016) reports summary sta�s�cs and balance tests by variable. Table 4 displays our es�mates of CATEs. Table 5 reports t-sta�s�cs for the difference in CATEs. We perform six equality tests in total; hence, we must adjust for mul�ple tes�ng (MHT). In Subsec�on 7.1.2, we controlled the FWER (the probability of at least one false re- jec�on). Although this error rate is the most stringent, all effects were significant at the five- percent level. Evidence of condi�onal heterogeneity is less robust. Therefore, Table 5 shows cri�cal values for the 2-FWER (the probability of at least two false rejec�ons) as well. We use the step-down bootstrap algorithm of Romano and Wolf (2010). (See Subsec�on 6.2.2.) Responses to treatment seem to correlate with socioeconomic background. The families of a third of our sample par�cipate in Bolsa Família, a welfare program for low-income households. On average, these students gained significantly less from the interven�on (3.39 points, against 5.02 for the remainder). This difference is significant at the ten-percent level if we control the FWER. We only find weak evidence of heterogeneity otherwise. Boys had higher average gains than girls (4.98, against 3.81). The average effect was also larger for students with earned income (4.68, against 3.59). However, these differences are only sig- nificant if we control the 2-FWER, i.e., if we are willing to tolerate a five-percent chance of two false rejec�ons in six tests. Other criteria do not yield significant effects, whether we control the FWER or the 2-FWER. 68 A number of par�cipants declined to answer the relevant ques�ons in the surveys, depressing response rates to 64 %. For rigorous policy evalua�on, it would be necessary to assess robustness to missing observa- �ons. 48 Table 5: Critical Values for Test of Equality of Average Effects for Financial Education Program 5% confidence level 10% confidence level Subgroup t-stat. Point- Point- 1-FWER 2-FWER 1-FWER 2-FWER wise wise Baseline score above – 0.095 1.913 2.615 1.890 1.619 2.325 1.696 median Student has repeated – 0.284 1.914 2.498 1.884 1.609 2.284 1.688 grade Student is female – 2.219 1.979 2.622 1.955 1.660 2.374 1.759 Student works – 0.850 1.957 2.695 1.930 1.677 2.427 1.752 Student earns – 2.139 1.970 2.597 1.929 1.648 2.344 1.711 income Family is on welfare – 2.407 1.998 2.645 1.957 1.636 2.357 1.723 (Bolsa Família) Notes: Bold values indicate rejec�on of the null hypothesis. The first four specifica�ons include 848 schools. The last two include 851 schools. Cri�cal values based on bootstrapping t-sta�s�cs (five thou- sand replica�ons) and clustered at the school level. Columns 4, 5, 7 and 8 control the k-FWER via the step-down algorithm of Romano and Wolf (2010). Figure 7 plots es�mates of five condi�onal quan�le effects for each subgroup. 69 Paterns are similar to Figure 6, with effects peaking around the first quar�le. The only clear excep- �on is students whose baseline score was above the median, for which quan�le effects are monotonically decreasing. Are differences between subgroups significant? There are two approaches to this ques�on. On the one hand, we could test whether the difference be- tween each quan�le effect in each pair of subgroups is significant. However, this approach would require thirty hypothesis tests; adjus�ng for MHT would take a heavy toll on power. On the other hand, we could test whether all quan�le effects are equal in each pair of sub- groups. To do so, we could test the maximum absolute difference in quan�le effects across quan�les, 70 which only requires six tests. This second procedure reveals that quan�le effects only differ between subgroups defined by baseline scores and gender, controlling the FWER at the ten-percent level. These results point to heterogeneity both between and within sub- groups, which average effects did not uncover. Note that these es�mates are not informa�ve about causal effects of the characteris�cs that define the subgroups, because students do not sort into subgroups at random. Note also that we did not take the overlaps between subgroups into account. For example, stu- dents might be more likely to work if their families are on welfare. Subgroup analysis might provide addi�onal insights as we break down cells into smaller groups. However, the number of observa�ons per cell would decrease just as the number of hypothesis tests increased, to the detriment of sta�s�cal power. 69 The quan�les are: the 10th, the 25th, the 50th, the 75th and the 90th. Table 10 in Appendix 3 reports point es�mates and standard errors. 70 This procedure tests for significant differences in at least one quan�le effect for each pair of subgroups. It cannot detect all quan�le effects for which there are significant differences. 49 Figure 7: Conditional Quantile Treatment Effects for Financial Education Program On a methodological note, Table 5 highlights the importance of correc�ng for mul�ple tests (even at large sample sizes). Controlling the FWER inflates cri�cal values by 30 to 50 percent. As a consequence, only one difference in average effects remains significant, compared to three if we do not adjust. The choice of error rate is also consequen�al. Whereas control of the FWER always widens confidence regions, other criteria may not. For example, control of the 2-FWER leads to smaller cri�cal values at the 95-percent level and larger cutoffs at the 90-percent level. To understand this puzzling feature, recall that the probability of two or more false rejec�ons across six pointwise tests at the five-percent level is 3.28%, which is lower than the nominal five-percent level. Therefore, cri�cal values adjust downwards. At the 10-percent level, it is 11.43%. Hence, cri�cal values adjust upwards. 5. Power Calculations This dataset includes 18,276 observa�ons, which allowed us to obtain precise es�mates of a rich parameter set. However, DIA is feasible with much smaller samples. This subsec�on looks into the choice of sample size for this RCT. 50 Table 6: Optimal Sample Sizes for Financial Education Program by Quantile Treatment Effect Minimum detectable effect +1% +5% +10% +15% +1 point +5 pts. +10 pts. +15 pts. 10th perc. 67920 2828 743 347 8862 985 354 181 20th perc. 72109 3002 789 368 12194 1355 488 249 30th perc. 90478 3767 990 462 19492 2166 780 398 40th perc. 61993 2581 678 317 16371 1819 655 334 50th perc. 46217 1924 506 236 14430 1603 577 294 60th perc. 38287 1594 419 196 13792 1532 552 281 70th perc. 33043 1376 361 169 13753 1528 550 281 80th perc. 28562 1189 312 146 13731 1526 549 280 90th perc. 27957 1164 306 143 15908 1768 636 325 Notes: The table shows the smallest sample size, such that a pointwise two-sided test of the zero null hypothesis detects a given treatment effect with size 5% and power 80%. See Subsec�on 6.3.1. We follow the approach of Subsec�on 6.4.1. We compute the op�mal sample size to de- tect a given minimum effect on each decile. We fix the significance level at 5% and power at 80%. We must also parametrize the asympto�c variance of our es�mator of interest. For quan�le effects, it depends on the density of outcomes, which we compute from the control group. 71 In an applica�on, researchers might obtain the density func�on based on a baseline sample, previous studies and parametric assump�ons. Table 6 reports our results for a typical random control trial without any adjustments. Small effects require large samples: detec�ng an increase of 1% in the 10th percen�le re- quires a total of 67,920 observa�ons. However, moderate samples suffice for modest effects. For instance, to detect increases around 10% in each decile would have required a sample for around 990 in the 30th decile. Table 6 does not adjust sample sizes for mul�ple hypothesis tests or clustering, which would raise data requirements. The formula for the op�mal sample size in Subsec�on 6.4.1 depends on the square of the cri�cal values. In Subsec�on 7.1.2, we controlled the FWER across 143 tests, infla�ng cutoffs by up to 50%. Such a correc�on would increase the rec- ommended sample size by a factor of 2.25. Es�ma�ng fewer parameters would reduce this adjustment. For instance, suppose that we focus on the mean and the deciles (a total of ten parameters). Then, control of the FWER increases cri�cal values by 30%, which implies mul- �plying the sample size by a factor of 1.7. As for clustering, the op�mal sample size is pro- por�onal to the variance of the es�mator of interest. In our applica�on, accoun�ng for clus- tering increases variances by a factor between 4 and 7. Combining these rules of thumb, our es�mates of each quan�le effect (see Table 3 and Figure 6) and the formula for the op�mal sample size (see Subsec�on 6.4.1), we find that the op�mal sample sizes for this RCT would be 12,000 observa�ons if we control the FWER across all 143 parameters and 9,000 observa�ons if we only control the FWER across the mean and the deciles. Note that adjustments for mul�ple tes�ng and clustering are un- known at the design stage. Simula�ons may help researchers incorporate these features into their choice of sample size. See Subsec�on 6.4.2. 71 We use the Epanechnikov kernel and the Sheather-Jones plug-in bandwidth. 51 6. Takeaways These empirical examples highlight several points about the importance of going beyond average treatment effects. First, it is difficult to summarize the impact of an interven�on with any single sta�s�c. In par�cular, average effects o�en hide significant heterogeneity. Second, we gain important insights from the changes in the shape of the distribu�on. Quan- �le analysis and other DIA methods allow us to address a broader range of policy concerns than means alone under minimal assump�ons. Third, condi�onal analysis is a complement and not a subs�tute for uncondi�onal approaches. Each method reveals a different dimen- sion of heterogeneity. Finally, our analysis illustrates methodological issues in DIA, such as correc�ng for mul�ple tes�ng. 2. School Management RCT in The Gambia In this sec�on, we use the methods in this toolkit, focusing on those in Part 4, to re-analyze the impact of the Whole School Development Program (WSDP). The WSDP was administered as part of an RCT in The Gambia from 2007 to 2011 and evaluated by Blimpo, Evans and La- hire (2016). The program aims to improve school quality by training school leaders and community members in school management techniques. Each of 273 Gambian primary schools was randomly assigned to one of three groups. Ninety schools in the first treatment arm were offered the WSDP, which provided principals, certain teachers and community members with a comprehensive training program in school management. These schools were also given a $ 500 grant to help cover costs associated with implemen�ng new ini�a- �ves based on the training. To disentangle the impact of the WSDP training from the impact of the grant, 94 schools in the second treatment arm (“grant only”) received the $ 500 grant without any addi�onal training. In the third arm, 89 schools served as the control group and received no treatment. Blimpo, Evans and Lahire (2016) find that the average impact of the WSDP treatment on test scores is approximately zero and sta�s�cally insignificant when looking at impacts sepa- rately by grade and year. Zero average effects are consistent with a program having no im- pact or with a program having heterogeneous impacts, benefi�ng some schools and hur�ng others. In order to dis�nguish between these two cases we use the methods discussed in Part 4 to es�mate the variance of treatment effects and, relying on strong assump�ons, the en�re distribu�on of treatment effects. The results of these exercises are important for poli- cy: If there is treatment effect heterogeneity the program could become effec�ve if beter targeted at those with posi�ve treatment effects. This is obviously not possible if the treat- ment effect is (very close to) zero for everyone. The WSDP is a good applica�on to demonstrate the methods discussed in Part 4 because high quality panel data were collected for the evalua�on. However, we do not find evidence of heterogeneity in treatment effects: Our results are consistent with the WSDP having no impact on any school. To illustrate the methods further, and demonstrate that our finding of no impact is not due to a general problem of these methods, we simulate an analogous da- taset, based on the features of the data collected for this RCT, using the Simula�on 1 data genera�ng process described in Appendix 1. 52 1. Data We use the data from Blimpo, Evans and Lahire (2016). Baseline data were collected in 2008 at the start of the program. Follow-up data were collected in 2009, 2010 and 2011. Blimpo, Evans and Lahire (2016) demonstrate that schools’ baseline characteris�cs are balanced across the three arms. Here, we focus on the impact of the WSDP on student test scores. Math and literacy scores were collected for 3rd and 5th graders in 2008 and 2010 and for 4th and 6th graders in 2009 and 2011. We construct a balanced panel of test scores by taking the average test score by school- grade-year and restric�ng aten�on to WSDP and control group observa�ons where 2 grades were observed in each year from 2008 to 2011. This yields a sample of 960 total observa- �ons with 8 observa�ons for each of 120 schools, including 61 schools in the WSDP treat- ment arm and 59 schools in the control group. This differs from Blimpo, Evans and Lahire (2016), who use student level data clustered by school-year for their cross-sec�onal analysis. We instead focus on a balanced panel of school level observa�ons. As Table 7 shows, this approach yields similar results for the mean impacts, but has the advantage of simplifying the implementa�on of Mallows’ deconvolu�on algorithm, which we use in Subsec- �on 7.2.2. 72 2. Going Beyond the Mean of the Distribution of Treatment Effects Blimpo, Evans and Lahire (2016) find the average impact of the WSDP treatment on test scores is approximately zero when looking at impacts separately by grade and year. We find a similar small and sta�s�cally insignificant effect, using pooled OLS on the stacked cross- sec�ons. But does this null average effect mask treatment effect heterogeneity? Of course, in order for a null average effect to mask treatment heterogeneity, some schools must bene- fit while others are hurt by the program. While policymakers usually implement programs because they think they will benefit par�cipants, programs may have unintended nega�ve consequences if, for example, they disrupt effec�ve systems. If this is the case here, what propor�on of schools benefited from the WSDP? In this subsec�on, we answer these ques�ons by studying features of the distribu�on of the WSDP’s effects beyond the mean. Unlike average effects and quan�le treatment effects, features of the distribu�on of impacts beyond the mean are not iden�fied by random varia- �on induced by an RCT alone; addi�onal assump�ons are always required. The WSDP RCT is a good candidate for sa�sfying these assump�ons because it includes four waves of data, and treated schools are observed both before and a�er the implementa�on of the WSDP. This allows us to analyze the data using panel data methods, like those discussed in Subsec- �on 4.4.4. Condi�onal on school fixed effects, the variance of treatment effects is iden�fied so long as the treatment effect is uncorrelated with all �me-varying components of the error term, and these �me varying components are not themselves too correlated over �me. The full distribu�on of treatment effects is iden�fied under the stronger assump�on that treat- ment effects are independent of all �me varying components of the error term. While these are not directly testable assump�ons, they may be plausible. They would be violated, for ex- 72 Specifically, by focusing on a balanced panel, all schools’ residual vectors are of the same length and receive equal weight. 53 ample, if the school management training is par�cularly effec�ve in areas with the highest (unobserved) economic growth during the study period. In order to move beyond the mean, we use the following variant of the panel data speci- fica�on from Subsec�on 4.4.4. to account for heterogeneity in the WSDP’s impact: = WSDP + + + + . Note two features of this specifica�on: First, we control for school fixed effects, . Second, the treatment effects are no longer assumed to be constant across schools. Instead, is the impact of the WSDP on school . Importantly, is only iden�fied for the subpopula�on of schools in the WSDP treatment arm, since iden�fica�on relies on varia�on in WSDP within school . Control group schools contribute to es�ma�ng the coefficients and which are common across all schools. The first row of Table 7 shows that the es�mate of the average impact across schools, 0.01, is quite similar to the es�mate assuming treatment effects are constant across schools. This es�mate is just ̂ � = 1�. Standard errors are given by conven�onal formulas for � � method of moments es�mators. The second row shows the es�mate of the variance of treatment effects using the formu- la given in Arellano and Bonhomme’s (2012) equa�on (50). The es�mate, 0.005, is about half the size of the mean impact and is also sta�s�cally insignificant. The validity of this es- �mate relies on two assump�ons. First, treatment effects are condi�onally mean independ- ent of all past, current and future error terms (i.e., strict exogeneity): � �WSDP , , , , 1 , … , � = � �WSDP , , , �. Second, the covariance matrix of = (1 , … , ) is given by = 2 . The nota�on 2 indicates standard errors may be a func�on of individual covariates. The key component of this later assump�on is that a school’s errors are not correlated across periods. This can be relaxed so long as errors are not “too correlated”. As discussed in Part 4, we can iden�fy the en�re distribu�on of treatment effects if we are willing to make even stronger assump�ons. In many applica�ons, these assump�ons are unjus�fied. We include this analysis here primarily for the purpose of illustra�on. First, if we are willing to assume treatment effects are normally distributed, then the mean and variance es�mates are sufficient sta�s�cs for the distribu�on. This distribu�on is shown with the doted line in Figure 8. The distribu�on is �ghtly concentrated around zero. To be sure, we cannot rule out that the distribu�on is degenerate at zero, since neither the mean or variance is sta�s�cally significant. Alterna�vely, we can recover the en�re distribu�on using deconvolu�on if we are willing Table 7: Beyond the Mean Impact of WSDP on Numeracy and Literacy Test Scores Es�mate Standard Error ( ) 0.010 0.03 var( ) 0.005 0.01 Makarov Bounds on P( ≥ 0) [0.05, 0.95] Notes: N = 960 with 8 observa�ons on each of 120 schools. Mean and variance of impacts es�mates and standard errors calculated using Arellano and Bonhomme’s (2012) mean group and robust vari- ance es�mators. See Sec�on 4.4 for details. 54 to assume that treatment effects are condi�onally independent of the error terms . While this assump�on is not directly testable, a necessary condi�on for it being sa�sfied is that the variance of + is greater for treatment observa�ons than for control ob- serva�ons. This condi�on is sa�sfied in our sample; the variance of these par�al residuals is 0.133 for treatment observa�ons and 0.114 for control observa�ons. Figure 8: Estimates of Distribution of WSDP Impacts ̂ are very noisy because they are es�mated using only 3 treatment The es�mates of observa�ons and one control observa�on per treated school. Deconvolu�on atempts to disentangle the varia�on due to the treatment effects from the varia�on atributable to noise. The dashed line in Figure 8 shows the distribu�on of the unadjusted ̂ es�mates. In contrast, the solid line shows the distribu�on of treatment effects recovered from applying deconvolu�on via Mallows’ algorithm. As expected, the deconvolu�on es�mate is much less dispersed than the unadjusted distribu�on. However, it is more dispersed and has heavier tails than the normal approxima�on to the distribu�on of treatment effects. Figure 9: Deciles of Treatment Effects and QTTs With knowledge of the full distribu�on of treatment effects, we can calculate any feature of the distribu�on. For example, we can calculate any quan�le of the treatment effect distri- bu�on or any moment of the distribu�on, like the mean, variance, skewness or kurtosis of the distribu�on. In Figure 9, we present deciles of the deconvolu�on es�mates of the distri- 55 bu�on of treatment effects in gray. For comparison, we show quan�le treatment effects at each decile in black. We calculate these quan�le treatment effects using the empirical dec- iles of the par�al residuals of a regression of average test scores in 2011 on grade level. While the distribu�on of treatment effects is calculated using the full panel, our panel func- �onal form implies that treatment effects are the same in every year and should therefore be comparable to the quan�le treatment effects calculated using only 2011 data. Pointwise confidence intervals for the QTTs are calculated using one thousand itera�ons of the pointwise bootstrap described in Subsec�on 6.2.1. In this case, deciles of the distribu�on of treatment impacts look quite similar to the quan�le treatment effects at each decile. The deciles of the distribu�on of treatment im- pacts are increasing by construc�on, but the QTTs trend upwards at almost exactly the same rate. As discussed in Parts 2 and 4, this is not the case in general. The similarity of these es- �mates suggests the WSDP did not cause much mobility across the poten�al outcome distri- bu�ons. In other words, schools that ranked high in the control group distribu�on also ranked high in treatment group distribu�on. While none of the decile QTTs are significantly different from 0 based on the pointwise 95% confidence interval, the effect at the top decile, 0.23, is significant at the 10-percent level. 3. Simulation Results The analyses in this sec�on so far amend Blimpo, Evans and Lahire’s (2016) finding of a zero average effect by providing evidence that this is due to zero or small individual effects rather than offse�ng gains and losses. This demonstrates that DIA can provide further guidance for policy even in cases with no average impact. However, the lack of significance impairs the illustra�on of what else one can learn from DIA and such panel data and how failures of the assump�ons can affect the results. To improve the illustra�on, we replicate our distribu�onal analysis of the WSDP using simulated data mirroring the sample used for the WSDP. Specifi- cally, we draw a simulated panel of 8 observa�ons for each of 120 “schools” from the follow- ing data genera�ng process: = + + 0.5√ + . With the following distribu�onal assump�ons: 1.5 1 √0.5 ~(0, 1), � � ~ �� � , � �� and ~(0, 0.01). 0 √0.5 0.5 We include the √γi term to explore the impact of viola�ng the deconvolu�on assump�ons on the es�mates. In par�cular, will be correlated with this term unless = 0. As in the WSDP sample, 61 schools are randomly assigned to treatment beginning in year 2. Table 8 presents es�mates of the average and variance of treatment effects and Makarov Bounds on the propor�on of schools who benefit from the program when is 0, 0.1 and 0.8. 73 Panel A shows results when is 0 so that the deconvolu�on assump�ons are sa�sfied. 73 Note that each simulated dataset was drawn with the same random seed so that the only difference in the es�mated is due to changing . Mirroring the above analysis, our controls include an indicator for whether the 56 Table 8: Illustrating Analyses Beyong the Mean Impact using Simulated Data True value Es�mate Std. error from DGP A. cov( , ) = 0 ( ) 1.50 1.57 0.15 var( ) 1.00 1.37 0.23 Makarov Bounds on P( ≥ 0) 0.93 [0.49, 1.00] B. cov( , ) = 0.1 ( ) 1.50 1.57 0.16 var( ) 1.00 1.41 0.24 Makarov Bounds on P( ≥ 0) 0.93 [0.48, 1.00] C. cov( , ) = 0.8 ( ) 1.50 1.55 0.18 var( ) 1.00 1.64 0.30 Makarov Bounds on P( ≥ 0) 0.93 [0.46, 1.00] Notes: N = 960 with 8 observa�ons on each of 120 schools. Based on simulated data with iden�cal structure to WSDP sample. Average and variance of impacts es�mates and standard errors calculated using Arellano and Bonhomme's (2012) mean group and robust variance es�mators. See Sec�on 4.4.4 for addi�onal details. In this case, the es�mated mean and variance of treatment effects are both sta�s�cally sig- nificant, but not significantly different from the true values. The Makarov Bounds for the propor�on of schools which were hurt range from 0.49 to 1, which is consistent with roughly half of schools being hurt/benefit by the program to everyone benefi�ng, but include the true value of 0.93. Panel B shows results when is 0.1 so that the deconvolu�on assump�ons are violated, Figure 10: Estimates of Distribution of Simulated Impacts but the correla�on between the treatment effect and the omited variable is rela�vely weak. Since the omited variable is mean zero, the es�mated average effect remains unbiased. observa�on is from an “older” grade (which are randomly selected since this coefficient does not enter the DGP) and year indicators. 57 Figure 11: Deciles of Simulated Treatment Effects and QTTs However, the es�mated variance of treatment effects increased slightly and as a result is significantly different from the true variance at the ten-percent level. Panel C shows the case when is 0.8, so that treatment effects are strongly correlated with the omited variable. In this case, the es�mated variance increases further and is significantly different from the true variance at the five-percent level. In both cases, the Makarov Bounds become somewhat wider, but are qualita�vely quite similar. Figure 11 plots the true distribu�on of treatment effects and the deconvolu�on es�- mates for each of the above cases. As Table 8 suggests, when = 0 or 0.1, the es�mated dis- tribu�on is slightly more dispersed but generally similar to the true distribu�on of treatment effects. When = 0.8, the es�mated distribu�on is even more over dispersed. The three figures in Figure 11 plot the deciles of the true distribu�on of treatment effects, the deconvolu�on es�mate of the distribu�on of treatment effects, and es�mated QTTs at each decile when equals 0, 0.1, or 0.8, respec�vely. Similarly to Figure 10, the es- �mates are quite similar when is 0 or 0.1. In both cases, the es�mated deciles of the dis- tribu�on of treatment effects are quite similar to the deciles from the true distribu�on. The correla�on between the es�mated and true deciles is about 0.92 in both cases. The deciles of QTTs are much flater and generally smaller than the true deciles of treatment effects above the 4th decile. When is 0.8, so that the deconvolu�on assump�ons are violated, the deciles of the es�mated distribu�on of treatment effects are s�ll quite similar to the true deciles. But in this case, the QTTs at the deciles are about equicorrelated with the true dec- iles as the deconvolu�on es�mates, whereas the QTTs were less correlated with the true deciles of the treatment effect in the other cases. 58 4. Takeaways By going beyond the mean and es�ma�ng the distribu�on of treatment effects we were able to test whether the null average effect found in Blimpo, Evans, and Lahire (2015) was mask- ing policy relevant treatment heterogeneity. The zero average effect does not appear to be masking heterogeneity in schools’ response to the Whole School Development Program. In fact, the es�mated variance implies that the standard devia�on is about 70% of the es�mat- ed average effect and is also sta�s�cally insignificant. This finding of litle treatment hetero- geneity is corroborated by both the deconvolu�on es�mate of the distribu�on and es�mates of QTTs. While relevant for policy, such a null result falls short of illustra�ng what we can learn from the methods we apply. The simulated results demonstrate that the null results are a feature of the par�cular program being studied rather than a deficiency of the distribu�onal methods. Our es�mates of both the variance and distribu�on of treatment effects are sta- �s�cally indis�nguishable from the truth even when the strong assump�ons are violat- ed. However, it is not clear how sensi�ve the methods are to viola�ons of their assump�ons in general, and we do not mean to suggest that the robustness of the es�mates in our appli- ca�on generalizes. References AAKVIK, A., J.J. HECKMAN AND E.J. VYTLACIL (2005): “Es�ma�ng treatment effects for discrete outcomes when re- sponses to treatment vary: An applica�on to Norwegian voca�onal rehabilita�on programs”, Journal of Econometrics 125(1–2), 15–21. ABADIE, A. (2002): “Bootstrap tests for distribu�onal treatment effects in instrumental variable models”, Journal of the American Statistical Association 97(457), 284–292. ABADIE, A., J. ANGRIST AND G.W. IMBENS (2002): “Instrumental variables es�ma�on of quan�le treatment effects”, Econometrica 70(1), 91–117. ABBRING, J., AND J. HECKMAN (2007): “Econometric evalua�on of social programs, part III: Distribu�onal treatment effects, dynamic treatment effects, dynamic discrete choice, and general equilibrium policy evalua�on”, Handbook of Econometrics 6, 5145–5303. ANDREWS, D.W.K., AND M. BUCHINSKY (2000): “A three-step method for choosing the number of bootstrap repe�- �ons”, Econometrica 68(1), 23–51. ANGRIST, J.D., AND G.W. IMBENS (1994): “Iden�fica�on and es�ma�on of local average treatment effects”, Econ- ometrica 62(2), 467–475. ANGRIST, J., AND J.S. PISCHKE (2009): Mostly harmless econometrics: An empiricist's companion. Princeton (NJ): Princeton University Press. ARCIDIACONO, P., E. AUCEJO, H. FANG AND K. SPENNER (2011): “Does affirma�ve ac�on lead to mismatch? A new test and evidence”, Quantitative Economics 2(3), 303–333. ARELLANO, M., AND S. BONHOMME (2012): “Iden�fying distribu�onal characteris�cs in random coefficients panel data models”, Review of Economic Studies 79(3), 987–1020. ATHEY, S., AND G.W. IMBENS (2002): “Recursive par��oning for heterogeneous causal effects”, arXiv:1504.01132. BELLONI, A., V. CHERNOZHUKOV, AND C. HANSEN (2014): “Inference on treatment effects a�er selec�on among high- dimensional controls”, Review of Economic Studies 81(2), 608–650. BITLER, M., J. GELBACH AND H.W. HOYNES (2006): “What mean impacts miss: Distribu�onal effects of welfare re- form experiments”, American Economic Review 96(4), 988–1012. ───── (2014): “Can varia�on in subgroups’ average treatment effects explain treatment effect heterogeneity? Evidence from a social experiment”, NBER Working Paper 20142. BITLER, M., H.W. HOYNES AND T. DOMINA (2014): “Experimental evidence on distribu�onal effects of Head Start”, NBER Working Paper 20434. BISSANTZ, N., L. DÜMBGEN, H. HOLZMANN AND A. MUNK (2007): "Non-parametric confidence bands in deconvolu�on density es�ma�on”, Journal of the Royal Statistical Society B 69(3), 483–506. 59 BLACK, D.A., J.A. SMITH, M.C. BERGER AND B. J. NOEL (2003): “Is the threat of reemployment services more effec- �ve than the services themselves? Experimental evidence from the UI system”, American Economic Re- view 93(3), 1313–1327. BLIMPO, M., D.K. EVANS AND N. LAHIRE (2015): “Parental human capital and effec�ve school management: Evi- dence from The Gambia”, World Bank Policy Research Working Paper 7238. BONHOMME, S., and J. ROBIN (2010): “Generalized nonparameteric deconvolu�on with an applica�on to earnings dynamics”, Review of Economic Studies 77(2), 491–533. BRINCH, C., M. MOGSTAD AND M. WISWALL (forthcoming): “Beyond LATE with a discrete instrument”, Journal of Political Economy. BRUHN, M., L. DE SOUZA LEÃO, A. LEGOVINI, R. MARCHETTI AND B. ZIA (2016): “The impact of high school financial edu- ca�on: Evidence from a large-scale evalua�on in Brazil”, American Economic Journal: Applied Economics 8(4), 256–295. CAMERON, C., AND D.L. MILLER (2015): “A prac��oner’s guide to cluster-robust inference”, Journal of Human Re- sources 50(2), 317–372. CAMERON, C., AND P. TRIVEDI (2005): Microeconometrics: Methods and applications. New York (NY): Cambridge University Press. CARNEIRO, P., K. HANSEN AND J. HECKMAN (2002): “Removing the veil of ignorance in assessing the distribu�onal impacts of social policies”, NBER Working Papers 8840. ───── (2003): “Es�ma�ng distribu�ons of treatment effects with an applica�on to the returns to schooling and measurement of the effects of uncertainty on college choice”, International Economic Review 44(2), 361– 422. CATTANEO, M. (2010): “Efficient semiparametric es�ma�on of mul�-valued treatment effects under ignorabil- ity”, Journal of Econometrics 155(2), 138–154. CHERNOZHUKOV, V., I. FERNÁNDEZ-VAL AND A. GALICHON (2010): “Quan�le and probability curves without crossing”, Econometrica 78(3), 1093–1125. CHERNOZHUKOV, V., I. FERNÁNDEZ-VAL AND B. MELLY (2013): “Inference on counterfactual distribu�ons”, Economet- rica 81(6), 2205–2268. CHERNOZHUKOV, V., AND C. HANSEN (2004): “The impact of 401(k) par�cipa�on on the wealth distribu�on: An in- strumental quan�le regression analysis”, Review of Economics and Statistics 86(3), 735–751. ───── (2005): “An IV model of quan�le treatment effects”, Econometrica 73(1), 245–261. ───── (2013): “Quan�le models with endogeneity”, Annual Review of Economics 5, 57–81. CHERNOZHUKOV, V., C. HANSEN AND M. JANSSON (2009): “Finite sample inference in econometric models via quan�le restric�ons”, Journal of Econometrics 152(2), 93–103. CORNELISSEN, T., C. DUSTMANN, A. RAUTE AND U. SCHÖNBERG (2016): “From LATE to MTE: Alterna�ve methods for the evalua�on of policy interven�ons”, Labour Economics 41, 47–60. CRUMP, R.K., V.J. HOTZ, G.W. IMBENS AND O.A. MITNIK (2008): “Nonparametric tests for treatment effect heterogeneity”, Review of Economics and Statistics 90(3), 389–405. ───── (2009): “Dealing with limited overlap in es�ma�on of average treatment effects”, Biometrika 96(1), 187–199. CUNHA, F., J. HECKMAN AND S. SCHENNACH (2010): “Es�ma�ng the technology of cogni�ve and non-cogni�ve skill forma�on”, Econometrica 78(3), 883–931. DINARDO, J., N.M. FORTIN AND T. LEMIEUX (1996): “Labor market ins�tu�ons and the distribu�on of wages, 1973– 1992: A semiparametric approach”, Econometrica 64(5), 1001–1044. DJEBBARI, H., and J.A. SMITH (2008): “Heterogeneous impacts in PROGRESA”, Journal of Econometrics 145(1), 64– 80. DONALD, S.G. AND Y.C. HSU (2014): “Es�ma�on and inference for distribu�on func�ons and quan�le func�ons in treatment effect models”, Journal of Econometrics 178(3), 383–397. DONALD, S.G., Y.C. HSU AND R.P. LIELI (2014): “Tes�ng the unconfoundedness assump�on via inverse probability weighted es�mators of (L)ATT”, Journal of Business & Economic Statistics 32(3), 395–415. FAFCHAMPS, M., and J. LABONNE (2016): “Using split samples to improve inference about causal effects”, NBER Working Paper 21842. FAN, Y., and S. PARK (2010): “Sharp bounds on the distribu�on of treatment effects and their sta�s�cal infer- ence”, Econometric Theory 26(3), 931–951. FITHIAN, W., D. SUN AND J. TAYLOR (2017). “Op�mal inference a�er model selec�on”, arXiv:1410.2597v4. FIRPO, S. (2007): “Efficient semiparametric es�ma�on of quan�le treatment effects”, Econometrica, 75(1), 259– 276. 60 FIRPO, S., N.M. FORTIN AND T. LEMIEUX (2009): “Uncondi�onal quan�le regressions”, Econometrica, 77(3), 953– 973. FIRPO, S., AND C. PINTO (2015): “Iden�fica�on and es�ma�on of distribu�onal impacts of interven�ons using changes in inequality measures”, Journal of Applied Econometrics 31(3): 457–486. FIRPO, S., and G. RIDDER (2008). “Bounds on func�onals of the distribu�on of treatment effects”, Textos para Discussão 201. São Paulo, SP: Escola de Economia de São Paulo. FRÖLICH, M. (2006): “Non-parametric regression for binary dependent variables”, Econometrics Journal 9(3), 511–540. FRÖLICH, M. (2007): “Propensity score matching without condi�onal independence assump�on—with an applica�on to the gender gap in the United Kingdom”, Econometrics Journal 10(2), 359–407. FRÖLICH, M., AND B. MELLY (2013a): “Iden�fica�on of treatment effects on the treated with one-sided non- compliance”, Econometric Reviews 32(3), 384–414. ───── (2013b): “Uncondi�onal quan�le treatment effects under endogeneity”, Journal of Business & Economic Statistics 31(3), 346–357. HECKMAN, J.J., AND B. HONORÉ (1990): “The empirical content of the Roy model”, Econometrica 58(5), 1121–1149. HECKMAN, J.J., J. SMITH AND N. CLEMENTS (1997): “Making the most out of programme evalua�ons and social ex- periments: Accoun�ng for heterogeneity in programme impacts”, Review of Economic Studies 64(4), 487– 535. HECKMAN, J.J., AND E.J. VYTLACIL (1999): “Local instrumental variables and latent variable models for iden�fying and bounding treatment effects”, Proceedings of the National Academy of Sciences, vol. 96, 4730–4734. ───── (2005): “Structural equa�ons, treatment effects, and econometric policy evalua�on”, Econometrica 73(3), 669–738. ───── (2007): “Econometric evalua�on of social programs, part II: Using the marginal treatment effect to or- ganize alterna�ve econometric es�mators to evaluate social programs, and to forecast their effects in new environments”, Handbook of Econometrics 6b, 4875–5143. HIRANO, K., G.W. IMBENS AND G. RIDDER (2003): “Efficient es�ma�on of average treatment effects using the es�- mated propensity score”, Econometrica 71(4), 1161–1189. HOROWITZ, J.L. (2001): “The bootstrap”, Handbook of Econometrics 5, 3159–3228. HOROWITZ, J.L., AND N.E. SAVIN (2001): “Binary response models: Logits, probits and semiparametrics”, Journal of Economic Perspectives 15(4), 43–56. IMAI, K., AND M. RATKOVIC (2013): “Es�ma�ng treatment effect heterogeneity in randomized program evalua- �on”, Annals of Applied Statistics 7(1): 443–470. IMBENS, G.W. (2015): “Matching methods in prac�ce: Three Examples”, Journal of Human Resources 50(2), 373– 419. IMBENS, G.W., AND C.F. MANSKI (2004): “Confidence intervals for par�ally iden�fied parameters”, Econometrica 72(6), 1845–1857. IMBENS, G.W AND D.B. RUBIN (1997): “Es�ma�ng outcome distribu�ons for compliers in instrumental variables models”, Review of Economic Studies 64(4), 555–574. IMBENS, G.W AND J.M. WOOLDRIDGE (2009): “Recent developments in the econometrics of program evalua�on”, Journal of Economic Literature 47(1), 5–86. JACOBSON, L.S., R.J. LALONDE AND D.G. SULLIVAN (1993): “Earnings losses of displaced workers”, American Econom- ic Review 83(4), 685–709. JAYNES, E. (1957): “Informa�on theory and sta�s�cal mechanics”, Physics Review 106(4), 620–630. KLINE, P., AND C. WALTERS (2015): “Evalua�ng public programs with close subs�tutes: The case of Head Start”, NBER Working Paper 21658. KOENKER, R. (2005): Quantile regression. Cambridge, UK: Cambridge University Press. KOENKER, R., AND G. BASSETT (1978): “Regression quan�les”, Econometrica 46(1), 33–50. KOENKER, R., AND Z. XIAO (2002): “Inference on the quan�le regression process”, Econometrica 70(4), 1583– 1612. KOTLARSKI, I. (1967): “On characterizing the gamma and the normal distribu�on”, Pacific Journal of Mathematics 20(1), 69–76. KOWALSKI, A.E. (2016): “Doing more when you’re running LATE: Applying marginal treatment effect methods to examine treatment effect heterogeneity in experiments”, NBER Working Paper Series 22363. LECHNER, M. (1999): “Earnings and employment effects of con�nuous off-the-job training in East Germany a�er unifica�on”, Journal of Business and Economic Statistics 17(1), 74–90. LEE, S., AND A.M. SHAIKH (2014): “Mul�ple tes�ng and heterogeneous treatment effects: Re-evalua�ng the effect of PROGRESA on school enrollment”, Journal of Applied Econometrics 29(4), 612–626. 61 LIST, J.A., A.M. SHAIKH AND Y. XU (2016): “Mul�ple hypothesis tes�ng in experimental economics”, NBER Working Paper 21875. MAKAROV, G. (1982): “Es�mates for the distribu�on func�on of a sum of two random variables when the mar- ginal distribu�ons are fixed”, Theory of Probability and its Applications 26(4), 803–806. MALLOWS, C. (2007): “Deconvolu�on by simula�on”, in R. Liu, W. Strawderman and C.H. Zhang (eds.), Complex Datasets and Inverse Problems: Tomography, Networks and Beyond, IMS Lecture Notes – Monograph Se- ries, vol. 54, 1–11. New Brunswick, NJ: Rutgers University. MANSKI, C.F. (2004): “Sta�s�cal treatment rules for heterogeneous popula�ons”, Econometrica 72(4), 1221– 1246. MCCONNELL, B., AND VERA-HERNÁNDEZ, M. (2015): “Going beyond simple sample size calcula�ons: A prac��oner’s guide”, IFS Working Paper W15/17. NEWEY, WHITNEY K., AND DANIEL L. MCFADDEN. (1994): “Large Sample Es�ma�on and Hypothesis Tes�ng." In Handbook of Econometrics. Vol. 4, ed. Robert F. Engle and Daniel L. McFadden, Chapter 36, 2111-2245. Amsterdam:Elsevier. O’MUIRCHEARTAIGH, C., AND L.V. HEDGES (2014): “Generalizing from unrepresenta�ve experiments: a stra�fied propensity score approach”, Applied Statistics, 63(2), 195–210. PITT, M.M., M.R. ROSENZWEIG AND M.N. HASSAN (2012): “Human capital investment and the gender division of labor in a brawn-based economy”, American Economic Review 102(7), 3531–3560. OTSU, T., AND Y. RAI (forthcoming): “Bootstrap inference of matching es�mators for average treatment effects”, Journal of the American Statistical Association. DOI: 10.1080/01621459.2016.1231613. RIDGEWAY, G., S.A. KOVALCHIK, B.A. GRIFFIN AND M.U. KABETO (2015): “Propensity score analysis with survey weighted data”, Journal of Causal Inference 3(2), 237–249. ROMANO, J.P., AND M. WOLF (2010): “Balanced control of generalized error rates”, Annals of Statistics 38(1), 598– 633. ROTHE, C. (2012): “Par�al distribu�onal policy effects”, Econometrica 80(5), 2269–2301. SCHNENNACH, S. (2013): “Convolu�on without independence”, CEMMAP working paper CWP46/13. SMITH, J. (2015): “The important role of heterogeneity in social and biological models”, presenta�on at the RCGD/IHPI Seminar. STOYE, J. (2009): “More on confidence intervals for par�ally iden�fied parameters”, Econometrica 77(4), 1299- 1315. WAGER, S., AND S. ATHEY (2015): “Es�ma�on and inference of heterogeneous treatment effects using random forests”, arXiv:1510.04342. WASSERMAN, L., AND K. ROEDER (2009): “High-dimensional variable selec�on”, Annals of Statistics 37(5A), 2178– 2201. WU, X., and J. PERLOFF (2006): “Informa�on-theore�c deconvolu�on approxima�on of treatment effect distri- bu�on”, unpublished manuscript. College Sta�on, TX: Texas A&M University. ZHANG, Y. (2016): Three Essays on Extremal Quantiles, Ph.D. disserta�on. Durham, NC: Duke University. Availa- ble at: hdl.handle.net/10161/12160. Appendix 1. Simula�on Details Throughout the toolkit, we use results from simula�on exercises for illustra�on. In this ap- pendix, we describe the data genera�ng processes used for each of these simula�ons. Simulation 1 Simula�on 1 is our workhorse simula�on, because it can be used to generate cross-sec�onal data with baseline outcome measures or panel data with an arbitrary number of periods. In this simula�on, individual ’s outcome in period is: = + + . Moreover, we assume 62 0 0.5 0.5 � � ~ �� � , � ��, 1.5 0.5 1.0 and ~(0, 1). We require exactly half of observa�ons to be randomly selected for treat- ment beginning in period /2. 74 Simulation 2 Simula�on 2 draws data from two data genera�ng processes with iden�cal average treat- ment effects, but very different levels of heterogeneity in treatment effects. Specifically, we assume the following data-genera�ng process: 1 = 1 + , 2 2 = + . For each individual, we draw a single and . We assume ~(0, 1). Furthermore, we require exactly half of observa�ons be selected for treatment by drawing a uniform random variable for each individual and selec�ng the treatment cutoff using the median value across all draws. As for treatment effects, we assume 1 2 2 2 ~ (1, 0.5 ) and ~ (1, 5 ). Simulation 3 In Simula�on 3, poten�al outcomes are: 1 = 1 + 1 + 1 , 0 = 0.51 + 0 , where 1 0 1.0 12 � � ~ �� � , � ��. 0 0 12 0.5 Note that 1 and 0 are independent when 12 = 0, whereas and 0 are independent when 12 = 0.5. In addi�on, we assume 1 ~(1, 1). Treatment effects are then: = 1 − 0 = 1 + ( − 0.5)1 + 1 − 0 . We do not condi�on on 1 in our simula�ons. Consequently, there is an omited variable which is posi�vely correlated with treatment effects when > 0.5 and nega�vely correlated with treatment effects when < 0.5. There is no omited variable when = 0.5. Appendix 2. Es�ma�ng Condi�onal Probabili�es 74 If is odd, we use ⌈/2⌉. 63 To construct the IPW and IV weights, we need condi�onal probabili�es P(⋅ |). We might know them a priori. If our instrument is treatment assignment and we only want to adjust for stra�fica�on, for example, the randomiza�on rules provide the stratum specific treat- ment probabili�es. Otherwise, we need to es�mate and predict P(⋅ |). There are many strategies to predict P(⋅ |). If the covariates are discrete, Abadie, An- grist and Imbens (2002) recommend sor�ng the data into cells and using the propor�on of treated observa�ons within each cell. This method is nonparametric and efficient. For con- �nuous covariates, Hirano, Imbens and Ridder (2003) suggest a flexible logis�c specifica�on, including polynomials of covariates and interac�on terms. The authors give condi�ons under which the resul�ng es�mator is nonparametric. The appendix of Imbens (2015) presents an algorithm to select the higher-order terms. One can also use a LASSO procedure to select controls: see Athey and Imbens (2015) and Belloni, Chernozhukov and Hansen (2014) for references. Alterna�ve semiparametric or nonparametric strategies are available. Because the variance of binary outcomes is intrinsically bounded, these methods perform well. See Horowiz and Savin (2001) and Frölich (2006) for references. Both the IV and the IPW es�mators are sensi�ve to observa�ons in the tails of P(⋅ |). Unfortunately, it is difficult to predict this condi�onal probability with much accuracy near its boundaries. Crump et al. (2009) suggest trimming the sample and ignoring observa�ons for which P(⋅ |) is close to zero or one when es�ma�ng quan�les. If the rate of trimming decreases with sample size, the resul�ng es�mator remains consistent. The authors suggest using only observa�ons with predicted probability between 0.1 and 0.9 as a rule of thumb. Appendix 3. Recursively Solving for Higher Order Moments The Binomial Theorem implies: [( + ) ] = �� � � − � = ( − ) + � � � � �� �. =1 =1 For any , [( + ) ] can be es�mated from the treatment group residuals and each � � can be es�mated from the control group residuals. Therefore, this is a system of independent equa�ons and unknowns that we can solve for ( ). This gives us K equa- �ons for the first moments of the treatment effect distribu�on: � � = [( + ) ] − � � � �− �� �. =1 These moments can be es�mated recursively using the sample counterparts to each term in the above equa�on. Specifically, 1 1 � [( + ) ] = ̂ � , ��1 − 1 =1 0 1 � � � = ̂ � . ��0 − 0 =1 64 ̂ is the OLS coefficient from a regression of on and 1 and 0 are the number of where treatment and control group observa�ons, respec�vely. 65 Appendix 4. Addi�onal Results from Applica�ons Figure 12: Effects on the CDF for Financial Education Program Table 9: Reweighted Treatment Effects for Financial Education Program Control- Effect es�mate Standard Simultaneous Sta�s�c group value Value Percent error conf. region (95%) Mean 56.260 – 3.859 1– 6.886% 0.427 – 2.636 – 5.083 Standard devia�on 14.768 – 0.382 1– 2.583% 0.187 – 0.753 – 0.012 75/25 perc. ra�o 01.525 – 0.106 1– 6.904% 0.019 – 0.157 – 0.054 10th percen�le 33.047 – 3.887 – 10.706% 0.543 – 2.311 – 5.462 25th percen�le 44.134 – 5.613 – 12.795% 0.733 – 3.564 – 7.662 50th percen�le 56.438 – 4.043 1– 7.200% 0.512 – 2.563 – 5.524 75th percen�le 67.326 – 3.228 1– 4.808% 0.404 – 2.023 – 4.434 90th percen�le 75.894 – 3.099 1– 4.088% 0.404 – 1.720 – 4.480 CDF at 40 points 00.164 – 0.070 – 40.897% 0.009 – 0.098 – 0.043 CDF at 50 points 00.352 – 0.103 – 28.144% 0.013 – 0.140 – 0.066 CDF at 60 points 00.581 – 0.102 – 17.232% 0.012 – 0.137 – 0.068 CDF at 70 points 00.801 – 0.066 1– 8.188% 0.009 – 0.093 – 0.040 CDF at 80 points 00.945 – 0.031 1– 3.315% 0.005 – 0.045 – 0.017 Notes: Effects based on inverse probability weigh�ng (Firpo and Pinto, 2016). Weights based on logis�c regression of treatment on gender indicator and quadra�c polynomial of baseline scores. Standard errors and confidence region based on the bootstrap (five thousand replica�ons) and clustered at the school level. The confidence region controls the FWER (probability of at least one false rejec�on across tests), following Romano and Wolf (2010). 66 Table 10: Conditional Quantile Effects for Financial Education Program Sample Percen�le Subgroup size 10th 25th 50th 75th 90th Baseline score 3.263 4.645 5.060 3.793 3.497 No 17960 above median (0.460) (0.664) (0.635) (0.554) (0.615) 8.875 4.953 2.992 2.853 3.157 Yes 17960 (1.275) (0.864) (0.572) (0.482) (0.543) Student has repeat- 5.197 6.522 4.516 3.608 3.306 No 10949 ed grade (0.716) (0.867) (0.670) (0.642) (0.691) 3.791 5.592 5.568 4.372 3.261 Yes 14437 (0.661) (0.773) (0.794) (0.889) (0.914) 3.959 6.657 5.943 4.601 3.803 Student is female No 16941 (0.662) (0.863) (0.938) (0.861) (0.875) 4.843 5.350 3.626 2.659 3.080 Yes 18720 (0.870) (0.842) (0.634) (0.684) (0.631) 4.872 5.906 4.088 3.342 3.232 Student works No 10898 (0.636) (0.874) (0.713) (0.682) (0.629) 3.313 6.265 5.550 4.865 3.983 Yes 15612 (0.740) (0.868) (0.922) (0.912) (0.871) Student earns 4.117 5.467 4.025 2.337 2.331 No 15713 income (0.724) (0.990) (0.800) (0.733) (0.747) 4.386 6.864 4.773 4.136 3.515 Yes 10812 (0.648) (0.936) (0.709) (0.692) (0.583) Family is on welfare 5.145 7.522 5.356 4.081 3.548 No 10216 (Bolsa Família) (0.704) (0.985) (0.757) (0.751) (0.679) 3.618 5.152 3.559 2.628 2.245 Yes 15334 (0.787) (0.894) (0.900) (0.869) (0.884) Notes: The first four specifica�ons include 848 schools. The last two include 851 schools. Bootstrap standard errors in parentheses, based on five thousand replica�ons and clustered at the school level. 67