57490 Impact Evaluations and Development Inputs NoNIE Guidance on Impact Evaluation » Activities » Outputs » Outcomes » Impacts What Is NONIE? NONIE is a Network of Networks for Impact Evaluation comprised of the Organisation for Economic Co-operation and Development's Development Assistance Committee (OECD/ DAC) Evaluation Network, the United Nations Evaluation Group (UNEG), the Evaluation Cooperation Group (ECG), and the International Organization for Cooperation in Evaluation (IOCE)--a network drawn from the regional evaluation associations. NONIE was formed to promote quality impact evaluation. NONIE fosters a program of impact evaluation activities based on a common understanding of the meaning of impact evaluation and approaches to conducting impact evaluation. NONIE focuses on impact evaluation and does not attempt to address wider monitoring and evaluation issues. To this end NONIE aims to-- · Build an international collaborative research effort for high-quality and useful impact evalu- ations as a means of improving development effectiveness. · Provide its members with opportunities for learning, collaboration, guidance, and support, leading to commissioning and carrying out impact evaluations. · Develop a platform of resources to support impact evaluation by member organizations. www.worldbank.org/ieg/nonie Impact Evaluations and Development NoNIE Guidance on Impact Evaluation Frans Leeuw Maastricht University Jos Vaessen Maastricht University and University of Antwerp ©2009 NONIE--The Network of Networks on Impact Evaluation, Frans Leeuw, and Jos Vaessen c/o Independent Evaluation Group 1818 H Street, NW Washington, DC 20433 Internet: www.worldbank.org/ieg/nonie/ All rights reserved This volume is a product of the volume's authors, Frans Leeuw and Jos Vaessen, who were commis- sioned by NONIE. The findings, interpretations, and conclusions expressed in this volume are those of the authors and do not necessarily reflect the views of NONIE, its members, or other participating agencies. NONIE does not guarantee the accuracy of the data included in this work and accepts no responsibility for any consequence of their use. Rights and Permissions The material in this publication is copyrighted. Copying and/or transmitting portions or all of this work without permission may be a violation of applicable law. NONIE encourages dissemination of its work and will normally grant permission to reproduce portions of the work promptly. All queries on rights and licenses, including subsidiary rights, should be addressed to NONIE, c/o IEG, 1818 H St., , NW Washington, DC, 20433, ieg@worldbank.org. Cover: Pakistani girl reading. Photo by Curt Carnemark, courtesy of World Bank Photo Library. ISBN-10: 1-60244-120-0 ISBN-13: 978-1-60244-120-0 Printed on recycled paper Contents vii Acknowledgments ix Executive Summary xix Introduction 1 PArt I ­ MEthodologIcAl And concEPtuAl ISSuES In IMPAct EvAluAtIon 3 1 Identify the (type and scope of the) intervention 3 1.1. The impact evaluation landscape and the scope of impact evaluation 4 1.2. Impact of what? 7 1.3. Impact on what? 9 Key message 11 2 Agree on what is valued 11 2.1. Stakeholder values in impact evaluation 12 2.2. Intended versus unintended effects 12 2.3. Short-term versus long-term effects 12 2.4. The sustainability of effects 13 Key message 15 3 carefully articulate the theories linking interventions to outcomes 15 3.1. Seeing interventions as theories: The black box and the contribution problem 15 3.2. Articulating intervention theories on impact 17 3.3. Testing intervention theories on impact 19 Key message 21 4 Address the attribution problem 21 4.1. The attribution problem 23 4.2. Quantitative methods addressing the attribution problem 29 4.3. Applicability of quantitative methods for addressing the attribution problem 31 4.4. Other approaches 34 Key message 35 5 use a mixed-methods approach: the logic of the comparative advantages of methods 35 5.1. Different methodologies have comparative advantages in addressing particular concerns and needs 36 5.2. Advantages of combining different methods and sources of evidence 38 5.3. Average effect versus distribution of costs and benefits 39 Key message iii I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n 41 6 Build on existing knowledge relevant to the impact of interventions 43 Key message 45 PArt II ­ MAnAgIng IMPAct EvAluAtIonS 47 7 determine if an impact evaluation is feasible and worth the cost 48 Key message 49 8 Start collecting data early 49 8.1. Timing of data collection 49 8.2. Data availability 51 8.3. Quality of the data 51 8.4. Dealing with data constraints 52 Key message 53 9 Front-end planning is important 53 9.1. Planning tools 53 9.2. Staffing and resources 54 9.3. The balance between independence and collaboration between evaluators and stakeholders 54 9.4. Ethical issues 55 9.5. Norms and standards 56 9.6. Ownership and capacity building 56 Key message 57 Appendices 59 1. Examples of diversity in impact evaluation 61 2. The General Elimination Methodology as a basis for causal analysis 63 3. Overview of quantitative techniques of impact evaluation 65 4. Technical aspects of quantitative impact evaluation techniques 69 5. Evaluations using quantitative impact evaluation approaches 71 6. Decision tree for selecting quantitative evaluation designs to deal with selection bias 73 7. Hierarchical modeling and other statistical approaches 75 8. Multi-site evaluation approaches 77 9. Methodological frameworks for assessing the effects of interventions, mainly based on quantitative methods 79 10. Where to find reviews and synthesis studies on mechanisms underlying processes of change 81 11. Evaluations based on qualitative and quantitative descriptive methods 101 12. Further information on review and synthesis approaches in impact evaluation 105 13. Basic education in Ghana 109 14. Hierarchy of quasi-experimental designs 111 15. International experts who contributed to the subgroup documents 113 Endnotes 117 references Boxes 7 1.1. "Unpacking" the aid chain 16 3.1. Social funds and government capacity: Competing theories iv contEnts 18 3.2. Social and behavioral mechanisms as heuristics for understanding processes of change and impact 25 4.1. Using propensity scores to select a matched comparison group-- The Vietnam Rural Roads Project 33 4.2. Participatory impact monitoring in the context of the poverty reduction strategy process 39 5.1. Brief illustration of the logic of comparative advantages 42 6.1. Narrative review and synthesis study: Targeting and impact of community-based development initiatives 73 A7.1. Impact of the Indonesian financial crisis on the poor: Partial equilibrium modeling and CGE modeling with microsimulation Figures xi ES.1. Levels of intervention, programs, and policies and types of impact xii ES.2. Simple graphic of net impact of an intervention 8 1.1. Levels of intervention, programs, and policies and types of impact 17 3.1. Basic intervention theory of a fictitious small business support project 22 4.1. Graphic display of the net impact of an intervention 29 4.2. Regression discontinuity analysis 66 A4.1. Estimation of the effect of class size with and without the inclusion of a variable correlated with class size 87 A11.1. Final impact assessment triangulation 92 A11.2. Generic representation of a project's theory of change 93 A11.3. Components of impact evaluation framework 96 A11.4. Project outputs and outcomes 99 A11.5. Framework to establish contribution 99 A11.6. Model linking outcome to impact tables 5 1.1. Aspects of complication in interventions 6 1.2. Aspects of complexity in interventions 26 4.1. Double difference and other designs 52 8.1. Evaluation scenarios with time, data, and budget constraints 96 A11.1. Project outcome 97 A11.2. Change in key ecological attributes over time 97 A11.3. Current threats to the global environment benefits v Acknowledgments This Guidance document could not have The Guidance document represents the views of existed without the numerous contributions the authors, who were commissioned by NONIE. of Network of Networks on Impact Evaluation Given the fact that perspectives on the defini- (NONIE) members and others in terms of papers, tion, scope, and appropriate methods of impact PowerPoint® presentations, and suggestions. evaluation differ widely among practitioners and other stakeholders, the document should not be In particular, this Guidance document builds on two taken to represent the agreed positions of all of existing draft guidance documents, a document on the individual NONIE members. The network experimental and quasi-experimental approaches membership and the authors recognize that to impact evaluation (NONIE subgroup 1, May 17, there is scope to develop the arguments further 2007) and a document on qualitative approaches in several key areas. to impact evaluation (NONIE subgroup 2, January 9, 2008). A third draft document prepared by We would like to thank all of the above people NONIE members on the impact evaluation of for their contributions to the process of writing macroeconomic policies and new aid modalities the Guidance document. First, we thank the such as budget support is outside the scope of this authors of the subgroup documents for provid- Guidance document. The subgroup 1 document ing building blocks for this document. In was prepared mainly by Howard White and Antonie addition, we would like to thank the steering De Kemp. The subgroup 2 document, which was committee of this project, Andrew Warner, David somewhat broader in content than methodol- Todd, Zenda Ofir, and Henri Jorritsma, for their ogy, was coordinated by Sukai Prom-Jackson. pertinent suggestions. We also would like to The primary authors were Patricia Rogers, Zenda thank Antonie De Kemp for exchanging ideas Ofir, Sukai Prom-Jackson, and Christine Obester. on design questions. We are grateful to Patricia Case studies were prepared by Jocelyn Delarue, Rogers, the external peer reviewer, for providing Fabrizio Felloni, Divya Nair, Christine Obester, valuable input to this document. Our thanks also Lee Risby, Patricia Rogers, David Todd, and Rob go to Victoria Gunnarsson and Andrew Warner van den Berg. The development of this document from the NONIE secretariat for accompanying benefited extensively from a reference group of us throughout the whole process and providing international evaluators. excellent feedback. Nick York, Howard White, David Todd, Indran Nadoo, and John Mayne Whereas the two subgroup documents provided provided helpful insights in the final phase of the basis for the current Guidance document, the this project. We thank Arup Banerji for drafting purpose of the current document was to develop the executive summary. Comments from NONIE a new structure that could accommodate some members were received at the Lisbon European of the diversity in perspectives on impact evalua- Evaluation Society Conference (October, 2008) tion. In addition, within this new structure, new and the Cairo Conference on Impact Evaluation content was added where necessary to support key (March, 2009). Networks within NONIE, such as points. The process of developing this Guidance the International Organization for Cooperation in was supervised by a steering committee of NONIE Evaluation and the European Evaluation Society, members. An external peer reviewer critically contributed by submitting written comments. assessed the first draft of this document. Moreover, many individual NONIE members also vii I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n sent in their feedback through email. We would an important and quite diverse selection of the like to thank all NONIE members for the stimulat- thinking and practice on the subject has been ing discussions and inputs on impact evaluation. incorporated. The result, we hope, represents a balance between coherence, a comprehen- Finally, within the restricted time available sive structure of key issues, and diversity. Any for writing this document, we have tried to remaining errors are our own. combine different complementary perspec- tives on impact evaluation into an overall framework, in line with our own views on Frans Leeuw these topics and feedback from the steering frans.leeuw@maastrichtuniversity.nl committee and others. Though we have not Jos Vaessen included all perspectives on impact evaluation, jos.vaessen@maastrichtuniversity.nl viii Executive Summary I n international development, impact evaluation is principally concerned with final results of interventions (programs, projects, policy measures, reforms) on the welfare of communities, households, and individuals, in- cluding taxpayers and voters. Impact evaluation is one tool within the larger toolkit of monitoring and evaluation (including broad program evaluations, process evaluations, ex ante studies, etc.). The Network of Networks for Impact Evaluation whether development interventions do or do not (NONIE) was established in 2006 to foster more work, whether they make a difference, and how and better impact evaluations by its membership-- cost-effective they are. Consequently, they can help the evaluation networks of bilateral and multilat- ensure that scarce resources are allocated where eral organizations focusing on development they can have the most developmental impact. issues, as well as networks of developing country evaluators. NONIE's member networks conduct Although there is debate within the profession a broad set of evaluations, examining issues such about the precise definition of impact evalua- as project and strategy performance, institutional tion, NONIE's use of the term proceeds from development, and aid effectiveness. But the focus its adoption of the Development Assistance of NONIE is narrower. By sharing methodological Committee of the Organisation for Economic approaches and promoting learning by doing on Co-operation and Development (DAC) definition impact evaluations, NONIE aims to promote the of impact, as "the positive and negative, primary use of this more specific approach by its members and secondary long-term effects produced by a within their larger portfolio of evaluations. This development intervention, directly or indirectly, document, by Frans Leeuw and Jos Vaessen, has intended or unintended."2 been developed to support this focus.1 Adopting the DAC definition of impact leads to The Guidance document was written by and a focus on two underlying premises for impact represents the views of the authors. Given the evaluations: fact that perspectives on the definition, scope, and appropriate methods of impact evaluation · Attribution: The words "effects produced by" differ widely among practitioners and other in the DAC definition imply an approach to stakeholders, the document should not be taken impact evaluation that is about attributing im- to represent the agreed positions of all of the pacts to interventions, rather than just assess- individual NONIE members. ing what happened. · Counterfactual: It follows that in most contexts, Why promote impact evaluations? For develop- knowledge about the impacts produced by an ment practitioners, impact evaluations play a key intervention requires an attempt to gauge what role in the drive for better evidence on results and would have occurred in the absence of the development effectiveness. They are particularly intervention and a comparison with what has well suited to answer important questions about occurred with the intervention implemented. ix I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n These two premises do not, however, lead to a Yet across this continuum, the scope of an impact determination of a set of analytical methods that evaluation can be identified through the lens of is above all others in all situations. In fact, this two questions: the impact of what and the impact Guidance note underlines that-- on what? · No single method is best for addressing the When asking the "of what" question, it is useful variety of questions and aspects that might be to differentiate among intervention character- part of impact evaluations. istics. Take single-strand initiatives with explicit · However, depending on the specific questions objectives--for example, the change in crop or objectives of a given impact evaluation, yield after introduction of a new technology, some methods have a comparative advan- or reduction in malaria prevalence after the tage over others in analyzing a particular ques- introduction of bed nets. Such interventions can tion or objective. be isolated, manipulated, and measured, and · Particular methods or perspectives comple- experimental and quasi-experimental designs ment each other in providing a more complete may be appropriate for assessing causal relation- "picture" of impact. ships between these single-strand initiatives and their effects. The document is structured around nine key issues that provide guidance on conceptualizing, At the other end of the continuum are programs designing, and implementing an impact evalua- with an extensive range and scope that have tion: activities that cut across sectors, themes, and geographic areas. These can be complicated-- Methodological guidance: multiple agencies, multiple simultaneous causes 1. Identify the type and scope of the interven- for the outcomes, and causal mechanisms differ- tion. ing across contexts and complex (recursive, 2. Agree on what is valued. with feedback loops, and with emergent 3. Carefully articulate the theories linking outcomes) (Rogers, 2008). In such cases, impact interventions to outcomes. evaluations have to proceed systematically-- 4. Address the attribution problem. first, through locating and prioritizing key 5. Use a mixed-methods approach--the logic of program components through a comprehen- the comparative advantages of methods. sive mapping of the potential influences shaping 6. Build on existing knowledge relevant to the the program, including possible feedback loops impact of interventions. and emerging outcomes; second, evaluating program components by subsets of this priori- Guidance on managing impact tized program mapping. evaluations: 7. Determine if an impact evaluation is feasible When asking the "on what" question, impact and worth the cost. evaluations have to unpack interventions that 8. Start collecting data early. affect multiple institutions, groups, individuals 9. Front-end planning is important. and sites. For tractability, this guidance distin- guishes between two principal levels of impact: 1. Identify the (type and scope of the) impact at the institutional level and impact at intervention the beneficiary level (figure ES1). Examples Interventions range along a continuum from of the former are policy dialogues, training single-"strand" initiatives with explicit objectives programs, and strategic support to institutional to complex institutional policies, and the particu- actors such as governmental and civil society lar type of impact evaluation would be affected by institutions or private corporations and public- the type and scope of the intervention. private partnerships. x ExEcutIvE summary Figure ES1: Levels of intervention, programs, and policies and types of impact International conferences, treaties, declarations, protocols, policy networks Institutional-level impact Donor capacities/policies Government capacities/policies Other actors (INGOs, NGOs, Macro-earmarking (e.g., debt relief, banks, cooperatives, etc.) Micro-earmarking, GBS) meso-earmarking (e.g., SBS) May constitute Programs multiple Projects Policy measures (e.g., health reform) (e.g., agricultural (e.g., tax increases) extension) Beneficiary-level impact Communities Households Individual (taxpayers, voters, citizens, etc.) Replication and scaling up Wider systemic effects Most policy makers and stakeholders are, objectives of the intervention, and then as much however, primarily interested in beneficiary- as possible to try to translate these objectives level interventions that directly affect communi- into measurable indicators while keeping track of ties, households, and individuals--whether important aspects that are difficult to measure. they be trade liberalization measures, technical assistance programs, antiretroviral treatments, The "for whom" question is inherently a question cash transfer programs, construction of schools, about stakeholder values--which impacts and etc. This Guidance document accordingly focuses processes are judged as significant or valuable, on this level. But it should be recognized that and whose values are used to judge the distri- policy interventions primarily geared at inducing bution of costs and benefits? The first and most sustainable changes at institutional levels can also important reference source to answer this have indirect effects at the beneficiary level. question is the objectives of an intervention, as stated in the official documents. However, 2. Agree on what is valued interventions evolve and objectives might be When conducting impact evaluations, evaluators implicit or may change. To bring stakeholder also need to ask a third question--not only the values to the surface, evaluators may need to have impact of what and on what, but impact for whom. informal or structured (e.g., "values inquiry") The fundamental principles to follow here are to consultations with representatives from different agree on the most important, and most valued, stakeholder groups or use a participatory evalua- xi I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n tion approach to include stakeholder values But often, these theories are partly "hidden" directly in the evaluation. and require reconstruction and articulation. This articulation can use one or more pieces Three other issues are critical to creating of evidence--ranging from the interven- measurable indicators to capture the effects tion's existing logical framework, to insights of an intervention. First, the evaluation has to and expectations of policy makers and other consider the possibility of unintended effects stakeholders on the expected way target groups that go beyond those envisaged in the program are affected, to theoretical and empirical research theory of the intervention--for example, on processes of change or past experiences of governments reducing spending on a village similar interventions. However, it is important targeted by an aid intervention. Second, there to critically look for, and articulate, plausible may be long-term effects of an intervention (such explanations for the changes. as environmental changes, or changes in social impacts on subsequent generations) or time lags After articulating the assumptions on the effect of not captured in an impact evaluation that occur an intervention on outcomes and impacts, these relatively soon after the intervention period. assumptions will need to be tested. This can be Third, and related, is evidence on the sustain- done in two ways--by carefully constructing the ability of effects, which few impact evaluations causal "story" about the way the intervention has will be able to directly capture. Impact evalua- produced results (as by using "causal contribu- tions therefore need to identify shorter-term tion analysis") or by formally testing the causal impacts and, where possible, indicate whether assumptions using appropriate methods. longer-term impacts are likely to occur. 4. Address the attribution problem 3. carefully articulate the theories linking The steps above are important to identify the interventions to outcomes "factual"--the observed outcome that is a result Development policies and interventions are of the intervention. But given that multiple factors typically aimed at changing the behavior or can affect the outcomes pertaining to individuals knowledge of households, individuals, and organi- and institutions, the unique point of an impact zations. Underlying the design of the intervention evaluation is to go beyond the factual--to know is a "theory"--explicit or implicit--with social, the added value of the policy intervention behavioral, and institutional assumptions indicat- under consideration, separate from these other ing why a particular policy intervention will work factors. to address a given development challenge. Any observed changes will be, in general, only For evaluating the nature and direction of an partly caused by the intervention of interest. Other impact, understanding this theory is critical. interventions inside or outside the core area will often interact and strengthen/reduce the effects of the intervention of interest for the evaluation. Figure ES2: Simple graphic of net impact of an Therefore, addressing this "attribution problem" intervention implies both isolating and accurately measuring the particular contribution of an intervention and ensuring that causality runs from the interven- a tion to the outcome. target variable Value c Analysis of the attribution problem compares b the situation "with" an intervention to what would have happened in the absence of an Before After intervention, the "without" situation (the Time counterfactual, figure ES2). The impact is not xii ExEcutIvE summary measured by either the value of a target variable control group ends up being exposed to the (point a) or even the difference between the intervention (either because of geographic before and after situation (a­b, measured on proximity or because of the presence of simi- the vertical axis). The net impact is the differ- lar parallel interventions affecting the control ence between the target variable's value after group). the intervention and the value the variable would have had in case the intervention had Quasi-experimental techniques can simulate not taken place (a­c). comparable intervention and comparison groups. In doing impact evaluations, there is no "gold standard" (in the sense of a single method that is · A pipeline approach takes advantage of proj- best in all cases). However, depending on factors ects that are rolled out gradually and compares such as the scope, objectives, and design of the outcomes for households or communities that intervention, as well as data availability, some have already experienced the intervention methods can be better than others in specific (the treatment group) with households or cases. communities that are selected but that have not yet participated (the control group). But Quantitative techniques can be broadly cat- for pipeline approaches to be valid, it is critical egorized into experimental, quasi-experimental, that both the treatment and control groups and regression-based techniques. These, if well have similar characteristics. Self-selection (due done, have a comparative advantage in address- to earlier participation by those eager to re- ing the issue of attribution. In each case, the ceive the intervention) or geographical biases counterfactual is simulated by examining the (such as moving from rural to urban areas) do situation of a participant group (receiving introduce selection biases. benefits from or affected by an intervention, · In propensity score matching, a control the "treatment" group) with the situation of an group is created ex post by selecting its mem- equivalent comparison or "control" group that bers on the basis of observed and relevant is not affected by the intervention. A key issue characteristics that are similar to those of these techniques aim to tackle is selection bias-- members of the treatment group. The pairs when those in the treatment group are different are formed not by matching every character- in some way from those in the control group. istic exactly, but by selecting groups that have similar probabilities of being included in the Experimental techniques avoid selection effects sample as the treatment group on the basis by randomly selecting treatment and control of observable characteristics. But the tech- groups from the same eligible population, before nique does not solve the potential bias that the intervention starts. results from the omission of unobserved dif- ferences between the groups and may require · In a randomized controlled trial (RCT), both a large sample for the selection of the com- groups are expected to have similar average parison group. This is usually accounted for characteristics, with the single exception that through the added use of double difference the treatment group received the interven- or difference-in-difference, which measures tion. Thus, a simple comparison of average differences between the two groups, before outcomes in the two groups solves the attri- and after the intervention, thus netting out bution problem and yields accurate estimates the unobservables (as long as they remain of the impact of the intervention. But, despite constant over time). the clean design, RCTs have to be managed · Judgmental matching is a less precise carefully to ensure that the two groups do not method using descriptive information to have different rates of attrition and that there construct comparison groups--first consult- is a minimum of "contamination," when the ing with clients and other knowledgeable xiii I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n persons to identify relevant matching comes before the intervention with the regres- characteristics, and then combining geo- sion line after. But this method assesses the graphic information, secondary data (such marginal impact of the program only around as household surveys), interviews, and key the cut-off point for eligibility and not across informants to select comparison areas or in- the whole spectrum of the people affected by dividuals/households with the best match of the intervention. Moreover, care must be taken characteristics. But the element of subjectiv- that individuals were not able to manipulate ity may induce biases, and further qualitative the selection process or threshold. work is essential to tease out unobserved differences. Quantitative techniques are not foolproof and can have limitations that go beyond the Regression-based techniques are more flexible technical constraints identified above. Narrow tools for ex post impact evaluation, which can counterfactual estimation is not applicable in flexibly deal with a range of issues--heterogene- full-coverage interventions such as price policies ity of treatment, multiple interventions, hetero- or regulation on land use, which affect everybody geneity of participant characteristics, interactions (although to different degrees)--so regression- between interventions, and interactions between based techniques that focus on the variability in interventions and specific characteristics. With exposure/participation are called for. There are a regression approach, it may be possible to also some pragmatic constraints--such as ethical estimate the contribution of a separate interven- objections to randomization or lack of data tion to the total effect or to estimate the effect of representing the baseline situation of interven- the interaction between two interventions. tion target groups. And simple quantitative approaches may not be appropriate in "complex" · Dealing with unobservables and endogene- contexts--though the methodological difficul- ity: "Difference-in-difference" approaches in a ties of evaluating complicated interventions can regression model, by examining the changes to some extent be "neutralized" by deconstruct- within groups over time, can have unobserved ing them into their "active ingredients." (time invariant) variables drop from the equa- tion. The approach is similar to a fixed-effects Nonquantitative techniques are often less ef- regression model. "Instrumental variables" can fective in many cases in addressing attribution, help with endogeneity, as a good instrument though they can have the comparative advantage correlates with the original endogenous vari- when addressing issues of contribution in complex able in the equation, but not with the error settings. But they can be useful in impact evalua- term. But the difference-in-difference method tions to both obtain information about the scope, is more vulnerable than others to the presence objectives, and theory of change and to generate of measurement error in the data, and good or supplement data and evidence. instruments are not always possible to find, given the available data. Participatory approaches are a central nonquan- · Regression discontinuity takes advantage of titative tool and are built on the principle that programs that have a cut-off point regarding stakeholders should be involved in some or all who receives the treatment (for example, geo- stages of the evaluation. In the case of impact graphic boundaries or income thresholds). evaluation, this includes aspects such as the It compares the treatment group just within determination of objectives, indicators to be taken the cut-off point with a control group of those into account, and stakeholder participation in data just beyond. At that point, it is unlikely that collection and analysis. The various methodolo- there are unobserved differences between the gies under this umbrella rely on different degrees two groups. Estimating the impact can now of participation, ranging from consultation to be done by comparing the mean difference collaboration to joint decision making. Participa- between the regression line of treatment out- tory approaches can be valuable in identifying a xiv ExEcutIvE summary more comprehensive and/or more appropriate For example, RCTs are arguably better than most set of valued impacts, greater ownership and a other methods in terms of internal validity, better level of understanding among stakehold- because if well designed, the counterfactual can ers, and a better understanding of processes be cleanly identified--the randomized project of change and the ways in which interventions benefits (within a relatively homogenous popula- affect people. But the higher the degree of partic- tion) would ensure that there are no systematic ipation, the more costly and difficult it is to set differences between those that receive benefits up an impact evaluation--and thus these may and those that do not. But RCTs control for differ- be inappropriate for large-scale comprehensive ences between groups within the particular interventions such as sector programs. Also, setting that is covered by the evaluation; other there are serious limitations to the validity of settings have different characteristics that are not information based only on stakeholder percep- controlled, so the external validity of such RCTs tions. Finally, strategic responses, manipulation, may be limited--unless there has been a system- or advocacy by stakeholders can also influence atic and large set of RCTs undertaken that test the validity of the data collection and analysis. the intervention across the range of settings and policy options found in reality. Overall, for impact evaluations, well-designed quantitative methods are usually preferable for Again, in-depth qualitative methods that attempt addressing attribution and should be pursued to capture complexity and diversity of institu- when possible. Qualitative techniques cannot tional and social change can have a comparative quantify the changes attributable to interven- advantage in construct validity in assessing the tions, but should be used to evaluate important contribution of complex and multidimensional issues for which quantification is not feasible or interventions or impacts. Take the example of practical and to develop complementary and impacts on poverty or governance--these may be in-depth perspectives on processes of change difficult to fully capture in terms of the distinct, induced by interventions. quantifiable indicators usually employed by RCTs and some quasi-experimental methods and may be 5. use a mixed-methods approach better addressed through qualitative techniques. Each different methodology mentioned above Yet these methods also may be lacking in terms has comparative advantages in addressing partic- of external validity. In such cases, methods having ular concerns and needs in impact evaluation. A comparative advantages are those large sample lens to examine these comparative advantages is quantitative approaches that cover substantial the four different types of validity: diversity in context and people. · Internal validity: Establishing the causal re- A mix of methods--"triangulating" informa- lationship between intervention outputs and tion from different approaches--can be used to processes of change leading to outcomes and assess different facets of complex outcomes or impacts impacts, yielding greater validity than from one · Construct validity: Ensuring that the variables method alone. For example, if looking at the measured adequately represent the under- impact of incentives on farmers' labor utiliza- lying realities of development interventions tion and livelihoods, a randomized experiment linked to processes of change can test the effectiveness of different individual · External validity: Establishing the generaliz- incentives on labor and income effects (testing ability of findings to other settings internal validity); survey data and case studies · Statistical conclusion validity: For quantita- can deepen the analysis by looking at the distri- tive techniques, ensuring the degree of con- bution of these effects among different types fidence about the existence of a relationship of farm households (triangulating with the RCT between intervention and impact variable and evidence on internal validity and increasing the magnitude of change. external validity); and semistructured interviews xv I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n and focus group conversations can broaden the different environments. They follow a strict information about the nature of effects in terms procedure to search for and select appropri- of production, consumption, poverty, and so on ate evidence, typically using a hierarchy of (establishing construct validity). methods, with more quantitatively rigorous (experimental) studies being ranked higher Finally, important to note is that an analysis of the as sources of evidence. distribution of costs and benefits as a result of an · Narrative reviews are descriptive accounts intervention--distinguishing between coverage, of intervention processes and/or results cov- effects on those that are directly affected, and ering a series of interventions, relying on a indirect effects--cannot be addressed with one common analytical framework and template particular method. If one is interested in all these to extract data from the individual studies and questions, then inevitably one needs a framework summarizing the main findings in a narrative of multiple methods and sources of evidence. account and/or tables and matrices represent- ing key aspects of the interventions. 6. Build on existing knowledge relevant · Realist syntheses are theory based and do to the impact of interventions not use a hierarchy of methods. They collect Review and synthesis methods can play a pivotal earlier research findings by placing the policy role in marshalling existing evidence to deepen instrument or intervention that is evaluated in the power and validity of an impact evaluation, the context of other similar instruments and to contribute to future knowledge building, and describe the intervention in terms of its con- to meet the information needs of stakeholders. text, social and behavioral mechanisms (what Specifically, these methods can serve two major makes the intervention work), and outcomes purposes: (the deliverables). · They strengthen external validity by evaluat- 7. determine if an impact evaluation is ing comparable interventions across different feasible and worth the cost countries and regions--thus assessing the rel- Impact evaluations can be costly exercises in ative effectiveness of alternative interventions terms of their need for human, financial, and in different contexts. often political resources. They complement · Because many interventions rely on similar rather than replace other types of monitoring mechanisms of change, they help refine the and evaluation activities and should therefore hypotheses or expected results chain to help be seen as one of several in a cycle of potentially greater selectivity for the impact evaluation. useful evaluations in the lifetime of an interven- tion. Thus, at each juncture of deciding whether There are several methods that fall into this to set up an impact evaluation, it is useful to category: examine its objectives, benefits, and feasibility and to weigh these against the cost. · Systematic reviews are syntheses of pri- mary studies that, from an initial explicit Impact evaluations are feasible when they have statement of objectives, follow a transparent, a clearly defined purpose and design, adequate systematic, and replicable methodology of resources, support from influential stakehold- literature search, inclusion and exclusion of ers, and data availability and when they are studies according to clear criteria, and extract- appropriate, given the nature and context of the ing and synthesizing of information from the intervention. They provide the greatest value resulting body of knowledge. when there is an articulated need to obtain · Meta-analyses, a common type of systematic the information from them--either to know review, quantitatively synthesize "scores" for whether a specific intervention worked, to learn the impact of a similar set of interventions from the intervention, to increase transparency from a number of individual studies across of the intervention, or to know its "value for xvi ExEcutIvE summary money." If they are feasible, their value can then tion--thus policy makers and commissioners be weighed against the expected costs--includ- need to involve experts in impact evaluation as ing the costs of establishing a credible counter- early as possible in the intervention to design factual, or what would have happened without high-quality impact evaluations. the intervention. 9. Front-end planning is important 8. Start collecting data early For every impact evaluation, front-end planning As good baseline data are essential to understand- is important to help manage the study, its ing and estimating impact, starting early is reception, and its use. critical to the success of the eventual evalua- tion. When working with secondary data, a lack When managing the evaluation, it is critical to of information on the quality of data collection manage costs and staffing and to make essential and can restrict data analysis options and validity of transparent decisions on ethical issues and levels of findings. Those managing an impact evaluation independence (of the evaluating team vis-à-vis the have to take notice of and deal effectively with stakeholders with whom they are collaborating). the constraints--of time, data, and resources-- under which an impact evaluation has to be To ensure that the evaluation is used, it is also carried out. important, at the beginning, to pay attention to country and regional ownership of the impact Depending on the type of intervention, the evaluation and to build capacity to understand collection of baseline data and the setup of and use it. Providing a space for consultation and other aspects of the impact evaluation require agreement on impact evaluation priorities among an efficient relationship between the impact the different stakeholders of an intervention will evaluators and the implementers of the interven- help enhance utilization and ownership. xvii Introduction O ver the last 15­20 years, governments and other (public sector) or- ganizations have been paying much more attention to evaluation. It has become a growth industry in which systems of evaluation exist, with their methodologies, organizational infrastructures, textbooks, and pro- fessional societies (Leeuw and Furubo, 2008). In the development world, the growth of monitor- and corresponding evaluative inquiry, impact ing and evaluation (M&E) in particular has been evaluation. This document discusses questions acknowledged as crucial. Kusek and Rist (2004) of what impact evaluation is about, when it is have articulated its underlying philosophy. M&E appropriate, and how to do it. stimulates capacity development within countries and organizations to do their "own" evaluations The Network of Networks for Impact Evaluation and to produce their "own" performance data. (NONIE) was established in 2006 to foster more M&E is not focused on one type of evaluation, but and better impact evaluations by its membership. concerns all of them, including, for example, ex NONIE uses the definition of the Organisation ante studies, rapid appraisals, process evaluations, for Economic Co-operation and Development's cost-benefit analyses, and impact evaluations. Development Assistance Committee (DAC), defining impacts as "[p]ositive and negative, Part of the philosophy of evaluation and therefore primary and secondary long-term effects also M&E is to put questions first. Different produced by a development intervention, questions raise a need for different approaches. directly or indirectly, intended or unintended" If the question an evaluator is confronted with is (OECD-DAC, 2002: 24). directed toward understanding what a program or policy is about, what the underlying theory of The impact evaluations that NONIE pursues change or logic is, and what the risk factors are are expected to reinforce and complement the when implementing the program, an evaluabil- broader evaluation work by NONIE members. ity assessment or an ex ante evaluation will be The DAC definition refers to the "effects produced an appropriate route to follow. If the question is by," stressing the attribution aspect. This implies focused on the implementation of the program an approach to impact evaluation that is about or policy or on the role agencies play, then an attributing impacts rather than assessing what implementation analysis or a review of the perfor- happened. In most contexts, adequate empiri- mance of agencies can be appropriate. This can cal knowledge about the effects produced by an include an audit or inspection. However, if the intervention requires at least an accurate estimate question is about whether and to what extent of what would have occurred in the absence of the policy intervention made a significant differ- the intervention and a comparison with what has ence (compared with the status quo, compared occurred with the intervention implemented. with other factors and interventions and with or without side effects), then an impact evalua- Following this line of argument, this document tion is the appropriate answer. This Guidance subscribes to a somewhat more comprehensive document looks at the latter type of question view on impact than the DAC definition does. xix I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Much of the work on impact evaluation that last few years. The major reason is that many stresses the attribution problem is in fact about outside of development agencies believe that attributing short- and medium-term outcomes achievement of results has been poor, or at best (to interventions). In practice, this type of not convincingly established. Many develop- attribution analysis is also referred to as impact ment interventions appear to leave no trace of evaluation, although (in a strict sense) not within sustained positive change after they have been the scope of the DAC evaluation. This document terminated, and it is hard to determine the extent includes a discussion on the latter type of analysis to which interventions are making a difference. as well as the more long-term effects emphasized However, the development world is not "alone" in the DAC definition (for further discussion of in attaching increasing importance to impact these issues, see White, 2009). evaluations. In fields such as crime and justice, education, and social welfare, impact evaluations The purpose of NONIE is to promote more and have over the last decade become more and better impact evaluations among its members. more important.1 Evidence-based (sometimes Issues relating to evaluations in general are "evidence-informed") policies are high on the more effectively dealt with within the parent (political) agenda, and some even refer to the networks and are thus not the primary focus of "Evidence Movement" (Rieper et al., 2009). This NONIE. NONIE will focus on sharing methods includes the development of knowledge reposi- and learning by doing to promote the practice of tories, where results of impact evaluations are impact evaluation. This Guidance document was summarized. In some fields such as criminol- developed to support those purposes. ogy and in some professional associations such as the Campbell Collaboration, methodological The Guidance document was written by and standards and scales are used to grade impact represents the views of the authors, Frans Leeuw evaluations,2 although not without discussion and Jos Vaessen, who were commissioned by (Leeuw, 2005; Worral, 2002, 2007). NONIE. In writing the document, the authors included previous work by NONIE members Important reasons for doing impact evaluations and took account of their comments in finaliz- are the following: ing the document. Given the fact that perspec- tives on the definition, scope, and appropriate · Impact evaluations provide evidence on "what methods of impact evaluation differ widely works and what doesn't" (under what circum- among practitioners and other stakeholders, the stances) and how large the impact is. As the document should not be taken to represent the Independent Evaluation Group (IEG) of the agreed positions of all of the individual NONIE World Bank (IEG, 2005) puts it: measuring members. The current Guidance document, outcomes and impacts of an activity and dis- highlighting key conceptual and methodologi- tinguishing these from the influence of other, cal issues in impact evaluation, provides ample external factors is one of the rationales behind coverage of such topics as delimitation, interven- impact evaluation. tion theory, attribution, and combining methods · Measuring impacts and relating the changes in in impact evaluation. It also presents an introduc- dependent variables to development policies tion to such topics as participatory approaches and programs is not something that can be to impact evaluation and assessing impact for done "from an armchair." Impact evaluation complex interventions. These and other topics, is the instrument for these tasks. such as the evaluation of new aid modalities · Impact evaluation can gather evidence on the and country perspectives to impact evaluation, sustainability of effects of interventions. should be developed further in the future. · Impact evaluations produce information that is relevant from an accountability perspective; Impact evaluation in development assistance they disclose knowledge about the (societal) has received considerable attention over the effects of programs that can be linked to the (fi- xx IntroductIon nancial) resources used to reach these effects. the world of health and agriculture or in other · Individual and organizational learning can be social conditions can help realize goal achieve- stimulated by doing impact evaluations. This is ment, even in a situation where the "believed- true for organizations in developing countries to-be-effective" intervention under review is not but also for donor organizations. Informing de- working. cision makers on whether to expand, modify, or eliminate projects, programs, and policies The question of whether impact evaluation is linked to this point, as is IEG's (2005) argu- should always attempt to measure all possible ment that impact evaluations enable sponsors, impacts is not easy to answer. Impact evaluation partners, and recipients to compare the effec- involves finding the appropriate balance between tiveness of alternative interventions. the desire to understand and measure the full range of effects in the most rigorous manner The authors of this Guidance document believe possible and the practical need to delimit and that the ultimate reason for promoting impact prioritize on the basis of interests of stakehold- evaluations is to learn about "what works and ers as well as resource constraints. what doesn't and why" and thus to contribute to the effectiveness of (future) development Key issues addressed in this document interventions. In addition to this fundamental The guidance is structured around nine key motive, impact evaluations have a key role to issues in impact evaluation: play in the international drive for better evidence on results and development effectiveness. 1. Identify the (type and scope of the) interven- They are particularly well suited to answering tion. important questions about whether develop- 2. Agree on what is valued. ment interventions made a difference (and how 3. Carefully articulate the theories linking cost-effective they were). Well-designed impact interventions to outcomes. evaluations also shed light on why an interven- 4. Address the attribution problem. tion did or did not work, which can vary across 5. Use a mixed-methods approach: the logic of time and space. the comparative advantages of methods. 6. Build on existing knowledge relevant to the Decision makers need better evidence on impact impact of interventions. and its causes to ensure that resources are 7. Determine if an impact evaluation is feasible allocated where they can have most impact and and worth the cost. to maintain future public funding for interna- 8. Start collecting the data early. tional development. The pressures for this are 9. Front-end planning is important. already strong and will increase as resources are scaled up for international development. The discussion of these nine issues constitutes Without such evidence there is a risk of the the structure of this Guidance document. The case for aid and future funding sources being first part, comprising the first six issues, deals undermined. with methodological and conceptual issues in impact evaluation and constitutes the core of Using the word "effects" and "effectiveness" the document. In addition, a shorter second implies that the changes in the "dependent part focuses on managing impact evaluation and variable[s]" that are measured within the addresses aspects of evaluability, benefits, and context of an impact evaluation are caused costs of impact evaluation and planning. by the intervention under study. The concept of "goal achievement" is used when causal- There is no universally accepted definition of ity is not necessarily present. Goals can also "rigorous" impact evaluation. There are some who be achieved independent of the intervention. equate rigorous impact evaluation with particu- Changes in financial or economic situations in lar methods and designs. Given the diversity in xxi I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n thinking and practice on the topic and the variety over others in analyzing a particular question in terms of interventions and contexts in which or objective. impact evaluation is being applied, the writing of · Particular methods or perspectives comple- this document has been guided by three basic ment each other in providing a more complete premises: "picture" of impact. · No single method is best for addressing the Moreover, in our view, rigorous impact evaluation variety of questions and aspects that might be is more than methodological design. Rigorous part of impact evaluations. impact evaluation requires addressing the issues · However, depending on the specific questions described above in an appropriate manner, or objectives of a given impact evaluation, especially the core methodological and concep- some methods have a comparative advantage tual issues described in Part I. xxii Part I Methodological and Conceptual Issues in Impact Evaluation Chapter 1 Identify the (type and scope of the) intervention I n international development, impact evaluation is principally concerned with final results of interventions (programs, projects, policy measures, reforms) on the welfare of communities, households, and individuals. 1.1. the impact evaluation landscape and other tasks of impact evaluation. At the other end the scope of impact evaluation of the continuum are comprehensive programs Impact is often associated with progress at the with an extensive range and scope (increas- level of the Millennium Development Goals, ingly at the country, regional, or global level), which primarily comprise indicators of welfare of with a variety of activities that cut across sectors, these households and individuals. The renewed themes, geographic areas, and emergent specific attention on results- and evidence-based thinking activities. Many of these interventions address and ensuing interest in impact evaluation aspects that are assumed to be critical for effective provides new momentum for applying rigorous development yet difficult to define and measure, methods and techniques in assessing the impact such as human security, good governance, politi- of interventions. cal will and capacity, sustainability, and effective institutional systems. There is today more than ever a "continuum" of interventions. At one end of the continuum Some evidence of this continuum is provided are relatively simple projects characterized by in appendix 1, in which two examples of impact single-"strand" initiatives with explicit objectives, evaluations are presented, implemented at differ- carried out within a relatively short timeframe, ent (institutional) levels, and based on divergent where interventions can be isolated, manipu- methodologies with different timeframes (see lated, and measured. An impact evaluation in also figure 1.1.). the agricultural sector, for example, will seek to attribute changes in crop yield to an interven- The endorsement in 2000 of the Millennium tion such as a new technology or agricultural Development Goals by all heads of state, together practice. In a similar guise, in the health sector, a with other defining events and occurrences, has reduction in malaria will be analyzed in relation propelled new action that challenges develop- to the introduction of bed nets. For these types ment evaluation to enter new arenas. There of interventions, experimental and quasi-experi- is a shift away from fragmented, top-down, mental designs may be appropriate for assessing and asymmetrical approaches. Increasingly, causal relationships, along with attention to the ideals such as "harmonization," "partnership," 3 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n "participation," "ownership," and "empower- ing (e.g., support for the agricultural sector) or ment" are being emphasized by stakeholders. macro-earmarking (e.g., support for the govern- ment budget being allocated according to country However, this trend in policy is not yet reflected priorities). in evaluative practices, including impact evalua- tion. Institutional policies such as anticorrup- Besides a continued interest in the impact of tion policies--but also regional and global policy individual projects, donors, governments, and networks and public-private partnerships with nongovernmental institutions are increasingly their different forms and structures1--appear to interested in the impact of comprehensive be less often a part of the goal of impact evalua- programs and sector or country strategies, often tions, when compared with (top-down) small comprising multiple instruments, stakeholders, programs for specific groups of beneficiaries. sites of intervention, and target groups. Ravallion (2008: 6) is of the opinion that there is "a `myopia bias' in our knowledge, favoring develop- There is a growing demand for assessing the ment projects that yield quick results."2 In the impact of new instruments and modalities, such promotion of more rigorous impact evaluation, as-- development agencies, national governments, civil society organizations, and other stakehold- · International treaties governing the actions of ers in development should be aware of this bias multiple stakeholders (e.g., the Paris Declara- in focus, keeping in mind the full range of policy tion, the Kyoto Protocol) interventions that (eventually) affect the welfare · New aid modalities such as sector budget sup- of developing societies. port or general budget support · Instruments such as institutional capacity Evaluating the impact of policies--with their building, institutional reform, partnership own settings and levels--requires appropri- development, and stakeholder dialogues at ate methodological responses. These can be national or regional levels. usefully discussed under the banner of two key issues: the impact of what and the impact on In most countries donor organizations are (still) what. These two issues point to a key challenge the main promoters of impact evaluation. The in impact evaluation: the scope of the impact shift of the unit of analysis to the macro and evaluation. (government) institutional level requires that impact evaluators pay more attention to compli- 1.2. Impact of what? cated and more complex interventions at the What is the independent variable (intervention) national, sector, or program level. Multi-site, we are looking at? In recent years, we have seen multi-governance, and multiple (simultaneous) a broadening in the range of policy interven- causal strands are important elements of this tions that should/could be subject to impact (see Rogers, 2008). evaluation. At the same time, the need for more rigorous One of the trends in development is that donors impact evaluation at the "project level" remains are moving up the aid chain. In the past, donors urgent. The majority of aid money is (still) were very much involved in "micro-managing" micro-earmarked money for particular projects their own projects and (sometimes) bypass- managed by donors in collaboration with national ing government systems. In contrast, nowadays institutions. Furthermore, the ongoing efforts a sizeable chunk of aid is allocated to national in capacity building on national M&E systems support for recipient governments. Condition- (see Kusek and Rist, 2004) and the promotion ality to some extent has shifted from micro- of country-led evaluation efforts stress the need earmarking (e.g., donor money destined for an for further guidance on impact evaluation at the irrigation project in district x) to meso-earmark- "single" intervention level. 4 IdEntIfy thE (typE and scopE of thE) IntErvEntIon Earlier we referred to a continuum of interven- of public-private partnerships or new aid modali- tions. At one end of the continuum are relatively ties, which have become more important in simple projects characterized by single-"strand" the development world. Demands for account- initiatives with explicit objectives, carried out ability and learning about results at the country, within a relatively short timeframe, where agency, sector, or program and strategy levels interventions can be relatively easy isolated, are also increasing, which has made the need manipulated, and measured. Examples of these for appropriate methodological frameworks to kinds of interventions include building new assess their impact more pressing. roads, repairing roads, reducing the price of fertil- izer for farmers, providing clean drinking water at Pawson (2005) has distinguished five principles lower cost, etc. It is important to be precise in on complicated programs that can be helpful what the interventions are and what they focus when designing impact evaluations of aid: on. In the case of new roads or the rehabilitation of existing ones, the goal often is a reduction in 1. Locate key program components. Evaluation journey time and therefore reduction of societal should begin with a comprehensive scoping transaction costs. study, mapping out the potential conjec- tures and influences that appear to shape At the other end of the continuum are compre- the program under investigation. One can hensive programs with an extensive range and envisage stage-one mapping as the hypothesis scope (increasingly at the country, regional, or generator. It should alert the evaluator to the global level), with a variety of activities that cut array of decisions that constitute a program, across sectors, themes, and geographic areas as well as providing some initial deliberation and emergent specific activities. Rogers (2008) on their intended and wayward outcomes. has outlined several aspects of what constitutes 2. Prioritize among program components. complicated interventions (multiple agencies, The general rule here is to concentrate on alternative and multiple causal strands) and (i) those components of the program complex interventions3 (recursive causality, and (intervention) theory that seem likely to emergent outcomes; see tables 1.1 and 1.2). have the most significant bearing on overall outcomes, and (ii) those segments of program Rogers (2008: 40) recently argued that "the theory about which the least is known. greatest challenge [for the evaluator] comes 3. Evaluate program components by subsets. when interventions have both complicated This principle is about when and where aspects (multi-level and multi-site) and complex to locate evaluation efforts in relation to a aspects (emergent outcomes)." These aspects program. The evaluation should take on often converge in interventions in the context subsets of program theory. Evaluation should Table 1.1: Aspects of complication in interventions Aspect of complication Simple intervention complicated intervention Governance and location Single organization Multiple agencies, often interdisciplinary and cross-jurisdictional Simultaneous causal strands Single causal strand Multiple simultaneous causal strands Alternative causal strands Universal mechanism Different causal mechanisms operating in different contexts Source: Rogers (2008). 5 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Table 1.2: Aspects of complexity in interventions Aspect of complexity Simple intervention complex intervention Recursive causality and disproportionate Linear, constant dose-response relationship Recursive, with feedback loops, including effect reinforcing loops; disproportionate effects at critical limits Emergent outcomes Pre-identified outcomes Emergent outcomes Source: Rogers (2008). occur in ongoing portfolios rather than To a large extent interventions can be identi- one-off projects. Suites of evaluations and fied and categorized on the basis of the main reviews should track program theories as and theme addressed. Examples of thematic areas of wherever they unfold. interventions are roads and railroads, protected 4. Identify bottlenecks in the program network. area management, alternative livelihoods, and "Theories of Change" analysis perceives research on innovative practices. programs as implementation chains and asks, "What are the flows and blockages as we put a A second way to identify interventions is to program into action?" The basic strategy is to find out which generic policy instruments and investigate how the implementation details their combinations constitute the interven- sustain or hinder program outputs. The tion: economic incentives (e.g., tax reductions, main analytic effort is directed at configura- subsidies), regulations (e.g. , laws or restric- tions made up of selected segments of the tions), or information (e.g., education or implementation chains across a limited range technical assistance). As argued by authors such of program locations. as Pawson (2006), Salamon (1981), and Vedung 5. Provide feedback on the conceptual (1998), using this relatively simple classifica- framework. What the theory-based approach tion helps identify the interventions. ``Rather initiates is a process of thinking through the than focusing on individual programs, as is now pathways along which a successful program done, or even collections of programs grouped has to travel. What would be described are according to major `purpose,' as is frequently the main series of decision points through proposed, the suggestion here is that we should which an initiative has proceeded, and concentrate on the generic tools of govern- the findings would be used in alerting ment that come to be used, in varying combina- stakeholders to the caveats and consider- tions in particular public programs" (Salamon, ations that should inform those decisions. 1981: 256). Acknowledging the central role of The most durable and practical recommen- policy instruments enables evaluators to take dations that evaluators can offer come from into account lessons from the application of research that begins with a theory and ends particular (combinations of) policy interven- with a refined theory. tions elsewhere (see Bemelmans-Videc and Rist, 1998). If interventions are complicated, in that they have multiple active components, it is helpful to Third, the separate analysis of intervention state these separately and treat the intervention components implies interventions being as a package of components. Depending on the unpacked in such a way that the most important context, the impact of intervention components social and behavioral mechanisms believed to can be analyzed separately and/or as part of a make the "package" work are spelled out (see package.4 chapter 3). 6 IdEntIfy thE (typE and scopE of thE) IntErvEntIon Box 1.1: "Unpacking" the aid chain The importance of distinguishing among different levels of im- ances mechanisms, etc.) and is likely to be affected by donor pact is also discussed by Bourguignon and Sundberg (2007), who policies and aid. (institutional level impact) "unpack" the aid effectiveness box by differentiating among · External donors and international financial institutions to policy three essential links between aid and final policy outcomes: makers: How do external institutions influence the policy-mak- ing process through financial resources, dialogue, technical · Policies to outcomes: How do policies, programs and projects assistance, conditionalities, etc.? (institutional-level impact) affect investment, production, growth, social welfare, and poverty levels? (beneficiary level impact) The above links can be perceived as channels through which · Policy makers to policies: How does the policy-making process aid eventually affects beneficiary-level impact. At the same time, at national and local levels lead to "good policies"? This is the processes triggered by aid generate lasting impacts at insti- about governance (institutional capacities, checks and bal- tutional levels. Source: Bourguignon and Sundberg (2007). Although complicated interventions are intermediate changes and being contingent on becoming more important and therefore should more external variables (e.g., from stakeholder be subject to impact evaluation, this evolution dialogue, to changes in policy priorities, to should not imply a reduction of interest in changes in policy implementation, to changes in evaluating the impact of relatively simple, single- human welfare). strand interventions. The sheer number of these interventions makes doing robust impact evalua- Given this diversity, we think it is useful for tions of great importance. purposes of "scoping" to distinguish between two principal levels of impact: at the institu- 1.3. Impact on what? tional level and at the beneficiary level.6 It This topic concerns the "dependent variable broadens impact evaluation beyond either problem." Interventions often affect multiple simply measuring whether objectives have been institutions, groups, and individuals. What level achieved or assessing direct effects on intended of impact should we be interested in? beneficiaries. It includes the full range of impacts at all levels of the results chain, including ripple The causality chain linking policy interventions effects on families, households, and communi- to ultimate policy goals (e.g., poverty alleviation) ties; on institutional, technical, or social systems; can be relatively direct and straightforward (e.g., and on the environment. In terms of a simple the impact of vaccination programs on mortal- logic model, there can be multiple intermediate ity levels) but also complex and diffuse. Impact (short- and medium-term) outcomes over time evaluations of, for example, sector strategies or that eventually lead to impact--some or all of general budget support potentially encompass which may be included in an evaluation of impact multiple causal pathways, resulting in long-term at a specific moment in time. direct and indirect impacts. Some of the causal pathways linking interventions to impacts might Interventions that can be labeled as institu- be "fairly" straightforward5 (e.g., from training tional primarily aim at changing second-order programs in alternative income generating conditions (i.e., the capacities, willingness, and activities to employment and to income levels), organizational structures enabling institutions to whereas other pathways are more complex design, manage, and implement better policies and diffuse in terms of going through more for communities, households, and individuals). 7 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Examples are policy dialogues, policy networks, the discussion on choice of scope and method in training programs, institutional reforms, and impact evaluation. strategic support to institutional actors (i.e., governmental and civil society institutions, Having illustrated this differentiation, it is private corporations, and hybrids) and public- important to note that for many in the develop- private partnerships. ment community, impact assessment is essentially about impact at the beneficiary level. The main Other types of interventions directly aim at/ concern is how (sets of) policy interventions affect communities, households, and individu- directly or indirectly affect the welfare of benefi- als, including voters and taxpayers. Examples ciaries and to what extent changes in welfare are fiscal reforms, trade liberalization measures, can be attributed to these interventions. In line technical assistance programs, cash transfer with this interpretation of impact evaluation, 8 programs, construction of schools, etc. throughout this document we will focus on impact assessment at the beneficiary level (see Figure 1.1. graphically presents different levels of the dotted oval in figure 1.1.), addressing key intervention and levels of impact. The differentia- methodological concerns and methodological tion between impact at the institutional level and approaches as well as the choice of methodologi- impact at the beneficiary level7 can be useful in cal approach in a particular evaluation context. Figure 1.1: Levels of intervention, programs, and policies and types of impact International conferences, treaties, declarations, protocols, policy networks Institutional-level impact Donor capacities/policies Government capacities/policies Other actors (INGOs, NGOs, Macro-earmarking (e.g., debt relief, banks, cooperatives, etc.) Micro-earmarking, GBS) meso-earmarking (e.g., SBS) May constitute Programs multiple Projects Policy measures (e.g., health reform) (e.g., agricultural (e.g., tax increases) extension) Beneficiary-level impact Communities Households Individual (taxpayers, voters, citizens, etc.) Replication and scaling up Wider systemic effects 8 IdEntIfy thE (typE and scopE of thE) IntErvEntIon Where necessary, other levels and settings of as interventions financed through these modali- impact will be addressed (see the dashed oval in ties (aim to) affect the lives of households and figure 1.1.). The implication is that with respect individuals.9 We do not address the question of to the impact evaluation of, for example, new aid how to do impact evaluations of new aid modali- modalities (e.g., general budget support or sector ties as such (see Lister and Carter, 2006; Elbers budget support), this will only be discussed as far et al., 2008). Key message Identify the scope and type of the intervention. In- and therefore should be subject to impact evaluation, terventions range from single-strand initiatives with this should not imply a reduction of interest in evaluating explicit objectives to complicated institutional policies. the impact of relatively simple, single-strand interven- Across this continuum, the scope of an impact evalu- tions. The sheer number of these interventions makes ation can be identified by answering two questions: doing robust impact evaluations of great importance. the impact of what and on what? Look closely at the In addition, one should be clear about the level of im- nature of the intervention, for example, on the basis pact to be evaluated. Although most policy makers and of the main theme addressed or by the generic policy stakeholders are primarily interested in beneficiary- instruments used. If interventions are complicated in level impact (e.g., impact on poverty), specific policy that they have multiple active components, state these interventions are primarily geared at inducing sustain- separately and treat the intervention as a package of able changes at the institutional (government) level components that should be unpacked. ("second-order" effects), with only indirect effects at Although complicated interventions, sometimes of the beneficiary level. an institutional nature, are becoming more important 9 Chapter 2 Agree on what is valued I mpact evaluation requires finding a balance between taking into account the values of stakeholders and paying appropriate attention to the empiri- cal complexity of processes of change induced by an intervention. Some of this complexity has been unpacked in the discussion on the topic of scope of the impact evaluation, where we distinguished between levels of impact that neatly capture the often complex and diffuse causal pathways from interven- tion to different outcomes and impact: institutional or beneficiary level and replicatory impact. It is best to--as much as possible--translate objectives into measurable indicators, but at the same time not lose track of important aspects that are difficult to measure. After addressing the issue of stakeholder values, First, stakeholder values are reflected in the we briefly discuss three dimensions that are objectives of an intervention, as stated in the particularly important and at the same time official documents produced by an intervention. challenging to capture in terms of measurable However, interventions evolve and objectives indicators: intended versus unintended effects, might change. In addition, stakeholder groups, short-term versus long-term effects, and the besides funding and implementing agencies, sustainability of effects. might harbor expectations not adequately covered by official documents. Impact evaluations 2.1. Stakeholder values in impact need to answer questions related to "for whom" evaluation the impacts have been intended and how context Impact evaluation needs to assess the value of influences impacts of interest. Some of the main the results derived from an intervention. This is tasks of an impact evaluation are, therefore, to not only an empirical question but inherently a be clear about who decides what the right aims question about values--which impacts are judged are and to ensure that the legitimate different as significant (whether positive or negative), perspectives of different stakeholders are given what types of processes are valued in themselves adequate weight. Where there are multiple aims, (either positive or negative), and what and whose there must be agreement about the standards values are used to judge the distribution of the of performance required in the weighting of costs and benefits of interventions. these--for example, can an intervention be 11 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n considered a success overall if it fails to meet 2.3. Short-term versus long-term effects some of the targets but does well in terms of the In some types of interventions, impacts emerge main intended outcome? quickly. In others, impact may take much longer and change over time. The timing of the evalua- Depending on the evaluation context, there tion is therefore important. Development are different ways for evaluators to address interventions are usually assumed to contribute stakeholder values: to long-term development (with the exception of humanitarian disaster and emergency situations). · Informal consultation with representatives However, focusing on short-term or intermedi- from different stakeholder groups ate outcomes often provides more useful and · Using values inquiry1 (Henry, 2002) as a basis immediate information for policy and decision for more systematic stakeholder consultation making. Intermediate outcomes may be mislead- · Using a participatory evaluation approach to ing, often differing markedly from those achieved include stakeholder values in the evaluation in the longer term. Many of the impacts of interest (see, e.g., Cousins and Whitmore, 1998). from development interventions will only be evident in the longer-term, such as environ- 2.2. Intended versus unintended effects mental changes or changes in social impacts on In development programs and projects, intended subsequent generations. Searching for evidence effects are often translated into measurable indica- of such impacts too early might mistakenly lead tors as early as the design phase. Impact evalua- to the conclusion that they have failed. tion should go beyond assessing the expected effects, given an intervention's logical framework In this context, the exposure time of an interven- and objectives. Interventions often change over tion in making an impact is an important time, with consequences for how they affect point. A typical agricultural innovation project institutional and people's realities. Moreover, that tries to change farmers' behavior with effects are sometimes context specific, where incentives (training, technical assistance, credit) different contexts trigger particular processes of is faced with time lags in both the adoption change. Finally, in most cases, the full scope of effect (farmers typically are risk averse and an intervention's effects is not known in advance. face resource constraints and start adopting A well-articulated intervention theory can help innovations on an experimental scale) and anticipate some of the unintended effects of an the diffusion effect (other farmers want to see intervention (see chapter 3). evidence of results before they copy any new behavior). In such gradual, nonlinear processes Classic impact evaluations assume that there of change with cascading effects, the timing are no impacts for nonparticipants, but this of the ex post measurement (of land use) is is unlikely to be true for most development crucial. Ex post measurements that occur just interventions. Spillover effects or replicatory after project closure could either underestimate effects (see chapter 1) can stem from market (full adoption/diffusion of interesting practices responses (given that participants and nonpar- has not taken place yet) or overestimate impact ticipants trade in the same markets), the (as farmers will stop investing in those land use (nonmarket) behavior of participants/nonpar- practices that are not attractive enough to be ticipants or the behavior of intervening agents maintained without project incentives). (governmental/nongovernmental organiza- tion). For example, aid projects often target local 2.4. the sustainability of effects areas, assuming that the local government will Focusing on short- or intermediate-term not respond; yet if one village gets the project, outcomes may underestimate the importance of the local government may well cut its spending designs that are able to measure effects (positive on that village and move to the control village or negative) in the long term. One example is an (Ravallion, 2008). effective strategy to reduce child malnutrition 12 a G r E E o n w h at I s va l u E d in a certain population that may quite quickly be explored in an evaluation. The sustainability of produce impressive results, yet fail soon after in positive impacts is also likely to be evident only the absence of systems, resources, and capacities in the longer term. Impact evaluations therefore to maintain the work--or follow-up work--after can focus on other impacts that will be observable termination of the intervention. in the short term, such as the institutionalization of practices and the development of organiza- Few impact evaluations will probably provide tional capacity, that are likely to contribute to direct evidence of long-term impacts, and in any the sustainability of impacts for participants and case results are needed before these impacts communities in the longer term.2 become evident to inform decisions on contin- uation, next phases, and scaling-up. Impact evaluations therefore need to identify short-term Key message impacts and, where possible, indicate whether Agree on what is valued. Select objectives that are longer-term impacts are likely to occur. important to the stakeholders' values. Do not be afraid of selecting one objective; focus and clar- To detect negative impacts in the long term, early ity are virtues, not vices. As much as possible try warning indicators are important to include. to translate objectives into measurable indicators, A well-articulated intervention theory (see but at the same time do not lose track of important chapter 3) that also addresses the time horizons aspects that are difficult to measure. In addition, over which different types of outcomes and keep in mind the dimensions of exposure time and impacts could reasonably be expected to occur the sustainability of changes. can help to identify impacts that can and should 13 Chapter 3 Carefully articulate the theories linking interventions to outcomes W hen evaluators talk about the black box "problem," they are usually referring to the practice of viewing interventions primarily in terms of effects, with little attention paid to how and why those effects are produced. The common thread underlying the various versions of theory- based evaluation is the argument that "interventions are theories incarnate" and evaluation constitutes a test of intervention theory or theories. 3.1. Seeing interventions as theories: the possible to open up the black box. Development black box and the contribution problem policies and interventions, in one way or another, Interventions are embodiments of theories in at have to do with changing behavior/intentions/ least two ways. First, they comprise an expecta- knowledge of households, individuals, and organi- tion that the introduction of a program or policy zations (grass roots, private, and public sector). intervention will help ameliorate a recurring Crucial for understanding what can change social problem. Second, they involve an assump- behavior is information on behavioral and social tion or set of assumptions about how and why mechanisms. An important insight from theory- program activities and resources will bring about based evaluations is that policy interventions are changes for the better. The underlying theory of (often) believed to address and trigger certain a program often remains hidden, typically in the social and behavioral responses among people and minds of policy architects and staff. Policies--be organizations; in reality this may not be the case. they relatively small-scale direct interventions like information campaigns, training programs, 3.2. Articulating intervention theories on or subsidization; meso-level interventions such impact as public-private partnerships and social funds, Program theory (or intervention theory) can be or macro-level interventions such as "general identified (articulated) and expressed in many budget support"--rest on social, behavioral and ways--a graphic display of boxes and arrows, institutional assumptions indicating why "this" a table, a narrative description, and so on. The policy intervention will work, which at first view methodology for constructing intervention are difficult to uncover. theory, as well as the level of detail and complexity, also varies significantly (e.g., Connell et al., 1995; By seeing interventions as theories and by using Leeuw, 2003; Lipsey, 1993; McClintock, 1990; insights from theory-based evaluations, it is Rogers et al., 2000; Trochim, 1989; Wholey, 1987). 15 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Too often the role of methodology is neglected, Sometimes stakeholders have contrast- and it is assumed that "intervention theories" ing assumptions and expectations about an are like manna falling out of the sky. That is intervention's impact that has implications for not the case. Often the underlying theory has reconstructing the intervention theory. Basically, to be dug up. Moreover, much of what passes as there are two ways to address this issue. The first theory-based evaluation today is simply a form of is to try to combine the perspectives of differ- "analytic evaluation [which] involves no theory in ent people (for example, program managers anything like a proper use of that term" (Scriven, and target group members) into an overarch- 1998: 59). ing intervention theory that consists of (parts of) arguments from these different sources. The The intervention theory provides an overall overall theory might be created through an itiner- framework for making sense of potential ary process of dialogue and refinement and as processes of change induced by an intervention. such might contribute to a shared vision among Several pieces of evidence can be used for articu- stakeholders (see, e.g., Pawson and Tilley, 1997). lating the intervention theory: Second, when differences are substantial, several competing intervention theories have to be · An intervention's existing logical framework reconstructed. Carvalho and White (2004) give as a starting point for mapping causal assump- an example of a "theory" and an "anti-theory" tions linked to objectives and other written dealing with the assumed impact of social funds documents produced within the framework (see box 3.1). of an intervention · Insights provided by and expectations harbored For an example of what an impact theory might by policy makers and staff (and other stakehold- look like, consider the case of a small business ers) on how they think the intervention will development project that provides training to affect/is affecting/has affected target groups young managers who have started a business. · (Written) evidence on past experiences of The direct goal is to help make small businesses similar interventions (including those imple- financially sustainable and the indirect goal is mented by other organizations) to generate more employment in the region. · Literature on mechanisms and processes of Closer scrutiny reveals that the project might change in certain institutional contexts, for par- have a positive influence on the viability of small ticular social problems, in specific sectors, etc. businesses in two ways: First, by training young Box 3.1: Social funds and government capacity: Competing theories Proponents of social funds argue they will develop government social funds affect government capacity. Carvalho and White capacity in several ways. Principle among these are that the so- (2004) refer to both sets of assumptions in terms of "theory" and cial fund will develop superior means of resource allocation and "anti-theory." Their study found that well-functioning, decentral- monitoring, which will be transferred to the government either ized social funds, such as the Zambia Social Investment Fund directly through collaborative work or indirectly by copying the in Zambia, worked through--rather than parallel to--existing procedures shown to be successful by the social fund. But crit- structures and that the social fund procedures were indeed ics argue that social funds bypass normal government channels adopted more generally by district staff. But at national level and so undermine government capacity, an effect reinforced by there was generally little evidence of either positive or nega- drawing away the government's best people by paying a project tive effects on capacity--with some exceptions, such as the premium. Hence, these are rather different theories of how promotion of poverty mapping in some countries. Source: Carvalho and White (2004). 16 c a r E f u l ly a r t I c u l at E t h E t h E o r I E s l I n k I n G I n t E r v E n t I o n s t o o u t c o m E s people in basic management and accounting · Deteriorating market conditions (in input or skills, the project intends to have a positive effect output markets) may jeopardize the future of on financial viability and ultimately on the growth the business. and sustainability of the business; second, by · The availability and quality of infrastructure supporting the writing of a business plan, the or skilled labor at any point may become con- project aims to increase the number of successful straining factors on business development applications for credit with the local bank, which prospects. previously excluded the project's target group · The efforts of other institutions promoting because of the small loan sizes (high transac- small business development or any particular tion costs) and high risks involved. Following aspect of it might positively (or negatively) this second causal strand, efficient and effective affect businesses. spending of the loan is also expected to contrib- ute to the strength of the business. Outputs are Methods for reconstructing the underlying measured in terms of the number of people assumptions of project/program/policy theories trained by the project and the number of loans are the following (see Leeuw, 2003): the bank extends (see figure 3.1.). · A policy-scientific method, which focuses on Any further empirical analysis of the impact of interviews, documents, and argumentation the project requires insight into the different analysis factors--besides the project itself--that affect · A strategic assessment method, which focuses small business development and employment on group dynamics and dialogue generation. Even in this rather simple example, · An elicitation method, which focuses on cogni- the number of external variables that affect the tive and organizational psychology. impact variables either directly or by moderat- ing the causal relations specified in figure 3.1. is Central in all three approaches is the search for manifold. Some examples are the following: mechanisms that are believed to be "at work" when a policy is implemented. Box 3.2 discusses · Short-term demands on the labor efforts of social and behavioral mechanisms for understand- business owners in other activities may lead ing impact. to suboptimal strategic choices, jeopardizing the sustainability of the business. 3.3. testing intervention theories on impact · Inefficient or ineffective use of loans because After articulating the assumptions on how an of short-term demands for cash for other ex- intervention is expected to affect outcomes penditures might jeopardize repayment and and impacts, the question arises as to what the financial viability of the business. extent these assumptions are valid. In practice, Figure 3.1: Basic intervention theory of a fictitious small business support project Small business SBO's capacity to Growth and owners (SBO) manage business sustainability of receive training increases the business SBO writes SBO receives Employment generation business plan loan from bank in the region 17 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Box 3.2: Social and behavioral mechanisms as heuristics for understanding processes of change and impact Hedström (2005: 25) has defined the concept of social mecha- is related to this, as are group think, the common knowledge nisms as "a constellation of entities and activities that are or- effect, and herd behavior. ganized such that they regularly bring about a particular type of · "Tipping points," "where a small additional effort can have a outcome." Mechanisms form the "nuts and bolts" (Elster, 1989) disproportionately large effect, can be created through virtu- or the "engines" (Leeuw, 2003) of interventions (policies and pro- ous circles, or be a result of achieving certain critical levels" grams), making them work, given certain contexts (Pawson and (Rogers, 2008: 35). Tilley, 1997). Hedström and Swedberg (1998: 296­98), building on the work of Coleman (1990), discuss three types of mechanisms: Relevance of mechanisms for impact evaluations situational mechanisms, action formation mechanisms, and Development policies and interventions, in one way or another, transformational mechanisms. have to do with changing behavior/intentions/knowledge of Examples of situational mechanisms are self-fulfilling and self- households, individuals, and organizations (grass roots, pri- denying prophecies and crowding-out (e.g., by striving to force vate, and public sector). Crucial for understanding what can people who are already largely compliant with laws and regula- change behavior is information about these mechanisms. tions into full compliance, the opposite is realized, because due The mechanisms underlying processes of change might to the extra focus on laws and regulation, the internal motivation not be necessarily those that are assumed to be at work of people to comply is reduced). by policy makers, programs designers, and staff. Creating Action-formation mechanisms are the heuristics that people awareness on the basis of (public) information campaigns develop to deal with their bounded rationality, such as-- does not always lead to behavioral change. Subsidies and other financial incentives run the risk of causing unintended · Framing and the endowment effect--"The fact that people side effects, such as benefit snatching, but also create the often demand much more to give up an object than they would "Mitnahme-effect" (people already tended to behave in a way be willing to pay to acquire it," but also the tendency for people the incentive wanted them to behave before the incentive ex- to have a stronger preference for more immediate payoffs than isted). Mentoring dropouts in education might cause "learned for later payoffs, the closer to the present both payoffs are helplessness" and therefore increase dropout rates. Many · Types of learning (social learning, vicarious learning) other examples are available in the literature. The relevance · "Game-theoretical" mechanisms, such as the "grim strat- of knowing which social and behavioral mechanisms are egy" (to repeatedly refuse to cooperate with another party believed to do the work increases as the complication and as a punishment for the other party's failure to cooperate complexity of interventions increases. previously) and the shadow of the future /shadow of the past A focus on mechanisms helps evaluators and managers open mechanism up and test the theory underlying an intervention. Spending time · Mechanisms such as the "fight-or-flight-response" to stress and money on programs based on "pet theories" of policy makers and the "tend-and-befriend mechanism" are other examples. or implementation agents that are not corroborated by relevant research should probably not be high on the agenda. If a policy Transformational mechanisms illuminate how processes and intervention is based on mechanisms that are known not to work (in results of interacting individuals and groups are "transformed" a given context or in general), that is a signal that the intervention into collective outcomes. Examples are the following: probably will not be very effective. This can be found out on the basis of desk research as a first test of the relevance and validity · Cascading is a process by which people influence one another, of an intervention theory, that is, by confronting the theory with so much so that participants ignore their private knowledge existing knowledge about mechanisms. That knowledge stems and rely instead on the publicly stated judgments of others. from synthesis and review studies (see chapter 6). Further em- The bandwagon phenomenon (the tendency to do [or believe] pirical impact evaluation can generate more contextualized and things because many other people do [or believe] the same) precise tests of the intervention theory. 18 c a r E f u l ly a r t I c u l at E t h E t h E o r I E s l I n k I n G I n t E r v E n t I o n s t o o u t c o m E s evaluators have at their disposal a wide range of · The theory of change--or key elements methods and techniques to test the intervention thereof--is verified by evidence: the chain of theory. We can distinguish between two broad expected results occurred. approaches. The first is that the theory consti- · Other influencing factors have been assessed tutes the basis for constructing a "causal story" and either shown not to have made a sig- about how and to what extent the intervention nificant contribution or their relative role in has produced results. Usually different methods contributing to the desired result has been and sources of evidence are used to further refine recognized. the theory in an iterative manner until a credible and reliable causal story has been generated. The analysis is best done iteratively, building up The second approach is to use the theory as a more robust assessment of causal contribu- an explicit benchmark for testing (some of) tion. The overall aim is to reduce the uncertainty the assumptions in a formal manner. Besides about the contribution the intervention is making providing a benchmark, the theory provides the to the observed results through an increased template for method choice, variable selection, understanding of why the observed results have and other data collection and analysis issues. This occurred (or not) and the roles played by the approach is typically applied in statistical analysis intervention and other factors. At the impact level but is not in any way restricted to this type of this is the most challenging, and a "contribution method. In short, theory-based methodological story" has to be developed for each major strategy designs can be situated anywhere in between that is part of an intervention, at different levels "telling the causal story" and "formally testing of analysis. They would be linked, as each would causal assumptions." treat the other strategies as influencing factors. The systematic development and corrobo- One of the key challenges in the foregoing ration of the causal story can be achieved analysis is to pinpoint the exact causal effect from through causal contribution analysis (Mayne, intervention to its impact. Despite the potential 2001), which aims to demonstrate whether the strength of the causal argumentation on the evaluated intervention is one of the causes of links between the intervention and impact, and observed change. Contribution analysis relies despite the possible availability of data on indica- on chains of logical arguments that are verified tors, as well as data on contributing factors, etc., through careful analysis. Rigor in causal contri- there remains uncertainty about the magnitude bution analysis involves systematically identify- of the impact as well as the extent to which the ing and investigating alternative explanations for changes in impact variables are really due to the observed impacts. This includes being able to intervention or to other influential variables. This rule out implementation failure as an explana- is called the attribution problem and is discussed tion for lack of results and developing testable in chapter 4. hypotheses and predictions to identify the conditions under which interventions contribute to specific impacts. Key message Carefully articulate the assumptions behind the The causal story is inferred from the following theories linking interventions to outcomes. What evidence: are the causal pathways linking intervention out- puts to processes of change and impact? Be criti- · There is a reasoned theory of change for the cal if an "intervention theory" appears to assert or intervention: it makes sense, is plausible, and assume changes without much explanation. The is agreed to by key players. focus should be on dissecting the causal (social, · The activities of the intervention were imple- behavioral, and institutional) mechanisms that make mented. interventions "work." 19 Chapter 4 Address the attribution problem M ultiple factors can affect the livelihoods of individuals or the capaci- ties of institutions. For policy makers as well as stakeholders it is important to know what the added value of the policy intervention is, apart from these other factors. 4.1. the attribution problem time spent fetching water. If nothing else of The attribution problem is often referred to as importance happened during the period under the central problem in impact evaluation. The study, attribution is so clear that there is no need central question is to what extent changes in to resort to anything other than before versus outcomes of interest can be attributed to a after to determine this impact. particular intervention. Attribution refers to both isolating and estimating accurately the particu- In general, the observed changes are only partly lar contribution of an intervention and ensuring caused by the intervention of interest. Other that causality runs from the intervention to the interventions inside or outside the core area will outcome. often interact and strengthen/reduce the effects of the intervention of interest for the evaluation. The changes in welfare for a particular group of In addition, other unplanned events or general people can be observed by undertaking before change processes will often influence develop- and after studies, but these rarely accurately ment, such as natural catastrophes, urbaniza- measure impact. Baseline data (before the tion, growing economies, business cycles, war, intervention) and end-line data (after the or long-term climate change. For example, intervention) give facts about the development in evaluating the impact of microfinance on over time and describe "the factual" for the poverty, we have to control for the influences treatment group (not the counterfactual). But of changing market conditions, infrastruc- changes observed by comparing before-after (or ture developments, or climate shocks such as pre-post) data are rarely caused by the interven- droughts, and so on. tion alone, as other interventions and processes influence developments, both in time and space. A discussion that often comes up in impact There are some exceptions in which before evaluation is the issue of attribution of what. versus after will suffice to determine impact. For This issue is complementary to the indepen- example, supplying village water pumps reduces dent variable question discussed in chapter 1. 21 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n How the impact of the intervention is measured The value of a target variable (point a) after an may be stated in several ways: intervention should not be regarded as the intervention's impact, nor is it simply the differ- · What is the impact of an additional dollar of ence between the before and after situation (a­b, funding to program X?1 measured on the vertical axis). The net impact (at · What is the impact of country Y's contribution a given point in time) is the difference between to a particular intervention? the target variable's value after the intervention · What is the impact of intervention Z? and the value the variable would have had if the intervention had not taken place (a­c). In this guidance we will focus on the third level of attribution: What is the impact of a particu- The starting point for an evaluation is a good lar policy intervention (from very simple to account of the factual--what happened in terms complex), independent of the specific monetary of the outputs/outcomes targeted by the interven- and nonmonetary contributions of the (institu- tion? A good account of the factual requires tional) actors involved? articulating the intervention theory (or theories) and connecting the different causal assumptions The issue of attributing impact to a particular from intervention outputs to outcomes and intervention can be a quite complicated issue impacts, as discussed earlier in chapter 3. This in itself (especially when talking about compli- guidance will discuss several options for measur- cated interventions such as sector strategies or ing the counterfactual. programs). Additional levels of attribution, such as tracing impact back from interventions to specific Evaluations can either be experimental, as (financial) contributions of different donors, are when the evaluator purposely collects data and either meaningless or too complicated to achieve designs evaluations in advance, or quasi-experi- in a pragmatic and cost-effective manner. mental, as when data are collected to mimic an experimental situation. Multiple regres- Analyzing attribution requires comparing the sion analysis is an all-purpose technique that situation "with" an intervention to what would can be used in virtually all settings (provided have happened in the absence of an interven- that data are available); when the experiment tion, the "without" situation (the counterfac- is organized in such a way that no controls are tual). Such comparison of the situation with and needed, a simple comparison of means can be without the intervention is challenging because used instead of a regression, because both will it is not possible to observe how the situation give the same answer. (Experimental and quasi- would have been without the intervention, experimental approaches will be discussed in so that has to be constructed by the evaluator. § 4.2.) We briefly introduce the general principles The counterfactual is illustrated in figure 4.1. and the most common approaches. The idea of (quasi-) experimental counterfactual analysis is that the situation of a participant group (receiv- Figure 4.1: Graphic display of the net impact of an ing benefits from/affected by an intervention) intervention is compared over time with the situation of an equivalent comparison group that is not affected by the intervention. a target variable Several designs exist of combinations of ex ante Value c and ex post measurements of participant and b control groups (see § 4.2.). Randomization of intervention participation is considered the best Before After way to create equivalent groups. Random assign- Time ment to the participant and control group leads 22 a d d r E s s t h E at t r I b u t I o n p r o b l E m to groups with similar average characteristics2 for controlled trial (RCT) or the pipeline approach can both observables and non-observables, except be compromised by two sets of problems: contam- for the intervention. As a second best alternative, ination and unintended behavioral responses. several matching techniques (e.g., propensity score matching) can be used to create control Contamination: Contamination (or contagion, groups that are as similar to participant groups treatment diffusion) refers to the problem of as possible (see below). groups of people that are not supposed to be exposed to certain project benefits but in fact are 4.2. Quantitative methods addressing the benefiting from them. Contamination comes from attribution problem3 two possible sources. The first is from the interven- In this section we discuss experimental (e.g., tion itself, as a result of spill-over effects. Interven- randomized controlled trials), quasi-experimen- tions are most often planned and implemented tal (e.g., propensity score matching), and regres- within a delimited space (a village, district, nation, sion-based techniques.4, 5 region, or institution). The influence zone of an intervention may, however, be larger than the Three related problems that quantitative impact core area where the intervention takes place or evaluation methods attempt to address are the is intended to generate results (geographical spill- following: over effects). To avoid contamination, control and comparison groups must be located outside the · The establishment of a counterfactual: What influence zone. Second, the selected compari- would have happened in the absence of the son group may be subject to similar interven- intervention(s)? tions implemented by different agencies, or even · The elimination of selection effects, leading to somewhat dissimilar interventions that affect differences between the intervention group the same outcomes. The counterfactual is thus (or treatment group) and the control group a different type of intervention rather than no · A solution for the problem of unobservables: intervention. This problem is often overlooked. The omission of one or more unobserved vari- A good intervention theory as a basis for designing ables, leading to biased estimates. a measurement instrument that records the differ- ent potential problems of contamination is a good Selection effects occur, for example, when those way to address this problem. in the intervention group are more or less motivated than those in the control group. It Unintended behavioral responses: In any is particularly a problem when the variable in experiment people may behave differently when question, in this case motivation, is not easily they know that they are part of the intervention observable. As long as selection is based on or treatment. Consequently, this will affect data. observable characteristics and these are measured The resulting bias is even more pronounced in the evaluation, they may be included--and when the researcher has to rely on recall data thus controlled for--in the regression analysis. or self-reported effects. Several unintended However, not all relevant characteristics are behavioral responses not caused by the interven- observed or measured. This problem of selection tion or by "normal" conditions might therefore of unobservables is one of the main problems in disrupt the validity of comparisons between impact evaluation. groups and hence the ability to attribute changes to project incentives. Important possible effects In the following sections we will discuss differ- are the following (see Shadish et al., 2002; Rossi ent techniques of quantitative impact evaluation, et al., 2004): thereby mainly focusing our discussion on the selection bias issue. In trying to deal systemati- · Expected behavior or compliance behavior: cally with selection effects, (quasi-) experimental Participants react in accordance with inter- design-based approaches such as the randomized vention staff expectations for reasons such 23 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n as compliance with the established contract To determine if the intervention had a statistically or certain expectations about future benefits significant impact, one simply performs a test from the organization (not necessarily the of equality between the mean outcomes in the project). experiment and control group. Statistical analysis · Compensatory equalization: Discontent will tell you if the impact is statistically signifi- among staff or recipients with inequality be- cant and how large it is. Of course, with larger tween incentives might result in compensa- samples, the statistical inferences will be increas- tion of groups that receive less than other ingly precise; but if the impact of an intervention groups. really is large, it can be detected and measured · Compensatory rivalry: Differentiation of in- even with a relatively small sample. centives to groups of people might result in social competition between those receiving A proper RCT addresses many attribution issues, (many) intervention benefits and those receiv- but has to be planned and managed carefully to ing fewer or no benefits. avoid contamination and other risks. Risks of a · Hawthorne effect: The fact of being part of RCT are (i) different rates of attrition in the two an experiment rather than the intervention as groups, possibly caused by a high dropout in one such causes people to change their behavior. of the two groups, (ii) spillover effects (contami- · Placebo effect: The behavioral effect is not nation) resulting in the control group receiv- the result of the incentives provided by the ing some of the treatment, and (iii) unintended intervention but of people's perception of the behavioral responses. incentives and the subsequent anticipatory behavior. 4.2.2. Pipeline approach One of the problems for the evaluation of These problems are relevant in most experimen- development projects or programs is that evalua- tal and quasi-experimental design approaches tors rarely get involved early enough to design that are based on ex ante participant and control/ a good evaluation (although this is changing). comparison group designs.6 They are less relevant Often, households or individuals are selected for in regression-based approaches that use statisti- a specific project, but not everybody participates cal matching procedures or that do not rely on (directly) in the project. A reason may be a gradual the participant-control group comparison for implementation of the project. Large projects counterfactual analysis.7 (such as in housing or construction of schools) normally have a phased implementation. 4.2.1. Randomized controlled trial The safest way to avoid selection effects is a In such a case, it may be possible to exploit randomized selection of the intervention and this phase of the project by comparing the control groups before the experiment starts. outcomes of households or communities that When the experimental group and the control actually participate (the experiment group) group are selected randomly from the same with households or communities that are eligible population, both groups will have similar selected but do not participate (the compari- average characteristics (except that one group has son group). A specific project (school building) been subjected to the intervention and the other may start, for instance, in a number of villages has not). Consequently, in a well-designed and and be implemented later in other villages. This correctly implemented RCT, a simple compari- creates the possibility of evaluating the effect of son of average outcomes in the two groups can school building on enrollment. One has to be adequately resolve the attribution problem and certain, of course, that the second selection-- yield accurate estimates of the impact of the the actual inclusion in the project--does not intervention on a variable of interest; by design, introduce a selection bias. If, for instance, at the only difference between the two groups was the start of the project a choice is made to start the intervention. construction in a number of specific villages, the 24 a d d r E s s t h E at t r I b u t I o n p r o b l E m (relevant) characteristics of these villages must 10 years from a small rural school with a high be similar to other villages that are eligible for pupil:teacher ratio in a poor district another new schools. Self-selection (of villages that are boy with the same observed characteristics. This eager to participate) or other selection criteria would be a time-consuming procedure, especially (starting in remote areas or in urban areas) may for 100 pupils. introduce a selection bias. An alternative way to create a control group 4.2.3. Propensity score matching for this case is the method of propensity score When no comparison group has been created at matching. This technique involves forming pairs, the start of the project or program, a compari- not by matching every characteristic exactly, but son group may be created ex post through a by selecting groups that have similar probabilities matching procedure: for every member of the of being included in the sample as the treatment treatment group, one or more members in a group. The technique uses all available informa- control group are selected on the basis of similar tion to construct a control group (see box 4.1.).8 observed (and relevant) characteristics. Rosenbaum and Rubin (1983) showed that this method makes it possible to create a control Suppose there are two groups, one a relatively group ex post with characteristics that are similar small intervention group of 100 pupils who will to the intervention group that would have been receive a specific reading program. If we want created had its members been selected randomly to analyze the effects of this program, we must before the beginning of the project. compare the results of the pupils in the program with other pupils who were not included in It should be noted that the technique only the program. We cannot select just any control deals with selection bias on observables and group, because the intervention group may have does not solve potential endogeneity bias (see been self-selected on the basis of specific charac- appendix 4), which results from the omission teristics (pupils with relatively good results or of unobserved variables. Nevertheless, propen- relatively bad results, pupils from rural areas, sity score matching may be combined with the from private schools or public schools, boys, technique of double differencing to correct for girls, orphans, etc.). Therefore, we need to select the influence of time-invariant unobservables (see a group with similar characteristics. One way of below). Moreover, the technique may require a doing this would be to find for every boy age large sample for the selection of the comparison Box 4.1: Using propensity scores to select a matched comparison group--The Vietnam Rural Roads Project The survey sample included 100 project communes and 100 non- able to draw on commune-level data collected for administra- project communes in the same districts. Using the same districts tive purposes that cover infrastructure, employment, education, simplified survey logistics and reduced costs, but communes health care, agriculture, and community organization. These were still far enough apart to avoid "contamination" (control data will be used for contextual analysis, to construct commune- areas being affected by the project). A logit model of the prob- level indicators of welfare, and to test program impacts over ability of participating in the project was used to calculate the time. The administrative data will also be used to model the propensity score for each project and non-project commune. process of project selection and to assess whether there are Comparison communes were then selected with propensity any selection biases. scores similar to the project communes. The evaluation was also Sources: Van De Walle and Cratty (2005); Bamberger (2006). 25 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n group, which might pose a problem if secondary their own ways to select the sample. Sometimes data are not available (see chapter 8). the selection is based on physical characteristics that can be observed (type of housing, distance 4.2.4. Judgmental matching9 from water and other services, type of crops or A less precise method for selecting control groups area cultivated), whereas in other cases selection uses descriptive information from, for example, is based on characteristics that require screening survey data to construct comparison groups. interviews (e.g., economic status, labor market activity, school attendance). In these latter cases, Matching areas on observables. In consultation the interviewer must conduct quota sampling. with clients and other knowledgeable persons, the researcher identifies characteristics that should be 4.2.5. Double difference (difference in matched (e.g., access to services, type or quality of difference) house construction, economic level, location, or Differences between the intervention group types of agricultural production). Information from and the control group may be unobserved maps (sometimes including geographic informa- and therefore problematic. Nevertheless, even tion system data and/or aerial photographs), though such differences cannot be measured, the observation, secondary data (e.g., censuses, technique of double difference (or difference-in- household surveys, school records), and key difference) deals with these differences as long as informants are then combined to select compari- they are time invariant. The technique measures son areas with the best match of characteristics. differences between the two groups, before and Operating under real-world constraints means that after the intervention (hence the name double it will often be necessary to rely on easily observ- difference). able or identifiable characteristics (e.g., types of housing and infrastructure). Although this may Suppose there are two groups, an intervention expedite matters, there may also be unobservable group I and a control group C. One measures, for differences; the researcher must address these as instance, enrollment rates before (0) and after much as possible through qualitative research and (1) the intervention. According to this method, attach the appropriate caveats to any results. the effect is Matching individuals or households on observ- (I1 ­ I0) ­ (C1 ­ C0) or (I1 ­ C1) ­ (I0­ C0). ables. Similar procedures as those noted above can be used to match individuals and households. For example, if enrolment rates at t = 0 would Sample selection can sometimes draw on existing be 80% (for the intervention group) and 70% for survey data or ongoing household surveys; the control group and at t = 1, these rates would however, in many cases researchers must find be, respectively, 90% and 75%, then the effect of Table 4.1: Double difference and other designs Intervention group control group difference across groups Baseline I0 C0 I0­C0 Follow-up I1 C1 I1­C1 Double-difference: (I1­C1) ­ (I0­C0) = Difference across time I1­I0 C1­C0 (I1­I0) ­ (C1­C0) Source: Adapted from Maluccio and Flores (2005). 26 a d d r E s s t h E at t r I b u t I o n p r o b l E m the intervention would be (90% ­ 80%) ­ (75% ­ between interventions, and interactions between 70%) = 5%. interventions and specific characteristics, as long as the treatment (or intervention) and the The techniques of propensity score matching characteristics of the subjects in the sample are (see above) and double difference may be observed (can be measured). With a regression combined. Propensity score matching increases approach, it may be possible to estimate the the likelihood that the treatment and control contribution of a specific intervention to the total groups have similar characteristics, but cannot effect or to estimate the effect of the interaction guarantee that all relevant characteristics are between two interventions. The analysis may included in the selection procedure. The double include an explicit control group. difference technique can eliminate the effects of an unobserved selection bias, but this technique We must go beyond a standard regression-based may work better when differences between the approach when there are unobserved selection intervention group and the control group are effects or endogeneity (see next section). A way eliminated as much as possible. The approach to deal with unobserved selection effects is the eliminates initial differences between the two application of the "difference-in-difference" groups (e.g., differences in enrollment rates) approach in a regression model (see appendix and therefore gives an unbiased estimate of the 4). In such a model we do not analyze the (cross- effects of the intervention, as long as these differ- section) effects between groups, but the changes ences are time invariant. When an unobserved (within groups) over time. Instead of taking the variable is time variant (changes over time), the specific values of a variable in a specific year, we measured effect will still be biased. analyze the changes in these variables over time. In such an analysis, unobserved time-invariant 4.2.6. Regression analysis and double variables drop from the equation.10 difference In some programs the interventions are all or Again, the quality of this method as a solution nothing (a household or individual is subjected depends on the validity of the assumption that to the intervention or not); in others they vary unobservables are time invariant. Moreover, the continuously over a range, as when programs vary quality of the method also depends on the quality the type of benefit offered to target groups. One of the underlying data. The method of double example is a cash transfer program or a microfi- differencing is more vulnerable than some other nance facility where the amount transferred or methods to the presence of measurement error loaned may depend on the income of the partici- in the data. pant; improved drinking water facilities are another example. These facilities differ in capacity 4.2.7. Instrumental variables and are implemented in different circumstances An important problem when analyzing the impact with beneficiaries living at different distances to of an intervention is the problem of endogeneity. these facilities. The most common example of endogeneity is when a third variable causes two other variables In addition to the need to deal with both discrete to correlate without there being any causality. For and continuous interventions, we also need to example, doctors are observed to be frequently in control for other factors that affect the outcome the presence of people with fevers, but doctors other than the magnitude of the intervention. do not cause the fevers; it is the third variable The standard methodology for such an approach (the illness) that causes the two other variables is a regression analysis. One of the reasons for the to correlate (people with fevers and the presence popularity of regression-based approaches is their of doctors). In econometric language, when flexibility: they may deal with the heterogeneity there is endogeneity an explanatory variable will of treatment, multiple interventions, heterogene- be correlated with the error term in a mathemati- ity of characteristics of participants, interactions cal model (see appendix 4). When an explanatory 27 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n variable is endogenous, it is not possible to give tion depends on income. On the left side of an unbiased estimate of the causal effect of this the cut-off point, people (or households) have variable. an income that is just low enough to be eligible for participation; on the right side of the cut-off Selection effects also give rise to bias. Consider point, people are no longer allowed to partici- the following example. Various studies in the field pate, even though their income is just slightly of education find that repeaters produce lower higher. There may be more criteria that define test results than non-repeaters. A preliminary and the threshold, and these criteria may be explicit false conclusion would be that repetition does or implicit. Regression discontinuity analysis not have a positive effect on student performance compares the treatment group with the control and that it is simply a waste of resources. But such group at the cut-off point. At that point, it is a conclusion neglects the endogeneity of repeti- unlikely that there are unobserved differences tion: intelligent children with well-educated between the two groups. parents are more likely to perform well and therefore not repeat. Less intelligent children, on Suppose we want to analyze the effect of a the other hand, will probably not achieve good specific program to improve learning achieve- results and are therefore more likely to repeat. ments. This program focuses on the poorest So, both groups of pupils (i.e., repeaters and households: the program includes only non-repeaters) have different characteristics, households with an income below a certain which at first view makes it impossible to draw level. We know that learning achievements conclusions based on a comparison between are correlated with income,11 and therefore them. we cannot compare households participat- ing in the program with households that do The technique of instrumental variables is not participate. Other factors may induce an used to address the endogeneity problem. endogeneity bias (such as differences in the An instrumental variable (or instrument) is a educational background of parents or the third variable that is used to get an unbiased distance to the school). Nevertheless, at the estimate of the effect of the original endoge- cut-off point, there is no reason to assume that nous variable (see appendix 4). A good instru- there are systematic differences between the ment correlates with the original endogenous two groups of households (apart from small variable in the equation, but not with the error differences in income). Estimating the impact term. Suppose a researcher is interested in the can now be done, for example, by comparing effect of a training program. Actual participation the mean difference between the regression in the program may be endogenous, because, line of learning achievements in function of for instance, the most motivated employ- income before the intervention with the regres- ees may subscribe to the training. Therefore, sion line after (see figure 4.2). one cannot compare employees who had the training with employees who did not without A major disadvantage of a regression discon- incurring bias. The effect of the training may tinuity design is that the method assesses the be determined if a subset were assigned to the marginal impact of the program only around training by accident or through some process the cut-off point for eligibility. Moreover, it must unrelated to personal motivation. In this case, be possible to construct a specific threshold, the instrumental variables procedure essentially and individuals should not be able to manipu- only uses data from that subset to estimate the late the selection process (ADB, 2006: 14). impact of training. Many researchers prefer regression discontinu- ity analysis above propensity score matching, 4.2.8. Regression discontinuity analysis because the technique generates a higher The basic idea of regression discontinuity likelihood that estimates will not be biased by analysis is simple. Suppose program participa- unobserved variables.12 28 a d d r E s s t h E at t r I b u t I o n p r o b l E m Figure 4.2: Regression discontinuity analysis 10 9 Learning achievements (standardized score) 8 7 6 5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 Income (local currency) 4.3. Applicability of quantitative methods is a small n. The small n problem can arise either for addressing the attribution problem because the intervention was applied to a single There are some limitations to the applicabil- unit (e.g., capacity building in a single ministry ity of the techniques discussed in the previous or a national policy change) or a small number section. We briefly highlight some of the more of units or because there is heterogeneity in the important ones (for a more comprehensive intervention so that only a small number of units discussion see, e.g., Bamberger and White, received support of a specific type. Where this is 2007). First, in general, counterfactual estimation a small n, then a variety of other approaches can is not applicable in full-coverage interventions be used (see § 4.4.). such as price policies or regulation on land use, which affect everybody (although to different An important critique of the applicability of these degrees). In this case there are still possibilities methods refers to the nature of the intervention to use statistical "counterfactual-like" analyses, and the complexity of the context in which the such as those that focus on the variability in intervention is embedded. The methodological exposure/participation in relation to changes in difficulties of evaluating complicated interven- an outcome variable (see, e.g., Rossi et al., 2004). tions to some extent can be "neutralized" by Second, there are several pragmatic constraints deconstructing them into their "active ingredi- to applying this type of analysis, especially with ents" (see, e.g., Vaessen and Todd, 2008).13 respect to randomization and other design- Consider the example of school reform in based techniques. For example, there might Kenya as described by Duflo and Kremer (2005). be ethical objections to randomization or lack School reform constitutes a set of different of data representing the baseline situation of simultaneous interventions at different levels, intervention target groups (see chapter 8). ranging from revisions in and decentralization Third, applicability of quantitative approaches of the budget allocation process, to addressing (experimental and non-experimental) also links between teacher pay and performance, to largely depends on the number of observations vouchers and school choice. Although the total (n) available for analysis. Quantitative analysis is package of interventions constituting school only meaningful if n is reasonably large: statisti- reform represents an impressive landscape of cally based approaches are not applicable if there causal pathways of change at different levels, 29 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n directly and indirectly affecting individual school, atic control of the influence of other factors can teacher, and student welfare in different ways, significantly increase the reliability of findings it can be unpacked into different (workable) (see also chapter 8). components, such as teacher incentives and their effects on student performance indicators Some final remarks on attribution are in order. or school vouchers and their effects on student Given the centrality of the attribution issue in performance. impact evaluation, we concur with many of our colleagues that there is scope for more quantita- True experimental designs have been relatively tive impact evaluation, as these techniques offer a rare in development settings (but not rare in comparative advantage of formally addressing the developing countries, as medical tests routinely counterfactual. Therefore, with a relatively large use a randomized approach). Alternatively, n, a quantitative approach is usually preferred. quasi-experiments using non-random assign- However, at the same time it is admitted that, given ment to participant and control groups are more the limitations discussed above, the application widely applicable. Preferably, double differ- of experimental and quasi-experimental design- ence (participant-control group comparisons based approaches will necessarily be limited to over time) designs should be used. However, only a part of the total amount of interventions in it is more usual that impact assessments are development.14 based on less rigorous--and reliable--designs, where-- The combination of theory-based evaluation and quantitative impact evaluation provides · Baseline data are reconstructed or collected a powerful methodological basis for rigorous late during the implementation phase. impact evaluation for several reasons: · Baseline data are collected only for the treat- ment group. · The intervention theory will help indicate · There are no baseline data for the treatment which of the intervention components are or control group. amenable to quantitative counterfactual analy- sis through, for example, quasi-experimental If no baseline data exist, then the impact of evaluation and how this part of the analysis the intervention is measured by comparing the relates to other elements of the theory.15 situation afterward between the treatment and · The intervention theory approach will help control groups. This comparison of end-line identify key determinants of impact variables data is measured by a single difference (see also to be taken into account in a quantitative im- appendix 14). pact evaluation. · The intervention theory approach can pro- Some impact evaluations are based on pure vide a basis for analyzing how an interven- "before and after" comparisons of change only tion affects particular individuals or groups in for the treatment group, with no comparison different ways; although quantitative impact group at all. The measure in such cases is also evaluation methods typically result in quan- a single difference, but the lack of a proxy for titative measures of average net effects of an the counterfactual makes conclusions based on intervention, an intervention theory can help this design less robust. This design gives a valid to support the analysis of distribution of costs measure of impacts only in the rare situations and benefits (see chapter 5). when no other factors can explain the observed · The intervention theory can help strengthen change, or when the intervention of interest is the the interpretation of findings generated by only factor influencing the conditions. In other quantitative impact evaluation techniques. words, all other factors are stable, or there are no other cause-effect relationships than between the This symbiosis between theory-based evalua- intervention and the observed change. A system- tion and quantitative impact evaluation has been 30 a d d r E s s t h E at t r I b u t I o n p r o b l E m acknowledged by a growing number of authors check on the validity of the conclusions and can in both the general impact evaluation literature help one understand why the results are as they (e.g., Cook, 2000; Shadish et al., 2002; Rossi et are. Pawson and Tilley (1997) criticized experi- al., 2004; Morgan and Winship, 2007) as well as mentalists by highlighting what they perceive as in the literature on development impact evalua- a lack of attention to explanatory questions in tion (e.g., Bamberger et al., 2004; Bourguignon (quasi-) experiments. Consequently, GEM can be and Sundberg, 2007; Ravallion, 2008). When helpful by involving the evaluator in setting up a this combination is not feasible, alternative "competition" between the conclusions from the methods embedded in a theory-based evaluation evaluation and possible other hypotheses. framework should be applied. A second example is causal contribution analysis 4.4. other approaches (see Mayne, 2001; described in chapter 3). In this section we introduce a range of method- Contribution analysis relies on chains of logical ological approaches that can be used to address arguments that are verified through careful the attribution problem or particular aspects of analysis. Rigor in this type of causal analysis the impact evaluation.16 involves systematically identifying and investigat- ing alternative explanations for observed impacts. 4.4.1. Alternative approaches for addressing the This includes being able to rule out implementa- attribution problem tion failure as an explanation for lack of results The methods discussed in the previous sections and developing testable hypotheses and predic- have the advantage of allowing for an estima- tions to identify the conditions under which tion of the magnitude of change attributable interventions contribute to specific impacts. to a particular intervention using counterfac- tual analysis. There are also other (qualitative) Some of these hypotheses can be tested using the methods that can be useful in addressing the quantitative methods discussed previously. In this issue of attribution. However, these methods as sense, contribution analysis, and other variants such do not quantify effects attributable to an of theory-based analysis, provide a framework in intervention.17 which quantitative methods of impact evaluation could be used to test particular causal assump- A first example of an alternative approach is tions. If the latter is not possible, the verifica- the so-called General Elimination Methodology tion and refinement of the causal story should (GEM). This approach is epistemologically related exclusively rely on other (multiple) methods of to Popper's falsification principle. Michael Scriven inquiry (see chapter 5). added it to the methodology of (impact) evalua- tions. Although in some papers he suggested that 4.4.2. Participatory approaches18 the GEM approach was particularly relevant for Nowadays, participatory methods have become dissecting causality chains within case studies, mainstream tools in development in almost every both in his earlier work and in a more recent area of policy intervention. The roots of participa- paper (Scriven, 1998), he makes clear that the tion in development lie in the rural sector, where GEM approach is relevant for every type of expert Chambers (1995) and others developed the practice, including RCTs and case studies (see now widely used principles of participatory rural appendix 2 for a more detailed discussion). appraisal. Participatory evaluation approaches (see, e.g., Cousins and Whitmore, 1998) are built What is the relevance of this approach for on the principle that stakeholders should be impact evaluation? Given the complexity of involved in some or all stages of the evaluation. solving the attribution problem, GEM can help As Greene (2006: 127ff) illustrates, "[P]articipa- "test" different counterfactuals that have been tory approaches to evaluation directly engage the put forward in a theoretical way. When doing micropolitics of power by involving stakeholders (quasi-)experiments, using GEM can be an extra in important decision-making roles within 31 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n the evaluation process itself. Multiple, diverse ing an intervention theory or multiple theories,19 stakeholders collaborate as co-evaluators, often which subsequently can be refined or put to the as members of an evaluation team." Participa- test during further analysis. tory evaluation can be perceived as a develop- mental process in itself, largely because it is "the Some of the latter benefits can also be realized by process that counts" (Whitmore, 1991). In the using qualitative methods that are nonparticipa- case of impact evaluation, participation includes tory (see Mikkelsen, 2005; see also appendix 9). aspects such as the determination of objectives, This brings us to an important point. There is a indicators to be taken into account, as well as common misperception that there is a finite and stakeholder participation in data collection and clearly defined set of so-called "participatory" analysis. In practice it can be useful to differenti- evaluation methods. Although certain methods ate between stakeholder participation as a process are often (justifiably) classified under the banner and stakeholder perceptions and views as sources of participatory methods because stakeholder of evidence (Cousins and Whitmore, 1998). participation is a defining feature, many methods not commonly associated with stakeholder partic- Participatory approaches to impact evaluation ipation (including, for example, (quasi-) experi- can be important for several reasons. First, one mental methods) can also be used in more or less could ask the legitimate question of impact participatory ways, with or without stakeholder "according to whom." Participatory approaches involvement. The participatory aspect of method- can be helpful in engaging stakeholders on the ology is largely determined by the issues of who is issue of what is to be valued in a particular impact involved and who does or decides on what and how. evaluation. By engaging a range of stakeholders, For example, the methodology for testing water a more comprehensive and/or appropriate set of quality to ascertain the impact of treatment facili- valued impacts is likely to be identified (see the ties can become participatory if community-level second key issue of this Guidance document). water users are involved in deciding, for example, When identifying the (type and scope of the) what aspects of water quality to measure and how intervention to be evaluated (see first chapter), to collect the data and report the results. participatory methods might be of particular use; aspects that might be "hidden" behind official Methodologies commonly found under the language and political jargon (in documents) can umbrella of participatory (impact) evaluation be revealed by narrative analyses and by consult- include appreciative inquiry; beneficiary assess- ing stakeholders. More generally, the process ment; participatory impact pathway analysis; of participation in some cases can enhance participatory impact monitoring (see box 4.2.); stakeholder ownership, the level of understand- poverty and social impact analysis; social return ing of a problem among stakeholders, and utiliza- on investment; systematic client consultation; tion of impact evaluation results. self-esteem, associative strength, resourceful- ness, action planning and responsibility; citizen Within the light of the attribution issue, report cards; community score cards; and the stakeholder perspectives can help improve Participatory Learning and Action toolbox20 (see, an evaluator's understanding of the complex for example, IFAD, 2002; Mikkelsen, 2005; Pretty reality surrounding causal relationships among et al., 1995; Salmen and Kane, 2006). interventions and outcomes and impacts. In addition, insight into the multiple and These methods rely on different degrees of (potentially) contrasting assumptions about participation, ranging from consultation to causal relationships between an intervention collaboration to joint decision making. In and processes of change can help enrich an general, the higher the degree of participation, evaluator's perspective on the attribution issue. the more costly and difficult it is to set up the As discussed in chapter 3, stakeholder perspec- impact evaluation. In addition, a high degree tives can be an important source for reconstruct- of participation might be difficult to realize in 32 a d d r E s s t h E at t r I b u t I o n p r o b l E m large-scale comprehensive interventions such as · For testing/refining particular parts (i.e., as- sector programs.21 sumptions) of the impact theory but not specif- ically focused on impact assessment as such Apart from the previously discussed potential · For strengthening particular lines of argumen- benefits of an impact evaluation involving some tation with additional/detailed knowledge, element of stakeholder participation, disadvan- useful for triangulation with other sources of tages of participatory approaches include the evidence following: · For deepening the understanding of the na- ture of particular relationships between inter- · Limitations to the validity of information based vention and processes of change. on stakeholder perceptions (only); this prob- lem is related to the general issue of short- The literature on (impact) evaluation methodol- comings in individual and group perceptional ogy, as in any other field of methodology, is riddled data. with labels representing different (and sometimes · The risk of strategic responses, manipulation, not so different) methodological approaches. In or advocacy by stakeholders can influence the essence however, methodologies are built upon validity of the data collection and analysis.22 specific methods. Survey data collection and · Limitations to the applicability of impact evalu- (descriptive) analysis, semi-structured interviews, ation with a high degree of participation espe- and focus-group interviews are but a few of the cially in large-scale, comprehensive, multi-site specific methods that are found throughout interventions (aspects of time and cost). the landscape of methodological approaches to impact evaluation. 4.4.3. Useful methods for data collection and analysis that are often part of impact evaluation Evaluators, commissioners, and other stakehold- designs23 ers in impact evaluation should have a basic In this section we distinguish a set of methods knowledge about the more common research that are useful: techniques:24 Box 4.2: Participatory impact monitoring in the context of the poverty reduction strategy process Participatory impact monitoring builds on the voiced perceptions · Contribute to methodology development, strengthen the knowl- and assessments of the poor and aims to strengthen these as edge base, and facilitate cross-country learning on the effective relevant factors in decision making at national and subnational use of participatory monitoring at the policy level, and in the levels. In the context of poverty reduction strategy monitoring context of poverty reduction strategy processes in particular. it will provide systematic and fast feedback on the implementa- tion progress, early indications of outcomes, impact, and the Conceptually, the proposed project impact monitoring approach unintended effects of policies and programs. combines (1) the analysis of relevant policies and programs at the The purposes are as follows: national level, leading to an inventory of "impact hypotheses," with (2) extensive consultations at the district/local government level, · Increase the voice and the agency of poor people through and (3) joint analysis and consultations with poor communities participatory monitoring and evaluation on their perceptions of change, their attributions to causal fac- · Enhance the effectiveness of poverty oriented policies and tors, and their contextualized assessments of how policies and programs in countries with poverty reduction strategies programs affect their situation. Source: Booth and Lucas (2002). 33 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Descriptive statistical techniques (e.g., of survey or testing particular causal assumptions of the or registry data): The statistician Tukey (e.g., intervention theory. These techniques (including Tukey, 1977) argued for more attention to explor- the first bullet point) are also used in the (quasi-) atory data analysis techniques as powerful and experimental and regression-based approaches relatively simple ways to understand patterns in described in § 4.2. For more information, see data. Examples include univariate and bivariate Agresti and Finlay (1997) or Hair et al. (2005) statistical analysis of primary or secondary data or, more specifically for development contexts, using graphical analysis and simple statistical see Casley and Lury (1987) or Mukherjee et al. summaries (e.g., for univariate analysis: mean, (1998). standard deviation, median, interquartile range; for bivariate analysis: series of boxplots, scatter- "Qualitative methods" include widely used plots, odds ratios). methods, such as semi-structured interviews, open interviews, focus group interviews, partici- Inferential statistical techniques (e.g., of survey pant observation, and discourse analysis--but or registry data): Univariate analysis (e.g., also less conventional approaches such as confidence intervals around the mean; t-test mystery guests, unobtrusive measures (e.g., of the mean), bivariate analysis (e.g., t-test for through observation; see Webb et al., 2000), difference in means), and multivariate analysis etc. For more information, see Patton (2002) or, (e.g., cluster analysis, multiple regression) can more specifically for development contexts, see be rather useful in estimating impact effects Mikkelsen (2005) or Roche (1999).25 Key message Address the attribution problem. Although there is no single method that is best in all cases (a gold standard), some methods are indeed best in specific cases. When empirically addressing the attribution problem, experimental and quasi-experimental designs embedded in a theory-based evaluation framework have clear advantages over other designs. If addressing the attribution problem can only be achieved by doing a contribution analysis, be clear about that and specify the limits and opportunities of this approach. Overall, for impact evaluations, well-designed quantitative methods may better address the attribution problem. Baseline data are critical when using quantita- tive methods. Qualitative techniques cannot quantify the changes attributable to interventions but should be used to evaluate important issues for which quantification is not feasible or practical, and to develop complementary and in-depth perspectives on processes of change induced by interventions (see next section). Evaluators need a good basic knowledge about all techniques before determining what method to use to address this problem. 34 Chapter 5 Use a mixed-methods approach: The logic of the comparative advantages of methods T he work by Campbell and others on validity and threats to validity within experiments and other types of evaluations have left deep marks on the way researchers and evaluators have addressed methodological challenges in impact evaluation (see Campbell, 1957; Campbell and Stanley, 1963; Cook and Campbell, 1979; Shadish et al., 2002). 5.1. different methodologies have vention and impact variable is in fact true? comparative advantages in addressing How can we be sure about the magnitude particular concerns and needs of change?1 Validity can be broadly defined as the "truth of, or correctness of, or degree of support for an Applying the logic of comparative advantages inference" (Shadish et al., 2002: 513). Campbell makes it possible for evaluators to compare distinguished among four types of validity, which methods on the basis of their relative merits in can be explained in a concise manner by looking addressing particular aspects of validity. This at the questions underlying the four types: provides a useful basis for methodological design choice; given the evaluation's priorities, methods · Internal validity: How do we establish that that better address particular aspects of validity there is a causal relationship between inter- are selected in favor of others. In addition, the vention outputs and processes of change lead- logic of comparative advantages can support ing to outcomes and impacts? decisions on combining methods to be able · Construct validity: How do we make sure that to simultaneously address multiple aspects of the variables we are measuring adequately rep- validity. resent the underlying realities of development interventions linked to processes of change? We will illustrate this logic using the example of · External validity: How do we (and to what RCTs. Internal validity usually receives (and justifi- extent can we) generalize about findings to ably so) a lot of attention in impact evaluation, as other settings (interventions, regions, target it lies at the heart of the attribution problem; is groups, etc.)? there a causal link between intervention outputs · Statistical conclusion validity: How do we and outcomes and impacts? Arguably, RCTs (see make sure that our conclusion about the § 4.2.) are viewed by many as the best method existence of a relationship between inter- for addressing the attribution problem from 35 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n the point of view of internal validity. Random validity concerns. In addition, the intervention allocation of project benefits reduces the likeli- theory as a structure for making explicit causal hood that there are systematic (observable assumptions, generalizing findings, and making and unobservable) differences between those in-depth analysis of specific assumptions can that receive benefits and those that do not. help strengthen internal, external, and construct However, this does not make it necessarily the validity claims. best method overall. For example, RCTs control for differences between groups within the To conclude: particular setting that is covered by the study. Other settings have other characteristics that are · There is no single best method in impact eval- not controlled, hence there are limitations of uation that can always address the different external validity here. aspects of validity better than others. · Methods have particular advantages in deal- To resolve this issue, Duflo and Kremer (2005) ing with particular validity concerns; this propose to undertake series of RCTs on the same provides a strong rationale for combining type of instrument in different settings. However, methods. as argued by Ravallion, "The feasibility of doing a sufficient number of trials--sufficient to span 5.2. Advantages of combining different the relevant domain of variation found in reality methods and sources of evidence for a given program, as well as across the range In principle, each impact evaluation is in some of policy options--is far from clear. The scale of way supported by different methods and the randomized trials needed to test even one sources of evidence. For example, even the quite large national program could well be prohibitive" technical quantitative approaches described (Ravallion, 2008: 19). in § 4.2 include other modes of inquiry, such as the research review to identify key variables Another limitation of RCTs (also valid for other that should be controlled for in, for example, a approaches discussed in § 4.2.) lies in the realm quasi-experimental setting. Nevertheless, there of construct validity. Does the limited set of is a growing literature on the explicit use of indicators adequately represent the impact of multiple methods to strengthen the quality of a policy on a complex phenomenon such as the analysis.2 At the same time the discordance poverty? In-depth qualitative methods can more between the practice and "theory" of mixed- adequately capture the complexity and diversity methods research (Bryman, 2006) suggests that of aspects that define (and determine) poverty mixed-methods research is often more an art than the singular or limited set of impact indica- than a science. tors taken into account in RCTs. Consequently, the latter have a comparative advantage in address- Triangulation is a key concept that embodies ing construct validity concerns. However, a much of the rationale behind doing mixed- downside of most qualitative approaches is that methods research and represents a set of the focus is local and findings are very context principles to fortify the design, analysis, and specific, with limited external validity. External interpretation of findings in impact evalua- validity can be adequately addressed by, for tion.3 Triangulation is about looking at things example, quantitative quasi- and non-experimen- from multiple points of view, a method "to tal approaches that are based on large samples overcome the problems that stem from studies covering substantial diversity in context and relying upon a single theory, a single method, a people. single set of data ... and from a single investiga- tor" (Mikkelsen, 2005: 96). As can be deduced Theory-based evaluation provides the basis for from the definition, there are different types of combining different methodological approaches triangulation. Broadly, these are the following that have comparative advantages in addressing (Mikkelsen, 2005): 36 u s E a m I x E d - m E t h o d s a p p r o a c h : t h E l o G I c o f t h E c o m pa r at I v E a d va n ta G E s o f m E t h o d s · Data triangulation--To study a problem using impact of institutions as "rules of the game" (see different types of data, different points in time, North, 1990), and interventions such as policies or different units of analysis can be considered as attempts to establish · Investigator triangulation--Multiple research- specific rules with the expectation (through a ers looking at the same problem "theory of change") of generating certain impacts · Discipline triangulation--Researchers trained (Picciotto and Wiesner, 1997). In addition, the in different disciplines looking at the same literature on behavioral and social mechanisms problem (see appendix 10; see also chapter 6) provides a · Theory triangulation--Using multiple compet- wealth of explanatory insights that help evalua- ing theories to explain and analyze a problem tors better understand and frame processes of · Methodological triangulation--Using different change triggered by interventions. methods, or the same method over time, to study a problem. A good methodological practice in impact evalua- tion is to encourage applying these principles of As can be observed from this list, particular triangulation as much as possible. methodologies already embody aspects of triangu- lation. Quantitative double-difference impact Advantages of mixed-methods approaches to evaluation (see § 4.2.), for example, embodies impact evaluation are the following: aspects of methodological and data triangula- tion. Participatory impact evaluation approaches · A mix of methods can be used to assess impor- are often used to seek out and reconstruct tant outcomes or impacts of the intervention multiple (sometimes contrasting) perspectives being studied. If the results from different on processes of change and impact using diverse methods converge, then inferences about methods, often relying on teams of researchers the nature and magnitude of these impacts with different disciplinary backgrounds (that will be stronger. For example, triangulation of may include members of target groups). Theory- standardized indicators of children's educa- based evaluation often involves theory triangula- tional attainments with results from an analy- tion (see chapter 3; see also Carvalho and White sis of samples of children's academic work [2004], who refer to competing theories in their yields stronger confidence in the educational study on social funds). Moreover, it also allows for impacts observed than either method alone methodological and data triangulation by relying (especially if the methods employed have off- on different methods and sources of evidence to setting biases). test particular causal assumptions. · A mix of methods can be used to assess differ- ent facets of complex outcomes or impacts, Discipline triangulation and theory triangula- yielding a broader, richer portrait than one tion both point to the need for more diversity method alone can. For example, standardized in perspectives for understanding processes of indicators of health status could be mixed with change in impact evaluation. Strong pleas have onsite observations of practices related to recently been made for development evalua- nutrition, water quality, environmental risks, tors to recognize and make full use of the wide or other contributors to health, jointly yield- spectrum of frameworks and methodologies ing a richer understanding of the interven- that have emerged from different disciplines tion's impacts on targeted health behaviors. and that provide evaluation with a rich arsenal In a more general sense, quantitative impact of possibilities (Kanbur, 2003; White, 2002; evaluation techniques work well for a limited Bamberger and White, 2007). For example, set of pre-established variables (preferably when doing impact evaluations, evaluators can determined and measured ex ante) but less benefit from approaches developed in different well for capturing unintended, less expected disciplines and subdisciplines. Neo-institution- (indirect) effects of interventions. Qualita- alist economists have shown ways to study the tive methods or descriptive (secondary) data 37 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n analysis can be helpful in better understand- · Case 4: A theory-based approach with qualita- ing the latter. tive methods (GEF, 2007). · One set of methods could be used to assess outcomes or impacts and another set to assess 5.3. Average effect versus distribution of the quality and character of program imple- costs and benefits mentation, including program integrity and Sometimes policy makers and stakeholders are the experiences during the implementation concerned with the question of whether an phase. intervention (for a specific context and group of · Multiple methods can help ensure that the people) has been effective overall. This is typically sampling frame and the sample selection a question that can be addressed by using (quasi) strategies cover the whole of the target inter- experimental evaluation techniques. However, vention and comparison populations. Many another important question, one that might sampling frames leave out important sectors not be easily answered with these techniques, is of the population (usually the most vulnerable whether and how people are differently affected groups or people who have recently moved by an intervention. 4 This question can be into the community), while respondent se- answered by using regression analysis. A regres- lection procedures often under-represent sion model can incorporate different moderator women, youth, the elderly, or ethnic minori- variables (e.g., through modeling interaction ties. This is critical because important positive effects) to analyze to what extent important or negative impacts on vulnerable groups (or characteristics co-determine outcome variables. other important sectors) are completely ig- In addition, many qualitative methods such as nored if they do not even get included in the those used for case studies can help evaluators sample. This is particularly important (and study in detail how interventions work differ- frequently ignored) where the evaluation uses ently in different situations. From a methodolog- secondary data sets, as the evaluator often ical design perspective, a mixed-methods study does not have access to information on how combining quasi-experimental survey data with the sample was selected. a limited number of in-depth, semistructured · Multiple methods are needed to address the interviews among different types of people complementary questions of average effect from the target population is an example of a and distribution of costs and benefits of an potentially good framework to provide credible intervention (see § 5.3.) answers to both questions (see box 5.1.). Appendix 11 presents four interesting examples When talking about the issue of distribution of of impact evaluations that are based on a mixed costs and benefits of an intervention, it is useful method perspective: to distinguish between different levels or foci. First, one should consider the issue of outreach · Case 1: Combining qualitative and quantitative or coverage. Who are the people (individuals, descriptive methods--Ex post impact study households, and communities) directly affected of the Noakhali Rural Development Project by an intervention? Sometimes this question in Bangladesh can be answered in a relatively straightforward · Case 2: Combining qualitative and quantitative manner, such as when the intervention is clearly descriptive methods--Mixed-methods impact delineated and targeted to a specific group of evaluation of International Fund for Agricul- people (e.g., a training program). In other cases tural Development projects in Gambia, Ghana, (e.g., a tax cut or construction of a road), coverage and Morocco or outreach, or indeed the delineation of the · Case 3: Combining qualitative and quanti- group of people affected by the intervention, is tative descriptive methods--Impact evalu- not that easy to determine. In the last case, the ation: agricultural development projects in issue of delineation is closely linked to the second Guinea level, how an intervention has different effects on 38 u s E a m I x E d - m E t h o d s a p p r o a c h : t h E l o G I c o f t h E c o m pa r at I v E a d va n ta G E s o f m E t h o d s Box 5.1: Brief illustration of the logic of comparative advantages Consider the example of an intervention that provides monetary · Survey data and case studies could tell how incentives have incentives and training to farmers to promote land use changes different effects on particular types of farm households (po- leading to improved livelihoods conditions. tentially strengthens internal validity and increases external We could use the following methods in the impact evaluation: validity of findings) · Semistructured interviews and focus group conversations · A randomized experiment could be used to assess the ef- could tell us more about the nature of effects in terms of fectiveness of different incentives on land use change and/or production, consumption, poverty, etc. (potentially enhances socio-economic effects of these changes (potentially strength- construct validity of findings). ens internal validity of findings) groups of people (e.g., how the construction of Important to note is that an analysis of the distribu- a road affects different types of businesses and tion of costs and benefits as a result of an interven- households near or farther from the new road). tion--distinguishing among coverage, effects In the case of a simple training program, the first on those who are directly affected, and indirect level (who participates, who is directly affected) effects--cannot be addressed with one particular can be neatly separated from the second (how method. If one is interested in all these questions, an intervention affects participants in different then inevitably one needs a framework of multiple ways). A third level concerns the indirect effects methods and sources of evidence. For example, of an intervention. For example, an objective of descriptive analysis of survey data can help to a training program may be that participants in map coverage, quasi-experiments can help to turn become teachers for the population at large. assess attribution of change among those directly While this is an intended indirect effect, multiple affected, and case studies and survey data analysis indirect effects on participants, their families, and can help to map indirect effects over time. non-participants may occur, some of which may be quite substantial. Time and scale are important Key message dimensions here (see also chapter 2). Use a mixed-methods design. Bear in mind the logic Often, impact evaluation is about level of the comparative advantages of designs and meth- two--determining the effects on those that are ods. A mix of methods can be used to assess differ- directly targeted by/participating in the interven- ent facets of complex outcomes or impacts, yielding tion. In those cases, it is often assumed that level more breadth, depth, and width in the portrait than one (targeting, outreach) is fully known and one method alone can. One set of methods could be mapped. In other cases, level one--outreach used to assess outcomes or impacts and another set and coverage or indeed the determination of to assess the quality and nature of intervention im- the scope of direct effects of an intervention on plementation, thus enhancing impact evaluation with the population at risk--is the great "unknown" information about program integrity and program and should be a first priority in an impact evalua- experiences. It is important to note that an analysis tion exercise. Level three--indirect processes of the distribution of costs and benefits of an inter- of change induced by an intervention, with vention--distinguishing among coverage, effects on potentially important implications for the those directly affected, and indirect effects--cannot distribution of costs and benefits among target be addressed with one particular method. Answer- populations and beyond--is often outside the ing these questions requires a framework of multiple scope of impact evaluations (see Ravallion, methods and sources of evidence. 2008). 39 Chapter 6 Build on existing knowledge relevant to the impact of interventions R eview and synthesis approaches are commonly associated with sys- tematic reviews and meta-analyses. Using these methods, comparable interventions evaluated across countries and regions can provide the empirical basis to identify "robust" performance goals and to help assess the relative effectiveness of alternative interventions under different country contexts and settings. These methods can lead to increased emphasis on the rigor of impact evaluations so they can contribute to future knowledge build- ing as well as meet the information needs of stakeholders. These methods can also lead to a more selective approach to extensive impact evaluation, where existing knowledge is more systematically reviewed before undertak- ing a local impact evaluation. "Systematic review" is a term that is used to · Defining and applying criteria for assessing the indicate a number of methodologies that deal methodological quality of the documents with synthesizing lessons from existing evidence. · Extracting information1 In general, one can define a systematic review as · Synthesizing the information into findings. a synthesis of primary studies that contains an explicit statement of objectives and is conducted A meta-analysis is a quantitative aggregation of according to a transparent, methodical, and effect scores established in individual studies. replicable methodology (Greenhalgh et al., The synthesis is often limited to a calculation of 2004). Typical features of a protocol underlying an overall effect score that expresses the impact a systematic review are the following (Oliver et attributable to a specific intervention or group al., 2005): of interventions. To arrive at such a calcula- tion, meta-analysis involves a strict procedure · Defining the review question(s) to search for and select appropriate evidence of · Developing the protocol the impact of single interventions. The selection · Searching for relevant bibliographic sources of evidence is based on an assessment of the · Defining and applying criteria for including methodology of the single-intervention impact and excluding documents study. In this type of assessment, usually a hierar- 41 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n chy of methods is applied in which RCTs rank and (to a lesser extent) criminal justice and highest and provide the most rigorous sources social work (Clarke, 2006). Knowledge reposi- of evidence for meta-analysis. Meta-analysis tories such as the Campbell Collaboration and differs from multicenter clinical trials in that in Cochrane Society rely heavily on meta-analysis the former, the evaluator has no control over as a rigorous tool for knowledge management the single-intervention evaluations as such. As on what works. Both from within these profes- a result, despite the fact that homogeneity of sional fields as well as from other fields criticism implementation of similar interventions is a has emerged. In part, this criticism reflects precondition for successful meta-analysis, inevita- a resistance to the idea of a "gold standard" bly meta-analysis is confronted with higher levels underlying the practice of meta-analysis. of variability in individual project implementa- The discussion has been useful in that it has tion, context, and evaluation methodology than helped define the boundaries of applicability of in multicenter clinical trials. meta-analysis and the idea that, given the huge variability in parameters characterizing evalua- Meta-analysis is most frequently applied in tions, there is no such thing as a gold standard professional fields such as medicine, education, (see Clarke, 2006). Box 6.1: Narrative review and synthesis study: Targeting and impact of community-based development initiatives The study was performed by Mansuri and Rao (2004), who re- To obtain relevant and reliable evidence on CBD projects, the viewed the evidence on community-based development (CBD) reviewers decided to restrict the review process to peer-reviewed projects funded by the World Bank. At the time, it was estimated publications or studies conducted by independent researchers. that an estimated US$ 7 billion of World Bank projects were This provided an exogenous rule that improved the quality and about CBD. reduced the level of potential bias while casting a wide enough net to let in research from a variety of disciplinary perspectives Review questions on different types of CBD projects. The following sources of evi- 1. Does community participation improve the targeting of private dence were included: impact evaluations, which use statistical or benefits such as welfare or relief? econometric techniques to assess the causal impact of specific 2. Are the public goods created by community participation proj- project outcomes; and ethnographic or case studies, which use ects better targeted to the poor? anthropological methods such as participant observation, in-depth 3. Are such goods of higher quality, or better managed, than interviews, and focus group discussions. similar public goods provided by the government? 4. Does participation lead to the empowerment of marginalized Some conclusions groups--does it lessen exclusion, increase the capacity for · Projects that rely on community participation have not been collective action, or reduce the possibility that locally powerful particularly effective at targeting the poor; there is some evi- elites will capture project benefits? dence that CBD/community-driven development projects cre- 5. Do the characteristics of external agents--donors, govern- ate effective community infrastructure, but not a single study ments, nongovernmental organizations (NGOs), and project establishes a causal relationship between any outcome and facilitators--affect the quality of participation or project suc- participatory elements of a CBD project. cess or failure? · A naïve application of complex contextual concepts like "par- 6. Can community participation projects be sustainably scaled ticipation," "social capital," and "empowerment" is endemic up? among project implementers and contributes to poor design and implementation. Source: Mansuri and Rao (2004). 42 b u I l d o n E x I s t I n G k n o w l E d G E r E l E va n t t o t h E I m pa c t o f I n t E r v E n t I o n s Partly as a response to the limitations in applica- A realist synthesis is a theory-based approach that bility of meta-analysis as a synthesis tool, more helps synthesize findings across interventions. It comprehensive methodologies of systematic focuses on the question of which mechanisms review have been developed. One example is are assumed to be at work in a given interven- a systematic review of health behavior among tion, taking into account the context the interven- young people in the United Kingdom that tion operates in (see appendix 10). Although involves both quantitative and qualitative synthe- interventions often appear different, they often sis (see Oliver et al., 2005). The case shows that rely on strikingly similar mechanisms. Recogni- meta-analytic work on evidence stemming from tion of this can broaden the range of applicable what the authors call "intervention studies" evidence from other studies. (evaluation studies on similar interventions) can be combined with qualitative systematic review Combinations of meta-approaches are also of "non-intervention studies," mainly research on possible. In a recent study on the impact of public relevant topics related to the problems addressed policy programs designed to reduce and/or prevent by the intervention. Regarding the latter, similar violence in the public arena, Van der Knaap et al. to the quantitative part, a systematic procedure (2008) have shown the relevance of combining for evidence search, assessment, and selection is synthesis approaches (see appendix 12). applied. The difference lies mostly in the synthe- sis part, which in the latter case is a qualitative analysis of major findings. The two types of review can subsequently be used for triangula- Key message tion purposes, reinforcing the overall synthesis findings. Build on existing knowledge relevant to the impact of interventions. Review and synthesis methods can Other examples of review and synthesis play a pivotal role in impact evaluation in synthesiz- approaches are the narrative review and the ing results and contributing to knowledge. Although realist synthesis. A narrative review is a descrip- interventions often appear different, they often may tive account of intervention processes and/or rely on strikingly similar mechanisms. Recognition of results covering a series of interventions (see box this can broaden the range of applicable evidence. 6.1.). Often, the evaluator relies on a common As there are several approaches available, it is analytical framework, which serves as a basis for a worthwhile to try to combine (some of) them. Review template that is used for data extraction from the and synthesis work can provide a useful basis for individual studies. In the end, the main findings empirical impact analysis of a specific intervention are summarized in a narrative account and/or and in some cases may even take away the need for tables and matrices representing key aspects of further in-depth impact evaluation. the interventions. 43 Part II Managing Impact Evaluations Chapter 7 Determine if an impact evaluation is feasible and worth the cost M anagers and policy makers sometimes assume that impact evaluation is synonymous with any other kind of evaluation. They might request an "impact evaluation" when the real need is for a quite different kind of evaluation (e.g., to provide feedback on an implementation process or to assess the accessibility of program services to vulnerable groups). Ensuring clarity of the information needed and for what purposes is a prerequisite to defining the type of evaluation to conduct. Moreover, impact evaluation is not "the" alterna- · When there is an articulated need to obtain tive but draws on and complements rather than the information from an impact evaluation replaces other types of M&E activities. It should to know whether the intervention worked, therefore be seen as one of several in a cycle to learn from it, to increase transparency of of potentially useful evaluations in the lifetime the intervention, and to know its "value for of an intervention. The rather traditional money." difference between ex ante and ex post impact · When a "readiness assessment" shows that evaluations remains important, where the ex political, technical, resource, and other prac- ante impact assessment is, by nature, largely tical considerations are adequate and it is an activity in which "predictions" are made of feasible to do an impact evaluation. More any effects and side effects a particular interven- specifically, this would include the following tion might have. Ex post impact evaluation, or conditions: simply "impact evaluation," as defined by the · The evaluation has a clearly defined pur- development community (and elsewhere), can pose and an agreed-upon intended use, test whether and to what extent these ex ante appropriate to its timing and with support predictions have been correct. In fact, one of of influential stakeholders. the potential uses of impact evaluations, not yet · There is clarity about the evaluation de- frequently applied in the field of development sign. The evaluation design has to be clearly intervention, could be to strengthen the process described and well justified after due con- of ex ante impact assessments. sideration of alternatives and constraints. · The evaluation design has a chance to be When should an impact evaluation ideally be credibly executed, given the nature and conducted? context of the intervention, the data, and 47 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n information needs and the availability of ad- Not all interventions should be submitted to equate resources and expertise to conduct elaborate and costly impact evaluation exercises. the evaluation. Rather, those sectors, regions, and intervention · When an intervention is functioning long approaches about which less is known (includ- enough to have visible effects. ing new, innovative ideas) should receive funding · When there is sufficient scale (e.g., in terms of and support for impact evaluation. Ideally, funding, number of people affected) to justify organizations should pool their resources and a thorough assessment. expertise to select interventions of interest for · When the evaluation is likely to produce rigorous and elaborate impact evaluation and "new" knowledge, adding value to the public consequently contribute jointly to the public knowledge on the effectiveness of particular good of knowledge on the impact of (under- (innovative) types of interventions and the evaluated) interventions. mechanisms that "do the work." Impact evaluations may not be appropriate at particular times: Key message · When other valuable forms of evaluation will Determine if an impact evaluation is feasible and yield more useful information to support deci- worth the cost. Costs can be significant; what are the sions to be made or other purposes benefits? In what ways does the impact evaluation · When they move too many resources and too contribute to accountability, learning, and informa- much attention away from the need to de- tion about the "value for money" of what works? velop and use a rich spectrum of evaluation What is the likely added value of an impact evalu- approaches and capacities ation in relation to what is already known about a · When political, technical, practical, or resource particular intervention? What are the costs? What considerations are likely to prevent a credible, are the costs of estimating or measuring what would rigorous, and useful evaluation have happened without the intervention? Is the likeli- · When there are signs that the evaluation will hood of getting accurate information on impact high not be used (or may be misused, for example, enough to justify the cost of the evaluation? for political reasons). 48 Chapter 8 Start collecting data early A lthough issues of data and data collection such as availability and qual- ity often sound like "mere" operational issues that only need to be discussed on a technical level, it should not be forgotten that these aspects are of crucial importance for any impact evaluation (and any evalua- tion in general). Data are needed to test whether there have been changes in the dependent variables or to represent the counterfactual estimate of what the populations' situation would have been if the project had not taken place. The data issue is strongly linked to the method of evaluation. 8.1. timing of data collection An additional issue concerns short-term versus Ideally, impact evaluations should be based on long-term effects. Depending on the interven- data from both before and after an intervention tion and its context, at the time of ex post data has been implemented.1 An important question collection some effects might not have occurred is if the baseline period or end-line period is or not be visible yet, whereas others might wither representative or normal. If the baseline or over time. Evaluators should be aware of how this end-line year (or season) is not normal, then affects conclusions about impact. this affects the observed change over time. If, for example, the baseline year is influenced by 8.2. data availability unusually high or low agricultural production In practice, impact evaluation starts with an or a natural disaster, then the observed change appraisal of existing data, the data that have been up to the end-line year can be strongly biased. produced during the course of an intervention on In most cases it is the timing of the interven- inputs, processes, and outputs (and outcomes). tion, or the impact evaluation, that determines This inventory is useful for several reasons: the timing of the baseline and end-line studies. This timing is not random, and evaluators need · Available data are useful for reconstructing the to investigate if the baseline/end-line data are intervention theory that further guides pri- representative of "normal" periods before they mary and secondary data collection efforts. draw conclusions. If not, even rigorous evalua- · Available data might affect the choice of meth- tions may produce unreliable conclusions odological design or options for further data about impacts. processing and analysis; for example, ex ante 49 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n and ex post data sets of target groups might A useful stepwise approach for assessing data be complemented with other data sets to availability is the following: construct useful control groups; the amount and type of data available might influence the 1. Make an inventory of the availability of data choice of whether to organize additional pri- and assess its quality. Sometimes secondary mary data collection efforts. data can be used to carry out the whole impact · Available data from different sources allow for study. This is especially true for national or triangulation of findings. sector-wide interventions. More usually, secondary data can be used to buttress other In addition, evaluators can rely on a variety of data. data from other sources that can be used in the 2. Analyze, from the perspective of the interven- evaluation process: tion theory, the necessity of additional data. The process of data gathering must be based · National census data on the evaluation design which is, in turn, · General household surveys such as Living partly based on the intervention theory. Data Standards Measurement Surveys must be collected across the results chain, · Specialized surveys such as demographic and not just on outcomes. health surveys 3. Assess the best way(s) to obtain additional · Administrative data collected by line ministries data. and other public agencies (e.g., on school en- 4. A comparison group sample must be of rolment, use of health facilities, market prices adequate size, and subject to the same, or for agricultural produce) virtually the same, questionnaire or other · Studies conducted by donor agencies, NGOs, data collecting instruments. While some and universities intervention-specific questions may not be · Administrative data from agencies, ministries, appropriate, similar questions of a more or other organizations general nature can help test for contagion. · Mass media (newspapers, television documen- 5. It is necessary to check if other interventions, taries, etc.); these can be useful, among other unexpected events, or other processes have things, for understanding the local economic influenced developments in the comparison and political context of an intervention. group or the treatment group (i.e., check whether the comparison group is influenced Appendix 13 describes an example of an impact by other processes than the treatment evaluation implemented by IEG. In 1986 the group). government of Ghana embarked on an ambitious 6. Multiple instruments (e.g., household and program of educational reform, shortening facility level) are usually desirable and must the length of pre-university education from 17 be coded in such a way that they can be to 12 years, reducing subsidies at the second- linked. ary and tertiary levels, lengthening the school 7. Baseline data must cover the relevant day, and taking steps to eliminate unqualified welfare indicators but preferably also the teachers from schools. There was no clearly main determinants of the relevant welfare defined "project" for this study, but the focus was elements, so it will be easier to investigate World Bank support to the subsector through later if other processes than the interven- four large operations. These operations had tion have influenced welfare developments supported a range of activities, from rehabilitat- over time. End-line data must be collected ing school buildings to assisting in the formation across the results chain, not just on intended of community-based school management outcomes. committees. The impact evaluation heavily relied on existing data sets such as the Ghana Living When there is no baseline, the option of a field Standards Survey for impact analyses. survey using recall on the variables of interest may 50 s ta r t c o l l E c t I n G d ata E a r ly be considered. Many commentators are critical examples are recall problems or the sensitivity of of relying on recall. But all survey questions in certain topics) is equally relevant for semistruc- the end are recall, so it is a question of degree. tured interviews and similar techniques in qualita- The evaluator needs to use his or her judgment tive research. With respect to group processes (and knowledge about cognitive processes) as in qualitative research, Cooke (2001) discusses to what are credible data, given a respondent's three of the most widely cited problems: risky capacity to recall. shift, groupthink, and coercive persuasion. A detailed discussion of these issues is beyond 8.3. Quality of the data the scope of this guidance. However, they lead The quality of data can make or break any impact us to some important points: evaluation. Mixed methods and triangulation are strategies to reduce the problem of data quality. · Data based on the perceptions, views, and Yet in terms of the quality control that is needed to opinions of people on the causes and effects ensure that evaluation findings are not (heavily) of an intervention (e.g., target groups) do not biased because of data quality problems, they are necessarily adequately reflect the real causes insufficient. of an intervention; data collected through ob- servation, measurement, or counting (e.g., The evaluator should ask several questions: assets, farm size, infrastructure, profits) in general are less prone to measurement error · What principles should we follow to improve (but are not always easy to collect or sufficient the quality of data (collection)? Some exam- to cover all information needs). ples of subquestions: · The quality of data is more often than not a · How to address missing data (missing ob- constraining factor in the overall quality of servations in a data set, missing variables). the impact evaluation; it cannot be solved by · How to address measurement error--Does sophisticated methods but might be solved the value of a variable or the answer to a in part through triangulation among data question represent the true value? sources. · How to address specification error--Does the question asked or variable measured 8.4. dealing with data constraints represent the concept that it was intended According to Bamberger et al. (2004: 8), to cover? "Frequently, funds for the evaluation were not · Does the quality of the data allow for (ad- included in the original project budget and the vanced) statistical analysis? New advances in evaluation must be conducted with a much and the more widespread use of quasi-experi- smaller budget than would normally be allocated mental evaluation and multivariate data analy- for this kind of study. As a result, it may not be sis are promising in light of impact evaluation. possible to apply the desirable data collection Yet often data quality is a constraining factor instruments (tracer studies or sample surveys, in terms of the quality of the findings (see for example), or to apply the methods for Deaton, 2005). reconstructing baseline data or creating control · In the case of secondary data, what do we groups." Data problems are often correlated with know about the data collection process that or compounded by time and budget constraints. might strengthen or weaken the validity of The scenarios laid out in table 8.1 can occur. our findings?2 Bamberger et al. (2004) describe scenarios for De Leeuw et al. (2008) discuss data quality issues working within these constraints. For example, in survey data analysis. Much of their discussion the implications for quasi-experimental designs on measurement error (errors resulting from are that evaluators have to rely on less robust respondent, interviewer, method, and question- designs such as ex post comparisons only (see related sources or a combination of these; appendix 14). 51 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Table 8.1: Evaluation scenarios with time, data, and budget constraints the constraints under which the evaluation must be conducted time Budget data typical Scenarios X The evaluator is called in late in the project and is told that the evaluation must be completed by a certain date so that it can be used in a decision making process or contribute to a report. The budget may be adequate but it may be difficult to collect or analyze survey data within the time frame. X The evaluation is only allocated a small budget, but there is not necessarily excessive time pressure. However, it will be difficult to collect sample survey data because of the limited budget. X The evaluator is not called in until the project is well advanced. Consequently no baseline survey has been conducted either on the project population or on a control group. The evaluation does have an adequate scope, either to analyze existing household survey data or to collect additional data. In some cases the intended project impacts may also concern changes in sensitive areas such as domestic violence, community conflict, women`s empowerment, community leadership styles, or corruption, on which it is difficult to collect reliable data--even when time and budget are not constraints. X X The evaluator has to operate under time pressure and with a limited budget. Secondary survey data may be available but there is little time and few resources to analyze it. X X The evaluator has little time and no access to baseline data or a control group. Funds are available to collect additional data but the survey design is constrained by the tight deadlines. X X The evaluator is called in late and has no access to baseline data or control groups. The budget is limited but time is not a constraint. X X X The evaluator is called in late, is given a limited budget, has no access to baseline survey data and no control group has been identified. Source: Bamberger et al. (2004). Key message Start collecting data early. Good baseline data are design to be able to design high-quality evaluations. essential to understanding and estimating impact. Ensuring high-quality data collection should be part Depending on the type of intervention, the collec- and parcel of every impact evaluation. When work- tion of baseline data, as well as the setup of other ing with secondary data, a lack of information on the aspects of the impact evaluation, requires an effi- quality of data collection can restrict data analysis cient relationship between the impact evaluators and options and the validity of findings. Take notice of and the implementers of the intervention. Policy makers deal effectively with the restrictions under which an and commissioners need to involve experts in impact impact evaluation has to be carried out (time, data, evaluation as early as possible in the intervention and money). 52 Chapter 9 Front-end planning is important F ront-end planning refers to the initial planning and design phase of an impact evaluation. Ad hoc commissioned impact evaluations usually do not have a long planning period, thereby risking a suboptimally planned and executed evaluation process. As good impact evaluation relies on good data, document can be widely circulated and gives preferably including baseline data, attention stakeholders and others a chance to comment to proper front-end planning should be a and improve upon the intended evaluation design priority issue. Ideally, front-end planning of from an early stage. It also helps to generate broad impact evaluations should be closely articu- "buy-in" or at worst to define the main grounds lated to the initial design and planning phase of potential disagreement between evaluators of the policy intervention. Indeed, this articu- and practitioners. In addition, it is wise to use an lation is most clearly visible in an RCT, in evaluation matrix when planning and executing which intervention and impact evaluation are the work. This tool ensures that key questions inextricably linked. are identified, together with the ways to address them, sources of data, role of theory, etc. This 9.1. Planning tools can also play an important role in stakeholder Clear definition of scope (chapters 1 and 2) and consultation, ensuring that important elements sound methodological design (chapters 3­6) are not omitted. cannot be captured in standardized frameworks. Decision trees on assessing data availability (see 9.2. Staffing and resources § 8.2.) and method choice (see appendix 6) are Resources are important, and spending should useful, though they provide only partial answers be safeguarded up front. The longer the time to methodological design choice issues. Pragmatic horizon of a study, the more difficult this is. considerations of time, budget, and data (see Resources are also important to realize the § 8.4) but culture and politics also play a role. Two much-needed independence of an evalua- tools that are particularly helpful in the planning tor and the evaluation team. A template for phase of an impact evaluation are the approach assessing the independence of evaluation paper and the evaluation matrix. organizations can be downloaded from http:// www.ecgnet.org/docs/ecg.doc. This document The approach paper outlines what the evaluation specifies a number of criteria and questions that is about and how it will be implemented. This can be asked. 53 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Evaluation is not only a financial resources for communication during key points of the business but even more a people's business. evaluation. So is the planning of an evaluation. As evaluation projects are usually no longer 9.3. the balance between independence "lonely hunter" activities, staffing is crucial. and collaboration between evaluators So when starting the preparation of the study, and stakeholders a crucial point concerns addressing a number One of the questions within the world of impact of questions: evaluations is what degree of institutional separa- tion to put between the evaluation providers and · Who are the people who do the evaluation? the evaluation users. There is much to be gained · Under which (contractual) conditions are they from the objectivity provided by having the evalua- doing the job? tion carried out independently of the institution · What is their expertise? responsible for the project being evaluated. · Which roles will they be carrying out? Pollitt (1999) warned against "partnerial" evalua- tions, where positions of stakeholders, commis- Topics that deserve attention are the following: sioners, and evaluators blurred too much. 1 However, evaluations often have multiple goals, · The mix of disciplines and traditions that are including building evaluation capacity within brought together in the team. government agencies and sensitizing program · The competencies the team has "in stock." operators to the realities of their projects once Competencies range from methodological ex- they are carried out in the field. At a minimum, pertise to negotiating with institutional actors the evaluation users, who can range from govern- and stakeholders, getting involved in "hearing ment agencies in client countries to bilateral both sides" (those evaluated and the principal) and multilateral donors, international NGOs, and in the clearance of the report. and grass roots/civil society organizations, must · The structure of the evaluation team. For remain sufficiently involved in the evaluation to the evaluation to be planned and carried out ensure that the evaluation process is recognized effectively, the roles of the project director, as legitimate and that the results produced are staff, and other evaluators must be made clear relevant to their information needs. Otherwise, to all parties. the evaluation results are less likely to be used · The responsibilities of the team members. to inform policy. The evaluation manager and · The more an evaluation is linked to a politi- his or her clients must achieve the right balance cal "hot spot," the more it is necessary that at between involving the users of evaluations and least one member of the team have a "political maintaining the objectivity and legitimacy of the nose"--not primarily to deal with administra- results (Baker, 2000). tors and (local) politicians, but to understand when an evaluation project becomes too much 9.4. Ethical issues of what is known as a partnerial evaluation It is important to take the ethical objections (Pollitt, 1999). and political sensitivities seriously. There can · Also, staff should be active in realizing an ade- be ethical concerns with deliberately denying a quate documentation and evaluation trail. program to those who need it and providing the program to those who do not; this applies to both A range of skills is needed in evaluation work. The experimental and non-experimental methods. quality and eventual utility of the impact evalua- For example, with too few resources, random- tion can be greatly enhanced with coordination ization may be seen as a fair solution, possibly between team members and policymakers from after conditioning on observables. However, the outset. It is therefore important to identify the information available to the evaluator (for team members as early as possible, agree on roles conditioning) is typically a partial subset of the and responsibilities, and establish mechanisms information available "on the ground" (includ- 54 f r o n t- E n d p l a n n I n G I s I m p o r ta n t ing voters/taxpayers). The idea of "intention-to- of the findings. The client(s) and evaluator could treat" helps alleviate these concerns; one has then strategize to either seek ways to increase a randomized assignment, but anyone is free the budget or extend the time, or agree to limit to not participate. Even then, the "randomized the scope of the evaluation and what it promises out" group may include people in great need. to deliver. If clients understand that the current All these issues must be discussed openly and design will not hold up under the scrutiny of weighed against the (potentially large) longer- critics, they can find ways to help address some term welfare gains from better information for of the constraints: "We have found that impact public decision making (Ravallion, 2008).2 evaluations generally provide rudimentary documentation of the data being used. There is 9.5. norms and standards evidently a trade-off between decision makers' As noted before, impact evaluations are often and bureaucrats' appeal for short and crisp designed, implemented, analyzed, disseminated, reports and principles for scientific documenta- and used under budget, time, and data constraints tion, but we want to emphasise that displaying while facing diverse and often competing political descriptive statistics improves the transparency interests. Given these constraints, the manage- of the methodological approach" (Jerve and ment of a real-world evaluation is much more Villanger, 2008: 34). complicated than textbook descriptions. For the sake of honest commitment to develop- Evaluations sometimes fail because the ment, evaluators and evaluation units should stakeholders were not involved, or the findings ensure that impact evaluations are designed and were not used because they did not address the executed in a manner that limits manipulation of stakeholders' priorities. Others fail because of processes or results that lean toward any ideolog- administrative or political difficulties in getting ical or political agenda. They should also ensure access to the required data, being able to meet that there are realistic expectations of what can with all the individuals and groups that should be be achieved by a single evaluation within existing interviewed, or being able to ask all the questions time and resource constraints, and that findings that the evaluator feels are necessary. Many other from the evaluation are presented in ways that are evaluations fail because the sampling frame, accessible to the intended users. This includes often based on existing administrative data, finding a balance between simple, clear messages omits important sectors of the target popula- and properly acknowledging the complexities tion--often without anyone being aware of this. and limitations of the findings. In other cases the budget was insufficient, or was too unpredictable to permit an adequate evalua- International evaluation standards (such as tion to be conducted. Needless to say, evaluations the OECD-DAC or the United Nations Evalua- also fail because of emphasizing stakeholders´ tion Group Norms and Standards and/or the participation too much, leading to partnerial standards and guidelines developed by national evaluations (Pollitt, 1999), and because of insuffi- or regional evaluation associations) should be cient methodological and theoretical expertise. applied where appropriate (Picciotto, 2004). Although many of these constraints are presented Greater emphasis on impact evaluation for in the final evaluation report as being completely evidence-based policy making can create greater beyond the control of the evaluator, in fact their risk of manipulation aimed at producing desirable effects could very probably have been reduced results (House, 2008). Impact evaluations require by more effective management of the evaluation. an honest search for the truth and thus place For example, a more thorough scoping analysis high demands on the integrity of those commis- could have revealed many of these problems, and sioning and conducting them. For the sake of the client(s) could then have been made aware of honest commitment to development, evalua- the likely limitations on the methodological rigor tors and evaluation units need to ensure that 55 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n impact evaluations are designed and executed in Evaluation Cooperation Group have a tradition a manner that limits manipulation of processes of cooperation, shared vision on evaluation, and or that produces results favoring any ideological longstanding relationships and have fostered or political agenda. numerous joint evaluations. 9.6. ownership and capacity building The interaction among the international develop- Capacity building at the level of governmental ment evaluation community, the countries/ or non-governmental agencies involved should regions themselves, and the academic evaluation be an explicit purpose in impact evaluation. In communities should also be stimulated, as it is cases where sector-wide investment programs are likely to affect the pace and quality of capacity financed by multidonor co-financing schemes, building in impact evaluation. Capacity building participating donors would make natural partners will also strengthen (country and regional) for a joint evaluation (OECD-DAC, 2000). ownership of impact evaluation. Providing a space for consultation and agreement on Other factors in selecting other donors as partners impact evaluation priorities among the differ- in a joint evaluation work may also be relevant. ent stakeholders of an intervention will also help Selecting donors with similar development enhance utilization and ownership. philosophies, cultures, evaluation procedures and techniques, and regional affiliations, and that are geographically close may make working together easier. Another issue may be keeping Key message the total number of donors "manageable." Where more donors are involved, a key group Front-end planning is important. It can help manage of development partners (including national the study, its reception, and its use. When managing actors) could assume management responsibili- the evaluation, keep a clear eye on items such as ties and the role of others can be more limited. costs, staffing, ethical issues, and level of indepen- Once appropriate donors that have a likely stake dence--of the evaluator and the team, versus the in an evaluation topic are identified, the next level of collaboration with stakeholders. Pay atten- step is to contact them and discern whether they tion to country and regional ownership of impact are interested in participating. In some cases, an evaluation and capacity building and promote it. appropriate consortium or group may already Providing a space for consultation and agreement exist, where the issue of a joint evaluation can be on impact evaluation priorities among the different raised and expressions of interest easily solicited. stakeholders of an intervention will help enhance The DAC Working Party on Aid Evaluation, the utilization and ownership United Nations Evaluation Group, and the 56 Appendixes 57 APPENDIX 1: EXAMPLES OF DIVERSITY IN IMPACT EVALUATION Example 1. Evaluating the impact of a that a training project will bring about a complete European union-funded training project transformation from a conventional farming on low External Input Agriculture in system to a LEIA farming system (as assumed in guatemala the objectives). In line with the literature, the Within the framework of a European Union- most popular practices (in this case, for example, funded integrated rural development project, organic fertilizers and medicinal plants) were financial support was provided to a training those that offer a clear short term return while project aimed at the promotion of Low External not requiring significant investments in terms of Input Agriculture (LEIA) as a viable agricultural labor or capital. Finally, an ideological faith in the livelihood approach for small farmers in the absolute supremacy of LEIA practices is not in the highlands of western Guatemala. best interest of the farmers. Projects promoting LEIA should focus on the complementary effects The impact evaluation design of this project of LEIA practices and conventional farming was based on a quasi-experimental design and techniques, encouraging each farmer to choose complemented by qualitative methods of data the best balance fitted to his/her needs. collection (Vaessen and De Groot, 2004). An intervention theory was reconstructed on the Example 2. Assessing the impact of basis of field observations and relevant literature Swedish program aid to make explicit the different causal assumptions White and Dijkstra (2003) analyzed the impact of the project, facilitating further data collec- of Swedish program aid. Their analysis accepted tion and analysis. The quasi-experimental design from the start that it is impossible to separate included data collection on the ex ante and ex the impact of Swedish money from that of other post situation of participants, complemented with donors' money. Therefore, the analysis focuses ex post data collection involving a control group on all program aid with nine (country) case (based on judgmental matching using descriptive studies that trace how program aid has affected statistical techniques). Without complex matching macro-economic aggregates (like imports and procedures and with limited statistical power, the government spending) and (through these strength of the quasi-experiment relied heavily indicators) economic growth. The authors on additional qualitative information. This shift discern two channels for influencing policy: in emphasis should not give the impression of a money and policy dialogue. The main evaluation lack of rigor. Problems such as the influence of questions are-- selection bias were explicitly addressed, even if not done in a formal statistical way. 1. How has the policy dialogue affected the pattern and pace of reform (and what has Farmers' adoption behavior after the termination been the contribution of program aid to this of the project can be characterized as selective process)? and partial. Given the particular circumstances of 2. What has the impact of the program aid funds small farmers (e.g., risk aversion, high opportu- (on imports, government expenditure, invest- nity costs of labor), it is not realistic to assume ment, etc.) been? 59 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n 3. What has the impact of reform programs a number of ad hoc techniques, such as analyzing been? behavior during surges and before versus after breaks in key series and searching the data for Their analytical model treats donor funds and other explanations of the patterns observed. the policy dialogue as inputs; specific economic, social, and political indicators as outputs; and Moreover, the authors analyze the impact of aid the main program objectives (like economic on stabilization through-- growth, democracy, human rights and gender equality) as outcomes; and poverty reduction as a. The effect on imports the overall goal. b. Its impact on the markets for domestic currency and foreign exchange The analysis focuses on marginal impact and c. The reduction of inflationary financing of the uses a combination of quantitative and qualita- government deficit. tive approaches (interviews, questionnaires, and e-mail enquiries). The analysis of the impact In terms of the impact of program aid on reform, of aid is largely quantitative, while the analysis domestic political considerations are a key factor of the impact of the policy dialogue is mainly in determining reform: most countries have qualitative. initiated reform without the help from donors and have carried out some measure of reform An accounting approach is used to identify aid not required by them, while ignoring others that impact on expenditure levels and patterns using have been required. 60 APPENDIX 2: THE GENERAL ELIMINATION METHODOLOGY AS A BASIS FOR CAUSAL ANALYSIS What are the core elements of the General case by applying the general one, is the list of Elimination Methodology (also known as the suspects. When dealing with new effects, we modus operandi approach)? We follow Scriven may not be certain the list is complete, but (2008).1 we work with the list we have and extend it when necessary. i. The general premise is the deterministic princi- iii. The second practical premise is the list of ple: all macro events (or conditions, etc.) have the modus operandi for each of the possible a cause. This is only false at the micro-level, causes (the MOL). Each cause has a set of where the uncertainty principle applies, but footprints, a short one if it's a proximate the latter principle has essentially no detect- cause, a long one if it's a remote cause, but in able effect on the truth of macro determin- general the modus operandus is a sequence ism (though it is easy enough to deliberately of intermediate or concurrent events or a set create bizarre experiments where it does). of conditions, or a chain of events, that has ii. The first "premise from practice" is the list to be present when the cause is effective. of possible causes (LOPC) of events of the There's often a rubric for this; for example, type in which we are interested, e.g., learning in criminal (and most other) investigations gains, reduction of poverty, and extension of into human agency, we use the rubric of life for AIDS patients. We have used LOPCs means/motives/opportunity to get from the for more than a million years, in tracking and motives to the list of "suspects." The list of cooking and healing and repairing, and today modus operandi is the magnifying lens that every detective knows the list for murder, just fleshes out the candidate causes from the as every competent mechanic knows the list LOPC so that we can start fitting them to the for a big-end rattle or a brake failure, though case or rejecting them, for which we use the the knowledge is as often tacit as explicit, next premise. outside the classroom and the maintenance iv. The fourth premise comprises the "facts of the videos. An LOPC usually refers to causes at case," and these are now assembled selectively, a certain temporal or spatial remove from by looking for the presence or absence of the effect, and at a certain level of concep- factors listed in the modus operandi of each tualization, and will vary depending on these of the LOPCs. Only those causes are (eventu- parameters; of course, the context of the ally) left standing whose modus operandi are investigation determines the appropriate completely present. Ideally, there will be just distance parameters. The distant LOPC for one of these, but sometimes more than one, murder is the list of possible motives; a more which are then co-causes. (Note that there is proximate one, developed in a particular no reference to counterfactuals.) 61 APPENDIX 3: OVERVIEW OF QUANTITATIVE TECHNIQUES OF IMPACT EVALUATION Analysis of intervention(s) Explicit counterfactual Analysis of multiple (with/without) interventions and influences S O E B L S E E C R Propensity score Regression analysis T V I E O D N E U Randomized controlled trial Difference in difference regression F N pipeline approach F O Fixed effects regression E B Double difference C S (Difference in difference) Instrumental variables T E S R Regression discontinuity V E D 63 APPENDIX 4: TECHNICAL ASPECTS OF QUANTITATIVE IMPACT EVALUATION TECHNIQUES Endogeneity Yi = a + bPi + ei , The selection on unobservables is an important cause of endogeneity, a correlation of one of while in effect we have the explanatory variables with the error term in a mathematical model. This correlation occurs Yi = a + bPi + (ei + ex), when an omitted variable has an effect at the same time on the dependent variable and an where ei is a random error term and ex is the effect explanatory variable.1 of the unobserved variable. P and ex are correlated and therefore P is endogenous. Ignoring this correlation results in a biased estimate of b. When Intervention the source of the selection bias (X) is known, inclusion of this variable (or these variables) leads Exogenous variable to an unbiased estimate of the effect Yi = a + bPi + cXi + ei . Result An example is the effect of class size on learning achievements. The school choice of motivated When a third variable is not included in the model, (and probably well-educated) parents is probably the effect of the variable becomes part of the correlated with class size, as these parents tend error term and contributes to the "unexplained to send their children to schools with low variance." As long as this variable does not have pupil:teacher ratios. The neglect of the endoge- an effect at the same time on one of the explana- neity of class size may lead to biased estimates tory variables in the model, this does not lead (with an overestimation of the real effect of class to biased estimates. However, when this third size). When the selection effects are observable, variable has an effect on one of the explanatory a regression-based approach may be used to get variables, this explanatory variable will "pick up" an unbiased estimate of the effects. part of the error and therefore will be correlated with the error. In that case, omission of the third Figure A4.1 gives the relation between class variable leads to a biased estimate. size and learning achievements for two groups of schools: the left side of the figure shows Suppose we have the relation private schools in urban areas with pupils with relatively rich and well educated parents; the Yi = a + bPi + cXi + ei , right side shows public schools with pupils from poor remote rural areas. A neglect of the differ- where Yi is the effect, Pi is the program or ences between the two schools leads to a biased intervention, Xi is an unobserved variable, and ei estimate, as shown by the black line. Including is the error term. Ignoring X we try to estimate these effects in the equation leads to the smaller the equation effect of the dotted lines. 65 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Figure A4.1: Estimation of the effect of class size with and without the inclusion of a variable correlated with class size 10 9 8 7 Learning achievements 6 5 4 3 2 1 0 0 20 40 60 80 100 120 Class size double difference and regression program and with the anticipated effect of the analysis program, but we have no data on the year of The technique of "double differencing" can also birth, we may get an unbiased estimate by taking be applied in a regression analysis. Suppose the first differences of the original variables. This that the anticipated effect (Y) is a function of technique helps to get rid of the problem of participation in the project (P) and of a vector "unobservables."2 of background characteristics. In a regression equation we may estimate the effect as Instrumental variables The use of instrumental variables is another Yi = a + bPi + cXi + ei , technique to get rid of the endogeneity problem. A good instrument correlates with the where e is the error term and a, b, and c the (endogenous) intervention, but not with the parameters to be estimated. error term. This instrument is used to get an unbiased estimate of the effect of the endoge- When we analyze changes over time, we get nous variable. (taking the first differences of the variables in the model): In practice, researchers often use the method of two-stage least squares: in the first stage an (Yi,1 ­ Yi,0) = a + b(Pi,1 ­ Pi,0) + c (Xi,1 ­ Xi,0) + ei. exogenous variable (Z) is used to give an estimate of the endogenous intervention-variable (P): When the (unobserved) variables X are time invariant, (Xi,1 ­ Xi,0) = 0, and these variables Pi' = a + dZi + ei . drop from the equation. Suppose, for instance that a variable X denotes the "year of birth." For In the second stage this new variable is used every individual the year of birth in year 1 = year to get an unbiased estimate of the effect of the of birth in year and therefore (Xi,1 ­ Xi,0) = 0. So, intervention: if we expect that the year of birth is correlated with the probability of being included in the Yi = a + bP'i + cXi + ei . 66 a p p E n d I x 4 : t E c h n I c a l a s p E c t s o f q u a n t I tat I v E I m pa c t E va l u at I o n t E c h n I q u E s the computation of propensity scores where pi is the probability of being included in The method of propensity score matching the intervention group and X, Y, and Z denote involves forming pairs by matching on the specific observed characteristics. In this model, probability that subjects have been part of the the probability is a function of the observed treatment group. The method uses all available characteristics. Rosenbaum and Rubin (1983) information to construct a control group. A proved that when subjects in the control group standard way to do this is using a probit or logit have the same probability of being included in regression model. In a logit specification, we get the treatment group as subjects who actually belong to the treatment group, the treatment and ln (pi / (1­pi)) = a + bXi + cYi + dZi +ei , control groups will have similar characteristics. 67 APPENDIX 5: EVALUATIONS USING QUANTITATIVE IMPACT EVALUATION APPROACHES1 Agriculture and rural development Case study: Philippines The project: The Second Rural Credit Projects Case study: Pakistan (SRCP) operated between 1969 and 1974 with a The projects: Irrigation in Pakistan suffers from US$12.5 million loan from the World Bank. SRCP the "twin menaces" of salinity and waterlogging. was the continuation of a pilot credit project These problems have been tackled through started in 1965 and completed in 1969. As its Salinity Control and Reclamation Projects successful predecessor, SRCP aimed to provide (SCARPs), financed in part by the Bank. Although credit to small and medium rice and sugar farmers technically successful, SCARP tubewells imposed for the purchase of farm machinery, power tillers, an unsustainable burden on the government's and irrigation equipment. Credits were to be budget. The project was to address this problem channeled through 250 rural banks scattered in areas with plentiful groundwater by closing around the country. An average financial contribu- public tubewells and subsidizing farmers to tion to the project of 10% was required from both construct their own wells. rural banks and farmers. The SRCP was followed by a third loan of US$22.0 million from 1975 to Methodology: The Independent Evaluation Group 1977 and by a fourth loan of US$36.5 million that (IEG) commissioned a survey in 1994 to create was still in operation at the time of the evaluation a panel from two earlier surveys undertaken in (1983). 1989 and 1990. The survey covered 391 farmers in project areas and 100 from comparison areas. Methodology: The study uses data of a survey of Single and double differences of group means are 738 borrowers (nearly 20% of total project benefi- reported. ciaries) from seven provinces of the country. Data were collected through household question- Findings: The success of the project was that naires on land, production, employment, and the public tubewells were closed without measures of standard of living. In addition, 47 the public protests that had been expected. banks were surveyed to measure the impact on Coverage of private tubewells grew rapidly. their profitability, liquidity, and solvency. The However, private tubewells grew even more study uses before-and-after comparisons of rapidly in the control area. This growth may means and ratios to assess the project impact be a case of contagion, though a demonstra- on farmers. National level data are often used to tion effect. But it seems more likely that other validate the effects observed. Regarding the rural factors (e.g., availability of cheaper tubewell banks, the study compares measures of financial technology) were behind the rapid diffusion performance before and after the project, taking of private water exploitation. Hence the advantage of the fact that the banks surveyed project did not have any impact on agricultural joined the project at different stages. productivity or incomes. It did, however, have a positive rate of return by virtue of the savings Findings: The mechanization of farming did not in government revenue. produce an expansion of holding sizes (though 69 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n the effect of a contemporaneous land reform in nutritional status of participating children over should be taken into account). Mechanization did time, differential participation, and differential not change cropping patterns, and most farmers project impact across social groups. Data on the were concentrating on a single crop at the time of change in nutritional status in project areas are the interviews. No change in cropping intensity compared to secondary data on the nutritional was observed, but production and productivity status of children outside the project areas. With were found to be higher at the end of the project. some assumptions, the use of secondary data The project increased the demand for both family makes the findings plausible. and hired labor. Farmers reported an increase in incomes and savings, and in several other welfare Findings: The study concludes that the implemen- indicators, as a result of the project. Regarding the tation of GMP programs on a large scale is project impact on rural banks, the study observes feasible and that this had a positive impact on the an increase in the net income of the sample banks nutritional status of children of Tamil Nadu. More from 1969 to 1975 and a decline thereafter. Banks' specifically, these are the findings of the study: liquidity and solvency position was negatively affected by poor collection and loan arrears. · Program participation: Among children par- , ticipating in GMP all service delivery indica- health, nutrition, and population tors (age at enrolment, regular attendance of sessions, administration of vitamin A, and de- Case study: India worming) show a substantial increase between The project: The Tamil Nadu Integrated Nutrition 1982 and 1986, though subsequently they de- Project (TINP) operated between 1980 and 1989, clined to around their initial levels. Levels of with a credit of US$32 million from the Interna- service delivery, however, are generally high. tional Development Association (IDA). The · Nutritional status: Mean weight and malnutrition overall objective of the project was to improve rates of children aged between 6 and 36 months the nutritional and health status of pre-school and participating in GMP have improved over children, pregnant women, and nursing mothers. time. Data on non-project areas in Tamil Nadu The intervention consisted of a package of and all-India data show a smaller improvement services including nutrition education, primary over the same time period. Regression analy- health care, supplementary feeding, administra- sis of nutritional status on a set of explanatory tion of vitamin A, and periodic de-worming. The variables, including the participation in a cotem- project was the first to employ Growth Monitor- poraneous nutrition project (the National Meal ing and Promotion (GMP) on a large scale. The Program) shows that the latter had no addi- evaluation is concerned with the impact of the tional benefit on nutritional outcomes. Positive project on the nutritional status of children. associations are also found between nutritional status and intensive participation in the pro- Methodology: The study uses three cross- gram and complete immunization. sectional rounds of data collected by the TINP · Targeting: Using tabulations and regression Monitoring Office. Child and household charac- analysis, it is shown that initially girls have ben- teristics of children participating in the program efited more from the program, but that at the were collected in 1982, 1986, and 1990, each end of the program boys have benefited more. round consisting of between 1,000 and 1,500 Children from the scheduled caste are shown observations. The study uses before-and-after to have benefited more than other groups. Nu- comparisons of means, regression analysis, and tritional status was observed to be improving at charts to provide evidence of the following: all income levels, the highest income category frequency of project participation, improvement benefiting slightly more than the lowest. 70 APPENDIX 6: DECISION TREE FOR SELECTING QUANTITATIVE EVALUATION DESIGNS TO DEAL WITH SELECTION BIAS decision tree for impact evaluation possible? If the treatment group is chosen at design using quantitative impact random, then a random sample drawn from the evaluation techniques sample population is a valid control group and 1. If the evaluation is being designed before will remain so provided they are outside the the intervention (ex ante), is randomization influence zone and contamination is avoided. Implement an ex Yes ante randomized experimental design Is a randomized Implement a suitable Yes design Yes quasi-experimental feasible? design Is evaluation Is selection Use panel being No based on Yes data-based designed observables? design ex ante? Are the unobservables No No time invariant? Is selection based on No observables? Use Is there a group of Can a means well-triangulated No as-yet-untreated No be found to plausible beneficiaries? observe them? association Use the pipeline Yes Yes approach Implement a suitable Yes quasi-experimental design Source: SG1 (2008). 71 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n This approach does not mean that targeting (a panel of persons, households, etc.) and specific analytical units is not possible. The selection is determined by unobservables, random allocation may be to a subgroup of then some means of observing the supposed the total population, e.g., from the poorest unobservables should be sought. If that is districts. not possible, then a pipeline approach can 2. If randomization is not possible, are all be used if there are as-yet untreated benefi- selection determinants observed? If they ciaries. For example, the Asian Development are, then there are a number of regression- Bank's impact study of microfinance in the based approaches that can remove the Philippines matched treatment areas with selection bias. areas that were in the program but that had 3. If the selection determinants are unobserved not yet received the intervention. and if they are thought to be time invari- 5. If none of the above mentioned procedures is ant, then using panel data will remove their possible, then the problem of selection bias influence, so a baseline is essential (or some cannot be addressed. The impact evaluation means of substituting for a baseline). will have to rely heavily on the intervention 4. If the study is done ex post so it is not possible theory and triangulation to build an argument to get information for exactly the same units by plausible association. 72 APPENDIX 7: HIERARCHICAL MODELING AND OTHER STATISTICAL APPROACHES This group of approaches covers a quite diverse set occurring at another level. Such analyses often of advanced modeling and statistical approaches. attempt to decompose the total effect of the Detailed discussion of these technical features program into the effect across various program is beyond the scope of this document. The levels and that between program sites within a common element that binds these approaches level (Dehejia, 1999)" (Yang et al., 2004: 494). is purpose modeling and estimating direct and indirect effects of interventions at various levels Also part of this branch of approaches is a range of aggregation (from micro to macro). At the of statistical approaches such as nested models, risk of substantial oversimplification we briefly models with latent variables, multi-level regres- mention a few of the approaches. In hierarchi- sion approaches, and others (see, for example, cal modeling, evaluators and researchers look at Snijders and Bosker 1999). Other examples are the interrelationships between different levels typical economist tools such as partial equilib- of a program. The goal is "to measure the true rium analyses; general computable equilibrium and often intertwined effects of the program. In models (CGEs) are often used to assess the a typical hierarchical linear model analysis, for impact of, for example, macroeconomic policies example, the emphasis is on how to model the on markets and example, subsequently on effect of variables at one level on the relations household welfare (see box A7.1). Box A7.1: Impact of the Indonesian financial crisis on the poor: Partial equilibrium modeling and CGE modeling with microsimulation General equilibrium models permit the analyst to examine ex- households from the 1993 SUSENAS survey, together with de- plicitly the indirect and second-round consequences of policy tailed information on price changes over the 1997­98 crisis pe- changes. These indirect consequences are often larger than the riod, to compute household-specific cost-of-living changes. It direct, immediate impact, and may have different distributional finds that the poorest urban households were hit hardest by the implications. General equilibrium models and partial equilibrium shock, experiencing a 10%­30% increase in the cost of living models may thus lead to significantly different conclusions. A (depending on the method used to calculate the change). Rural comparison of conclusions reached by two sets of research- households and wealthy urban households actually saw the ers, examining the same event using different methods, reveals cost of living fall. the differences between the models. Levinsohn et al. (1999) and These results suggest that the poor are just as integrated Robillard et al. (2001) both look at the impact of the Indonesian into the economy as other classes but have fewer opportunities financial crisis on the poor--the former using partial equilibrium to smooth consumption during a crisis. However, the methods methods, the latter using a CGE model with micro-simulation. used have at least three serious drawbacks. First, the consump- The Levinsohn study used consumption data for nearly 60,000 tion parameters are fixed; that is, no substitution is permitted (continued on next page) 73 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Box A7.1: Impact of the Indonesian financial crisis on the poor: Partial equilibrium modeling and CGE modeling with microsimulation (continued) between more expensive and less expensive consumption items. profits, and households accrue profits and income to factors in Second, the results are exclusively nominal, in that the welfare proportion to their endowments. Labor supply is endogenous. The changes are due entirely to changes in the price of consumption micro-simulation model is constrained to conform to the aggregate and do not account for any concomitant change in income. Third, levels provided by the CGE model. The Robillard team finds that this analysis cannot control for other exogenous events, such as poverty did increase during the crisis, although not as severely the El Niño drought and resulting widespread forest fires. as the previous results suggest. Also, the increase in poverty Robillard et al. (2001) use a CGE model, connected to a mi- was due in equal parts to the crisis and to the drought. Comparing crosimulation model. The results are obtained in two steps. First, their microsimulation results to those produced by the CGE alone, the CGE is run to derive a set of parameters for prices, wages, the authors find that the representative household model is likely and labor demand. These results are fed into a micro-simulation to underestimate the impact of shocks on poverty. In contrast, model to estimate the effects on each of 10,000 households in the ignoring both substitution and income effects, as Levinsohn et al. 1996 SUSENAS survey. In the microsimulation model, workers (1999) do, is likely to lead to overestimating the increase in poverty, are divided into groups according to sex, residence, and skill. since it does not permit the household to reallocate resources in Individuals earn factor income from wage labor and enterprise response to the shock. Source: World Bank (2003). 74 APPENDIX 8: MULTI-SITE EVALUATION APPROACHES Multi-site evaluation approaches involve primary and multiple sites are included in the experiment data collection processes and analyses at multiple in order to strengthen the external validity of the sites or interventions. They usually focus on findings. Control over all aspects of the evaluation programs encompassing multiple interven- is very tight to keep as many variables constant tions implemented in different sites (Turpin over the different sites. Applications are mostly and Sinacore, 1991; Straw and Herrell, 2002). found in the health sector (see Kraemer, 2000). Although these approaches are often referred to as a family of methodologies, in what follows, and Multi-site evaluation distinguishes itself from in line with the literature, we will use a somewhat cluster evaluation in the sense that its primary more narrow definition of multi-site evaluations purpose is summative. In addition, multi-site alongside several specific methodologies to evaluations are less participatory in nature vis-à-vis address the issue of aggregation and cross-site intervention staff. In contrast to settings in which evaluation of multiple interventions. multi-center clinical trials are applied, multi-site evaluations address large-scale programs that, Straw and Herrell (2002) use the term "multi- because of their (complex) underlying strate- site evaluation" both as an overarching concept, gies, implementation issues, or other reasons, i.e., including cluster evaluation and multi-center are not amenable to controlled experimental clinical trials, as well as a particular type of multi- impact evaluation designs. Possible variations in level evaluation distinguishable from cluster implementation among interventions sites, and evaluation and multi-center clinical trials. Here variations in terms of available data require a we use the latter definition to refer to a partic- different, more flexible approach to data collec- ular (though rather flexible) methodological tion and analysis than in the case of the multi- framework applicable to the evaluation of compre- center clinical trials. A common framework of hensive multilevel programs addressing health, questions and indicators is established to counter economic, environmental, or social issues. this variability, enabling data analysis across interventions in function of establishing general- The multi-center clinical trial is a methodology izable findings (Straw and Herrell, 2002). in which empirical data collection in a selection of homogenous intervention sites is systematically Cluster evaluation is a methodology that is organized and coordinated. Basically it consists of especially useful for evaluating large-scale a series of randomized controlled trials. The latter interventions that address complex societal are experimental evaluations in which treatment themes such as education, social service is randomly assigned to a target group while a delivery, and health promotion. Within a cluster similar group not receiving the treatment is used of projects under evaluation, implementa- as a control group. Consequently, changes in tion among interventions may vary widely, but impact variables between the two groups can be single interventions are still linked in terms traced back to the treatment, as all other variables of common strategies, target populations, or are assumed to be similar at group level. In the problems that are addressed (Worthen and multi-center clinical trial sample size is increased Schmitz, 1997). 75 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n The approach was developed by the Kellogg stage) in close collaboration with stakeholders Foundation in the 1990s and since then has been from all levels. Its purpose is, on the one hand, taken up by other institutions. Four elements formative, as evaluators in close collaboration characterize cluster evaluation (Kellogg Founda- with stakeholders at project level try to explore tion, 1991): common issues as well as variations between sites. At the program level the evaluation's purpose · It focuses on a group of projects in order to can be both formative in terms of supporting identify common issues and patterns. planning processes as well as summative, i.e., · It focuses on what happened as well as why. judging what went wrong and why. A common · It is based on a collaborative process involving question at the program level would be, for all relevant actors, including evaluators and example, to explore the factors that in the differ- individual project staff. ent sites are associated with positive impacts. In · Project-specific information is confidential and general, the objective of cluster evaluations is not reported to the higher level; evaluators not so much to prove as to improve, based on a only report aggregate findings; this type of shared understanding of why things are happen- confidentiality between evaluators and project ing the way they are (Worthen and Schmitz, staff induces a more open and collaborative 1997). It should be noted that not only cluster environment. evaluations but also multi-site evaluations are applicable to homogenous programs with little Cluster evaluation is typically applied during variation in terms of implementation and context program implementation (or during the planning among single interventions. 76 APPENDIX 9: METHODOLOGICAL FRAMEWORKS FOR ASSESSING THE EFFECTS OF INTERVENTIONS, MAINLY BASED ON QUALITATIVE METHODS1 outcome mapping Most significant change Outcome mapping (IDRC, 2001) is a method- The most significant change technique (Davies ology that focuses on outcomes as behavioral and Dart, 2005) is a form of participatory change. The outcomes can be logically linked to monitoring and evaluation. It is participatory an intervention's activities, although they may because many intervention stakeholders are not be necessarily directly caused by them. These involved both in deciding the types of change changes are aimed at contributing to specific to be recorded, and in analyzing the data. It is a aspects of human and ecological well-being by form of monitoring because it occurs through- providing partners with new tools, techniques, out the intervention cycle and provides informa- and resources to contribute to the development tion to help people manage the intervention. It process. "Boundary partners" are individuals, contributes to impact evaluation in part because groups, and organizations with whom the interven- it provides data on impact and outcomes that tion interacts directly and with whom the interven- can be used to help assess the performance tion anticipates opportunities for influence; most of the intervention as a whole--but largely activities will involve multiple outcomes because through providing a tool for identifying and they have multiple boundary partners. rating the impacts that are valued by different stakeholders. Success case method The success case method (Brinkerhoff, 2003) is MAPP a widely adopted example of a mixed-method The Method for Impact Assessment of Projects framework, drawing from several established and Programs (Späth, 2004) is a methodological traditions, including theory-based evaluation, framework for combining a qualitative approach organizational development, appreciative inquiry, with participatory assessment instruments, narrative analysis, and quantitative statistical including a quantification step. It orients itself analysis of impact. It has been expanded in scope toward principles and procedures of Participatory by those who combine it with realist methodolo- Rural Appraisal methodology, including triangu- gies (e.g., Dart) and soft systems methodologies lation, "optimal ignorance," and communal (e.g., Williams). It also shares much in common learning. A major element of this methodology with the positive deviance approach that has is conducting workshops with representatives of been applied to health interventions in many relevant stakeholders. Perceived key processes developing countries. The success case method are jointly reflected in structured group discus- identifies individual cases that have been particu- sions in which at least six interlinked and logically larly successful (and unsuccessful) and uses case connected steps are accomplished: (i) lifeline; study analytical methods to develop credible (ii) trend analysis; (iii) activity list; (iv) influence arguments about the contribution of the interven- matrix; (v) transect--or data cross checking; and tion to these. (vi) development and impact profile. 77 APPENDIX 10: WHERE TO FIND REVIEWS AND SYNTHESIS STUDIES ON MECHANISMS UNDERLYING PROCESSES OF CHANGE Books on social mechanisms tional organizations: the Cochrane Society, Authors like Elster (1989; 2007), Farnsworth working within the health field; and the Campbell (2007), Hedström and Swedberg (1998), Collaboration, working within the fields of Swedberg (2005), Bunge (2004), and Mayntz social welfare, education, and criminology. Both (2004) have summarized and synthesized the organizations subscribe to the idea of produc- research literature on different (types of) social ing globally valid knowledge about the effects of mechanisms. Elster's explanation of social interventions, if possible through synthesizing behavior (2007) summarizes insights from the results of primary studies designed as RCTs neurosciences to economics and political science and using meta-analysis as the form of synthe- and discusses 20-plus mechanisms. They range ses. In many (Western) countries second-order from motivations, emotions, and self-interest to knowledge-producing organizations have been rational choice, games and behavior and collec- established at the national level that not all are tive decision making. based on findings from RCTs. Hansen and Rieper (2009) present information about some 15 of Farnsworth (2007) takes legal arrangements like them, including web addresses. laws and contracts as a starting point and dissects which (types of) mechanisms play a role when Knowledge repositories and development one wants to understand why laws sometimes intervention impact do or do not work. He combines insights from The Coalition for Evidence-Based Policy offers psychology, economics, and sociology and "Social Programs That Work," a Web site provid- discusses mechanisms such as the "slippery ing policy makers and practitioners with clear, slope," the endowment effect, framing effects, actionable information on what works in social and public goods production. policy, as demonstrated in scientifically valid studies (www.evidencebasedprograms.org/). review journals Since the 1970s review journals have been The International Organization for Coopera- developed to address important developments tion in Evaluation, a loose alliance of regional within a discipline. An example is Annual and national evaluation organizations from Reviews, which publishes analytic reviews in 37 around the world, builds evaluation leadership disciplines within the biomedical, life, physical, and capacity in developing countries, fosters and social sciences. the cross-fertilization of evaluation theory and practice around the world, addresses interna- Knowledge repositories tional challenges in evaluation, and assists Hansen and Rieper (2009) have inventoried a evaluation professionals to take a more global number of second-order evidence-producing approach to identifying and solving problems. organizations within the social (and behavioral) It offers links to other evaluation organizations; sciences. In recent years the production of forums that network evaluators internationally; systematic reviews has been institutionalized in news of events and important initiatives; and these institutions. There are two main interna- opportunities to exchange ideas, practices, and 79 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n insights with evaluation associations, societies, practitioners to carry them out (www.povertyac- and networks (http://ioce.net). tionlab.com/). The Abdul Latif Jameel Poverty Action Lab (J-PAL) The Development Impact Evaluation Initia- fights poverty by ensuring that policy decisions tive (DIME) is a World Bank-led effort involv- are based on scientific evidence. Located in the ing thematic networks and regional units under Economics Department at the Massachusetts the guidance of the Bank's Chief Economist. Its Institute of Technology, J-PAL brings together objectives are-- a network of researchers at several universities who work on randomized evaluations. It works · To increase the number of Bank projects with with governments, aid agencies, bilateral donors, impact evaluation components and nongovernmental organizations to evaluate · To increase staff capacity to design and carry the effectiveness of antipoverty programs using out such evaluations randomized evaluations, disseminate findings · To build a process of systematic learning based and policy implications, and promote the use of on effective development interventions with randomized evaluations, including by training lessons learned from completed evaluations. 80 APPENDIX 11: EVALUATIONS BASED ON QUALITATIVE AND QUANTITATIVE DESCRIPTIVE METHODS case 1: combining qualitative and During NRDP-I the project comprised activities in quantitative descriptive methods-- 14 different areas grouped under four headings: Ex post impact study of the noakhali rural development Project in · Infrastructure (roads, canals, market places, Bangladesh1 public facilities) · Agriculture (credit, cooperatives, irrigation, 1. Summary extension, marketing) The evaluation examined the intended and · Other productive activities (livestock, fish unintended socio-economic impacts of the ponds, cottage industries) project, with particular attention to the impact · Social sector (health & family planning, on women and to the sustainability and sustain- education). ment of these impacts. The evaluation drew on a wide range of existing evidence and also used The overarching objective of NRDP-I was to mixed methods to generate additional evidence; promote economic growth and social progress, because the evaluation was conducted nine years in particular aiming at the poorer sections of after the project had ended, it was possible to the population. The poorer sections were to directly investigate the extent to which impacts be reached through the creation of temporary had been sustained. Careful attention was paid employment in construction activities (infrastruc- to differential impacts in different contexts to ture) and engaging them in income-generating interpret the significance of before/after and with/ activities (other productive activities). There without comparisons; the intervention was only was also an aim to create more employment in successful in contexts that provided the other agriculture for landless laborers through intensi- necessary ingredients for success. The evalua- fication. Almost all the major activities started tion had significant resources and was preceded under NRDP-I continued under NRDP-II, albeit by considerable planning and review of existing with some modifications and additions. The evidence. overarching objective was kept, with one notable addition: to promote economic growth and 2. Summary and main characteristics social progress in particular, aiming at the poorer The Noakhali Rural Development Project (NRDP) segments of the population, including women. was an integrated rural development project A special focus on women was thus included, (IRDP) in Bangladesh, funded for DKK 389 million based on the experience that most of the benefits by Danida. It was implemented in two phases of the project had accrued to men. over a period of 14 years, 1978­92, in the greater Noakhali district, one of the poorest regions of 3. Purpose, intended use, and key evaluation Bangladesh, which had a population of approxi- questions mately 4 million. More than 60 long-term expatri- This ex post impact study was carried out nine ate advisers--most of them Danish--worked 2­3 years after the project was terminated. At the years each on the project together with a Bangla- time of implementation NRDP was one of the deshi staff of up to 1,000 (at the peak). largest projects funded by Danida, and it was 81 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n considered an excellent example of integrated political--context. In comparison with ordinary rural development, which was a common type evaluations, this study puts a lot more emphasis of support during the 1970s and '80s. In spite on understanding the national and in particular of the potential lessons to be learned from the the local context. project, it was not evaluated upon completion in 1992. This fact and an interest in the sustain- Gathering evidence of impacts ability factor in Danish development assistance One of the distinguishing features of this impact led to the commission of the study. What type study, compared to normal evaluations, is the of impact could still be traced in Noakhali nine order and kind of fieldwork. The fieldwork lasted years after Danida terminated its support to the four months and involved a team of eight research- project? ers (three European and five Bangladeshi) and 15 assistants. The researchers spent 1.5­3.5 months Although the study dealt with aspects of the in the field, the assistants 2­4 months. project implementation, its main focus was on the project's socioeconomic impact in the Noakhali The following is a list of the methods used: region. The study aimed to identify the intended as well as unintended impact of the project, in · Documentary study (project documents, re- particular whether it had stimulated economic search reports, etc.) growth and social development and improved · Archival work (in the Danish embassy, the livelihoods of the poor, including women, Dhaka) which the project had set out to do. · Questionnaire with former advisers and Danida staff members The evaluation focused on the following · Stakeholder interviews (Danida staff, former questions: advisers, Bangladeshi staff, etc.) · Quantitative analysis of project monitoring · What has been the short- and long-term-- data intended as well as unintended--impact of · Key informant interviews the project? · Compilation and analysis of material about · Has the project stimulated economic growth context (statistics, articles, reports, etc.) and social development in the area? · Institutional mapping (particularly NGOs in · Has the project contributed to improving the the area) livelihoods of the poorest section of the popu- · Representative surveys of project components lation, including women? · Assessment of buildings, roads and irrigation · Have the institutional and capacity-building canals (function, maintenance, etc.) activities engendered or reinforced by the · Questionnaire-based interviews with benefi- project produced sustainable results? ciaries and non-beneficiaries · Extensive and intensive village studies (sur- 4. Concise description of the evaluation veys, interviews, etc.) · Observation Identifying impacts of interest · Focus group interviews This study focuses on the impact of NRDP, in · In-depth interviews (issue-based and life particular the long-term impact (i.e., nine years stories). after). But impact cannot be understood in isolation from implementation so the study In the history of Danish development coopera- analyzes various elements and problems in the tion no other project has been subject to so way the project was designed and executed. many studies and reports, not to speak of the vast Impact can also not be understood isolated from number of newspaper articles. Most important the context, both the natural/physical and in for the impact study have been the appraisal particular the societal--social, cultural, economic, reports and the evaluations plus the final project 82 a p p E n d I x 1 1 : E va l u at I o n s b a s E d o n q u a l I tat I v E a n d q u a n t I tat I v E d E s c r I p t I v E m E t h o d s completion report. But in addition to this, there number of roads were selected for study, both exists an enormous number of reports on all of their current maintenance standard, their use, aspects of the project. A catalogue from 1993 lists etc., but also the employment the road construc- more than 1,500 reports produced by and for tion and maintenance generated, particularly . the NRDP Both the project and the local context for groups of destitute women. The study also were, moreover, intensively studied in a research attempted to assess the socio-economic impact project carried out in cooperation between the of the roads on different groups (poor/better-off, Centre for Development Research and Bangla- men/women, etc.). desh Institute of Development Studies. Assessing causal contribution A special effort was made to solicit the views of The impact of a development intervention is a a number of key actors (or stakeholders) in the result of the interplay of the intervention and the project and other key informants. This included context. It is the matching of what the project has numerous former NRDP and BRDB officers, to offer and people's needs and capabilities that expatriate former advisers as well as former key produces the outcome and impact. Moreover, Danida staff, both based in the Danish Embassy the development processes engendered unfold in Dhaka and in the Ministry of Foreign Affairs in a setting that is often characterized by inequali- in Copenhagen. They were asked about their ties, structural constraints, and power relations. views on strengths and weaknesses of the This certainly has been the case in Noakhali. As project and the components they know best, a consequence there will be differential impacts, about their own involvement and about their varying between individuals and according to judgment regarding likely impact. A question- gender, socio-economic group and political naire survey was carried out among the around leverage. 60 former expatriate long-term advisers and 25 former key staff members in the Danish In addition to the documentary studies, embassy, Danida, and other key informants. interviews, and questionnaire survey, the actual In both cases about half returned the filled- fieldwork has employed a range of both quanti- in questionnaires. This was followed up by a tative and qualitative methods. The approach number of individual interviews. can be characterized as a contextualized, tailor- made ex post impact study. There is considerable The main method in four of the five component emphasis on uncovering elements of the societal studies was surveys with interviews, based on context in which the project was implemented. standardized questionnaires, with a random-- This covers both the national context and the or at least reasonably representative--sample of local context. The approach is tailor-made in the beneficiaries (of course combined with documen- sense that it will be made to fit the study design tary evidence, key informant interviews, etc.). outlined above and apply an appropriate mix of A great deal of effort was taken in ensuring methods. that the survey samples were reasonably representative. An element in the method is the incorporation in the study of both before/after and with/without The infrastructure component was studied by perspectives. These, however, are not seen as the partly different methods, because in this case ultimate test of impact (success or failure), but the beneficiaries were less well defined. It was interpreted cautiously, bearing in mind that the decided to make a survey of all the buildings area's development has also been influenced by that were constructed during the first phase of a range of other factors (market forces, changing the project to assess their current use, mainte- government policies, other development nance standard, and benefits. In this phase the interventions, etc.), both during the 14 years the emphasis was on construction; in the second project was implemented and during the 9 years phase it shifted to maintenance. Moreover, a after its termination. 83 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Considerable weight was accorded to studying such a cautious approach that the question of what has happened in the villages that have attribution was addressed. Arguably, elements previously been studied and for which some of subjectivity may still have remained in the comparable data exist. Four villages were conclusions and assumptions, but that is studied intensively in 1979 and briefly restudied unavoidable in a study that seeks to uncover in 1988 and 1994. These studies--together with the impact of an education project. a thorough restudy in the year 2001--provide a unique opportunity to compare the situation Managing the impact evaluation before, during, and after the project. Moreover, The impact study was commissioned by Danida 10 villages were monitored under the project's and carried out by Centre for Development village-wise impact monitoring system in Research, who also co-funded the study as a the years 1988­90, some of these being with component of its Aid Impact Research Program. (+NRDP) and some (largely) without (­NRDP) The research team comprised independent the project. Analysis of the monitoring data researchers from Bangladesh, Denmark, and combined with a restudy of a sample of these the UK. A reference group of nine persons villages illuminates the impact of the project in (former advisers, Danida officers, and research- relation to other factors. It was decided to study ers) followed the study from the beginning to a total of 15 villages, 3 intensively (all +NRDP , the end. It discussed the approach paper in an about 3 weeks each) and 12 extensively (9 +NRDP , initial meeting and the draft reports in a final , 3 ­ NRDP 3­5 days each). As a matter of princi- meeting. In between it received three progress ple, this part of the study looks at impact in reports from the team leader and took up terms of the project as a whole. It brings in discussions by e-mail correspondence. The focus the project benefits as perceived by differ- study was prepared during the year 2000 and ent groups and individuals and tries to study fieldwork carried out in the period January­May how the project has impinged on economic and 2001. The study consists of a main report and social processes of development and change. seven topical reports. At the same time it provides a picture of the considerable variety found in the local context. The first step in establishing a study design was the elaboration of an approach paper (study In the evaluation of the mass education outline) by the team leader. This was followed by a program, the problem of attribution was dealt two-week visit to Dhaka and the greater Noakhali with as carefully as possible. First, a parallel area. During this visit, Bangladeshi researchers comparison has been made between the benefi- and assistants were recruited to the team, and ciaries on the one hand and non-beneficiaries more detailed plans for the subsequent fieldwork on the other, to identify (if any) the changes were drafted. Moreover, a background paper by directly or indirectly related to the program. Hasnat Abdul Hye, former Director General of Such comparison was vital due to the absence BRDB and Secretary, Ministry of Local Govern- of any reliable and comparable baseline data. ment, was commissioned. Second, specific queries were made in relation to the impact of the program as perceived by The fieldwork was preceded by a two-day the beneficiaries and other stakeholders of the methodology-cum-planning workshop in Dhaka. program, assuming that they would be able to The actual fieldwork lasted four months--from perceive the impact of the intervention on their mid-January to mid-May 2001. The study team own lives in a way that would not be possible comprised 23 people: 5 Bangladeshi researchers, for others. And finally, views of non-beneficia- 3 European researchers, 6 research assistants, ries and non-stakeholders were sought to have and 9 field assistants (all from Bangladesh). The opinions from people who do not have any researchers spent 1.5­3.5 months in the field, the valid reason for either understating or overstat- assistants 2­4 months. Most of the time the team ing the impact of the program. It was through worked 60­70 hours a week. So it takes a good 84 a p p E n d I x 1 1 : E va l u at I o n s b a s E d o n q u a l I tat I v E a n d q u a n t I tat I v E d E s c r I p t I v E m E t h o d s deal of resources to accomplish such a big and tion which is similar to the DAC definition. The complex impact study. key feature of IFAD evaluations is that they are conducted just before or immediately after project case 2: combining qualitative and conclusion: the effects can be observed after 4­7 quantitative descriptive methods-- years of operations and the future evolution can be Mixed-method impact evaluation of IFAd estimated through an educated guess on sustain- projects in gambia, ghana, and Morocco2 ability perspectives. Several impact domains are considered, including household income and 1. Summary assets, human capital, social capital, food security, The evaluation included intended and unintended environment, and institutions. impacts and examined the magnitude, coverage, and targeting of changes. It used mixed methods 3. Sequencing of the process and choice of to gather evidence of impacts and the quality of methods processes with cross-checking among sources. This short case study is based on evaluations With regard to assessing causal contribution, conducted in Gambia, Ghana, and Morocco it must be noted that no baseline data were between 2004 and 2006. As explained above, available. Instead a comparison group was evaluations had multiple questions to answer constructed, and analysis of other contributing and impact assessment was but one of them. factors was made to ensure appropriate compar- Moreover, impact domains were quite diverse. isons. The evaluation was undertaken within This meant that some questions and domains significant resource constraints and was carried required quantitative evidence (e.g., in the case out by an interdisciplinary team. of household income and assets), whereas a more qualitative assessment would be in order 2. Introduction and background for other domains (e.g., social capital). In many Evaluations of rural development projects and instances, however, more than one method country programs are routinely conducted by would have to be used to answer the same the Office of Evaluation of IFAD. The ultimate questions to cross-check the validity of findings, objectives of these evaluations is to set a basis identify discrepancies, and formulate hypotheses for accountability by assessing the development on the explanation of apparent inconsistencies. results and contribute to learning and improve- ment of design and implementation by providing As the final objective of the evaluation was not lessons learned and practical recommendations. only to assess results but also to provide future These evaluations follow a standardized method- intervention designers with adequate knowledge ology and a set of evaluation questions including and insights, the evaluation design could not be the following: (i) project performance (relevance, confined to addressing a dichotomy between effectiveness, and efficiency), (ii) project impact, "significant impact has been observed" and "no (iii) overarching factors (sustainability, innova- significant impact has been observed." Findings tion, and replication) and (iv) the performance of would need to be rich enough and grounded in the partners. As can be seen, impact is but one of field experience to provide a plausible explana- the key evaluation questions and the resources tion that would lead, when suitable, to a solution allocated to the evaluation (budget, specialists, to identified problems and to recommendations and time) that have to be shared for the entirety to improve the design and the execution of the of the evaluation. operations. Thus, these evaluations are to be conducted under Countries and projects considered in this case resource constraints. In addition, very limited data study were diverse. In all cases, however, the first are available on socio-economic changes taking step in the evaluation consisted of a desk review place in the project area that can be ascribed to an of the project documentation. This allowed the impact definition. IFAD adopts an impact defini- evaluation team to understand or reconstruct 85 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n the intervention theory (often implicit) and the synergies or conflicts between parallel dynamics logical framework. In turn, this would help to could not be done simply through inferential identify a set of hypotheses on changes that may statistical instruments but required interaction be observed in the field as well as on intermedi- with a wider range of stakeholders. ary steps that would lead to those changes. The third step in the process was the fielding of a In particular, the preliminary desk analysis data collection survey (after pre-testing the instru- highlighted that the results assessment would ments) that would help the evaluation cope with have to be supplemented with some analysis of the dearth of impact data. The selected techniques implementation performance. The latter would for data collection included a quantitative survey include some insight into the business processes with a range of 200­300 households (including (e.g., the management and resource allocation both project and control groups) and a more made by the project implementation unit) and reduced set of focus group discussion with groups the quality of service rendered (e.g., the topics of project users and "control groups" stratified and the communication quality of an extension based on the economic activities in which they service or the construction quality of a feeder had engaged and the area they were leaving. road or of a drinking water scheme). In the quantitative survey standardized question- The second step was to conduct a prepara- naires were administered to final project tory mission. This mission was instrumental in users (mostly farmers or herders) as well as to fine-tuning our hypotheses on project results non-project groups (control observations) on and designing the methods and instruments. the situation before (recall methods) and after Given the special emphasis of the IFAD interven- the project. Recall methods were adopted to tions on the rural poor, impact evaluation would make up for the absence of a baseline. need to shed light, to the extent possible, on the following dimensions of impact: (i) magnitude In the course of focus group interviews, open-ended of changes, (ii) coverage (i.e., the number of discussion guidelines were adopted; results were persons or households served by the projects), mostly of a qualitative nature. Some of the focus and (iii) targeting (i.e., gauging the distribution group facilitators had also been involved in the of project benefits according to social, ethnic, or quantitative survey and could refer the discus- gender grouping). sion to observations previously made. After the completion of data collection and analysis, a first As pointed out before, a major concern was the cross-checking of results could be made between absence of a baseline survey which could be the results of quantitative and qualitative analysis. used as a reference for impact assessment. This required reconstructing the "before project" As a fourth step, an interdisciplinary evaluation situation. By the same token, it was clear that team would be fielded. Results from the prelimi- the observed results could not simply be attrib- nary data collection exercise were made available uted to the evaluated interventions. In addition to the evaluation team. The data collection coordi- to exogenous factors such as weather changes, nator was a member of the evaluation team or in other important factors were at play, for example, a position to advise its members. The evaluation changes in government strategies and policies would conduct field visits and conduct a further (such as the increased support to grassroots validation survey and collect focus group data associations by Moroccan public agencies) or through participant observations and interviews operations supported by other development with key informants (and further focus group organizations in the same or in adjacent zones. discussions if necessary). The team would also This meant that the evaluated interventions spend adequate time with project management would interplay with existing dynamics and units to gather a better insight of implementation interact with other interventions. Understanding and business processes. 86 a p p E n d I x 1 1 : E va l u at I o n s b a s E d o n q u a l I tat I v E a n d q u a n t I tat I v E d E s c r I p t I v E m E t h o d s The final impact assessment would be made by ered control groups, on the grounds that they means of triangulation of evidence captured would broadly satisfy the same eligibility criteria from the (scarce) existing documentation, the at entry as "older" project clients. However, no preliminary data collection exercise, and the statistical technique (e.g., instrumental variables, main interdisciplinary mission (figure A11.1). Heckman's procedure or propensity score) was adopted to test for sampling bias, due to limited 4. Constraints in data gathering and analysis time and resources. Threats to the validity of recall methods. Accord- ing to the available literature sources3 and our Coping with linguistic gaps. Given the broad own experience, the reliability of recall methods scope of the evaluations, a team of international may be questionable for monetary indicators sector specialists was required. However, interna- (e.g., income) but higher for easier-to-remember tional experts were not necessarily the best facts (e.g., household appliances, approximate suited for data collection analysis, which calls herd size). Focus group discussions helped for fluency in the local vernacular, knowledge identify possible sources of bias in the quantita- of local practices, and skills to obtain the most tive survey and ways to address them. possible information within a limited time frame. Staggering the process in several phases was a Finding "equivalent" samples for with and viable solution. The preliminary data collection without-project observations. One of the exercise was conducted by a team of local special- challenges was to extract a control sample that ists, with university students or local teachers or would be "similar" in the salient characteristics literate nurses serving as enumerators. to the project sample. In other words, problems of sampling bias and endogeneity should have 5. Main value added of mixed methods and been controlled for (e.g., more entrepreneur- opportunities for improvement ial people are more likely to participate in a The choice of methods was made taking into rural finance intervention). In sampling control account the objectives of the evaluations and observations, serious attempts were made to the resource constraints (time, budget, and match project and non-project households expertise) in conducting the exercise. The based on similarity of main economic activities, combination of multiple methods allowed us to agro-ecological environment, household size, cross-check the evidence and understand, for and resource endowment. In some instances, example, when survey questions were likely to household that had just started to be served be misinterpreted or generate over- or under- by the projects ("new entries") were consid- reporting. In contrast, quantitative evidence Figure A11.1: Final impact assessment triangulation Focus groups Interdisciplinary mission Desk review and (key informants, Final secondary data participant observations) assessment Quantitative survey 87 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n allowed us to shed light on the prevalence of questionnaires can be filled in, in the absence certain phenomena highlighted during the focus of major transportation problems. group discussion. Finally, the interactions with · Data coding: it may vary depending on the key informants and project managers and staff length and complexity of the questionnaire. helped us better understand the reasons for It is safe to assume 5­7 days. under- or over-achievements and come up with · Time for conducting focus groups discussions: more practical recommendations. 7 days based on the hypothesis that around 10 FGD would be conducted by 2 teams. The findings, together with the main conclu- · Data analysis. Depending on the analysis re- sions and recommendations in the report, quirement, it will require one to two weeks were adopted to design new projects or a new only to generate the tables and summary of country strategy. There was also interest from focus group discussions. the concerned project implementation agencies · Drafting survey report: 2 weeks. in adopting the format of the survey to conduct future impact assessments on their own. Due Note: As some of the above tasks can be conducted to time constraints, only inferential analysis simultaneously, the total time for conducting a was conducted on the quantitative survey data. preliminary data collection exercise may be lower A full-fledged econometric analysis would have than the sum of its parts. been desirable. By the same token, further analysis of focus group discussion outcomes case 3: combining qualitative and would be desirable in principle. quantitative descriptive methods-- Impact evaluation: Agricultural 6. A few highlights on the management development projects in guinea4 The overall process design, as well as the choice of methods and the design of the data collection 1. Summary instruments, was made by the lead evaluator in The evaluation focused on impact in terms of the Office of Evaluation of IFAD, in consultation poverty alleviation; the distribution of benefits was with international sectoral specialists and the of particular interest, not just the mean effect. All local survey coordinator. The pre-mission data data gathering was conducted after the interven- collection exercise was coordinated by a local tion had been completed; mixed methods were rural sociologist, with the help of a statistician for used, including attention to describing the differ- the design of the sampling framework and data ent implementation contexts. Assessing causal analysis. contribution is the major focus of the case study. A counterfactual was created by creating a The time required for conducting the survey and comparison group, taking into account the focus groups was as follows: endogenous and exogenous factors affect- ing impacts. Modeling was used to develop · Develop draft questionnaire and sampling an estimate of the impact. With regard to the frame, identify enumerators: 3 weeks. management of the impact evaluation, it should · Conduct a quick trip on the ground, contact be noted that the study was undertaken as part project authorities and pre-test question- of doctoral dissertation work; the stakeholder naires: 3 days. engagement and subsequent use of it was · Train enumerators' and coders' team: 3 days. limited. · Survey administering: depending on the length of the questionnaire, on average an enumera- This impact evaluation concerned two types of tor will be able to fill no more than three to five agricultural projects based in the Kpèlè region, questionnaires per day. In addition, time needs in Guinea. The first one5 was the Guinean Oil to be allowed for travel, rest. With a team of 6 Palms and Rubber Company (SOGUIPAH). enumerators, in 9­10 working days up to 200 It was founded in 1987 by the Guinean govern- 88 a p p E n d I x 1 1 : E va l u at I o n s b a s E d o n q u a l I tat I v E a n d q u a n t I tat I v E d E s c r I p t I v E m E t h o d s ment to take charge of developing palm oil and First, a contextual analysis realized all along rubber production at the national level. With the the research work with key informants was support of several donors, SOGUIPAH quickly set necessary to describe the project implementa- up a program of industrial plantations6 by negoti- tion scheme, the contemporaneous events, and ating the ownership of 22,830 ha with villagers. the existing agrarian dynamics. It was also used In addition, several successive programs were to assess qualitatively whether those dynamics implemented between 1989 and 1998 with were attributable to the project. A series of SOGUIPAH to establish contractual plantations7 surveys and historical interviews (focused on on farmers' own land and at the request of the the pre-project situation) were conducted to farmers (1,552 ha of palm trees and 1,396 ha of establish the most reliable baseline possible. rubber trees) and to improve 1,093 ha of lowland An area considered "witness" to the agrarian areas for irrigated rice production. dynamic that would have existed in the project's absence was identified. The impact evaluation took place in a context of policy debates among different rural stakehold- Second, a preliminary structured survey (of ers at a regional level: two seminars had been about 240 households) was implemented, using held in 2002 and 2003 between the farmers' recall to collect data on the farmers' situation syndicates, the state administration, the private in the pre-intervention period and during the sector, and development partners (donors, project. It was the basis of a judgment sample to NGOs) to discuss a regional strategy for agricul- realize in-depth interviews (see below), which tural development. These two seminars revealed aimed at describing the farming systems and that there was little evidence of what should be rigorously quantifying the farmers' income. done to alleviate rural poverty, despite a long history of development projects. The impact of 3. Assessing causal attribution these projects on farmers' income seemed to be By conducting an early contextual analysis, particularly relevant to assess, notably to compare the evaluator was able to identify a typology of the projects' efficiency. farming systems that existed before the project. To set up a sound counterfactual, a judgment This question was investigated through doctoral sample was realized among the 240 households thesis work that was entirely managed by the surveyed, by choosing 100 production units AGROPARISTECH. 8 It was financed by AFD, that had belonged to the same initial types of one of the main donors in the rural sector in farming system and that had evolved with (in Guinea. This thesis proposed a new method, the the project area) or without the project (in the systemic impact evaluation, aiming at quantifying witness area). impact using a qualitative approach. It enabled the understanding of the process through which In-depth understanding of the endogenous and impact materializes and rigorous quantification of exogenous factors influencing the evolution and the impact of agricultural development projects possible trajectories of farming systems enabled on the farmers' income, using a counterfactual. the evaluator to rigorously identify the individu- The analysis is notably based on the comprehen- als whose evolution with or without the project sion of the agrarian dynamics and the farmers' was comparable. This phase of judgment sample strategies, and permits the quantification of ex was followed by in-depth interviews with the post impact but also to devise a model of ex ante hundred farmers. The evaluator's direct involve- evolution for the following years. ment in data collection was then essential, hence the importance of a small sample. It would not 2. Gathering evidence of impact have been possible to gather reliable data on The data collection was carried out entirely ex yields, modifications to production structures post. Several types of surveys and interviews over time, and producers' strategies from a large were used to collect evidence of impact. survey sample in a rural context. 89 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Then, based on the understanding of the way original farming system and the various trajecto- the project proceeded and of the trajectories ries with and without the project, which could of these farmers, with or without the project, it not be ignored. Whereas former civil servants or was possible to build a quantitative model, based traditional landlords beneficiated large contrac- on Gittinger's method of economic analysis tual plantations, other villagers were deprived of of development projects (Gittinger, 1982). As their land for the needs of the project or received the initial diversity of production units was surfaces of plantations too limited to improve well identified before sampling, this model was their economic situation. constructed for each type of farming system existing before the project. Understanding the Therefore, it seems important that the impact possible evolution of each farming system with evaluation of a complex development project and without the project allowed for the estima- include an analysis of the diversity of cases created tion of the differential created by the project on by the intervention, directly or indirectly. farmers' income, i.e., its impact. The primary interest of this new method was to 4. Ensuring rigor and quality give the opportunity to build a credible impact Although the objective differences between each assessment entirely ex post. Second, it gave an production unit studied appear to leave room estimate of the impact on different types of farming for the researcher's subjectivity when construct- systems, making explicit the existing inequalities ing the typology and sample, the rationale in the distribution of the projects' benefits. Third, behind the farming system concept made it it permitted a subtle understanding of the reasons possible to transcend this possible arbitrariness. why the desired impacts materialized or not. What underlies this methodological jump from a small number of interviews to a model is the 6. Influence demonstration that a finite number of types of The results from this impact assessment were farming systems exists in reality. available after four years of field work and data treatment. They were presented to the Guinean Moreover, the use of a comparison group, the authorities and to the local representatives of the triangulation of most data collected by in-depth main donors in the rural sector. In the field, the interviews through direct observation and results were delivered to the local communities contextual analysis, and the constant implication interviewed and to the farmers' syndicates. The of the principal researcher were key factors to Minister of Agriculture declared that he would try ensure rigor and quality. to foster more impact evaluations on agricultural development projects. Unfortunately, there is 5. Key findings little hope that the conclusions of this research The large survey of 240 households identified will change the national policy about these types 11 trajectories related to the implementation of of projects, in the absence of an institutional- the project. Once each trajectory and impact was ized forum for discussing it among the different characterized and quantified through in-depth stakeholders. interviews and modeling, this survey permitted as well quantifying a mean impact of the project, case 4: A theory-based approach with on the basis of the weight of each type in the qualitative methods--global Environment population. The mean impact was only 24 /year/ Facility impact evaluation 20079, 10 household in one village poorly served by the project, due to its enclosed situation, whereas it Evaluation of three gEF-protected area was 200 /year/household in a central village. projects in East Africa Despite a positive mean impact, highly differen- 1. Description of evaluation tiated impacts also existed, depending on the The objectives of this evaluation included-- 90 a p p E n d I x 1 1 : E va l u at I o n s b a s E d o n q u a l I tat I v E a n d q u a n t I tat I v E d E s c r I p t I v E m E t h o d s · To test evaluation methodologies that can environmental indicators and global data sets assess the impact of GEF interventions. The than other focal areas, both within the GEF and key activity of the GEF is "providing new and in the broader international arena. The Evalua- additional grant and concessional funding to tion Office chose protected areas as the central meet the agreed incremental costs of mea- theme for this phase of the Impact Evaluation sures to achieve agreed global environmental because protected areas are one of the primary benefits."11 The emphasis of this evaluation approaches supported by the GEF biodiversity was therefore on verifying the achievement of focal area and its implementing agencies, and the agreed global environmental benefits. GEF is the largest supporter of protected areas · Specifically, to test a theory of change ap- globally; previous evaluations have noted that an proach to evaluation in GEF's biodiversity evaluation of the GEF support for protected areas focal area, and assess its potential for broader has not been carried out and recommended that application within GEF evaluations. such a study be undertaken; protected areas are · To assess the sustainability and replication based on a set of explicit change theories, not of the benefits of GEF support and extract les- just in the GEF, but in the broader conservation sons. It evaluated whether and how project community; in many protected area projects, benefits have continued, and will continue, substantial field research has been undertaken, after project closure. and some have usable baseline data on key factors to be changed by the intervention; a Primary users protected areas strategy can be addressed at The primary users of the evaluation are GEF both a thematic and regional cluster level (as in entities. They include the GEF Council, which East Africa, the region chosen for the study); and requested the evaluation; the GEF Secretariat, the biodiversity focal area team has made consid- which will approve future protected area projects; erable progress in identifying appropriate indica- implementing agencies (such as the World Bank, tors for protected areas through its "managing UN agencies and regional development banks); for results" system. and national stakeholders who will implement future protected area projects. The choice of projects Lessons from a set of related interventions (or 2. Evaluation design projects) are more compelling than those from an isolated study of an individual project. To test Factors driving selection of evaluation design the potential for aggregation of project results, The Approach Paper to the impact evalua- enable comparisons across projects and ease tion12 considered the overall GEF portfolio to logistics, it was decided to adopt a sub-regional develop an entry-point which could provide a focus and select a set of projects that are good opportunity to develop and refine effective geographically close to each other. East Africa and implementable impact evaluation method- is the sub-region with the largest number of ologies. Themes and projects that are relatively complete and active projects in the GEF portfolio straightforward to evaluate were emphasized. with a protected area component, utilizing large The Evaluation Office adopted the DAC definition GEF and cofinancing expenditure. of impact, which determined that closed projects would be evaluated to assess the sustainability of The following three projects were selected for GEF interventions. evaluation: Biodiversity and protected areas · Bwindi Impenetrable National Park and Mga- The biodiversity focal area has the largest hinga Gorilla National Park Conservation Proj- number of projects within the GEF portfo- ect, Uganda (World Bank) lio of currently active and completed projects. · Lewa Wildlife Conservancy, Kenya (World In addition, biodiversity has developed more Bank) 91 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n · Reducing Biodiversity Loss at Cross-Border intermediate outcomes, which are then expected Sites in East Africa, Regional: Kenya, Tanzania, to lead to impact (see figure A11.2). The process Uganda (UNDP). of these interventions, in a given context, is determined by the contribution of a variety of These projects were implemented on behalf of actions at multiple levels, some of which are . the GEF by the World Bank and UNDP They have outside the purview of the intervention (e.g., a variety of biodiversity targets, some of which are actions of exterior actors at the local, national, relatively easy to monitor (gorillas, zebras, rhinos). or global levels or change in political situations, Also, these projects were evaluated positively by regional conflicts, and natural disasters). terminal and other evaluations and the continu- Subsequently, an intervention may have different ance of long-term results was predicted. The levels of achievement in its component parts, Bwindi Impenetrable National Park and giving mixed results towards its objectives. Mgahinga Gorilla National Park Conservation Project is a $6.7 million full-size project and the The use of a hybrid evaluation model first GEF-sponsored trust fund in Africa. The Lewa During field testing it was decided that, given Wildlife Conservancy is a medium-sized project, the intensive data requirements of a theory of within a private wildlife conservation company. change approach and the intention to examine The Reducing Biodiversity Loss at Cross-Border project impacts, the evaluation would mainly Sites in East Africa Cross project is a $12 million focus on the later elements of each project's project, implemented at field level by government theory of change, when outcomes are expected agencies, that aims to foster an enabling environ- to lead to impact. Based on this approach, the ment for the sustainable use of biodiversity. evaluation developed a methodology composed of three components (see figure A11.3): The advantages of a theory of change approach An intervention generally consists of several · Assessing implementation success and fail- complementary activities that together produce ure: To understand the contributions of the Figure A11.2: Generic representation of a project's theory of change Results continuum Inputs Outputs Outcomes Impacts The human, The immediate An intermediate The ultimate result organizational, product of result brought of a combination financial, and project actions about by of outcomes material resources producing contributed contributed to outputs by the project a project Intervention Process (at each stage) Activities Assumptions Tasks carried Theory behind out by project activity 92 a p p E n d I x 1 1 : E va l u at I o n s b a s E d o n q u a l I tat I v E a n d q u a n t I tat I v E d E s c r I p t I v E m E t h o d s project at earlier stages of the results con- theory of change approach constructs and tinuum, leading to project outputs and out- validates the project logic connecting out- comes, a logframe analysis is used. Though comes and ultimate project impact. It involves the normally complex and iterative process a comprehensive assessment of the activities of project implementation is not captured by undertaken after project closure, along with this method, the logframe provides a means of their explicit and implicit assumptions. This tracing the realization of declared objectives. component enables an assessment of the GEF interventions aim to "assist in the protec- sustainability and/or catalytic nature of proj- tion of the global environment and promote ect interventions and provides a composite thereby environmentally sound and sustain- qualitative ranking for the achievements of able economic development."13 the projects. Elements of the varied aspects · Assessing the level of contribution (i.e., im- of sustainability include behavior change and pact): To provide a direct measure of project the effectiveness of capacity-building activities, impacts, a targets-threats analysis (threats- financial mechanisms, legislative change, and based analysis) is used to determine whether institutional development. global environmental benefits have actually been produced and safeguarded.14 The robust- The model incorporates three different elements ness of global environment benefits identified that may be involved in the transformation of project for each project (targets) is evaluated by col- outcomes into impacts. These are as follows, and lecting information on attributes relating to the were each scored for the level of achievement of targets' biological composition, environmental the project in converting outcomes into impacts: requirements, and ecological interactions. This analysis of targets is complemented by an as- · Intermediary states. These are conditions that sessment of the level of "threat" (e.g., preda- are expected to be produced on the way to tion, stakeholder attitude, and behavior) faced delivering the intended impacts. by the global environment benefits. For targets · Impact drivers. These are significant factors and significant threats, trends over time (at or conditions that are expected to contribute project start, at project close, and currently), to the ultimate realization of project impacts. and across project and non-project areas are Existence of the impact driver in relation sought, so that a comparison is available to to the project being assessed suggests that assess levels of change. there is a good likelihood that the intended · Explanations for observed impact: To unpack project impact will have been achieved. Ab- the processes by which the project addresses sence of these suggests that the intended and contributes to impact, an outcomes-im- impact may not have occurred or may be pacts theory of change analysis is used. This diminished. Figure A11.3: Components of impact evaluation framework Project logframe analysis Outcomes-impacts TOC analysis Threats-based analysis Assumption Assumption Assumption Impact Reduced Enhanced State/ State/ Outputs Outcome threats to status of condition condition GEB GEB 93 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n · External assumptions. These are potential external sources. Given that all three projects are events or changes in the project environment now closed, the participation from former project that would negatively or positively affect the staff enabled a candid and detailed exchange of ability of a project outcome to lead to the in- information (during workshops in Uganda and tended impact, but that are largely beyond the Kenya). The participants in return found the power of the project to influence or address. process to be empowering, as it clarified and supported the rationale for their actions (by 3. Data collection and constraints drawing out the logical connections between activities, goals and assumptions) and enabled Logical framework and theory of change model them to plan for future interventions. The approach built on existing project logical frameworks, implying that a significant part of External validity: Given the small number of the framework could be relatively easily tested projects, their variety, and age (approved in varied through an examination of existing project past GEF replenishment phases), the evaluation documentation, terminal evaluation reports did not expect to produce findings that could be and, where available, monitoring data. Where directly aggregated. Nevertheless, given the very necessary, targeted consultations and additional detailed analysis of the interventions a few years studies were carried out. after project closure, it did provide a wealth of insights into the functioning of protected area Assessing conservation status and threats to global projects, particularly elements of their sustain- environment benefits ability after project closure. This allowed limited A data collection framework for assessing the generalization on key factors associated with status of the targets and associated threats was achievement of impact, on the basis of differ- developed, identifying indicators for each, along ent levels of results related to a set of common with the potential sources of information. For the linkages in the theoretical models. On this basis, Bwindi and Lewa projects, the task of collecting the Evaluation Office recommended that the GEF and assessing this information was undertaken Secretariat ensure specific monitoring of progress by scientists from the Institute of Tropical Forest toward institutional continuity of protected areas Conservation, headquartered in Bwindi Impene- throughout the life of a project. trable National Park, and the Lewa Research Department respectively. For the Cross-Borders Weaknesses project, this exercise was done by the Conserva- Impact evaluations are generally acknowledged tion Development Center, based on the existing to be highly challenging. The objective of this project documentation, a field visit to the project particular study, examining GEF's impact at a site, and consultations with key informants. The "global" level in biodiversity, makes the study objective of this exercise was to provide quantita- particularly complex. A few concerns: tive measures for each indicator from before the project (baseline), at the project close, and present · The nature of changes in biodiversity is still day. Where quantitative data were not available, under debate. Such changes are often non- detailed qualitative data were collected. linear, with uncertain time scales even in the short run, interactions within and across Improving rigor species, and exogenous factors (e.g., climate Internal validity: The evaluation used a partici- change). Evidence regarding the achievement patory approach with substantial involvement of of global environment benefits and their sus- former project staff in drawing out theories of tainability must therefore be presented with change and subsequently providing data for verifi- numerous caveats. cation. These data were verified by local indepen- · Numerous explanations and assumptions may dent consultants, via a process of triangulating be identified for each activity that is carried information from project documentation and out. 94 a p p E n d I x 1 1 : E va l u at I o n s b a s E d o n q u a l I tat I v E a n d q u a n t I tat I v E d E s c r I p t I v E m E t h o d s · The approach may not always uncover unex- depleting substances in eastern Europe) should pected outcomes or synergies, unless they seek to combine Theory of Change approaches are anticipated in the theories or assumptions with opportunistic use of existing data sets, of the evaluation team. However, fieldwork which might provide some level of quantifiable should be able to discern such outcomes, as counterfactual information. was the case in the Bwindi case study, which produced evidence of a number of unexpected Application: Impact of Lewa Wildlife negative impacts on local indigenous people. Conservancy (Kenya)15 · The association between activities and out- comes in the Theory of Change approach Context depends on measuring the level of activities The Lewa GEF medium-sized project provided carried out, and then consciously (logically) support for the further development of Lewa linking them with impact through a chain of Wildlife Conservancy ("Lewa"), a not-for- intermediate linkages and outcomes. Informa- profit private wildlife conservation company tion on these intermediate outcomes may be that operates on 62,000 acres of land in Meru difficult to obtain, unless former project imple- District, Kenya. The GEF awarded Lewa a grant menters participate fully in the evaluation. of $0.75 million for the period 2000 to the end of 2003, with co-financing amounting to $3.193 4. Concluding thoughts on the evaluation million. approach For biodiversity, GEF's first strategic priority is Since the GEF grant, Lewa has been instru- catalyzing sustainability of protected area mental in initiating the formation of the systems, which aims for an expected impact Northern Rangelands Trust (NRT) in 2004. NRT whereby "biodiversity [is] conserved and sustain- is an umbrella local organization with a goal of ably used in protected area systems." collectively developing strong community-led institutions as a foundation for investment in The advantage of the hybrid evaluation model community development and wildlife conserva- used was that by focusing toward the end of the tion in the Northern Rangelands of Kenya. The results chain, it examined the combination of NRT membership comprises community conser- mechanisms in place that led to a project's impacts vation conservancies and trusts, local county and ensure sustainability of results. It is during this councils, the Kenya Wildlife Service, the private later stage, after project closure, that the ecologi- sector, and NGOs established and working within cal, financial, political, socio-economic and institu- the broader ecosystem. The establishment and tional sustainability of the project are tested, along functioning of the NRT has therefore been a very with its catalytic effects. During project conceptu- important aspect in understanding and assessing alization, given the discounting of time, links from the ultimate achievement of impacts from the outcome to impact are often weak. Once a project original GEF investment. closes, the role of actors, activities, and resources is often unclear; this evaluation highlighted these The Lewa case study implemented the three links and assumptions. elements of the Impact Evaluation Framework, which are summarized below. Adopting a theory of change approach also had the potential to provide a mechanism that Assess implementation success and failure helped GEF understand what has worked and Given that no project logical framework or what has not worked and allows for predictions outcomes were defined as such in the original regarding the probability of success for similar GEF project brief, the GEF Evaluation Office projects. The Evaluation Office team concluded team for the Study of Local Benefits in Lewa, that the most effective combination for its next with the participation of senior Lewa staff, identi- round of impact evaluation (phase-out of ozone- fied project outcomes and associated outputs 95 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n that reflected the various intervention strategies Assess the level of contribution (i.e., impact) employed by the project and identified missed A targets-threats analysis of those ecological opportunities in achieving the project goals. The features identified as global environment benefits assessment provided an understanding of the (black rhinos and Grevy's zebra) was undertaken project logic used (figure A11.2) and a review of with input from scientists from Lewa and the NRT the fidelity with which the project activities were research departments. Tables A11.2 and A11.3 implemented (figure A11.3). provide an overview of the variables considered Figure A11.4: Project outputs and outcomes output 1.1: Management capacity of Lewa strengthened outcome 1: Long-term institutional and financial capacity of Lewa to output 1.2: Lewa revenue streams and provide global and local benefits funding base enhanced from wildlife conservation strengthened output 1.3: Strategic plans and partnerships developed to improve effectiveness output 2.1: Security of endangered species increased outcome 2: Protection & outcome 2: Protection & management of endangered management of endangered output 2.2: Research and monitoring of wildlife species in the wider wildlife species in the wider wildlife and habitats improved ecosystem strengthened ecosystem strengthened output 3.1: Capacity of local communities to undertake conservation-compatible income- generating activities strengthened output 3.2: Community natural resource outcome 3: Community-based management institutions strengthened and conservation and natural structures enhanced resource management initiatives strengthened output 3.3: Community skills and roles developed to optimise wildlife benefits Table A11.1: Project Outcomes outcomes Assessment outcome 1: Long-term institutional and financial capacity of Lewa to provide global and local benefits from Fully achieved (5) wildlife conservation strengthened outcome 2: Protection and management of endangered wildlife species in the wider ecosystem strengthened Well achieved (4) outcome 3: Community-based conservation and natural resource management initiatives strengthened Well achieved (4) 96 a p p E n d I x 1 1 : E va l u at I o n s b a s E d o n q u a l I tat I v E a n d q u a n t I tat I v E d E s c r I p t I v E m E t h o d s Table A11.2: Change in key ecological attributes over time conservation Status Key ecological attribute Indicator unit Baseline Project end now trend Black rhino Population size Total population size of black rhino on Lewa Number 29 40 54 Productivity Annual growth rates at Lewa Percent 12 13 15 Suitable secure habitat Size of Lewa rhino sanctuary Acres 55,000 55,000 62,000 Genetic diversity Degree of genetic variation -- No data available grevy's zebra Population size Total population size of Grevy's zebra on Lewa Number 497 435 430 Productivity Annual foaling rates on Lewa Percent 11 11 12 Population distribution Number of known sub-populations No data available and connectivity Suitable habitat Community conservancies set aside (grassland and secure for conservation under NRT Number 3 4 15 water) Genetic diversity Degree of genetic variation No data available Table A11.3: Current threats to the global environment benefits Severitya Scopeb score (1­4) score (1­4) overall ranking Black rhino Poaching and snaring 3 3 3 Insufficient secure areas 2 3 2 Habitat loss (due to elephant density) 1 1 1 grevy's zebra Poaching 2 2 2 Disease 4 2 3 Predation 3 1 2 Habitat loss/ degradation 3 3 3 Insufficient secure areas 2 2 2 Hybridization with Burchell's zebra 1 1 1 a Severity (level of damage): Destroy or eliminate GEBs/Seriously degrade the GEBs/Moderately degrade the GEBs/Slightly impair the GEBs. b Scope (geographic extent): Very widespread or pervasive/Widespread/Localized/Very localized. 97 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n to increase robustness of the understanding of In sum for Lewa ecological changes that have taken place since The analysis provided indication that the black before the project started. rhino and Grevy's zebra populations on the Lewa Conservancy are very well managed and Provide explanations for observed impact protected. Perhaps the most notable achievement Theory of change models were developed for has been the visionary, catalytic, and support role each project outcome to establish contribu- that Lewa has provided for the conservation of tion; the framework reflected in figure A11.5 these endangered species in the broader ecosys- was used. This analysis enabled an examination tem, beyond Lewa. Lewa has played a significant of the links between observed project interven- role in the protection and management of about tions and observed impact. As per GEF princi- 40% of Kenya's black rhino population and is ples, factors that were examined as potentially providing leadership in finding innovative ways influencing results included the appropriate- to increase the coverage of secure sanctuaries ness of intervention, the sustainability of the for black rhinos. Regarding the conservation of intervention and its catalytic effect--these Grevy's zebra, Lewa's role in the establishment are referred to as impact drivers. The next of community conservancies, which have added step involved the identification of intermedi- almost 1 million acres of land set aside for conser- ary states, examining whether the successful vation, has been unprecedented in East Africa achievement of a specific project outcome and is enabling the recovery of Grevy's zebra would directly lead to the intended impacts populations within their natural range. However, and, if not, identifying additional conditions the costs and resources required to manage that would need to be met to deliver the impact. and protect this increasing conservation estate Taking cognizance of factors that are beyond are substantial, and unless the continued and project control, the final step identified those increasing financing streams are maintained, it is factors that are necessary for the realization possible that the substantial gains in the conser- and sustainability of the intermediary state(s) vation of this ecosystem and its global environ- and ultimate impacts, but outside the project's mental benefits could eventually be reversed. influence. In conclusion An example is provided by a consideration of The assessment of project conceptualization and Outcome 3 that via community-based conser- implementation of project activities in Lewa has vation and natural resource management been favorable, but, this is coupled with indica- initiatives strengthened, expected to achieve tions that threats from poaching, disease, and enhanced conservation of black rhinos and habitat loss in and around Lewa continue to be Grevy's zebras. The theory of change model severe. Moreover, evaluation of the other case linking Outcome 3 to the intended impacts studies, Bwindi Impenetrable National Park and is illustrated below, in figure A11.6. The Mgahinga Gorilla National Park Conservation overall logframe assessment of the project's Project, Uganda and Reducing Biodiversity Loss at implementation for community-based conser- Cross-Border Sites in East Africa, Regional: Kenya, vation and natural resource management Tanzania, Uganda, confirmed that to achieve was well achieved. All intermediate factors/ long-term results in the generation of global impact drivers/external assumptions that were environment benefits the absence of a specific identified received a score of partially to well plan for institutionalized continuation would, achieved, indicating that together with all its in particular, reduce results over time--this was activities, this component was well-conceived the major conclusion of the GEF's pilot impact and implemented. evaluation. 98 a p p E n d I x 1 1 : E va l u at I o n s b a s E d o n q u a l I tat I v E a n d q u a n t I tat I v E d E s c r I p t I v E m E t h o d s Figure A11.5: Framework to establish contribution External assumption Intermediate Impact Impact (enhanced Project outcome state (reduced threats) conservation status) Impact driver Figure A11.6: Model linking outcome to impact LWC capacity building in local community institutions is scaled up to meet demand [S2/ C2] Increased Reduced threats community from poaching support and and the lack of outcome 3 land set aside secure areas Community- for conservation Impact based Enhanced conservation conservation status and natural Community of GEBs Reduced pressure resource natural resource on local natural management needs better resource base/ initiatives met in long wildlife habitat strengthened term Other community Conservation- land uses Livelihood based land uses complement and improvements don't make a significant do not undermine lead to increased contribution to conservation-based population livelihoods [A2] land uses [A1] 99 APPENDIX 12: FURTHER INFORMATION ON REVIEW AND SYNTHESIS APPROACHES IN IMPACT EVALUATION realist synthesis and the realist synthesis. Both perspectives This approach is different from the systematic have something to offer. Opening the black box research reviews. It conceptualizes interven- of an intervention under review will be helpful tions, programs, and policies as theories and for experimental evaluators if they want to collects earlier research findings by interpreting understand why interventions have (no) effects the specific policy instrument that is evaluated, and/or side effects. Realists are confronted with as an example or specimen of more generic the problem of the selection of studies to be instruments and tools (of governments). Next it taken into account, ranging from opinion surveys, describes the intervention in terms of its context, oral history, and newspaper content analysis to mechanisms (what makes the program work), results based on more sophisticated methodolo- and outcomes (the deliverables). gies. As the methodological quality of evaluations can be and sometimes is a problem, particularly Instead of synthesizing results from evalua- with regard to the measurement of the impact tions and other studies per intervention or per of a program, realists can benefit from a stricter program, realist evaluators first open the black methodology and protocol, like the one used box of an intervention and synthesize knowledge by the Campbell Collaboration, when doing a about social and behavioral mechanisms. synthesis. For example, knowledge that is to be Examples are Pawson's study of incentives generalized should be credible and valid. (Pawson, 2002), on naming and shaming, and Megan´s law (Pawson, 2006) and Kruisbergen's To combine Campbell standards and the realist work (2005) on fear-arousal communication evaluation approach, Van der Knaap et al. (2008) campaigns trying to reduce the smuggling of first conducted a systematic review according to cocaine. the Campbell standards. The research questions were formulated, and next the inclusion and Contrary to producers of systematic research exclusion criteria were determined. This reviews, realist evaluators do not use a hierarchy included a number of questions. What types of of research designs. For realists an impact study interventions are included? At which participants using the RCT design is not necessarily better should interventions be aimed? What kinds of than a comparative case study design or a process outcome data should be reported? At this stage, evaluation. The problem (of an evaluation) that criteria were also formulated for inclusion and needs to be addressed is crucial in selecting the exclusion of study designs and methodological design or method, not vice versa. quality. As a third step, the search for potential studies was explicitly described. Once potentially combining different meta approaches relevant studies had been identified, they were In a study on the question which public policy screened for eligibility according to the inclusion programs designed to reduce and/or prevent and exclusion criteria. violence in the public arena work best, Van der Knaap et al. (2008) have shown the relevance After selecting the relevant studies, the quality of combining the systematic research review of these studies had to be determined. Van der 101 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Knaap et al (2008) used the Maryland Scientific working or why has it not worked? Two research- Methods Scale (MSMS) (Sherman et al., 1998; ers independently articulated these underlying Welsh and Farrington, 2006). This is a five-point mechanisms. The focus was on behavioral and scale that enables researchers to draw conclu- social "cogs and wheels" of the intervention sions on methodological quality of outcome (Elster, 1989; 2007). evaluations in terms of the internal validity. Using a scale of 1­5, the MSMS is applied to rate the In a second step the studies under review were strength of scientific evidence, with 1 being the searched for information on contexts (schools, weakest and 5 the strongest scientific evidence streets, banks, etc., but also types of offenders needed for inferring cause and effect. and victims and type of crime) and outcomes. This completed the C[ontext], M[echanism] Based on the MSMS scores, the authors then and O[utcome] approach that characterizes classified each of the 36 interventions that were realist evaluations. However, not every original inventoried by analyzing some 450 English, evaluation study described which mechanisms German, French, and Dutch articles and are assumed to be at work when the program is papers into the following categories: effective, implemented. The same goes for contexts and potentially effective, potentially ineffective, and outcomes. This meant that in most cases missing ineffective. links in or between different statements in the evaluation study had to be identified through However, not all studies could be grouped in one argumentational analysis. of the four categories: in 16 cases the quality of the study design was not good enough to decide on Based on the evaluations analyzed, Van der the effectiveness of a measure. The (remaining) Knaap et al. (2008) traced the following three nine interventions were labeled effective and mechanisms to be at work in programs that had the (final) six were labeled potentially effective. demonstrated their impact or the very-likely-to- Four interventions were labeled potentially come-impact: ineffective and one was labeled ineffective in preventing violence in the public and semi-public · The first is of a cognitive nature, focusing on domain. learning, teaching, and training. · The second (overarching) mechanism con- To combine Campbell standards and the realist cerns the way the (social) environment is evaluation approach, the realist approach was rewarding or punishing behavior (through applied after finishing the Campbell-style bonding, community development, and the systematic review. This means that only then targeting of police activities). the underlying mechanisms and contexts as · The third mechanism is risk reduction, for described in the studies included in the review instance, promoting protective factors. were on the agenda of the evaluator. This was done for the four types of interventions, whether concluding remarks on review and they were measured as being effective, potentially synthesis approaches effective, potentially ineffective, or ineffective. As Given the "fleets" (Weiss, 1998) and the streams a first step, information was collected concern- of studies (Rist and Stame, 2006) in the world ing social and behavioral mechanisms that were of evaluation, it is not recommended to start assumed to be at work when the program or an impact evaluation of a specific program, intervention was implemented. Pawson (2006: intervention, or tool of government without 24) refers to this process as "to look beneath the making use of the accumulated evidence to be surface [of a program] in order to inspect how found in systematic reviews and other types of they work." One way of doing this is to search meta-studies. One reason concerns the efficiency articles under review for statements that address of the investments: what has been sorted out the why question: why will this intervention be does not need (always) to be sorted out again. 102 a p p E n d I x 1 2 : f u rt h E r I n f o r m at I o n o n r E v I E w a n d s y n t h E s I s a p p r o a c h E s I n I m pa c t E va l u at I o n If over and over again it has been found that Different approaches in the world of (impact) awareness-raising leads to behavior changes only evaluation are a wise thing to have, but under specific conditions, then it is wise to have (continuous) paradigm wars ("randomistas that knowledge ready before designing a similar versus relativistas"--realists versus experi- program or evaluation. A second reason is that mentalists) run the risk of developing into by using results from synthesis studies the test intellectual ostracism. Wars also run the risk of of an intervention theory can be done with more vesting the image of evaluations as a "helter- rigor. The larger the discrepancy between what skelter mishmash [and] a stew of hit-or-miss is known about mechanisms a policy or program procedures" (Perloff, 2003), which is not believes to be at work and what the policy in fact the best perspective to live with. Combining tries to set into motion, the smaller the chances perspectives and paradigms should therefore of an effective intervention. be stimulated. 103 APPENDIX 13: BASIC EDUCATION IN GHANA1 Introduction of communities over a 15-year period. The test In 1986 the government of Ghana embarked on scores are directly comparable because exactly an ambitious program of educational reform, the same tests were used in 2003 as had been shortening the length of pre-university education applied 15 years earlier. from 17 to 12 years, reducing subsidies at the secondary and tertiary levels, increasing the length There was no clearly defined project for this of the school day, and taking steps to eliminate study, rather support to the sub-sector through unqualified teachers from schools. These reforms four large operations. The four projects had were supported by four World Bank credits--the supported a range of activities, from rehabilitating Education Sector Adjustment Credits I and II, the school buildings to assisting in the formation of Primary School Development Project, and the community-based school management commit- Basic Education Sector Improvement Project. An tees. To identify the impact of these various activi- impact study by IEG looked at what had happened ties a regression-based approach was adopted that to basic education (grades 1­9, in primary and analyzed the determinants of school attainment junior secondary school) over this period. (years of schooling) and achievement (learning outcomes, i.e., test scores). For some of these data and methodology determinants--notably books and buildings-- In 1988­89 the Ghana Statistical Service (GSS) the contribution of the World Bank to better undertook the second round of the Ghana Living learning outcomes could then be quantified. Standards Survey (GLSS 2). Half of the 170 areas surveyed around the country were chosen at The methodology adopted a theory-based random to have an additional education module, approach to identify the channels through which which administered math and English tests to all a diverse range of interventions were having those aged 9­55 years with at least three years of their impact. As discussed below, the qualitative schooling and surveyed schools in the enumer- context of the political economy of education ation areas. Working with bothGSS and the reform in Ghana at the time proved to be a vital Ministry of Education, Youth and Sport (MOEYS), piece of the story. IEG resurveyed these same 85 communities and their schools in 2003, applying the same survey Findings instruments. In the interests of comparability, the The first major finding from the study was the same questions were kept, although new ones factual. Contrary to official statistics, enroll- were added pertaining to school management, as ments in basic education had been rising steadily were two whole new questionnaires--a teacher over the period. This discrepancy was readily questionnaire for five teachers at each school explained: in the official statistics, both the and a local language test in addition to the math numerator and denominator were wrong. The and English tests. The study thus had a possibly numerator was wrong as it relied on the adminis- unique data set--not only could children's test trative data from the school census, which had scores be linked to both household and school incomplete coverage of the public sector and did characteristics, but this could be done in a panel not cover the rapidly growing private sector. A 105 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n constant mark-up was made to allow for private absolute enrolments falling. But by 2000, more sector enrollments, but the IEG analysis showed than 90% of Ghanaians 15 and older had attended that that had gone up fourfold (from 5% to 20% school, compared to 75% 20 years earlier. In of total enrollments) over the 15 years. The addition, drop-out rates have fallen, so comple- denominator was based on the 1984 census, with tion rates have risen: by 2003, 92% of those an assumed rate of growth that turned out to be entering grade 1 complete JSS (grade 9). Gender too high once the 2000 census became available, disparities have been virtually eliminated in basic thus underestimating enrolment growth. enrolments. Primary enrolments have risen in both disadvantaged areas and amongst the lowest More strikingly still, learning outcomes have income groups. The differential between both the improved markedly: 15 years ago nearly two-thirds poorest areas and other parts of the country, and (63%) of those who had completed grades 3­6 between enrollments of the poor and non-poor, were, using the English test as a guide, illiterate. have been narrowed but still exist. By 2003 this figure had fallen to 19%. The finding of improved learning outcomes flies in the face Statistical analysis of the survey results showed of qualitative data from many, though not all, key the importance of building school infrastructure informant interviews. But such key informants based on enrollments. Building a school, and display a middle class bias that persists against the so reducing children's travel time, has a major reforms that were essentially populist in nature. impact on enrollments. Although the majority of children live within 20 minutes of school, some Also striking are the improvements in school 20% do not, and school building has increased quality revealed by the school-level data: enrollments among these groups. In one area surveyed, average travel time to the nearest · In 1988, fewer than half of schools could use school was cut from nearly an hour to less than 15 all their classrooms when it was raining, but minutes, with enrollments increasing from 10% in 2003 over two-thirds could do so. to 80%. In two other areas, average travel time · Fifteen years ago over two-thirds of primary was reduced by nearly 30 minutes and enroll- schools reported occasional shortages of ments increased by more than 20%. Rehabilitat- chalk. Only one in 20 does so today, with 86% ing classrooms so that they could be used when saying there is always enough. it is raining also positively affects enrollments. · The percentage of primary schools having at Complete rehabilitation can increase enrollments least one English textbook per pupil has risen by as much as one-third. Across the country as from 21% in 1988 to 72% today, and for math a whole, the changes in infrastructure quantity books in junior secondary school (JSS) these and quality have accounted for a 4% increase figures are 13% and 71%, respectively. in enrolments between 1988 and 2003, about one-third of the increase over that period. The School quality has improved across the country, World Bank has been the main source of finance in poor and non-poor communities alike. But for these improvements. Before the first World there is a growing disparity within the public Bank program, communities were responsible school sector. Increased reliance on community for building their own schools. These structures and district financing has meant that schools in collapsed after a few years. The Bank has financed relatively prosperous areas continue to enjoy 8,000 school pavilions around the country, provid- better facilities than do those in less-well-off ing more permanent structures for the school communities. that can better withstand the weather. The IEG study argues that Ghana has been a Learning outcomes depend significantly on school case of a quality-led quantity expansion in basic quality, including textbook supply. Bank-financed education. The education system was in crisis in textbook provision accounts for around the seventies; school quality was declining and one-quarter of the observed improvement in test 106 a p p E n d I x 1 3 : b a s I c E d u c at I o n I n G h a n a scores. But other major school-level determi- higher for children who have attended primary nants of achievement, such as teaching methods and JSS than for children who have not), but and supervision of teachers by the head teacher there is a return to cognitive achievement. and circuit supervisor, have not been affected by Children who attain higher test scores as a result the Bank's interventions. The Bank has not been of attending school can expect to enjoy higher heavily involved in teacher training and plans to income; but children who learn little in school extend in-service training have not been realized. will not reap any economic benefit. Support to "hardware" has been shown to have made a substantial positive contribution to both Some policy implications attainment and achievement. But when satisfac- The major policy finding from the study relates tory levels of inputs are reached--which is still to the appropriate balance between hardware far from the case for the many relatively deprived and software in support for education. The latter schools--future improvements could come from is now stressed. But the study highlights the focusing on what happens in the classroom. importance of hardware: books and buildings. However, the Bank's one main effort to change It was also of course important that teachers incentives--providing head teacher housing were in their classrooms; the government's own under the Primary School Development Project commitment (borne out of a desire to build in return for the head teacher signing a contract political support in rural areas) helped ensure on school management practices--was not a this happened. great success. Others, notably DFID and USAID, have made better progress in this direction but In the many countries and regions in which with limited coverage. educational facilities are inadequate, then hardware provision is a necessary step in The policy context, meaning government increasing enrollments and improving learning commitment, was an important factor in making outcomes. The USAID project in Ghana encour- the Bank's contributions work. The government ages teachers to arrange children's desks in was committed to improving the quality of life groups rather than rows--but many of the in rural areas, through the provision of roads, poorer schools don't have desks. In the words of electricity, and schools, as a way of building a one teacher, "I'd like to hang posters on my walls political base. Hence there was a desire to make it but I don't have posters. In fact, as you can see, I work. Party loyalists were placed in key positions don't have any walls." to keep the reform on track, the army distributed textbooks in support of the new curriculum in the These same concerns underlie a second policy early 1990s to make sure they reached schools on implication. Central government finances time, and efforts were made to post teachers to teacher's salaries and little else in basic education. new schools and make sure that they received Other resources come from donors, districts, or their pay on time. the communities themselves. There is thus a real danger of poorer communities falling behind, Teachers also benefited from the large civil as they lack both resources and the connec- service salary increase in the run up to the 1992 tions to access external resources. The reality of election. Better education leads to better welfare this finding was reinforced by both qualitative outcomes. Existing studies on Ghana show how data--field trips to the best and worst performing education reduces fertility and mortality. Analysis schools in a single district in the same day--and of IEG's survey data shows that education the quantitative data, which show the poorer improves nutritional outcomes, with this effect performance of children in these disadvantaged being particularly strong for children of women schools. Hence children of poorer communities living in poorer households. Regression analysis are left behind and account for the remaining shows there is no economic return to primary illiterate primary graduates, which should be a and JSS education (i.e., average earnings are not pressing policy concern. 107 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n The study highlighted other areas of concern: now accounts for 20% of primary enrolments first, low teacher morale, manifested through compared to 5% 15 years earlier. This is a sector increased absenteeism; and second, the that has had limited government involvement growing importance of the private sector, which and none from the Bank. 108 APPENDIX 14: HIERARCHY OF QUASI-EXPERIMENTAL DESIGNS the stage of the Project project cycle intervention Mid- End of at which each Start of project (process not term project evaluation design (pre-test) discrete event) evaluation (post-test) can be used Quantitative Impact Evaluation design T1 T2 T3 relatively robust quasi-experimental designs 1. Pre-test/post-test non-equivalent control group P1 X P2 Start design with statistical matching of the two groups. C1 C2 Participants are either self-selected or are selected by the project implementing agency. Statistical techniques (such as propensity score matching), drawing on high- quality secondary data used to match the two groups on a number of relevant variables. 2. Pre-test/post-test non-equivalent control group P1 X P2 Start design with judgmental matching of the two C1 C2 groups. Participants are either self-selected or are selected by the project implementing agency. Control areas usually selected judgmentally and subjects are randomly selected from within these areas. less robust quasi-experimental designs 3. Pre-test/post-test comparison where the X P1 P2 During project baseline study is not conducted until the project C1 C2 implementation has been under way for some time (most commonly (often at mid-term) this is around the mid-term review). 4. Pipeline control group design. When a project is P1 X P2 Start implemented in phases, subjects in Phase 2 (i.e., who C1 C2 will not receive benefits until some later point in time) can be used as the control group for Phase 1 subjects. 5. Pre-test/post-test comparison of project group P1 X P2 Start combined with post-test comparison of project C2 and control group 6. Post-test comparison of project and control X P1 End groups C1 non-experimental designs (the least robust) 7. Pre-test/post-test comparison of project group P1 X P2 Start 8. Post-test analysis of project group X P1 End Source: Bamberger et al. (2006). Note: T = time; P = project participants; C = control group; P1, P2, C1, C2 = first and second observations; X = project intervention (a process rather than a discrete event). 109 APPENDIX 15: INTERNATIONAL EXPERTS WHO CONTRIBUTED TO THE SUBGROUP DOCUMENTS Marie-Hélene Adrien: President and Senior Masafumi Nagao: Research Professor, Center for Consultant, Universalia the Study of International Cooperation in Paul Balogun: Consultant, Author Education, Hiroshima University Michael Bamberger: Consultant, Author Michael Quinn Patton: Consultant, Author, Fred Carden: Director of Evaluation Unit, IDRC Former President of AEA Canada Ray Pawson: Professor of Social Research Method- Stewart Donaldson: Professor and Chair of ology, School of Sociology and Social Policy, Psychology, Director of the Institute of Organi- University of Leeds zational and Program Evaluation Research, Bob Picciotto: Visiting Professor, Kings College, and Dean of the School of Behavioural and London Organizational Sciences, Claremont Graduate Patricia Rogers: Professor in Public Sector Evalua- University tion, CIRCLE (Collaboration for Interdisci- Osvaldo Feinstein: Consultant. Author, Editor plinary Research, Consulting and Learning Ted Freeman: Consultant and Partner, Gross in Evaluation), Royal Melbourne Institute of Gilroy, Inc. Technology Sulley Gariba: Consultant, Executive Director, Thomas Schwandt: University Distinguished Institute of Policy Alternatives Teacher/Scholar and Professor of Education, Jennifer Greene: Professor, Educational University of Illinois at Urbana-Champaign Psychology, University of Illinois at Urbana- Nicoletta Stame: Professor, University of Rome Champaign "La Sapienza" Ernie House: Emeritus Professor, School of Bob Williams: Consultant, Author, member of the Education, University of Colorado Editorial Boards of the American Journal of Mel Mark: Professor of Psychology, Penn State Evaluation and New Directions in Evaluation University John Mayne: Consultant, Author, Adviser on public sector performance 111 ENDNOTES Executive Summary tions. Both replicatory and systemic effects can result 1. Available at www.worldbank.org/ieg/nonie. from processes of change at institutional or benefi- 2. OECD-DAC (2002): "Glossary of Key Terms ciary levels. With respect to the first, evaluations that in Evaluation and Results Based Management," cover replicatory effects are quite scarce. This is in dire OECD-DAC, Paris. contrast with the manifest presence of replication (and the related concept of scaling up) as explicit objectives Introduction in many policy interventions. For further discussion 1. The history of impact evaluations in some on replication, see, for example, GEF (2007). These countries goes back many decades (Oakley, 2000). dimensions can be addressed in a theory-based impact 2. The Maryland Scientific Methods Scale (MSMS) evaluation framework (see chapter 3). is, for example, used in parts of criminology and in 8. This is the interpretation that has received several countries (see Leeuw, 2005). RCTs are believed the most attention in methodological guidelines of to be the top design (level 5). international organizations working on impact evalua- tion, such as the World Bank or the Asian Develop- chapter 1 ment Bank. 1. An interesting overview of public-private partner- 9. In this context one can distinguish between the ships and their evaluation is given by Utce Ltd. and effect of aid modalities on "the way business is being Japan Pfi Association (2003). done" (additionality of funding, direction of funding, 2. "We probably also under-invest in evaluative public sector performance, coherence of policy research on types of interventions that tend to have changes, quality of intervention design, etc.; see, e.g., diffused, wide-spread benefits" (Ravallion, 2008: 6). See Lawson et al., 2005), i.e., what we call institutional-level also Jones et al. (2008), who have identified geographi- impact, and subsequently the impact of interventions cal and sectoral biases in impact evaluation. funded (in part) by general budget support, sector 3. Complexity in terms of the nature of change budget support, or debt relief funds at the beneficiary processes induced by an intervention. level. In the latter case, we are talking about impact 4. For example, Elbers et al. (2008) directly assess evaluation as it is understood in most of the literature. the impact of a set of policy variables (i.e., the equiva- lent of a multi-stranded program) by means of a chapter 2 regression-based evaluation approach (see chapter 4) 1. "Values inquiry refers to a variety of methods on outcome variables. that can be applied to the systematic assessment of the 5. Though not necessarily easy to measure. value positions surrounding the existence, activities, 6. Please note that the two levels should not be and outcomes of a social policy and program" (Mark regarded as a dichotomy. In fact, a particular interven- et al., 1999: 183). tion might induce a "cascade" of processes of change 2. For a discussion on different dimensions of sustain- at different institutional levels (e.g., national, provin- ability in development intervention, see Mog (2004). cial government, cooperatives) before finally affecting the welfare of individuals. chapter 4 7. A third and fourth level of impact, more 1. Economists employ several useful techniques difficult to pinpoint, respectively refer to the replica- for estimating the marginal impact of an extra dollar tory impact and the wider systemic effects of interven- invested in a particular policy intervention. See, for 113 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n example appendix 1, second example. We consider based approaches to impact evaluation referred to these methods to be complementary to impact evalua- earlier are structural equation models that can be used tion and beyond the scope of this guidance. to model some of the more complex causal relation- 2. The larger the sample size, the more likely it is ships that underlie interventions, using, for example, that groups are equivalent, on average. an intervention theory as a basis. 3. We would like to thank Antonie de Kemp of 14. In general, regression-based techniques (and IOB for insightful suggestions. See also SG1 (2008). quasi-experimental techniques that rely on existing 4. Alternative, more nuanced classifications distin- data) are primarily constrained by the availability of guish between experimental, quasi-experimental, and existing data (see chapter 8). In contrast, experimen- passive observational (correlational) research designs. tal and quasi-experimental techniques that rely on Features that distinguish one type of design from design-based group comparisons face more pressing another are (i) control over exposure to the treatment; constraints in terms of the need for ex ante involve- (ii) control over the nature of the treatment; and (iii) ment of evaluators in a policy intervention (see control over the timing and nature of measurement. In appendix 14). Consequently, there is probably more experiments one has control over i, ii, and iii; in quasi- scope for extending the use of the former group of experiments one usually controls ii and iii only; and in techniques. passive observational studies one does not have full 15. This might need to be analyzed using other control over any of these features (see, e.g., Posavac and methods (see §4.4 and chapter 5). Carey, 2002; personal communication, J. Scott Bayley). 16. See appendices 7 and 8 for brief discussions on 5. We discuss only a selection of available methods. additional approaches applicable to impact evaluation See Shadish et al. (2002) or Mohr (1995) for additional problems in multi-level settings. (quasi-experimental and regression-based) methods. 17. However, as explained below, in some cases 6. It is difficult to identify general guidelines for these methods can be articulated to quantitative avoiding these problems. Evaluators have to be aware methods of impact evaluation (see also chapter 5). of the possibility of these effects affecting the validity 18. See also SG2 (2008). of the design. For other problems, as well as solutions, 19. One of the methods that relies on the see Shadish et al. (2002). reconstruction of stakeholder perspectives is called 7. For further discussion on the approaches the strategic assessment approach, also known as discussed below, see appendices 3­6. assumptional analysis. It can be found in a series of 8. For an explanation, see Wooldridge (2002), studies (Jackson, 1989) but has as its core knowledge chapter 18. basis Mason and Mitroff 's (1981) book Challenging 9. This subsection comes largely from Bamberger Strategic Planning Assumptions (see also Leeuw, (2006). 2003; see also chapter 3). 10. The approach is similar to a fixed-effects regres- 20. Participatory Learning and Action as a generic sion model that uses deviations from individual means approach with an associated set of methods has its to deal with (unobserved) selection effects. origins in rapid rural appraisal and participatory rural 11. Although in reality one will not find such a clear appraisal. Participatory poverty assessment processes linear correlation as figure 4.2. have built strongly on this tradition. 12. With instrumental variables one may try to 21. Although particular case studies of localized get rid of an expected bias, but the technique cannot intervention activities within the sector program might guarantee that endogeneity problems will be solved be conducted in a participatory manner. completely (the instrumental variable may also be 22. When addressing the attribution problem, endogenous). Moreover, with weak instruments the the role of participatory approaches is also restricted precision of the estimate may be low. because perceptions and experiences of participants 13. Alternatively, impact evaluation in the case of collected through participatory methods run the complex interventions or complex processes of change risk of making an evaluation "partnerial." In such can rely on several statistical modeling approaches a situation, the distinction between evaluator and to capture the complexity of a phenomenon. For evaluated is blurred. As policies and programs often-- example, an extension of reduced form regression- implicitly or explicitly--deal with interests, incentives, 114 EndnotEs and disincentives, this complicates the process and 4. This is an issue that is closely related to the idea the reliability of the evaluation outcomes. (See also of external validity. If one knows how an intervention §8.3 for a wider discussion of data quality issues.) affects groups of people in different ways, then one 23. Throughout this document we have used can more easily generalize findings to other similar the rather generic terms "quantitative" and "qualita- settings. tive" methods of research/evaluation. Although we are aware of the limitations of these concepts, we chapter 6 have opted to use them because of their widespread 1. This step may rely on statistical methods accepted use. In practice, often but not always, a (meta-analysis) for analyzing and summarizing the distinction can be made between methods of data results of included studies, if quantitative evidence at collection and methods of data analysis. In addition, the level of single-intervention studies is available and one should distinguish between the type of method if interventions are considered similar enough. and the scale of measurement (type of data). For example, quantitative data (that is, data measured on chapter 8 interval or ratio scales) can be collected using what 1. In some cases, talking about the "end" of an are often called qualitative methods. Rather than intervention is not applicable or is less applicable, for spending a lot of effort on coherently separating these example, in institutional reforms, new legislation, fiscal issues, we decided to keep things simple for the sake policy, etc. of argument (and space). 2. For example, with secondary data sets, what do 24. Please note that different methods rely on we know about the quality of the data collection (e.g., different types of sampling or selection of units of sampling errors, training and supervision of interview- analysis. For example, quantitative descriptive analysis ers) or data processing (e.g., dealing with missing values, (preferably) relies on data based on random (simple, weighting issues)? We cannot simply take for granted that stratified, clustered) samples or on census data. In a data set is free from error and bias. Lack of information contrast, many qualitative methods rely on nonran- on the process of generating the database inevitably dom sampling techniques such as purposive or constrains any subsequent data analysis efforts. snowball sampling or do not rely on sampling at all, as they might focus on a relatively small number of chapter 9 observations. 1. An example from Europe stresses this point. In 25. Appendix 9 presents a list of qualitative some situations, educational evaluators of the Danish methodological frameworks that combine several Evaluation Institute discussed their reports with up qualitative (and occasionally quantitative) methods to 20-plus stakeholders before the report was cleared for the purposes of evaluating the effects of an and published (Leeuw, 2003). intervention (see also chapter 5 on combining 2. For a broader discussion on ethics in evaluation, methods). see Simons (2006). chapter 5 Appendix 2 1. This dimension is only addressed by quantitative 1. The text is a literal citation of Scriven (2008: impact evaluation techniques. 21­22). 2. The most commonly used term is mixed methods (see for example Tashakkori and Teddlie, 2003). In the Appendix 4 case of development research and evaluation, see 1. In traditional usage, a variable is endogenous if it is Bamberger (2000) and Kanbur (2003). determined within the context of a model. In economet- 3. This is true for the broad interpretation of the rics, it is used to describe any situation in which an concept of triangulation as used by Mikkelsen (2005). explanatory variable is correlated with the disturbance Other authors use the concept in a more restrictive term. Endogeneity arises as a result of omitted variables, way (e.g., Bamberger [2000] uses triangulation in the measurement error, or in situations where one of the more narrow sense of validating findings by looking at explanatory variables is determined along with the different data sources). dependent variable (Wooldridge, 2002: 50). 115 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n 2. The approach is similar to a fixed-effects 7. A contract between SOGUIPAH and the farmer regression model, using deviations from individual binds the farmer to reimburse the cost of the planta- means. tion and deliver his production to SOGUIPAH. 8. AGROPARISTECH is a member of the Paris Appendix 5 Institute of Technology, which is a consortium of 10 1. For further examples see White (2006). of the foremost French Graduate Institutes in Science and Engineering. AGROPARISTECH is a leader Institute Appendix 9 in life sciences and engineering. 1. Source: SG2 (2008). 9. Source: SG2 (2008). 10. The GEF Evaluation Office section of the GEF Appendix 11 website contains the 11 papers produced by the impact 1. This case study is drawn from the 2002 report evaluation in 2007, under the heading of "ongoing published by the Ministry of Foreign Affairs, Denmark evaluations." (SG2, 2008). 11. Instrument for the Establishment of the 2. Source: SG2 (2008). Restructured Global Environment Facility. 3. Typical problems with recall methods are that 12. GEF Evaluation Office, "Approach Paper to of incorrect recalling and telescoping, i.e., projecting Impact Evaluation," February 2006. backward or forward onto an event: for example, the 13. See the Preamble, "Instrument for the purchase of a durable good that took place seven years Establishment of the Restructured Global Environ- ago (before the project started) could be projected to ment Facility." four years ago, during project implementation (see, 14. This is based on Nature Conservancy's conser- e.g., Bamberger et al., 2004). vation action planning methodology. 4. Source: SG2 (2008). 15. Full case study at http://www.thegef.org/ 5. The second project was inland valley develop- uploadedFiles/Evaluation_Office/Ongoing_Evalua- ment for irrigated rice cultivation and is not presented tions/Ongoing_Evals-Impact-8Case_Study_Lewa.pdf. here. 6. Industrial plantations are the property of Appendix 13 SOGUIPAH and are worked by salaried employees. 1. White (2006). 116 REFERENCES ADB (2006) Impact Evaluation--Methodologi- Working Paper 172, Overseas Development cal and Operational Issues, Economics and Institute, London. Research Department, Asian Development Bourguignon, F., and M. Sundberg (2007) Bank, Manila. "Aid effectiveness, opening the black box", Agresti, A., and B. Finlay (1997) Statistical American Economic Review 97(2), 316­321. Methods for the Social Sciences, Prentice Brinkerhoff, R. (2003) The Success Case Method, Hall, New Jersey. Berrett Koehler, San Francisco. Baker, J.L. (2000) Evaluating the Impact of Bryman, A. (2006) "Integrating quantitative and Development Projects on Poverty, The World qualitative research: How is it done?" Qualita- Bank, Washington, D.C. tive Research 6(1), 97­113. Bamberger, M. (2000) "Opportunities and Bunge, M. (2004) "How Does It Work? The Search challenges for integrating quantitative and for Explanatory Mechanisms", Philosophy of qualitative research", in: M. Bamberger (ed.) the Social Sciences 34(2), 182­210. Integrating Quantitative and Qualitative Campbell, D.T. (1957) "Factors relevant to the Research in Development Projects, World validity of experiments in social settings", Bank, Washington, D.C. Psychological Bulletin 54, 297­312. Bamberger, M. (2006) Conducting Quality Impact Campbell, D.T., and J.C. Stanley (1963) "Experi- Evaluations under Budget, Time and Data mental and quasi-experimental designs for Constraints, World Bank, Washington, D.C. research on teaching", in: N. L. Gage (ed.) Bamberger, M., J. Rugh, M. Church and L. Fort Handbook of Research on Teaching, Rand (2004) "Shoestring Evaluation: Designing McNally, Chicago. Impact Evaluations under Budget, Time and Carvalho, S., and H. White (2004) "Theory-based Data Constraints", American Journal of evaluation: The case of social funds", American Evaluation 25(1), 5­37. Journal of Evaluation 25(2), 141­160. Bamberger, M., J. Rugh and L. Mabry (2006) Casley, D.J., and D.A. Lury (1987) Data Collec- Real-World Evaluation Working Under tion in Developing Countries, Oxford Univer- Budget, Time, Data, and Political sity Press, New York. Constraints, Sage Publications, Thousand CGD (2006) When Will We Ever Learn? Improv- Oaks, CA. ing Lives through Impact Evaluation, Report Bamberger, M., and H. White (2007) "Using of the Evaluation Gap Working Group, Center strong evaluation designs in developing for Global Development, Washington, DC. countries: Experience and challenges", Chambers, R. (1995) "Paradigm Shifts and Journal of Multidisciplinary Evaluation the Practice of Participatory Research and 4(8), 58­73. Development", in: S. Wright and N. Nelson Bemelmans-Videc, M.L., and R.C. Rist (eds.) (eds.) Power and Participatory Develop- (1998) Carrots, Sticks and Sermons: Policy ment: Theory and Practice, Intermediate Instruments and their Evaluation, Transac- Technology Publications, London. tion Publishers, New Brunswick. Clarke, A. (2006) "Evidence-Based Evaluation in Booth, D., and H. Lucas, H. (2002) "Good Practice Different Professional Domains: Similarities, in the Development of PRSP Indicators", Differences and Challenges", in: I.F. Shaw, 117 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n J.C. Greene and M.M. Mark (eds.) The SAGE effectiveness", in: G.K. Pitman, O.N. Feinstein Handbook of Evaluation, Sage Publications, and G.K. Ingram (eds.) Evaluating Develop- London. ment Effectiveness, Transaction Publishers, Coleman, J.S. (1990) Foundations of Social New Brunswick. Theory, Belknap Press, Cambridge. . Elbers, C., J.W Gunning and K. De Hoop (2008) Cook, T.D. (2000) "The false choice between "Assessing sector-wide programs with statistical theory-based evaluation and experimenta- impact evaluation: a methodological proposal", tion", in: P.J. Rogers, T.A. Hacsi, A. Petrosino World Development 37(2), 513­520. and T.A. Huebner (eds.) (2000) Program Elster, J. (1989) Nuts and Bolts for the Social Theory in Evaluation: Challenges and Sciences, Cambridge University Press, Opportunities, New Directions for Evalua- Cambridge. tion, 87, Jossey-Bass, San Francisco. Elster, J. (2007) Explaining Social Behavior-- Cook, T.D., and D.T. Campbell (1979) Quasi- More Nuts and Bolts for the Social Sciences, Experimentation: Design and Analysis for Cambridge University Press, Cambridge. Field Settings, Rand McNally, Chicago. Farnsworth, W. (2007) The Legal Analyst--A Cooke, B. (2001) "The Social Psychological Limits Toolkit for Thinking about the Law, Univer- of Participation?" in: B. Cooke and U. Kothari sity of Chicago Press, Chicago. (eds.) Participation: The New Tyranny?, Zed GEF (2007) "Evaluation of the Catalytic Role of Books, London. the GEF", Approach Paper, GEF Evaluation Connell, J.P., A.C. Kubisch, L.B. Schorr and Office, Washington, D.C. C.H. Weiss (eds.) (1995) New Approaches Gittinger, J.P. (1982) Economic Analysis of to Evaluating Community Initiatives, The Agricultural Projects, Johns Hopkins Univer- Aspen Institute, Washington, D.C. sity Press, Baltimore. Cousins, J.B., and E. Whitmore (1998) "Framing Greene, J.C. (2006) "Evaluation, democracy and Participatory Evaluation", in: E. Whitmore social change", in: I.F. Shaw, J.C. Greene and (ed.) Understanding and Practicing Partici- M.M. Mark (eds.) The SAGE Handbook of patory Evaluation, New Directions for Evaluation, Sage Publications, London. Evaluation 80, Jossey-Bass, San Francisco. . Greenhalgh, T., G. Robert, F. Macfarlane, P Bate Davies, R., and J. Dart (2005) The `Most Signifi- and O. Kyriakidou (2004) "Diffusion of Innova- cant Change' Technique, http://www.mande. tions in Service Organizations: Systematic co.uk/docs/MSCGuide.pdf (last consulted Review and Recommendations", The Milbank May 12, 2009). Quarterly 82(1), 581­629. Deaton, A. (2005) "Some remarks on randomiza- Hair, J.F., B. Black, B. Babin, R.E. Anderson tion, econometrics and data", in: G.K. Pitman, and R.L. Tatham (2005) Multivariate Data O.N. Feinstein and G.K. Ingram (eds.) Evaluat- Analysis, Prentice Hall, New Jersey. ing Development Effectiveness, Transaction Hansen, H.F., and Rieper, O. (2009) "Institution- Publishers, New Brunswick, NJ. alization of second-order evidence producing Dehejia, R. (1999) "Evaluation in multi-site organizations", in: O. Rieper, F.L. Leeuw and programs", Working paper, Columbia Univer- T. Ling (eds.) The Evidence Book: Concepts, sity and NBER, http://emlab.berkeley.edu/ Generation and Use of Evidence, Transaction symposia/nsf99/papers/dehejia.pdf (last Publishers, New Brunswick. consulted January 12, 2009). . Hedström, P (2005) Dissecting the Social: On the De Leeuw, E.D., J.J. Hox and D.A. Dillman (eds.) Principles of Analytical Sociology, Cambridge (2008) International Handbook of Survey University Press, Cambridge. Methodology, Lawrence Erlbaum Associates, Hedström, P., and R. Swedberg (1998) Social London. Mechanisms: An Analytical Approach to Duflo, E. and M. Kremer (2005) "Use of random- Social Theory, Cambridge University Press, ization in the evaluation of development Cambridge. 118 rEfErEncEs Henry, G.T. (2002) "Choosing Criteria to Judge Support Work? Evidence from Tanzania, Program Success--A Values Inquiry", Evalua- Overseas Development Institute, London. tion 8(2), 182­204. Leeuw, F.L. (2003) "Reconstructing Program House, E. (2008) "Blowback: Consequences of Theories: Methods Available and Problems to Evaluation for Evaluation", American Journal be Solved", American Journal of Evaluation of Evaluation 29(4), 416­426. 24(1), 5­20. IDRC (2001) Outcome Mapping: Building Leeuw, F.L. (2005) "Trends and Developments in Learning and Reflection into Develop- Program Evaluation in General and Criminal ment Programs, International Development Justice Programs in Particular", European Research Centre (IDRC), Ottawa. Journal on Criminal Policy and Research IEG (2005) "OED and Impact Evaluation: A 11, 18­35. Discussion Note," Operations Evaluation Leeuw, F.L., and J.E. Furubo (2008) "Evaluation Department, World Bank, Washington, D.C. Systems ­ What Are They and Why Study IFAD (2002) Managing for Impact in Rural Them?", Evaluation 14(2), 157­169. Development: A Practical Guide for M&E, Levinsohn, J., S. Berry, and J. Friedman (1999) IFAD, Rome. "Impacts of the Indonesian Economic Crisis: Jackson, M. C. (1989) "Assumptional analysis", Price Changes and the Poor", Working Paper Systems Practice 14, 11­28. 7194, National Bureau of Economic Research, Jerve, A.M., and E. Villanger (2008) The challenge Cambridge. of Assessing Aid Impact: A Review of Lipsey, M.W. (1993) "Theory as Method: Small Norwegian Practice, Study commissioned by Theories of Treatments," in: L.B. Sechrest and NORAD, Chr. Michelsen Institute, Bergen. A.G. Scott (eds.), Understanding Causes and Jones, N., C. Walsh, H. Jones and C. Tincati (2008) Generalizing about Them, New Directions Improving Impact Evaluation Coordina- for Program Evaluation 57, Jossey-Bass, San tion and Uptake--A Scoping Study Commis- Franscisco. sioned by the DFID Evaluation Department Lister, S., and R. Carter (2006) Evaluation of on Behalf of NONIE, Overseas Development General Budget Support: Synthesis Report, Institute, London. Joint evaluation of general budget support Kanbur, R. (ed.) (2003) Q-Squared: Combining 1994­2004, Department for International Qualitative and Quantitative Methods in Development, University of Birmingham. Poverty Appraisal, Permanent Black, Delhi. Maluccio, J. A., and R. Flores (2005) "Impact Kellogg Foundation (1991) Information on evaluation of conditional cash transfer Cluster Evaluation, Kellogg Foundation, program: The Nicaraguan Red de Protección Battle Creek. Social", International Food Policy Research Kraemer, H.C. (2000) "Pitfalls of Multisite Institute, Washington, D.C. Randomized Clinical Trials of Efficacy and . Mansuri, G., and V Rao (2004) Community-Based Effectiveness." Schizophrenia Bulletin 26, and -Driven Development: A Critical Review, The 533­541. World Bank Research Observer 19(1), 1­39. . Kruisbergen, E.W (2005) Voorlichting: doen of Mark, M.M., G.T. Henry and G. Julnes (1999) laten? Theorie van afschrikwekkende voorlich- "Toward an Integrative Framework for Evalua- tingscampagnes toegepast op de casus van tion Practice", American Journal of Evalua- bolletjesslikkers, Beleidswetenschap 19(3), tion 20, 177­198. 3­1. Mason, I., and I. Mitroff (1981) Challenging Kusek, J., and R.C. Rist (2004) Ten Steps to a Strategic Planning Assumptions, Wiley, New Results-Based Monitoring and Evaluation York. System: A Handbook for Development Practi- Mayne, J. (2001) "Addressing Attribution through tioners, World Bank, Washington, D.C. Contribution Analysis: Using Performance Lawson, A., D. Booth, M. Msuya, S. Wangwe and Measures Sensibly", Canadian Journal of T. Williamson (2005) Does General Budget Program Evaluation 16(1), 1­24. 119 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n Mayntz, R. (2004) "Mechanisms in the Analysis of A. Killoran, M. Kelly, C. Swann, L. Taylor, Social Macro-phenomena", Philosophy of the L. Milward and S. Ellis (eds.) Evidence-Based Social Sciences 34(2), 237­259. Public Health, Oxford University Press, Oxford. McClintock, C. (1990) "Administrators as applied Pawson, R. (2006) Evidence-Based Policy: theorists", in: L. Bickman (ed.) Advances in A Realist Perspective, Sage Publications, Program Theory, New Directions for Evalua- London. tion, Jossey-Bass, San Francisco. Pawson, R., and N. Tilley (1997) Realistic Evalua- Mikkelsen, B. (2005) Methods for Develop- tion, Sage Publications, Thousand Oaks, CA. ment Work and Research, Sage Publications, Perloff, R. (2003) "A potpourri of cursory thoughts Thousand Oaks, CA. on evaluation", Industrial-Organizational Mog, J.M. (2004) "Struggling with sustainabil- Psychologist 40(3), 52­54. ity: A comparative framework for evaluating Picciotto, R. (2004) "The value of evaluation sustainable development programs", World standards: A comparative assessment" Paper Development 32(12), 2139­2160. presented at the European Evaluation Society's Mohr, L.B. (1995) Impact Analysis for Program 6th biennial Conference on Democracy and Evaluation, Sage Publications, Newbury Park, Evaluation, Berlin. CA. Picciotto, R., and E. Wiesner (eds.) (1997) Evalua- Morgan, S.L., and C. Winship (2007) Counterfac- tion and Development: The Institutional tuals and Causal Inference--Methods and Dimension, World Bank Series on Evaluation Principles for Social Research, Cambridge and Development, Transaction Publishers, University Press, Cambridge. New Brunswick. Mukherjee, C., H. White and M. Wuyts (1998) Pollitt, C. (1999) "Stunted by stakeholders? Limits Econometrics and Data Analysis for Develop- to collaborative evaluation", Public Policy and ing Countries, Routledge, London. Administration 14 (2), 77­90. North, D.C. (1990) Institutions, Institutional Posavac, E.J., and R.G. Carey (2002) Program Change and Economic Performance, Evaluation: Methods and Case Studies, Cambridge University Press, New York. Prentice Hall, Englewood Cliffs, NJ. Oakley, A. (2000) Experiments in Knowing: Pretty, J.N., I. Guijt, J. Thompson and I. Scoones Gender and Method in the Social Sciences, (1995) A Trainers' Guide to Participatory Polity Press, Cambridge. Learning and Action, IIED Participatory OECD-DAC (2000) Effective Practices in Conduct- Methodology Series, IIED, London. ing a Multi-donor Evaluation, OECD-DAC, Ravallion, M. (2008) "Evaluation in the practice Paris. of development", Policy Research Working OECD-DAC (2002) Glossary of Key Terms in Paper 4547, World Bank, Washington, D.C. Evaluation and Results Based Management, Rieper, O., F.L. Leeuw and T. Ling (eds.) (2009) OECD-DAC, Paris. The Evidence Book: Concepts, Generation Oliver, S., A. Harden, R. Rees, J. Shepherd, G. and Use of Evidence, Transaction Publishers, Brunton, J. Garcia and A. Oakley (2005) "An New Brunswick, NJ. Emerging Framework for Including Different Rist, R., and N. Stame (eds.) (2006) From Studies Types of Evidence in Systematic Reviews for to Streams--Managing Evaluative Systems, Public Policy", Evaluation 11(4), 428­446. Transaction Publishers, New Brunswick. Patton, M.Q. (2002) Qualitative Research and Robilliard, A.S., F. Bourguignon and S. Robinson Evaluation Methods, Sage Publications, (2001) "Crisis and Income Distribution: Thousand Oaks, CA. A Micro-Macro Model for Indonesia", Interna- Pawson, R. (2002) "Evidence-based Policy: The tional Food Policy Research Institute, Washing- Promise of `Realist Synthesis'", Evaluation ton, D.C. 8(3), 340­358. Roche, C. (1999) Impact Assessment for Develop- Pawson, R. (2005) "Simple Principles for The ment Agencies: Learning to Value Change, Evaluation of Complex Programmes", in: Oxfam, Oxford. 120 rEfErEncEs Rogers, P. J. (2008) "Using programme theory Simons, H. (2006) "Ethics in evaluation", in: I.F. for complex and complicated programs", Shaw, J.C. Greene and M.M. Mark (eds.) The Evaluation 14(1), 29­48. SAGE Handbook of Evaluation, Sage Publica- Rogers, P.J., T.A. Hacsi, A. Petrosino, and T.A. tions London. Huebner (eds.) (2000) Program Theory in Snijders, T., and R. Bosker (1999) Multilevel Evaluation: Challenges and Opportunities, Analysis: An Introduction to Basic and New directions for evaluation 87, Jossey-Bass, Advanced Multilevel Modeling, Sage Publica- San Francisco. tions, London. Rosenbaum, P.R., and D.B. Rubin (1983) "The Späth, B. (2004) Current State of the Art in central role of the propensity score in observa- Impact Assessment: With a Special View on tional studies for causal effects", Biometrika Small Enterprise Development, Report for 70, 41­55. SDC. Rossi, P . .H., M.W Lipsey, and H.E. Freeman (2004) Straw, R.B., and J.M. Herrell (2002) "A Framework Evaluation: A Systematic Approach, Sage for understanding and improving Multisite Publications, Thousand Oaks, CA. Evaluations", in: J.M. Herrell and R.B. Straw Salamon, L. (1981) "Rethinking public manage- (eds.), Conducting Multiple Site Evaluations ment: Third party government and the in Real-World Settings, New Directions for changing forms of government action", Public Evaluation 94, Jossey-Bass, San Francisco. Policy 29(3), 255­275. Swedberg, R. (2005) Principles of Economic Salmen, L., and E. Kane (2006) Bridging Diversity: Sociology, Princeton University Press, Prince- Participatory Learning for Responsive ton, NJ. Development, World Bank, Washington, D.C. Tashakkori, A., and C. Teddlie (eds.) (2003) Scriven, M. (1976) "Maximizing the Power of Handbook of Mixed Methods in Social and Causal Investigations: The Modus Operandi Behavioral Research, Sage Publications, Method", in: G. V. Glass (ed.) Evaluation Thousand Oaks, CA. Studies Review Annual, Vol. 1, Sage Publica- Trochim, W.M.K. (1989) "An introduction to tions, Beverly Hills, CA. concept mapping for planning and evalua- Scriven, M. (1998) "Minimalist theory: The least tion", Evaluation and Program Planning theory that practice requires", American 12, 1­16. Journal of Evaluation 19(1), 57­70. Tukey, J.W. (1977) Exploratory Data Analysis, Scriven, M. (2008) "Summative Evaluation of Addison-Wesley, Reading, PA. RCT Methodology: An Alternative Approach Turpin, R.S., and J.M. Sinacore (eds.) (1991) to Causal Research", Journal of Multidisci- Multisite Evaluations. New Directions for plinary Evaluation 5(9), 11­24. Evaluation 50, Jossey-Bass, San Francisco. SG1 (2008) NONIE: Impact Evaluation Utce Ltd & Japan Pfi Association (2003) Impact Guidance--Sections 1 and 2, Subgroup 1, Evaluation Study on Public-Private Partner- Network of Networks on Impact Evaluation. ships: The Case of Angat Water Supply SG2 (2008) NONIE Impact Evaluation Guidance, Optimization Project and the Metropolitan Subgroup 2, Network of Networks on Impact Waterworks and Sewerage System, Republic Evaluation. of the Philippines. Shadish, W. R., T.D. Cook and D.T. Campbell Vaessen, J., and J. De Groot (2004) "Evaluat- (2002) Experimental and Quasiexperimen- ing Training Projects on Low External Input tal Designs for Generalized Causal Inference, Agriculture: Lessons from Guatemala", Houghton Mifflin Company, Boston. Agricultural Research & Extension Network Sherman, L.W., D.C. Gottfredson, D.L. MacKen- Papers 139, Overseas Development Institute, . zie, J. Eck, P Reuter and S.D. Bushway (1998) London. "Preventing crime: What works, what doesn't, Vaessen, J., and D. Todd (2008) "Methodologi- what's promising", National Institute of Justice cal challenges of evaluating the impact of Research Brief, July 1998, Washington, D.C. the Global Environment Facility's biodiver- 121 I m pa c t E va l u at I o n s a n d d E v E l o p m E n t ­ n o n I E G u I d a n c E o n I m pa c t E va l u at I o n sity program", Evaluation and Program 1, International Initiative for Impact Evalua- Planning 31(3), 231­240. tion, New Delhi. Van der Knaap, L.M., F.L. Leeuw, S. Bogaerts White, H., and G. Dijkstra (2003) Programme and L.T.J. Nijssen (2008) "Combining Aid and Development: Beyond Conditional- Campbell standards and the realist evaluation ity, Routledge, London. approach--the best of two worlds?" American Whitmore. E (1991) "Evaluation and empower- Journal of Evaluation 29(1), 48­57. ment: it's the process that counts", Empower- Van De Walle, D., and D. Cratty (2005) "Do Donors ment and Family Support Networking Get What They Paid For? Micro Evidence on Bulletin, 2(2), 1­7. the Fungibility of Development Project Aid", Wholey, J.S. (1987) "Evaluability Assessment: World Bank Policy Research Working Paper Developing Program Theory", in: L. Bickman 3542, World Bank, Washington, D.C. (ed.) Using Program Theory in Evaluation, Vedung, E. (1998) "Policy instruments: Typologies New Directions for Program Evaluation, and theories", In: M. L. Bemelmans- Videc, and Jossey-Bass, San Francisco. R. C. Rist (eds.), Carrots, Sticks and Sermons: Wooldridge, J.M. (2002) Econometric Analysis Policy Instruments and their Evaluation, of Cross Section and Panel Data, The MIT Transaction Publishers, New Brunswick. Press, Cambridge. Webb, E.J., D.T. Campbell, R.D. Schwartz and L. World Bank (2003) A User's Guide to Poverty and Sechrest (2000) Unobtrusive measures, Sage Social Impact Analysis, Poverty Reduction Publications, Thousand Oaks, CA. Group and Social Development Department, Weiss, C.H. (1998) Evaluation--Methods for World Bank, Washington, D.C. Studying Programs and Policies, Prentice Worrall, J. (2002) "What evidence in evidence- Hall, New Jersey. based medicine?" Philosophy of Science, 69, Welsh, B., and D.P. Farrington (eds.) (2006) 316­330. Preventing crime: What Works for Children, Worrall, J. (2007) "Why there's no cause to Offenders, Victims and Places, Springer, randomize", The British Journal for the Berlin. Philosophy of Science 58(3), 451­488. White, H. (2002) "Combining quantitative and Worthen, B.R., and C.C. Schmitz (1997) "Concep- qualitative approaches in poverty analysis", tual Challenges Confronting Cluster Evalua- World Development 30(3), 511­522. tion." Evaluation 3(3), 300­319. White, H. (2006) Impact Evaluation Experience Yang, H., J. Shen, H. Cao and C. Warfield (2004) of the Independent Evaluation Group of the "Multilevel Evaluation Alignment: An Explica- World Bank, World Bank, Washington, D.C. tion of a Four-Step Model", American Journal White, H. (2009) "Some Reflection on Current of Evaluation 25(4), 493­507. Debates in Impact Evaluation", Working Paper 122 ISBN 978-1-60244-120-0