WPS5877 Policy Research Working Paper 5877 Impact Evaluation of Trade Interventions Paving the Way Olivier Cadot Ana M. Fernandes Julien Gourdon Aaditya Mattoo The World Bank Development Research Group Trade and Integration Team November 2011 Policy Research Working Paper 5877 Abstract The focus of trade policy has shifted in recent years from such as in the case of customs reforms, some techniques, economy-wide reductions in tariffs and trade restrictions such as randomized control trials, may not be feasible toward targeted interventions to facilitate trade and but meaningful evaluation may still be possible. Theis promote exports. Most of these latter interventions are paper discusses examples of impact evaluations using a based on the new mantra of “aid-for-trade� rather than range of methods (experimental and non-experimental), on hard evidence on what works and what does not. On highlighting the particular issues and caveats arising in the one hand, rigorous impact-evaluation is needed to a trade context, and the valuable lessons that are already justify these interventions and to improve their design. being learned. The authors argue that systematically On the other hand, rigorous evaluation is feasible because building impact evaluation into trade projects could unlike traditional trade policy, these interventions tend lead to better policy design and a more credible case for to be targeted and so it is possible to construct treatment “aid-for-trade.� and control groups. When interventions are not targeted, This paper is a product of the Trade and Integration Team, Development Research Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be contacted at Olivier.Cadot@unil.ch, afernandes@worldbank.org, julien.gourdon@cepii.fr, amattoo@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Impact Evaluation of Trade Interventions: Paving the Way 1 Olivier Cadot* Ana M. Fernandes+ Julien Gourdon§ Aaditya Mattoo++ JEL classification codes: F13, F14, L15, L25, 017, 024, C23 Keywords: impact evaluation, trade competitiveness, trade facilitation, aid for trade, export promotion, randomized control trials, propensity-score matching. 1 We thank Vivian Agbegha for excellent research assistance, Christina Neagu for help with tariff data and Mohini Datt for help with data on World Bank aid for trade. We thank Daniel Lederman, Martin Ravallion, and participants at the December 2010 workshop on ―Impact Evaluation of Trade Interventions: Paving the Way‖ in Washington, DC for comments. Support from the governments of Norway, Sweden and the United Kingdom through the Multi-Donor Trust Fund for Trade and Development is gratefully acknowledged. This paper is the result of collaboration between the World Bank and Switzerland’s NCCR on the evaluation of trade-related interventions. * University of Lausanne, CEPREMAP and CEPR. + Trade and International Integration Unit, Development Economics Research Group, World Bank. § CEPII. ++ Trade and International Integration Unit, Development Economics Research Group, World Bank. 1 1. Introduction Trade policy has changed fundamentally since the days of structural adjustment and economy- wide trade reforms. Partly in reaction to the uneven results of trade policy reforms, the focus has shifted to more targeted interventions aimed at reducing trade costs and addressing market failures that inhibit exports. Significant national resources and international assistance are now devoted to trade facilitation and export promotion, and the international development community has galvanized around a new ―aid-for-trade‖ (AfT) mantra as a means of helping low-income countries integrate into the global economy. The environment in which trade-related assistance is provided has also changed. In times of fiscal austerity, taxpayers increasingly question the justification for large aid flows and, at the very least, demand results and accountability.2 The development community has struggled to respond to these demands because there is surprisingly little evidence about what works and what doesn’t in the area of trade and industrial policies. An authoritative survey of trade and industrial policy recently acknowledged that there is hardly any microeconomic evidence to guide specific trade interventions (Harrison and Rodríguez- Clare, 2010). There are several reasons for the disappointing pace at which such evidence has been gathered. Trade policy research has been slow to respond to changing needs. Tariffs continue to occupy center stage in policy research, in spite of their declining importance as trade barriers, simply because they are easy to measure. The aid-for-trade community has in turn been slow to build a culture of rigorous evaluation. For instance, a review of 85 recent World Bank trade-related projects conducted by the authors revealed that only five of them included rigorous evaluation components. Worse, those few evaluations relied on crude before-after comparisons, which are known to be vulnerable to confounding influences. The ―knowledge-market failures‖ identified by Ravallion (2009) have also inhibited rigorous evaluations in the trade context: demanders of knowledge about the effectiveness of trade interventions have inadequate information about the quality of any potential evaluation, especially because there are so few good examples; project managers tend to have ―monopolistic‖ control over which projects get evaluated, at what cost and how; and the benefits from the rigorous evaluation of a particular trade project accrue in large part to other future projects which do not share in the cost of evaluating the project. Still, trade evaluation itself can benefit from the positive externalities generated by research in other areas. In fact, the tools for a serious evaluation of trade-related interventions are already 2 A recent poll featured by the Financial Times (Financial Times, July 12, 2010) showed that a majority of respondents in OECD countries considered defense and development aid as priority areas for spending cuts. 2 there. Originally developed in the agro-biological and then the medical sciences, impact evaluation (IE) methods have spread to the social sciences and are routinely employed in the areas of health and education. In essence, an impact evaluation compares the outcomes of entities — individuals or firms — that received support from a program or were directly impacted by a policy with the counterfactual outcome of those same entities had the program or policy not been in place. Because such counterfactual outcomes are not observable, they are approximated by the outcomes of a control group. IE methods have provided powerful tools in other fields to help guide policy choices and minimize the cost of interventions. For instance, Banerjee and Duflo (2008) showed how a comparison of IE results established that, in order to raise school attendance rates among Kenyan children, a program to treat intestinal worms was twenty times more cost-effective than hiring teachers, suggesting a clear prioritization of actions.3 The recent creation by the World Bank of a separate impact evaluation unit as part of the Development Impact Evaluation Initiative (DIME) has helped spread IE methods to new areas of development research and practice.4 For instance, World Bank researchers have led the way in analyzing the impact of business registration reform or bankruptcy reform (Klapper and Love, 2010; Bruhn, 2011; Gine and Love, 2011). Researchers have also begun to use these methods to evaluate programs and policies in the area of private sector development, where the treated "entities" are firms (see McKenzie, 2010 for a survey). Similar evaluations could be used to guide trade interventions. The usual excuse for not using IE methods in assessing the effectiveness of trade assistance is that the "clinical" nature of the treatment needed for a proper definition of treatment and control groups is absent from trade policy. This was perhaps true of old-style trade policies like structural adjustment or tariff reforms; but it is not true of the new trade interventions like export promotion. This paper intends to show that trade exceptionalism — the notion that trade-related interventions are inherently not amenable to IE — is, if anything, limited to traditional trade policies. More recent, focused trade-related interventions can be evaluated formally, provided that one is not wedded to a particular methodology such as randomized-control trials (RCTs). Although, as we will see, the range of application of RCTs is broader than one might think, other quasi-experimental methods are available and can shed light on what works and what does not. 3 This ratio was established by comparing the evaluation of a de-worming program by Miguel and Kremer (2004) with a separate evaluation of a program to reduce teacher-student ratios by Banerjee, Jacob, Kremer, Lanjouw, and Lanjouw (2005). Comparing impact estimates from separate impact evaluations is tricky since each has been established in a particular context with limited external validity (we will return to the issue later on in this paper). However, when the difference in cost effectiveness is as large as this one, the risk of getting the prioritization order wrong is reduced. 4 Information on DIME can be obtained at: http://web.worldbank.org/WBSITE/EXTERNAL/EXTDEC/ EXTDEVIMPEVAINI/0,,menuPK:3998281~pagePK:64168427~piPK:64168435~theSitePK:3998212,00.html. 3 RCTs are only one of the possible approaches for rigorous impact evaluation. For instance, some countries implement regulatory reforms in staggered fashion starting in a small set of locations before extending them to all locations. The impact of such reforms can be rigorously evaluated by using locations where the reforms are introduced later as a control group for the locations where reforms are introduced earlier and using a difference-in-differences estimation methodology (see Bruhn, 2011). Similarly, ex-post evaluation of programs and policies is a possible approach, provided that information is available both on which firms received support from a program or were directly impacted by a policy as well as on the entire (or a large portion of the) universe of firms. In these circumstances, it is possible to use propensity score matching combined with difference-in-differences estimation (see e.g. Tan, 2009; Lopez-Acevedo and Tinajero, 2010). These methods have already been applied in a number of recent studies and have produced interesting and unexpected results. Consider the following three examples: First, in an ex-post evaluation of export promotion programs in six Latin American countries using rich firm-level datasets, Volpe (2011) shows that these programs were effective in facilitating export expansion primarily along the extensive margin (i.e., through an increase in the number of products exported or in the number of export markets served) rather than along the intensive margin (an increase in exports of existing products to existing markets). He also shows that programs benefitted small and relatively inexperienced firms more than larger and already established exporters, and that bundled services providing support to firms throughout the export development process were more effective than isolated actions. Gourdon, Marchat, Sharma, and Vishwanath (2011) use similar ex-post evaluation methods to assess the impact of a World Bank-financed export promotion program in Tunisia — FAMEX — which provided a mixture of counseling and matching grants to new exporters. Their findings suggest that export promotion has a large and significant effect on overall export growth: a 39% increase in the average annual growth rate of program beneficiaries relative to the control group over a four-year period. The effect of the program on the extensive margin of exports – in terms of products and destinations – is more subdued: about 5% higher growth for beneficiaries that is significant only for destinations. They also find a significant increase in employment growth, i.e., 10% more for program beneficiaries than for control firms. The effect on export growth is stronger for firms that were initially only marginal exporters (exports represented less than 20% of turnover). Interestingly, their sample also includes services firms, for which the effect of export promotion is significantly larger than for manufacturing firms. Datt and Yang (2011) analyze a natural experiment in which the Philippines government suddenly reduced the minimum value threshold under which shipments were exempt from pre- shipment inspections (PSI), closing a loophole that had encouraged importers to slice shipments 4 in order to escape inspection. They show that the reform failed to curb under-invoicing and thus to raise duty collection as importers switched to an alternative loophole, namely, the use of an export-processing zone (EPZ). As this alternative loophole involved high fixed costs (setting up a presence in the EPZ), in the end the Philippine government was no better off while importers were worse off. The authors also discuss the effects of a related policy reform in Colombia where the government sought to remedy undervaluation of certain imports by mandating PSI on a subset of products. This, however, left open the loophole of misclassification of those products as similar products that did not require a PSI. Both cases illustrate the importance of careful, incentive-compatible reform design. This paper considers a detailed menu of trade-related interventions and discusses the challenges posed by their evaluation. In doing so, we discuss examples of impact evaluations using a range of methods (experimental and non-experimental) highlighting the particular issues and caveats arising in a trade context, and the valuable lessons that are already being learnt. We argue that systematically building impact evaluation into trade projects could lead to better policy design and to a more credible case for ―aid-for-trade.‖ The rest of the paper is organized as follows: Section 2 discusses the changing nature of trade policy while Section 3 reviews the available evidence on the impact of trade assistance. Section 4 considers trade-related interventions and their evaluation. Section 5 addresses the data issues crucial to impact evaluation. Section 6 discusses the future challenges in IE of trade assistance. Section 7 concludes. 2. The changing nature of trade policy and trade assistance Most developing countries have moved beyond the first generation of trade reforms, which involved across-the-board cuts in tariffs and the elimination of import quotas. Tariffs have fallen substantially over the last 20 years. The simple average applied tariff of World Trade Organization (WTO) members on all goods was 5.8 percent in 2008 (WTO, 2009), and the developing country average is down to around 10 percent compared to 30 percent in 1990. Recourse to quantitative restrictions has also substantially declined. One reason is the narrower interpretation of the balance-of-payments exception in the WTO and the stricter enforcement of the conditions under which it can be invoked. Countries like India have been forced to phase out numerous quotas that had been maintained for a long time, ostensibly to address balance of payments difficulties. Another reason is the tighter interpretation following the Uruguay Round Agreement of the national treatment provision in the WTO, which precludes local-content requirements that many developing countries had favored and other members had tolerated. With this decline in traditional barriers to market access, supply-side constraints are seen as the main obstacle that developing countries face in taking advantage of new opportunities in 5 international markets. Therefore, trade interventions are becoming more targeted, focusing either on (a) the trade facilitation agenda, involving, for example, customs reforms and infrastructure — e.g. port — improvements and/or (b) the trade competitiveness agenda, consisting of pro- active industrial policies, involving productive capacity building, EPZs, or export promotion. In designing such trade interventions, developing countries need policy advice, particularly more evidence-based advice. They need to know which interventions work and which do not, in which sectors, in which sequence, and which ones are most cost-effective. The World Bank too has shifted emphasis in its trade assistance from broad trade liberalization reforms in the 1980s and 1990s to more targeted interventions to reduce the costs of trade and to equip producers to export since the early 2000s. The declaration of World Trade Organization (WTO) ministers in Hong Kong SAR, China in 2005 and the first Global Aid for Trade Review in Geneva in 2007 gave an impetus to the expansion of aid for trade to help developing countries build their supply-side capacity and trade-related infrastructure. The World Bank responded by expanding its commitments on trade competitiveness, trade facilitation, and infrastructure, and is now a leading contributor to aid for trade. As shown by Figure 1, recent commitments by the World Bank are substantial and growing: concessional trade-related lending (as per OECD/WTO definition) to low-income countries grew from US$3.18 billion annually in 2002–2005 to an average of US$4.84 billion in 2007–2008, while non-concessional trade-related lending to middle income countries increased from US$4.16 billion in 2002–2005 to US$9.8 billion in 2007-2008 (World Bank, 2011).5 Since 2001 the World Bank approved 437 trade-related lending projects in 90 countries and 53 trade-related lending operations in 10 regional groups, with Africa and Eastern Europe and Central Asia accounting for most of the operations (World Bank, 2011). Figure 1 World Bank aid-for-trade commitments 2002–2010 5 The numbers presented in the figure are based on the OECD/WTO definition of aid-for-trade. The sectors that fall under this definition are (1) for IBRD/IDA - agriculture, fishing and forestry; information and communication; energy and mining; transportation; and industry and trade; (2) for IFC - agriculture and forestry; information; oil, gas and mining; chemical; utilities; transportation and warehousing; construction and real estate; food and beverages; nonmetallic mineral product manufacturing; primary metals; pulp and paper; textiles, apparel and leather; plastics and rubber; industrial and consumer products; wholesale and retail trade; professional, scientific and technical services; and accommodation and tourism services. 6 US$ 30,000 millions 25,000 20,000 15,000 10,000 5,000 - FY02 FY03 FY04 FY05 FY06 FY07 FY08 FY09 FY10 IBRD IDA IFC Source: Authors’ calculations based on data from the World Bank Business Warehouse website. Trade facilitation-related infrastructure is the largest single component of World Bank trade- related investments in developing countries, while the rest consist mostly of improving competitiveness. Figure 2 shows the distribution of World Bank commitments on aid for trade as of fiscal year 2008 while Table 1 details the types of interventions falling under the "trade competitiveness" and the "trade facilitation" agendas. Figure 2 World Bank Group trade portfolio 2008 Source: Authors’ calculations based on data from the World Bank Business Warehouse website. Given the increase in aid for trade, donors and recipients would like to see evidence that this new type of assistance will be more effective than past aid efforts. These concerns are especially strong in the aftermath of the recent financial crisis, when pressures to reduce fiscal deficits and debt are weakening political support for foreign assistance. In fact, a recent opinion poll in OECD countries revealed that a large majority of the public favored cuts in defense and aid spending rather than in other categories of expenditure.6 Table 1 Focused trade interventions Trade competitiveness (including trade finance) Trade facilitation and logistics 6 See Financial Times of July 12, 2010. 7 � Export promotion/diversification � Customs reform � Support to producer/exporter organizations � Ports/airports rehabilitation � Quality testing and export certification � Railway privatization/rehabilitation � Technology upgrading and support services � Roads construction/rehabilitation � Strengthening policy/regulatory framework � Export credit insurance � Export credit guarantee � Line of credit � Support for financial institutions 3. Existing methods of evaluating trade interventions In this section, we review three kinds of existing evaluation efforts. The first involves broad and—to this day—largely inconclusive assessments of aid for trade and its impact. The second examines the effect of national trade interventions, such as export promotion activities, but still at a highly aggregate level, considering mostly aggregate exports as outcomes; this literature provides some support for certain types of focused interventions. The third set of efforts involves assessments by the World Bank of its own trade-related projects. While this last set are in principle as focused as the interventions themselves, they have for the most part not been based on the collection or analysis of any hard evidence on impact. Before we discuss some of the results emerging from each strand of evaluation efforts, it is worth noting that cross-country regressions, on which the first two strands rely heavily, have strengths of their own, but also limitations that are sufficiently serious to have prompted a growing number of development scholars to turn to different methodologies if not an altogether different paradigm. On the positive side, cross-country regressions—based on either cross-sections stricto sensu or on multi-period panels—provide general average estimates of the effects of a policy or program that are not reflections of a specific context. They also pick up the entire effect of the policy or program, including externalities and general-equilibrium feedbacks. Both of these strengths are particularly relevant in comparison with micro-level impact evaluations, as we will see later on. On the negative side, like the earlier literature on the effect of trade reforms, cross-country regressions evaluating the effect of aid or specific trade interventions tend to suffer from problems of weak identification and attribution.7 Neither policies nor aid flows can be taken as exogenous to the performance outcomes they are supposed to affect, and no instrumental- 7 For a thorough discussion of the trade-off between internal and external validity, see for instance Rodrik (2008) and references therein. 8 variable strategy, however clever, has dispelled doubts about reverse causation or omitted- variable bias, both likely to be present at the level of aggregation at which these studies are cast. Impact evaluations, for all their own limitations, are less vulnerable to these identification issues, because they rely, for identification, on outcome differences between treatment and control groups in the same context, instead of variations in policy choices or aid flows across countries. 3.1 Evaluating aid for trade The literature on the impact of AfT is fairly limited, in part because AfT projects are not always distinguishable from other aid projects. As in the rest of the aid effectiveness literature, the results are ambiguous (Rajan and Subramanian, 2008). Regarding the cross-country allocation of AfT, Gamberoni and Newfarmer (2009) find that, after controlling for absorption capacity (related, for example, to governance), more AfT is directed towards countries with a higher demand for AFT as measured by indicators of ―underperformance‖ in trade.8 On impact, one strand of the literature explores whether AfT positively affects exports from the donor country to the recipient country given that, up to the early 1990s, over half of all bilateral aid was at least partially tied to donor exports. Using a gravity equation, Wagner (2003) shows that this form of trade was indeed boosted; but Osei, Morrissey, and Lloyd (2004), using a gravity equation in first differences for a panel of four European donors and 26 African recipients, found an unstable and insignificant impact of aid on exports from donor to recipient. Recently, Nelson and Juhasz Silva (2008) use a more conventional gravity equation including bilateral aid flows as a regressor (instrumented by their one-year lagged value), and find a significant although small impact on trade flows from donor to recipient. From a development perspective, only a few of the recent studies focus on the more relevant question of whether aid raises the export capacity of recipient countries. Cali and te Velde (2011) regress trading costs and the value of exports on lagged AfT disbursements and control variables, using data from the OECD's Creditor Reporting System that separately identifies aid to trade facilitation and infrastructure from aid to productive capacity.9 Using a large panel of developing countries, the authors address the possibility of endogeneity and measurement errors in AfT flows by instrumenting those with the Freedom House’s index of civil liberties. The message that emerges across their various specifications is that aid to trade facilitation and infrastructure seems to have a significant effect in reducing trade costs and in increasing export values, while 8 Underperformance in trade is captured by multiple indicators. Countries that underperform in trade can be those in the lower two quintiles of performance measured along five dimensions: (a) those experiencing relatively slow growth of exports of goods and services, (b) those losing global market share, (c) those suffering deterioration in competitiveness in existing markets, (d) those exporting slow-growing products or to slow-growing markets, and/or (e) those over-reliant on only a few exports. Also, countries that underperform in trade are those that under-trade with bilateral partners, controlling for market size and distance, those with low levels scores on the World Bank logistics performance index for transport or for customs, and on an indicator of peak tariffs. 9 Trading costs are measured by the trading across borders indicators of the Doing Business database. 9 aid to productive capacity is insignificant. When considering sectorally targeted aid, the authors again find that aid to infrastructure has a significant impact on export values, but aid to productive capacity does not, controlling for country-sector fixed effects that account for comparative advantage differences. Brenton and von Uexkull (2009) examine the response of product-level exports from developing countries to product-level export-development aid, combining mirrored product-level (HS4) export data with export-development aid data from the German cooperation agency GTZ and from the OECD/WTO Trade Capacity Building Database for 48 developing countries. Using a matching difference-in-differences (DID) approach (discussed in section 4) they show insignificant effects of contemporaneous and lagged aid on product-level exports after controlling for lagged exports, and country and year-product fixed effects, and eliminating outliers.10 However, the authors do show strong positive effects in a simple comparison of product-level exports before and after receiving export development aid. This finding suggests an important attribution problem — namely, export growth may not be due to the aid received but instead may reflect the fact that aid targets sectors with promising prospects. The authors go on to argue that, in evaluating the impact of technical assistance for exports, it is essential to identify what would have happened in the absence of the policy intervention. This is a primary concern in this paper. As the literature stands, it is fair to say that the effect of AfT on the export performance of beneficiary countries has not been established on the basis of aggregate numbers. Ferro, Portugal-Perez, and Wilson (2011) advance the analysis of the effectiveness of AfT revisiting the data from OECD’s Creditor Reporting System. The authors exploit the differential intensities of service use across manufacturing sectors (based on input-output tables from the U.S. and Argentina) to evaluate the impact of aid for trade flows directed at five services sectors — transport, communications, energy, banking/financial services, and business services — on the exports of downstream manufacturing sectors in 106 aid-recipient countries over the period 1990–2008. Their identification strategy aims at circumventing reverse causality problems common in the AfT literature; and their results show that aid flows directed at the energy and banking sectors have a significant positive impact on downstream manufacturing exports. 3.2 Evaluating national trade interventions A few recent cross-country studies suggest a positive impact of certain types of trade interventions, regardless of whether they are financed by donors or domestic government budgets. On export promotion, Lederman, Olarreaga, and Payton (2010) examine the 10 Their matching approach pairs each treatment country that receives export development aid for a given product i to the country that is more similar to it in terms of its likelihood to export product i, where this likelihood is estimated based on observable country characteristics such as the level of development, factor endowments, and climate conditions. 10 effectiveness of export promotion agencies (EPAs) based on a rich survey of EPAs across 88 developed and developing countries. The goals of EPAs are to help exporters understand and find markets for their products and services and can be divided into four categories: (a) country image building (advertising, promotional events, but also advocacy); (b) export support services (exporter training, technical assistance, capacity building, including regulatory compliance, information on trade finance, logistics, customs, packaging, pricing); (c) marketing (trade fairs, exporter and importer missions, follow-up services offered by representatives abroad); and (d) market research and publications (general, sector, and firm-level information, such as market surveys, on-line information on export markets, publications encouraging firms to export, importer and exporter contact databases) (Lederman et al. [2010], pp. 257–258). For 21 of the 73 developing countries surveyed, the authors find that EPAs receive budgetary support from multilateral donors such as the World Bank. The authors estimate the effect of EPAs’ expenditures per capita on overall exports per capita at the country level, accounting for selection bias in survey responses and for potential reverse causality. Their main conclusion is that, on average, EPAs have a significant positive effect on exports. Their estimates also point to the importance of EPAs’ services for overcoming foreign trade barriers and solving asymmetric information problems associated with exports of differentiated goods. In addition, they find evidence of strong diminishing returns, suggesting that small is beautiful as far as EPAs are concerned. However, the authors acknowledge that cross-country regressions cannot fully capture the heterogeneity of policy environments and institutional structures in which EPAs operate; hence, more detailed studies or project-type analyses are needed to provide specific policy advice. On trade facilitation, Helble, Mann, and Wilson (2009) examine the responsiveness of trade flows to various types of aid for trade — linked to reform of trade policy and regulation, trade development (productive capacity building), and economic infrastructure — using a gravity equation framework covering 167 importers (reporters) and 172 exporters (partners) during the 1990–2005 period. Their results indicate that relatively small amounts of aid targeted at trade policy and regulatory reform have a greater impact with respect to increased trade flows than aid for broad trade development assistance or infrastructure. Several recent papers point to the importance of internal barriers related to infrastructure and institutions — including logistics performance — as obstacles to developing countries’ ability to trade and the volume of trade (e.g., Djankov, Freund, and Pham, 2010; Francois and Manchin, 2007; Freund and Rocha, 2011; Hoekman and Nicita, 2008; Portugal-Perez and Wilson, 2010). More specific studies highlight the importance of reducing marketing, transport, and other intermediary costs in agricultural supply chains (Balat, Brambilla, and Porto, 2009; Diop, Brenton, and Asarkaya, 2005). Although 11 these studies point out the relevance of increased donor assistance to trade facilitation, they do not help delineate the policies and programs that would be most effective in cutting trade costs.11 In their recent authoritative survey of the state-of-the-art literature on industrial policy, Harrison and Rodríguez-Clare (2010) conclude that empirical evidence on the effectiveness of various forms of industrial policy is scarce. The authors look at the case of East Asian countries where industrial policies based on use of production subsidies, subsidized credit, fiscal incentives, and trade protection to foster particular sectors. From this, they claim that the available evidence does not answer the most important question: what was the effect of these industrial policies relative to the counterfactual situation where such intervention was absent.12 In sum, there are no studies that can credibly credit industrial policies with bringing about East Asia’s successful industrialization experience. But the authors do make a tentative argument that industrial policies played a role in some countries' growth experiences based on two complementary ideas. First, the composition of a country’s export basket — a tilt towards manufacturing or skill-intensive goods rather than primary products or raw materials — seems to matter for its long-run growth. Second, China’s export basket in 1992 was much more sophisticated than what would be expected given the country’s per capita GDP and that could only be the outcome of its industrial policies (Rodrik, 2006).13 Harrison and Rodríguez-Clare’s literature survey concludes with an advocacy statement on the type of national trade-related assistance likely to be most successful: that which increases exposure to trade (such as export promotion) in contrast to that which limits trade (such as tariffs or domestic content requirements).14 The authors also make a statement on the specifics of policy design, where they envision an increasing role for ―soft‖ industrial policies that deal directly with coordination problems, such as those that keep productivity low in existing or emerging sectors. These policies include programs ―to help particular clusters by increasing supply of skilled workers, encouraging technology adoption, and improving regulation and infrastructure‖ (Harrison and Rodríguez-Clare [2010] p. 4112).15 The problem with this statement is that it has a ―rabbit-out-of-the-hat‖ aspect because the survey includes little supporting evidence. In fact, the absence of evidence for the policy recommendations the survey offers is a reason for our effort to initiate new research on these issues. 11 For example, Portugal-Perez and Wilson (2010) estimate the impact of aggregate indicators of "soft" and "hard" infrastructure on the export performance of 101 developing countries over the 2004–2007 period. Their estimates show that trade facilitation reforms - particularly investment in physical infrastructure and regulatory reform to improve the business environment - improve significantly export performance. Moreover, they show that the marginal effect of infrastructure improvements on exports appears to be decreasing with per capita income. 12 One empirical approach that has been followed in some studies is to examine whether the sectors that received most support from industrial policies are those that have grown most rapidly; but that approach does not address the counterfactual issue. 13 This finding was based on the measure of sophistication of a country’s exports basket developed by Hausman, Hwang, and Rodrik (2007) constructed using the level of GDP per capita associated with exports of different goods worldwide. 14 The authors make this statement based on extensive cross-country and cross-sector evidence on trade and growth. 15 The authors argue that an advantage of such ―soft‖ industrial policies is that they are generally compatible with the multilateral and bilateral trade agreements that developing countries have entered into in the last decades. 12 3.3 Evaluating World Bank trade programs In principle, World Bank trade-related projects should be a key source of evidence on the effects of specific trade interventions, which could become the basis for further evidence-based policy advice. In practice, though, this is rarely the case. Few interventions have undergone rigorous impact evaluation. An evaluation of World Bank financed trade-related assistance during the 1987–2004 period conducted by the Independent Evaluation Group (IEG) concluded that it helped countries liberalize their trade regimes — average tariffs fell and coverage of nontariff barriers diminished — with positive effects on economic growth (IEG, 2006). However, the evaluation also argued that assistance fell short of generating a strong export supply response. Many client countries, especially in Africa, could not diversify their exports and remained vulnerable to commodity price shocks. IEG (2006) also discusses the performance ratings of World Bank aid-for-trade projects, which give a sense of their effectiveness in achieving their stated goals. The report shows that trade- related adjustment loans until 2004 performed better than other adjustment loans; whereas trade- related investment loans performed worse than other investment loans of the World Bank.16 Moreover, according to the same evaluation, assistance on trade logistics — ports, customs, and trade finance — and export incentives had a mixed record, though one that improved over time. A review of the IEG ratings of recent investment projects and programs on trade promotion, completed as of 2007 (World Bank, 2009), indicates that more than 85 percent were rated as having moderately satisfactory, satisfactory, or highly satisfactory outcomes, which was higher than for projects in other areas.17 Aid-for-trade projects also had higher estimated economic rates of return (around 32%) than other non-trade related projects (around 23.7%).18 While providing valuable insights, the IEG evaluations of trade assistance offer limited evidence to support focused trade interventions. Moreover, the evaluation does not cover much of the recent increase in AfT assistance for export promotion and trade facilitation. 16 Projects that focused primarily on trade liberalization achieved the best performance ratings whereas those related to private financing (such as export finance guarantees and export reinsurance) were the least successful. The superior performance of projects focusing on trade liberalization is not surprising as it reflects the relative legislative ease of putting in place the associated actions (e.g., reform of the tariff regime). In contrast, projects that focused on thematic areas related to key supply-side constraints that impose greater demands on institutional and administrative capacity, such as trade financing, are more difficult to implement. 17 IEG assesses the performance of roughly one World Bank project out of four (about 70 projects a year) measuring outcomes against the original objectives, sustainability of results, and institutional development impact. 18 An economic rate of return is the discounted interest rate that would keep an agent indifferent between the choice of undertaking or not undertaking the project. 13 In search of evidence on the impact of such trade interventions, we conducted a thorough review of the evaluation methods for 85 World Bank trade-related investment lending projects undertaken during the 1995–2005 period. The source of data was the World Bank’s Operations portal website and in particular the Project Appraisal Documents (PADs) and the Implementation Completion Reports (ICRs).19 The evaluation methods used can be classified into five distinct categories: (a) only economic or financial internal rates of return, net present value or effectiveness calculations; (b) beneficiary surveys and stakeholder workshops; (c) both a and b, (d) both a and b, with a comparison of beneficiaries to a control group; and (e) no formal evaluation methods used.20 One key aspect to note is that the implementation of a beneficiary survey does not guarantee that a rigorous impact evaluation can be conducted since in most cases the survey covers only outcomes pertaining to beneficiaries of the project, and no control group is covered (more details on these methods will be provided in section 4). Figure 3 shows that evaluation using only economic or financial rates of return was the most commonly used method for the trade-related projects, while 10 percent of the projects involved no formal evaluation method.21 Included in the latter category is a trade competitiveness project that described the impact of the project in purely subjective terms: ―While the impact on the firms assisted had not yet been determined, a visit to two beneficiaries by a supervision mission confirmed that there had been an impressive impact on the firms' quality of products and skills.‖ Another example of the latter is a trade competitiveness project where the achievement of the overall goal was measured in terms of the higher average annual growth rate of exports during the project duration and increases in exports' share of GDP compared with the initial year of the project. Figure 3 Evaluation of World Bank trade-related projects 1995–2005 19 We thank Vivian Agbhega for compiling the data for this review. The selection of trade-related projects followed the criteria used by Steven Gunawan in an unpublished study of "Monitoring and Evaluation Lessons of Trade Projects" that served as background work for the 2011 World Bank Trade Strategy. The projects were filtered from the World Bank’s Operations portal website according to the theme ―Trade and Integration," and falling within the following criteria: i) approved only after 1995 due to obsolescence; ii) IBRD/IDA-funded; and iii) closed. A total of 321 projects were filtered, out of which 144 were development policy loans and 177 were investment loans, and 30 investment lending projects had to be dropped since they lacked ICRs. A final set of 85 investment lending projects was obtained after excluding projects that did not have any trade components. The main documents used to extract information on the projects were PADs and ICRs. For each project we collected information on the types of intervention, the types of outputs and outcomes achieved, the evaluation methods employed, and the evidence or proof of causation of the impact of the project. 20 A beneficiary survey consists of a formal survey of the entities that received assistance from the project, whereas a stakeholder workshop is a more informal way to collect information on the various entities affected by the project. 21 Our analysis of project ICRs did reveal, however, that the use of these methods is often handicapped by difficulty in quantifying some of the costs and benefits of the project. Some project ICRs explicitly say that certain benefits are not incorporated in net present value calculations due to their complexity. 14 No formal evaluation method Both rates of return and beneficiaries' surveys with a comparison of beneficiaries to control group Only beneficiaries' surveys/stakeholder workshops Both rates of return and beneficiaries' surveys/stakeholder workshops Only rates of return 0 5 10 15 20 25 30 35 Source: Authors’ calculations based on data from the World Bank Operations portal website. To be fair, task managers of trade-related projects are often candid about their project’s achievements, writing in the ICR that observable results (particularly those relating to aggregate outcomes such as total exports) are not entirely the result of the program alone but rather the result of the work and resources of different institutions and sectors. The most striking fact in Figure 3 is that only 6 percent of projects (5 out of 85 projects) included a rigorous impact evaluation, involving a proper comparison of the outcomes of project beneficiaries with those of a control group. But even in such cases, the impact evaluation method raised certain issues, which we will discuss in the next section. A clarification should be made at this point concerning the link between evaluation methods of projects and the monitoring and evaluation (M&E) framework.22 M&E is an important part of the design and implementation of World Bank lending projects and is the reason why, as mentioned at the beginning of this section, we would expect to obtain evidence on the effects of certain types of trade interventions from project-level analysis. M&E is based on performance indicators capturing outputs, outcomes, and impact of a project (discussed in ICRs). These categories of performance indicators are thought to be related according to the scheme shown in Figure 4. The scheme makes clear the distinction between considering outputs in general, and going one step further and also considering outcomes and the impact attributable to the project per se. One common concern with the M&E framework for World Bank projects is that it often focuses on the monitoring part and not enough on the evaluation part. For example, most projects include exports as impact indicators but do not include a proper impact evaluation strategy that allows for attribution to the project of an increase in exports. Figure 4 From inputs to impact 22 This discussion draws heavily on the aforementioned unpublished study by Steven Gunawan. 15 In addition to World Bank investment lending projects, the World Bank also produces a large amount of analytical work — economic and sector work (ESW) — where one could expect to find evidence that supports certain trade interventions. The key trade-related analytical pieces — diagnostic trade integration studies (DTIS) — do highlight the high costs of producing goods and services for export, and for delivering them to foreign markets, as being the major barriers to trade integration in less developed countries, and point to infrastructure as the most pressing constraint. But they do not inform the development community about which interventions work and which do not, and which interventions are most cost-effective. 4. Impact evaluation of trade interventions The key problem that IE addresses is attribution — making sure that observed changes in outcome variables are caused by the program or policy under evaluation and not by outside influences. Many outside influences can confound the identification of a program or policy’s impact. For instance, an export promotion scheme put in place in 2007 would see its positive impact confounded by the negative impact of the global crisis of 2008–2009; a simple before- after comparison of outcomes would likely suggest a negative impact of the program. In order to filter out these influences, one would want to know how beneficiary firms would have performed in the absence of the program (presumably worse). But the data needed for this counterfactual does not exist, because firms cannot be both beneficiaries and non-beneficiaries at the same time. This missing data problem is solved by using as a counterfactual the performance of other firms that did not benefit from the program. By analogy with first agro-biological and then medical sciences, where IE methods originate, beneficiaries are called the treatment group and non-beneficiaries the control group.23 23 A pedagogical reference to IE techniques can be found in Khandker, Koolwal and Samad (2010), which contains analytical guidance as well as case studies and Stata do-files. A formal treatment can be found in Ravallion (2008), Blundell and Costa Dias (2009), and Imbens and Wooldridge (2009). 16 The central idea of IE is best illustrated by a widely-used technique called double-differences or difference-in-differences. Under that technique, the effect of a program is assessed by comparing the performance of beneficiary firms before and after the treatment (first-difference), and then benchmarking that difference by comparing it to the difference in performance over the same period of non-beneficiary firms (difference-in-differences).24 In our earlier example of an export promotion scheme put in place just before the onset of the crisis, its confounding effect would be captured (and thus filtered out) by the decrease in the performance of non-beneficiary firms during the program period. The program’s impact would then be measured by how much less badly beneficiary firms performed than non-beneficiary ones. As noted earlier, IE design relies on less-than-universal coverage, which provides a first categorization of programs into targeted and non-targeted ones. Another useful distinction is whether an evaluation is built into program design. In what follows, we consider each of the cases defined in Table 2 in the trade context, and discuss to what extent IE methods can be applied to them. Anticipating our conclusions, our basic argument is that the scope for IE in trade assistance projects is broader than might appear at first, provided that one is not wedded to a particular methodology (randomized control trials for instance). Table 2 Boundaries of impact evaluation Evaluation not built into program Evaluation built into program design design Targeted (typically trade competitiveness- related e.g., matching grants for producers for technology upgrading or RCT is feasible; Quasi-experimental RCT is infeasible; Quasi-experimental export business plans; export credit methods are a possible alternative methods are feasible guarantees for producers) Non-targeted (typically trade facilitation- related: e.g., customs reform, port RCT is typically infeasible; Quasi- improvements; but also some trade experimental methods are more All IE methods are difficult; before- competitiveness -related: support for appropriate; Some methods of targeting after comparisons may be only producer organizations or other can be introduced (phase-in, staggered alternative institutional reforms) implementation) Notes: RCT: Randomized control trial; Quasi-experimental methods are matching, difference-in-differences, instrumental variables, or regression discontinuity design. 24 This difference in performance is not a ceteris paribus effect: it picks up both direct program effects and induced behavioral changes, which may work to either reinforce or weaken the program’s direct effect. For instance, a program combining matching grants with technical assistance targeted at particular operations within the firm can trigger broader management improvements (a reinforcing influence) or partial waste of program money through management slack (a mitigating influence). See Duflo, Glennerster and Kremer (2008) for a discussion. 17 4.1 Targeted interventions Targeted trade interventions include ―clinical‖ trade competitiveness programs such as export promotion schemes through matching grants for supporting export business plans, through export-credit guarantees, or through firm-level technical assistance for technology upgrading, for acquisition of international quality certifications or to meet other product standards. The key feature of these interventions is that the programs are assigned exclusively to certain units, often firms. Because these interventions operate at the level of the firm, non-assisted firms can in principle serve as the control group. 4.1.1 Randomized-control trials In targeted interventions, when evaluation is built into program design, a randomized-control trial (RCT), sometimes called the ―gold standard‖ of IE, tends to be viewed as the best option, though this can be questioned as discussed below. It consists of drawing beneficiaries at random from a large pool of firms. By the law of large numbers, the average characteristics of beneficiaries will be the same as those of non-beneficiaries. Were this condition not met, there would be a selection bias; that is, the program’s impact would be confounded not by outside factors, as before, but by differences in individual characteristics.25 Random assignment to the program ensures that the ―unconfoundedness assumption‖ is verified which is key to identify the average treatment effect (Imbens and Wooldridge, 2009). Despite its analytical appeal, randomization must confront other difficulties, in general and in the context of trade-related assistance in particular. In terms of practical feasibility, randomization can be a hard sell with client governments for ethical or political reasons. Governments may be reluctant to extend assistance only to a subset of agents when all need it, and any de facto discrimination may be politically costly. Randomization does allow for flexibility, which may help make it acceptable. First, it does not need to cover all individuals. For instance, a program can use standard selection methods to determine eligibility, and introduce randomization among either all eligible firms or only ―marginal‖ ones. That is, very strong candidates can be taken in, very weak ones left out, and only those in the middle subject to randomization.26 Lotteries are somehow more appealing than blind randomization because they avoid the impression that something is hidden. At a more basic level, in the presence of rationed resources to fund a policy intervention, the advantage of randomization is that it constitutes the fairest solution to rationing (Ravallion, 2008). 25 Put differently, the probability of getting treatment, conditional on the individual’s characteristics, needs to be independent of the outcome. 26 We are grateful to David McKenzie for pointing this out to us. 18 Duflo, Glennersten, and Kremer (2008) note that the spread of RCTs in health, education, and poverty programs owes much to the collaboration with NGOs, as collaboration with local authorities is still relatively rare. NGOs are much less involved in trade-related programs than in other programs, so the scope for RCTs may be inherently less, at least as long as the evaluation culture remains rare in public policy. Atkin and Khandelwal (2011) discuss how carrying out an RCT within the context of international trade depends crucially on finding a suitable local project partner who can provide the export promoting services to producers, and on convincing that project partner of the feasibility and value of the randomization procedure. However it should also be noted that working with NGOs generally limits the size and scope of the intervention and the impact of the program could differ if it is scaled up and implemented without NGO collaboration (Ravallion, 2008). Atkin and Khandelwal (2011) describe an ongoing project for an RCT to assist microenterprises in the handloom weaving sector in Akhmeem, Upper Egypt to enter into export markets. The project’s objective is to link those microenterprises to foreign buyers in the U.S. through the provision of three kinds of services. The first consists of putting Egyptian producers in contact with design consultants to develop patterns that can appeal to the tastes of U.S. consumers; the second is marketing assistance with U.S. buyers; and the third is general business training. The project’s impact-evaluation design is simple: after drawing up a list of potentially viable producers/exporters in the sector and region, a random group of them will be given the opportunity to export to the U.S. market with the help of the three services listed above. The data on both outcomes (export performance) and covariates (producer characteristics) will be generated through surveys conducted as part of the IE. A baseline survey will collect information on all viable exporters — both those that will benefit from this intervention as well as those not approached before the services are provided. Another survey will be conducted long enough after the intervention in order for the effects to be tangible. The World Bank is considering implementing RCTs in some of its own projects as well, although plans at this stage are preliminary. Candidate projects include a customs border post modernization project at the border between the Democratic Republic of Congo and Rwanda, where petty traders on foot, mostly women, are regularly exposed to corruption and harassment. The project would involve some of the women’s associations (with group randomization) to designate customs brokers acting as shields between women and predatory customs officers. Another project involves the facilitation of payments for small cross-border transactions through branchless banking near the Cameroon-Chad border. Currently, all payments for such transactions are made in cash, which hampers trade. At the very least, the project would involve a natural experiment if branchless banking is allowed for traders on one side of the border but not on the other; in addition, the design may involve, in a pilot phase, selected access to non-cash payments for a randomly chosen treatment group. 19 One of the reasons why RCT is a preferred design for such experiments is that randomization does away with the need for complex econometric techniques to control for selection in non- experimental settings. However, RCT is no silver bullet in small sample environments, as it relies on the law of large numbers to ensure that expected untreated outcomes are equal in treatment and control groups. In low-income countries, interventions sometimes target very small numbers of firms (McKenzie, 2011b).27 For instance, the Pesticides Initiative Program (PIP), an E.U. technical-assistance program designed to help fruit and vegetable producers cope with E.U. standards, covers less than a few dozen firms in some African countries (Jaud and Cadot, 2011). Randomization is not an option in such environments. Quasi-experimental methods may not do very well either; but if a cross-country sample is available with enough observations, econometrics may offer some scope to control for cross-country heterogeneity.28 We will return to the small sample issue in the context of non-targeted interventions in section 4.2 when referring to the Cameroon customs project described in Cantens, Raballand, Bilangna, and Djeuwo (2011). An intrinsic limitation of RCTs in the trade and other economic areas is that the study subjects are active economic agents who consciously choose their responses as opposed to the medical sciences where passive entities (e.g., cancer cells) respond endogenously following the laws of nature (Barrett and Carter, 2010). Unobservable perceptions about the benefits of a new trade- related intervention will vary among potential beneficiaries in ways that are likely to be correlated with other attributes and with the actual effects of the treatment. In section 6, we will discuss how randomization may fail to produce unbiased treatment effects in the presence of ―essential‖ heterogeneity or in the presence of spillovers. 4.1.2 Quasi-experimental methods When evaluation is not built into program design, RCT is not an option and quasi-experimental (QEM) methods must be used, all relying on econometric techniques to overcome selection bias.29 The first is the difference-in-differences (DID) method briefly described above. By comparing differences in outcomes instead of comparing levels, DID controls for unequal performance levels of treatment and control groups not related to the program. However, DID 27 McKenzie (2011b) discusses the issue of small samples in World Bank private sector support programs in Africa. None of those programs has been subject to rigorous impact evaluations so far, but if such evaluations were to be conducted researchers would be faced with a serious problem of power given the small number of enterprises assisted by the projects and their large degree of heterogeneity. 28 Randomization across countries would be more difficult to implement than within a country and would not necessarily increase the test’s power. 29 How well quasi-experimental methods perform compared to randomization has been a subject of intense scrutiny since the seminal paper of Lalonde (1986), with largely inconclusive results. Glazerman, Levy and Myers (2003) found that quasi- experimental methods produced substantially biased results compared to experimental ones in twelve replication studies of welfare and employment programs in the U.S. Cook, Shadish, and Wong (2006) found less clear-cut results for education programs. See Ravallion (2008) on the evaluation of poverty programs in non-experimental settings. 20 relies on the assumption of parallel trends and does not control for selection on observables (firm-level covariates). The DID method can be improved by matching that controls for observed firm characteristics correlated with both program participation and performance. The key assumption for the impact estimated by this method to be unbiased is that selection into the program is based only on observable firm characteristics.30 The matching procedure evolves in two steps. First, firm-level covariates are used to predict the probability of getting (or enrolling into) the program using a probit or logit regression. This predicted probability is called a propensity score. Second, the control group is formed by picking, for each treated firm, the untreated firms with the closest propensity score. For each treated firm, depending on the method, there can be either one matched control firm or several, using a weighted scheme.31 Average outcomes in first differences are then compared between the treatment group and the matched control group. The propensity score matching DID estimator allows for time-invariant unobserved firm heterogeneity to affect selection and outcomes. But it does not address the problem that selection - as well as outcomes - may depend on unobserved time-varying firm heterogeneity, as will be discussed below as well as in section 6.3. The studies surveyed in Volpe (2011) are good illustrations of the use of quasi-experimental methods in the evaluation of trade assistance. These studies, recently carried out at the Integration and Trade Sector of the Inter-American Development Bank, use DID and matching- DID methods to assess the effectiveness of export promotion activities of PROMPEX/PROMPERU (Peru), PROCOMER (Costa Rica), URUGUAY XXI (Uruguay), PROCHILE (Chile), EXPORTAR (Argentina), and PROEXPORT (Colombia). They use rich and unique datasets for the six Latin American countries that combine firm-level customs data with covariates drawn from other national firm-level data sources, and constitute the first rigorous micro-based evidence of the effects of export promotion.32 The picture emerging from Christian Volpe’s survey is that export promotion was effective in facilitating export expansion for firms in the LAC region, but primarily along the extensive margin. Firms exporting differentiated goods benefit more than those selling more homogeneous goods. Small and relatively inexperienced companies benefit more than larger and already established exporters. Finally, bundled services that provide support to firms throughout the export development process appear to be more effective than isolated actions. 30 This assumption is designated as ―ignorable treatment assignment‖ by Rosenbaum and Rubin (1983) which is the seminal study on propensity score matching estimation. The assumption means that program participation and outcomes are independent, conditional on a set of observed attributes. 31 The single-match method is called ―nearest-neighbor.‖ Alternatively, one can use n nearest neighbors, or the entire sample of untreated firms with weights that decrease with distance from the treated firm’s propensity score. This latter method is called ―kernel matching.‖ Many other refinements are possible. See Caliendo and Kopeinig (2005) for details on propensity score matching estimators. 32 An alternative, more traditional route to the evaluation of export-promotion’s effectiveness is the aforementioned cross-country study of Lederman et al. (2010). 21 Gourdon, Marchat, Sharma, and Vishwanath (2011) apply the same type of quasi-experimental methods to the evaluation of FAMEX, a World Bank-supported export promotion program in Tunisia, which provided a mixture of counseling and matching grants to new exporters. The study exploits a customized firm-level survey to estimate the effects of FAMEX on the export performance of beneficiary firms at the intensive and extensive margins. Propensity-score matching DID estimates suggest a very large and statistically significant growth effect at the intensive margin: a 39% differential in terms of annual export growth compared to control firms over the 2004-2008 period. The treatment effect at the extensive margin – in terms of products and destinations – is both smaller quantitatively (a 5% growth differential in the count of products and destinations for program beneficiaries compared to control firms) and of marginal or no significance (at 10% confidence level for destinations and insignificant for products). In addition to the observed acceleration in export growth, Gourdon et al. find a significant boost to employment growth: a 10% annual differential for program beneficiaries, significant at the 5% confidence level. An original feature of their dataset is that it covers service firms in addition to manufacturing firms, and they find considerably stronger effects for the former. One potential issue with their data is that the survey was conducted ex-post (no baseline survey was conducted as IE was not part of the program design) so the data may suffer from recall bias. Preliminary results in Cadot, Fernandes, Gourdon, and Mattoo (2011) based on an alternative source of data (customs data) suggest a smaller and non-persistent treatment effect. Jaud and Cadot (2011) also apply quasi-experimental methods to assess the impact of the E.U.- funded pesticides initiative program (PIP) on the export performance of firms in Senegal's horticulture sector. Their results suggest that, while the program had no significant effect on exports of fresh fruit and vegetable pooled over all products and destinations, it had a positive effect when considering exports to the EU. Other quasi-experimental methods can address selection bias in the evaluation of the impact of a program. One approach relies on instrumental variable (IV) estimation. This can be used when program take-up is less than complete and thought to be correlated with unobserved individual characteristics influencing performance. In this case, eligibility can be used as an instrument for participation, provided that eligibility is truly exogenous (e.g., if there is randomization of eligibility but program take-up is incomplete or some participants drop out). This method is used in the context of non-targeted interventions by Sequeira (2011), as described in section 4.2. Another approach is regression discontinuity design (RDD), which makes use of breaks in eligibility to identify a program’s impact.33 For instance, suppose that an export promotion program targets small and medium-sized enterprises (SMEs) as defined by a cutoff level of sales. If the sample is large enough, one can compare outcomes for SMEs immediately below the 33 See Campbell (1969) for details and a survey can be found in Todd (2008). 22 cutoff (eligible) and for SMEs immediately above (ineligible), on the assumption that they are close enough in the characteristic upon which eligibility is defined to be good matches for each other, and most importantly that the cutoff rule is indeed enforced.34 4.2 Non-targeted interventions Non-targeted trade interventions cover mostly programs that help reduce trade costs. These include trade facilitation programs such as upgrading of bottleneck infrastructures in ports, roads, or railroads, reforms of customs agencies and procedures, and some types of trade competitiveness programs related to general improvements in the business environment or support to producer organizations. Because these interventions generally do not target micro entities and their direct beneficiaries are multiple and diffuse, the identification of a control group is difficult, and so they are less amenable to experimental or quasi-experimental design. Considering "hard" and "soft" infrastructure-related trade facilitation programs, the two key constraints to estimating their effects are (a) the endogeneity of program placement and (b) the absence of well-defined treatment and control groups. Thus, the pre-treatment unobservable characteristics that determine infrastructure placement and affect outcomes will likely differ between treatment and comparison groups (where groups are, in this case, most likely to be locations). Randomization in the context of large and sensitive hard transport infrastructure programs is generally not feasible. This is also the case for soft trade facilitation programs relating to rules, regulations, and government agencies dealing with the movement of cargo across borders that are often not amenable to random assignment at the micro-level nor to the creation of comparison groups for the purposes of an IE. For interventions such as customs reform, the only way to generate a control group is to introduce elements of targeting through progressive phase-in during a pilot phase, staggered for example across different border posts, or through selective implementation covering only some customs offices or officials, or by giving privileged access only to some firms or to some types of traded goods. For instance, a "green channel" in customs, which is a speedy clearance for trusted operators, can be restricted and randomly allocated in an early phase, using non-eligible operators as controls.35 In this case, methods such as DID can in principle be applied using the locations initially not covered, customs offices or officials, or firms like the control group for the targeted entities. However, in many cases, during the pilot phase the control group will not be strictly comparable to the treatment group. For example, when a border modernization program is initially deployed 34 The issue of rule enforcement has been a controversial one for example in the context of microcredit evaluation (see Morduch 1998), but may be lesser concern for firm-level trade interventions such as support to SMEs. 35 This approach is similar to so-called ―pipeline‖ methods where applicants are used as controls for beneficiaries. 23 in one border post, other border posts of different scale and product mixes serving other areas could serve as controls. It may then be necessary to use regression analysis to control explicitly for the heterogeneity in covariates in estimating differences in outcomes between treated and control border posts. In some cases, policy design or implementation inadvertently creates the conditions necessary to perform evaluation through quasi-experimental methods — what economists call a ―natural experiment.‖ Datt and Yang (2011) exploit one such natural experiment. The government of the Philippines used pre-shipment inspection (PSI) services to combat corruption in customs and increase import duty collections. The natural experiment arose from two conditions: (1) imports from only some origin countries were covered by PSI, which created a natural control group (imports from other countries); and (2) in 1990 the government decided to close a loophole whereby import transactions below a threshold of $5,000 were exempted from PSI. The loophole had enabled traders to slice shipments into small batches and under-invoice them without being detected. The customs reform consisted of lowering the threshold to $500, so the period after 1990 can be considered a ―treatment period.‖ A DID equation can then be used to compare the evolution of outcomes before versus after the reform for the treatment and control group of countries. The DID estimates show that, when inspections were expanded to lower-valued shipments, imports shipments were no longer mis-valued, but those from treatment countries shifted differentially to an alternative duty-avoidance method — shipping via duty-exempt export processing zones (EPZs). Thus, increased enforcement reduced the targeted method of duty avoidance, but led to substantial displacement to an alternative duty-avoidance method. Duty collection failed to rise, while importers incurred higher fixed costs as they relocated to EPZs. This evidence shows that, to be successful, anti-corruption reforms need to encompass a wide range of possible alternative methods of committing illegal activity. Sequeira (2011) discusses a transport infrastructure project consisting of investments in a railroad connecting the economic heartland of South Africa to the port of Maputo in Mozambique. Given the poor state of Mozambique's infrastructure after two decades of war, and in face of budget constraints, the government had to be selective in its choice of infrastructure investments. They decided to rehabilitate the old-pre-colonial railway in the Maputo transport corridor (that would promote regional integration) rather than building an entirely new North- South connection as was demanded by the Mozambican business class. As the layout of the old- pre-colonial railways had been designed to serve 19th century mining companies, there is plausible exogenous variation in the emergence of the rehabilitated railway relative to the geography of manufacturing and retail firms at the time of rehabilitation of the railway. The IE of this transport infrastructure project estimates the impact of railway rehabilitation on firm performance — namely, how it affects transport costs for different firms and sectors, how firms respond to these changes, and what are the spillover and network effects across rail and 24 road transport. To identify a causal relationship, the study will use a quasi-experimental method, IV, where the treatment, defined as changes in transportation costs, will be instrumented by the distance between a firm’s location and a working station of the railroad. In addition, the study exploits the fact that other transport corridors in Mozambique developed at different speeds and identifies two sets of control firms to match to the treated firms in the Maputo transport corridor: firms in the Beira corridor (that have access to a new port but no railroad) and firms in the Nacala corridor (that have no access to a new port or railroad). To isolate the impact of the Maputo railway rehabilitation, the study will use a matching DID estimation that assumes that the only factor making the trajectory of these three sets of firms different during the sample period is that they were exposed to different transport choice sets. The impact of the Maputo railway rehabilitation is not yet known since only the baseline survey information is available; a follow-up survey will be conducted in 2011. Sequeira (2011) also discusses a "soft" transport infrastructure project focusing on corruption in Southern African ports.36 By collecting original data on bribe payments made to customs officials and to port operators in the two competing ports of Durban and Maputo, the study is able to trace differences in bribe schedules to the organizational structure of each port. By observing how firms adapt their shipping and sourcing decisions to the type of corruption faced at each port — which enters the calculation of the overall cost of using each port — the study estimates the impact of corruption at ports on the behavior of South African firms. The estimates show that corruption imposes a distortion in terms of ―diversion‖ — i.e., firms travel on average an additional 322 kilometers, more than doubling their transport costs, just to avoid ―coercive‖ corruption at a port. This effect is only observed for firms facing a higher probability of being coerced into a bribe because of the kind of product they ship. Firms are willing to incur higher costs to avoid corruption because of an aversion to the uncertainty surrounding bribe payments at the most corrupt port (Maputo). The uncertainty in Maputo seems linked to the short time periods caused by high job turnover among customs officials. Firms also respond to different types of corruption by adjusting their sourcing decisions for inputs — domestically or internationally — since corruption at ports increases the cost of using the port and thus directly affects the relative cost of imports. While this project is not an impact evaluation of an intervention to reduce corruption in ports, it provides two sets of valuable insights on such interventions because it considers the entire chain between competing port bureaucracies setting bribes and user firms making shipping and sourcing decisions. First, the study shows that, depending on the type of corruption that 36 The project is described more extensively in Djankov and Sequeira (2010). 25 bureaucrats engage in, bribes can affect the deadweight loss, tariff revenue, and the demand for the public service. In particular, corruption seems to reduce significantly demand for the Maputo port, stifling the returns to the massive investments in hard infrastructure of the corridor that have taken place in recent years. Second, policy changes to the organization of ports and to the nature of the interaction between shippers and port officials could reduce corruption. Such changes include reducing the discretion of port officials in the clearance process, and eliminating face-to-face interactions between clearing agents and port officials. Cantens, Raballand, Bilangna, and Djeuwo (2011) describe a recent pilot for customs reform in Cameroon that involved the introduction of contracts with performance indicators for frontline customs inspectors in two of the country’s customs bureaus (henceforth referred to as treated bureaus). The performance indicators covered both trade facilitation and the fight against fraud and bad practices. Frontline customs inspectors with good performance would be rewarded with non-financial incentives such as congratulatory letters entered into their personnel files, easier access to the director general of customs, training courses, and transfers to more attractive bureaus. Poorly performing inspectors would be sanctioned by eviction from bureaus with strong ―fiscal potential‖ — that is, where the possibilities of earning money legally through disputed claims were high. This project is an interesting example of a trade intervention that in principle is non-targeted, but where targeting could have been introduced by focusing on a sub-set of frontline customs inspectors. This could then have been an ideal setting to implement an RCT, whereby a subset of randomly chosen frontline inspectors would have been under performance contracts while others would not. However, it was not possible to implement an RCT for several reasons. First, the seven customs bureaus in Cameroon are specialized (oil imports, special customs regimes related to public trends, transit, exports, bulk cargo, and the two treated bureaus) and differ so much in customs practices that it would be difficult to make comparisons across bureaus. Hence, if anything, one would need to take a bureau and split it into a treated group and a control group of frontline inspectors. But this was not feasible given a small sample problem: less than 10 staff work in each bureau. Second, as is generally the case in projects funded by governments or international donors, the time for the pilot project was limited. Thus, it was not possible to overcome the small sample issue by allowing for turnover within each bureau to artificially increase the number of treated and control officers. Moreover, since contract incentives were not financial, time was required to reward good performers (e.g., it would not have been feasible to appoint high-performing inspectors to better positions every six months). Therefore, the IE of the customs performance contracts project was conducted as a comparison of inspectors’ behavior before and after the project was implemented, without a defined control group, although the impact on clearance times was assessed using the bulk-cargo import bureau 26 as a counterfactual. The estimated effects of the pilot performance contracts were positive surprisingly soon after the pilot was launched in mid 2009. Duties and taxes assessed increased despite a fall in the number of imported containers (likely linked to the financial crisis), and the tax yield of the declarations also rose. The performance contracts also affected clearance times, as the share of declarations treated within 24 hours increased more in the treated bureaus than in the counterfactual bureau, and the variance of clearance times decreased dramatically. The impact on disputed claims was equally interesting, with inspectors abandoning low-level disputed claims to focus on major ones, and the ratio of taxes adjusted to taxes assessed increased. Finally, the contracts also had a major impact in reducing costly practices. For instance, the number of litigious re-routings from the yellow channel (documents control) to the red channel (physical inspection) declined tremendously. 5. Data issues In this section, we discuss first, how the objectives of the evaluation influence the type of performance measures that need to be considered, and then how the necessary data may be obtained. 5.1 What should we measure? The choice of performance measures is important not only to ensure that IE focuses on the appropriate indicators, but also because using IE can affect the incentives of agents and program managers in unintended ways. Performance indicators that strongly relate to targeted interventions in a causal sense are often too technical to be of interest from a broad policy perspective; whereas, the highly aggregate indicators that interest policy-makers are rarely faithful reflections of the effect of targeted interventions and projects. Thus, selecting performance indicators involves a trade-off between breadth and identification. Much of the talk in aid-for-trade evaluation focuses on aggregate indicators such as national export performance or other macro variables. Although policy-makers may find these broad indicators relevant, the causal link between them and the actual performance of trade interventions is tenuous, implying weak identification. By contrast, M&E frameworks, developed to ensure project management and quality control, have used intermediate outcomes more directly linked to the projects themselves, like customs clearance times. In a causal sense, these measures are closer to project management but are likely to be narrow in scope. Deciding which approach is better depends on what the indicators are used for. If evaluation results are expected to feed into incentive structures for program managers, identification is critical and breadth is secondary. In contrast, in order to catch the attention of policy-makers, breadth matters more, possibly at the cost of weaker identification. 27 Impact evaluation does not escape this general trade-off between breadth and identification, but typically locates at the ―narrow‖ end of the spectrum since it identifies changes in performance measures that are directly attributable to the project. For instance, when evaluating a customs modernization program, the performance measure is likely to be something like container dwell time, even though less quantifiable dimensions of customs performance, like security at the borders, may also matter. But identifying and documenting the chain of causality from program to ultimate outcomes can be challenging for some trade interventions. In trade facilitation programs, it is not always clear what are the micro-level mechanisms by which transport costs reductions influence firms and households and, more generally, economic activity. In addition, the use of IE can affect incentives in the long run. The focus on narrow, immediate performance outcomes may well lead to measurement biases or, even worse, create perverse incentives when used for monitoring and evaluation. For one thing, it can focus attention on readily measured outcomes at the expense of less easily measured ones. Consider a customs modernization program. Using IE results to design reward schemes for customs officials might lead to over-emphasis on easy-to-measure reductions in clearance times, at the expense of the monitoring of suspect shipments. If, say, there is a low rate of smuggling illicit products, it may take time before the consequences of reduced monitoring get noticed — too long to show up in an IE. 5.2 How do we obtain the data? The feasibility of rigorous impact evaluation hinges critically on data availability. Whether the IE is based on experimental (RCTs) or quasi-experimental design, it needs to include a baseline survey and at least one follow-up survey. If quasi-experimental methods are used, the baseline survey must include a rich set of covariates to estimate a (first-stage) selection regression. One of the advantages of RCTs, especially in developing countries, is that they are less demanding in terms of data; however, even with randomization, firm-level covariates can be useful in verifying that the treatment and control groups are comparable in their observable characteristics. This is especially important for small samples. The availability of a rich set of covariates allows for the analysis of heterogeneity in the effects of the program. Moreover, a deep knowledge of the objectives of the program as well as its administrative and institutional details can be important for the design of surveys that collect the right type of information to control for the selection process (Ravallion, 2008).37 37 Qualitative information, collected from surveys or focus groups, can complement quantitative survey data though it cannot be the basis for credible impact evaluation by itself. 28 Table 3 provides examples of intermediate and ultimate outcomes in the context of new-style trade interventions linked to trade competitiveness and trade facilitation. Table 3 Intermediate and ultimate performance outcomes Trade Competitiveness Trade Facilitation Example of program: matching grant to Example of Program: Customs reform support firms access export markets Intermediate outcomes to understand the Exports, output, input choices at firm- Customs or port clearance time and chain of causality from program to level costs, incidence of illegal activity outcomes Productivity, wages, employment at firm- Trade volumes, customs revenue Ultimate outcomes level collected Covariates to use as controls or to Firm-related or customs office or Firm-related industry, location, age, size, understand the heterogeneity of effects official-related: location, education, ownership, workforce details of program age, contract The evaluations of the impact of trade facilitation programs — especially those related to infrastructure — suffer currently from a serious lack of micro-data on transport costs and prices before and after interventions take place. For these types of interventions, it is desirable to conduct baseline and follow-up surveys of program beneficiaries and control groups. In addition, baseline and follow-up surveys may not be enough to assess a program’s impact. Consider, for example, the case of a one-year export-promotion program, where firms can enlist in any year between 2005 and 2009; and then suppose that a baseline survey is conducted in 2004 and a follow-up survey is conducted in 2010. For firms that enrolled in 2005, the follow-up survey will pick up outcomes four years after the treatment. By then, if the effects are transient, they may have vanished, and the follow-up survey will pick up heterogeneous effects (one year after treatment for firms enrolled in 2009, two years for those enrolled in 2008, and so on). Thus, although costly, it may be necessary to run repeated follow-up surveys year after year.38 While projects typically have budgets for baseline data collection, these may not always be enough to gather the data needed for a proper IE evaluation after the project is completed. An alternative cost-efficient method is to use official pre-existing sources of data provided that they are collected often enough and provided that they can be closely reconciled with program data. For example, in the IE of an export promotion program, customs records at the transaction/firm level can be used to measure outcomes such as growth in export value (the intensive margin), 38 McKenzie (2011a) argues that two advantages of having multiple data points for treatment and control groups are (1) the possibility of studying the trajectory of program impacts and uncovering causal chains and (2) the collection of multiple measurements on possibly noisy and weakly auto-correlated outcomes. By averaging outcomes across multiple data points noise is eliminated and the power to detect genuine effects of a program increases. 29 number of products, or number of destinations (the extensive margin).39 Naturally it is important to integrate such data with program data such as from the project monitoring database. The trade and integration unit of the World Bank Development Research Group is involved in a major data collection exercise that may help the IE of trade-related interventions in the next few years. As described in Freund and Pierola (2011), the exercise consists of the collection and compilation of the first ever database on exporter-level customs transaction data across countries and over time. Data has been obtained for 20 countries in Africa, Asia, Eastern Europe, and Latin America, and negotiations are in progress to obtain data for 25 more countries. The database will include statistics on exporters’ characteristics and behavior by country, industry, and destination market. The purpose of the database is to provide policymakers, development agencies, researchers, and the public with a novel source of information to conduct analysis of export growth at the micro level and allow for the evaluation of programs and policies affecting that growth. Data on firm characteristics (covariates), used to control for selection bias, is typically hard to obtain. If an industrial survey is available in the country where the trade intervention is taking place, it can provide the required variables (e.g., location, age of the firm, education of its head, number of employees, foreign ownership). However this requires that customs and industrial census data be merged, which raises confidentiality concerns and may require active collaboration by busy officials in local institutions. When data is not available, an alternative is to conduct a ―retrospective survey‖ — although this method may be biased. In its evaluation of Tunisia’s export-promotion agency, Gourdon et al. (2011) use a combination of data from a survey and from national sources (the customs agency and national statistics institute). Yet another alternative is to include questions on program participation and details in ongoing surveys. This also requires close collaboration between the evaluator and the local institution implementing the survey. 6. Looking ahead: Challenges facing IE of trade assistance In this section, we consider three key challenges that credible IE of trade interventions must address. 6.1 External validity and cost One concern with impact evaluations is that their external validity is an act of faith. When a program is found to be effective (or ineffective), how do we know that the result would carry over to similar programs run in different environments? 39 See Freund and Pierola (2011) and Lederman, Rodriguez-Clare, and Xu (2011) for uses of such data in a non-IE context. 30 As Rodrik (2008) and Ravallion (2008) have argued, there is a trade-off in policy evaluation between external and internal validity. As traditional identification of causal effects through instrumental-variable strategies never completely eliminates confounding influences, these strategies always suffer from an internal-validity problem. However, when based on cross- country evidence, they pick up average effects that can be relatively stable — provided they are consistent with some sort of theory — because induction, even on cross-country samples, may fail to produce generalized results. By contrast, IE purges out confounding influences, but generates results that are empirical and case-dependent. Such results may fail to carry over to different settings. Limited external validity of any study would not be a problem if we could replicate it easily. With enough replications, the sheer mass of evidence would provide the desired generality (although the method would still be inductive and would thus suffer from the general critique of inductive methods in science). But some kinds of IE can be costly. For instance, the World Bank reckons that household surveys cost on average $300 per household. At that rate, a baseline and final survey of 500 households would cost $300,000. This is a lot for studies with only internal validity.40 However, costs can often be contained by working with local institutions, which has the added advantage of building capacity in a key area. Some trade-related programs target limited numbers of firms, so their evaluation is less costly than that of poverty-reduction programs. For instance, in a middle-income country, the cost of surveying 500 firms can be substantially lower than $100,000. Moreover, the data may exist prior to and independently of the IE in the form of census or industrial surveys and customs records. In that case, the cost of the IE goes down dramatically. The problem then is no longer one of cost but more of securing buy-in from the agencies possessing the data so that they share it. However, it should also be kept in mind that tests of the effect of interventions based on 500 firms are likely to have low power (see McKenzie 2011a and 2011b for a discussion), and thus to generate type-II errors (failing to reject the null hypothesis of no treatment effect when, in fact, the effect is present). If cost-cutting leads, through low-power experiments, to unjustified pessimism on the effect of interventions, IE may lose a lot of its power to guide policy choices. 6.2 Spillovers and general equilibrium effects 40 In their discussion of quasi-experimental versus experimental methods, Duflo et al. (2008) make a noteworthy point about the commitment value of costly experimental design. It has often been argued, with some statistical support (see, e.g., Ashenfelter, Harmon, and Ooster-beek 1999), that statistically significant results (positive impact in our setting) are more likely to get published, a so-called ―publication bias.‖ As experimental methods are costly and usually planned with donors, self-censure in the face of insignificant results is less likely to be feasible than when relatively low-cost quasi-experimental methods are used with publicly available data. In that sense, IEs may be less affected by publication bias. 31 Externalities can bias treatment effects by blurring or magnifying the difference in outcomes between treatment and control groups. In the context of policy evaluation, this raises a deep issue, as externalities are often the basic justification for government intervention. One key assumption of both experimental and quasi-experimental methods is that the impact of the program can be located only among its direct participants, that is, the control group is not ―polluted‖ by the treatment group, lest the comparison of outcomes be biased. A classic case in economics occurs when general equilibrium effects transmit the benefits conferred on beneficiaries to non-beneficiaries or, alternatively, penalize them, say, through rising input prices. For example, a program to upgrade one border post may induce traffic shifting from other, untreated border posts. The volume of trade going through the treated border post will then be increased by the substitution from traffic that normally goes through other posts, and as ―control‖ border posts see their traffic go down, using them as controls will result in an upward bias in the estimated treatment effect of the program. Similarly, beneficiaries of an export promotion program may be able to lure away the ablest workers from other non-beneficiary firms. The adverse effects on the latter would lead to an overestimate of the benefits of the program for the former. Thus, ignoring general equilibrium effects can produce misleading evaluations of policies and programs (Abbring and Heckman, 2007).41 In the evaluation of trade-related programs of limited scale, such as export promotion or trade facilitation, general equilibrium effects through market mechanisms may not be critical. However, spillovers may be present through other channels – such as social interactions, which are direct externalities in which the actions of one agent directly affect the actions (preferences, constraints, technology) of other agents (Abbring and Heckman, 2007). For example, an export promotion program may have ―demonstration effects‖ yielding valuable information on the viability of products or destination markets that can be easily imitated by non-participants. In such circumstances, the estimated treatment effect will be biased downward because the difference in outcomes between the treatment and control groups will measure only the purely private effect. It is hardly surprising that treatment effects may be contaminated by information externalities. After all, even the most rigorous RCTs used to test the effectiveness of drugs can be affected by informational biases, as is the case when individuals in the control group observe that they do not suffer a drug’s side-effects while individuals in the treatment group do and as a result infer that they received the placebo instead of the treatment. Interestingly, the problems originating in the confounding spillover effects from a program to the control group are as relevant for targeted interventions as they are for non-targeted interventions. In fact, as pointed out by Ravallion (2008), they may be a more severe problem for randomized evaluations. 41 But Abbring and Heckman (2007) acknowledge that it is costly to obtain information on all the behavioral parameters required to conduct general equilibrium evaluation. 32 In the trade context, as in economics more generally, the presence of externalities takes on special importance because it plays a key role in justifying government intervention. If the benefits of a program whose costs are borne by taxpayers were internalized by the beneficiaries, surely those beneficiaries ought to pay for it and there would be no justification for public intervention. In contrast, if a program generated spillovers so powerful that no treatment effect was detectable – that is, the control group indirectly benefited as much from the program as the beneficiaries - then there would be a strong argument for a public intervention, as beneficiaries would be willing to pay nothing for the treatment.42 These arguments suggest that impact evaluation results cannot be properly interpreted without a careful discussion of what market failure(s) the policies or programs are trying to remedy. Understanding clearly the policy objectives, the relevant constraints (including those related to resources, information, incentives, and political economy) and the causal links through which the specific policies and programs yield expected outcomes are key for any good evaluation (Ravallion, 2009). If the market failure lies in imperfect capital markets, then a program that provides cheaper-than- market trade financing to specific firms can be expected to have a positive treatment effect and evidence of such an effect can lead one to conclude that the program works. In this case, if there are spillovers, then the positive treatment effect is simply a lower bound of the total effect of the program. However, if the market failure is due to informational spillovers, so that private firms wait for other private firms to invest in uncovering the information needed to export a particular product or to a particular market, then the absence of a treatment effect from an export promotion program is not evidence that the program is not working. In fact, finding a positive treatment effect could reflect the fact that the benefits are largely private, in which case the rationale for the program is put into question. Thus, seeking to justify government-financed programs solely on the basis of treatment effects may not only be affected by bias, it may be altogether wrongheaded. In the export promotion program example, what IE would be measuring is only the private-good dimension of the intervention; the public-good dimension would be left unevaluated. It is thus important to disentangle whether a no-effect finding is due to externalities or to program ineffectiveness. This may call for an independent effort, aside from the IE itself, to detect the presence of externalities. For instance, one might estimate a regression of outcomes of untreated individuals on some continuous measure of exposure/closeness to treated individuals, to see if more or closer treated neighbors raise the outcome of untreated ones. Alternatively, one can include this same measure of exposure to (other) treated individuals in the DID equation and 42 This point was made to the authors by Daniel Lederman. 33 interact it with the treatment to see if the treatment is more powerful on individuals ―surrounded‖ (in some economic sense) by other treated individuals. These methods are inspired by measures of contagion used in epidemiologic studies. In contrast with medical sciences, however, in social sciences the mechanisms by which contagion takes place are largely unknown. Baseline surveys as well as qualitative information gathered from focus group discussions may help in understanding and identifying channels through which future program benefits might spread from one firm to another — for example, professional association memberships, personal contacts and so on. 6.3 Heterogeneity Differences among beneficiaries, especially if they are unobserved, can pose particular challenges for evaluation. Policy interventions can have diverse impacts across economic agents. By focusing on average treatment effects, an evaluation ignores valuable information on the heterogeneity of the effects.43 As Ravallion (2009, p. 37) puts it, practitioners should ―never be happy with an evaluation that assumes common (homogeneous) impact‖. He also argues that knowing more about the heterogeneity of the effects and the role of contextual factors is key to better understand the impact of the intervention and make evaluation more relevant for good policy-making. The challenges linked to the heterogeneity of the effects are relevant across types of interventions, whether they are of a targeted or a non-targeted nature. First, the treatment effects of a program can be related to the observable characteristics of the beneficiaries. For example an export promotion program can have differential effects for participant firms depending on their prior export experience or on their workforce skill levels. If the export promotion program consists of a matching grant scheme which co-finances firms’ export business plan, the opportunity costs for participant firms may differ in terms of the alternative uses they could give to their funds.44 A simple approach to address the heterogeneity of the effects when differences are observable is to add interaction effects with the treatment dummy variable in a regression framework that estimates the average treatment on the treated effect. Treatment effects can also vary with the distribution of outcomes themselves. Volpe Martincus and Carballo (2010) examine the impacts of export promotion activities across quantiles of the distribution of Chilean firms’ growth rates of exports using quantile treatment effects estimation. They find stronger effects at the lower end of the distribution, which are combined with data on firms’ export histories to show that smaller and relatively inexperienced firms as measured by their total exports benefit more from export promotion. 43 See Abbring and Heckman (2007) and Ravallion (2011) on the heterogeneity of impacts in evaluation studies. 44 While these opportunity costs may not be observable, they are likely to vary with observable characteristics such as the conditions in the local market or in the sector of activity. 34 Second, a particular challenge arises when unobserved differences among beneficiaries influence their participation in a program. Even in RCTs where eligibility is randomized, economic agents inclined to take up the program based on their unobserved expected net benefits may differ systematically from the agents that were part of the sample randomly assigned to the treatment group. Heckman, Urzua and Vytlacil (2006) introduce the notion of ―essential‖ heterogeneity as pertaining to the case where the impact of a policy intervention is heterogeneous and agents take up treatment based on this heterogeneity (i.e., with knowledge of their idiosyncratic response). The presence of ―essential‖ heterogeneity implies that the estimated average effects of an intervention, even under an RCT, can be biased. For instance, unobservable characteristics of firms that determine their choice of applying to an export promotion matching grant scheme (as well as their choice of the type of business plan and amount of co-finance to provide) could influence the firms’ export success and thus the true effect of the program. The econometric approaches to address ―essential‖ heterogeneity problems are at the frontier of the impact evaluation research (see Manski, 1997; Abbring and Heckman, 2007; Djebbari and Smith, 2008; Fan and Park, 2010). There are at least two approaches to mitigate the problem of selective take-up on the basis of unobservable attributes. Consider again the export promotion scheme example. First, all firms could be invited to participate in the program, some firms would officially apply, but then only a randomly selected subset of the applicants would actually receive the assistance. The potential downside of this approach is limited external validity in the sense that the estimated treatment effect will apply only to the self-selected group of applicants – unless, of course, that is the impact of interest to the evaluators and policy-makers (McKenzie, 2010). Alternatively, the treatment itself could be defined in a way that is de facto compulsory so the question of selective take-up does not arise. Thus, a randomly selected set of firms would receive some form of ―encouragement,‖ for example through phone calls or visits aimed at providing detailed information on the application process, raising the probability that those firms apply to the program.45 The unbiased effect of this random encouragement - the ―intention-to-treat‖ effect - would be estimated by comparing take-up by firms that received encouragement relative to take-up by firms that did not. To obtain the effect of the program on ultimate outcomes (e.g., export performance) one would instrument for treatment using the randomly provided encouragement. A limitation of this approach is that in the presence of ―essential‖ heterogeneity‖, out of the set of firms receiving the encouragement, those that take up the program are likely to have higher unobserved expected benefits from the program than those that 45 Duflo, Kremer, and Robinson (2006) offer an example of an encouragement design. They tested whether seeing a neighbor use fertilizers would encourage other farmers to do the same. For each using farmer, they invited randomly chosen neighbors to attend a demonstration of fertilizer use. Although other farmers were also welcome to attend, the attendance rate was much higher in the sub-sample of invited ones, which was randomized. To our knowledge, no trade intervention has been evaluated with encouragement design. 35 do not. Hence encouragement design could be associated with biased treatment effects, which are potentially over-estimated relative to the average effects for the sample that would have taken up treatment in the absence of encouragement (Barrett and Carter, 2010; Mckenzie, 2010). 7. Conclusion In spite of the challenges, rising demands for results and accountability from donors and clients alike require that aid-for-trade evaluation strategies need more ambition and rigor. Implementing agencies should no longer be content with traditional methods based on output monitoring and before-after comparisons. Output monitoring is largely introspective, relying on measures defined by the task managers and therefore liable to biases, while before-after comparisons are vulnerable to confounding influences. The basic problem faced in the evaluation of a policy, program or project impact is attribution. Are the observed changes in the performance of treated entities really attributable to the intervention under consideration, or do they reflect a fortuitous combination of effects? Impact- evaluation (IE) methods — developed outside of the social sciences but widely adopted in the evaluation of poverty, health and education programs — provide a generally accepted answer to the problem of attribution. Trade interventions have so far escaped the rising tide of evaluation methods. But there is no justification for this trade exceptionalism as IE techniques are many and sufficiently flexible for use even in the case of interventions that are not targeted at a defined group of treated individuals. As the authors have experienced in their campaign for greater recourse to IE techniques in trade, the key barriers to progress are not conceptual. Rather, they concern incentive issues, as IEs are costly, burdensome, lengthy, and not necessarily aligned with project managers’ incentives. For example, World Bank projects to assist private sector firms in Africa last on average five years, which would imply that, if their IE involved an RCT, many years would need to elapse for the projects to show results (McKenzie, 2011a). These many years would go well beyond a project manager’s horizon. In principle, researchers need not wait until completion of the project to evaluate its effects; rather, results one or two years after the project could be assessed and be used to guide the implementation of the project in the subsequent years. While early feedback from an IE is useful, is useful it should however be treated with caution. First, it may simply be premature in that the effects of the program may not be adequately manifested. Second, from a methodological perspective, fine-tuning a program at an intermediate stage could jeopardize the possibility of evaluating its effects credibly. 36 The weakness of current evaluation practice can be illustrated no better than by this critical assessment, found in the Implementation Completion Report of a recent World Bank project in the area of export promotion: Although the design of the M&E system was appropriate, both Bank and Government project teams had difficulty measuring the achievements of the project using the broad indicators cited in the PAD.46 [B]y current standards, they were insufficient and incomplete. […] M&E, particularly important as a learning objective, was weak. It was slow to start and did not deliver. The M&E staff […] lacked the capacity and experience to carry out the monitoring activities, and the Unit was unable to carry out baseline and impact surveys of randomly selected farmers in both project and non-project areas, i.e., survey to gauge key interest groups’ response to the outputs generated by the pilot activities. The M&E Unit’s ability to collaborate with other implementing agencies to collect information and data was also ineffective. Implementing partners did not regard the M&E exercise as a learning process but instead, conducted their promotion activities without consulting or collaborating with the M&E unit. As the reviewers noted, the learning function of evaluation tends to be overshadowed by the ―monitoring‖ function for implementing agencies. In order to overcome these hurdles, several avenues must be considered. First, the burden imposed on project managers should be relieved by making impact evaluation a separate exercise carried out by specialists, albeit in collaboration with project managers. Project managers should be involved at the right time — that is, during project design and from then on, as much as possible, left in peace. The World Bank has moved in this direction through the creation of the DIME unit, which provides expertise and help with IE financing. At the same time, governments in the countries receiving trade assistance must buy into the process. This means sharing knowledge and building capacities for a proper interpretation of IE results and, over the long run, for governments to build their own IE capabilities as part of public-services delivery improvements. Also, every effort should be made to reduce the cost of IEs. For small-scale activities, the cost of an IE can be as great as that of the activity itself. This is excessive. Local resources — in particular universities and graduate students — should be involved, producing a double benefit: costs are reduced and local capacities are strengthened. Finally, the exploitation of IE results should prioritize learning over monitoring. That is, donors and implementing agencies should tread cautiously in using IE results to frame incentive systems. Care is needed in the interpretation of IE results because premature conclusions could 46 M&E stands for Monitoring and Evaluation while PAD stands for Project Appraisal Document. 37 easily provoke a backlash and because a considerable accumulation of evidence is needed to yield truly valuable new knowledge. References Abbring, J. and J. Heckman (2007). ―Econometric Evaluation of Social Programs Part III: Distributional Treatment Effects, Dynamic Treatment Effects, Dynamic Discrete Choice, and General Equilibrium Policy Evaluation,‖ in Heckman, J. and E. Leamer (eds.) Handbook of Econometrics, vol. 6B, pp. 5146-5303. Ashenfelter, O., Harmon, C., and H. Oosterbeek (1999). ―A Review of Estimates of the Schooling/Earnings Relationship,‖ Labour Economics 6, 453-470. Atkin, D. and A. Khandelwal (2011). ―The Use of Experimental Designs in the Evaluation of Trade�Facilitation Programs,‖ in Cadot, O., Fernandes, A., Gourdon, J., and A. Mattoo (eds.) Where to Spend the Next Million? Applying Impact Evaluation to Trade Assistance, pp. 107-122. The World Bank and CEPR. Balat, J., Brambilla, I., and G. Porto (2009). ―Realizing the Gains from Trade: Export Crops, Marketing Costs, and Poverty,‖ Journal of International Economics 78, 21-31. 38 Banerjee, A., S. Jacob, M. Kremer, J. Lanjouw, and P. Lanjouw (2005). ―Moving to Universal Education: Costs and Trade-offs,‖ MIT mimeo. Banerjee, A., Amsden, A., Bates, R., and J. Bhagwati, and N. Stern (2007). Making Aid Work. MIT Press. Banerjee, A. and E. Duflo (2008). ―The Experimental Approach to Development Economics,‖ NBER Working Paper 14467. Barrett, C. and M. Carter (2010). ―The power and Pitfalls of Experiments in Development Economics: Some Non-Random Reflections,‖ Applied Economic Perspectives and Policy 32, 515-548. Blundell, R. and M. Costa Dias (2009). ―Alternative Approaches to Evaluation in Empirical Microeconomics,‖ Journal of Human Resources 44, 565-640. Brenton, P. and E. von Uexkuhl (2009). ―Product-Specific Technical Assistance for Exports— Has it Been Effective?,‖ Journal of International Trade and Economic Development 18, 235- 254. Bruhn, M. (2011). ―License to Sell: The Effect of Business Registration Reform on Entrepreneurial Activity in Mexico,‖ Review of Economics and Statistics 93, 382-386. Cali, M. and D. te Velde (2011). ―Does Aid for Trade Really Improve Trade Performance?,‖ World Development 39, 725-740. Caliendo, M. and S. Kopeinig (2005). ―Some Practical Guidance for the Implementation of Propensity Score Matching,‖ IZA Discussion Paper 1588. Campbell, D. (1969). ―Reforms as Experiments,‖ American Psychologist 24, 407-429. Cantens, T., Raballand, G., Bilangna, S., and M. Djeuwo (2011). ―Reforming Customs by Measuring Performance: a Cameroon Case Study,‖ in Cadot, O., Fernandes, A., Gourdon, J. and A. Mattoo (eds.) Where to Spend the Next Million? Applying Impact Evaluation to Trade Assistance, pp. 183-206. The World Bank and CEPR. Cook, T., Shadish, W., and V. Wong (2006). ―Within Study Comparisons of Experiments and Non-Experiments: Can they help decide on Evaluation Policy?,‖ Northwestern University mimeo. Datt, M. and D. Yang (2011). ―Half-Baked Interventions: Staggered Pre-Shipment Inspections in the Philippines and Colombia,‖ in Cadot, O., Fernandes, A., Gourdon, J. and A. Mattoo (eds.) Where to Spend the Next Million? Applying Impact Evaluation to Trade Assistance, pp. 163-182. The World Bank and CEPR. Djankov, S., Freund, C. and C. Pham (2010). ―Trading on Time,‖ Review of Economics and Statistics 92, 166-173. 39 Djebbari, H. and J. Smith (2008). ―Heterogeneous Program Impacts of PROGRESA,‖ Journal of Econometrics 145, 64-80. Duflo, E., Kremer, M., and J. Robinson (2006). ―Understanding Technology Adoption: Fertilizer in Western Kenya, Preliminary Results from Field Experiments,‖ Mimeo, MIT. Duflo, E., Glennerster, R., and M. Kremer (2008). ―Using Randomization in Development Economics Research: A Toolkit,‖ in Schultz, T.P. and J. Strauss (Eds.) Handbook of Development Economics, vol. 4, pp. 3895-3962. Fan, Y. and S. Park (2010) ―Sharp Bounds on the Distribution of Treatment Effects and Their Statistical Inference,‖ Econometric Theory 26, 931-951. Ferro, E., Portugal-Perez, A., and J. Wilson (2011). ―Aid-for-Trade and Export Performance: The Case of Aid in Services,‖ in Cadot, O., Fernandes, A., Gourdon, J. and A. Mattoo (eds.) Where to Spend the Next Million? Applying Impact Evaluation to Trade Assistance, pp. 207-219. The World Bank and CEPR. Francois, J. and M. Manchin (2007). ―Institutions, Infrastructure, and Trade,‖ Policy Research Working Paper Series 4152. Freund, C. and M. Pierola (2010). ―Export Entrepreneurs: Evidence from Peru,‖ World Bank Policy Research Working Paper 5407. Freund, C. and M. Pierola (2011). ―Export Superstars,‖ World Bank mimeo. Freund, C. and N. Rocha (2011). ―What Constrains Africa’s Exports,‖ World Bank Economic Review 25, 361-386. Gamberoni, E. and R. Newfarmer (2009). ―Aid for Trade: Matching Potential Demand and Supply,‖ World Bank Policy Research Working Paper 4991. Gine, X. and I. Love (2011). ―Do Reorganization Costs Matter for Efficiency? Evidence from a Bankruptcy Reform in Colombia,‖ Journal of Law and Economics, forthcoming. Glazerman, S., Levy, D., and D. Myers (2003). Nonexperimental Replications of Social Experiments: A Systematic Review. Princeton, NJ: Mathematica Policy Research, Inc. Gourdon, J., Marchat, J., Sharma, S. and T. Vishwanath (2011). ―Can Matching Grants Promote Exports? Evidence from Tunisia's FAMEX Program,‖ in Cadot, O., Fernandes, A., Gourdon, J. and A. Mattoo (eds.) Where to Spend the Next Million? Applying Impact Evaluation to Trade Assistance, pp. 81-106. The World Bank and CEPR. Harrison, A. and A. Rodríguez-Clare (2010). ―Trade, Foreign Investment, and Industrial Policy,‖ in Rodrik, D. and M. Rosenzweig (eds.) Handbook of Development Economics vol. 5 pp. 4039- 4214. 40 Heckman, J., Urzua, S., and E. Vytlacil (2006). ―Understanding Instrumental Variables in Models with Essential Heterogeneity,‖ Review of Economics and Statistics 88, 389-432. Helble, M., Mann, C. and J. Wilson (2009). ―Aid for Trade Facilitation,‖ World Bank Policy Research Working Paper 5064. Hoekman, B. and A. Nicita (2008). ―Trade Policy, Trade Costs, and Developing Country Trade,‖ World Bank Policy Research Working Paper 4797. Hausmann, R., Hwang, J., and D. Rodrik (2007). ―What You Export Matters,‖ Journal of Economic Growth 12, 1-25. IEG (2006). Assessing World Bank Support for Trade, 1987-2004: an IEG Evaluation. The World Bank. Imbens, G. and J. Wooldridge (2009). ―Recent Developments in the Econometrics of Program Evaluation,‖ Journal of Economic Literature 47, 5-86. Jaud, M. and O. Cadot (2011). ―A Second Look at the Pesticides Initiative Program: Evidence from Senegal,‖ World Bank Policy Research Working Paper 5635. Khandker, S., Koolwal, G., and H. Samad (2010). Handbook on Impact Evaluation. Washington DC: The World Bank. Klapper, L. and I. Love (2010). ―The Impact of Business Environment Reforms on New Firm Registration,‖ World Bank Policy Research Working Paper 5493. Lalonde, R. (1986). ―Evaluating the Econometric Evaluations of Training Programs Using Experimental Data,‖ American Economic Review 76, 602-620. Lederman, D., M. Olarreaga, and L. Payton (2010). ―Export Promotion Agencies Revisited,‖ Journal of Development Economics 91, 257-265. Lederman, D., Rodríguez-Clare, A., and D. Xu (2011). ―Entrepreneurship and the extensive margin in export growth: a microeconomic accounting of Costa Rica's export growth during 1997-2007,‖ World Bank Economic Review 25, 543-561. Lopez-Acevedo, G. and M. Tinajero (2010). ―Mexico: Impact Evaluation of SME Programs using Panel Firm Data,‖ World Bank Policy Research Working Paper 5186. Manski, C. (1997). ―The Mixing Problem in Programme Evaluation,‖ Review of Economic Studies 64, 537-553. McKenzie, D. (2010). ―Impact Assessments in Finance and Private-Sector Development: What Have We Learned and What Should We Learn?‖ World Bank Research Observer 25, 209-233. 41 McKenzie, D. (2011a). ―How Can We Learn Whether Firm Policies Are Working in Africa? Challenges (and Solutions?) for Experiments and Structural Models,‖ World Bank Policy Research Working Paper 5632. McKenzie, D. (2011b) ―Beyond Baseline and Follow-up: The Case for More T in Experiments,‖ World Bank Policy Research Working Paper 5639. Miguel, E. and M. Kremer (2004). ―Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities,‖ Econometrica 72, 159-217. Morduch, J. (1998). ―Does Microfinance Really Help the Poor? New Evidence from Flagship Programs in Bangladesh,‖ Princeton University, Woodrow Wilson School of Public and International Affairs, Research Program in Development Studies Working Paper 198. Nelson, D. and S. Silva (2008). ―Does Aid Cause Trade? Evidence from an Asymmetric Gravity Model,‖ University of Nottingham research paper 2008/21. Osei, R., O. Morrissey, and T. Lloyd (2004). ―The Nature of Aid and Trade Relationships,‖ European Journal of Development Research 16, 354-374. Portugal-Perez, A. and J. Wilson (2010). ―Export Performance and Trade Facilitation Reform: Hard and Soft Infrastructure,‖ World Bank Policy Research Working Paper 5261. Rajan, R. and A. Subramanian (2008). ―Aid and Growth: What Does the Cross-Country Evidence Really Show?,‖ Review of Economics and Statistics 90, 643-665. Ravallion, M. (2008). ―Evaluating Anti-Poverty Programs,‖ in Schultz, T.P. and J. Strauss (eds.) Handbook of Development Economics, vol. 4 pp. 3787-3846. Ravallion, M. (2009). ―Evaluation in the Practice of Development‖ World Bank Economic Observer 24, 29-53. Ravallion, M. (2011). ―On the Implications of Essential Heterogeneity for Estimating Causal Impacts using Social Experiments,‖ World Bank Policy Research Working Paper 5804. Rodrik, D. (2006). ―What's So Special about China's Exports?,‖ China and World Economy 14, 1-19. Rodrik, D. (2008). ―The New Development Economics: We Shall Experiment, but Shall We Learn?,‖ Mimeo, John F. Kennedy School of Government, Harvard University. Rosenbaum, P. and D. Rubin (1983). ―The Central Role of the Propensity Score in Observational Studies for Causal Effects,‖ Biometrika 70, 41-55. Savedoff, W. (2006). The Evaluation Gap: An International Initiative to Build Knowledge. Center for Global Development, Washington, D.C.. 42 Sequeira, S. (2011). ―Transport Costs and Firm Behavior,‖ in Cadot, O., Fernandes, A., Gourdon, J., and A. Mattoo (eds.) Where to Spend the Next Million? Applying Impact Evaluation to Trade Assistance, pp. 123-162. The World Bank and CEPR. Tan, H. (2009). ―Evaluating SME Support Programs in Chile using Panel Firm Data,‖ World Bank Policy Research Working Paper 5082. Todd, P. (2008). ―Evaluating Social Programs with Endogenous Program Placement and Self Selection of the Treated,‖ in Schultz, T.P. and J. Strauss (eds.) Handbook of Development Economics, vol. 4, pp. 3848-3894. Volpe, C. and J. Carballo (2010). ―Beyond the Average Effects: The Distributional Impacts of Export Promotion Programs in Developing Countries,‖ Journal of Development Economics 92, 201-214. Volpe, C. (2011). ―Assessing the Impacts of Trade Promotion Interventions: Where Do We Stand?,‖ in Cadot, O., Fernandes, A., Gourdon, J., and A. Mattoo (eds.) Where to Spend the Next Million? Applying Impact Evaluation to Trade Assistance, pp. 39-80. The World Bank and CEPR. Wagner, D. (2003). ―Aid and Trade - an Empirical Study,‖ Journal of the Japanese and International Economies 17, 153-173. WTO (2009). World Trade Report 2009: Trade Policy Commitments and Contingency Measures. Geneva: World Trade Organization. World Bank (2009). Unlocking Global Opportunities: The Aid For Trade Program of the World Bank Group. Washington, DC: The World Bank. World Bank (2011). ―Leveraging Trade for Development and Growth: The World Bank Group Trade Strategy, 2011-2021,‖ Washington, DC: The World Bank. 43