WPS7800 Policy Research Working Paper 7800 Predicting Project Outcomes A Simple Methodology for Predictions Based on Project Ratings Marc Blanc Talib Esmail Caroline Mascarell Rukshan Rodriguez East Asia and the Pacific Region Development Effectiveness Unit August 2016 Policy Research Working Paper 7800 Abstract The downgrading of projects at the closing from moder- 2012 and 2013, this paper derives a prediction model ately satisfactory to moderately unsatisfactory has been based on ratings for implementation progress and achieve- a persistent problem in the World Bank and a particular ment of development objectives during project supervision. problem in the World Bank’s East Asia and Pacific region The model, used in combination with other indicators of since 2012. Through analysis of the projects that exited project progress toward outcomes, appears to improve the East Asia and Pacific region’s portfolio in fiscal years on existing methods for assessing the downgrade risk. This paper is a product of the Development Effectiveness Unit, East Asia and the Pacific Region. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be contacted at tesmail@worldbank.org, cmascarell@worldbank.org, and mblanc@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Predicting Project Outcomes A Simple Methodology for Predictions Based on Project Ratings Marc Blanc Talib Esmail Caroline Mascarell Rukshan Rodriguez Acknowledgments This paper was prepared with the assistance of Brigitte Duces, Carlos Elbirt, Vilija Kostelnickiene, Edgar Molina, and Katia Nemes. William B. Hurlbut (consultant) edited the paper. I. Introduction In its Results and Performance of the World Bank Group 2015, the Independent Evaluation Group (IEG) reported a decline in the outcome ratings of World Bank investment projects that was particularly strong in the portfolio of the East Asia and Pacific (EAP) region. Understandably, this triggered concern within the EAP vice-presidential unit (VPU) about the efficacy of existing systems to provide warnings about the potential that a project will fail to deliver its intended results. In an attempt to discover what went wrong, the VPU undertook an analysis of the projects that contributed to the decline in ratings. That analysis led to the discovery of a new method for predicting when investment projects that, under current monitoring regimes, appear to be headed toward a moderately satisfactory rating for outcome, but may be at risk of a downgrade on completion. This paper provides some background on the decline in ratings and current monitoring systems, details the portfolio analysis, describes the prediction model that evolved from that analysis, and reports on validation of the model through subsequent testing. Reasons for Concern about Outcome Ratings in EAP Outcome Ratings Have Declined Outcome ratings for IBRD and IDA investment projects, based on the three-year average reported by IEG in its annual Results and Performance reports, have been declining across the Bank since 2006.1 The ratings in EAP have followed the broader trend, but on a somewhat steeper trajectory and in the 2012 fiscal year the ratings for the region dropped below the Bank-wide average. A significant uptick in the FY14 outcome ratings for both the Bank and the region2 explains the upward shift in the three-year average for the last year in the series (Figure 1). Figure 1. Outcome Ratings, Bank versus EAP Three-year average for investment project financing for IDA and IBRD 100 90 % Satisfactory 80 70 Bank 60 EAP 50 40 FY05 FY06 FY07 FY08 FY09 FY10 FY11 FY12 FY13 FY14 FY (3-year average) 1 These data reflect IEG results as of December 31, 2015. The three-year average for the percent of satisfactory outcome ratings (i.e. ratings of moderately satisfactory or better) is based on 115 FY12-14 EAP projects rated covering 87 percent of the FY12-14 exits. Out of 42 FY14 exits 35 projects had been rated. 2 When ratings are weighted by net commitment amounts, the percentages of satisfactory ratings are higher, indicating that larger projects appear to have been more successful than smaller ones over the period considered. However, the main concern is with the declining number of successful projects. 1 Regression analyses of project performance at exit (outcome ratings) consistently have shown a high degree of correlation with quality at entry and quality of supervision. While this is to be expected since these ratings are all given at exit by the same evaluator, it is still useful to observe that quality at entry and the quality of supervision ratings, for both the Bank and EAP, have tracked the declining outcome ratings. Net Disconnect and the Candor Gap Have Increased In addition to the decline in satisfactory outcome ratings, the two measures of realism in self-evaluation ratings have increased. The net disconnect is the difference between the percentage of projects IEG rates unsatisfactory for development outcome and the percentage the region’s final Implementation Status and Results (ISR) reports rates unsatisfactory for achieving their development objectives. Although net disconnects have declined from their 18 percent level in FY13 they have remained larger than the Bank average, and have been associated with the region’s declining outcome in years prior to FY14 (Figure 2). While there are substantial variations from year to year and among country management units, there is a strong relationship between the low and decreasing ratings for monitoring and evaluation quality and the substantial disconnect. Projects with unclear and weak results frameworks are more difficult to evaluate, giving rise to the disconnect. Figure 2. FY13 versus FY14 Net Disconnect by Region (IBRD and IDA) 50 40 40 30 20 10 18 18 14 14 12 7 12 11 13 6 4 9 0 AFR EAP ECA LCR MNA SAR Bank -10 -21 -20 -30 FY13 FY14 Note: AFR = Sub-Saharan Africa, EAP = East Asia and Pacific, ECA = Eastern Europe and Central Asia, LCR = Latin America and the Caribbean, MNA = Middle East and Northern Africa, SAR = South Asia The candor gap is the difference between the percentage of projects with satisfactory development outcome ratings in the active portfolio for a fiscal year (including both investment project financing and development policy financing) and IEG ratings of satisfactory outcome based on projects evaluated over the past 18 months as of the date of the data download. The candor gap for EAP was higher than the average for the Bank and all other regions (except the Middle East and Northern Africa region) up to January of fiscal 2015. However, since the gap closely tracks recent IEG project outcome ratings, it has 2 improved in line with the improvements in FY14 outcome ratings for the region, and reached the Bank average of 14 percent as of January 2016 (Figure 3).3 Figure 3. Candor Gap by Region (IBRD and IDA) 50% 39% 40% 30% 27% 26% 22% 21% 20% 19% 20% 14% 14% 9% 10% 5% 6% 5% 0% AFR EAP ECA LCR MNA SAR Bank -10% -7% January FY15 January FY16 Note: AFR = Sub-Saharan Africa, EAP = East Asia and Pacific, ECA = Eastern Europe and Central Asia, LCR = Latin America and the Caribbean, MNA = Middle East and Northern Africa, SAR = South Asia Shortcomings in Existing Performance Monitoring While the candor gap is useful for identifying the scale of the performance problem and assessing the level of over-optimism among task teams, it does not help to identify which projects may need attention during implementation to reduce the likelihood of a less than satisfactory rating upon closing. For that purpose, the Bank uses two other tools: disbursement tracking and ratings for implementation progress (IP) and development outcome (DO) in ISR reports that are prepared about every six months during the life of the project. Both methods have limitations. Disbursement tracking underestimates projects likely to be rated moderately unsatisfactory The Bank’s disbursement monitoring systems track “disbursement ratios” for countries and regions as well as tracking slow-disbursing projects. Currently, a monthly senior management meeting identifies and reviews all IBRD and IDA projects more than 5 years old and with undisbursed balances greater than 60 percent. As of the end of 2015, EAP had only three slow-disbursing projects and only 5 percent of EAP projects were delayed by 24 months or more. This substantially underestimates the percentage projects likely to be rated moderately unsatisfactory or lower at exit, which is currently on the order of 30 percent. Moreover, regression studies carried out by others (see bibliography) have failed to demonstrate a strong negative correlation between the disbursement delays and project outcome. 3 If there was no lack of candor or over optimism in reporting by task team leaders (TTLs), one could argue that the underestimation and overestimation of the project risks would cancel each other out, leading to a non-biased estimate. However, this is not the case as easily demonstrated by the very low amount of negative disconnects when looking at the difference between the IEG ratings and the final ISR rating, i.e., when IEG rates a project moderately satisfactory or higher that had been rated moderately unsatisfactory or lower in the last ISR. 3 ISR ratings for IP and DO also underestimate projects likely to be rated moderately unsatisfactory During routine project supervision projects are rated on their IP and DO. A moderately unsatisfactory rating is given to projects that fall short on performance standards and are therefore likely to fail at meeting their development outcome at exit. These projects are then referred to as “Problem Projects.” Table 1 shows the percentages of projects rated moderately unsatisfactory for IP, DO, or both as of the end of 2015. These percentages are believed to be substantially below the failure rates of these projects at exit. As previous studies have shown, this is a very consistent and stable occurrence across both time and institutional levels, since it is rooted in the over-optimism or lack of candor of task team leaders when they self-report on their project in the ISR report. Table 1. Percentage of Moderately Unsatisfactory or Lower Ratings for DO and IP, all Investment Projects as of December 31, 2015 DO IP DO or IP Bank 15% 20% 21% EAP 15% 21% 22% II. Existing Approaches to Predicting Performance As the preceding discussion shows, the current tracking mechanisms are not very effective at predicting whether a project rated moderately satisfactory for most of its life will be downgraded on completion. But those are not the only ways that the Bank tries to manage its risks. Risk at entry – difficult to ensure update during implementation One method for managing risk identifies and defines the level of risk at entry, that is, during the preparation and approval phase of the project, using relatively well defined criteria and methodologies. This approach has been useful at the preparation stage to identify comprehensively the risk profile of the project, and thus the type of processing and level of resources required. It has proved much less useful after Board approval, probably because it is difficult to get the task team leader to regularly and precisely update this risk rating during project supervision. Risk under implementation – a system that has resulted in a perverse incentive to manage flags Another approach has identified a risk profile of the project under implementation based on flags, or warning indicators. These have included self-reported indicators, such as for DO and IP, as well as system-based indicators, such as disbursement delay calculated by comparing the actual and the initial or revised disbursement schedules. The full set of indicators, currently numbering about ten, are being input in the ISR report and are tracked in the project monitoring system; several of them are also monitored at the corporate level (for example in the Corporate Scorecard, memoranda of understanding, and senior management monthly monitoring meetings). Until about 2007 the flag system (then consisting of 12 flags) was used to identify and track “potential problem projects” that were not rated moderately unsatisfactory or lower for IP or DO during implementation. This led some teams to avoid the appearance of failure by finding ways to avoid rating 3 flags as less than satisfactory. This undercut the ability of the system to detect problem projects and the system was dropped, though 10 of the flags remain in use. 4 Recent studies of risk prediction identified significant predictors of project outcomes – but they still only explain a small percentage of the downgrades Two recent studies attempted to identify factors contributing to the success and failure of investment projects, and to build prediction models for the outcome ratings of those projects. In the first, Denizer, Kaufmann, and Kraay (2011) analyzed correlations between project outcomes and project characteristics, such as size and duration, as well as country variables and project variables corresponding to the 12 original warning indicators used in the implementation monitoring systems. In the second study, Geli, Kraay, and Nobakht (2014) built on the findings of the previous study and designed an outcome prediction model using a combination of country and project variables, and comparing it with the prediction value of the IP and DO ratings as the project matures. Denizer, Kaufmann, and Kraay used data on more than 6,000 World Bank-financed projects undertaken in 130 countries since the 1970s to develop their model. Their major findings were:  Eighty percent of the variation in project outcomes occurs within countries rather than between countries, and is therefore accounted for by project variables rather than country variables.  Country Policy and Institutional Assessment (CPIA) scores account for 40 percent of the between- country variations and therefore for 8 percent of the variation in outcome. It is among the factors most strongly correlated with project outcome.  Project variables used in the regression analysis, comprising both characteristics (such as size, duration, and preparation costs) and early flags, overall explain only 6 percent of the 80 percent variation, or 5 percent in the project outcome.  The quality of the task team leader is strongly correlated to the project outcome, with a contribution about equal to that of the CPIA.  Among the most significant results of individual partial correlation of outcome with project variables were project size, duration, and preparation and supervision costs, which were negatively correlated with outcome; early IP/DO flags, which were negatively and very significantly correlated with the ultimate outcome; and the early monitoring and evaluation (M&E) flag and sometimes the project management flag, which were also significantly and negatively correlated with outcome.  No evidence was found that disbursement delays are significantly correlated with outcomes. Geli, Kraay, and Nobakht, for their prediction model, used all investment operations between 1995 and 2005,4 and included correlates for size, preparation time, elapsed time between approval and effectiveness, initially planned project length, task team leader track record, and CPIA score. Their major findings were:  The main correlates with outcome were task team leader track record and CPIA score; preparation time and effectiveness delay were negatively correlated but not significantly so; project size not significantly so.  Application of the full model with all the variables correctly predicted about 40 percent of the outcomes, but the same result was obtained with only the CPIA and the task team leader track record variables. 4 After defining the model using the sample of 1,561 projects, with closing dates between 1995 and 2005, the model was applied to a second “out of sample” set of 1,168 projects closing from 2005 to 2012. Results between the two sets were consistent. 5  Only in the last quarter of the project’s life (measured by time or by disbursement amount) does the DO rating reach the prediction power of the model using only the CPIA and task team leader quality variables. Until then, it is lower than or equal to 20 percent.  For EAP, the predictive performance of the ISR DO rating was even lower.  Combining the first model with the ISR DO rating during each quarter of the life of the project leads to a predictive power of slightly above 60 percent in the last quarter.  Using the IP rating instead of the DO rating improves the prediction model, and using the two together further enhanced the results.  Applying the model to the EAP portfolio as of July 1, 2014, yielded a predicted rate of unsatisfactory outcome of 17 percent. The two studies clarify the relative importance of country and project variables, including flags and other indicators for predicting project outcome, but they have several limitations. First, since the model significantly underestimates current IEG rates of unsatisfactory outcomes for EAP, now close to 30 percent, it will need to be re-run and calibrated using more recent data. Second, the quality of the task team leader is a difficult, sensitive, and controversial variable to estimate fairly and regularly. Third, the model still only predicts about 40 percent of the unsatisfactory outcomes. III. Foundation for a New Outcome Prediction Tool for Moderately Satisfactory (or Higher) Rated Projects Given the limitations of existing monitoring tools and predictive models, it would be advantageous to find a simpler approach that could yield similar or better results, by using real-time data, such as the IP and DO ratings, even if it meant moving away from a pure prediction model, i.e., a model that would mostly include variables already known at the project start. Hence, EAPDE conducted a thorough analysis of the projects rated less than moderately satisfactory and particularly disconnects in search of a tool that would predict whether a project rated moderately satisfactory would be downgraded on completion. Methodology for analysis Since the review started out of concern about the sharp decline in IEG rates of satisfactory outcomes in the latest set of projects reviewed, it was logical to use the set of projects that exited in FY12 and FY13 for analysis. Since the focus was on identifying factors and patterns contributing to the increase in unsatisfactory outcomes, the review concentrated on projects with unsatisfactory ratings, particularly those with disconnects. The convenience sample included all 62 investment projects that exited in FY12 and FY13 with ICR reviews posted on the IEG site as of November 15, 2014. In accordance with IEG review policy, the sample included all projects financed by IBRD or IDA as well as Global Environment Facility, State and Peace-Building Fund, and recipient-executed trust funding with commitment amounts larger than $5 million. The IEG ICR reviews were first examined to identify patterns in the country and network or sector breakdown. Outcome results broken down by country were compared with corresponding CPIA figures and with candor gaps and recent series of ratings (Annex 1, Table 1). Since IEG uses four dimensions to 6 assess the outcome rating of a project,5 the relative impact of each dimension on the outcome rating was analyzed for the projects rated moderately unsatisfactory or lower. The frequency and impact of restructuring were also reviewed. Next, for each of the projects rated moderately unsatisfactory or lower the incidence of warning indicators or flags collected during their lifetime was noted and recorded. Special attention was given to the IP and DO ratings, but the incidence was also calculated for project management, procurement, M&E, financial management, safeguards, counterpart funding, slow disbursements,6 legal covenants, effectiveness delay,7 problem project, and long-term risk.8 Incidence results were then analyzed and a simple set of rules, derived by inspection, that allowed to most closely match in the sample the percentages of projects that IEG had rated moderately unsatisfactory or lower for their outcomes. The set of rules was then applied to the EAP FY14 active portfolio. This identified 106 projects that were at risk of not meeting their objectives. A first subset of 42 projects under implementation was then reviewed using the IEG ICR review methodology to identify potential weaknesses. Results of analysis The country breakdown of outcome ratings for the 62 projects in the sample is shown in Table 2, along with the corresponding percentage of moderately unsatisfactory or lower in the candor gap set as of the date of the review. Table 2. Country Distribution of Rated Projects Philippines Cambodia Indonesia Mongolia Solomon Thailand Vietnam Lao PDR Guinea Islands Timor- region Papua China Leste Total New #of projects reviewed 12 5 11 6 10 6 4 1 1 4 1 62 % MS+ 67 60 36 17 70 67 50 0 100 0 100 52% % MU- 33 40 64 83 30 33 50 100 0 100 0 48% % MU- in candor gap set 24 25 37 75 25 50 20 100 N/A 100 N/A 38% Low CPIA countries % MU- 33 50 0 100 0 50% Low CPIA countries: CPIA rating as of 11/2014 3.6 2.8 3.4 2.3 3.3 5 The four dimensions are: relevance of objectives, relevance of design, efficacy, and efficiency. 6 A slow disbursement flag is given to projects when the delay is 24 months or higher. The delay is calculated on based on the initial or “revised” disbursement schedule for the project. The definition of revised has varied: it had been meant to be “officially revised,” i.e., sanctioned by a restructuring for many years, but the definition has been relaxed recently to include informal revision as well. 7 The effectiveness flag is given when the elapsed time between Board approval and effectiveness is more than nine months for investment and three months for emergency operations. It is turned off three years after Board approval. 8 This flag is no longer officially tracked but is still available in the system. It is given if the project has been rated moderately unsatisfactory or lower for IP or DO for any 24 months cumulative during the life of the project. The flag is removed when the project has been rated moderately satisfactory or higher for IP and DO for the previous 24 months. This flag is currently used to identify the “non-proactive” projects, which are then subjected to special reviews. 7 The following observations can be made:  The variations by country of the MU- and MS+ percentages are very large because of the small sample reviewed, which for several countries include only one project.  The sample shows a higher percentage of MU- than the candor gap reference set, but the correspondence with the percentages of the candor set is significant. This confirms the relative stability of the “country record” outcomes over a period of a few years.  The percentage of MU- for the five countries that had a CPIA lower than 3 at the time the project was implemented is 50 percent, i.e., about equal to the percentage of the full sample. This is moot for confirming the relationship between the CPIA rating and the outcome rating, but can be expected given the small sample size, and the fact that a couple of countries had already improved their CPIA at the time the projects exited (Cambodia and Papua New Guinea). The network/sector breakdown of the outcome and disconnect ratings for the sample reviewed as compared with a three-year FY11-13 moderately satisfactory or higher average is shown in Table 3. Table 3. Network/Sector Distribution of Rated Projects SDN HDN PREM FPD Total # of projects reviewed 42 16 2 2 62 within 62 FY12-13 % MS+ FY11-13 68 73 69 100 % MS+ 48 69 0 50 52 %MU- 52 31 100 50 48 Of which disconnect % 31 19 0 0 26 Notes: SDN = Social Development Network; HDN = Human Development Network; PREM = Poverty Reduction and Economic Management; FPD = Financial and Private Sector Development  The percentages of MS+ for SDN and HDN are roughly consistent, but they are not for PREM and FPD because of the small sample. The results of the incidence analysis for DO and IP ratings are summarized in Table 4 and Figures 4 and 5. Table 4. Incidence of Outcome Ratings and Disconnects MU- DO ratings MU- IP ratings Last DO DO Ever Last IP IP Ever rated MU- rated MU-? rated rated MU- MU-? 16 Disconnects 26% 0 (0%) 6 (38%) 0 (0%) 10 (63%) 14 MU- outcome & No 13 (93%) 13 (93%) 12 (86%) 14 disconnect 23% (100%) Sub-Total 30 MU- rated 13 (43%) 19 (63%) 12 (40%) 24 (80%) 48% Sub-total 32 MS+ rated 0 (0%) 9 (28%) 0 (0%) 12 (38%) 52% Total 62 sample 100% 13 28 12 36 (21%) (45%) 19% (58%) 8 Figure 4. Percentage of Projects That Ever Had an Unsatisfactory Rating for IP or DO During the Project Life  The last ISR DO/IP rating at closure matches the IEG rating for all projects in the sample except for disconnects. This is clearly consistent with the statistical observation that the predictive performance of the IP/DO rating improves as the project matures.  Twenty-eight percent of the MS+ IEG-rated projects in the sample have ever been rated MU- for DO, while 63 percent of the IEG-rated MU- have been rated MU- at some stage.  For IP MU- percentage ratings the corresponding incidence figures are 38 percent for the MS+ IEG- rated, compared with 80 percent for the MU- IEG-rated, factoring in 63 percent for disconnects.  Among the 16 disconnects, 6 were ever rated MU- for DO (38 percent), compared with only 9 of the 32 MS+ outcome (28 percent).  Out of the 28 projects ever rated MU- for DO, only 9 ended up having an MS+ outcome. In other words, those projects that have ever had one MU- rating for DO have about two of three chances of exiting with an MU- rating.  Similarly, out of the 36 projects ever rated MU- for IP, only 12 ended up having an MS+ outcome.  A significant percentage of unsatisfactory projects with disconnect were rated MU- for their IP or DO at some stage in their project life.  These percentages are significantly higher than those observed for MS+ projects. 9 Figure 5. Percentage of Unsatisfactory Projects That Ever Had an Unsatisfactory Rating: Disconnect versus No Disconnect 120% 100% 14/14 13/14 100% 80% 93% 60% 10/16 63% 40% 6/16 20% 38% 0% DO IP Disconnect No Disconnect Note: One of the 14 projects with unsatisfactory IEG rating had never received an MU- DO rating, but was rated MU- at exit for IP. It was considered a non-disconnect in this analysis. The results of the incidence analysis for six indicators—project management, procurement, M&E, safeguards, counterpart funding, and financial management—are summarized in Figures 6 and 7. Figure 6. Percentage of Projects That Have Ever Had an Unsatisfactory Rating During the Life of the Project 80% 70% 70% 70% 60% 67% 63% 50% 40% 47% 30% 37% 34% 31% 20% 28% 23% 10% 13% 13% 0% Project Procurement Monitoring Safeguards Counterpart Financial management and funding management evaluation Satisfactory IEG outcome rating Unsatisfactory IEG outcome rating The incidence of three of these indicators is significantly different for projects with disconnect than it is for those without, as Figure 7 shows. 10 Figure 7. Percentage of Unsatisfactory Projects That Have Ever Had an Unsatisfactory Rating During the Project Life 100% 93% 90% 86% 79% 80% 70% 60% 56% 50% 50% 50% 40% 30% 20% 10% 0% Project Management Procurement M&E Disconnect No Disconnect  A high percentage of MU-rated projects with no disconnect had an unsatisfactory rating for project management, procurement, and M&E, at some point in the project life.  A significant number of MU- projects with disconnect received an unsatisfactory rating for project management, procurement, and M&E, at some point in the project life.  However, these last percentages are significantly higher than those observed for MS+ projects as shown in Figure 6. The above results are fully consistent with previous regression analyses (Denizer, Kaufmann, and Kraay 2011 and Geli, Kraay, and Nobakht 2014), which have shown a strong correlation between outcomes and the M&E and project management indicators, and to a lesser extent the procurement indicator. Poor M&E, in particular, has been the most consistently and strongly correlated to an unsatisfactory outcome. Even though safeguards, counterpart funding, and financial management do not show as strong an incidence and correlation as the other indicators, they still capture a high percentage of unsatisfactory projects and show a significantly different incidence between MS+ and MU- rated projects. In summary, in addition to the IP and DO ratings, which have been shown to be significantly correlated with the final IEG outcome rating of any project, the six analyzed indicators, which represent critical dimensions of an investment project’s implementation are critical markers and predictors of a project’s success. Under that premise, a trial and error, heuristic search was conducted to identify the combination of these indicators that would provide the best predictor of future successful outcome rating. The Prediction Model and Verification Testing The major finding of the analysis is that projects with a relatively high risk of being rated unsatisfactory at exit include two groups. The first is projects that are, at the time of the review, rated moderately unsatisfactory or lower for IP or DO in their current ISR. The other group includes those projects that are not currently rated moderately unsatisfactory or lower for IP or DO but that have had at least three of six indicators rated moderately unsatisfactory or lower at any time during their life, and not necessarily 11 concurrently. The two groups constitute separate modules in the proposed prediction model. The first can be called the problem project module, the second can be called the flag-based module. Validation of the flag-based module was undertaken first at the level of the sample of 62 reviewed EAP projects, and then at the level of a corresponding Bank-wide sample of 531 projects including all FY12- 13 investment project exits rated by IEG as of February 1, 2016. Applying the flag-based module to the 62 projects in the study sample, and to the 531 in the Bank-wide sample, had an overall ex post prediction rate of 76 percent for the set of EAP projects and 68 percent for the set of Bank-wide projects (Table 5). Table 5. Ex Post Prediction Rates Using the Flag-Based Module At least 3 of the 6 flags at any time EAP Bank EAP (#) Bank (#) Predicting an unsatisfactory rating 70% 59% 21/30 103/174 Predicting a satisfactory rating 81% 72% 26/32 258/357 Overall prediction rate 76% 68% 47/62 361/531 *EAP sample size is 62 **Bank sample size is 531 The likely reason the ex post prediction rates are lower for the Bank-wide sample is the particularly low performance of the EAP sample in the reviewed period, as well as the low candor level of ratings. With an almost 50 percent unsatisfactory outcome and 26 percent disconnect, the flag-based module is able to identify a significant number of those disconnects. These ex post prediction rates are obviously higher than those that could be estimated on an actual “running” portfolio for at least two reasons. First, at project closure the last DO rating normally has its highest prediction rate, which is highest when the “disconnect” is the lowest: i.e., lower rate for the EAP sample due to high disconnect of 27 percent and higher rate for the Bank with an overall disconnect of 20 percent. Second, at closure, any project would have received its highest allocation of flags; therefore, the flag-based module would yield a lower prediction when applied to a running portfolio. How much lower cannot be known, as it would depend on the average age of the portfolio and the rating candor in the region. When applied to the actual EAP portfolio the corresponding FY15 and FY16 percentages identified by the flag-based module were 22 percent (66/297) and percent 18 percent (51/284). Creating a “watch list” for FY15 projects potentially at risk of a moderately unsatisfactory (or lower) outcome rating The model, consisting of the problem project module and the flag-based module, was applied to the EAP FY15 investment project portfolio of 297 projects active as of July 1, 2014, to identify a watch list of projects at risk of not achieving their outcomes. The portfolio consists of projects in three age groups: 63 projects had reached their final year of implementation, and were due to close in FY15; 59 new projects approved during FY14 were less than one year old; and 175 projects were older than one year and not due to close during FY15. The application of the flag-based module and problem project module identified 30 projects in the first group, 2 projects in the second group, and 65 projects in the third group. 12 While the combination of projects with IP or DO rated moderately unsatisfactory or lower and those with three out of six flags may be a reasonable predictor for well-established projects, application of these two rules alone would not select recently approved projects, which are seldom rated moderately unsatisfactory or lower for any indicator. Hence, to complete the prediction model, and to potentially increase the prediction rate, it seems appropriate to include a third prediction module for those young projects that will turn out unsuccessful at exit, most likely as a result of poor quality at entry. Lacking a quality at entry indicator, the most reasonable proxy for such an indicator are lags from approval to signing of greater than three months and from approval to effectiveness of greater than six months. However, the correlation of the outcome rating with those delays is not been well established and appears rather weak, so this third rule or module would not be permanent, meaning that after the projects have completed the first year, the rule should no longer apply to them. The country breakdown of the three watch list sets is in Table 1 of Annex 2. In view of the well-established correlation between the outcome and the CPIA and recent country record, the percentages of project failure implied by the latest candor gap figures (see Table 1 of Annex 2) were used to calculate a predicted MU- rate for each country. This was then compared to the watch list generated by the model. The two lists have a few significant differences (China and Vietnam in particular), but overall there is a good consistency with 106 of projects identified at the regional level compared with 112 corresponding to 38 percent of expected MU- outcome based on the candor gap figure, as of July 1, 2014. The composition of the list, broken down by module components, is summarized in Table 6. Table 6. FY15 Watch List Projects for EAP Age group Three out of 6 Three flags and Only MU- Approval to Total flags but not last IP/DO MU- IP/DO, but not signing/effect. last IP/DO MU- three flags lags FY15 closings 13 13 4 N/A 30 < 1year old 0 0 2 9 11 Others 23 17 25 N/A 65 Total 36 30 31 9 106 % 34% 28% 29% 8% 100% The breakdown of the watch list by module is as follows: 62 percent of the list is generated by the flag- based module, including 28 percent that are also rated MU- for IP and DO in their last ISR. Slightly less, 57 percent, would have been picked up by the problem project module alone. By age group, the projects in their closing years seem overrepresented compared with first-year projects (48 percent versus 19 percent), or even with the remaining group (37 percent), but several of these projects will not close and will be extended into the following year. It also seems sensible to be more cautious for the exit year than for the first-year projects. The problem project, flag-based module and also a module of projects with delays to signing and effectiveness were used to create a watch list of EAP projects based on July 1, 2014 data. A review methodology was designed based on IEG ICR review criteria, i.e., relevance of objectives and design, efficacy, and efficiency. The reviews were prepared in batches and forwarded by the regional vice- president to the task team leaders and their practice managers. Once the projects on the watch list 13 begin to exit, it will be possible to assess the accuracy of the prediction model when applied to an ongoing portfolio. IV. Conclusion The downgrading of projects at the closing from moderately satisfactory to moderately unsatisfactory has been a persistent problem in the World Bank and a particular problem in the EAP region since 2012. Through analysis of the projects that exited the EAP portfolio in FY12 and FY13, this paper has derived a prediction model that appears to improve on existing methods for assessing the downgrade risk. The Bank has recently revised its project monitoring system and has introduced Standard Reports that provide management with information on project performance based on the flags self-reported by task teams in the ISRs. The problem project module is already used in the Standard Reports and, as this research has shown, continues to have value in identifying specific projects for attention in order to try to avert an unsatisfactory outcome. The Standard Reports also recognize that a sizeable number of projects that ultimately end up with outcome ratings of unsatisfactory will largely coast through their implementation phase rated moderately satisfactory. There is a category of reports referred to as “Soft MS Rating” in the Standard Reports that identifies projects whose IP or DO rating is rated moderately satisfactory for more than 12 months. Research does not show any correlation between time rated moderately satisfactory and ultimate outcome. We therefore propose that the definition of the “Soft MS Rating” in the Standard Reports be replaced with the “flag-based module” developed here, which is more robust in predicting unsatisfactory outcomes. This will help to fill the gap in current portfolio monitoring and provide managers with a list of outlier projects that are at risk so that they may get timely attention to avert a moderately unsatisfactory or lower outcome rating. 14 Acronyms and Abbreviations CPIA Country Policy and Institutional Assessment EAP East Asia and Pacific region FPD Financial and private sector development HDN Human Development Network ICR Implementation Completion and Results report IEG Independent Evaluation Group IP Implementation progress ISR Implementation Status Report M&E Monitoring and evaluation MNA Middle East and North Africa region MS Moderately satisfactory MS+ Moderately satisfactory or higher MU Moderately unsatisfactory MU- Moderately unsatisfactory or lower PREM Poverty Reduction and Economic Management (vice-presidency) SDN Social Development Network TTL Task team leader 15 Annex 1 Table 1 Papua Lao Solomon Timor- Total # Rated Cambodia China EAP Indonesia Kiribati* Mongolia New Philippines Samoa* Thailand Tonga Vietnam PDR Islands Leste Region Guinea FY08-FY10 (# IEG Rated) 2 39 1 17 3 4 1 7 1 1 5 1 12 96 FY11-FY13 (# IEG Rated) 8 24 1 22 1 6 6 1 8 1 1 1 4 1 15 101 Grand Total (#) 10 63 2 39 1 9 10 2 15 2 1 2 9 2 27 197 FY08-FY10 (MU- %) 100% 8% 0% 41% 0% 0% 0% 57% 0% 0% 40% 0% 0% 21% FY11-FY13 (MU- %) 25% 17% 0% 32% 100% 33% 33% 0% 75% 0% 0% 100% 100% 100% 27% 34% Grand Total (MU-%) 40% 11% 0% 36% 100% 22% 20% 0% 67% 0% 0% 50% 67% 50% 15% 27% Candor Gap Set #** 4 17 #N/A 19 1 5 4 #N/A 8 #N/A #N/A 1 3 1 12 77 Candor Gap Set: % MU- 50.0 23.5 #N/A 36.8 100.0 20.0 25.0 #N/A 75.0 #N/A #N/A 100.0 100.0 0.0 25.0 38 FY12-13 Sample # 6 12 1 11 4 5 1 6 1 1 4 10 62 FY12-13 Sample % MU- 33 33 0 64 50 40 0 83 0 100 100 30 48 Predicted MU- % 50% 24% 100% 37% 100% 20% 25% 100% 75% 100% 100% 100% 100% 0% 25% 38% # IPF projects in portfolio 9 108 1 34 4 20 13 10 14 8 5 1 4 4 52 298 Sample Size for Watch list† 5 25 1 13 4 4 3 10 11 8 5 1 4 0 13 112 Notes: FY08-FY13 data are from IEG RAP report. * Kiribati and Samoa were not in the FY12-FY13 set. **The calculation of candor gap for this report used the number of projects reviewed in the 18-month period 07/01/2103-12/31/2014. †Watch list sample size was calculated using the candor gap set. 16 Annex 2 Table 1 Breakdown of EAP FY15 Watch List by Country Management Unit Projects Project Remaining closing age < 1 Forecast Country Projects Total in FY15 year Size (Set-3) (Set-1) (Set-2) Cambodia 2 1 1 4 5 China 12 2 20 34 25 EAP 0 1 Indonesia 4 2 10 16 13 Kiribati 2 2 4 Lao 5 5 4 Mongolia 1 1 3 Pacific Islands 1 1 Papua New Guinea 1 3 4 10 Philippines 2 5 7 11 Samoa 2 1 3 8 Solomon Islands 1 1 2 5 Thailand 0 1 Timor-Leste 1 2 3 4 Tonga 1 1 Vietnam 6 4 13 23 13 Watch list Total 30 11 65 106 112* Total Count 63 59 175 297 % 48% 19% 37% 36% * Sum of the forecast is 106 but the target for the region is 112 because we took the percentage for the entire portfolio. 17 Annex 2 Table 2 Breakdown of projects in the Watch list # with MU- for IP/DO or at least 3 key flags rated MU- out of 6** 97 # with lag to effectiveness > 6 months or lag to signing > 3 months* 9 Total 106 MU- IP/DO or ever has a Set Definition MU- for at least 3 or the 6 Add ons Total Flags** 1 Projects closing in FY15 30 30 2 Projects less than one year (as of 07/01/2014) 2 9* 11 3 Remaining projects 65 65 Total 97 9 106 *Lag to approval >6 or lag to approval to signing > 3 months **6 flags: project management, procurement, M&E, counterpart funding, safeguards, financial management 3 Flags and MU- Set Only 3 Flags Only MU- IP/DO Approval Lag Total IP/DO 1 13 13 4 N/A 30 2 N/A N/A 2 9 11 3 23 17 25 N/A 65 Total 36 30 31 9 106 18 Bibliography Cevdet Denizer, Daniel Kaufmann, Aart Kraay, 2011, “Good Countries for Good Projects: Macro and Micro Correlates of World Bank Performance,” Policy Research Working Paper #5646, World Bank, Washington, DC. Independent Evaluation Group (IEG), 2015, Results and Performance of the World Bank Group 2014, World Bank, Washington, DC Jurgen Rene Blum, 2014, “What Factors Predict How Public Sector Projects Perform?: A Review of the World Bank’s Public Sector Management Portfolio,” Policy Research Working Paper #6798, World Bank, Washington, DC. Patricia Geli, Aart Kraay, Hoveida Nobakht, 2014, “Predicting World Bank Project Outcome Ratings,” Policy Research Working Paper #7001, World Bank, Washington, DC. Peter Moll, Patricia Geli, Pablo Saavedra, 2015, “Correlates of Success in World Bank Development Policy Lending.” Policy Research Working Paper #7181, World Bank, Washington, DC. Quality Assurance Group (QAG), 2006, “Review of the Flag Risk System,” First Phase Draft Report, unpublished. 19