WPS7393 Policy Research Working Paper 7393 The Impact of an Accountability Intervention with Diagnostic Feedback Evidence from Mexico Rafael de Hoyos Vicente A. Garcia-Moreno Harry Anthony Patrinos Education Global Practice Group August 2015 Policy Research Working Paper 7393 Abstract In 2009, the Mexican state of Colima implemented a by 0.12 standard deviations only a few months after the low-stakes accountability intervention with diagnostic program was launched. When students, teachers, and feedback among 108 public primary schools with the parents in a school know that their scores are low, and lowest test scores in the national student assessment. A this triggers a process of self-evaluation and analysis, the difference-in-difference and a regression discontinuity process itself may lead to an improvement in learning out- design are used to identify the effects of the intervention comes. Information on quality, without punitive measures on learning outcomes. The two alternative strategies con- but within a supportive and collaborative environment, sistently show that the intervention increased test scores appears to be sufficient to improve learning outcomes. This paper is a product of the Education Global Practice Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be contacted at rdehoyos@ worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team The Impact of an Accountability Intervention with Diagnostic Feedback: Evidence from Mexico∗ Rafael de Hoyos† Vicente A. Garcia-Moreno‡ Harry Anthony Patrinos § 1 Introduction The increasing availability of standardized tests in developing countries opens the door to a wide range of policy interventions to improve student learning outcomes. Under certain conditions, the implementation of standardized tests in itself can be conductive to an increase in learning outcomes. When test scores are made public, they can elimi- nate heterogeneous perceptions among agents in the system (authorities, school directors, teachers, parents and students) regarding the quality of education services. Given suffi- cient levels of initial asymmetries on perceptions regarding the quality of education, the additional information brought about by test scores can be enough to generate a new equilibrium yielding a more transparent and accountable system providing higher quality education services. The second quality-enhancing change that can be triggered by the availability of test scores is the development of school-specific improvement plans. If the standardized tests are universal, then every school in the system could have access to a detailed diagnosis of the main challenges in the subject areas and grades assessed by the tests. If these detailed diagnoses and school improvement plans are followed by what can be loosely defined as pedagogical interventions, or actions to address the problems identified, then the final outcome can be a system delivering higher quality services. ∗ The authors gratefully acknowledge funding from the World Bank’s Research Support Budget and the comments of participants at the World Bank’s economics of education seminar. † The World Bank. E-mail: rdehoyos@worlbank.org. ‡ Teachers College, Columbia University. E-mail: vag2120@tc.columbia.edu § The World Bank. Email: hpatrinos@worldbank.org There is evidence, mostly from high-income countries, showing that learning outcomes can improve as a result of an intervention increasing transparency and accountability through the use of test scores. For instance, in the Netherlands, both average grades and the number of diplomas awarded increased after schools received a negative report card. For schools that received the lowest ranking, the one-year effects on final exam grades amounted to 10 to 30 percent of a standard deviation (Koning and van der Wiel, 2013). Evidence shows that schools in the United States respond to accountability pressures with improved test scores (Carnoy and Loeb, 2003; Hanushek and Raymond, 2005). Students in high-accountability states averaged significantly greater gains on 8th grade math tests than did students in states with little or no state measures to formally track student performance. One of the most prominent examples of accountability interventions is the state-level ranking system of No Child Left Behind (NCLB) in the United States. Rockoff and Turner (2010) found that the introduction of student tests and other measures to assign each school a grade was enough to increase student achievement in New York City. Rouse et al. (2013) show that schools facing accountability pressures in Florida changed instructional practices and this, in turn, partly accounted for increases in test scores. Using administrative data for the entire country, Ahn and Vigdor (2014) find evidence of a short term positive impact on school performance, and among low performing students in the medium term. In most of the documented cases of education accountability interventions in high-income countries, school actors respond to this type of intervention when they are followed by rewards and/or sanctions. However, in low- and middle-income developing countries with a combination of relatively weak institutions, high poverty and inequality levels, and the presence of powerful teacher unions, such high stakes accountability interventions might not be feasible nor desirable. Instead, a shared responsibility approach characterized by a supportive and collaborative environment rather than a punitive one might prove to be a more effective accountability intervention. Although low-stakes accountability interventions like the one describe here could work through implicit mechanisms, such as stigmatization or reputational damage, they could also have an impact through co- ordinated collaboration among school actors and pedagogical tools to sustain a shared responsibility intervention. Identifying the schools as low performing, meeting with them, developing a detailed diagnosis to identify their main challenges, and offering them ad- vice to design a school-specific improvement plan, could be enough to improve service delivery. The evidence on the effects of low-stakes accountability interventions within a support- ive and collaborative environment in developing countries is mixed. The experimental evidence for India presented in Muralidharan and Sundararaman (2010) shows that a program that provided low-stakes diagnostic tests and feedback to teachers had no effect 2 on student learning outcomes. In a randomized study in Punjab, Pakistan, Andrabi et al. (2014) show that providing test scores to households and schools leads to increases in subsequent test scores by 0.11 standard deviations after one year of the intervention. Test score gains in public schools were in response to a low-stakes threat since there were no formal consequences attached to results. Public schools in Punjab face little competitive or regulatory pressure to perform, yet the information has a significant impact because there are non-monetary mechanisms such as social/community pressures on public school teachers that induce performance improvements if a school is revealed to have low test scores. In Latin America, the results presented in Mizala and Urquiola (2013) show that distributing information regarding schools’ value added in Chile had no effects on enroll- ment, tuition levels or socioeconomic composition of students suggesting a limited effect of a low-stakes accountability intervention. This paper evaluates the impact on learning outcomes of a short-lived program suitable to measure the effect of a low-stakes accountability intervention within a supportive and collaborative environment and, albeit, in a more limited way, the pedagogical interventions effect. The program, PAE, short for Programa de Atenci´ ıfica para la Mejora del on Espec´ Logro Educativo was implemented in the Mexican state of Colima between January 2010 and late 2011, with the objective of increasing learning outcomes among the worst perform- ing public primary schools in the state. Although originally designed as a comprehensive schooling intervention with various components, PAE was cut short administratively and in the end only a subset of the components were implemented. Schools were informed of their test scores and that they were to be part of the program; once the program was launched and the list of PAE schools was publicly disseminated, participating schools were assigned a technical adviser who would visit the school and help diagnose the test score results and design a school improvement plan. The present study follows two alternative strategies to identify the effects on PAE on learning outcomes. The first one follows a difference-in-difference approach comparing the evolution in test scores between PAE and non-PAE schools over time. The second strategy exploits PAE’s rigid eligibility rule (an exogenously determined cut-off point of the national standardized test) dividing schools into treatment and control groups and compares them through a regression discontinuity design estimating the accountability effect of PAE. Both strategies consistently show that the intervention increased test scores by 0.12 standard deviations only a few months after program launch. Although this effect remains one and, to a lesser extent, two years after program implementation, our results show no additional impact attributable to the implementation of the pedagogical interventions. The intuition behind these results is that public recognition of low performing schools, together with a detailed diagnosis of the school’s main challenges and an invitation to net- 3 work with other school directors, teachers and advisers to develop improvement strategies is enough to improve the quality of education services. In fact, no additional inputs to the schools were actually funded or provided within the first few months of the program. In this way, the current study examines how standardized tests were used to identify poor performing schools and how merely diagnosing school problems and developing a school improvement plan can lead to higher student performance. One way that the account- ability effect results in quality improvement is that it creates a fact-based dialogue among educational authorities, school principals and teachers. In this way, identification of the schools’ main problems within a supportive and collaborative environment is not only the first step toward improvement, it is arguably, in itself, sufficient for improvement. The paper is organized as follows: Section 2 provides background information on the Mex- ican state of Colima, describes the PAE, and charts some trends in learning outcomes. Section 3 details the methodology and identification strategies while Section 4 discusses the main results. Finally, a concluding Section 5 enumerates some policy recommenda- tions. 2 Background, PAE Description and Recent Trends Colima is a small state in the center-west region of Mexico, with 650,000 inhabitants, 34 percent of whom live in households with incomes below the official 2012 poverty line.1 Since the decentralization of the education system that began in 1992, Colima has built an efficient school system adjusting the national educational programs to the state’s specific characteristics and needs. Throughout the 1990s, Colima undertook innovative education policies such as the implementation and dissemination of one of the country’s first stan- dardized tests. By 2003, Colima outperformed all Mexican states in the OECD’s PISA test and actually approached the OECD average. For example, Colima’s math scores were on par with those for Greece and Serbia, and higher than those from Thailand, Brazil, Uruguay and Turkey. At the end of school year 2005-2006, for the first time, the Federal Ministry of Education (SEP) applied a universal standardized test (ENLACE). ENLACE, a low stakes test, gathered information on student performance in math, Spanish and a rotating subject for third, fourth, fifth and sixth graders in private and public primary schools in Mexico. By design, ENLACE has a national mean score of 500 and a standard deviation of 100 for every subject area and grade. In early October 2009, the results from ENLACE 2008-2009 were published, and Colima performed below the national average. All the schools in Colima had access to this information, but there is no evidence of rankings 1 The poverty headcount ratio in Mexico was 45.4 percent in 2012. 4 or public dissemination of the results at this stage. A few weeks after the release of the disappointing ENLACE results, Colima’s Ministry of Education (CMoE) began the design of PAE. 2.1 PAE Description PAE was a program designed to improve learning outcomes among the lowest performing public primary schools in Colima, which also provide education services to students from the most marginalized schools ((Mizala et al., 2007)). The program’s operational rules excluded multi-grade schools with one or two teachers and community schools managed by the Consejo Nacional de Fomento Educativo (CONAFE).2 Of the 477 primary schools in Colima in academic year 2008-2009, 40 were private, 39 were managed by CONAFE, 78 were one- and two-teacher schools, and 10, for a variety of reasons, did not have an ENLACE score. The group of PAE-eligible schools consisted of 310 public primary schools (see Figure 1) with a total student population of 62,366 (95.2% of the total number of students in public primary schools in Colima during the 2008-2009 school year). Between October and November 2009, the CMoE used the 2008-2009 ENLACE score data to construct its ranking of schools. School scores were a simple average of the three subject areas tested: math, Spanish and science across grades 3, 4, 5 and 6. Schools in the 35th percentile or less of the distribution of the school average test scores were automatically designated as PAE schools (see Figure AI in Annex). As shown in Figure 1, PAE included 108 of the 310 schools that belong to the potentially eligible population.3 PAE schools were distributed across all ten municipalities of Colima and encompassed 1,091 teachers and 10,550 students in 2009. Figure 2 illustrates the timeline followed by the design and implementation of Colima’s PAE. Between November and December 2009, the CMoE assigned the schools that were going to participate in PAE following the criteria described above. In January 2010, the selected schools were officially notified. In February 2010, at a teachers’ congress in Colima, the Governor launched the program and publicly disseminated the list of PAE schools. Although the assigned schools were presented as those with the lowest learning outcomes in the state, the Governor emphasized the co-responsibility behind low performance of schools and state education authorities and the importance of working together to make improvements. Between the public announcement of the program and the first follow up ENLACE test in May 2010, which the CMoE called Phase 1 or the “awareness” period, PAE schools were assigned a technical adviser, part of the CMoE, who would visit the school three times a month to work with school directors and teachers on 2 For details on CONAFE see www.conafe.gob.mx. 3 Two schools were dropped from the sample due to a mistake in their original classification as non- multi-grade schools which later on was changed to multi-grade. 5 the diagnosis of the ENLACE test and the design of improvement strategies. In addition, the PAE technical adviser coached teachers on analyzing the ENLACE information to have a clearer understanding of how schools were assigned to PAE and the causes of poor performance within their schools. Between January and March 2010, authorities at Colima’s Ministry of Education, together with school directors and selected teachers, developed a simple methodology to construct a detailed diagnosis identifying the academic weaknesses of their students based on EN- LACE results.4 The diagnoses were tailored to each school: the ENLACE test questions which more students answered wrong were collected by subject area, grade and classroom. Using personal identification numbers and a password, all teachers in Mexico had online access to a rich data set organizing the proportion of students in their class who answered an ENLACE question incorrectly. The website also indicated the area of knowledge and the relevant curriculum area for each ENLACE question, thereby providing teachers with concrete pedagogical direction to guide their efforts (see Figure 3 showing an example of the type of information provided by ENLACE). Between February and May 2009, PAE’s technical advisers, school directors and teachers worked on a school improvement plan to address the problems identified during the di- agnosis. The school improvement plan had to include clear medium- and long-term goals in terms of learning outcomes and a plausible strategy, involving teachers and parents, to reach them. The CMoE’s technical advisers would visit PAE schools three times a month to follow up on the implementation of the school improvement plans. However, the “awareness” period of the program was too short to change any of the fundamental inputs of the learning production function and it is, hence, capturing the accountability and diagnosis effect of the program (see Figure 2). The second phase, or the implementation stage, of PAE started in September 2010. It consisted of pedagogical interventions and the monitoring of progress. With diagnoses and school improvement plans in hand, state authorities, school directors and teachers collaboratively implemented the school-specific improvement strategies which, broadly speaking, included one or more of the following four interventions: 1. Strengthening school-based management. This intervention draws on the experi- ences gained from the on-going national school based management program: PEC (Programa Escuelas de Calidad ) and AGE (Apoyo a la Gesti´ on Escolar ). It should be noted that all schools in Colima at one time or another participated in PEC or PEC-like program. 2. Redefining the role of school supervisors and providing them with training. In 2010, school supervisors in Mexico were not appointed through a competitive process and 4 The methodology relied on public information generated by the Federal Ministry of Education (SEP). 6 did not have to undertake any training before taking on their duty. Hence, there was a high degree of variation in the quality of school supervision. 3. Specifying the role of school directors and providing them with training. Similar to school supervisors, many of the school directors lack the necessary skills on how to manage a school. Directors had difficulty identifying the strengths and weaknesses of their school, and in turn were limited in their capacity to design and monitor school improvement plans. For example, school directors seldom set measurable and reachable goals in crucial indicators to monitor progress. 4. Reinforcing teachers’ knowledge in the identified academic areas posing challenges. The program provided teachers with training and special courses to strengthen their knowledge in subject areas identified as challenging during the diagnosis step of the program. Due to reasons unrelated to the program’s performance, PAE was canceled in December 2011, one year after the pedagogical interventions were in place. A difference-in-difference and a regression discontinuity approach is used to answer the following research ques- tions: 1. Does PAE increase learning outcomes among participating schools? 2. If there is a positive effect of PAE, then is this explained by the accountability plus diagnosis intervention or by the pedagogical intervention or both? 3. Did PAE have a differentiated effect across boys and girls or among students with relatively disadvantaged initial conditions? 2.2 Dataset and Recent Trends This study uses and merges student learning outcomes as measured by ENLACE with administrative school census data collected by federal and state education authorities (known as the Formato 911 ). Since 1998, this school census is collected at the beginning and end of each school year, and lists, among other entries, the number of teachers, students, classrooms, computers, the average years of education of teachers, and the geographic location of each school. With a unique school identifier (Clave de Centro de Trabajo, CCT ), it is possible to merge this school census data with the results from ENLACE into a single data base. In addition to learning outcomes, ENLACE includes socioeconomic information for each school based on their geographical location.5 5 The National Population Council (Consejo Nacional de Poblaci´ on, CONAPO) ranks all localities (an administrative and / or geographic entity often more disaggregated than a municipality) in Mexico according to a marginality index, a weighted average of literacy, access to basic public utilities, household infrastructure and average wages. Rankings range from very high marginalization, high marginalization, 7 Figure 4 shows mean math scores in PAE and non-PAE schools from 2006 to 2013. In general, schools in Colima have improved by 42 points of ENLACE, or 0.42 standard deviations, throughout this period. As expected, the 108 PAE schools had lower learning outcomes relative to the non-PAE schools. In 2009, the baseline year, the difference between PAE and non-PAE schools was 56 points; the gap was reduced to 42 point the year after, perhaps partly explained by mean reversion effects, but also explained by the effects of PAE (see (Chay et al., 2005)). 3 Methodology We follow two alternative methodologies to identify the effects of PAE on student-level learning outcomes: a differences-in difference (DD) approach exploiting differentiated performance on learning outcomes between PAE and non-PAE schools and a regression discontinuity (RD) approach exploiting the exogenously defined threshold or cutoff point dividing PAE and non-PAE schools. The DD identifies the effects of the program assuming that performance on learning outcomes, in PAE and non-PAE schools, would have been homogeneous in the absence of the program, which could be a restrictive assumption. RD overcomes this limitation by exploiting differences in performance between PAE and non-PAE schools in the neighborhood of the exogenously determined cutoff point defining PAE schools. However, the limitation of RD is that the number of clusters (in our case schools) around the cutoff needed to identify a given effect under RD is relatively large, hence reducing the precision of the estimates. Therefore, getting similar effects under the two alternative strategies provides more robust evidence of the true impact of PAE on learning outcomes. Formally, let us define Yi,s,t as the test score of the ith student in school s in year t and P AEs as a dummy variable taking the value of one if the school is part of the program, zero otherwise. The DD is estimated via the following regression using OLS: T K k Yi,s,t = α0,t + α1 P AEs + At + λt P AEs ∗ (At ) + δk,t Xs,t + i,s,t (1) t=t∗ k=1 where A are year fixed effects, Xs,t are a series of school-level controls and i,s,t is a random component. The parameters of interest capturing the impact of PAE are the ones measuring the learning outcome effects of the interaction between P AE and the year fixed effects, λt . If these parameters are statistically significant they would indicate that medium marginalization, low marginalization, and very low marginalization. For methodological details regarding Mexico’s marginality index, see www.conapo.gob.mx 8 PAE schools performed differently from non-PAE schools, controlling for everything else, suggesting that the program had an effect on learning outcomes. There could be several reasons why an identification strategy such as the one described by equation (1) would yield biased estimators. First, the two groups, PAE and non- PAE schools are not equal ex-ante. Therefore, factors unrelated to the program can impact both groups differently causing a different post-PAE performance among treated and untreated schools and hence biasing the DD results. Second, the selection of schools on the basis of a one-year school performance ranking may misclassify schools due to a one-time performance aberration (one time shocks or mean reverting noise). As discussed by Chay et al. (2005), this would produce biased estimators of the program’s impact since PAE (low-performing) schools would tend to automatically revert to the overall mean. According to Chay et al. (2005), a regression discontinuity design can defuse these two identification problems. The logic behind the RD approach is simple: the objective is to identify a group of schools that are part of PAE and similar enough to a group of schools that are not part of the program. A good place to identify such comparison groups is around the cutoff point distinguishing PAE from non-PAE schools as the threshold mimics a randomized selection to receive or not to receive treatment (Imbens and Lemieux, 2007; Imbens and Wooldridge, 2008). Formally, let us define Yi,s,t as a function of P AE , the average results of school s at the baseline Ys,2009 , the interaction between the former and the latter, a series of school-level controls Xs,t and random component i,s,t : K k Yi,s,t = β0,t + β1,t P AEs + β2,t Ys,2009 + β3,t P AEs ∗ (Ys,2009 ) + βk,t Xs,t + i,s,t (2) k=4 Notice that the dummy variable identifying schools belonging to PAE and their eligibility variable, ENLACE average results for 2009, are constant over time. By assumption i,s,t should be independently and identically distributed (iid) with a mean of zero and known variance. Equation (2) can be modified to include higher order terms of the forcing vari- able, Ys,2009 , to control for non-linearities in the relationship between the eligibility criteria and subsequent learning outcomes.6 For a group of schools sufficiently close to the PAE- eligibility cutoff, such that samples are balanced both in observables and unobservables, the effects of PAE will be captured by β ˆ1,t in Equation (2). An important limitation to the regression discontinuity design, however, is that the results are valid only for observa- tions around the cutoff point; the estimated impact is limited to a local average treatment 6 Higher order polynomials are typically used when estimating RD using all the available information as oppose to restricting it to those observations within the optimal bandwidth (see (Imbens and Lemieux, 2007)). 9 effect which cannot be generalized to cover the entire population, thereby undermining the external validity of the estimation. A second limitation more relevant for the current study is that RD relies on having a large number of observations around the cutoff. As shown by Dragoset and Deke (2012) and Schochet (2005) for a constant statistical power and minimum detectable effect, the number clusters (schools) needed under RD is sub- stantially larger than the one needed under a randomized control trial, hence, limiting the statistical power in the evaluation of PAE. Under both approaches, DD and RD, the unit of intervention is the school but the unit of analysis is the student (that is, schools, not students, are assigned to PAE). Therefore, the unobservables are composed of two terms i,s,t = ηs + νi,s,t . In other words, the unob- servables are composed of a school-specific component (ηs ) and an individual-, school- and time-specific term (νi,s,t ). This structure of the error term implies clustering of students within schools allowing for intra-school correlation across students. Since PAE started in January of 2010 and the first follow up ENLACE test was a few months later, in May 2010, the DD estimator λ ˆ 2010 and RD estimator β ˆ1,2010 capture the accountability effect of the program while those for the following years capture the effects of the pedagogical interventions. 3.1 Determining the bandwidth of comparable schools The optimal number of schools around the cutoff by which to evaluate the impact of the PAE program is determined by a trade-off between precision and internal validity. That is, a narrow bandwidth would select schools very close to the cutoff, hence more similar in observables and unobservables, but the statistical power might be compromised given the small number of observations. On the other hand, a wider bandwidth would increase the number of observations in the treatment and control groups but might not yield balanced samples in observables (and unobservables). We follow the method developed by Imbens and Kalyanaraman (2012) to determine the optimal bandwidth which yields 20.2 ENLACE points below and above the cutoff or 0.202 standard deviations around the threshold dividing PAE from non-PAE schools. This optimal bandwidth will be complemented with two alternative but rather arbitrary bandwidth: half of the optimal bandwidth (±10.1 points of ENLACE) and double the optimal bandwidth (±40.4 points of ENLACE).7 Table 1 shows the number of schools above (non-PAE) and below (PAE) the cutoff as well as the number of students using each of the three different bandwidths. Figure 5 shows the intra-cluster correlation for those schools around the cut-off: in other words, the proportion of total variation in student-level math learning outcomes that is 7 The optimal bandwidth was computed using the regression discontinuity Stata program RD developed in Nichols (2014). 10 accounted for by differences between schools. This coefficient is practically zero until 10 ENLACE points around the threshold, and less than 0.02 at 20 ENLACE points (the op- timal bandwidth) above and below the cutoff. The relatively low value of the intra-cluster correlation among schools within the optimal bandwidth presented in Figure 5 suggests that a valid comparison between schools with 0.2 standard deviations of the cutoff can be conducted. In a ranking where first place indicates the lowest scoring school, schools ranging from 41 to 108 are defined as the PAE treatment group while the non-PAE control group would consist of schools 109 to 171. A second requirement for the RD approach to be valid is that the density of the forcing variable must be continuous around the cutoff and this is what is shown by Figure AI in the Annex. Granted sufficient observations around the cutoff, the RD approach may mimic a ran- domized experiment if the treatment and control groups are equal in expectation on all observed and unobserved dimensions. Table 2 shows school inputs in the school year 2008-2009 (the baseline) for schools within the optimal bandwidth. School inputs are statistically equal across treatment and control. Differences in school size, number of teachers, the proportion of teacher that are part of a monetary incentives program (Car- rera Magisterial ), the proportion of teachers with a university degree or higher, as well as the dropout and failure rates between PAE and non-PAE schools within the optimal bandwidth are statistically insignificant. One of the six variables compared during the baseline is statistically different between PAE and non-PAE schools, indicating that PAE schools tend to be poorer that non-PAE, even when comparing only those located ±20.2 points of ENLACE around the cutoff. 4 Results This section shows the results of the DD and RD approaches described above focusing on the effects of PAE on math test scores. The effects on Spanish are listed in the Annex as a comparison. The results include estimations of DD and RD with and without school-level controls using all available data and restricting the estimation for those schools within the optimal bandwidth. The DD are also estimated using only schools around the cutoff to make them more comparable to the RD estimates and to make the results robust to differences in initial conditions. In all estimations standard errors are clustered at the school level. 11 4.1 DD Approach Table 3 presents the DD effects of the PAE program on math learning outcomes using different specifications. The first of these specifications (column 1 in Table 3) estimates the DD effects without controls and using all available data of 311 PAE-eligible public primary schools in Colima. The results show an effect of almost 0.13 standard deviations (σ hereafter) in 2010, a few months after PAE was launched. According to this simple specification, the DD estimator increased over time although the estimated effects in 2011 and 2012 are not statistically different from that in 2010, as suggested by the learning outcome trends shown in Figure 4. Notice that specification (1) with standard errors clustered at the school level, yields very similar effects to a specification with school fixed effects; that is, a dummy variable for each of the 311 eligible public primary schools (clusters) in Colima. Specification (2) includes the following school-level controls: student - teacher ratio, proportion of teachers that are enrolled in the monetary incentives program Carrera Magisterial, proportion of teachers with a university degree or a post-graduate diploma, and the level of marginalization of the locality where the school is located. Including these controls, which are highly significant, reduces the estimated effects of PAE to close to 0.10 σ in 2010 but still significant at the 99 percent confidence level. Specifications (3), (4) and (5) are the same as specification (2) but restricting the obser- vations to schools within the optimal bandwidth (OB), half of the OB, and twice the OB. Not surprisingly, under the OB restricted sample, the controls reduce their significance, particularly the proxies for socio-economic status (marginalization levels). However, the effects of PAE remain statistically significant with a point estimator close to 0.11 σ . No- tice that under specification (3), the DD estimators across years 2010, 2011 and 2012 are very similar suggesting that, between 2009 and 2010, PAE schools managed to improve 0.11 σ faster vis-a-vis the changes in comparable non-PAE schools as a result of the inter- vention, capturing the accountability effects of the program and its subsequent diagnosis activities during the “awareness” period. Restricting the schools included in the analysis to half of the OB does not change the DD estimator; however, with only 67 clusters, the estimates lose precision. DD estimations using schools within double the size of the OB (222 clusters) yields very similar results as those with the OB. Albeit marginally smaller, the effects of PAE on math scores are corroborated by the program’s effects on Spanish (see Table AI in Annex). The effects on Spanish scores in 2010 range from close to 0.12 σ when estimating the DD without controls and using all available information, to 0.09 σ when the controls are included and 0.08 σ when restricting the estimation to schools within the OB, always significant at the 95 percent confidence level. Bertrand et al. (2004) show that, by ignoring the fact that DD focuses on serially correlated outcomes, its estimation can severely underestimate standard errors. They show that a simple correction consisting of collapsing the time series information into a “pre”- and 12 “post”-intervention period would be enough to account for this time series inconsistency in standard errors. Tables AII and AIII in the Annex shows that when the DD is performed using a sample with data collapsed into two periods, before and after PAE, the results are practically unchanged. The effects on math (Spanish) test scores range from 0.18 σ (0.14 σ ) with the full sample and no controls to 0.14 σ (0.09 σ ) when the controls are added and 0.13 σ (0.08 σ ) when the sample is restricted to schools within the OB, in all cases significant at the 95 percent confidence level. As mentioned in Section 3 one of the most important limitations of the DD approach as a strategy to evaluate the impact of PAE is that program effects can be confounded with mean-reversion noise caused by a treatment selection based on ranking of schools at the baseline. Although the RD is a robust strategy to address this concern, we also perform a simulation showing that mean-reversion is not driving the results presented so far. If mean-reversion is at work, then reproducing PAE’s eligibility criteria but using the ENLACE results of 2008 (as opposed to 2009) to rank and select schools should result in PAE schools advancing faster than non-PAE schools between 2008 and 2009. Figure AII in the Annex shows the average evolution of learning outcomes between PAE and non-PAE schools using the alternative selection criteria based on the school ranking of 2008 (top figure) and compares it with the evolution under the actual selection of PAE and non-PAE schools. The simulation and actual trends on average test scores in Figure AII are based on schools within the OB. Using the ranking of 2008 to assign schools into PAE and non-PAE produces no differentiated evolution in test scores after the assignment, suggesting that the post-2009 PAE versus non-PAE differences in trends observed after 2009 can be attributable to the program with little or no mean-reversion effects. 4.2 RD Approach The first strategy followed within the RD approach is a graphical representation of the discontinuity using local linear or kernel regressions on both sides of the cutoff. Figure 6 illustrates the relationship between average math performance at the school level in 2010 (vertical axis) and the forcing variable, the simple average ENLACE results at the school level in 2009 relative to the test score (horizontal axis). The PAE schools are to the left of the cutoff, which was used for their eligibility into the program, and the non-PAE schools are to the right. A mild discontinuity appears at the cutoff (469 points of ENLACE): there is a general pattern showing that schools slightly below the cutoff (the PAE schools) display greater test scores in 2010 than schools slightly above the cutoff (the non-PAE schools), although their scores in 2009 were very similar. Zooming into the discontinuity reveals that the difference in the regression at the cutoff is around 10 ENLACE points, very close to DD estimates. This graphic illustration suggests a positive effect on learning 13 outcomes in 2010 brought about by the program and is consistent with what was shown by the DD approach. Analyzing the impact of the program after one year, when the schools may have im- plemented some of the pedagogical interventions, provides additional information on the impact of the PAE. The graphical representation of the RD in 2011, shows no discon- tinuity at the cutoff. Schools on the left of the threshold have an average achievement in 2011 similar to those schools on the right of the cutoff (see Figure 7). Similar results showing no-discontinuity are found in 2012, two years after the implementation of the program. The parametric estimation of RD, equation 2, includes the same specifications as the DD approach. In other words, the results presented in Table 4 include a specification using all available information (more than 38,000 students in 311 schools at baseline) and including school-level controls while specifications 2, 3 and 4 restrict the observations to those schools within different bandwidths. All standard errors are clustered at the school level. According to the RD results of a specification using all available schools, PAE had an effect of 0.16 σ on math learning outcomes in 2010 (with ρ = 0.075), a few months after the implementation of PAE. These results are similar in magnitude to the ones obtained by the DD approach. However, the effect tends to disappear as we narrow the bandwidth of schools included in the regression. At double the OB (at ± 0.4 σ around the cutoff) with 220 schools included, the effect of the program is 0.12 σ significant at the 10 percent level; at the OB (± 0.2 σ around the cutoff) with 127 schools included the effect is no longer significant; and at half of the OB (± 0.1 σ around the cutoff) with only 67 schools included, the effect basically disappears. The parametric estimation of RD for years 2011 and 2012 are shown in Tables AIV and AV in the Annex. Consistent with the graphic representation of the discontinuity, the parametric results for 2011 and 2012 show no effects of PAE on learning outcomes. These results combined with the significant effects for 2010 suggest that the impact of PAE on learning outcomes are driven by the accountability and diagnosis effect with no apparent impact from the pedagogical interventions. It seems that the diagnosis and design of a school improvement plan based on the results of ENLACE was enough to improve test scores among PAE schools; however, no further improvements were possible either because the pedagogical interventions were not able to address the constraints faced by schools or because the interventions were not well implemented, or they simply needed more time to bear fruit. The point estimator and standard error (SE) of PAE’s effect is significantly different between DD and RD, especially when we restrict the sample to the OB. While the DD and RD point estimates of 2010 are not substantially different when all the schools in the sample or those within double the OB are considered, the SE are significantly higher 14 under RD. For instance, taking the full sample and including controls, the RD shows an impact of 0.16 σ with a standard error of 8.9, this last one being two a half times larger than the SE under DD which shows an effect of 0.10 σ and SE of 3.5. As we restrict further the bandwidth under RD, the SE increases and the point estimate reduces. The increase in SE or lost precision of the RD estimator can be explained by poor statistical power attributable to the relatively few schools around the cutoff. Notice that although there is a large number of students in our sample, the number of clusters or schools is relatively small, especially when we restrict the sample to schools located within the OB (127 schools). Schochet (2005) estimated that, for the same level of statistical precision, the number of clusters needed under RD are three to four times larger than the sample size required under a randomized controlled trial (RCT). According to Schochet (2005) “[t]he reduction in precision in the RD design arises due to the substantial correlation, by construction, between the treatment status and score variables that are included in the regression models; this correlation is not present under the random allocation design.” Additional evidence in Dragoset and Deke (2012) shows that, under certain conditions, sample requirements are 9 to 17 times larger under RD that for RCTs. In our particular case, the number of clusters required under an RCT with an error of 0.05, power of 0.8, minimum detectable effect of 0.13 σ , an intra-class correlation of 0.05, R2 of 0.05 and an average number of students per school of 250 is equal to 200 clusters, or 100 treatment and control schools. Taking the results of Schochet (2005), to have the same statistical precision under an RD we would need at least three times more clusters, or 600 schools within the OB, a figure considerably higher the 129 schools within the OB. Therefore, the lack of precision under RD when we restrict the sample to schools within the OB is explained by a lack of statistical power due to the low number of schools around the cutoff. PAE’s information dissemination strategy could have created incentives within schools that can explain a positive impact. First, the pressure put on PAE teachers by being declared low-performing schools may have created incentives to practice some type of “strategic behavior” by school directors or teachers such as cheating or teaching to the test (see (Figlio and Getzler, 2002)). ENLACE uses two algorithms to detect cheating and results are invalidated when it happens.8 There is no evidence of test scores invalidation by the Ministry of Education of Colima to any PAE school during 2010, 2011 and 2012. In addition, the percentage of students who did not take ENLACE in 2010 and 2011 was equal to non-PAE schools, which suggests that PAE schools did not try to manipulate test scores by choosing the students who took the test. Second, student mobility across schools might have affected the test scores after the PAE schools were identified. But such mobility in Mexico is very difficult without a strong reason such as the geographic reallocation of 8 Algorithms have also been used in the US to detect cheating, see Jacob and Levitt (2003) for an example using data from Chicago public schools. 15 the student’s family. In addition, it would have been more likely that the best students would have moved out of PAE schools, which, if anything, would suggest the impact of the program could have been stronger than suggested by this study. Third, school directors could have changed their managerial practices in PAE schools. There is some evidence, from self-reported surveys, that directors in PAE schools improve monitoring of teacher’s attendance and punctuality, visited classrooms more often, and had meetings to discuss learning outcomes between 2009 and 2010. Finally, principals and teachers may have focused on teaching to the test. Curriculum and ENLACE are linked by design and ENLACE was used to show the weakness areas in the classrooms of the low-performing schools. With the existing information, we cannot rule out the possibility that teachers used this information to teach the subjects that were more closely connected to the test (math and Spanish). 4.3 Heterogeneous effects The average positive and significant effect of PAE could hide important heterogeneous impacts within schools. The program could have a differentiated effect among boys and girls or among students with adverse initial conditions versus relatively well-off students. Between 2009 and 2010, the student-level dispersion of ENLACE results in PAE schools experienced a larger increase vis-a-vis what was observed for the state as a whole. For instance, between 2009 and 2010, the standard deviation in math test scores among all schools in Colima increased by 4 ENLACE points while the increase was equal to 10 points among PAE schools. To explore possible heterogeneous effects of the program, separate specifications are esti- mated for boys and girls and for students identified as having an age-grade distortion and those who do not. Figure 8 shows PAE’s learning outcome effects in math and Spanish, differentiated by gender using the two alternative methodologies, DID and RD. The results show that although the program had some heterogeneous impacts across boys and girls with boys experiencing larger improvements in math and girls in Spanish, we cannot reject the null hypothesis of equality of coefficients between gender-specific equations. Students with an age-grade distortion in Mexico are largely the outcome of late enroll- ment in the education system. As shown by Manacorda (2012) age-grade distortions have significant and long-lasting negative effects on learning outcomes. In 2009, more than 17 percent of the students in Colima’s public primary schools had an age-grade distortion and close to 40 percent of them fell in the “insufficient” level in ENLACE compared to 20 percent among students without an age-grade distortion. Figure (9) shows PAE’s learning outcome effects in math and Spanish, for students with an age-grade distortion and those who are on the right grade based on their age, using the two alternative methodologies, 16 DID and RD. The results show that PAE had a positive, though not statistically signif- icant, effect on math and Spanish learning outcomes among students with an age-grade distortion while the effect was positive and significant among students without a distor- tion. This has two important implications: (1) the improvements in average learning outcomes brought about by PAE were not the outcome of an increase in results among the better-off students while reducing learning outcomes among the relatively worse-off, and (2) students with relatively better starting conditions seem to benefit more from the changes introduced by a low-stakes accountability and diagnosis interventions such PAE. Arguably, improving learning outcomes among students with more challenging starting conditions requires more comprehensive interventions, combining PAE’s diagnosis with specialized pedagogical interventions. 5 Conclusions In 2009, the state of Colima identified 108 public primary schools that had obtained the lowest learning outcomes as measured by the national standardized student assessment, ENLACE. In early February 2010, the state governor announced the “performance status” of selected schools: schools which performed below an arbitrary cut-off were automati- cally enrolled in a mandatory school improvement program known as PAE. The program, however, was discontinued during the 2011-2012 school year. Following two alternative strategies to identify the effects of PAE on learning outcomes, a difference-in-difference and a regression discontinuity design, the paper shows that PAE increased test scores by 0.12 standard deviations only a few months after program launch. The size of the effect aligns with other studies evaluating the impact of low-stakes ac- countability interventions on test scores. Although this effect remains one and, to a lesser extent, two years after program implementation, our results show no additional impact attributable to the implementation of pedagogical interventions or interventions intended to change major inputs in the learning production function as oppose to a marginal change in effort. The effects are homogeneous across boys and girls; however, learning outcomes among students with disadvantaged initial conditions, proxied by age-grade distortion, improved only marginally (statistically not different from zero) as a result of the inter- vention. The fact that the PAE program was halted after only 18 months of implementation sug- gests that the main intervention of the program was circumscribed to the public an- nouncement made by state authorities, followed by detailed information provided to the schools about the test scores of their students, the activities connected to the design of a school improvement plan, and close support provided by the program’s technical advisers. 17 Activities during the period of preparation of the school improvement plan included the notification to schools that they were low-performing, a diagnosis based on test score re- sults, identification of weaknesses within subject areas evaluated, a discussion between the school director and teachers on how to address the challenges, and the setting of clear goals regarding learning outcomes. In other words, it was the information that was publicly announced apprising directors in PAE schools of their relatively poor performance. While this information was public already, the announcement by state authorities triggered an accountability effect. The diagnostic feedback that came about through the design of the school improvement plan gave the schools the tools and knowledge they needed to take action and set goals themselves. Therefore, it is plausible that the public announcement itself allowed school to make small but significant learning gains using the results of the diagnostic feedback they received. The results of this analysis indicate that full and wide dissemination of information de- tailing school quality is critically important. When students, teachers and parents in a school know that their scores are low, and this triggers a process of self-evaluation and analysis, the process itself may lead to an improvement in learning outcomes. Although there was no “shaming” for PAE schools in Colima, there may be an intrinsic motiva- tional impact connected to the ranking of a school relative to others compounded by the compensatory nature of the program and the co-responsibility of state authorities in the challenge of improving learning outcomes. According to this analysis, it is not the inputs made available by PAE that led to improvements. Rather, it was the signaling value of the program – as well as the associated diagnosis and networking opportunities with other school officials and advisers – which resulted in rising test scores. Moreover, unlike the high-stakes accountability interventions sometimes leading school closures in the United States, or the sacking of school directors in England, or the lead with your feet school choice in the Netherlands, the policy (and the de facto events) in Colima bore no punitive actions against schools or school directors. While the PAE program in Colima was surprisingly and frustratingly short-lived, its pre- mature termination serves to highlight a largely unrecognized phenomenon in education: acknowledgment is, in some ways, virtually tantamount to improvement. After all, if you really understand the problem, effective solutions come much easier. If you do not under- stand the problem, no amount of “problem-solving” can be expected to work. One may still legitimately wonder why schools did not improve before the PAE program given that the same information was already disclosed publicly. Perhaps the information was not well understood or disseminated, or beleaguered school leaders in poorly performing schools could not, without the right logistical support and networking, begin to proactively use the results from the standardized test to trigger a discussion and design a school improvement plan. These are all areas of future research. It remains refreshing, however, that the use of 18 information from standardized tests, without punitive measures but within a supportive and collaborative environment, appears to be sufficient for improving learning. References Ahn, T. and J. Vigdor (2014, September). The impact of no child left behind’s account- ability sanctions on school performance: Regression discontinuity evidence from north carolina. Working Paper 20511, National Bureau of Economic Research. Andrabi, T., J. Das, and A. I. Khwaja (2014, October). Report Cards: The Impact of Providing School and Child Test Scores on Educational Markets. Working Paper Series rwp14-052, Harvard University, John F. Kennedy School of Government. Bertrand, M., E. Duflo, and S. Mullainathan (2004). How much should we trust differences-in-differences estimates? The Quarterly Journal of Economics 119 (1), 249– 275. Carnoy, M. and S. Loeb (2003). Does external accountability affect student outcomes? a cross-state analysis. Education Evaluation and Policy Analysis 24 (4), 305–331. Chay, K. Y., P. J. McEwan, and M. Urquiola (2005). The central role of noise in evaluating interventions that use test scores to rank schools. American Economic Review 94 (4), 1237–1258. Dragoset, J. and L. Deke (2012). Statistical power for regression discontinuity designs in education empirical estimates of design effects relative to randomized controlled trials. Figlio, D. N. and L. S. Getzler (2002, November). Accountability , Ability and Disabil- ity: Gaming the System. NBER Working Papers 9307, National Bureau of Economic Research, Inc. Hanushek, E. A. and M. E. Raymond (2005). Does school accountability lead to improved student performance? Journal of Policy Analysis and Management 24 (2), 297–327. Imbens, G. and K. Kalyanaraman (2012). Optimal Bandwidth Choice for the Regression Discontinuity Estimator. Review of Economic Studies 79 (3), 933–959. Imbens, G. and T. Lemieux (2007). Regression discontinuity designs: A guide to practice. Working Paper 13039, National Bureau of Economic Research. 19 Imbens, G. M. and J. M. Wooldridge (2008, August). Recent developments in the econo- metrics of program evaluation. Working Paper 14251, National Bureau of Economic Research. Jacob, B. A. and S. D. Levitt (2003, August). Rotten Apples: An Investigation Of The Prevalence And Predictors Of Teacher Cheating. The Quarterly Journal of Eco- nomics 118 (3), 843–877. Koning, P. and K. van der Wiel (2013). Ranking The Schools: How School-Quality Infor- mation Affects School Choice In The Netherlands. Journal of the European Economic Association 11 (2), 466–493. Manacorda, M. (2012, May). The Cost of Grade Retention. The Review of Economics and Statistics 94 (2), 596–606. Mizala, A., P. Romaguera, and M. Urquiola (2007, September). Socioeconomic status or noise? Tradeoffs in the generation of school quality information. Journal of Development Economics 84 (1), 61–75. Mizala, A. and M. Urquiola (2013). School markets: The impact of information approxi- mating schools’ effectiveness. Journal of Development Economics 103 (C), 313–335. Muralidharan, K. and V. Sundararaman (2010, 08). The Impact of Diagnostic Feed- back to Teachers on Student Learning: Experimental Evidence from India. Economic Journal 120 (546), F187–F203. Nichols, A. (2014). Rd: Stata module for regression discontinuity estimation. Rockoff, J. and L. J. Turner (2010). Short-run impacts of accountability on school quality. American Economic Journal: Economic Policy 2 (4), 119–47. Rouse, C. E., J. Hannaway, D. Goldhaber, and D. Figlio (2013). Feeling the florida heat? how low-performing schools respond to voucher and accountability pressure. American Economic Journal: Economic Policy 5 (2), 251–81. Schochet, P. Z. (2005, June). Statistical Power for Random Assignment Evaluations of Education Programs. 20 Figures Figure 1: PAE and Non-PAE Population and Sample, Colima Figure 2: The timeline of PAE 21 Figure 3: Example of a report card using ENLACE, math 3rd grade. Figure 4: Evolution of Average Score in PAE and non-PAE schools 22 Figure 5: Intraclass Correlation coefficient around the cutoff - Math ENLACE 2009 23 Figure 6: Regression Discontinuity, PAE vs Non-PAE, 2010 Figure 7: Regression Discontinuity, PAE vs Non-PAE, 2011 and 2012 24 Figure 8: PAE effects (in σ ) by gender and subject, DID and RD, 2010 Note: Not significant coefficients in red, at 10% level Figure 9: PAE effects (in σ ) by age-grade distortion and subject, DID and RD, 2010 Note: Not significant coefficients in red, at 10% level 25 Tables Table 1: Bandwidths around the cutoff Enlace Points Schools Students PAE NonPAE 10.1 67 7,460 35 32 20.2 129 14,126 67 62 40.4 223 25,071 98 125 Source: Authors’ elaboration using RD command Table 2: School Inputs 2009, schools within the optimal bandwidth Non-PAE control PAE-Treatment Difference S.E. Number of students 201.9 181.7 20.17 (16.82) Number of teachers 11.20 10.82 0.38 (0.73) % of teachers with Incentive Program 0.51 0.45 0.06 (.055) % of teachers with B.A. or more 0.72 0.75 -0.02 (0.04) Student/teacher ratio 26.18 24.7 1.47 (0.89) Disadvantage Index 0.77 1.13 -0.36 (0.17)** ıa de Educaci´ Source: Authors’ own computations with data from the school census, 2009, Secretar´ on ublica. P´ 26 Table 3: Difference in Difference PAE estimation, Math (1) (2) (3) (4) (5) All All-controls Optmal BD 50% OB Double OB PAE -64.71*** (3.15) -47.79*** (3.31) -22.42*** (2.66) -11.64*** (3.33) -36.92*** (2.60) 2010 13.29*** (1.37) 13.88*** (1.38) 13.75*** (3.07) 16.18*** (4.54) 13.73*** (1.97) 2011 25.21*** (1.82) 26.34*** (1.82) 29.07*** (3.31) 30.40*** (5.36) 29.63*** (2.16) 2012 44.19*** (1.84) 45.52*** (1.85) 47.09*** (3.63) 51.34*** (5.05) 48.95*** (2.36) PAE 2010 12.89*** (3.42) 9.63*** (3.51) 10.82** (4.77) 10.93 (7.49) 10.31*** (3.61) PAE 2011 18.47*** (3.30) 13.78*** (3.50) 12.85*** (4.81) 10.84 (7.28) 12.58*** (3.60) PAE 2012 22.77*** (4.46) 17.62*** (4.55) 13.51** (5.43) 7.88 (7.19) 15.92*** (4.77) Student/Teacher 1.97*** (0.39) 0.64* (0.32) 0.49 (0.51) 0.52* (0.30) 27 Incentive program 23.62*** (6.05) 8.70* (4.92) 11.94* (6.09) 5.41 (4.54) Teachers BA -5.47 (7.89) -9.42 (7.03) -9.07 (9.84) -7.37 (6.09) Low Marginality -12.42*** (3.87) 5.47* (3.02) 5.16 (4.05) 4.13 (3.06) Medium Marginality -16.02** (6.33) 4.96 (7.68) 4.28 (8.76) -3.34 (6.31) High Marginality -15.52** (7.12) -1.88 (5.25) -4.80 (5.00) -4.40 (6.45) Constant 523.28*** (2.69) 461.38*** (11.29) 473.47*** (9.60) 470.56*** (13.03) 487.68*** (8.84) R2 0.063 0.075 0.038 0.037 0.050 Obs 161085 160757 59223 31548 105475 Clusters 311 309 129 67 222 Standard errors in parentheses Standard errors clustered by school *p<0.10, ** p<0.05, *** p<0.01 Table 4: Regresion Discontinuity Estimation Results, Math (1) (2) (3) (4) All Optmal BD 50% OB Double OB PAE 15.93* (8.91) 6.85 (8.59) 1.95 (13.61) 12.02* (7.24) Forcing Variable 0.98*** (0.17) 0.44 (0.46) 0.23 (1.43) 0.99*** (0.19) FV square -0.00 (0.00) P AE ∗ (F V − cutof f ) 1.23 (0.93) 1.10 (0.76) -0.06 (2.56) 0.38 (0.38) (P AE ∗ (F V − cutof f ))2 0.03 (0.02) Student/Teacher 1.25*** (0.35) 1.26** (0.51) 2.32* (1.26) 1.31*** (0.42) Incentive program 6.22 (5.24) 3.44 (6.81) -6.47 (10.09) -0.09 (5.79) 28 Teachers BA -7.10 (7.18) -24.84* (14.10) -46.98** (20.03) -13.01 (8.30) Low Marginality 2.43 (3.74) 2.96 (4.49) 12.01* (6.68) 4.76 (3.85) Medium Marginality -4.06 (7.61) 7.40 (12.41) 22.12 (14.08) 0.74 (8.53) High Marginality -1.54 (6.91) 0.86 (4.61) 8.55 (6.65) -1.42 (6.35) Constant 461.86*** (11.77) 481.07*** (18.81) 468.94*** (29.26) 466.49*** (14.18) R2 0.095 0.012 0.014 0.030 Obs 38928 14201 7518 25433 Clusters 307 127 67 220 Standard errors in parentheses Standard errors clustered by school * p<0.10, ** p<0.05, *** p<0.01 Annex Figure AI: Density of the assignment variable 29 Figure AII: Simulation of selection into PAE using ranking of 2008 versus 2009 30 Table AI: Difference in Difference PAE estimation, Spanish (1) (2) (3) (4) (5) All All-controls Optmal BD 50% OB Double OB PAE -62.38*** (2.98) -45.01*** (3.07) -19.89*** (2.09) -9.47*** (2.60) -34.98*** (2.22) 2010 16.28*** (1.27) 16.74*** (1.31) 17.62*** (2.61) 20.25*** (3.80) 17.64*** (1.73) 2011 28.99*** (1.70) 29.80*** (1.72) 33.08*** (2.84) 33.84*** (4.43) 32.62*** (2.14) 2012 30.07*** (1.79) 31.05*** (1.83) 31.30*** (3.40) 32.92*** (4.17) 31.63*** (2.30) PAE 2010 11.75*** (2.92) 8.65*** (3.03) 8.14** (3.90) 7.81 (6.22) 8.26*** (3.03) PAE 2011 14.56*** (3.16) 10.17*** (3.46) 8.07* (4.13) 4.72 (5.66) 9.18*** (3.50) PAE 2012 14.26*** (3.90) 9.40** (4.08) 6.63 (5.08) 3.25 (6.15) 10.28** (4.23) Student/Teacher 2.05*** (0.37) 0.52* (0.30) 0.11 (0.46) 0.63** (0.25) Incentive program 20.45*** (5.45) 3.36 (4.65) 5.97 (5.69) 1.52 (3.94) 31 Teachers BA -2.32 (7.37) 0.49 (6.07) 0.15 (8.31) -1.40 (5.25) Low Marginality -13.56*** (3.62) 3.44 (2.90) 2.02 (4.04) 2.85 (2.82) Medium Marginality -19.96*** (5.27) -1.32 (6.21) -0.45 (6.72) -7.64 (4.97) High Marginality -15.50*** (5.71) -4.07 (4.86) -9.75** (4.87) -4.70 (5.38) Constant 519.25*** (2.54) 455.07*** (10.82) 469.19*** (9.28) 474.83*** (12.07) 479.53*** (7.73) R2 0.060 0.074 0.027 0.022 0.041 Mean Dep SD Dep Obs 161085 160757 59223 31548 105475 Clusters 311 309 129 67 222 Standard errors in parentheses Standard errors clustered by school * p¡0.10, ** p¡0.05, *** p¡0.01 Table AII: Difference in Difference, Before vs. After Estimation, Math (1) (2) (3) All All-controls Optmal BD PAE -64.71*** (3.15) -47.86*** (3.31) -22.55*** (2.68) After 28.18*** (1.43) 29.02*** (1.41) 30.62*** (2.85) PAE*After 18.27*** (2.94) 14.04*** (3.06) 12.61*** (4.04) Student/Teacher 2.00*** (0.39) 0.68** (0.33) Incentive program 22.95*** (6.11) 7.25 (5.11) Teachers BA -2.28 (7.98) -9.09 (7.06) Low Marginality -12.33*** (3.90) 5.26* (3.06) Medium Marginality -16.01** (6.35) 5.12 (7.66) High Marginality -15.62** (7.09) -1.91 (5.34) Constant 523.28*** (2.69) 458.65*** (11.21) 473.02*** (9.68) R2 0.052 0.064 0.026 Obs 161085 160757 59223 Clusters 311 309 129 Standard errors in parentheses Standard errors clustered by school * p<0.10, ** p<0.05, *** p<0.01 Table AIII: Difference in Difference, Before vs. After Estimation, Spanish (1) (2) (3) All All-controls Optimal BD PAE -62.38*** (2.98) -45.07*** (3.07) -19.96*** (2.10) After 25.33*** (1.38) 25.98*** (1.38) 27.52*** (2.51) PAE*After 13.55*** (2.65) 9.49*** (2.84) 7.63** (3.52) Student/Teacher 2.06*** (0.37) 0.54* (0.31) Incentive program 20.05*** (5.47) 2.55 (4.66) Teachers BA -0.50 (7.36) 0.60 (6.06) Low Marginality -13.46*** (3.64) 3.42 (2.91) Medium Marginality -19.90*** (5.28) -1.24 (6.22) High Marginality -15.58*** (5.69) -4.07 (4.90) Constant 519.25*** (2.54) 453.49*** (10.75) 468.98*** (9.38) R2 0.057 0.071 0.024 Obs 161085 160757 59223 Clusters 311 309 129 Standard errors in parentheses Standard errors clustered by school * p<0.10, ** p<0.05, *** p<0.01 32 Table AIV: Regresion Discontinuity Estimation Results, Math 2011 (1) (2) (3) (4) All Optmal BD 50% OB Double OB PAE 4.71 (6.43) 1.97 (8.06) -2.72 (11.44) 8.90 (6.47) Forcing Variable 1.04*** (0.20) 0.66 (0.47) 1.31 (1.68) 1.04*** (0.22) FV square -0.00 (0.00) P AE ∗ (F V − cutof f ) -0.77 (0.60) -0.12 (0.66) -3.35 (2.14) -0.08 (0.37) (P AE ∗ (F V − cutof f ))2 -0.01 (0.01) Student/Teacher 0.43 (0.33) 0.14 (0.59) -0.82 (0.77) 0.25 (0.38) Incentive program 12.21** (5.24) 14.68* (8.62) 17.63 (11.34) 7.60 (6.16) Teachers BA -7.43 (9.03) -8.78 (12.05) 0.50 (15.02) -7.35 (9.84) 33 Low Marginality 2.94 (3.68) 2.92 (4.70) -3.66 (7.19) 4.84 (3.69) Medium Marginality 1.65 (6.60) 3.86 (11.01) 0.60 (13.38) 4.53 (7.43) High Marginality -7.36 (7.12) -7.95 (7.71) -15.07** (6.98) -5.50 (7.90) Constant 496.83*** (10.80) 507.54*** (16.08) 523.92*** (23.54) 502.17*** (12.25) R2 0.074 0.007 0.007 0.023 Mean Dep SD Dep Obs 39866 14635 7836 26075 Clusters 307 127 66 220 Standard errors in parentheses Standard errors clustered by school * p<0.10, ** p<0.05, *** p<0.01 Table AV: Regresion Discontinuity Estimation Results, Math 2012 (1) (2) (3) (4) All Optmal BD 50% OB Double OB PAE 4.36 (7.82) 0.64 (8.39) 1.72 (10.71) 8.01 (7.13) Forcing Variable 1.01*** (0.21) -0.19 (0.53) 1.02 (1.46) 1.07*** (0.19) FV square -0.00 (0.00) P AE ∗ (F V − cutof f ) -0.91 (0.93) 1.22 (0.81) -1.76 (1.97) -0.48 (0.43) (P AE ∗ (F V − cutof f ))2 -0.01 (0.02) Student/Teacher 0.54 (0.40) 0.46 (0.61) -0.07 (0.76) 0.72 (0.47) Incentive program 14.04** (7.11) 24.45** (10.24) 39.68*** (13.01) 6.70 (8.56) Teachers BA -16.55 (10.76) -10.29 (12.47) 0.09 (14.35) -20.37 (12.35) 34 Low Marginality 8.30* (4.35) 13.57** (5.50) 8.07 (7.40) 12.35*** (4.59) Medium Marginality -1.29 (8.66) -7.42 (9.26) -8.90 (10.75) 0.97 (9.19) High Marginality -7.57 (8.39) -4.28 (8.90) -13.32 (9.12) -6.89 (7.69) Constant 517.67*** (12.47) 517.73*** (17.05) 512.17*** (22.79) 516.60*** (14.69) R2 0.063 0.012 0.014 0.020 Mean Dep SD Dep Obs 43806 16261 8734 28943 Clusters 307 127 66 220 Standard errors in parentheses Standard errors clustered by school * p<0.10, ** p<0.05, *** p<0.01