WPS5751 Policy Research Working Paper 5751 A Hybrid Approach to Efficiency Measurement with Empirical Illustrations from Education and Health Adam Wagstaff L. Choon Wang The World Bank Development Research Group Human Development and Public Services Team August 2011 Policy Research Working Paper 5751 Abstract Inefficiency is commonplace, yet exercises aimed approach nonparametrically estimates inefficiency by at improving provider performance efforts to date comparing actual performance with comparable real- to measure inefficiency and use it in benchmarking life “best practice” on the frontier and could be useful exercises have not been altogether satisfactory. This paper in exercises aimed at improving provider performance. proposes a new approach that blends the themes of Four applications in the education and health sectors are Data Envelopment Analysis and the Stochastic Frontier used to illustrate the features and strengths of this hybrid Approach to measure overall efficiency. The hybrid approach. This paper is a product of the Human Development and Public Services Team, Development Research Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The author may be contacted at awagstaff@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team 1 A Hybrid Approach to Efficiency Measurement with Empirical Illustrations from Education and Health by Adam Wagstaff and L. Choon Wang Development Research Group, The World Bank, Washington DC, USA Corresponding author and contact details: Adam Wagstaff, World Bank, 1818 H Street NW, Washington, D.C. 20433, USA. Tel. (202) 473-0566. Fax (202)-522 1153. Email: awagstaff@worldbank.org. Keywords: Frontier, Efficiency, Cost, Education, Health. Acknowledgements: Without wishing to implicate him in any way, we thank Martin Ravallion for helpful comments on an earlier version of the paper. The findings, interpretations and conclusions expressed in this paper are entirely those of the authors, and do not necessarily represent the views of the World Bank, its Executive Directors, or the governments of the countries they represent. 2 I. INTRODUCTION In this paper we propose a new approach to measuring efficiency, and illustrate it with four applications in the education and health sectors. The paper is motivated by three interrelated ideas. The first is that inefficiency is commonplace, especially in the education and health sectors, but is not ubiquitous—some providers, some local governments, and some countries, are more efficient than others. The second idea is that having data comparing actual performance to „best practice‟ could be useful in exercises aimed at improving provider performance. These might involve users holding providers accountable directly through, for example, voucher schemes. Or they might involve citizens holding politicians accountable for service delivery inefficiencies through the political process, and policymakers then holding service providers accountable through, for example, payment mechanisms that reward good performance or „naming-and-shaming‟ exercises where poor performance is publicized. The third idea behind the paper is that an observation that efforts to date to measure inefficiency and use it in benchmarking exercises have not been altogether satisfactory. Some efforts focus simply on outputs or outcomes without factoring in the expenditures involved (there has been much discussion, for example, about the cross-country variation in education test scores; but it may be the case that the high achievers simply spend more), while studies that have sought to measure efficiency have not apparently had much impact among policymakers (cf. e.g. Burgess 2006; Hollingsworth and Street 2006); we feel this lack of impact derives from skepticism over the methods. The method we propose and illustrate in this paper blends themes from the two efficiency measurement methodologies used to date, namely Data Envelopment Analysis (DEA) and Stochastic Frontier Analysis (SFA).1 DEA usually focuses on estimating technical inefficiency and constructs an isoquant frontier made up of a set of piecewise linear segments joining a few data points in the input space. The frontier envelopes the data and permits the comparison of a real-life unit to a hypothetical comparator on the piecewise linear segment of the isoquant with the same input proportions. SFA, on the other hand, estimates the production or cost frontier 1 For recent surveys of DEA and SFA in education and health, see Worthington (2001) and Hollingsworth (2008). Early published applications of SFA in education and health include Deller and Rudnicki (1993) and Wagstaff (1989) respectively. Early published applications of DEA in education and health include Bessent and Bessent (1980), Ray (1991), Sherman (1984), and Huang and Mclaughlin (1989). 3 using a regression model with specific functional form and distributional assumptions. The estimated frontier permits the comparison of actual output or spending to the corresponding point on the frontier to measure inefficiency. Both of these methods have pros and cons, and both share some weaknesses. Our approach tries to take the attractive features of each while trying to avoid their shared shortcomings. We borrow two ideas from DEA and two from SFA. We focus on overall efficiency— this is straightforward in SFA but not typically done in DEA where the focus is on technical inefficiency.2 We think policymakers will typically want to look at overall inefficiency not just technical inefficiency. On the other hand, our approach is more similar to DEA in that it envelopes the data through the use of nonparametric methods. This gets round the criticism of SFA that its results are dependent on the functional form assumed, and allows for multiple efficient units; in contrast, SFA typically produces at most a few efficient units and quite possibly just one—this will almost certainly to be the case in the classic panel-data formulation where inefficiency is modeled as a time-invariant fixed effect (Schmidt and Sickles 1984). Our approach imposes fewer assumptions than DEA, however: DEA imposes assumptions about the shape of isoquants, but we impose no assumptions about the cost curve. Finally, our approach is closer to SFA in that we make some allowance for statistical noise and measurement error. In its traditional and commonest form DEA does not allow for measurement error or statistical noise.3 SFA, by contrast, does allow for noise and measurement error, but this comes at the expense of an arbitrary and untestable assumption that enables inefficiency to be distinguished from random shocks and measurement error. In addition to borrowing the attractive features of DEA and SFA, we also try to avoid their shared weaknesses. Both assess the efficiency of a unit by comparing the unit‟s output or spending to that of a hypothetical unit rather than that of a real-life one. This use of hypothetical comparators makes the exercise untransparent to a policymaker and introduces a large element of „make believe‟. A policymaker in an inefficient country, or the head of an inefficient school board, can quite reasonably dismiss claims of inefficiency when the comparison is with a fictitious unit. And even if the policymaker or head of school district is keen to learn how to improve their performance, there is no better-performing country or better-performing school 2 Allocative inefficiency can be estimated in a DEA, but has to be done explicitly, and the analyst needs to have input prices for all the inputs (Coelli 1996). 3 Recent advances in DEA research have introduced bootstrapping in DEA to get around this (see Simar and Wilson 2000). 4 district to visit. In our approach, by contrast, inefficient units are compared with real-life efficient units. We can tell a policymaker of an inefficient government or a manager of an inefficient delivery unit which real-life country or unit achieves similar outcomes at a lower cost. The second common shortcoming of DEA and SFA we try to address is their vulnerability to special pleading—policymakers or managers of poorly performing service delivery units claiming that there are legitimate factors explaining their poor performance that are ignored by the analysis. The usual response to date in the DEA and SFA literatures has been to address this issue through a two-stage approach: (1) construct or estimate a frontier to calculate efficiency scores; and (2) regress efficiency scores on factors thought to influence them. This practice has been criticized, not least because the process by which the efficiency scores is generated is ignored in the second-stage regression exercise (Burgess 2006). Instead, we build exogenous constraints into our analysis and allow different groups of units to have different frontiers. This multiple-frontier approach could, of course, be used in DEA and SFA studies too. We illustrate our approach throughout using examples from the education and health sectors. Our examples are chosen with a view to data quality and variety in terms of level of decision-making. In both education and health, there is growing realization of the need to look beyond the number of people passing through a facility to the difference that the facility makes to the lives of the people passing through it. In the two education examples, therefore, we look at test scores to get at the quality of the education process rather than at student numbers which capture just quantity. One example comes from the level of the local government: our data come from California where school districts control the day-to-day running of public schools. The other education example relates to national education systems and makes use of data from the OECD‟s Program for International Student Assessment (PISA) study; while undertaken by the OECD, the geographic coverage of the study now extends well beyond the OECD countries. Our first health example also uses national data and tries to get at the efficiency of health systems. Looking at patients treated misses the quality of care, and in any case one could argue that if a health system is successful at preventing illness and injuries it ought to be reducing the number of patients requiring treatment. Measures of population health tend to be too broad-brush to be compelling measures of the outcome of a health system: many are affected by factors beyond the health system, and many causes of death are not amenable to medical care at all or only 5 marginally so. We therefore focus on a limited set of causes of death that are amenable to medical care, and on deaths among people under the age of 70 where for the selected conditions medical care can make a large difference. The data we use are currently available only for OECD countries. Our final example is at the facility level and concerns hospitals. A large fraction of health spending goes on hospitals, and many efficiency-enhancing efforts are directed at the hospital sector. Our data come from Vietnam, which though only a low-income country has unusually good data on its hospital sector by developing country standards. The data are not without their shortcomings, however, and this analysis especially is intended to be illustrative; with richer data on, for example, disease codes, one could do a more sophisticated efficiency analysis. The rest of the paper is organized as follows. Section II introduces our empirical examples. Section III introduces and illustrates our hybrid approach to efficiency measurement in the simple case of a single output. Briefly, we identify efficient units through a grid-search process, identifying the least-cost unit over each output range, and then estimate a frontier nonparametrically using the output-expenditure combinations of the efficient units. We then measure the inefficiency of the inefficient units by comparing the inefficient unit‟s spending with the spending of the closest efficient unit on the frontier. We compare our results with those emerging from the panel-data stochastic frontier model. We obtain smaller estimates of efficiency using our hybrid approach, reflecting the fact that our frontier consists of real-life not hypothetical units. Section III extends the method to allow for multiple outputs or multiple dimensions of quality. We now identify efficient units over ranges across multiple dimensions— in the two-output case, for example, our grid search is over a square rather than a line segment. We compare empirically for each of our four examples the single- and multiple-output results. In three of the four examples, allowing for multiple outputs makes a large difference. Section IV extends the analysis further to allow for exogenous factors that constrain a unit from reaching the frontier. We illustrate our approach of different groups of units having different frontiers on the California schools dataset, using poverty as the stratifying variable. By constructing separate frontiers for school districts with small and large fractions of poor children, we allow for the possibility that the latter are constrained to making do with a lower level of „home inputs‟ in the production of schooling outcomes. It turns out that allowing for separate frontiers makes less of a 6 difference in this example than we had expected—less than allowing for multiple dimensions of quality, for example. II. AN INTRODUCTION TO OUR EMPIRICAL EXAMPLES As we illustrate the methods through examples, it makes sense to introduce our examples ahead of the methods. As previously indicated, we have four examples—two from the education sector, and two from the health sector. Table 1 provides an overview of the four empirical applications, along with basic descriptive statistics. Schools—California It is often noted that California‟s school system spends relatively little per pupil by US standards, and that its students fare worse than the US national average on test scores.4 What is less frequently mentioned is the large variation across California school districts in spending and test scores; this latter variation likely reflects the fact that the day-to-day running of California‟s public schools is the responsibility of school districts. A recent newspaper article by Freedberg and Doig (2011), investigative reporters with the Center for Investigative Reporting, drew attention to the large variations around the 2010 mean of $8,452 (the Pacific Unified School District spent nearly $60,000 per pupil), and noted that there is no perceptible relationship between spending and test scores. It is this intra-state variation we focus on, trying to determine which school districts are—by California‟s standards—efficient, and how inefficient each of the inefficient ones is. The literature to date on the costs of California‟s schools has taken a somewhat different tack from ours. Imazeki (2006) estimates cost functions, allowing the structure to vary according to the concentration of schools in the school district and the fraction of pupils eligible for free or subsidized meals. No explicit allowance is made for inefficiency; for example, frontier techniques are not employed. Costrell et al. (2008) have expressed misgivings about the analysis, arguing inter alia that the results make no adequate allowance for or fail to uncover unobserved factors such as efficiency differences between school districts. Chambers et al. (Chambers et al. 4 See, for example, The Economist April 20 2011 “A lesson in mediocrity: California‟s schools show how direct democracy can destroy accountability”. 7 2006) use a panel of professional educators to assess the minimum spending necessary to deliver instructional programs for schools of varying size and demographic composition and conclude that California‟s schools underspend. They conclude that only 15 to 28 of the 984 school districts examined were spending at the level adequate to reach California‟s content and performance standards in all major subjects. The authors estimated that an additional $24.14 to $32.01 billion would have been necessary in the 2004/2005 school year to ensure the opportunity of all students to reach the state‟s content and performance standards. This approach begs the question of why some school districts appear to achieve much better results with similar levels of spending per pupil, and why some spend considerably more than other and yet achieve no better results. We sourced school districts‟ educational outcomes and expenditure data from the California Department of Education‟s website.5 Each year, the California Department of Education collects and reports information about student test scores in California Standards Tests (CST), current expense of education, enrolment, number of dropouts, number of graduates, and enrolment of English learners of roughly 980 local educational agencies (LEA) or school districts in California. For the purpose of our analysis, we focus on 313 of the school districts that control both elementary and high school levels (i.e., unified school districts) within its boundary and have data on CST scaled scores for math and reading in grade 5 and grade 7, current expenditures, enrolment, and poverty estimates consistently available between 2003/2004 and 2008/2009 academic years.6 Californian students attending grade 2 to grade 11 take the CST in mathematics, reading, science, history, social science, and so on. Some students with disabilities take the California Alternative Performance Assessment (CAPA) and their test scores are excluded from our sample. However, students in grade 8 and beyond tend to take different CST depending on their course selection in school. We chose to focus on math and reading achievement in grade 5 and grade 7, so we have a manageable number of test outcomes that correspond to elementary school level and secondary school level. International education systems—PISA Much has been written about the international variation in knowledge and skills emerging from exercises such as the OECD‟s Program for International Student Assessment (PISA) that 5 The website is http://www.cde.ca.gov. The current expenditure data are adjusted to June 2000 dollar values based on the Consumer Price Index (CPI) available at the Bureau of Labor Statistics website ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt. 6 14 of these school districts have some test score or enrollment data missing in some years because of their relatively small size. 8 assesses 15-year-old students in grade seven or higher every three years.7 Four of the five top places in the 2010 report in both reading and mathematics went to Asian countries: China (Shanghai and Hong Kong SAR, China), S Korea, and Singapore. By contrast, the US ranked 11th on reading and 26th on mathematics. Discussions of these results, such as OECD (2010), typically focus simply on scores, without factoring in the amounts that countries spend on education. It is perfectly possible in principle that the countries achieving the best outcomes are those that spend the most, and that those who do less well spend relatively little. The more interesting—but rarely asked—question from a policy perspective is how countries vary in their success at translating resources into learning outcomes.8 We explore this issue using data on knowledge and skills in mathematics, reading and science from the PISA exercise, and secondary education spending data from the World Bank‟s World Development Indicators (WDI). We have an unbalanced panel comprising 118 sets of test scores and secondary education expenditure per pupil for 49 countries for some or all the years 2000, 2003 and 2006.9 Because information about expenditure per student is missing in the WDI for some countries in some years, 25 of the 118 observations are interpolated or extrapolated. Although in each assessment of PISA, one of the three areas (science, reading and mathematics) is selected as the major domain and is given greater emphasis, while the two minor domains are assessed less thoroughly, differences in test scores across countries and over time are comparable because the tests are linked assessments and the scores are scaled.10 International health systems—OECD It is well known that countries vary considerably in their per capita spending on health care, with the US being among the largest spenders per capita. Inevitably, there has been and continues to be a lot of debate on the question of whether the high spenders (especially the US) achieve sufficiently better results to warrant the extra spending (cf. e.g. Anderson and Frogner 2008), and more generally on how to measure health sector results in international comparisons 7 See http://www.pisa.oecd.org/ for further details of the PISA program. For studies examining the cross-country differences in PISA, see Dobert, Klieme, and Sroka (2004), Ammermueller (2007), and Fuchs and Woessmann (2007). 8 Afonso and St Aubyn (2005) is an exception. They use DEA to analyze efficiency in education spending in OECD countries using the PISA data. 9 We have not included data for 2009 because most countries do not have expenditure data available in the WDI database. The secondary education expenditure per student is generated by multiplying purchasing power parity GDP per capita (2005 constant value) with expenditure per student in secondary education as a percentage of GDP per capita. 10 See PISA 2003 Technical Report (Adams 2005) for details. 9 of health system performance (see e.g. Hakkinen and Joumard 2007). Throughput measures— such as inpatient admissions, and ambulatory care visits—are just that, and need not necessarily capture the ultimate outcome of interest, namely better health. But health indicators such as mortality (or life expectancy) and disability reflect many factors beyond the control of the health sector: deaths from road accidents likely reflect road safety improvements more than health spending, and many deaths are from causes that are still not amenable to medical intervention. Arguably a more compelling approach when assessing the efficiency of a country‟s health spending is to link health spending to deaths from conditions that are amenable to medical care (Nolte and McKee 2008). This is what we do in this study. Our data come from the OECD‟s Health Data. We have data for 29 countries over the period 1960-2005. Not all countries have data for every year. Health spending is defined as total health spending (i.e. public plus private) measured in 2000 prices in international dollars. Mortality is measured through potential years of life lost (PYLL) among people below the age of 70 who die from nine causes of death that are „amenable‟ to medical care, i.e. causes where timely and effective medical care can result in a death being avoided. We select the conditions from Nolte and McKee (2008) who identify a longer list of conditions but only nine are included in the OECD PYLL database. Deaths among older age groups are excluded by the OECD on the grounds that they are less easily amenable to medical care. We aggregate some of the conditions so we are left with a more manageable six PYLL „outputs‟.11 Our measure of health sector output—while preferable to a throughput measure and better than all-cause mortality among all age groups—is not without its limitations, of course. It focuses on length of life rather than quality of life and does not capture success in reducing mortality among the over-70s from conditions that are amenable to medical care. Countries that disproportionately target spending at the over-70s, or at patients whose length of life cannot be extended but whose quality of life can be improved will appear in our analysis as inefficient. Hospitals—Vietnam Hospitals absorb the bulk of health spending in most countries, and there has been much discussion of the scope for lowering health spending by reducing their inefficiency. 11 We aggregated PYLL‟s from three types of cancer (colon, breast and cervical) into an aggregate cancer PYLL, and aggregated PYLL‟s from pregnancies/deliveries and perinatal causes into an aggregate maternal and child health (MCH)) PYLL. The remaining four amenable causes were diabetes, ischemic heart disease, cerebrovascular disease, and influenza/pneumonia. 10 Unsurprisingly some of the first applications of DEA and SFA in health were on the hospital sector (e.g., (Wagstaff 1989) and (Ray 1991)), and there has been a good deal of work undertaken since then: EconLit contains 56 publications with “hospital” and “frontier” in the abstract). According to Hollingsworth and Street (2006), however, this work has had a relatively modest impact on policymakers. Our data are from Vietnam‟s official public hospital inventory, the same dataset used by Weaver and Deolalikar (2004) in their study of economies of scale and scope in Vietnamese hospitals. By the standards of low- and middle-income countries, this is an unusually good dataset. However, it does lack detailed information on patients treated, distinguishing only between inpatients, surgery cases and outpatients and not between different departments, let alone different diseases and treatments. In what follows we have included only district hospitals that have between 50 and 500 beds. We have excluded central hospitals run directly by the health ministry, and level-1 and level-2 hospitals (more complex hospitals). Our sample consists of 795 hospitals. Our data are for three years: 1998, 1999 and 2000. The expenditure data cover recurrent costs. III. THE SINGLE OUTPUT CASE We start with the simplest case, where we have just one output, or one dimension of quality. (We allow for multiple outputs in the next section.) Methods We assume labor and nonlabor inputs are combined to produce an output y at a cost C. Costs can exceed their feasible minimum because the input bundle used does not yield the maximum possible output (technical inefficiency), or because inputs are used in the wrong proportions given their prices and marginal products (allocative inefficiency), or both. We do not try to disentangle the two, instead presenting an estimate of overall inefficiency. Suppose we have data from multiple service-delivery units. We can then generate a scatterplot of C (or average cost) against y—the space of the standard total (or average) cost curve chart in a microeconomics textbook. In services like education and health, it is important 11 to allow for quality and not just focus on „outputs‟ such as enrollments or cases treated. We can allow for quality by graphing average cost per person, C/y, against quality, q. For example, y might be students enrolled, C/y cost per student, and q the average test score.12 In the first stage of our analysis, we identify a group of efficient service delivery units (or, in the case of panel data, efficient service delivery units at a point in time), defined as those that have the smallest (total or average) expenditure for each level of output, or the smallest expenditure per student or patient treated for each level of quality. Because there will be relatively few units that have exactly the same output (or quality), we work with output (or quality) ranges. We define a caliper of size c, and move the caliper along the y (or q) axis in steps of size s≤c. In this case where y (or q) is a scalar, the caliper is a line of length c which gets moved up the y (or q) axis up to the maximum value of the outcome in steps of s. In each step, the unit with the smallest expenditure within the caliper is identified and labeled an „efficient unit‟. Next we create a (stochastic) frontier by running a nonparametric Lowess smoother through the datapoints of these efficient units, and defining the frontier as the predicted cost for each efficient unit. The grid-search process thus identifies efficient units, and the smoothing process produces the frontier, with all efficient units being moved to the frontier. In the second stage, we compute the inefficiency of inefficient delivery units by matching each unit off the frontier with the closest unit on the frontier in terms on the outcome y; the unit‟s inefficiency is the difference between its expenditure and the expenditure of the closest match on the frontier.13 In contrast to both DEA and SFA where units are compared with hypothetical units, the matched unit for each inefficient unit in our approach is a real-life service delivery unit, not a hypothetical point on the frontier. We see this as a strength of our approach; all are units that have actually managed to produce (close to) output y at a cost C, not ones that ought to have been capable of doing so. Two points are worth clarifying at this stage. First, the first stage may leave some inefficient units below the smoothed frontier. These are units that emerge with somewhat higher 12 For such a graph to be justified, the underlying two-product cost function would have to have the form C(y,q) = y∙c(q), giving C(y,q)/y = c(q). Crampes and Hollander (1995) use such a cost function, but do not explore its properties. For the most part, the properties are fairly innocuous. The extent of ray economies (the effect on cost of doubling both y and q) depends on the shape of c(q), with c'(q)>0 implying ray diseconomies and c'(q)<0 implying ray diseconomies. There are economies of scale with respect to quality if c'(q)