Policy Research Working Paper                   10051




                 Measuring What Matters
     Principles for a Balanced Data Suite That Prioritizes
                Problem-Solving and Learning

                              Kate Bridges
                             Michael Woolcock




Development Economics
Development Research Group
May 2022
Policy Research Working Paper 10051


  Abstract
  Responding effectively and with professional integrity to                          the cumulative concern is that these risks will inhibit rather
  the many challenges of public administration requires rec-                         than promote the core problem-solving and implementa-
  ognizing that access to more and better quantitative data                          tion capabilities of public sector organizations, an issue of
  is necessary but insufficient. Overreliance on quantitative                        high importance everywhere but especially in developing
  data comes with its own risks, of which public sector man-                         countries. The paper offers four cross-cutting principles for
  agers should be keenly aware. This paper focuses on four                           building an approach to the use of quantitative data—a
  such risks. The first is that attaining easy-to-measure targets                   “balanced data suite”—that strengthens problem-solving
  becomes a false standard of broader success. The second is                         and learning in public administration: (1) identify and
  that measurement becomes conflated with what manage-                               manage the organizational capacity and power relations
  ment is and does. The third is that measurement inhibits                           that shape data management; (2) focus quantitative mea-
  a deeper understanding of the key policy problems and                              sures of success on those aspects which are close to the
  their constituent parts. The fourth is that political pres-                        problem; (3) embrace a role for qualitative data, especially
  sure to manipulate key indicators can lead, if undetected,                         for those aspects that require in-depth, context-specific
  to falsification and unwarranted claims or, if exposed, to                         knowledge; and (4) protect space for judgment, discretion,
  jeopardizing the perceived integrity of many related (and                          and deliberation in those (many) decision-making domains
  otherwise worthy) measurement efforts. Left unattended,                            that inherently cannot be quantified.




 This paper is a product of the Development Research Group, Development Economics. It is part of a larger effort by the
 World Bank to provide open access to its research and make a contribution to development policy discussions around the
 world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may
 be contacted at k8bridges@gmail.com and mwoolcock@worldbank.org.




          The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
          issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
          names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
          of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
          its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                        Produced by the Research Support Team
    Measuring What Matters: Principles for a Balanced Data Suite That Prioritizes Problem-
                                   Solving and Learning


           Kate Bridges (Independent Consultant) and Michael Woolcock (World Bank)1




JEL codes: C80, H83, O20
Keywords: Mixed Methods, Public Administration, Data Curation, Organizational Learning,
Problem Solving



1
  The views expressed in this paper are those of the authors alone, and should not be attributed to the World Bank,
its executive directors, or the countries they represent. Our thanks to Galileu Kim, Daniel Rogger, Christian Schuster
and participants at an authors’ workshop for helpful comments and constructive suggestions. More than twenty
years of collaboration with Vijayendra Rao have also deeply shaped the views expressed herein. Remaining errors of
fact or interpretation are solely ours. The final version of this paper will be a chapter in Daniel Rogger and Christian
Schuster (eds.) Government Analytics: An Empirical Guide to Measurement in Public Administration (Washington,
DC: World Bank, forthcoming).
    I.       Introduction

“What gets measured gets managed; and what gets measured gets done” is one of those
ubiquitous (even clichéd) management phrases that hardly requires explanation; it seems
immediately obvious that the data generated by regular measurement and monitoring is what
makes possible the improvement of results. Less well known than the phrase itself is the fact that
although it is commonly attributed to the acclaimed management theorist Peter Drucker, Drucker
himself never actually said it. 2 In fact, Drucker’s views on the subject were reportedly far more
nuanced, along the lines of V. F. Ridgway, who argued over 65 years ago that not everything that
matters can be measured and not everything that can be measured matters (Ridgway 1956).
Simon Caulkin, a contemporary business management columnist, neatly summarized Ridgway’s
argument, in the process expanding the truncated to-measure-is-to-manage phrase to “What gets
measured gets managed — even when it’s pointless to measure and manage it, and even if it
harms the purpose of the organisation to do so.” 3
Ridgway and Caulkin’s warnings – repeated in various guises by many since 4 – remind us that
an indiscriminate usage of quantitative measures and an undue confidence in what they can tell
us may turn out to be highly problematic in certain situations, sometimes derailing the very
performance improvements that the data was intended to support (Merry et al 2015). We hasten
to add, of course, that seeking more and better quantitative data is clearly a worthy aim in public
administration (and elsewhere). Many important gains in human welfare (e.g., recognizing and
responding to learning disabilities) can be directly attributed to interventions conceived and
prioritized on the basis of the empirical documentation of the reality, scale, and consequences of
the underlying problem. The wonders of modern insurance are possible because actuaries can
quantify all manner of risks over time, space and groups. What we will argue in the following
sections, however, is that access to quantitative data alone is not a sufficient condition for
achieving many of the objectives that are central to public administration and economic
development.
This paper has five sections. Following this Introduction (Section I), we lay out in Section II the
ways in which the collection, curation, analysis, and interpretation of data is embedded in
contexts: no aspect takes place on a blank slate. On one hand, the institutional embeddedness of
the data collection and usage cycle – in rich and poor countries alike – leaves it susceptible to a

2
  According to the Drucker Institute; see https://www.drucker.institute/thedx/measurement-myopia/ (accessed 6
December 2021).
3
  See https://www.theguardian.com/business/2008/feb/10/businesscomment1 (accessed 6 December 2021).
4
  See, for example, former USAID Administrator, Andrew Natsios (2011), citing Lord Wellington in 1812, on the
insidious manner in which measures of ‘accountability’ can compromise rather than enable central policy objectives
(in Wellington’s case, winning a war). For his part, Stiglitz has argued that “What you measure affects what you do.
If you don’t measure the right thing, you don’t do the right thing.” (As quoted in the New York Times, October 4,
2009.) Pritchett (2014), exemplifying this point, notes (at least at the time of his writing) that the Indian state of
Tamil Nadu had 817 indicators for measuring the delivery of public education, but none that actually assessed
whether students were learning – in this instance, an abundance of “measurement” and “data” was entirely
disconnected from (what should have been) the policy’s central objective. In many cases, however, it is not always
obvious, especially ex ante, what constitutes the “right thing” to measure – hence the need for alternative
methodological entry points to elicit what these might be.

                                                                                                                     2
host of possible ways in which subsequent delivery efforts may be compromised, stemming from
an organization’s lack of capability to manage and deploy data in a consistently professional
manner. At the same time, the task’s inherent political and social embeddedness ensure that it
will be susceptible to influence by existing power dynamics and the normative expectations of
those leading and conducting the work, especially when the political and financial stakes are
high. In contexts where much of everyday life transpires in the informal sector – thereby
rendering it “illegible” to, or enabling it to actively avoid engagement with, most standard
measurement tools deployed by public administrators – sole reliance on formal quantitative
measures will inherently only capture a slice of the full picture.
In Section III, we highlight four specific ways in which an indiscriminate increase in the
collection of what is thought to be “good data” can lead to unintended and unwanted (potentially
even harmful) consequences. The risks are that: (1) the easy-to-measure becomes a misleading or
false measure of broader reality; (2) measurement becomes conflated with what management is
and does; (3) an emphasis on what is readily quantified inhibits a fuller and more accurate
understanding of the underlying policy problem(s) and their constituent elements; and (4)
political pressure to manipulate selected indicators leads, if undetected, to falsification and
unwarranted expectations – or, if exposed, to the perceived compromised integrity of otherwise
worthy measurement endeavors.
Thankfully, there are ways to anticipate and mitigate these risks and associated unintended
consequences. Having flagged how unwanted outcomes can emerge, we proceed to highlight, in
Section IV, some practical ways in which public administrators might thoughtfully anticipate,
identify, and guard against them. We discuss what a balanced suite of data tools might look like
in public administration and suggest four principles that can help us apply these tools to the
greatest effect, thereby enabling the important larger purposes of data to be served. We stress
from the outset that our concerns are not with methodological issues per se, or the quality or
comprehensiveness of quantitative data; these are addressed elsewhere, in every econometrics
textbook, and should always be considered as part of doing ‘normal social science’. The
concerns we articulate are salient even in a best-case scenario, in which analysts have access to
great data acquired from a robust methodology, though obviously they are compounded when the
available data is of poor quality – as is often the case, especially in low-income countries – and
when too much is asked of it.

   II.     How data is impacted by the institutional and socio-political environment in
           which it is collected

For all administrative tasks, but especially those entailing high-stakes decision-making, the
collection and use of data is a human process inherently subject to human foibles (Porter 1995).
Much of this is widely accepted and understood: for example, key conceptual constructs in
development (such as ‘exclusion’, ‘household’, ‘fairness’) can mean different things to different
people and translate awkwardly into different languages. With this in mind professional data
collectors will always give serious attention to ‘construct validity’ concerns in an effort to ensure


                                                                                                    3
there is close alignment between the questions they ask and the questions their informants hear. 5
For present purposes we draw attention to issues given less attention, but which are critical
nonetheless, namely institutional and political factors that together comprise the context shaping
which data is (and is not) collected, how it is collected, from whom, how well it is curated over
time, and how carefully conclusions and their policy implications are drawn from analyses of it.
We briefly address each item in turn.

i. Institutional embeddedness of data
Beyond the purposes to which it is put, the careful collection, curation, analysis, and
interpretation of public data is itself a complex technical and administrative task, requiring
broad, deep, and sustained levels of organizational capability. In this section, we briefly explore
three institutional considerations shaping these factors: dynamics shaping the (limited) ‘supply’
and refinement of technical skills; the forging of a professional culture that is a credible mediator
of complex (potentially heated) policy issues yet sufficiently robust to political pressure; and the
related capacity to infer what even the best data analysis ‘means’ for policy, practice, and
problem-solving.
These issues apply in every country, but are especially salient in low-income countries, where
the prevailing level of implementation capability in the public sector is likely to be low, and
where the corresponding expectations of those seeking to improve it by expanding the collection
and use of quantitative data may be high. At the individual level, staff with the requisite
quantitative analytical skills are likely to be in short supply, because acquiring such skills
requires considerable training, while those who do have them are likely to be offered much
higher pay in the private sector. (One could in principle ‘outsource’ some data collection and
analysis tasks to external consultants but doing so would be enormously expensive and
potentially compromise the integrity and privacy of unique public data.)
So understood, it would be unreasonable to expect the ‘performance’ of data-centric public
agencies to be superior to other service-delivery agencies in the same context (e.g., public
health). Numerous studies suggest the prevailing levels of implementation capability in many (if



5
  Social science methodology courses classically distinguish between four key issues that are at the heart of efforts
to make empirical claims in applied research: (1) ‘construct validity’ (the extent to which any concept, such as
‘corruption’ or ‘poverty’, matches particular indicators), (2) ‘internal validity’ (the extent to which causal claims
have controlled for potential confounding factors, such as sample selection bias), (3) ‘external validity’ (the
likelihood that claims are generalizable at larger scales, to more diverse populations, or novel contexts), and (4)
‘reliability’ (the extent to which similar findings would be reported if repeated or replicated by others). See, among
many others, Johnson et al (2019). Of these four issues, qualitative methods are especially helpful in ensuring
construct validity, since certain terms may mean different things to different people in different places, complicating
matters if one seeks to draw comparisons across different linguistic/cultural/national contexts. In survey research,
for example, it is increasingly common to include what is called an ‘anchoring vignette’ – a short real-world
example of the phenomena in question, such as an instance of ‘corruption’ by a government official at a port –
before asking the formal survey question so that cross-context variations in interpretation can be calibrated
accordingly (see, among others, King and Wand 2007). Qualitative methods can also contribute to considerations
pertaining to internal validity (Cartwright 2017) and external validity – helping to identify the conditions under
which findings ‘there’ might apply ‘here’ (Woolcock 2018; see also Cartwright and Hardie 2012).

                                                                                                                      4
not most) low-income countries are far from stellar (Andrews et al 2017). 6 For example, Jerven’s
(2013) important work in Africa on the numerous challenges associated with maintaining the
System of National Accounts – the longest-standing economic data collection task asked of all
countries, from which their respective GDP is determined – portends the difficulties facing less
high-profile metrics (see also Sandefur and Glassman 2015). 7 Put differently: if many
developing countries struggle to curate the single, longest-standing, universally-endorsed, most
important measure asked of them, on what basis do we expect these countries to manage lesser,
lower-stakes measures?
To be sure, building quantitative analytical skills in public agencies is highly desirable; for
present purposes, our initial point is a slight variation on the old adage 8 that the quality of
outcomes derived from quantitative data is only as good as the quality of the ‘raw material’ and
the competence with which it is analyzed and interpreted. Fulfilling an otherwise noble ambition
to build a professional public sector whose decisions are informed by ‘evidence’ requires a prior
and companion effort to build the requisite skills and sensibilities. Put differently, precisely
because effective data management is itself such a complex and difficult task, in contexts where
agencies struggle to implement even basic policy measures at a satisfactory level (e.g., delivering
mail, ensuring attendance at work) it is unlikely that, ceteris paribus, invoking such agencies to
also take a more ‘data driven’ approach will elicit substantive improvement. More and better
‘data’ will not fix a problem if the absence of such data is not itself the key problem or the
‘binding constraint’; as such, the priority issue is discerning what is in fact the key policy
problem and its constituent elements. From this starting point, more and better data can be part
of, but not a substitute for, strategies for enhancing the effectiveness of public sector agencies.
Even if both data management and broad institutional capability are functioning at high and
complementary levels, there remains the structural necessity of interpreting what the data means.
Policy inference from even the best data and most rigorous methodology is never self-evident; it
must always be undertaken in the light of theory. This might sound like an abstract academic
concern, but it is especially important when seeking to draw lessons from, and/or make big
decisions regarding the fate of, complex interventions. This is so because a defining
characteristic of a complex problem is that it generates highly variable outcomes across time,
space, and groups.



6
  If such agencies/departments do in fact happen to perform especially strongly – in the spirit of the ‘positive
deviance’ cases of government performance in Ghana provided in McDonnell (2020) – then it would be useful to
understand how and why this has been attained. For present purposes, our point is that, perhaps paradoxically, we
should not expect, ex ante, that agencies/departments in the business of collecting and curating data for guiding
policy and performance to themselves be exemplary exponents of the deployment of that data to guide their own
performance – because doing so is a separate ontological task, requiring distinctive professional capabilities. Like
the proverbial doctors, if data analysts cannot “heal thyselves” we should not expect other public agencies to be able
to do so merely by infusing them with more and better data.
7
  A special issue of The Journal of Development Studies (Volume 51, Issue 2) was dedicated to this issue. For
example, on the enduring challenges associated with agricultural data – another sector with a long history of data
collection experience – see Carletto, Jolliffe, and Banerjee (2015).
8
  Popularly known as GIGO: garbage in, garbage out.

                                                                                                                     5
Promoting gender equality, for example, is a task that rarely generates rapid change – it can take
a generation (or several, or centuries) for rules invoking/requiring equal participation in
community meetings, or equal pay for equal work, to become the ‘new normal’. 9 So, assessed
over a five-year timeframe, a ‘rigorous’ methodology and detailed data may likely yield an
empirical finding showing that a given Gender Empowerment Project (GEP) has had “no
impact”; taken at face value, this is precisely what ‘the data’ would show and the type of policy
conclusion (“GEP doesn’t work”) that would be drawn. However, interpreted in the light of a
general theory of change incorporating the likely impact trajectory that GEP-type interventions
follow – i.e., a long period of stasis eventually leading to a gradual but sustained take-off – a
“doesn’t work” conclusion would be unwarranted; five years is simply too soon to draw such a
firm conclusion (Woolcock 2018). 10 High quality data and a sound methodology alone cannot
solve this problem: GEP may well be fabulous, indifferent, useless, or a mixture of all three, but
discerning which of these it is – and why, where, and for whom it functions in the way it does –
will require the incorporation of different kinds of data into a close dialogue with a practical
theory of change fitted for this sector, this context, and the development problem being
addressed.

ii. Socio-political embeddedness of data
Beyond these institutional concerns, a second important form of embeddedness shaping data
collection, curation, and interpretation is the manner in which all three are shaped by socio-
political processes and imperatives. All data is compiled for a purpose; in public administration,
the scale and sophistication of the required data is costly and complex (therefore requiring
significant financial outlay and thus is in competition with rival claimants). Data is frequently
called upon to adjudicate both the merits of policy proposals ex ante (e.g., the Congressional
Budget Office in the US) and the effectiveness of programmatic achievements ex post (the
World Bank’s Independent Evaluation Group), which frequently entails entering into high-stakes
political gambits: e.g., achieving signature campaign proposals in the early days of an
administration and proclaiming their subsequent widespread success (or failure) as election time
beckons again. (See more on this below.)
Beyond the intense political pressure ‘data’ is asked to bear in such situations, a broader
institutional consideration is the role large-scale numerical information plays in “rendering
legible” (Scott 1998) complex and inherently heterogeneous realities, such that they can be

9
  See the evolution of the early and subsequent work on gender inclusion in rural India (Ban and Rao 2007, Duflo
2012, Sanyal and Rao 2018).
10
   This does not mean, of course, that nothing can be said about GEP after five years -- managers and funders would
surely want to know by this point whether the apparent “no net impact” claim is a result of (a) poor technical design;
(b) weak implementation; (c) contextual incompatibility; (d) countervailing political pressures; or (e) insufficient
time having elapsed. Moreover, they would likely be interested in learning whether GEP’s zero “average treatment
effect” is nonetheless a process of offsetting outcomes manifest in a high standard deviation (meaning GEP works
wonderfully for some groups in some places but disastrously for others), and/or is yielding unanticipated or
unmeasured outcomes (whether positive or negative). For present purposes, our point is that reliance on a single
form and methodological source of data is unlikely to be able to answer these crucial administrative questions; with
a diverse suite of methods and data, however, such questions become both ask-able and answerable. (See Rao et al
2017 for an instructive example, discussed below.)

                                                                                                                    6
managed, mapped, and manipulated for explicit policy purposes. We hasten to add that such
“thin simplifications” (Scott’s term) of reality can be both benign and widely beneficial:
comprehensive health insurance programs and pension systems have largely tamed the otherwise
debilitating historical risks of, respectively, disease and old age by generating premiums based
on general demographic characteristics and the likelihood of experiencing different kinds of risks
(e.g., injuries, cancer) over the course of one’s life.
A less happy aspect of apprehending deep contextual variation via simplified (often categorical)
data, however, is the corresponding shift it can generate in the political status and salience of
social groups. The deployment of the census in colonial India, for example, is one graphic
demonstration of how the very act of ‘counting’ certain social characteristics – such as the
incidence of caste, ethnicity and religion – can end up changing these characteristics themselves,
rendering what had heretofore been relatively ‘fluid’ and ‘continuous’ categories as ‘fixed’ and
‘discrete’. In the case of India, this massive exercise in data collection on identity led to “caste”
being created, targeted and mobilizable as a politically salient characteristic that had (and
continues to have) deep repercussions (e.g., at independence, when Pakistan split from India, and
more recently the rise of Hindu nationalism) (see Dirks 2011). 11 More recently, influential
scholars have argued that the infamous Hutu/Tutsi massacre in Rwanda was possible at the scale
at which it was enacted because of ethnic categories being formalized and fixed via public
documents whose origins lie in colonial rule (e.g., Mamdani 2002).
For Scott (1998), public administration can only function to the extent its measurement tools
successfully turn wide-spread anthropological variation, such as languages spoken, into singular
modern categories and policy responses (e.g., to ensure that education is conducted in one
national language, in a school, on the basis of a single curriculum 12); the net welfare gains to
society might be unambiguous, but poorer, isolated, marginalized, and less numerous groups are
likely to bear disproportionately the costs of this trade-off. If official ‘data’ itself constitutes an
alien or distrusted medium by which certain citizens are asked to discern the performance of
public agencies, merely providing (or requiring) “more of it” is unlikely to bring about positive
change. In such circumstances, much antecedent work may need to be undertaken to earn the
necessary trust from citizens, and to help them more confidently engage with their administrative
systems. 13 By way of reciprocity, perhaps it will also require such systems to interact with

11
   One could say that this is a social scientific version of the Heisenberg Uncertainty Principle, in which the very act
of measuring something changes it. See also Breckenridge (2014) on the politics and legacy of identity measurement
in pre- and post-colonial South Africa, and Hostetler (2021) on the broader manner in which imposing singular (but
often alien) measures of time, space and knowledge enabled colonial administration. More generally, Sheila
Jasanoff’s voluminous scholarship shows how science is a powerful representation of reality, which when harnessed
to technology can reduce “individuals to standard classifications that demarcate the normal from the deviant and
authorize varieties of social control” (Jasanoff 2004: 13).
12
   Among the classic historical texts on this issue are Peasants into Frenchman (Weber 1976) and Imagined
Communities (Anderson 1983). For more recent discussions, see Lewis (2015) on “the politics and consequences of
performance measurement” and Beraldo and Milan (2019) on the politics of Big Data.
13
   This is the finding, for example, from a major empirical assessment of cross-country differences regarding Covid-
19 (Bollyky et al 2022), wherein – controlling for a host of potential confounding variables – those countries with
both high infections and high fatalities are characterized by low levels of trust between citizens and their
government, and between each other. See further discussion on this study and its implications below.

                                                                                                                      7
citizens themselves in ways that more readily comport with citizens’ own everyday (but probably
rather different) vernacular for apprehending the world, interpreting events, and responding to
them. Either way, it is critical that officials be wary of the potentially negative or unintended
effects of data collection, even when it may begin with a benign intention to facilitate social
inclusion and more equitable policy ‘targeting’. 14

     III.    The unintended consequences of an indiscriminate pursuit of “more data”

There is a sense in which it is axiomatic that more and better data is always a good thing. But the
institutional and socio-political embeddedness of data generation and its use in public
administration (as discussed in the preceding section) means we need to qualify this otherwise
laudable assertion by focusing on where and how challenges can arise. With that in mind, we
turn our attention in this section to those instances wherein the increased collection of what is
thought to be “good data” has led to perverse outcomes. Here we highlight four such outcomes
that may materialize as the result of an undue focus on issues, concepts, inputs or outcomes
which happen to be most amenable to being quantified.
     1. The easy to measure may become a false standard of success

What may start as a well-intentioned managerial effort to be better at quantifying meaningful
success can end up generating instead a blinkered emphasis on that which is simply most easy to
quantify. The result can be a skewed or false sense of what a project has (or has not) achieved,
and how, where, and for whom outcomes were achieved.
In a recent study, we demonstrate how a variety of institutional incentives align across the
Government of Malawi and the World Bank in such a way that both Government of Malawi and
World Bank officials consistently favor easy-to-measure indicators (inputs and outputs, or what
we refer to as “changes in form rather than function”) as the yardstick of project success (Bridges
and Woolcock 2017). This was a quintessential example of what strategy writer Igor Ansoff
describes as a situation in which “managers start off trying to manage what they want, and finish
up wanting what they can measure.” 15 As a result of evaluating Public Financial Management
(PFM) projects that were implemented over the course of twenty years in Malawi, we show that
almost 70% of what projects measure or aim for is “change in terms of whether institutions look
like their functioning counterparts (i.e., have the requisite structures, policies, systems and laws

14
   The British movie ‘I, Daniel Blake’ provides a compelling example of how even the literate in rich countries can
be excluded by administrative systems and procedures that are completely alien to them – e.g., filling out forms for
unemployment benefits on the Internet that require them to first “log on” and then “upload” a “CV”. The limits of
formal measurement to bring about positive policy change has long been recognized; when the Victorian-era writer
George Eliot was asked why she wrote novels about the lives of the downtrodden rather than contribute to official
government reports more formally documenting their plight, she astutely explained that “appeals founded on
generalizations and statistics require a sympathy ready-made, a moral sentiment already in activity…” (cited in Gill
1970:10). Forging such a Smithian ‘sympathy’ and ‘moral sentiment’ is part of the important antecedent work that
renders ‘generalizations and statistics’ legible and credible to those who might otherwise have no reason for
engaging with, or experience interpreting, such encapsulations of reality.
15
   Quoted in Cahill (2017: 152).

                                                                                                                   8
in place)” whereas only 30% of what is measured can be said to be “functional” – that is, focused
on “purposeful changes to budget institutions aimed at improving their quality and outcomes”
(Andrews 2013: 7). What’s more, we find that World Bank PFM projects have considerably
more success in achieving the “form” results than the “functional” ones. Unsurprisingly,
demonstrable improvements in actual performance are far harder to achieve than changes that are
primarily regulative, procedural or systems oriented. Unfortunately, an emphasis on what is
easy-to-measure obfuscates this reality and allows reform “success” to be claimed.
In practice, Malawi’s history of PFM reform is littered with projects that claim “success” based
on hardware procured, software installed, legislation developed, and people trained, whereas
even a basic analysis reveals stagnation or even regression in terms of more affordable spending
decisions, spending that reflects budgeted promises, greater ability to track the flow of funds, or
a reduction in corruption. As long as the World Bank and the Malawian government focus on the
“form” measures, they are able to maintain the illusion of success. That is, until something like
Malawi’s 2013 “Cashgate” crisis – in which about US$ 32 million in government funds were
spectacularly revealed to have been misappropriated between April and September 2013 – lifts
the lid on the deep-rooted financial management problems that have remained largely unaffected
by millions of dollars of reform efforts. In this sense, Malawi is a microcosm of many
institutional reform efforts globally. Although similar financial reforms have been globally
implemented in a manner that suggests some level of consensus about “what works”, the
outcomes of those reforms are varied at best and often considerably lower than anticipated
(Andrews 2013).
In the same way that an emphasis on the easy-to-measure can lead to an over-estimation of
success, it can also contribute to an under-estimation. Reforms can sometimes yield meaningful
change via what McDonell (2017) calls “the animating spirit of daily practice” but end up being
missed because managers do not have good means of measuring, attributing, and enhancing
these kinds of shifts. For example, when researching the impact of technical assistance to a large
government health program in Nigeria, we found that there were strong indications of important
innovations and shifts taking place at the local level, including in aspects as difficult to shift as
cultural practices regarding contraceptives (Bridges and Woolcock 2019). These shifts in
practice and their impact on contraceptive uptake could not be apprehended by aggregated state-
wide indicators, however, and since no measurement was being done below this level, the
progress and valuable lessons of such interventions were being missed.
Another example of the importance of having access to a broader suite of data comes from an
assessment of a program in rural India seeking to promote participatory democracy in poor
communities, where the curation of such a data suite enabled more nuanced and constructive
lessons to be drawn (see Rao, Ananthpur, and Malik 2017). The results of the initial randomized
controlled trial (RCT) deemed the program to have had no mean impact – and if that was the
only data available, that would have been the sole conclusion reached. 16 Upon closer inspection,

16
  We fully recognize that, in principle, econometricians have methods available to both identify outcome
heterogeneity and the factors driving it. Even so, if local average treatment effects are reported as zero the ‘no
impact’ conclusion is highly likely to be the (only) key take-away message. The primary benefit of incorporating

                                                                                                                     9
however, it was learned that there was in fact considerable variation in the program’s impact.
The average of this variation may have been close to zero, but for certain groups the program had
worked quite well, for others it had had no impact, while for still others it had been detrimental.
Who were these different groups, and what was it about them that led to such variable outcomes?
A companion qualitative process evaluation 17 was able to discern that the key difference was the
quality of implementation received by different groups, the level of support provided to them by
managers and political leaders, and variations in the nature and extent of local-level inequalities
(which in turn shaped which groups were able to participate, and on what terms). The
administrative rules and implementation guidelines provided to all groups were identical, but in
this case a qualitative process evaluation was able to document the ways and places in which
variable fidelity to them yielded widely different outcomes (albeit with no net impact).
Moreover, the qualitive data was able to discern subtle positive effects from the program that
reliance on the quantitative survey instrument alone would have missed.

    2. Measurement becomes conflated with management

An extension of the above point is that undue emphasis on quantitative data can lead to
measurement becoming a substitute for rather than a complement to management. This is evident
when only that which is quantifiable receives any significant form of managerial attention, an
outcome made possible when the easily quantifiable becomes the measure of success, in turn
becoming the object of management’s focus, typically to the exclusion of all else. As Wilson
(1989: 161) famously intoned in this classic study of bureaucratic life, “[w]ork that produces
measurable outcomes tends to drive out work that produces immeasurable outcomes” (Wilson
1989: 161).
In one sense this is hardly surprising; the need for managers to make decisions on the basis of
partial information is difficult and feels risky, so anything that claims to fill that gap and bypass
the perceived uncertainty of subjective judgement will be readily welcomed. “The result”, Simon
Caulkin argues, “both practically and theoretically, is to turn today’s management into a
technology of control that attempts to minimise rather than capitalise on the pesky human
element.” 18 And in public administration, a managing-it-by-measuring-it bias can mean that,
over time, the bulk of organizational resources end up neglecting the “pesky human element” of
change processes, despite the fact that it is this element which is often central to attaining any
transformational outcomes managers are seeking.
This dynamic characterizes key aspects of the Saving One Million Lives (SOML) initiative, an
ambitious health sector reform program launched by the Government of Nigeria. The original
goal of SOML was to save the lives of one million mothers and children by 2015; to this end,

both qualitative and econometric methods is the capacity of the former to identify factors that were not anticipated in
the original design (see Rao 2022). In either case, Ravallion’s (2001) injunction to “look beyond averages” when
engaging with complex phenomena is worth being heeded by all researchers (and those that interpret the
researchers’ findings), no matter their disciplinary or methodological orientation.
17
   On the use of mixed methods in process evaluations, see Rogers and Woolcock (forthcoming).
18
   https://www.treasurers.org/hub/treasurer-magazine/decision-making-how-make-most-data

                                                                                                                    10
SOML gave priority to a package of health interventions known as ‘the six pillars’. 19 The World
Bank actively supported SOML, using its “Program for Results” instrument 20 to financially
reward Nigerian states based on improvements from their previous best performance on six key
indicators. 21 Improvements were to be measured through yearly household surveys providing
robust estimates at the state level.
In practice, of course, these six pillars (or intervention areas) were wildly different in their
drivers and complexity – improvement within them was therefore destined to move at different
trajectories and different speeds for different groups in different places. State actors, keen to raise
their aggregate measure of success and get paid for it, soon realized that there was some gaming
to be done. Our field research documented how the emphasis on singular measures of success
introduced a perverse incentive for states to focus on the easier metrics at the expense of the hard
(Bridges and Woolcock 2019). Interviews with state officials revealed that front-line staff were
increasingly focusing their time and energies on those constituent variables that they discerned
were easiest to accomplish (e.g., dispensing vitamin supplements) over those that were harder or
slower – typically those that involved a plethora of “pesky human elements” – such as lowering
maternal mortality or increasing contraceptive use. Thus, in selecting certain outcomes for
measurement and managing these alone, others inevitably end up being sidelined.
Likewise, a recent report on Results-based Financing (RBF) in the Education Sector (Dom et al.
2020) finds evidence of a “diversion risk” associated with the signposting effect of certain
reward indicators, with important areas deprioritized because of the RBF incentive. For example,
in Mozambique they find that an emphasis on simple process indicators and focus on targets
appears to have led to officials diverting resources and attention away from “more fundamental
and complex issues”, such as power dynamics in the school council, the political appointment of
school directors, or the teachers’ use of training. Dom et al. also report evidence of “cherry-
picking risks”, in which less costly or politically favored subgroups or regions see greater
resources, in part because they are more likely to reach a target. For example, in Tanzania, they
found evidence that the implementation of school rankings based on exam results was correlated
with weaker students not sitting, presumably in an effort by the schools to raise average exam
pass rates.
This tendency becomes a particular issue when the sidelined outcomes end up being the ones we
care most about. Andrew Natsios (2011), the former Director of the United States Agency for

19
   The six pillars were (i) Maternal, newborn and child health; (ii) Childhood essential medicines and increasing
treatment of important childhood diseases; (iii) Improving child nutrition; (iv) Immunization; (v) Malaria control;
and (vi) the Elimination of Mother to Child Transmission (EMTCT) of HIV.
20
   A PfR is one of the World Bank’s three financing instruments. Its unique features are that it uses a country’s own
institutions and processes, and links disbursement of funds directly to the achievement of specific program results.
Where ‘traditional’ development interventions proceed on the basis of ex ante commitments (e.g., to designated
‘policy reforms’, to the adoption of procedures compliant with international standards), PfR-type interventions
instead reward attainment of predetermined targets, typically set on the basis of extrapolating from what recent
historical trajectories have attained.
21
   According to the Program Appraisal Document “each state would be eligible for a grant worth $325,000 per the
percentage point gain they made above average annual gain in the sum of six indicators of health service coverage.”
The six indicators are: Vitamin A, Pentavalent3 immunization, Use of ITNs by children under 5, Skilled birth
attendance, Contraceptive prevalence rate, and Prevention of mother-to-child transmission of HIV.

                                                                                                                   11
International Development (USAID, an organization charged with “demonstrating the impact of
every aid cent that Congress approves”), argues compellingly that the tendency in aid and
development towards what he called “Obsessive Measurement Disorder” (OMD) is a
manifestation of a core dictum among field-based development practitioners – namely “that
those development programs that are most precisely and easily measured are the least
transformational, and those programs that are most transformational are the least measurable.”
The change we often desire most is in very difficult-to-measure aspects, such as people’s habits,
their cultural norms, leadership characteristics, or mindsets.

This reality is also aptly illustrated in many anti-corruption efforts, whereby imported solutions
have managed to change the easy-to-measure – new legislation approved, more cases brought,
new financial systems installed, more training sessions held – but failed to shift cultural norms
regarding the non-acceptability of whistle blowing or the social pressures for nepotism (Andrews
2013). A failure to measure and therefore manage these informal drivers of the problem ensures
that any apparent reductions in fund abuses tend to be short-lived or illusory.

This phenomenon is hardly limited to poor countries. A more brutal example of how what cannot
be measured does not get managed, with disastrous results, can be found in the UK’s National
Health System. While investigating the effects of competition in the NHS, Propper et al. (2008)
discovered that the introduction of inter-hospital competition improved waiting times while also
substantially increasing the death rate following emergency heart attacks. The reason for this was
that waiting times were being measured (and therefore managed), while emergency heart-attack
deaths were not tracked, and were thus neglected by management. The result was shorter waiting
times but more deaths as a result of the choice of measure. The authors note that the issue here
was not intent, but the extent to which one target consumed managerial attention, to the
detriment of all else; as they note, it “seems unlikely that hospitals deliberately set out to
decrease survival rates. What is more likely is that in response to competitive pressures on costs,
hospitals cut services that affected [heart-attack] mortality rates, which were unobserved, in
order to increase other activities which buyers could better observe” (Propper at al. 2008).

More recently, in October 2019 the Global Health Security Index sought to assess which
countries were “most prepared” for a pandemic, using a model that gave the highest ranking to
the United States and the United Kingdom, largely on the basis of these countries’ venerable
medical expertise and technical infrastructure, factors which are readily measurable. 22 Alas, the
model did not fare so well when an actual pandemic arrived soon thereafter: a subsequent
analysis, published in the Lancet on the basis of pandemic data from 177 countries between
January 2020 - September 2022, found that “[p]andemic-preparedness indices … were not
meaningfully associated with standardised infection rates or IFRs [infection/fatality ratios].
Measures of trust in the government and interpersonal trust, as well as less government
corruption, had larger, statistically significant associations with lower standardised infection
rates” (Bollyky et al et al 2022, p. 1).

Needless to say, variables such as ‘trust’ and ‘government corruption’ are (a) hard to measure,
(b) hard to incorporate into a single theory anticipating or informing a response to a pandemic,
and (c) map awkwardly onto any corresponding policy instrument. For present purposes, the

22
     See https://www.statista.com/chart/19790/index-scores-by-level-of-preparation-to-respond-to-an-epidemic/

                                                                                                                12
inference we draw from these findings is not that global indices have no place; rather, it suggests
the need, from the outset, for curating a broad suite of data when anticipating and responding to
complex policy challenges, the better to promote real-time learning. Doubling down on what can
be readily measured limits the space for eliciting those ‘unobserved’ (and perhaps unobservable)
factors that may turn out to be deeply consequential.

     3. An emphasis on the easy to quantify inhibits understanding of the foundational problem

An indiscriminate emphasis on aggregated, quantitative data can erode important nuances about
the root causes of the problems we want to fix, thereby hampering our ability to craft appropriate
solutions and undermining the longer-term problem-solving capabilities of an organization. All
too often the designation of indicators and targets has the effect of causing people to become
highly simplistic about the problems they are trying to address. In such circumstances, what
should be organizational meetings held to promote learning and reflection on what is working
and what is not become instead efforts in accounting and compliance. Reporting, rather than
learning, is incentivized and management increasingly focuses on meeting the target numbers
rather than solving the problem. Our concern here is that, over time, this tendency progressively
erodes an organization’s problem-solving capabilities.
The education sector is perhaps the best illustration of this; time and again practitioners have
sought to codify “learning” and time and again this has resulted in an obfuscation of the actual
causes underlying the problem. In a well-intentioned effort to raise academic performance, “the
standards movement” in education promoted efforts hinged on quantitative measurement, as
reported in the league tables of the Program for International Student Assessment (PISA). 23
PISA runs tests in mathematics, reading, and science every three years with groups of fifteen-
year-olds in countries around the world. Testing on such scale requires a level of simplicity and
“standardization”, thus the emphasis is on written examinations and extensive use of multiple-
choice tests so that students’ answers can be easily codified and processed (Robinson and
Aronica 2015). Demonstrating competence on fundamental learning tasks certainly has its place,
but critics have increasingly argued that such tests are based on an incorrect assumption that
what drives successful career and life outcomes is the kind of learning that is capable of being
codified via a standardized test (Khan 2021, Claxton and Lucas 2015).
In reality, the gap between the skills that children learn and are tested for, and the skills that they
need to excel in the 21st century, is becoming more obvious. The World Economic Forum noted
in 2016 that the traditional learning captured by standardized tests falls short of equipping
students with the knowledge they need to thrive. 24 Yong Zhao, the presidential chair and director
of the Institute for Global and Online Education in the College of Education at the University of

23
   These tables are based on student performance in standardized tests in mathematics, reading, and science, which
are administered by the Paris-based Organisation for Economic Co-operation and Development (OECD).
24
   https://www.weforum.org/agenda/2016/03/21st-century-skills-future-jobs-students/
https://www.weforum.org/agenda/2016/01/the-10-skills-you-need-to-thrive-in-the-fourth-industrial-revolution -
“Whereas negotiation and flexibility are high on the list of skills for 2015, in 2020 they will begin to drop from the
top 10 as machines, using masses of data, begin to make our decisions for us. A survey done by the World
Economic Forum’s Global Agenda Council on the Future of Software and Society shows people expect artificial
intelligence machines to be part of a company’s board of directors by 2026. Similarly, active listening, considered a
core skill today, will disappear completely from the top 10. Emotional intelligence, which doesn’t feature in the top
10 today, will become one of the top skills needed by all.”

                                                                                                                    13
Oregon, points out that there is in fact an inverse relationship between those countries that excel
on PISA tests and those that excel in aspects like entrepreneurism, for example (see Figure 1). 25




Figure 1: The inverse relationship between those countries who excel on PISA tests and those
that excel in entrepreneurism


While a focus on assessing learning is laudable – and a vast improvement over past practices
(e.g., in the Millennium Development Goals) of merely measuring attendance (World Bank
2018) – for present purposes the issue is that the drivers of learning outcomes are far more
complex than a quantifiable content deficit in a set of subjects. This is increasingly the case in
the 21st century, which has brought with it a need for new skills and mindsets that go well
beyond the foundational numeracy and literacy skills required during the Industrial Revolution
(Robinson and Aronica 2015). A survey of chief human resources and strategy officers by the
World Economic Forum finds a significant shift between 2015 and 2020 in the top skills future
workers will need, with “habits of mind” like critical thinking, creativity, emotional intelligence
and problem-solving ranking well ahead of any specific content acquisition. 26 None of this is to
say that data does not have a role to play in measuring the success of an educational endeavor.
Rather, the data task in this case needs to be informed by the complexity of the problem and the
extent to which holistic learning resists easy quantification. 27

25
   http://zhaolearning.com/2012/06/06/test-scores-vs-entrepreneurship-pisa-timss-and-confidence/
26
   World Economic Forum 2016 New Vision for Education: Fostering Social and Emotional Learning Through
Technology. https://www3.weforum.org/docs/WEF_New_Vision_for_Education.pdf accessed February 2022.
27
   Many companies and tertiary institutions are ahead of the curve in this regard. Recently, over 150 of the top
private high schools in the U.S., including Phillips Exeter and Dalton – storied institutions which have long relied on
the status conveyed by student ranking—have pledged to shift to new transcripts that provide more comprehensive,

                                                                                                                    14
Finally, relying exclusively on high level aggregate data can result in presuming uniformity in
underlying problems, and thus lead to the promotion of simplistic and correspondingly generic
solutions. McDonnell (2020) notes, for example, that because many developing countries have
relatively high corruption scores, an unwelcome outcome has been that all the institutions in the
country tend to be regarded by would-be-reformers as similarly corrupt and uniformly
ineffectual. In her impressive research on “clusters of effectiveness”, however, she offers
evidence of the variation in public sector performance within states, noting how the aggregated
data on ‘corruption’ masks the fact that the difference in corruption scores between Ghana’s
best- and worst-rated state agencies approximates the difference between Belgium (WGI = 1.50)
and Mozambique (WGI = –.396), in effect “spanning the chasm of so-called developed and
developing worlds.” The tendency of reform actors to be guided by simplistic aggregate
indicators – such as those that are used to determine a poor country’s ‘fragility’ status and
eligibility for IDA funding – has prevented a more detailed and context-specific understanding of
lessons that could be drawn from positive outlier cases, 28 or what McDonnell refers to as “the
thousand small revolutions quietly blooming in rugged and unruly meadows.”

    4. Pressure to improve select indicators leads to falsification and unwarranted impact claims
       (thereby jeopardizing the perceived integrity of broader measurement efforts)

As an extension of our previous point regarding how the easy-to-measure can become the
yardstick for success, it is important to acknowledge that public officials are often going to be
under extreme pressure to demonstrate success in these selected indicators. Once data itself,
rather than the more complex underlying reality, becomes the primary objective by which entire
governments publicly assess (and manage) their ‘progress’, it is essentially inevitable that vast
political pressure will be placed on these numbers to bring them into alignment with
expectations, imperatives, and interests. Similar logic can be expected at lower units of analysis
(e.g., field offices), where it tends to be even more straightforward to manipulate data entry and
analysis. This in turn contributes to a perverse incentive to falsify or skew data, to aggregate
numbers across wildly different variables into single indices, and to draw unwarranted inferences
from them.
An example of when this risk is particularly acute is when annual “global rankings” are publicly
released (assessing, for example, a country’s ‘investment climate’, ‘governance’, and gender
equity), thereby shaping major investment decisions, credit ratings, eligibility for funding from
international agencies, and the fate of senior officials charged with “improving” their country’s
place in these global league tables. Readers will surely be aware of the case at the World Bank in
September 2021, when an external review revealed that the ‘Doing Business’ indicators had been
subjected to such pressure, with alterations being made to certain indicators from certain



qualitative feedback on students while ruling out any mention of credit hours, GPAs, or A–F grades. And colleges –
the final arbiters of high school performance – are signaling a surprising willingness to depart from traditional
assessments that have been in place since the early 19th century. From Harvard and Dartmouth to small community
colleges, more than 70 U.S. institutions of higher learning have weighed in, signing formal statements asserting that
competency-based transcripts will not hurt students in the admissions process.
28
   See Milante and Woolcock (2017) for a complementary set of dynamic quantitative and qualitative measures by
which a given country might be declared a “fragile” state.

                                                                                                                   15
countries. 29 Such rankings are now omnipresent, and if they are not done by one organization
then they will inevitably be done by another. Even so, as The Economist magazine concluded,
some might regard the ‘Doing Business’ episode as “proof of ‘Goodhart’s law’, which states that
when a measure becomes a target, it ceases to be a good measure.” At the same time, it pointed
out that there is a delicate dance to be done here, since “the Doing Business rankings were
always intended to motivate as well as measure, to change the world, not merely describe it” and
“[i]f these rankings had never captured the imagination of world leaders, if they had remained an
obscure technical exercise, they might have been better as measures of red tape. But they would
have been worse at cutting it.” 30
Such are the wrenching trade-offs at stake in such exercises, and astute public administrators
need to engage in them with their eyes wide open. Even (or especially) at lower units of analysis,
where there are perhaps fewer prying eyes or quality-control checks, the potential is rife for
undue influence to be exerted on data used for political and budgetary allocation purposes. Fully
protecting the integrity of data collection, collation and curation (in all its forms) should be a
first-order priority, but so too is the need for deploying what should be standard ‘risk
diversification’ strategies on the part of managers, namely, not relying on single numbers or
methods to assess inherently complex realities.

     IV.    Principles for an expansive, qualified data suite that fosters problem-solving and
            organizational learning
In response to the four risks identified above, we offer a corresponding set of cross-cutting
principles for addressing them. Figure 2 summarizes the four risks in the left-hand column and
presents the principles as vertical text on the right, illustrating the extent to which the principles,
when applied in combination, can serve to produce a more balanced data suite that prioritizes
problem-solving and learning.




29
   https://thedocs.worldbank.org/en/doc/84a922cc9273b7b120d49ad3b9e9d3f9-0090012021/original/DB-
Investigation-Findings-and-Report-to-the-Board-of-Executive-Directors-September-15-2021.pdf
30
   https://www.economist.com/finance-and-economics/2021/09/17/how-world-bank-leaders-put-pressure-on-staff-
to-alter-a-global-index

                                                                                                          16
Figure 2: Four risks with corresponding principles for mitigating them, to ensure a balanced
data suite. (Note that the principles are ‘cross-cutting’, in the sense that they apply in some
measure to all the risks, not one-to-one.)

       1. Identify and manage the capacity and power dynamics that are going to shape your
       task

The data collection and curation process takes place not in isolation but in a densely populated
political and institutional ecosystem. It is difficult, expensive, and fraught work; building a
professional team capable of reliably and consistently doing this work – from field-level
collection and curation at headquarters to technical analysis and policy interpretation – will be as
challenging as it is in every other public sector organization. Breakdowns can happen at any
point, potentially compromising the integrity of the entire endeavor. As such, it is important for
managers not just to hire those with the requisite skills but to cultivate, recognize and reward a
professional ethos wherein staff are able to do their work in good faith, shielded from political
pressure. Such practices, in turn, need to be protected by having clear, open and safe procedures
staff can use for reporting undue pressure being placed upon them, complemented by
accountability to oversight or advisory boards comprising several external members selected for
their technical expertise and professional integrity. In the absence of such mechanisms, noble
aspirations for pursuing an ‘evidence-based policy’ agenda risk becoming perceived as a means
of providing merely ‘policy-based evidence.’
The contexts within or from which data is collected are also likely to be infused with their own
socio-political characteristics. Collecting data on the incidence of crime and violence, for
example, requires police to faithfully record such matters and their response to them, but to do so
in an environment where there may be strong pressures to under-report such matters, whether out

                                                                                                  17
of personal safety concerns, lack of adequate administrative resources, or pressure to show that a
given unit’s performance is improving (where this is measured by showing a ‘lower incidence’
of crime). In this respect, good diagnostic work will reveal the contours of the institutional and
political ecosystem wherein the data work will be conducted, and the necessary authorization,
financing, and protection sought; it will also help managers learn how to understand and
successfully navigate that space. 31 32 The inherent challenges of engaging with such issues might
be eased somewhat if those closest to them see data deployment not as an end in itself or an
instrument of compliance but rather regard it as a means to higher ends, namely learning,
practical problem solving, and enhancing the quality of policy options, choices and
implementation capability. 33
A related point is that corresponding efforts need to be made to clearly and accurately
communicate to the general public those findings that are derived from data managed by public
administrators, especially when these findings are contentious or speak to inherently complex
issues. This issue has been readily apparent during the Covid-19 pandemic, with high-stakes
policy decisions (e.g., requiring vulnerable populations to forgo income) needing to be made on
the basis of limited but evolving evidence. Countries such as Vietnam have been praised for the
clear and consistent manner in which they issued Covid-19 response guidelines to citizens
(Ravallion 2020), but the broader point is that even when the most supported decisions are based
on the best evidence generated by the most committed work environments, it remains important
for administrators to appreciate that one’s very acts of large-scale measurement and empirical
interpretation, especially when enacted by large public organizations, will potentially be
threatening to or misunderstood by the very populations they are seeking to assist.

         2. Focus the collection of quantitative data on those aspects closest to the problem

If we wish to guard against the tendency to falsely ascribe success based on the achievement of
poorly selected indicators, then we should ensure that any indicators used to claim or deny
reform success are as readily operational and close to the service delivery problem as possible.
Output and process indicators are all useful in their own way, but we should not make the
mistake of conflating their achievement with “problem fixed”. The tendency to claim reform
success based on whether a new mechanism or oversight structure has been created, a new law
passed, or percentage of participation achieved often comes with strong institutional incentives
of course, but if meaningful change is sought, these need to be countered. All of these measures
are changes in form that, while useful as indicators of outputs being met, can be achieved (and

31
   For development-oriented organizations, a set of tools and guidelines for guiding this initial assessment – crafted
by USAID and ODI (London) and adopted by certain parts of the World Bank – is ‘Thinking and Working
Politically Through Applied Political Economy Analysis: A Guide for Practitioners’.
https://usaidlearninglab.org/sites/default/files/resource/files/pea_guide_final.pdf
32
   Hudson et al (2016) offer a guide for “everyday Political Analysis”, which introduces a stripped-back political
analysis framework designed to help frontline practitioners make quick but politically-informed decisions. It aims to
complement more in-depth political analysis by helping programming staff to develop the 'craft' of political thinking
in a way that fits their everyday working practices. https://www.dlprog.org/publications/research-papers/everyday-
political-analysis
33
   On the application of such efforts to the case of policing in particular, see Sparrow (2018).

                                                                                                                   18
have been in the past) without any attendant functional shifts in the underlying quality of service
delivery.
Officials can guard against this tendency by taking some time to ensure that an intervention is
focused on specific problems, including those that matter at a local level, and that the
intervention’s success and attendant metrics are accurate measures of that problem being fixed.
Tools like the PDIA Toolkit (see below) can help guide practitioners in this process. 34 Figure 3
illustrates the step-by-step “problem-driven iterative adaptation” (PDIA) approach, which is
designed to help practitioners break down their problems into root causes, identify entry points,
search for possible solutions, take action, reflect upon what they have learned, adapt and then act
again. By embedding any intervention in such a framework, practitioners can ensure that success
metrics are well-linked to functional, locally felt problems.
Whatever tool is applied, the goal should be to arrive at metrics of success that represent a
compelling picture of the root performance problem being addressed (and hopefully solved). So,
for example, in our education example, metrics such as number of teachers hired, percentage of
budget dedicated to education, and number of schools built are all output measures that say
nothing about actual learning. Of course, there are assumptions that these outputs lead to
children learning, but as many recent studies now show, such assumptions are routinely
mistaken; these indicators can be achieved even as actual learning regresses (Pritchett 2013,
World Bank 2018). By contrast, when a robust measure of learning – in this case literacy
acquisition – was applied in India, it allowed implementers to gain valuable insights about which
interventions actually made a difference, revealing that teaching to a child’s actual level of
learning, not their age or grade, had led to marked and sustained improvements. Crucially, such
outcomes are the result of carefully integrated qualitative and quantitative approaches to
measurement (Banerjee et al 2016).




34
   The PDIA toolkit: A DIY Approach to Solving Complex Problems, https://bsc.cid.harvard.edu/PDIAtoolkit
(prepared by Salimah Samji, Matt Andrews, Lant Pritchett and Michael Woolcock) is designed by members of
Harvard’s Building State Capability program to guide government teams through the process of identifying,
deconstructing and solving complex problems. See in particular the section on “Constructing your problem”, which
guides practitioners through the process of defining a problem that matters and building a credible, measurable
vision of what success would look like.



                                                                                                              19
Figure 3: The Problem Driven Iterative Adaptation (PDIA) process


Going further, various cross-national assessments around the world are trying to tackle the
complex challenge of finding indicators that measure learning not just in the acquisition of
numeracy, science and literacy skills but in competencies that are proving to be increasingly
valuable in the 21st century: grit, curiosity, communication, leadership and compassion. PISA
for example, has included an “innovative domain” in each of its recent rounds, including creative
problem solving in 2012, collaborative problem solving in 2015, and global competence in 2018.
In Latin America, the Latin American Laboratory for Assessment of the Quality of Education
(LLECE) is including a module on socio-emotional skills for the first time in its assessment of
sixth grade students in 2019, focusing on the concepts of conscience, valuing of others, self-
regulation and self-management. 35 Much tinkering remains to be done but the increase in
assessments that include skills and competencies such as citizenship (local and global), social-
emotional skills, ICT literacy and problem solving is a clear indication of willingness to have
functional measures of success, capturing outcomes that matter.
In summary then, those public administrators who wish to guard against unwarranted impact
claims and ensure metrics of success are credible can begin by making sure that the intervention
itself is focused on a specific performance problem that is locally prioritized and thereafter
ensure that any judgement on that intervention’s success or failure is based not on output or
process metrics but on measures of the problem being fixed. And having ensured that measures
of success are functional, it is important that practitioners allow flexibility of implementation
where possible, so that strategies can shift if it becomes clear from the collected data that they
are not making progress on fixing the problem, possibly due to mistaken assumptions regarding
their theory of change.




35
     https://www.globalpartnership.org/sites/default/files/document/file/2020-01-GPE-21-century-skills-report.pdf

                                                                                                                    20
3. Embrace an active role for qualitative data

The issues we have raised thus far, we argue, imply that public administrators should adopt a far
more expansive concept of what constitutes “good data”, namely one that includes insights from
theory and qualitative research. Apprehending complex problems requires different forms and
sources of data; correctly interpreting empirical findings requires active dialogue with reasoned
expectations about what outcomes should be attained by when. Doing so helps avoid creating
distortions that can generate (potentially wildly) misleading claims regarding “what’s going on,
and why” and “what should be done”.
Specifically, we advocate for the adoption of a complementary suite of data forms and sources
that favors flexibility, is focused on problem-solving (as opposed to being an end in itself), and
values insights derived from seasoned experience. In the examples we have explored above, it
was reliance on a single form of data (sometimes even a single number) that rendered it
vulnerable to political manipulation, to unwarranted conclusions, and to being unable to bear the
decision-making burdens thrust upon it. More constructively, it was the incorporation of
alternative methods and data in dialogue with a reasoned theory of change that enabled decision-
makers to be capable of anticipating and addressing many of these same concerns.
To this end, we have sought to get beyond the familiar presumption that the primary role of
qualitative data and methods in public administration research (and elsewhere) is to provide
distinctive insights into the idiosyncrasies of an organization’s “context” and “culture” (and thus
infuse some “color” and “anecdotes” for accompanying boxes). 36 Qualitative approaches can
potentially yield unique and useful material that contributes to claims about whether policy goals
are being met and delivery processes duly upheld (Cartwright 2017); these can be especially
helpful when the realization of policy goals requires integrating both adaptive and technical
approaches to implementation – e.g., responding to Covid-19. But perhaps the more salient
contributions of qualitative approaches, we suggest, are to (a) help explore how, for whom, and
from whom data of all kinds are being deployed as part of broader imperatives to meet political
requirements and administrative logics in a professional manner; and (b) to elicit either novel or
heretofore ‘unobserved’ variables shaping policy outcomes.


36
   As anthropologist Mike McGovern (2011: 353) powerfully argues, taking context seriously “is neither a
luxury nor the result of a kind of methodological altruism to be extended by the soft-hearted. It is, in purely
positivist terms, the epistemological due diligence work required before one can talk meaningfully about other
people’s intentions, motivations, or desires. The risk in foregoing it is not simply that one might miss some of the
local color of individual ‘cases’. It is one of misrecognition. Analysis based on such misrecognition may mistake
symptoms for causes, or two formally similar situations as being comparable despite their different etiologies. To
extend the medical metaphor one step further, misdiagnosis is unfortunate, but a flawed prescription based on such a
misrecognition can be deadly.” More generally, see Hoag and Hull (2017) for a summary of the anthropological
literature on the civil service. Bailey (2017) provides a compelling example of how insights from qualitative
fieldwork help explain the strong preference among civil servants in Tanzania for providing new water infrastructure
projects over maintaining existing ones. Though a basic benefit/cost analysis favored prioritizing maintenance,
collective action problems among civil servants themselves, the prosaic challenges of mediating local water
management disputes overseen by customary institutions, and the performance targets set by the government all
conspired to create suboptimal outcomes.

                                                                                                                 21
4. Leave room for judgment

Our caution is against using data reductively: as a replacement or substitute for managing.
Management must be about more than measuring. A good manager needs to be able to
accommodate the immeasurable, since so much that is important to human thriving is in this
category; dashboards et al certainly have their place, but if these were all that was needed then
‘managing’ could be conducted by machines. We all know from personal experience that the
best managers and leaders are those that take a holistic interest in their staff, making the time and
effort to understand the subtle, often intangible processes that connect their respective talents. As
organizational management theorist Henry Mintzberg (2015) wisely puts it,
         Measuring as a complement to managing is a fine idea: measure what you can; take
         seriously what you can’t; and manage both thoughtfully. In other words: If you can’t
         measure it, you’ll have to manage it. If you can measure it, you’ll especially have to
         manage it. Have we not had enough of leadership by remote control: sitting in executive
         offices and running the numbers—all that deeming and downsizing? 37
Contrary to the “what can’t be measured can’t be managed” idea, we can manage the less
measurable if we embrace a wider set of tools and leave space for judgment. The key for
practitioners is to begin with a recognition that measurability is not an indicator of significance
and that professional management involves far more than simply “running the numbers”, as
Mintzberg puts it. Perhaps the most compelling empirical case for the importance of ‘navigating
by judgment’ in public administration has been made by Honig (2018), in which he shows –
using a mix of quantitative data and case study analysis – that the more complex the policy
intervention, the more necessary it becomes to grant discretionary space to front-line managers,
and the more necessary such discretion is to achieving project success. Having ready access to
relevant, high-quality quantitative data can aid in this ‘navigation’ but true navigation requires
access to a broader suite of empirical inputs.
In a similar vein, Ladner (2015: 3) points out that “standard performance monitoring tools are
not suitable for highly flexible, entrepreneurial programs as they assume that how a program will
be implemented follows its original design”. In order to avoid ‘locking in’ a theory of change
that prevents exploration or responsive adaptation, some practitioners have provided helpful
suggestions for how to use various planning frameworks in ways that support program
learning. 38 39 The Building State Capability team highlights lighter touch methods, such as their
PDIA “check ins”, which include a series of probing questions to assist teams in capturing


37
   https://mintzberg.org/blog/measure-it-manage-it. Says Mintzberg: “Someone I know once asked a most senior
British civil servant why his department had to do so much measuring. His reply: ‘What else can we do when we
don’t know what’s going on?’ Did he ever try getting on the ground to find out what’s going on? And then using
judgment to assess that?”
38
   Teskey (2017) and Wild et al (2017) give examples of an adaptive logframe, drawn from DfID experiences, that
sets out a set of clear objectives at the outcome level, and focuses monitoring of outputs on the quality of the agreed
rapid-cycle learning process.
39
   Strategy Testing (ST) is a monitoring system that The Asia Foundation developed specifically to track programs
that are addressing complex development problems through a highly iterative, adaptive approach.

                                                                                                                     22
learning and maximizing adaptation. Teskey and Tyrrel (2017) recommend the utility of
participating in regularized formal and informal Review and Reflection (R&R) points, during
which a contractor can demonstrate how politics, interests, incentives and institutions were
systematically considered in problem selection and design, and in turn justify why certain
choices were made to stop, drop, halt or expand any activity or budget during implementation.
The common connection across all these tools is that they are seeking to carve out meaningful
space for qualitative data and the hard-won insights borne out of practical experience.
In summary then, public administrators can embed the recognition that management must be
about more than measuring by firstly, recognizing that that whatever they choose to measure will
inevitably create incentives to neglect processes and outcomes that cannot be measured (or are
hard to measure) but are nonetheless crucial for discerning whether, how, where, and for whom
policies are working. Following that recognition, they need to be very careful what they choose
to measure. Secondly, they can actively identify what they cannot (readily) measure that matters,
and take it seriously, developing strategies to manage that as well. A key part of those strategies
will be that they create space for judgment, allowing meaningful space for qualitative data inputs
and the practical experience of embedded individuals (focus group discussions, case studies,
semi-structured interviews, review and reflection points etc.) and will treat these inputs as
equally valid alongside more quantitative ones. In terms of more long-term strategies to manage
the immeasurable, administrations can work towards developing organizational systems that
foster navigation. Such systems might include, for example, (i) a management structure that
delegates high levels of discretion so as to allow those on the ground the ability to navigate
complex situations; (ii) recruiting strategies that foster high numbers of staff with extensive
context-specific knowledge; and (iii) systems of monitoring and learning that encourage the
routine evaluation of theory with practice.

   V.      Conclusion

Quantitative measurement in public administration is undoubtedly a critical arrow in the quiver
of any attempt to improve the delivery of public services. And yet, since not everything that
matters can be measured and not everything that can be measured matters, a managerial
emphasis on measurement alone can quickly and inadvertently generate unwanted outcomes
and/or unwarranted conclusions. In the everyday practices of public administration, effective and
professional action requires forging greater complementarity between different epistemological
approaches to collecting, curating, analyzing, and interpreting data. We fully recognize that this
is easier said than done. The risks of reductive approaches to measurement are not unknown, and
yet simplified appeals to “what gets measured gets managed” persist because they offer
managers a form of escape from those “pesky human elements” that are difficult to understand,
even more so to shift.
Most public administrators might agree in principle to the need for a more balanced data suite to
navigate their professional terrain, yet such aspirations are too often honored in the breach: under
sufficient pressure to ‘deliver results’, staff from the top to the bottom of an organization can be
readily tempted to reverse engineer their behavior in accordance with what ‘the data’ says (or

                                                                                                 23
can be made to say). Management as measurement is tempting for any individual or organization
that fears the vulnerability of their domain to unfavorable comparisons with other (more readily
measurable and ‘legible’) domains, the complexity of problem-solving, and the necessity of
subjective navigation that it often entails. But given how heavily institutional and socio-political
factors shape how data is collected, how well it is collected and curated, and how it can be
manipulated for unwarranted purposes, a simplistic approach to data as an easy fix is virtually
guaranteed to obscure learning and hamper change efforts. If administrations genuinely wish to
build their problem-solving capabilities, then access to more and better quantitative data will be
necessary, but it will not be sufficient.
Beginning with an appreciation that much of what matters cannot be (formally) measured, public
administrations must routinely remind themselves that promoting and accessing data is not an
end in itself: data’s primary purpose is not just monitoring processes, compliance, and outcomes,
but contributing to problem-solving and organizational learning. More and better data will not fix
a problem if the absence of such data is not itself the key problem or the ‘binding constraint’.
Administrations that are committed to problem-solving therefore will need to embed their
measurement task in a broader problem-driven framework and seek to integrate complementary
qualitative data and to value embedded experience so that they might apprehend and interpret
complex realities more accurately. Their first priority in undertaking good diagnostic work
should be to identify and deconstruct key problems, using varied sources of data, and to then
track and learn from potential solutions authorized and enacted in response to the diagnosis.
Accurate inferences for policy and practice are not derived from data alone; close interaction is
required between data (in various forms), theory and experience. In doing all this, public
administrators will help mitigate the distortionary (and ultimately self-defeating) effects of
managing only that which is measured.




Box 1: Summary of lessons for promoting problem-solving and learning in public
administration
 Principles           Actions for practitioners
 1. Identify and      • Professional principles and standards for collecting, curating,
    manage the           analyzing and interpreting data must be made clear to all staff –
    capacity and         from external consultants to senior managers – in order to affirm
    power relations      and enforce commitments to ensuring the integrity of the data itself
    that shape your      and the conclusions drawn from it.
    measurement       • Make measurement accountable to advisory boards with relevant
    task.                external members.
                      • Communicate measurement results to the public in a clear and
                         compelling way, especially on contentious, complex issues.
 2. Focus             • Make sure that the measurement approach itself is anchored to a
    quantification on    specific performance problem.
    those aspects

                                                                                                  24
   which are close   •   Measurement investments should be targeted at those performance
   to the problem.       problems that are prioritized by the administration.
                     •   Thereafter ensure that any judgements on an intervention’s success
                         or failure are based on credible measures of the problem being
                         fixed and not simply on output or process metrics.
                     •   Where measures of success relate to whether the intervention is
                         functioning, practitioners should allow flexibility in the
                         implementation of the intervention (where possible) and in the
                         related measurement of its functioning. In this way, implementation
                         strategies can shift if it becomes clear from the collected data that
                         they are not making progress on fixing the problem.
3. Embrace a role    •   Include qualitative data collection as a complement to quantitative
   for qualitative       data.
   data and a        •   This may be as a prelude to future large-scale quantitative
   theory of             instruments, or as perhaps the only available data option for some
   change.               aspects of public administration in some settings (such as those
                         experiencing sustained violence or natural disasters).
                     •   Draw on qualitative methods as a basis for eliciting novel or
                         ‘unobserved’ factors driving variation in outcomes.
                     •   Tie measurement (both qualitative and quantitative) back to a
                         theory of change. If the implementation of an intervention is not
                         having its intended impacts on the problem, assess whether there
                         are mistaken assumptions regarding the theory of change.
4. Leave room for    •   Consider carefully what you choose to measure, recognizing that
   judgment,             whatever you choose will inevitably create incentives to neglect
   discretion and        processes and outcomes that cannot be measured.
   deliberation;     •   Actively identify what you cannot (readily) measure that matters,
   because not           and take it seriously, developing strategies to manage that as well.
   everything that   •   This will include identifying those aspects of implementation in the
   matters can be        public sector that require inherently discretionary decisions.
   measured.         •   Employ strategies that value reasoned judgement, allowing
                         meaningful space for qualitative data inputs and the practical
                         experience of embedded individuals, treating such inputs as having
                         value alongside more quantitative ones.
                     •   In the longer term, develop organizational systems that foster
                         “navigation by judgment”. For example: (i) a management structure
                         that delegates high levels of discretion so as to allow those on the
                         ground the space to navigate complex situations; (ii) recruitment
                         strategies that foster high numbers of staff with extensive context-
                         specific knowledge; and (iii) systems of monitoring and learning
                         that encourage the routine evaluation of theory against practice.




                                                                                            25
References
Anderson, Benedict (1983) Imagined Communities: Reflections on the Origins and Spread of
    Nationalism London: Verso
Andrews, Matt (2013) The Limits of Institutional Reform in Development: Changing Rules for
    Realistic Solutions New York: Cambridge University Press
Andrews, Matt, Lant Pritchett and Michael Woolcock (2017) Building State Capability:
    Evidence, Analysis, Action New York: Oxford University Press
Bailey, Julia (2017) ‘Bureaucratic blockages: Water, civil servants and community in Tanzania’
    Policy Research Working Paper No. 8101 Washington, DC: World Bank
Ban, Radu and Vijayendra Rao (2008) ‘Tokenism or agency? The impact of women’s
    reservations on village democracies in South India’ Economic Development and Cultural
    Change 56(3): 501-530
Banerjee, Abhijit, Rukmini Banerji, James Berry, Esther Duflo, Harini Kannan, Shobhini
    Mukherji, Marc Shotland, and Michael Walton (2016) ‘Mainstreaming an effective
    intervention: Evidence from randomized evaluations of “Teaching at the Right Level” in
    India’ Working Paper No. w22746. Cambridge, MA: National Bureau of Economic
    Research
Beraldo, Davide and Stefania Milan (2019) ‘From data politics to the contentious politics of
    data’ Big Data & Society 6(2): 2053951719885967.
Bollyky, T.J., Hulland, E.N., Barber, R.M., Collins, J.K., Kiernan, S., Moses, M., Pigott, D.M.,
    Reiner Jr, R.C., Sorensen, R.J., Abbafati, C. and Adolph, C. (2022) ‘Pandemic preparedness
    and Covid-19: An exploratory analysis of infection and fatality rates, and contextual factors
    associated with preparedness in 177 countries, from Jan 1, 2020, to Sept 30, 2021’ The
    Lancet, February 1
Breckenridge, Keith (2014) Biometric State: The Global Politics of Identification and
    Surveillance in South Africa, 1850 to the Present Cambridge, UK: Cambridge University
    Press
Bridges, Kate and Michael Woolcock (2017) ‘How (not) to fix problems that matter: assessing
    and responding to Malawi's history of institutional reform’ Washington, DC: World Bank
    Policy Research Working Paper No. 8289
Bridges, Kate and Michael Woolcock (2019) ‘Implementing adaptive approaches in real world
    scenarios: A Nigeria case study, with lessons for theory and practice’ Washington, DC:
    World Bank Policy Research Working Paper No. 8904
Cahill, Jonathan (2017) Making a Difference in Marketing: The Foundation of Competitive
    Advantage London: Routledge
Carletto, Calogero, Dean Jolliffe, and Raka Banerjee (2015) ‘From tragedy to renaissance:
    improving agricultural data for better policies' The Journal of Development Studies 51(2):
    133-148
Cartwright, Nancy (2017) ‘Single case causes: What is evidence and why’, in Hsiang-Ke Chao
    and Julian Reiss (eds.) Philosophy of Science in Practice, pp. 11-24. New York: Springer

                                                                                               26
Cartwright, Nancy, and Jeremy Hardie (2012) Evidence-Based Policy: A Practical Guide to
     Doing it Better New York: Oxford University Press
Claxton, Guy and Bill Lucas (2015) Educating Ruby: What our Children Really Need to Learn
     New York: Crown House Publishing
Dirks, Nicholas (2011) Castes of Mind Princeton, NJ: Princeton University Press
Dom, Catherine, Alasdair Fraser, John Patch, Joseph Holden (2020) ‘Results-Based Financing in
     the Education Sector: Country-Level Analysis. Final Synthesis Report’ Submitted to the
     REACH Program at the World Bank by Mokoro Ltd.
Duflo, Esther (2012) ‘Women empowerment and economic development’ Journal of Economic
     Literature 50(4): 1051-79
Gill, Stephen (1970) Introduction, in Elizabeth Gaskell, Mary Barton: A Tale of Manchester Life
     London: Penguin
Hoag, Colin and Matthew Hull (2017) ‘A review of the anthropological literature on the civil
     service’ Policy Research Working Paper No. 8081 Washington, DC: World Bank
Honig, Dan (2018) Navigation by Judgment: Why and When Top-Down Management of Foreign
     Aid Doesn't Work New York: Oxford University Press
Hostetler, Laura (2021) ‘Mapping, registering, and ordering: Time, space and knowledge’, in
     Peter Fibiger Bang, C. A. Bayly and Walter Scheidel (eds.) The Oxford World History of
     Empire: Volume One: The Imperial Experience New York: Oxford University Press, pp.
     288-317
Jasanoff, Sheila (2004) ‘Ordering knowledge, ordering society’, in Sheila Jasanoff (ed.) States of
     Knowledge: The Co-Production of Science and the Social Order London: Routledge, pp. 13-
     45
Jerven, Morten (2013) Poor Numbers: How We Are Misled by African Development Statistics
     and What to Do About It Ithaca, NY: Cornell University Press
Johnson, Janet Buttolph, Henry T. Reynolds, and Jason D. Mycoff (2019) Political Science
     Research Methods (9th edition) Thousand Oaks, CA: Sage
Khan, Salman (2012) The One World Schoolhouse: Education Reimagined London: Hodder &
     Stoughton
King, Gary, and Jonathan Wand (2007) ‘Comparing incomparable survey responses: Evaluating
     and selecting anchoring vignettes’ Political Analysis 15 (1): 46-66
Honig, Dan (2018) Navigation by Judgment: How and When Top-Down Management of Foreign
     Aid Doesn’t Work New York: Oxford University Press
Ladner, Debra (2015) ‘Strategy testing: An innovative approach to monitoring highly flexible aid
     programs’ Working Politically in Practice Case Study 3. San Francisco: The Asia
     Foundation
Lewis, Jenny M. (2015) ‘The politics and consequences of performance measurement’ Policy
     and Society 34(1): 1-12
McDonnell, Erin (2020) Patchwork Leviathan: Pockets of Bureaucratic Effectiveness in
     Developing States Princeton, NJ: Princeton University Press


                                                                                               27
McGovern, Mike (2011) ‘Popular development economics: An anthropologist among the
     mandarins’ Perspectives on Politics 9(2): 345-355
Merry, Sally Engle, Kevin E. Davis, and Benedict Kingsbury (eds.) (2015) The Quiet Power of
     Indicators: Measuring Governance, Corruption, and Rule of Law New York: Cambridge
     University Press
Milante, Gary and Michael Woolcock (2017) ‘New approaches to identifying state fragility’
     (with Gary Milante) Journal of Globalization and Development 8(1)
Mintzberg, Henry (2015) https://mintzberg.org/blog/measure-it-manage-it
Natsios, Andrew (2011) ‘The clash of the counter-bureaucracy and development’ Essay, Center
     for Global Development. Available at:
     https://www.cgdev.org/sites/default/files/1424271_file_Natsios_Counterbureaucracy.pdf
Porter, Theodore M. (1995) Trust in Numbers: The Pursuit of Objectivity in Science and Public
        Life Princeton, NJ: Princeton University Press
Preskill, Hallie, Srikanth Gopal, Katelyn Mack and Joelle Cook (2014) ‘Evaluating complexity:
        Propositions for improving practice’ Boston: FSG.
        http://www.fsg.org/publications/evaluating-complexity
Pritchett, Lant (2013) The Rebirth of Education: Schooling Ain’t Learning Washington, DC:
     Center for Global Development
Pritchett, Lant (2014) ‘The risks to education systems from design mismatch and global
     isomorphism: Concepts, with examples from India’ WIDER Working Paper No. 2014/039
     Helsinki: UNU-WIDER
Propper, Carol, Simon Burgess and Denise Gossage (2008) ‘Competition and quality: evidence
     from the NHS internal market 1991-9’ The Economic Journal 118(525): 138-170
Rao, Vijayendra (2022) ‘Can economics become more reflexive? Exploring the potential of
     mixed-methods’ Policy Research Working Paper No. 9918, Washington, DC: World Bank
Rao, Vijayendra, Kripa Ananthpur and Kabir Malik (2017) ‘The anatomy of failure: An
     ethnography of a randomized trial to deepen democracy in rural India’ World Development
     99(11): 481-497
Ravallion, Martin (2001) ‘Growth, inequality and poverty: looking beyond averages’ World
     Development 29(11): 1803-1815
Ravallion, Martin (2020) ‘Pandemic policies in poor places’ CGD Note (April 24), Washington,
     DC: Center for Global Development
Ridgway, V. F. (1956) ‘Dysfunctional consequences of performance measurements’
     Administrative Science Quarterly 1(2): 240-47
Robinson, Ken and Lou Aronica (2015) Creative Schools: Revolutionizing Education from the
     Ground Up London: Penguin UK
Rogers, Patricia and Michael Woolcock (forthcoming) ‘Process and implementation evaluation
     methods’, in Anu Rangarajan and Diane Paulsell (eds.) Oxford Handbook of Social Program
     Design and Implementation Evaluation New York: Oxford University Press




                                                                                          28
Sandefur, Justin, and Amanda Glassman (2015) ‘The political economy of bad data: Evidence
    from African survey and administrative statistics’ The Journal of Development
    Studies 51(2): 116-132
Sanyal, Paromita, and Vijayendra Rao (2018) Oral Democracy: Deliberation in Indian Village
    Assemblies New York: Cambridge University Press
Scott, James C. (1998) Seeing Like a State: How Certain Schemes to Improve the Human
    Condition Have Failed New Haven: Yale University Press
Sparrow, Malcolm (2018) ‘Problem-oriented policing: matching the science to the art’ Crime
    Science 7(1): 1-10
Teskey, Graham (2017) ‘Thinking and working politically: are we seeing the emergence of a
    second orthodoxy?’ ABT Associates, Governance Working Paper Series
Teskey, Graham and Lavinia Tyrrel (2017) ‘Thinking and working politically in large, multi-
    sector facilities: Lessons to date’ Canberra: ABT Associates, Governance Working Paper
    Series, Issue 2
Weber, Eugen (1976) Peasants into Frenchmen: The Modernization of Rural France, 1870-1914
    Palo Alto, CA: Stanford University Press
Wild, Leni, David Booth and Craig Valters (2017) ‘Putting theory into practice: How DFID is
    doing development differently’ London: ODI
Wilson, James Q. (1989) Bureaucracy: What Government Agencies Do and Why They Do It
    New York: Basic Books
World Bank (2018) World Development Report 2018: Learning to Realize Education’s Promise
    Washington, DC: World Bank
World Bank (2021) World Development Report 2021: Data for Better Lives (Chapter 6)
    Washington, DC: World Bank
Woolcock, Michael (2018) ‘Reasons for using mixed methods in the evaluation of complex
    projects’, in Michiru Nagatsu and Attilia Ruzzene (eds.) Contemporary Philosophy and
    Social Science: An Interdisciplinary Dialogue London: Bloomsbury Academic, pp. 149-171




                                                                                         29