Policy Research Working Paper                 10098




     Adaptive Experiments for Policy Choice
            Phone Calls for Home Reading in Kenya

                             Bruno Esposito
                             Anja Sautmann




Development Economics
Development Research Group
June 2022
Policy Research Working Paper 10098


  Abstract
 Adaptive sampling in experiments with multiple waves                               algorithm is used to efficiently identify the call with the
 can improve learning for “policy choice problems” where                            highest rate of engagement. Simulations show that adaptive
 the goal is to select the optimal intervention or treatment                        sampling increased the posterior probability of the chosen
 among several options. This paper uses a real-world policy                         arm being optimal from 86 to 93 percent and more than
 choice problem to demonstrate the advantages of adaptive                           halved the posterior expected regret. The paper discusses a
 sampling and propose solutions to common issues in apply-                          range of implementation aspects, including how to decide
 ing the method. The application is a test of six formats for                       about research design parameters such as the number of
 automated calls to parents in Kenya that encourage reading                         experimental waves.
 with children at home. The adaptive ‘exploration sampling’




 This paper is a product of the Development Research Group, Development Economics. It is part of a larger effort by the
 World Bank to provide open access to its research and make a contribution to development policy discussions around the
 world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may
 be contacted at asautmann@worldbank.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
   Adaptive Experiments for Policy Choice: Phone Calls for Home

                                              Reading in Kenya

                                   Bruno Esposito∗                  Anja Sautmann†‡



                                                  Latest version here.




Keywords: adaptive experiments; multi-armed bandits; education technology; early literacy; Kenya

JEL codes: C11, C93, I25, O15




   ∗ Development   Economics Research Group, World Bank, email: bespositoacosta@worldbank.org
   † Corresponding   author. Development Economics Research Group, World Bank, email: asautmann@worldbank.org.
    ‡ The authors thank Tim Sullivan and Clotilde de Maricourt at New Globe and Peter Bergman for connecting us with them, as

well as Grant Bridgman, Christine Vorster, and Faith Kibuswa at Uliza. Isaiah Andrews, Jiafeng Kevin Chen, Daniel Rodriguez-
Segura, Adam McCloskey, and Kelly W. Zhang were extremely generous in sharing their expertise. We thank Kathleen Beegle,
Daniel Bj¨ orkegren, Esther Gehrke, Andrew Foster, Maximilian Kasy, David McKenzie, Robert Pless, and seminar participants
at the World Bank and Rochester University for helpful comments and feedback. The ﬁndings, interpretations, and conclusions
expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the World Bank and
its aﬃliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. All errors
are ours.
1     Introduction

The use of experiments in research on economic development and policy represents one of the biggest method-

ological innovations in economics in the last few decades. For research that tests policy interventions and

programs, the hope has been that rigorous experiments can lead to better policy decisions. In this context,

the learning goal of a policy maker such as a government or NGO may be characterized as follows: they

would like to improve a certain outcome, say in education or health, and are looking to identify the best

(least expensive, most eﬀective) policy to aﬀect this outcome. We call this a “policy choice problem” for

short.

    Randomized control trials (RCT) as they are found in economics and related ﬁelds are typically aimed at

identifying causal treatment eﬀects and estimating these eﬀects as precisely as possible, and design choices

like equal-sized treatment groups and re-randomization or stratiﬁcation support this goal. But this approach

to experimental design is not ideally suited to inform a policy choice problem. To see why, note that sample

sizes that deliver the power to statistically distinguish the eﬀect sizes in multiple treatment arms from zero

(and each other) quickly grow large. Yet for the objective of choosing and implementing only one of the

tested policies, precise treatment eﬀect estimates for low-performing options are not actually needed; ex post,

some of the sample assigned to these arms could have been put to better use to distinguish the treatment

eﬀects in the highest-performing arms. This is particularly detrimental if the sample is small or there are

budget and time constraints that prevent prolonged experimentation.

    When the experiment can be carried out in two or more waves, the research design for policy choice can,

in many cases, be improved by using adaptive sampling. The objective in the policy choice problem is

to maximize the average outcome at the end of the experiment, or equivalently, minimize expected policy

regret, that is, the expected loss from selecting a suboptimal arm. Choosing the arm with the highest

outcome after repeated experimentation is a special case of the multi-armed bandit problem with a pure

“exploration” motive, but no “exploitation” motive (Bubeck et al., 2009; Audibert et al., 2010). Eﬃcient

learning means adapting the assignment of experimental units to treatment arms based on what was learned

in earlier waves.1 The best adaptive learning strategy for policy choice tends to assign a larger share of

the sample to higher-performing arms. This helps to distinguish these arms from each other while spending

less eﬀort on low-performing arms. In practice, researchers use sampling algorithms that approximate the

optimal strategy to reduce computational burden.

    This paper puts adaptive sampling for policy choice to the test by applying it to a real-world policy choice
   1 This is true for many diﬀerent learning objectives, including but not limited to policy choice. It holds also for learning

goals that usually motivate standard RCTs, such as eﬃcient hypothesis testing: it is typically not optimal to randomly assign
equal sample shares to all treatment arms in later waves, see e.g. Tabord-Meehan (2018).




                                                              2
problem in education technology. The goal is threefold: addressing conceptual and practical challenges in

implementing adaptive experiments in policy settings, studying performance of an adaptive research design

in a short time horizon where asymptotic performance guarantees may not apply, and, last but not least,

informing the actual policy choice problem at hand.

  The example we use is an experiment on using phone calls with interactive voice response (IVR) technology

to deliver regular short reading exercises directly to parents in Kenya. The calls are intended to encourage

parents to read with their children at home, a practice known to improve language acquisition and ﬂuency

(Mayer et al., 2019; York et al., 2019; Knauer et al., 2020). The implementer was NewGlobe, an organization

that both supports public schools and operates its own community schools in several countries, including the

Bridge Kenya primary schools in our sample. Faced with many options for IVR call and exercise designs,

NewGlobe was looking to decide which call format (if any) they should roll out to all parents.

  We use the IVR experiment to discuss in detail how to approach designing and conducting an adaptive

experiment for policy choice – from estimation approaches, to the algorithm used, to research design decisions,

for example about sampling and sampling size. In our example, we tested six diﬀerent IVR call options during

the third term of the 2020 school year. The experiment was designed to identify the call format with the

highest level of engagement, measured as the number of IVR calls in which the respondent started the reading

exercises, and to test whether IVR calls can increase reading ﬂuency. The calls cross-combine two delivery

formats for the exercises – parent-led vs. IVR-led reading – with three diﬀerent ways of matching exercise

contents to the child’s reading level, motivated by evidence that targeted instruction can improve outcomes

especially in the tails of the distribution (Banerjee et al., 2007; Muralidharan et al., 2019; Doss et al., 2019).

  In order to eﬃciently identify the arm with highest call engagement, the experiment uses a version of the

exploration sampling algorithm proposed by Kasy and Sautmann (2021a) to assign experimental units to

treatment arms. Exploration sampling is a Bayesian bandit algorithm that was shown to perform well in

both real and simulated experiments for policy choice and shares attractive asymptotic eﬃciency properties

for best-arm identiﬁcation of a set of similar algorithms (Russo, 2020; Qin et al., 2017; Shang et al., 2020;

Kasy and Sautmann, 2021b). To our knowledge, there are to date only three policy choice experiments that

have used it: an application in the original paper, a test of an SMS-based information campaign in India to

reduce the spread of Covid-19 (Bahety et al., 2021), and a trial on contraceptive uptake in Cameroon that

is ongoing at the time of writing (Athey et al., 2021). Instead of a Bernoulli outcome with a Beta prior as in

the original paper (and many multi-armed bandit settings), we use a hierarchical binomial model to obtain

assignment shares and parameter estimates.

  The IVR experiment has only two experimental waves and the outcome distribution is more complex than

assumed in theoretical treatments of Bayesian best-arm algorithms. We are therefore particularly interested


                                                        3
how adaptive sampling inﬂuences treatment assignment, performance, and estimation results. We ﬁnd that

after just one wave, there is suﬃcient learning so that adaptive sampling leads to a substantial shift in the

assignment shares: based on the diﬀerent call success rates in each arm after wave 1, we obtained sample

allocation shares for wave 2 that varied between 0% and 39%. After wave 2, we estimate a 93% probability

that the best call design for engagement uses parent-led reading exercises and delivers the same intermediate-

level exercises to all students. In this arm, parents engage – meaning, they start the reading exercises – with

8.40% probability per call, compared to 3.93% in the least successful arm. The arm with second highest

engagement (7.43%) has only a 5% probability of being optimal. Expressing the expected policy regret in

terms of average engagement probability, the selected treatment arm has an estimated 0.02% expected loss

from potentially making a suboptimal choice, compared to 1.27%-4.49% for the other arms.

  Even though the experiment targeted call engagement, we also estimate the treatment eﬀects on oral

reading ﬂuency (ORF), using exam scores collected by the implementer. Although the data is noisy, we ﬁnd

that the arm with the highest level of engagement leads to estimated increases in ORF of 1.68 correct words

per minute, equivalent to 0.065 standard deviations of the baseline data, with a credible interval between

0.13 and 3.21. The precision of this estimate is partly due to the large sample assigned to the best arm.

  The results of the IVR experiment speak to an important policy question: whether there are low-cost,

automated methods of increasing the probability that parents read with their young children, and what

their best design might be. Especially during the Covid-19 pandemic, it became clear that there is an

unﬁlled need for sustained learning at home and reaching children in families with limited educational and

technological resources. Personal calls have been shown to be highly eﬀective(Angrist et al., 2020b), but may

require signiﬁcant of resources. The experiment shows that mass-deployed IVR calls can increase parental

involvement in the child’s schooling but that the call design matters signiﬁcantly for uptake.

  Beyond these ﬁndings, the main contributions of this paper are an evaluation of the merits of using

adaptive sampling for policy choice, and a detailed guide to implementation.2 In particular, we use simulation

approaches to examine the performance of the experiment and understand the impact of diﬀerent design

choices. In a ﬁrst exercise, we compare ‘ex post’ the exploration sampling design with an alternative design

with equal-sized (stratiﬁed) treatment arms, akin to a “standard” RCT, using simulated samples drawn from

the experimental observations. This shows that adaptive assignment in only one wave achieved meaningful

reductions in uncertainty – from on average 86% probability that the chosen arm is optimal in the RCT to

93% probability with exploration sampling – and reduced posterior expected policy regret by more than half

from 0.05 percent to 0.02 percent engagement probability.

  The next two exercises carry out ‘ex ante’ simulations based on the outcome model in order to determine
  2 This   complements the excellent practitioner’s guide on adaptive experiments by Hadad et al. (2021).



                                                               4
the gains from (a) conducting two experimental waves instead of one (non-adaptive) wave with the full

sample, and (b) adding a second wave after having observed the outcomes of the ﬁrst. These are examples

of simulations a researcher might conduct to determine the research design, akin to power calculations. In

case (a), the predicted reductions in expected regret seem plausible for speciﬁc parameter vector, but the ﬂat

prior distributions of the treatment eﬀect parameters do not provide a good basis for simulating the gains

from adaptivity; researchers may instead choose to focus on speciﬁc parameter values, not least to reduce

computational burden (akin to power calculations where a minimum detectable eﬀect size is imposed). In

case (b), where the wave-1 posteriors can be used to simulate parameter draws, “agnostic” simulations do

better, but they appear to still somewhat under-predict the gains.

  As we go along, we discuss many details of implementation and experimental design, such as formulating

and validating the Bayesian models for treatment eﬀect estimation, calculating the expected posterior policy

regret of each arm, and writing a pre-analysis plan. We address questions such as when an adaptive experi-

ment is possible and when it may be most valuable, and approaches to correcting estimated treatment eﬀects

and conﬁdence intervals for sampling bias and the “winner’s curse” that aﬀects the treatment eﬀect estimate

of the best arm (e.g. Melﬁ and Page, 2000; Andrews et al., 2021). We also spend some time discussing the

trade-oﬀs that were involved in choosing the targeted outcome.

  The constraints on this experiment are representative of the decision contexts in which policy makers

work day-to-day. In NewGlobe’s situation, with a limited budget and only one school term available to test

IVR, many organizations might decide against an experiment entirely—but our trial shows that adaptive

sampling methods can enable rigorous learning even when the parameters of experimental design are severely

constrained. The solutions we propose can help inform future adaptive experiments for policy choice, in

EdTech as well as many other contexts.

  The next section introduces the concepts behind adaptive sampling for policy choice, showing how the

sampling algorithm used is determined by the objective of the experiment, describing the exploration sam-

pling algorithm, and discussing the use of Bayesian estimation. It also lays out some considerations for

choosing parameters of the research design such as the number of waves. Section 3 discusses the policy

background, interventions, and experimental design of the IVR experiment, including the choice of targeted

outcome, highlighting lessons for adaptive experiments in general. Section 4 discusses the data and details

the models used for estimation, including how to derive the probability optimal and the expected policy

regret, quantities used in the exploration sampling algorithm. Section 5 presents treatment eﬀect estimates

for parental engagement, shows the assignment shares based on these estimates, and discusses the impact

on reading ﬂuency. Finally, section 6 picks up the question of research design again. In the concrete context

of the IVR experiment, we ﬁrst show how the adaptive and a non-adaptive design compare ‘ex post’ in


                                                      5
simulated samples from the experimental data, and then demonstrate how ‘ex ante’ simulations can be used

to decide, for example, on the number of experimental waves. Section 7 concludes with a short discussion.

2     Using Adaptive Sampling in Experiments for Policy Choice

This section gives an overview over the use of adaptive sampling for policy choice and the exploration sampling

algorithm proposed by Kasy and Sautmann (2021a). We start with the “basic ingredients” for an adaptive

experiment: the objective, an algorithm that builds on the data from each wave to adaptively allocate units

to treatment arms, the estimation approach, and constraints that determine whether an adaptive experiment

is feasible. The corresponding features of the IVR experiment are described in detail in sections 3 and 4.

    This section also discusses the gains from adaptive vs. non-adaptive sampling or adding adaptive waves,

and how these gains can be calculated in simulations to choose the number of experimental waves. We return

to this in section 6, building on the data collected in Kenya and the estimation results in section 5. To begin

with, however, we assume that there are t = 1, ..., T exogenously given consecutive sample draws (waves) of

size Nt available for testing.


Objective.     In the canonical policy choice problem, there are K > 2 policy options – or treatment arms –

labeled k = 1, 2, . . . , K . Each arm has unobserved (stationary) average outcome θk , and the policy maker

wants to implement the arm with the highest average outcome. Formally, let k (1) = argmaxk θk be the true

best arm, and k ∗ the arm that is chosen. We call the loss (per unit) from implementing a suboptimal arm
                                 (1)
k the policy regret, ∆k = θk           − θk . Ex post, the policy maker will select the arm k ∗ that has the highest

average outcome, or lowest policy regret, based on the observed data.

    It is assumed that the outcomes of the experimental units are observed at the end of each period t. This

means we can learn from wave t and adjust the allocation of units to treatment arms in wave t + 1, i.e.

use adaptive sampling. In the policy choice problem, the policymaker’s wants to implement an adaptive

sampling strategy that maximizes welfare, that is, minimizes the expected policy regret from the ﬁnal choice
                                                                        ∗
given the true (unobserved) vector of average outcomes: E [∆k |θ]. Adaptivity increases the eﬃciency of

learning for a given objective by over-sampling some arms based on what was learned, at the expense of

other arms (and other objectives).

Remark: Other Objectives. Large literatures consider sampling for speciﬁc learning goals.

    The classical multi-arm bandit problem (MAB) considers the objective to maximize average outcomes

during the ongoing experiment, or equivalently to minimize in-sample regret, which introduces the well-

known exploration-exploitation trade-oﬀ (e.g Lai and Robbins, 1985; Bubeck and Cesa-Bianchi, 2012). The

policy choice problem of choosing the arm with the highest average outcome can be seen as a special case



                                                             6
of the MAB problem, where the experimenter has no “exploitation” motive (Bubeck et al., 2009; Audibert

et al., 2010). Closely related to the “pure exploration” problem of policy choice is the problem of “best arm

identiﬁcation” (BAI), to the point that they are often treated as interchangeable. Here, the experimental

design aims to either minimize the probability of choosing a sub-optimal arm after a given number of waves

(the “ﬁxed budget” setting), or minimize the expected number of waves to achieve a given level of certainty

about which arm is optimal (the “ﬁxed conﬁdence” setting (Garivier and Kaufmann, 2016); see e.g. Lattimore

           ari (2020) for an excellent and in-depth overview).
and Szepesv´

  Even in non-adaptive experiments, common sampling techniques such as stratiﬁcation and re-randomization

aim to maximize power to detect a diﬀerence between treatment and control group (Athey and Imbens,

2017).3 Adaptive strategies can further increase power for particular tests (Robbins, 1952). For example,

Tabord-Meehan (2018) proposes an adaptive stratiﬁcation procedure for a two-stage experiment with the

objective of minimizing the variance of the estimator for the average treatment eﬀect.


A Bayesian Bandit Algorithm for Policy Choice. Although the allocation of experimental units to

arms for a given experiment of length T is a ﬁnite decision problem, determining the optimal allocation

exactly is computationally prohibitively costly.4 In the IVR experiment this is the case even with just one

adaptive wave (see also simulations in section 6). This has led to the development of various heuristics for

treatment assignment.

  The exploration sampling algorithm used here is a Bayesian bandit algorithm: it starts from a prior

over the model parameters – with identical priors for the k treatment eﬀects – and updates the parameter

distributions as the outcomes of each wave t are observed. The posterior distribution for the arm-speciﬁc θk

is used to calculate the posterior probability that k is the best arm, pk
                                                                        t = Prt (k = k
                                                                                       (1)
                                                                                           ) and the (posterior)

expected policy regret Et (∆k ). In t + 1, the algorithm assigns experimental units to arm k with sampling

shares


                                                  k        pk        k
                                                            t (1 − pt )
                                                 qt =      K
                                                                              .                                             (1)
                                                                k       k
                                                           k=1 pt (1 − pt )


Exploration sampling is a modiﬁcation of Thompson sampling, which directly uses the probabilities pk
                                                                                                   t as

the assignment shares in the next wave. Thompson sampling is a MAB heuristic for minimizing in-sample
   3 The speciﬁc objective also matters for stratiﬁcation. Kasy (2016) considers stratiﬁcation with continuous covariates and

shows in a statistical decision theory framework that a deterministic design delivers maximal power for a given prior or a
minimax decision criterion. However, Banerjee et al. (2020) argue that (some) randomization improves the ability to convince
diverse and potentially adversarial audiences with a range of priors. The argument is relevant for adaptive designs as well:
reducing the sample size of some arms in favor of other arms is likely to be the wrong decision under at least some priors about
the true θ, and therefore an adaptive experiment is less convincing to an adversarial audience than a non-adaptive experiment.
   4 The supplement to Kasy and Sautmann (2021a) shows some simple examples of the optimal treatment assignment.




                                                               7
regret (Thompson, 1933). Compared with Thompson sampling and other algorithms that target in-sample

regret, exploration sampling shifts measurement eﬀort away from the best arm, increasing exploration and

decreasing exploitation. This is because we need to learn not only about the best arm but also its close

competitors for eﬃcient policy choice. At the same time, it shifts measurement eﬀort towards the higher-

performing arms compared to an experiment with uniform assignment (i.e., equal sampling shares 1/K ),

because information about the low-performing arms is unlikely to be relevant.

  For the case of Bernoulli distributed binary outcomes with a Beta prior, Kasy and Sautmann (2021a,b)

show that exploration sampling balances the sampling allocation in the limit at T → ∞ between the sub-

optimal arms, yielding constrained optimal posterior convergence (subject to the sampling share of the best

arm converging to a pre-selected proportion). In the Bernoulli case, posterior expected regret converges at

the same rate because regret is bounded by 1. Several Bayesian best-arm algorithms – applied to speciﬁc

outcome distributions – have been shown to have this property (Qin et al., 2017; Russo, 2020; Shang et al.,

2020).5 Each heuristic has its own merits, but exploration sampling is appealing for its simple form that

does not require a tuning parameter, its convenience for batch settings (waves larger than 1 unit), and its

motivation based on sampling the best arm from the posterior for θ with the restriction of never assigning

the same arm twice for increased exploration.6 The existing theoretical performance guarantees apply only

asymptotically and for speciﬁc outcome distributions. However, Kasy and Sautmann (2021a) demonstrate

the good performance of exploration sampling for expected policy regret in the Beta-Bernoulli case in simu-

lations based on pre-existing data and for posterior convergence in an experiment testing diﬀerent enrollment

methods for an agricultural extension service. We also use simulations in section 6 to assess the gains from

exploration sampling over uniform assignment.

  In the IVR experiment, the primary objective was to identify the best arm measured by the parents’

engagement with the IVR calls. We therefore used exploration sampling on 6/7th of the sample. A secondary

goal was to understand whether the IVR calls have an eﬀect on reading ability. The design therefore included

a (ﬁxed) control group of 1/7th of the sample for identifying the time trend in reading ﬂuency and estimating

treatment eﬀects (see sections 3 and 4). Designs that combine adaptive treatment arms with a control group
    5 All are“top-two” algorithms based on expending greater measurement eﬀort on the current best two arms, with a tuning

parameter β determining the allocation between them as well as the limit sample share of the best arm. Russo ﬁrst proposed
three algorithms and establish constrained optimal posterior convergence for a family of outcome distributions: Top-Two Prob-
ability Sampling (TTPS), Top-Two Value Sampling (TTVS), and Top-Two Thompson Sampling (TTTS). Top-Two Expected
Improvement (TTEI) by Qin et al. (2017) modiﬁes the expected improvement algorithm for Gaussian outcomes. The authors
also show that the algorithm is asymptotically optimal in the ﬁxed-conﬁdence setting, which requires that the limit allocation
is attained in ﬁnite time. Shang et al. (2020) propose a version called Top-Two Transportation Cost (T3C) that is less com-
putationally demanding than TTTS and applies to a larger set of outcome distributions than TTEI, and prove optimality of
both TTTS and TTEI in the ﬁxed conﬁdence setting for Gaussian outcomes. Finally, they establish posterior convergence for
TTTS for Normal and Bernoulli distributed outcomes.
    6 Thompson sampling is equivalent to taking simple draws from the posterior without prohibiting repeat assignments. TTPS

and TTVS determine two “top” candidate arms in each wave and randomly select the ﬁrst with probability β and the second
otherwise, making them poorly suited for batch allocation. TTEI is speciﬁc to normally distributed outcomes. Exploration
sampling is closest to TTTS and T3C and with β = 0.5 all three converge to the same limit allocation.


                                                              8
are also used by Bahety et al. (2021) and Athey et al. (2021).


Estimation.       In the IVR application, we focus on Bayesian estimation to obtain ﬁnal parameter estimates.

In Kasy and Sautmann (2021a), the outcome of each arm k is Bernoulli distributed with a Beta prior, so that

the posteriors after t have closed forms. In the IVR experiment, we generalize the approach and estimate

Bayesian hierarchical models with school-speciﬁc eﬀects and a Binomial outcome distribution (Normal for

reading ﬂuency), described in detail in section 4. The Bayesian approach with updating between waves is

internally consistent7 and naturally produces pk
                                               t that we need for exploration sampling. Bayesian inference

is valid with adaptively collected data.

  However, users may be interested in frequentist inference about the parameter estimates. Frequentist

estimates that do not account for the data being generated by an experiment for policy choice are subject

to potential biases (Melﬁ and Page, 2000; Xu et al., 2013). First, observations from an adaptive experiment

cease to be iid draws – intuitively, adaptivity introduces sampling bias because random ﬂuctuations in early

observations in a given treatment arm k aﬀect the weight of these observations in the overall sample assigned

to k (by changing the assignment shares of this arm in future waves). Second, inference on the best arm out

of a set, where the ranking is based on the treatment eﬀect estimates, creates an upward bias and invalidates

standard conﬁdence intervals even with non-adaptive sampling (Andrews et al., 2021).

  Inference from adaptively sampled data is an active ﬁeld of research, with particular focus on algorithms

targeting in-sample regret, which exacerbate selection bias by quickly focusing on high-performing arms.

Adaptively weighted estimators can correct sampling bias and produce asymptotically normal estimators

(Hadad et al., 2021; Zhang et al., 2021). Andrews et al. (2021) propose corrections for the “winner’s curse”

when estimating the average outcome of the highest-performing arm that apply to asymptotically normal

estimators. To our knowledge, there are to date no approaches that can provide conﬁdence intervals with

correct coverage for the optimal arm in an adaptive experiment in a model with random eﬀects as we used

in the IVR experiment. However, in section 5 we estimate a frequentist Binomial model for engagement and

illustrate how the estimates are aﬀected when (a) applying the weights proposed in Zhang et al. (2021) to

restore asymptotic normality and then applying the winner’s curse correction by Andrews et al. (2021).

Remark: Hybrid Algorithms. Given the problems with inference in adaptive procedures where low-performing

arms are under-sampled, recent applications have used modiﬁed algorithms for a hybrid goal of (frequentist)

estimation as well as regret minimization. For example, the “tempered Thompson” algorithm in Caria

et al. (2020) uses a convex combination of Thompson shares and 1/k equal-sized shares. Another common
   7 In principle, updating the posterior from any earlier wave with the data collected afterwards should lead to the same

posterior outcome distribution at t, including re-estimating the model with all the data collected and the initial prior, which is
in practice the method we use.



                                                                9
modiﬁcation is to impose a lower bound on the sampling share in each arm (“clipping”, applied e.g. in Athey

et al. (2021) with exploration sampling). Such modiﬁcations can be combined with setting aside a sample

share for one experimental arm and in particular a control group, see for example the “control-augmented”

Thompson sampling algorithm in Oﬀer-Westort et al. (2021).


  An important decision for research designs in practice is the size of the experimental sample N . The

MAB literature often assumes that experimental units arrive through an exogenous process and can be used

costlessly for experimentation, often indeﬁnitely. In practice, researchers using adaptive experiments need

to decide how to split the sample into waves, or how many waves of ﬁxed size to conduct. We approach

these questions in two steps, by ﬁrst discussing constraints that delineate the space of possible experimental

designs, and then outlining how to assess alternatives within these constraints.


Constraints on Adaptive Experimental Designs.                      The use of multiple waves imposes some constraints

on the set of possible adaptive experimental designs. We outline these here brieﬂy, partly to illustrate when

adaptive design are in practice feasible.

Total time Dmax available. Due to external constraints, such as funding timelines or deadlines for operational

deliverables, the maximal duration of an experiment is typically limited.8

Comparable waves. Most bandit algorithms assume some form of stationarity, e.g. that the observations in

all waves represent iid draws of the potential outcomes in the population. For eﬃcient learning across waves,

the treatment eﬀects must be stationary and any time trends must be common to all arms. Annual cohorts

of students or batches of survey participants recruited at random may fulﬁll these conditions, but e.g. job

seekers in a seasonal industry at diﬀerent times of the year likely do not.

Length of a wave d. To complete a wave, the intervention must be administered in full, outcome changes in

response to the treatments must have manifested, and post-intervention outcome measures must be collected

before the start of the next wave. This determines wave duration d.

  Together, these constraints typically impose a limit on the number of waves T max . If the policy environ-

ment changes rapidly, data is collected in a time-consuming survey, or the available time does not include

two comparable periods, only one “wave” may be possible, T max = 1. On the other hand, if a wave takes

only hours or days and data are automatically recorded, many waves may be possible, e.g. T = 10 in Bahety

et al. (2021) or T = 17 in Kasy and Sautmann (2021a).

  Other constraints may limit e.g. the maximum sample size per wave or the total sample N max . In the IVR

experiment, due to time and comparability constraints, the choice was eﬀectively only between conducting
   8 Such a limit is a reason to use policy choice algorithms that minimize expected regret after the experiment, rather than an

algorithm that simply continues indeﬁnitely and targets in-sample regret.



                                                              10
one or two experimental waves. The available sample was the full population of ﬁrst graders in the Bridge

Kenya schools in term 3 of 2020, see section 3.


Choosing the Research Design.                  Even if constraints narrow down the design space, the experimenter

may still need to decide whether to use adaptive sampling and choose sample size, wave size, and number

of waves to run. An added consideration is that even for a given sample size N , there are some costs to

conducting testing in waves.

Per-wave Implementation Costs. Maintaining the infrastructure for data collection and interventions for all

treatment arms, including the human capital costs of managing the experiment, adds ﬁxed costs ct per wave,

on top of any per-unit costs ck
                              i (which may vary by treatment arm).

Cost of Delay. Each new wave adds delay until the gains from the experiment – the average estimated

treatment eﬀect of the best arm – are realized for all potential beneﬁciaries.

  Balancing these costs are the eﬃciency gains from adaptivity. It is computationally involved to estimate

these gains, and so the researcher can typically only consider a small number of designs. Here, we brieﬂy

discuss two situations that will frequently arise in practice. First, experimenters often have a ﬁxed N

available and have to decide whether and how to divide the sample into waves. Second, the experimenter

may need to decide at time t whether to run an additional wave in t + 1.

  This could be set up as a simple optimization problem. For example, consider choosing the number of

waves T ∈ {1, . . . , T max } for given sample size N , so that the wave size is Nt = N/T (assuming equal-

sized waves for simplicity). We would expect more eﬃcient learning with more waves and more chances to

adapt, and indeed the simulations in Kasy and Sautmann (2021a) with data from three existing experiments

show how splitting the sample into 2, 4, and 10 waves monotonically shrinks the expected policy regret.

In practice, however, the marginal gains are likely decreasing in T .9 Moreover, the gains must be weighed

against the cost. The experimenter might solve

                                                                                       T
                                                               T +1         k∗
                                            max            δ          E (M θ |T ), −         ct   .
                                       T ∈{1,...,T max }
                                                                                       t=1


The second term penalizes the cost of increasing T . The ﬁrst term is the term of interest: the expectation
                                                                                                       ∗
of the number of beneﬁciaries M times the per-person outcome in the chosen arm θk , discounted by δ T due

to the implementation delay.
   9 This is at least in part due to indivisibility issues. As T grows and the wave size shrinks, it becomes harder to implement the

adaptive algorithm faithfully, and the actual assignment shares may diﬀer substantially from the exploration sampling shares
 k N k , especially if the sample is also stratiﬁed (see also section 5). With many treatment arms, in small waves some arms may
qt   t
not be assigned at all, updating about these arms will proceed slowly in terms of t, and the assignment shares may remain far
from optimal for a long period of time.




                                                                      11
    In the second situation we deﬁned, waves have ﬁxed size Nt and the experimenter needs to decide when

to end the experiment. In addition to the per-wave and delay costs given by ct and δ T +1 , increasing T
        k
incurs qt Nt times the per unit cost ck
                                      i for each experimental arm.
                                                                  10
                                                                     In exchange, the experimenter observes

additional Nt units in each wave t.
                                                                          ∗
    In each case above, the researcher needs to estimate E (θk ) as a function of the research design. Since

closed forms are not typically available, these projected gains from adaptivity have to be obtained from

simulations. This requires simulating not only experimental outcomes under diﬀerent random sample draws,

but also the diﬀerent sampling paths that arise from adaptivity. The experimenter is typically restricted to

comparing only a few hypothetical θ and a small number of possible research designs. We illustrate such

simulations in the context of the IVR experiment in section 6.

3       IVR Calls for Reading in Kenya: Background and Experimental Design
        Choices

3.1         Background and Setting

Our application for adaptive sampling for policy choice is an EdTech intervention that uses interactive

voice calling aimed at encouraging parents to read with their children. The implementing organization

(“the implementer”) is NewGlobe, the parent of Bridge International Academies. At the time of the study,

NewGlobe operated 112 private primary schools all over Kenya.11 The Kenyan school year usually has three

terms that start just after New Year’s and end late October. Due to Covid-19, the 2020 terms 2 and 3

took place 1/3 - 3/19 and 5/10 - 7/16 of 2021 (with the 2021 terms compressed into 7/26/21 - 4/2/22). All

Kenyan schools at the implementing organization had introduced oral reading ﬂuency (ORF) assessments

for the ﬁrst time in the midterm and endterm exams of term 2 of 2020.

    The implementer wanted to make a decision about whether and how to use interactive voice response calls

(IVR) to encourage parents to do reading exercises with their children. Reading with a child at home has

beneﬁts for language acquisition and ﬂuency, even in contexts where parents themselves may have limited

reading skills (Mayer et al., 2019; York et al., 2019; Knauer et al., 2020). Kenyan schools were closed for part of

2020 due to COVID-19, highlighting the beneﬁts of developing eﬀective home interventions targeting reading

and numeracy.12 More broadly, parental engagement is an important determinant of children’s long-term

success in school. Recent research has shown that relatively light-touch interventions such as personalized
    10 It
       may also reduce the number of beneﬁciaries by the additional experimental subjects.
    11 The
         schools follow Bridge’s speciﬁc teaching model and charge fees; these fees are lower than typical private school fees and
similar to the administrative costs of public schools.
  12 Prior research has shown that parental engagement interventions can counteract the detrimental eﬀect of extended periods

out of school (e.g. Kraft and Monti-Nussbaum (2017)). A combined text message and phone call intervention was able to reduce
learning loss during COVID-19-related school closures in Botswana (Angrist et al., 2020a).




                                                               12
text messages increase parental engagement, which in turn improves early literacy outcomes (York et al.,

2019; Doss et al., 2019). For older children, parental engagement also increases parents’ information about

attendance and performance at school and improves outcomes through this channel (Berlinski et al., 2021;

Bergman and Chan, 2021; Bergman, 2021; Bettinger et al., 2021).

  While many parent communication interventions rely on text messages, parental literacy barriers and

length restrictions limit text messaging as a tool to deliver reading exercises (ICTworks, 2016). The imple-

menter already routinely uses text messaging to contact parents with information about their kid’s schooling,

and collects phone numbers and consent for this purpose. However, to what degree these messages are re-

ceived and read by parents, and whether they lead to behavior change, is only incompletely known. In an

earlier trial with the same implementer in Nigeria, which used text messages to encourage parents to use a

WhatsApp-based quiz platform, almost none of the message recipients engaged with the quizzes (Sautmann,

2021b).

  Phone calls provide an alternative that may sustain higher rates of engagement and allows longer interac-

tions and better instructions for home exercises. Personal calls have been shown to be eﬀective for increasing

parental engagement (Kraft and Monti-Nussbaum, 2017), but are costly and time consuming for teachers.

IVR calls are pre-recorded and automated, designed by recording a set of modular text snippets and jingles

that are sequenced in response to listener input through the keyboard or through spoken word. There is to

date limited evidence on the eﬀectiveness of IVR for improving early literacy. A small pilot with 38 fami-

               ote d’Ivoire reports encouraging qualitative results on the use of IVR to foster phonological
lies in rural Cˆ

awareness in low-literacy environments (Madaio et al., 2019).

3.2   IVR Intervention Design

During piloting and discussions prior to the experiment, it was decided to test six IVR call variants. All

treatment arms consist of twice weekly calls to the parents’ phone. The IVR delivers a sequence of reading

exercises, either based on letter combinations or words that the parent notes down during the call, or based

on passages from the children’s term 3 homework book. An experimental wave contains 9 sets of calls (see

below), and each call contains 4 diﬀerent exercises. The exercises change from call to call. Before each

wave, we conducted a phone based opt-out procedure that explained the calls and also allowed parents to

change the enrolled number. The full intervention design, call logic trees, and sample recordings of two of

the interactive calls can be found in an online supplement (Sautmann, 2021a).

  All IVR recordings were created by a female Kenyan voice artist and edited by the voice call provider,

Uliza. The IVR system makes multiple call attempts and also allows the parent to “ﬂash” Uliza’s number,

meaning that they can call the number at a convenient time, and the system hangs up and immediately calls



                                                     13
Figure 1: Term 2 midterm and endterm oral reading ﬂuency scores, in units of correct words per minute,
as used for exercise level assignment. The left panel shows that individual student scores are only noisily
correlated. The right panel shows that there is some movement from higher leveling categories to lower ones,
as well as small but signiﬁcant numbers of students “skipping” from basic to advanced level. For 22.5%
students in our sample, score information was missing.


back. This is a common method in Kenya that avoids calling charges to the parents.

  We used three diﬀerent ways of choosing a diﬃculty level for the exercises, and two diﬀerent delivery

formats, described in detail below. We cross-combined the 3x2 interventions to create the 6 treatment

arms. In selecting the tested interventions, the aim was to create treatment variations that were genuine

“contenders” for having the greatest impact on how often parents read with their children at home.


Varying exercise leveling. Baseline information on oral reading ﬂuency (ORF) from term 2 showed

high variation in reading scores, in line with other comparable data in developing-country contexts (for

instance, see Muralidharan et al., 2019). In the presence of such variation, prior evidence has suggested

that there can be beneﬁts to leveling remedial programs (see, e.g., Banerjee, Cole, Duﬂo, and Linden,

2007; Banerjee, Banerji, Berry, Duﬂo, Kannan, Mukerji, Shotland, and Walton, 2017), and that customized

EdTech interventions could beneﬁt the lowest achieving students the most (de Barros and Ganimian, 2021;

Doss et al., 2019).

  However, our analysis of ORF scores showed that the available test data are very noisy (as seen in ﬁgure 1)

and 22.5% percent of the sample were missing at the start of wave 1. There is reason to believe that there is

selection bias in non-missing scores (see also below). This could make leveling based on observed or imputed

past scores ineﬀective or even counterproductive. An alternative is to leverage parents’ knowledge of their

child’s reading skills and let them choose the diﬃculty level during the call. But parents may be unable to

accurately assess their child or may choose a poorly suited exercise, perhaps because they themselves are

not secure readers or because their view of their child is too optimistic. A call that allows choice also takes


                                                      14
longer, and parents may stop using the system if they ﬁnd it fatiguing or challenging.

  Based on these considerations, three intervention variants were chosen, (A) leveling on actual or imputed

baseline scores, (B) providing the same sequence of intermediate-level exercises to all kids, and (C) giving

parents a choice of exercises from a menu. Arm A uses observed ﬂuency scores from the end of term 2 and

assigns students with ﬂuency scores of 0-29 into the “basic” group, 30-64 into the “intermediate” group, and

65+ into the “advanced” group. These cutoﬀs were used previously in a similar context (see Piper et al.,

2018). Students with missing scores are assigned their class median. Whole classes with missing scores are

assigned to the intermediate group (which also happens to be the full sample median). The exact exercise

sequences in the basic, intermediate, and advanced groups are described in detail in Appendix A. Arm B

assigns all students to the intermediate group, while Arm C allows parents to pick one exercise type (from

basic letters, to letter combinations, to advanced text passages) out of a set of three.


Varying delivery format. We also test two formats that use the IVR functionality in diﬀerent ways. In

the ﬁrst, the voice call explains to the parent how to do the reading exercises and asks them to carry them

out with their child after the call (T1). In the second, the IVR asks parents to put the call on speaker phone,

and then goes through the exercises with the parent and child on the call (T2).

  A priori, either approach might work better for diﬀerent reasons. In both call types, the parent is asked

to take notes on the exercises during the call. The parent is instructed to point to the written letters or

words while the IVR (or the parent) reads, and then again while the child reads. However, in T1, the parent

may not pronounce letter combinations correctly from memory. She may also listen to the exercises during

the call but then not carry them out with the child later. On the other hand, T2 may cause diﬃculty if the

phone’s speaker is poor or the IVR moves too fast for the child or is not responsive enough. All parties may

be more motivated when the child and parent practice together, rather than following instructions from an

unknown and disembodied voice.

3.3   The Research Design

A “standard” RCT of IVR for home reading would likely consist of extensive piloting, carrying out power

calculations to determine sample size and number of tested intervention arms, randomizing at the cluster

(school) level, and then administering an IVR program for at least a full school year, possibly accompanied

by a home survey and independent tests of reading ﬂuency. Based on the budget for delivering and deploying

messages, the size of the sample, and the available staﬀ time, such a comprehensive study was not feasible

for NewGlobe. At the same time, at the outset, it was not even known whether parents would listen to the

messages at all, and there is to our knowledge no existing guidance on how best to design such calls. In



                                                      15
such a situation, NGOs and policy makers might resort to simply not using experimental methods. They

might conduct an informal pilot, implement the program at scale, and then “tinker” with it after roll-out,

or conversely, simply abandon the idea. Adaptive sampling could oﬀer a solution that enables a rigorous

experiment and makes the most of the limited sample and time available. The implementer saw as an

attractive feature from an ethical perspective that even during the experiment, a larger share of participants

beneﬁt from the higher-performing treatment arms.


Objective.      In conversations about the experiment, on the one hand, the implementer wanted to identify

the “best” IVR call variant, and on the other, they wanted to verify that IVR calls with reading exercises

actually have positive eﬀects on reading ﬂuency. This hybrid goal was a reason to keep a control group that

received no intervention. At the same time, it suggested to use adaptive sampling to choose between the

six call formats. We discuss how the notion of the “best” IVR call translated into the choice of targeted

outcome below.


Constraints on the experimental design. The implementer was able to set aside only one ﬁrst grade

cohort and one term of the school year for testing the IVR calls, both due to other ongoing studies and due

to the implementer’s internal cost-beneﬁt assessment.

  The Kenyan school term is 10 weeks long, split equally into 5 weeks from start to midterm exams and from

midterm to endterm exams. Reading tests are conducted as part of these exams, providing an administrative

source of data. Moreover, the rhythm of the school term from start to midterm and from midterm to endterm

is similar. For example, parents’ attention to their child’s schoolwork may increase closer to the exams.

  Relative to the cost per call, the cost (in terms of both money and time) of developing sequences of reading

exercises and recording them is high.13 There was also concern that too many contact attempts from the

school create fatigue in parents, especially with pilot programs that may not yet be optimally designed. For

both reasons, an exercise sequence covering one half of the term was preferred to running the interventions

for a full term.

  Jointly, these constraints reduced the space of possible research designs to conducting one or two experi-

mental waves in the ﬁrst and second half of the term, with the total of ﬁrst graders enrolled that year across

all schools as the available sample.


Outcome measurement. The available outcome variables were take-up of the IVR calls, or call engage-

ment for short, and oral reading ﬂuency (ORF) scores collected by the school. Measures for both outcomes
  13 The exercises were developed by the implementer together with the research team. Dozens of sound snippets were recorded

by a voice artist hired by Uliza. A ﬁrst set of exercises was piloted with a small sample of parents in an older age group before
completing all the exercises and recordings.


                                                               16
were provided by the implementer, with random ID numbers replacing parents’ phone numbers and the

child’s name and school.

  We use IVR provider records to measure engagement with the calls, that is, whether the call recipient

actually starts the exercises. Uliza’s records show every contact with the parent’s registered phone number,

along with the length of each call in seconds. We deﬁne a call as successful if the parent started the ﬁrst

exercise, which requires tapping a phone key to conﬁrm. We deﬁne a parent as having engaged in one of the

twice weekly exercise sets if the IVR made at least one successful call in that set. Since there are 9 exercise

sets per wave, engagement can take values between 0 and 9. Call records are available immediately, and

they are complete and accurate.

  The implementer measures children’s ORF scores during the midterm and endterm examination periods.

In 2-3 hour periods set aside for the ﬂuency test, a teacher examines each child by counting the number of

words on a list that a child can read correctly in one minute of time (see Rodriguez-Segura et al. (2021) for

the use of this measure to assess reading and literacy). The teacher then submits the scores to the school’s

grade record system. ORF scores range between 0 and 85 correct words per minute (cwpm) based on the

length of the provided word list.14

  There are a number of issues with ORF measurement, which were partly revealed only after wave 1 of

the experiment had already started. Figure 1 shows the high variation in ORF scores between midterm and

endterm. Among the non-missing scores, an unusually high proportion are multiples of ﬁve, and in some

classrooms, there are implausibly many very high scores. In addition, a high percentage of scores are missing

or submitted late to the recording system: ORF scores were available for only 73.5%-88.9% of children

depending on the exam.15 Teacher reports on why a given score is missing are often ambiguous. Overall,

the data quality for ORF scores is fairly low.


Targeted Outcome.            In order to use adaptive sampling, it is necessary to deﬁne an outcome measure

that decides which is the “best” arm, which in turn determines which treatment arms will be sampled

more. In many settings, this is not straightforward, given that multiple indicators related to the desired

outcome(s) are typically available. Here, the implementer wants to increase parents’ engagement in their

children’s education in general, because parental engagement is known to have positive eﬀects on children’s

performance in school; at the same time, the calls explicitly encourage a set of reading exercises with the aim
  14 The implementer chooses a standardized, grade-appropriate word list, trains teachers and provides equipment. The measure

can in principle range from zero to over 200, but for ﬁrst graders it is typically not above 120.
  15 The total share of scores that are multiples of 5 is 36%, and the observed score distributions show unusual heaping even

when accounting for censoring at 0 and 85. Teachers sometimes delay submission or entirely fail to submit exam scores for their
class. We describe the patterns of missingness and suspected rounding in more detail in Appendix B in Tables A.1 and A.2.
Part of the reason that the problems of missing and rounded scores persist is that at elementary school level, these scores do
not aﬀect the student’s progression into the next grade, nor do they aﬀect the teacher’s evaluation.




                                                              17
to improve reading. Call engagement measures whether parents listen to the reading exercises, but we do

not observe the interactions they have with their children. As discussed above, ORF scores are an imperfect

measure of the child’s reading ability.

  In principle, both call engagement and ORF scores could be used to create a combined outcome measure.

Moreover, if there is a (known) relationship between the two measures, e.g. higher call engagement implies

greater reading improvements and the reverse, then an adaptive experiment could equivalently target either

outcome.16 A priori, we conjectured that call engagement is positively correlated with reading gains. First,

someone actually listening to the exercises is a necessary condition for the child’s exposure to these exercises.

Beyond the ﬁrst couple of calls, a simple model of marginal returns also suggests that parents are more likely

to engage with the calls if they feel that the child learns something and they plan on actually doing the

exercises. However, there could be reasons that call engagement and ORF are not aligned: any increase in

reading ability is a combination of (i) the child’s exposure to the exercises, and (ii) conditional on exposure,

how eﬀectively the delivery and content of the exercises in this arm improve reading (eﬃcacy of the arm

for short). Treatment variants (T1) and (T2) could potentially have diﬀerent exposure, conditional on

observed call engagement, and the treatment arm design choices regarding leveling (A, B, and C) may

exhibit diﬀerences in eﬃcacy.

  Without any constraints, the implementer might have chosen the best arm based on a weighted average

of ORF and engagement. However, we were unable to determine assignment shares in wave 2 based on

the midterm ORF scores.17 The grading day was moved during wave 1 and took place after the start of

the second half of the term. Due to the submission delays described above, ORF scores “trickle in” for

several weeks, and even after the end of the term, more than a quarter of the midterm data was missing (see

Table A.1). The choice in practice was therefore to either exclusively target call engagement in an adaptive

experiment, start the second wave late and with incomplete data for some form of adaptive assignment based

on ORF scores, or conduct an experiment with uniform assignment (or abandon the test).

  In this decision, it played a role that even in the best case of timely and accurate ORF measurements,

any eﬀects of IVR calls on reading ability were likely to be only incompletely realized by the end of the trial

intervention. Comparable early-reading interventions measure eﬀects after an intervention period of several

months or a whole school year (Doss et al., 2019; York et al., 2019). Moreover, cumulative eﬀects – e.g.

due to habit formation – are likely to accrue for a signiﬁcant period of time after intervention end, so it is
  16 This relationship would need to be established, e.g. from pilot data. Caria et al. (2020) make reference to the literature on

statistical surrogates – measurable or short term outcomes that can “stand in” for harder to measure or longer-term outcomes –
to argue that adaptive experiments could target short-term outcomes to achieve higher welfare in the long term; see also Athey
et al. (2019) for a proposal to create “surrogate indices” from multiple variables.
  17 Initially, we planned to use adaptive sampling to target ORF scores. The change is documented in the pre-analysis plan,

see (Sautmann, 2022).




                                                               18
unlikely that the impacts of the treatment were already fully realized by the end of the term.

  Based on these considerations, it was decided to exclusively target call engagement. Ultimately, the

implementer valued parental engagement suﬃciently to focus on maximizing call response rates, rather than

attempting to choose a treatment arm based on very noisy eﬀect estimates of ﬂuency gains and risking

inconclusive results. Another way to view this decision is to maximize learning about which arm has the

highest call engagement rates, at the expense of learning more precisely which arm has the greatest ORF

gains. While this solution may not be optimal, it reﬂects another reality of policy choice, that policymakers

sometimes have to make do with imperfect data.


Sample and Randomization.                We determined the sample using the phone number on record for the

parent.18 We dropped 2 schools that had fewer than 5 students, and 2 schools with very inﬂated ORF

scores, leaving us with 108 schools with 3,163 unique student-phone number combinations.

  We ﬁrst randomly assigned half of the sample to wave 1 and 2 (1,581 and 1,582 phone numbers, respec-

tively). We did not formally assess the best sample split between ﬁrst and second wave, but small-sample

simulations support equal-sized waves (see supplement of Kasy and Sautmann (2021a)). Before the start

of each wave, parents received an introductory call, followed by a text message conﬁrming enrollment and

explaining procedures for opt out and for switching phone number. Some parents opted out explicitly and

some phone numbers were invalid, leaving a sample of 1,494 in wave 1 and 1,384 in wave 2.

  The randomization was stratiﬁed at the school level.19 In wave 1, the assignment shares for the 6 treatment

arms were equally 1/7; in wave 2, we used the assignment shares given by exploration sampling, keeping 1/7

of the sample as a control group in each wave. Due to indivisibilities, the total shares are close but not equal

to the targeted shares, as shown in Table 2 in section 5.


Estimating ORF eﬀects. In many applications, outcomes other than the targeted outcome are of interest

to the experimenter. Here, we estimate the eﬀects of the treatments on ORF with reading ﬂuency exam

scores obtained after the experiment was completed to learn whether the treatment arm with the highest

engagement sees increases in children’s reading performance. We also brieﬂy discuss the possibility that

there are diﬀerences in how engagement with the calls translates into reading gains, which might imply that

the call format with the highest call engagement may not be the format with the highest reading gains.
  18 The implementer has parental consent to use this phone number for school related communications. Based on enrollment

data from the start of term 3, we randomly selected one student ID for measurement in the few cases where several student
IDs were associated with the same parental phone number (likely siblings). Phone numbers and schools are de-identiﬁed by the
implementer before sharing with the researchers.
  19 We also stratiﬁed assignment on whether the opt-in call or conﬁrmation text message were answered. For example, in wave

1, a large proportion of the sample (796 student IDs) neither opted in nor explicitly opted out. However, the extensive-margin
results (Appendix C.4) showed that most numbers answered the phone at least once during the experiment, and so we ignore
this in the estimation.



                                                             19
                                            Figure 2: Timeline for the study.


Implementation and Pre-Speciﬁcation.                   Figure 2 shows that the IVR experiment was carried out on

a very short timeline. Development and implementation, including designing the reading exercises and

recording and programming the calls for all treatment arms, were completed in three months up to April 30.

The research team developed the statistical model for parental engagement and carried out the treatment

assignment during wave 1 (term start May 11 to midterm exam on June 12), and the model for estimating

reading ﬂuency during wave 2 (midterm to endterm exam July 13) and after. This has downsides; for

example, not enough pilot data was available to improve our priors e.g. about school random eﬀects, and

while the ﬁrst wave of the experiment was ongoing, new information was still learned, such as the delay

to obtaining ORF scores. The timeline also shows that the experiment was pre-registered prior to the ﬁrst

wave, but by the time the pre-analysis plan was ﬁled on June 11 (before start of wave 2), the plans for the

experiment had changed signiﬁcantly. In general, short time windows and incomplete information in the

experimental design phase may make adaptive sampling more attractive, but will also make pre-speciﬁcation

more challenging.

Remark: Pre-Analysis Plans and Trial Registration. A question for the research community will be whether

adaptive policy choice experiments should be subject to the same norms of registration and pre-speciﬁcation

as “standard” experiments for causal eﬀect estimation.20 A full analysis of the incentives at play requires

a larger body of evidence on the method, but a priori, the need for pre-speciﬁcation seems less pressing:

depending on context there is often no speciﬁc incentive to demonstrate the eﬀectiveness of one treatment

arm over another; the metric of expected policy regret has no established cut-oﬀs akin to p-values for

conventional signiﬁcance levels; and, most importantly, after the ﬁrst adaptive wave it is not possible to

change the targeted outcome or the estimation approach, creating commitment before the data is fully
  20 Results showing signiﬁcant eﬀects often have higher value to both researchers and policy organizations, which contributes

to issues such as data mining, the ﬁle drawer problem, publication bias (e.g., Andrews and Kasy, 2019) and so on, familiar from
the literature on research transparency (Christensen and Miguel, 2018).




                                                              20
known. The opposite is true for trial registration : policy choice experiments are likely to be used to learn

about the eﬀectiveness of many diﬀerent policy options for the same outcome. Adaptive trials may inform

preliminary work where less successful interventions are never implemented or tested at scale. The ﬁle drawer

problem seems particularly salient in the context. In fact, a natural extension of adaptive sampling across

waves is to incorporate existing evidence into the priors that inform the research design of new experiments

(see e.g. Pouzo and Finan, 2022). This form of iterative learning requires a complete record of all prior

evidence gathered on the treatments under consideration.

4     Models and Estimation

This section describes how we estimate treatment eﬀects on parental engagement and ORF measures, and

how the engagement estimates are used for adaptive treatment assignment and ﬁnal arm choice. We also

comment on the modeling choices and implications for policy choice experiments more generally.

4.1    The Models for Call Engagement and Oral Reading Fluency

                      sk
Call Engagement. Let Zi  be the number of successful calls to a parent of child i in school s allocated

to treatment arm k ∈ {1, . . . , 6}. We assume that potential engagement is stationary across the two terms

and for simplicity suppress the index for wave t. No calls were made to the control group, so we restrict the
                                                                          sk
sample to enrolled phone numbers in the 6 treatment arms. We assume that Zi  is a draw from a Binomial

distribution with at most 9 successes and average probability of engagement θsk ∈ [0, 1]. This is motivated

by the distribution of the observed numbers of successful engagements in each treatment arm, shown in

Appendix C.1. We model the average engagement probability with a hierarchical logistic regression model

with school random eﬀects. Thus, we have


                                     sk
                                    Zi  | θsk ∼ Binomial(9, θsk ) ,

                                           θsk = logit−1 (β E xk + κE ηs
                                                                       E
                                                                         ).                               (2)


The vector xk is a unit vector indicating the treatment arm k , β E is a 1 × 6 vector of average treatment
                E
eﬀects, and κE ηs is the school-level realization of the random eﬀect. We do not include baseline ORF

information in this model – the only individual-level information we have – because of the problems with

missing and noisy data outlined earlier.

    We do not have much prior information on expected engagement, so we use a non-informative improper
           E 6
prior on {βk }k=1 and a Half-Normal prior distribution for κE (the standard Normal on [0, +∞)), and assume




                                                      21
a Standard Normal distribution for the school random eﬀects.21


                                              E
                                           p(βk )∝1      ∀k = 1, . . . , 6 ,

                                               κE ∼ Half-Normal(0, 1) ,
                                                E
                                               ηs ∼ N (0, 1) .


                      E 6
The hyperparameters {βk }k=1 and κE describe the average engagement probability in each treatment arm

and the arm-independent variance of the engagement probability across schools. Each θsk is a realization of

the average success probability speciﬁc to the school and treatment arm.

  While our main estimates focus on the average number of calls per phone number, in appendix C.4 we also

report estimates of the extensive margin of take-up. These use binary logit models, with the only change to

the model above that the outcome has a Bernoulli distribution with probability of success θsk .

Remark: Modelling Treatment Eﬀects. The model is agnostic about potential interaction eﬀects and uses

dummies for all treatment arms. A common approach to estimating the eﬀects of cross-randomized interven-

tions is to impose additional structure, e.g. by assuming additive eﬀects of the intervention variants T1/T2

and A/B/C. However, note that this imposes constraints across treatment arms that may interfere with

eﬃcient learning if the underlying assumptions are incorrect. Conversely, if it is known that the treatment

eﬀects have a speciﬁc structure, the optimal assignment shares change, as observations from one treatment

arm provide information about other arms, and the eﬃciency properties of algorithms such as exploration

sampling are not known in this setting.


Oral Reading Fluency.         Our estimation of oral reading ﬂuency uses ORF scores from three periods: the

endterm exam of term 2 (E2), and the midterm and endterm exams of term 3 (M3, E3). This means we

capture all students pre-treatment, wave-1 students in two periods post treatment, and wave-2 students

in one period post treatment (provided their ORF score is not missing). We use a Bayesian approach for

consistency and because Bayesian inference is valid even with adaptive sampling.
        sk
  Let Yit  denote the ORF score of a student i at time t in school s, assigned to treatment arm k . Deﬁne
 sk
γit as the average ORF score of student i in school s for period t ∈ {E 2, M 3, E 3} and arm k ∈ {0, . . . , 6},
                                                             sk
where k = 0 now includes the control group. We assume that Yit  has a normal distribution, and model the
  21 We use this random eﬀects parameterization to avoid what is known as “Neal’s funnel” when sampling from the joint

distribution of the treatment eﬀects and random eﬀect variance (Neal, 2003).




                                                         22
average ORF score with a hierarchical linear regression:


                                  sk    sk       sk
                                Yit  | γit ∼ N (γit , σ2 ) ,
                                        sk
                                       γit = β0 + β F x k    F F
                                                        t + κ ηs + φαi + ριt .                              (3)


  As before, β F is a 1 × 6 vector of average treatment eﬀects. The vector xk
                                                                            t , k ∈ {1, . . . , 6} is a unit

vector that indicates whether the student experienced treatment k in period t or earlier, as in a simple

diﬀerence-in-diﬀerence speciﬁcation wit time-invariant treatment eﬀects.
                  F
  The product κF ηs is the realization of a school-level random eﬀect, φαi is the realization of a student-level

random eﬀect and ριt is the realization of a period-level random eﬀect. We use a non-informative improper
           F 6
prior on {βk }k=0 and a Half-Normal prior distribution for each one of the random eﬀect variance terms
                                                                                          F
{σ, κF , φ, ρ}, and assume a Standard Normal distribution for each of the random eﬀects {ηs , αi , ιt }. We

have:


                                              p(β0 ) ∝ 1 ,
                                                F
                                             p(βk )∝1         ∀k = 1, . . . , 6 ,

                                      {σ, κF , φ, ρ} ∼ Half-Normal(0, 1) ,
                                         F
                                       {ηs , αi , ιt } ∼ N (0, 1) .


Remark: Note that, unlike for call engagement, we expect that ORF scores increase over time independently

of the intervention, as students’ reading ability improves over the course of the school term. The control

group helps distinguish the pure time trend, captured by ριt , from any common eﬀects of the IVR calls

on ORF. In pure policy choice experiments with stationary outcomes, a control group is not needed. But

sampling a control group and including a period random eﬀect in the model can be useful if the outcome

targeted for adaptive sampling is expected to vary over time, even if the treatment eﬀects have the same

distribution across waves.


Model Fitness. We conduct standard checks on the distribution of predicted outcomes for the call en-

gagement and the oral reading ﬂuency model to validate whether our models are correctly replicating the

characteristics of the observed outcome variable. We also check the sensitivity of our results to diﬀerent prior
                                                                                                            E
distribution speciﬁcations. For the call engagement model, we select four diﬀerent prior distributions for βk
  F
(βk for ORF): (i) a normal distribution centered on 0 and variance equal to 100, (ii) a T-Student distribution

with 1 degree of freedom, mean 0 and variance equal to 100, (iii) a normal distribution centered on 0 and


                                                         23
variance equal to 1, and (iv) a T-Student distribution with 1 degree of freedom, mean 0 and variance equal

to 1. Next, we follow the same approach with κE (κF for ORF) and test the following prior distributions:

(i) a half normal distribution with mean 0 and variance equal to 100, (ii) an inverse χ2 distribution with

1 degree of freedom, and (iii) a half T-Student distribution with 1 degree of freedom, mean 0 and variance

equal to 1. In all these cases, the results are not aﬀected by the selection of the prior distribution. Given

the large sample size, the likelihood is dominating the prior.

4.2   Treatment Assignment and Exploration Sampling

In wave 2, we want to use the Exploration Sampling algorithm proposed in Kasy and Sautmann (2021a) to

assign experimental units to treatment arms. Doing so requires calculating the probability optimal pk
                                                                                                    t after

each wave. In the policy choice model in Kasy and Sautmann, the outcome is binary, there are no covariates,

and the parameter of interest is simply the arm mean θk with a Beta prior. The posteriors used to derive pk
                                                                                                          t

therefore have a closed form. Here, we estimate a generalized linear model that allows for a school-speciﬁc

average call success rate; appropriate if we expect outcomes to vary signiﬁcantly between clusters (such as
                                                                     ¯k = ET [θsk |k ], depends on the random
schools). However, this implies that the expected outcome in arm k , θ
                                                                       sk
eﬀects (note that θsk is the re-scaled expectation of call engagement Zi  ). Moreover, we sample the posterior

distribution of all parameters using MCMC which requires many numerical draws.

  In order to simplify the calculation of pk               ¯k ¯k
                                           t , we use that θt > θt if and only if βk > βk . In our model, this
                                                                                             E
is the case since θsk is strictly increasing in β E for any realization of the school eﬀect ηs or the dispersion

parameter κE . This implies that


                                                    ¯k ) = Prt (k = argmax βk ),
                                    Prt (k = argmax θ                                                                (4)
                                                      t
                                                k                         k



and therefore we can simulate the probability that arm k is optimal using just the posterior of the parameters
  E 6                                                                           k 22
{βk }k=1 , rather than the (joint) distribution of all the parameters entering θi . This shortcut can simplify

deriving the exploration sampling assignment shares for many models with covariates or random eﬀects.


Posterior Probability of Successful Engagement and Posterior Expected Regret. At the end of

the experiment, we want to implement the arm with the highest average outcome, or equivalently, lowest

policy regret. Here, we translate this to choosing the treatment arm with lowest posterior estimated regret

in terms of the engagement probability, ET [∆k ] = ET [θs(1) − θsk |k ] (where the expectations are formed

over the posteriors for β and κ and the normally distributed school random eﬀects, and θs(1) denotes the
  22 Note that the same approach would also be valid if we had targeted ORF and were to simulate probability optimal base

on the {βkF }6
             k=1 .




                                                           24
school-speciﬁc success probability under the optimal treatment arm).
                                                      ¯k = ET [θsk |k ] and the expected regret ET [∆k ] cannot
  The expected probability of a successful engagement θ
                                          E
be derived from the distribution of the {βk } alone because of the non-linear inverse logit transformation
                 ex 23
logit−1 (x) =   1+ex .     We therefore draw from the posterior distributions of κE and β E and the standard
                        E
normal distribution of ηs to calculate the success probability in each arm and school. Then we average over
                          ¯k as well as ET [∆k ].
these θsk draws to obtain θ

Remark: Predicting probability of success and policy regret with school-level eﬀects. By drawing the school

random eﬀects from the normal distribution, we implicitly take an “out-of-sample” approach that ignores

the distribution of realized random eﬀects in the student sample. This is informed by the fact that we did

not ﬁnd important diﬀerences by school size, such as a correlation between average ORF scores and size. We

therefore treat new generations of students as random draws from the distribution of school random eﬀects.

An alternative would be to treat the school random eﬀects as persistent and combine the posteriors of the

school random eﬀects with assumptions about (future) class sizes to obtain the expected (future) engagement

probability and regret. The use of expected regret based on predicted treatment outcomes as the decision

criterion requires making explicit what assumptions are used to make predictions.

Remark: Heterogeneity. Relatedly, our approach to calculating pk
                                                               t rests on the monotonicity of the θ
                                                                                                    sk
                                                                                                       in β k .

The approach does not apply when “preference reversals” occur. As a simple example, suppose arm k has a

strong eﬀect in some schools and none in others, whereas k has a moderate eﬀect in all schools. In this case,

it depends on the treatment eﬀect distribution which arm has the highest average treatment eﬀect; here, for

example, the size of the diﬀerent schools. If such heterogeneity is expected, the researcher needs to estimate
                     k
the distribution of θi more ﬂexibly, for instance by allowing interactions between covariates and treatment,
                                                                                                     ∗
in which case deriving both the probability optimal pk                          k
                                                     t and expected regret ET [∆ ] requires assumptions

about the covariate distribution in the population. Note also that preference reversals imply that treatment

k is optimal for some schools, whereas for others it is k , in other words, the unconstrained optimal policy is

speciﬁc to each school. Targeted policy choice is discussed brieﬂy in Kasy and Sautmann (2021a), and Caria

et al. (2020) describe a targeted adaptive experiment using their proposed tempered Thompson algorithm.

Targeting has the advantage that we do not need to “trade-oﬀ” strata for which diﬀerent policies are optimal,

but it is not always easy to implement in real-world contexts.
  23 Note for example that the estimate of the average success probability, θ¯k = ET [logit−1 (β E xk + κE η E )|β E ] is diﬀerent
                                                                                                             s     k
                −1 ˆE
from both logit (βk ), the inverse logit of the point estimate of the treatment eﬀect, and from ET [logit−1 (βkE )], the expected

success rate at the median school with ηsE = 0.




                                                               25
4.3    Frequentist Inference

As discussed, treatment eﬀect estimates from adaptively collected data are subject to sampling bias, and

focusing on the eﬀect in the best arm leads to “winner’s curse”. Corrections for these sources of bias are

rapidly evolving ﬁelds of research.

    To our knowledge, there is no method yet available to correct for adaptive sampling bias in models

with random eﬀects, but there exist weighting approaches for a range of settings that make estimators

asymptotically normal (Hadad et al., 2021; Zhang et al., 2021, 2020). In particular, the square root inverse
                                                                                                                             1
propensity weighting proposed by Zhang et al. (2021) – which in our setting corresponds to weights                           k
                                                                                                                            qt

for observations in arm k — applies to m-estimators including Binomial GLM. Using these adaptive weights

results in an estimator that is asymptotically normal. In section 5.3, we examine how these weights aﬀects

point estimates and conﬁdence intervals compared to an unweighted Binomial GLM estimate.

    In addition, Andrews et al. (2021) have developed corrections for the “winner’s curse” that arise when

estimating the treatment eﬀect in the best arm. We construct conﬁdence intervals with “unconditional

coverage,” which allow valid inference on the eﬀect of IVR calls on engagement when the best call format

is implemented (but regardless which of the six formats that is).24 These corrections require normally

distributed estimates. Following a suggestion by Hadad et al. (2021), we use the adaptively weighted Binomial

GLM estimates as inputs into these corrections and show how this changes the point estimates and conﬁdence

intervals (section 5.3). These approaches are not directly comparable to the Bayesian estimates with random

eﬀects, but they allow us to gain some intuition about how the treatment eﬀect estimates change. Two recent

software packages make it easy to apply the “winner’s curse” corrections (Shreekumar, 2020; Bowen, 2022).

5     Results of the IVR Experiment

5.1    Call Engagement

Table 1 presents estimates of the treatment eﬀects from Bayesian Binomial GLM models as speciﬁed in

Equation (2). We show both the estimate with only wave-1 data and with data from both waves. The

table reports the means and, in brackets, the 95% highest-probability density (HPD) intervals of the pos-

terior distributions.25 A higher coeﬃcient is associated with a greater average probability of successful

engagement.26
  24 One may debate whether conditional or unconditional coverage is appropriate. In an experiment that compares diﬀerent

types of interventions – say, conditional cash transfers and IVR calls – we may be interested in the eﬀect of IVR calls only if
they yield better outcomes than the cash transfer. We see this as a case of conditional inference, because the identity of the
best arm matters.
  25 The 95%-HPD region H is deﬁned by the highest k such that u f (θ )dθ = 95% and f (θ ) ≥ k for all θ ∈ H , where f denotes
                                                                    l
the posterior pdf of θ. For unimodal distributions, H is an interval.
  26 Recall that, for a point estimate for the treatment eﬀect β E and the median school with random eﬀect 0, we would have
                                                                k
                                                              E
                                                        exp (βk )
that the probability of success in arm k equals θk =            E) .
                                                       1+exp (βk




                                                                 26
Table 1: Call engagement estimates after wave 1 and 2.

                            Bayesian Binomial GLM
                            Wave 1            Full sample
                             (1)                   (2)
   T1A                     −2.84∗              −2.63∗
                        [−3.09; −2.60]      [−2.81; −2.46]
   T1B                     −2.64∗              −2.49∗
                        [−2.87; −2.42]      [−2.63; −2.36]
   T1C                     −2.75∗              −2.78∗
                        [−3.00; −2.52]      [−2.93; −2.63]
   T2A                     −2.94∗              −2.89∗
                        [−3.19; −2.70]      [−3.11; −2.68]
   T2B                     −2.83∗              −2.67∗
                        [−3.08; −2.60]      [−2.85; −2.50]
   T2C                     −3.46∗              −3.32∗
                        [−3.74; −3.20]      [−3.57; −3.07]
   Num. students              1283               2462
   Period                       1               1 and 2
   Notes: ∗ Value of zero lies outside of the 95% credible in-
  terval. We simulate 4 independent Markov chains of 4,000
  posterior draws each and discard the ﬁrst 2,000 as warm up.
  The remaining 8,000 draws are used to generate the posterior
  distributions of the coeﬃcients. The Split-Rˆ of every poste-
  rior distribution is below 1.01 and there are no divergent
  transitions.




                               27
                                Table 2: Treatment allocation in waves 1 and 2.

                                         Wave 1                                      Wave 2
         Treatment       Target %       Actual %         Num.        Target %       Actual %         Num.
                                                       students                                    students
           T1A            14.28%         14.12%           211          7.44%         8.45%            117
           T1B            14.28%         14.73%           220         39.26%         40.46%           560
           T1C            14.28%         13.86%           207         28.45%         26.81%           371
           T2A            14.28%         14.32%           214          0.89%         1.01%             14
           T2B            14.28%         13.72%           205          9.68%          8.53%           118
           T2C            14.28%         15.13%           226          0.00%          0.00%             0
          Control         14.28%         14.12%           211         14.29%         14.74%           204
          Notes: treatment arm sample allocation on waves 1 and 2. Target % shows the target theoretical shares
         of each treatment arm. Observed % shows the actual treatment allocation after randomization with
         stratiﬁcation. Num. students is the number of students in each treatment arm.


  The estimates from wave 1 in Table 1 were used to determine the exploration sampling shares for wave 2.

Table 2 shows the theoretical sample shares in each treatment group, as well as the assigned sample shares

after stratifying by school, both for wave 1 and wave 2. Exploration sampling reduced the sampling share

assigned to treatments T2A and T2C to zero or almost zero. Moreover, T1A and T2B received only slightly

over 8% of the sample. The bulk of the allocation went to T1B and T1C (aside from the control). These

are both calls where the IVR instructs the parent to lead reading exercises, but in B the same intermediate

exercise sequencing is used for all, whereas in C the parent can choose the exercises.

  Column (1) in Table 1 shows that some diﬀerences in treatment eﬀects already emerged in wave 1, which

led to the diﬀerences in treatment assignment in wave 2. The full sample estimate in column (2) both shows

slightly diﬀerent point estimates and signiﬁcantly tighter HPD intervals, especially for the higher-performing

treatments. Figure 3 displays the treatment eﬀect posterior distributions after wave 2, corresponding to the

estimate in column (2) of Table 1. The shape of the distributions shows that the higher treatment eﬀects

are estimated with signiﬁcantly greater precision. This allows a ﬁner distinction between T1A, T1B, and

T2B. After wave 2, T1B is the treatment arm with the highest level of engagement, with a point estimate
                                                          ˆE = −3.32.
   ˆE = −2.49, whereas T2C has the lowest engagement with β
of β T 1B                                                   T 2C

  Table 3 provides additional information. Columns (1) and (2) show the raw numbers of attempted en-

gagements and share of successful engagements (dividing the number of successful calls by the number of call

attempts). Columns (3) to (5) are based on the posterior of the treatment eﬀect vector β E . The mean and

standard deviation in each arm replicate the estimation results in column (2) of Table 1 and show once more

that higher means are associated with lower dispersion of the estimate. Column (5) shows the probability

optimal pk
         2 for each arm k . The posterior probability that T1B is the optimal choice is over 93%; three




                                                          28
      T1A




      T1B




      T1C




      T2A




      T2B




      T2C


                                −3.5                         −3.0                        −2.5

     Notes: the ﬁgure shows the posterior distribution of parent engagement coeﬃcients after wave 2. Greater values
     are associated with a higher probability of a successful engagement. The vertical bar marks the median of each
     posterior distribution. The shaded areas indicate the 95% credible intervals. A total of 8,000 posterior draws
     sampled from 4 independent Markov chains were used.

                     Figure 3: Posterior distributions of parent engagement coeﬃcients.


arms (T1C, T2A, and T2C) have essentially zero posterior probability that they deliver the highest level of

engagement. Arm T1A, parent-led reading with leveled exercises, has the second highest engagement rate

of 7.43%, but has only a 5.24% probability optimal.

  The last two columns transform the posterior estimates into an average probability of successful engage-
                   ¯k , and report the expected policy regret based on the probability of engagement, the
ment for each arm, θ

objective of interest (see section 4). This statistic shows that implementing T1B would lead to an expected

loss in terms of the probability of a successful call of only 0.02 percentage points. For the other treatment

arms, the loss ranges between 0.99pp and 4.49pp. These expected losses are equivalent to less than 1%, 12%,

and 53% of the highest estimated success probability in arm T1B (of 8.40%).

  In order to look more into parents’ decision to answer the biweekly IVR calls, we also analyze the extensive

margin of engagement. Appendix C.4 shows estimates for the probability of any successful engagement

(i.e., whether the recipient started the reading exercises in any of the calls received) and the probability of

answering the phone at least once. Tables A.6 and A.7 report the coeﬃcient estimates and the corresponding

treatment arm averages. The arms had nearly identical initial response rates: in ﬁve arms at least one call

was answered with 84.1%-86.6% probability, and the response rate was only slightly lower in T2A (81.7%).

The share of phone numbers with at least one successful engagement varies somewhat more across arms,

and is particularly low in T2C, where the rate is only about half of what it is in other arms. However,

T1A, T1B and T2B have nearly identical engagement probabilities. It is instructive to also compare the


                                                          29
                        Table 3: Call engagement: treatment eﬀect estimates after wave 2.

                   Raw numbers                       Posteriors of β E                        Average
                                                                                                      ¯k
                                                                                           engagement θ
    Arm        Call            Share        Mean         SD           Prob.       Success        Post. exp. policy
             attempts        successful                             optimal pk    prob. θ¯k       regret ET (∆k )
                                                                             t
                (1)             (2)           (3)        (4)            (5)          (6)                (7)
    T1A        2, 952          7.28%         -2.63      0.09          5.24%         7.43%              0.99%
    T1B        7, 020          8.40%         -2.49      0.07         93.19%         8.40%              0.02%
    T1C        5, 193          6.47%         -2.78      0.08          0.00%         6.49%              1.93%
    T2A        2, 052          5.95%         -2.89      0.11          0.00%         5.86%              2.56%
    T2B        2, 907          7.05%         -2.67      0.09          1.57%         7.15%              1.27%
    T2C        2, 034          3.98%         -3.32      0.13          0.00%         3.93%              4.49%
     Notes: (1) A call attempt is a scheduled call to a parent, 9 per wave (not counting repeated attempts and call
    backs). (2) The share successful is the percentage of call attempts in which the exercises were started. (3-4) The
    posterior mean and standard deviation of β E were calculated from a total of 8,000 posterior draws sampled from 4
    independent Markov chains. (5) The probability optimal is calculated as in Eq. 4. (6) The average probability of
    success is calculated as in Eq. ??. (7) The posterior expected policy regret is the expected loss from choosing this
    arm, expressed in terms of the probability of a successful call, after observing both waves of the experiment.


“probability highest” for each arm based on the extensive margin estimates, reported in column (2) of Table

A.7. These probabilities are the analog of the probability optimal in column (5) of Table 3 (as these can

be calculated for any outcome in any experiment, regardless whether adaptive sampling was used). These

probabilities never exceed 35.3%. The point estimates and probability highest indicate that the six call

formats are much less clearly diﬀerentiated based on the probability of “any engagement” than based on the

overall call engagement rate. This suggests that the intensive margin matters, and diﬀerences in response

rates emerge more clearly as parents learn about the calls and decide about continued engagement.

  One interpretation of the results, comparing A and B arms, is that leveling exercise content in this setting

is not valuable – perhaps because of the noisy and often missing ORF scores used for leveling – or at least not

valued by parents, who may perceive the exercises as too easy or too diﬃcult. Both C arms have relatively

low call engagement rates. It is worth noting that the option to choose between exercises increases the

length of the call, which may discourage the listener. The call success rate in T2C is particularly low, and

we conjecture that this is because the listener is not only asked to choose which exercises to play, but the

IVR here also addresses the child directly. This “gamiﬁcation” aspect may lead the parent to worry about

overly long calls in which the child skips around between exercises. Between T1 and T2, the posterior means

suggest that T1 arms have slightly higher engagement rates, perhaps because the “listen now, practice later”

format allows the parent more ﬂexibility.

  The sampling shares in Table 2 and the numbers of attempted and successful engagements in Table 3

also demonstrate a property of adaptive sampling that is attractive in the context of policy choice: the


                                                               30
reassignment of treatment arm shares in later waves means that a larger percentage of participants beneﬁt

from the treatment arms with better outcomes. Here, this means more students get IVR calls with high

engagement levels. At the end of this experiment, 27.10% percent of students participated in T1B compared

to only 7.85% in arm T2C.

5.2   Oral Reading Fluency

Even though the adaptive sampling algorithm was geared towards learning about call engagement, we would

also like to estimate treatment eﬀects on reading ﬂuency. ORF may increase directly if parents regularly

carry out the actual exercises delivered with their children, improving their reading. The calls may also

increase parents’ awareness of their child’s reading ability more generally, leading them to express interest

and encourage reading practice in day-to-day interactions.

  Table 4 presents estimates from two diﬀerent samples. Column (1) in both panels shows the estimated

treatment eﬀects on ORF scores using only the sample of students with complete score information in all

three exams, whereas column (2) uses all students for whom we have at least one treated and one untreated

exam score. Figure 4 shows the posterior distributions of the ORF coeﬃcients corresponding to Panel A of

Table 4, panel (a) for the balanced panel data and panel (b) for the unbalanced panel data.

  In both samples, the ORF treatment eﬀects shown in Panel A are small and estimated noisily, ranging

from 0.90 to 1.90 correct words per minute. By comparison, in the control group, ORF increased on average

by 1.62 cwpm and 2.92 cwpm in the ﬁrst and second half of the term, respectively. Overall, going from

column (1) to column (2), the treatment eﬀects tend to be estimated larger, although with similar credible

intervals; despite the much larger sample in the unbalanced panel, the precision of the estimates does not

increase much, perhaps due to the student-level random eﬀects. In the balanced panel, the credible interval

for all six coeﬃcients includes 0. However, the unbalanced panel estimate for the arm with the highest call

engagement, T1B, indicates an increase in ﬂuency by 1.68 cwpm, and the credible interval does not include

zero. Note that T1B has a relatively large share of the sample because of the use of adaptive sampling, and

therefore the eﬀect on ﬂuency is more precisely estimated in this arm than in others, even though the mean

estimated ORF eﬀects are slightly larger in some other arms. The larger sample size in the treatment arms

chosen for implementation is an advantage of adaptive sampling for the estimation of non-targeted outcomes.

  In order to test whether simply receiving any calls has an eﬀect on ﬂuency, we pool the six treatment

groups in Panel B. In both samples, the HPD intervals do not include 0, and the eﬀect is 1.31 cwpm in the

balanced panel and 1.53 cwpm in the unbalanced panel.

  It is worth emphasizing once more that the ﬂuency estimates are only indicative, because of the low

data quality and because the eﬀects of any treatment would have likely been incompletely captured due



                                                     31
       T1A                                                          T1A


       T1B                                                          T1B


       T1C                                                          T1C


       T2A                                                          T2A


       T2B                                                          T2B


       T2C                                                          T2C

                  −2.5       0.0        2.5        5.0       7.5              −2.5        0.0        2.5       5.0        7.5


                          (a) Balanced panel                                         (b) Unbalanced panel

Notes: the ﬁgures present the posterior distribution of treatment eﬀects after wave 2. The vertical bar marks the median of each
posterior distribution. The shaded areas indicate the 95% credible intervals. A total of 8,000 posterior draws sampled from 4
independent Markov chains were used.

                         Figure 4: Posterior distributions of treatment eﬀects for ORF scores.


to the short exposure and the one-oﬀ measurement of ORF immediately after treatment. That said, the

estimates suggest that an IVR intervention for parental engagement in their children’s reading will have a

positive impact on children’s reading skills. This is an encouraging ﬁnding given the relatively “light touch”

of this intervention. Based on the estimates from the unbalanced panel, it is more than 95% likely that

implementing the arm with the highest engagement, T1B, which asks parents to carry out a few simple

reading exercises sequenced the same for all children, will lead to positive reading ﬂuency gains. While the

eﬀects of the 4.5-week intervention tested here were moderate, it stands to reason that exposure for the full

term or even the full school year will generate larger eﬀects. The program may also lead to continued joint

reading between parents and children after the calls end.

  A remaining question is how the treatment eﬀects on ﬂuency compare between the diﬀerent arms and

whether call exposure and eﬃcacy vary signiﬁcantly strongly so that one of the arms with lower call engage-

ment could be more eﬀective for reading outcomes. Unfortunately, the answer is hampered by the quality

of the data and the relatively small eﬀect sizes. From Figure 4, there is signiﬁcant overlap in the credible

intervals of all arms, even for the treatment arms with a large share of observations. To get a sense of

the uncertainty, Table A.4 in Appendix C shows the “probability highest” and the expected regret for each

arm based on the posterior distributions of the ORF model. The probability that T1B leads to the highest

possible reading gains among the six arms lies between 12% and 20% according to these estimates. T1B

generates a posterior regret of 0.94 cwpm in the balanced panel and 1.14 in in the unbalanced panel. In the

balanced panel, T1B is the arm with the lowest posterior regret. In the unbalanced panel, arm T2B has

the lowest posterior regret, with 0.92 cwpm. While the probability optimal is higher than for T1B for three

arms (T2A, T2B, and T2C), it is below 26.3% for all of them, and the diﬀerence in expected regret is less

than 0.24 cwpm.

  The low probability optimal for the arms with lowest regret reﬂects the noise in these estimates. Note also


                                                                   32
                 Table 4: ORF scores estimates.

 Panel A: Treatment eﬀects
                           Balanced Panel          Unbalanced Panel
                                (1)                      (2)
 (Constant)                     46.90∗                   46.54∗
                             [43.98; 49.92]           [43.76; 49.31]
 T1A                              0.90                     1.29
                             [−1.26; 3.09]            [−0.70; 3.30]
 T1B                              1.60                    1.68∗
                             [−0.04; 3.26]             [0.13; 3.21]
 T1C                              1.08                     0.91
                             [−0.72; 2.92]            [−0.79; 2.59]
 T2A                              1.40                     1.85
                             [−1.01; 3.81]            [−0.42; 4.04]
 T2B                              1.41                     1.90
                             [−0.75; 3.55]            [−0.12; 3.92]
 T2C                              1.32                     1.79
                             [−1.06; 3.71]            [−0.49; 4.05]
 Panel B: Pooled treatment eﬀects
                           Balanced Panel          Unbalanced Panel
                                (1)                      (2)
 (Constant)                     46.91∗                   46.63∗
                             [43.87; 50.02]           [43.77; 49.49]
 Pooled treatment                1.31∗                    1.53∗
                              [0.08; 2.52]             [0.34; 2.69]
 Num. obs.                        5469                     6701
 Num. students                    1823                     2439
 Notes: Reporting means and 95% HPD intervals (in square brackets) of
the posterior distributions of treatment eﬀects. ∗ : zero outside 95% credi-
ble interval. We simulate 4 independent Markov chains of 4,000 posterior
draws each and discard the ﬁrst 2,000 as warmup. The remaining 8,000
draws are used to generate the posterior distributions of the coeﬃcients.
The Split-Rˆ of every coeﬃcient is below 1.01 and there are no divergent
transitions.




                                    33
that T2A has a higher probability optimal than T2B in both the balanced and unbalanced panel, highlighting

that the arm with the highest probability optimal may not always have the lowest policy regret. This can

occur if some “unlikely” states of the world have very high regret realizations and occurs more often when

the best arm is fairly uncertain.

  Overall, based on these results there is signiﬁcant uncertainty about which arm has the highest ORF gains.

There is no strong evidence that choosing a policy based on maximal call engagement is systematically at

a tension with also increasing oral reading ﬂuency, but we can also not conclude that the two outcomes are

deﬁnitely aligned. If the implementer would like to revise the decision to target engagement only and learn

which call format maximizes ORF gains, additional testing would likely be needed.

5.3   Correcting for Sampling Bias and Winner’s Curse

While most of our analysis is Bayesian, researchers may also be interested in conducting frequentist inference

with the data obtained from a policy choice experiment to draw broader conclusions about the interventions

tested, and this requires correcting sampling and winner’s curse biases.




                                                     34
Table 5: Call engagement estimates applying the adaptively weighted m-estimator by Zhang et al. (2021)
and the “winner’s curse” correction by Andrews et al. (2021).

             Panel A: Binomial model estimates, unweighted and with adaptive weighting.
                                    Unweighted              Unweighted              Adaptively weighted
                                   With school RE        Without school RE          Without school RE
                                         (1)                    (2)                         (3)
             T1A                       −2.63∗                   −2.54∗                     −2.52∗
                                    [−2.80; −2.45]            [−2.79; −2.3]             [−2.78; −2.27]
             T1B                       −2.49∗                   −2.39∗                     −2.39∗
                                    [−2.62; −2.36]           [−2.54; −2.24]             [−2.55; −2.24]
             T1C                       −2.77∗                   −2.67∗                     −2.66∗
                                    [−2.92; −2.62]           [−2.86; −2.49]             [−2.85; −2.46]
             T2A                       −2.88∗                   −2.76∗                     −2.79∗
                                    [−3.09; −2.67]           [−3.09; −2.43]             [−3.20; −2.39]
             T2B                       −2.67∗                   −2.58∗                     −2.57∗
                                    [−2.84; −2.49]           [−2.82; −2.34]             [−2.81; −2.33]
             T2C                       −3.31∗                   −3.18∗                     −3.18∗
                                    [−3.55; −3.06]           [−3.59; −2.77]             [−3.59; −2.77]
             Num. students               2462                     2462                        2462
             School RE                   Yes                       No                          No
             Panel B: “Inference on winners” correction on T1B.
                                   With school RE        Without school RE               Re-weighted
                                         (1)                    (2)                          (3)
             T1B                       −2.49∗                   −2.39∗                     −2.39∗
                                    [−2.66; −2.32]           [−2.59; −2.19]             [−2.60; −2.18]
             Notes: ∗ Value of zero lies outside of the 95% conﬁdence interval. (1) Frequentist estimate,
            unweighted and with school random eﬀects as in the original model speciﬁcation (Table 1, Column
            2). (2) Frequentist estimate without school random eﬀects. (3) Frequentist estimate without
            random eﬀects, applying adaptive weights as in Zhang et al. (2021). Panel A: full estimates for all
            treatment groups. Panel B: Median estimate and adjusted conﬁdence intervals for T1B, applying
            corrections for inference on the best arm as in Andrews et al. (2021). Note that this correction is
            only theoretically valid in column (3) where the underlying estimator is asymptotically normal.




                                                            35
    As discussed in section 4.3, a method to correct for the biases that arise from adaptive sampling when

there are random eﬀects does to our knowledge not yet exist. We therefore present results without random

eﬀects for illustrative purposes. In Table 5, we show a set of frequentist estimates that iteratively apply

adaptive weighting and the winner’s curse correction. In column (1), we show unweighted estimates from

a Binomial model with random eﬀects. These are the frequentist equivalent to the Bayesian estimates in

column (2) of Table 1 (and they are very similar).

    Column (2) shows unweighted estimates again, but this time without random eﬀects. As is common,

this shifts the estimated coeﬃcients somewhat towards 0. In Column (3), we apply the adaptive weights

proposed by Zhang et al. (2021) to obtain asymptotically normal estimates. It is instructive to compare

columns (2) and (3) in Panel A: for the best arm, the estimates are almost identical, whereas for example

for T2A the point estimate is shifted and the conﬁdence interval signiﬁcantly wider. This reﬂects that arms

who initially perform poorly receive only a small share of the sample, and the weighted estimator therefore

gives those few observations signiﬁcantly greater weights, with the potential to change the overall treatment

eﬀect estimate. As Hadad et al. (2021) observe, this is an indirect consequence of the fact that sampling

bias primarily aﬀects the sub-optimal arms (which are “dropped” from the sample) rather than the optimal

arm, where initial biases have a chance to self-correct.

    In Panel B, we apply the winner’s curse correction by Andrews et al. (2021) to the treatment eﬀect

estimate for the empirically best arm, T1B. Note that the method requires normally distributed estimators,

so it is strictly speaking only applicable with the weighted estimates in column (3). However, for illustration

purposes we carry out the same correction in all columns. The corrected conﬁdence intervals we obtain

                               ıve” estimates in Panel A. However, the point estimates for the treatment
are somewhat wider than the “na¨

eﬀects remain virtually the same. This reﬂects that at least in the IVR experiment the best arm is fairly

unambiguously identiﬁed, and the distribution of the estimator is therefore not signiﬁcantly truncated. This

means also that a winner’s curse is less likely. As Andrews et al. (2021) also point out, uncorrected frequentist

estimates are asymptotically valid.

    We may deduce that we need not be too worried about taking the Bayesian treatment eﬀect estimates for

the IVR experiment at face value. However, in experiments with smaller samples, both sampling biases and

the winner’s curse problem may be more pronounced.

6     Alternative Research Designs

In this section, we turn back to the question of how to choose the research design. Potential users of

exploration sampling and adaptive experiments more generally will be interested in the learning gains from

adaptivity, as well as the best design for their adaptive experiment.


                                                       36
  A ﬁrst question is whether adaptive sampling improved learning in the IVR experiment. The motivation

to use adaptive methods is to increase eﬃciency and make the most of a limited sample and time. However,

asymptotic convergence results for exploration sampling and other best-arm algorithms (Kasy and Sautmann,

2021a; Russo, 2020; Qin et al., 2017) only apply to speciﬁc outcome distributions and when the number of

waves grows large. In this experiment, we learn only from one prior wave and adapt the assignment shares

for half of the sample in a second wave. Possible learning gains from adaptivity are further limited by

the fact that the exploration sampling algorithm can only approximate the optimal assignment. In a ﬁrst

exercise below we therefore use simulations to evaluate the gains from adaptive sampling in wave 2, relative

to non-adaptive sampling where an equal share of the sample is allocated to each treatment arm (a “standard

RCT”). We use the data actually gathered in this experiment. The goal is to quantify the performance of

exploration sampling ex post and for the speciﬁc context of the IVR experiment. This contributes to an

evidence base about the gains from adaptive sampling in policy choice.

  A second question is how researchers should ex ante compare and make decisions about research designs

based on prior information, and whether such comparisons are reliable. As discussed above, operational

and logistical constraints in the IVR restricted the space of possible research designs essentially to either

conducting one experimental wave (possibly in only one half of the term) or two waves. With reference to

the two scenarios laid out in 2, ex ante, we might have asked whether we should simply conduct a one-wave,

non-adaptive RCT with the full sample, or if there are signiﬁcant gains from holding back half of the sample

and adjusting the treatment assignment using exploration sampling in wave 2. Alternatively, after carrying

out wave 1 and observing the results, we might have asked whether the learning gains from the second wave

make the eﬀort worthwhile. In the second and third exercise below, we therefore carry out simulations that

answer these questions, in the same way an experimenter might have done to make decisions about the IVR

experiment. These simulations are by necessity not based on the actual data collected, but on the Bayesian

model and parameter distributions we speciﬁed. The purpose is both to compare the predicted gains from

adaptive sampling obtained ex ante from the model with those obtained ex post from the data, and to

illustrate how one might go about conducting such simulations.

6.1   Ex Post Counterfactual: Non-Adaptive Experiment

In a ﬁrst exercise, we ask what expected regret and probability optimal in the experiment might have been

if we had carried out a “standard RCT”, that is, an experiment with uniform assignment shares. Since the

assignment shares in wave 1 were equal, we simulate learning outcomes from a large number of bootstrapped

samples for wave 2, drawn from real experimental observations in wave 1 and 2.

  All our bootstrap samples for wave 2 have N = 1384 observations, the draws are stratiﬁed by school,



                                                     37
           Table 6: Ex post counterfactual: performance of exploration sampling and standard RCT

                             Exploration Sampling                                             Standard RCT
 Treat-       Success       Success         Prob.         Posterior        Success       Success         Prob.         Posterior
 ment          prob.       prob. SD         treat        exp. policy        prob.       prob. SD         treat        exp. policy
               mean                        optimal         regret           mean                        optimal         regret
                (1)            (2)           (3)             (4)             (5)            (6)           (7)             (8)
  T1A          6.96%         4.45%          4.33%           1.56%           7.28%         5.29%         6.14%            1.38%
  T1B          8.50%         5.24%          91.93%          0.03%           8.59%         6.06%         85.13%           0.06%
  T1C          6.89%         4.39%          1.59%           1.64%           7.27%         5.28%         6.40%            1.39%
  T2A          6.12%         3.99%          0.11%           2.41%           5.92%         4.44%         0.05%            2.74%
  T2B          6.77%         4.34%          2.03%           1.76%           6.92%         5.07%          2.28%           1.73%
  T2C          3.89%         2.67%          0.00%           4.64%           4.03%         3.17%         0.00%            4.63%
Selected       8.50%         5.25%          92.58%          0.02%           8.60%         6.07%         86.39%           0.05%
 Notes: The table shows averages of estimates for each treatment arm obtained from 1,000 simulated samples drawn from the
observed experimental data. Columns (1) and (5): mean posterior probability of a successful call. Columns (2) and (6): standard
deviation of the posterior success probability. Columns (3) and (7): probability that the treatment arm is optimal. Columns (4)
and (8): posterior policy regret in terms of engagement success probability.


and we append the bootstrapped sample to the observed wave 1 data to estimate a hierarchical Bayesian

Binomial GLM as described in Eq. 2. We carry out 1,000 draws that simulate an RCT and 1,000 draws that

simulate an exploration sampling experiment. For the simulated RCTs, we bootstrap a wave 2 of equal-sized

treatment arms. For the simulated exploration sampling experiments, we use the treatment assignment

shares derived from the original wave 1 posterior distributions.27 For each sample draw, we calculate the
                                         ¯k , the probability optimal, and the posterior expected policy
posterior mean and standard deviation of θ

regret for each arm. The averages for each arm across draws are shown in in Table 6. In addition, we show

the average of the posterior regret and probability optimal of the selected (lowest-regret) arm k ∗ in each

simulated experiment.

  The average posterior mean of the probability of a successful call is similar between exploration sampling

and standard RCT, as seen in columns (1) and (5). As expected, the standard deviation of the posterior
                                             ¯k is lower under exploration sampling for the high-performing
distribution of the mean success probability θ

treatments, but higher for the low-performing arms. In both research designs, the treatment arm that is most

often associated with the highest probability of engagement is T1B. However, in the exploration sampling

experiment, T1B is chosen 97.4% of the time, whereas this is the case 94.9% of the time in the simulated

RCTs. This reﬂects the greater uncertainty and consequently higher variance in the ﬁnal decision that results
  27 This exercise is not perfect, because we re-sample from the six arms at diﬀerent proportions for the two designs. Since we

use data from both waves, the bootstrapped wave-2 sample is always smaller than the original sample we draw from. However,
the probability of repeat draws is aﬀected by both the size of the original arm and the target size, and this ratio varies across
the two designs. An alternative approach is to use a randomly drawn sub-sample of the original data that is proportional to
the targeted wave size. This equalizes the chance of repeat sampling across arms, but it implies that the two bootstraps draw
from diﬀerent underlying populations. Ultimately this second drawback seemed more problematic than the ﬁrst.


                                                               38
from a non-adaptive experiment.

  Exploration sampling increases the probability optimal of the best arm on average from 86.39% to 92.58%

and reduces the average posterior regret from 0.05% to 0.02%. The reduction is small in absolute terms

for two reasons; ﬁrst, the student sample is large enough so that even an RCT would lead the researcher

to relatively ﬁrm conclusions here, and second, in this particular problem instance it turns out that the

arm averages are clustered closely together, meaning that even a suboptimal choice is likely to be benign.

However, in relative terms the improvement is large, and in a policy choice problem where the best arm is

actually implemented, even small per-unit gains in payoﬀs may accumulate into large welfare diﬀerences.

Overall, the ex-post simulations suggest that we can achieve a meaningful decrease in uncertainty and

improved decisions from just one wave with adaptive sampling involving half of the experimental sample.

Remark: Decision Metrics. These simulations highlight an advantage of the proposed Bayesian approach: the

metrics of expected policy regret and probability optimal provide the decision maker with easy to understand,

intuitive measures of the uncertainty attached to the policy choices they are making. This facilitates the

comparison of treatment arms as well as experimental research designs.

6.2   Ex Ante Comparison: Model-Based Simulation of Exploration Sampling vs. RCT

In the second exercise, we imagine the experimenter asking before the IVR experiment, “should I carry

out one (non-adaptive) wave with the whole sample, or two (adaptive) waves with half the sample each?”

For these simulations, take a given parameter vector (β E , κE ). Based on this vector and Eq. (2), we can
                   sk                                                           E
simulate outcomes Zi  for the students in each wave (drawing the school eﬀects ηs from the Standard Normal

distribution). The ﬁrst simulated sample uses equal assignment shares, the second is generated under an

adaptive design, where the assignment shares for wave 2 are obtained from estimating our model above from

the simulated wave 1 data. We can then compare the estimation results under these two sampling strategies

to calculate the predicted gains from the adaptive vs. the non-adaptive design for the given parameter vector.

This is reminiscent of conducting power calculations for an assumed eﬀect size.

  Panel A of Table 7 shows the result of such an exercise, using as the parameter vector the mean of the

posterior distributions of β E and κE after wave 2, as reported in Table 3. Using the wave-2 estimates from

the experiment serves to show how well the ex ante simulation does in predicting these estimates, and how ex

ante simulation results compare with the ex post simulation above. The predicted gains from using adaptive

sampling in terms of posterior regret are very similar to our previous exercise based on the actual IVR data.

The average posterior expected regret from arm T1B is 0.02% with adaptive sampling but 0.08% with the

“standard RCT” on average. The average posterior probability optimal for both sampling strategies is also

similar to what we obtained in Table 6.



                                                     39
Table 7: Ex ante comparison: performance of exploration sampling and standard RCT in simulated samples
                           ˆE , κ
based on parameter vector (β    ˆ E ).

              Panel A: Averages of Posterior Estimates.
                                   Exploration Sampling                            Standard RCT
                             Avg. posterior        Avg. posterior       Avg. posterior         Avg. posterior
            Treatment        expected policy        probability         expected policy         probability
                                 regret              optimal                regret               optimal
                T1A               1.18%                 4.39%                 1.08%                12.61%
                T1B               0.02%                 92.2%                 0.08%                81.29%
                T1C               2.10%                 0.45%                 2.01%                 0.48%
                T2A               2.67%                 0.07%                 2.66%                 0.03%
                T2B               1.52%                 2.89%                 1.38%                 5.60%
                T2C               4.51%                 0.00%                 4.61%                 0.00%
              Panel B: Average Realized Values.
                                   Exploration Sampling                            Standard RCT
                                 Average             Percentage              Average             Percentage
                                  policy              best arm                policy              best arm
                                  regret              identiﬁed               regret              identiﬁed
                                  0.01%                99.00%                 0.07%                93.00%
             Notes: The table shows averages from 100 simulated samples drawn using the parameter vec-
            tor given by the means of the estimated posteriors from wave 2 of the IVR experiment, β         ˆE =
                                                           ˆ E = 0.5. For each sample draw, the same ﬁrst wave was
            (−2.63, −2.49, −2.78, −2.89, −2.67, −3.32) and κ
            used, the second wave was drawn either using the exploration sampling shares based on the estimates
            from the ﬁrst wave, or using equal assignment shares.


  Panel B of Table 7 uses the fact that we know the parameter vector that generated the simulated samples,

and therefore know the policy regret from choosing a diﬀerent arm from T1B. This means we can calculate

the average policy regret and share of optimal decisions from making the ﬁnal choice after each simulated

experiment (which is based on posterior policy regret). According to panel B, in 99% of the time (93% in the

RCT) the experimenter correctly chooses T1B based on this decision metric. The average posterior regret

from T1B is only slightly higher than the realized average policy regret;28 both show an 0.06% reduction in

regret from adaptive over non-adaptive sampling. Panel B shows the decision metric that should be used to

choose between the adaptive and the non-adaptive design (Panel A shows the expected value of the posterior

estimates after the experiment). The posterior estimates show some remaining uncertainty. This is partly

due to the school random eﬀects: some of the measurement eﬀort is spent on estimating the school averages,

which adds uncertainty to the ﬁnal estimates.

  Prior to an experiment, the researcher of course does not know what the true parameters are, and they
 28 Note   that regret in Panel B only occurs when the experimenter does not choose T1B.




                                                             40
may want to carry out the calculation in Panel B of Table 7 for multiple parameter vectors in order to get

a sense of the distribution of gains from adaptivity. The most consistent approach would be to draw many

values from the prior distribution of the model parameters, but this can give a misleading picture of the

gains from adaptivity when uninformative priors are used (not to mention that the computational cost is

high). As an example, in the IVR experiment, the ﬂat priors combined with the logit transformation in

the model mean that treatment arm averages based on random draws from the distributions of the β E are

almost always close to 0 or 1. In Appendix C.3, we therefore show results from a modiﬁed exercise in which

we independently and randomly draw the θk from the uniform distribution on [0, 1]. As it turns out, this

exercise is not meaningful either: in many cases, the drawn parameters are so far apart that, given our large

sample of students, even equal assignment shares lead to a very high probability of picking the correct arm.
                                                                                      E
A more meaningful approach might be to assume correlated prior distributions for the βk , or use the prior

same distribution for each θk but with a mean obtained from pilot data. An alternative to drawing from a

prior distribution is to examine learning gains for a few well-chosen parameter vectors. Again, this is similar

to the approach taken in typical power calculations for experiments, see (e.g. Duﬂo et al., 2007).

Remark: When is Adaptive Sampling Most Valuable? As the simulations show, the eﬃciency gains from

adaptive sampling vary signiﬁcantly across diﬀerent problem instances. For best-arm identiﬁcation, closely

clustered treatment eﬀect averages make the problem “hard,” as it is diﬃcult to distinguish these arms.

From a welfare-maximization (regret minimization) perspective, however, two or more treatment arms with

very similar success rates may often lead to a sub-optimal choice, but the loss from that choice will be

small. Intuition suggests that adaptive sampling is particularly valuable when there are two or more “near-

optimal” arms but also several “far from optimal” arms that can be quickly ruled out. An example could

be an experiment that compares two or more diﬀerent types of interventions but also tests several variants

within each type. It will be fruitful to explore these questions in more detail.

6.3     Comparison after Wave 1: Model-Based Simulation of a Second Adaptive Wave

In our last exercise, we imagine the experimenter, after having carried out wave 1, asking, “should I conduct

a second adaptive wave?” This is somewhat less computationally costly than the above exercise because after

wave 1, the exploration sampling shares for wave 2 are known. As before, for a given parameter vector β E and

κE , we simulate a second wave of the experiment by generating a random sample of size N = 1384 following

the model in Eq. (2) and using the assignment shares in Table 2. We draw 200 parameter vectors from the

wave-1 posteriors and calculate average policy regret and percentage of times the best arm is identiﬁed for

each.

  Panel A of Table 8 shows the posterior expected regret and probability optimal for arm T1B after wave



                                                      41
                         50.0           Wave 2 posterior regret in IVR experiment




                         37.5



                 Count
                         25.0




                         12.5




                          0.0
                                0.000                   0.001                       0.002          0.003
                                           Posterior regret associated with the treatment arm selected

Figure 5: Distribution of the posterior expected regret from wave 1 data and 200 simulated samples for wave
2, based on β E , κE and η E drawn from their posterior distributions after wave 1.


1. The posterior expected regret at t = 1 would be the basis for decision making if no other wave was

conducted, and T1B was the arm with the lowest expected regret at that point. Panels B and C show the

results of the simulations of wave 2. Panel B shows the average and median posterior policy regret and

probability optimal of the chosen arm. On average, the simulation predicts an improvement in expected

regret from continued experimentation of 0.04%, and an increase in the probability optimal for the chosen

arm from 74.14% to 77.86%. Using the median of the distribution, the improvement would be 0.07% and

to a probability optimal of 83.01%. Note that the average posterior expected regret has a heavily skewed

distribution. The actual value of 0.02% observed after the second wave in the IVR experiment is at the 38th

percentile of that distribution, as seen in Figure 5.

  Panel C shows the average realized policy regret and the percentage of times the best arm is identiﬁed,

both after wave 1 (where the experimenter would have chosen arm T1B) and after wave 2. Comparing

the numbers for wave 1 with Panel A shows that the distribution of the simulated draws replicates the

theoretical posteriors, as expected. The numbers for wave 2 show realized gains that more than halve the

predicted policy regret of wave 1 and increase the share of optimal decisions by 10%. In the actual IVR

experiment, had we conducted these calculations between waves, we would have likely concluded that the

low monetary cost of sending the IVR calls to the second half of the sample would have more than justiﬁed

the gains in certainty about the optimal choice. The actual IVR experiment performed even better than

these simulations predict. A better prior for our parameters, for example based on pilot data, is likely to

generate more reliable answers to research design questions.




                                                                       42
          Table 8: Comparison after wave 1: ending the experiment vs. conducting a second wave.

                             Panel A: Posterior Estimates after Wave 1.
                                                    Exploration Sampling
                                               Posterior               Posterior
                            Treatment       expected policy           probability
                                                regret                 optimal
                               T1B                0.12%                 74.14%
                             Panel B: Posterior Estimates after Wave 2.
                                                    Exploration Sampling
                                            Avg. posterior         Avg. posterior
                              Wave          expected policy          probability
                                            regret [median]       optimal [median]
                             1 and 2         0.08% [0.05%]         77.86% [83.01%]
                             Panel C: Average Realized Values.
                                                    Exploration Sampling
                                                Average               Percentage
                              Wave               policy                best arm
                                                 regret                identiﬁed
                                1                 0.13%                 71.00%
                             1 and 2              0.06%                 81.00%
                             Notes: The table summarizes the results of 200 simulated
                            samples based on β E , κE and η E drawn from their posterior
                            distribution after wave 1.



7     Conclusion

This paper shows a concrete application of the exploration sampling algorithm to demonstrate the successful

use of adaptive sampling in real-world policy choice problems. The experiment we conducted provides an

opportunity to answer many implementation questions surrounding this new method. For instance, as part

of the IVR experiment, we give two examples of Bayesian modeling for the outcomes of interest – here call

engagement and oral reading ﬂuency – and show how to use such models to compute the assignment shares

in each wave and the posterior expected regret that is used to choose one arm for implementation. We

discuss some of the constraints on the research design that are unique to adaptive experiments as well as the

approaches to choosing between alternative designs based on simulations.

    Our sample application tests six diﬀerent designs for a new parent outreach method, interactive voice

response calls, to encourage home reading with children in Kenya, which is known to improve early literacy.

Even though the time and budget for the experiment were limited, the adaptive design is able to identify


                                                        43
the call format with the highest level of engagement with 93% probability, leading to minimal expected

losses from mistakenly selecting the wrong call format. Despite the short exposure period of just 5 weeks

(9 calls in total) and despite the moderate uptake, the call format with the highest engagement level, which

asks parents to carry out exercises after the call with the child and uses the same “intermediate” exercise

sequence for all children, leads to a moderate but detectable improvement in ORF test scores of 1.68 correct

words per minute ([0.13-3.21], or 0.065 standard deviations of the baseline reading ﬂuency level). These

ﬁndings make IVR calls a promising method of educational outreach. Identifying such methods has become

an urgent policy priority, given the delays to schooling experienced by millions of children in the wake of the

Covid-19 pandemic.

  This EdTech application provides a compelling example for using adaptive sampling in policy choice

experiments, showing that there are expected gains in the targeted outcome with even moderate adaptivity

and a relatively large sample. We would expect even larger gains when more waves are possible, and in

problem instances where (for example) a few inferior arms can be ruled out quickly, focusing sampling eﬀort

on a smaller subset of promising candidates.

  As long as the added (per wave) cost is low, adaptive sampling has the potential to improve learning in

many areas of policy, in particular when outcome data is regularly received as part of ongoing administrative

data collection. The range of contexts in which this is the case continues to expand as public administrations

shift towards digital record keeping and online interactions with beneﬁciaries and citizens. In other situations,

the cost of adaptivity may be high, for example due to the added data collection eﬀort, but the gains from

increased eﬃciency are potentially also high; for example when the available sample is small or the welfare

gains from implementing an eﬀective policy faster are potentially large. From an ethics perspective, adaptive

methods for policy choice can reduce the burden of experimentation with human subjects twofold; ﬁrst,

because the share of experimental subjects who receive the highest-performing policies increases as learning

progresses, and second, because the same sample size generates greater learning gains with an adaptive over

a non-adaptive design, increasing the potential for better policy outcomes afterwards.

  As part of describing the design of this experiment, the paper tackles many implementation questions that

we anticipate others will encounter as well. As more economists and policy makers begin to use adaptive

methods, we hope they beneﬁt from this example and the solutions we propose. The paper also reveals some

potential challenges and highlights that an important – and in practice often diﬃcult – step in the research

design is choosing the right outcome measure. This may in future applications involve more formal methods

of eliciting preferences from the policymaker in order to be able to correctly construct the posterior outcome

distributions and select the optimal arm. Many of the issues raised point to fruitful areas for future research

and will hopefully spur ongoing innovation to improve the method further.


                                                       44
References

Andrews, I. and M. Kasy (2019). Identiﬁcation of and correction for publication bias. American Economic

Review 109 (8), 2766–94.

Andrews, I., T. Kitagawa, and A. McCloskey (2021). Inference on winners. Working paper .

Angrist, N., P. Bergman, and M. Matsheng (2020a). School’s out: Experimental evidence on limiting learning

loss using “lowtech” in a pandemic. NBER Working Paper 28205.

Angrist, N., P. Bergman, and M. Matsheng (2020b). School’s out: Experimental evidence on limiting learning

loss using “low-tech” in a pandemic. Technical report, National Bureau of Economic Research.
                                                     ¨
Athey, S., S. Baird, J. Jamison, C. McIntosh, and B. Ozler (2021). A sequential and adaptive experiment

to increase the uptake of long-acting reversible contraceptives in Cameroon. AEA RCT Registry May 14.

https://doi.org/10.1257/rct.3514.

Athey, S., R. Chetty, G. W. Imbens, and H. Kang (2019). The surrogate index: Combining short-term proxies

to estimate long-term treatment eﬀects more rapidly and precisely. Technical report, National Bureau of

Economic Research.

Athey, S. and G. W. Imbens (2017). The econometrics of randomized experiments. In Handbook of Economic

Field Experiments, Volume 1, pp. 73–140. Elsevier.

Audibert, J.-Y., S. Bubeck, and R. Munos (2010). Best arm identiﬁcation in multi-armed bandits. In COLT,

pp. 41–53. Citeseer.

Bahety, G., S. Bauhoﬀ, D. Patel, and J. Potter (2021). Texts don’t nudge: An adaptive trial to prevent the

spread of COVID-19 in India. Journal of Development Economics 153, 102747.

Banerjee, A., R. Banerji, J. Berry, E. Duﬂo, H. Kannan, S. Mukerji, M. Shotland, and M. Walton (2017,

November). From proof of concept to scalable policies: Challenges and solutions, with an application.

Journal of Economic Perspectives 31 (4), 73–102.

Banerjee, A. V., S. Chassang, S. Montero, and E. Snowberg (2020). A theory of experimenters: Robustness,

randomization, and balance. American Economic Review 110 (4), 1206–30.

Banerjee, A. V., S. Cole, E. Duﬂo, and L. Linden (2007). Remedying education: Evidence from two ran-

domized experiments in India. The Quarterly Journal of Economics 122 (3), 1235–1264.

Bergman, P. (2021). Parent-Child Information Frictions and Human Capital Investment: Evidence from a

Field Experiment. Journal of Political Economy 129 (1), 286–322.

Bergman, P. and E. W. Chan (2021). Leveraging parents through low-cost technology: The impact of

high-frequency information on student achievement. Journal of Human Resources 56 (1), 125–158.

                                                  ınez (2021). Reducing parent-school information gaps and
Berlinski, S., M. Busso, T. Dinkelman, and C. Mart´


                                                     45
improving education outcomes: Evidence from high-frequency text messages. Technical report, National

Bureau of Economic Research.

Bettinger, E., N. Cunha, G. Lichand, and R. Madeira (2021, May). Are the Eﬀects of Informational Inter-

ventions Driven by Salience? Working Paper .

Bowen, D. (2022).    Multiple inference.   https://dsbowen-conditional-inference.readthedocs.io/en/latest/

?badge=latest.

Bubeck, S. and N. Cesa-Bianchi (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit

Problems. Foundations and Trends® in Machine Learning 5 (1), 1–122.

Bubeck, S., R. Munos, and G. Stoltz (2009). Pure exploration in multi-armed bandits problems. In Inter-

national conference on Algorithmic learning theory, pp. 23–37. Springer.

Caria, S., M. Kasy, S. Quinn, S. Shami, and A. Teytelboym (2020). An adaptive targeted ﬁeld experiment:

Job search assistance for refugees in jordan.

Christensen, G. and E. Miguel (2018). Transparency, reproducibility, and the credibility of economics re-

search. Journal of Economic Literature 56 (3), 920–80.

de Barros, A. and A. J. Ganimian (2021). Which Students Beneﬁt from Personalized Learning? Experimental

Evidence from a Math Software in Public Schools in India. Working Paper .

Doss, C., E. M. Fahle, S. Loeb, and B. N. York (2019). More than just a nudge: supporting kindergarten

parents with diﬀerentiated and personalized text messages. Journal of Human Resources 54 (3), 567–603.

Duﬂo, E., R. Glennerster, and M. Kremer (2007). Using randomization in development economics research:

A toolkit. Handbook of development economics 4, 3895–3962.

Garivier, A. and E. Kaufmann (2016). Optimal best arm identiﬁcation with ﬁxed conﬁdence. In Conference

on Learning Theory, pp. 998–1027. PMLR.

Hadad, V., D. A. Hirshberg, R. Zhan, S. Wager, and S. Athey (2021). Conﬁdence intervals for policy

evaluation in adaptive experiments.

Hadad, V., L. R. Rosenzweig, S. Athey, and D. Karlan (2021). Practitioner’s guide: Designing adaptive

experiments.

ICTworks (2016, August). The blind spot of sms projects: Constituent illiteracy.

Kasy, M. (2016). Why experimenters might not always want to randomize, and what they could do instead.

Political Analysis 24 (3), 324–338.

Kasy, M. and A. Sautmann (2021a). Adaptive treatment assignment in experiments for policy choice.

Econometrica 89 (1), 113–132.

Kasy, M. and A. Sautmann (2021b). Correction regarding “adaptive treatment assignment in experiments

for policy choice”. Working paper .


                                                    46
Knauer, H. A., P. Jakiela, O. Ozier, F. Aboud, and L. C. Fernald (2020). Enhancing young children’s language

acquisition through parent–child book-sharing: A randomized trial in rural Kenya. Early Childhood Research

Quarterly 50, 179–190.

Kraft, M. A. and M. Monti-Nussbaum (2017, November). Can Schools Enable Parents to Prevent Summer

Learning Loss? A Text-Messaging Field Experiment to Promote Literacy Skills. The ANNALS of the

American Academy of Political and Social Science 674 (1), 85–112.

Lai, T. L. and H. Robbins (1985). Asymptotically eﬃcient adaptive allocation rules. Advances in Applied

Mathematics 6 (1), 4–22.

                            ari (2020). Bandit algorithms. Cambridge University Press.
Lattimore, T. and C. Szepesv´

Madaio, M. A., V. Kamath, E. Yarzebinski, S. Zasacky, F. Tanoh, J. Hannon-Cropp, J. Cassell, K. Jasinska,

and A. Ogan (2019). ”you give a little of yourself”: Family support for children’s use of an ivr literacy

system. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies,

COMPASS ’19, New York, NY, USA, pp. 86–98. Association for Computing Machinery.

Mayer, S. E., A. Kalil, P. Oreopoulos, and S. Gallegos (2019, October). Using Behavioral Insights to Increase

Parental Engagement: The Parents and Children Together Intervention. Journal of Human Resources 54 (4),

900–925.

Melﬁ, V. F. and C. Page (2000). Estimation after adaptive allocation. Journal of Statistical Planning and

Inference 87 (2), 353–363.

Muralidharan, K., A. Singh, and A. J. Ganimian (2019, April). Disrupting Education? Experimental

Evidence on Technology-Aided Instruction in India. American Economic Review 109 (4), 1426–1460.

Neal, R. M. (2003). Slice sampling. Annals of Statistics , 705–741.

Oﬀer-Westort, M., A. Coppock, and D. P. Green (2021). Adaptive experimental design: Prospects and

applications in political science. American Journal of Political Science 65 (4), 826–844.

Piper, B., J. Destefano, E. M. Kinyanjui, and S. Ong’ele (2018). Scaling up successfully: Lessons from

Kenya’s TUSOME national literacy program. Journal of Educational Change 19 (3), 293–321.

Pouzo, D. and F. Finan (2022). Reinforcing RCTs with multiple priors while learning about external validity.

NBER Working Paper 29756.

Qin, C., D. Klabjan, and D. Russo (2017). Improving the expected improvement algorithm. In Proceedings

of the 31st International Conference on Neural Information Processing Systems, pp. 5387–5397.

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathe-

matical Society 58 (5), 527–535.

Rodriguez-Segura, D., C. Campton, L. Crouch, and T. S. Slade (2021). Looking beyond changes in aver-

ages in evaluating foundational learning: Some inequality measures. International Journal of Educational


                                                     47
Development 84, 102411.

Russo, D. (2020). Simple Bayesian algorithms for best-arm identiﬁcation. Operations Research (6), 1625–

1647.

Sautmann, A. (2021a). Online supplement: Bridge Kenya IVR literacy intervention materials. https://bit.

ly/3LosOgM.

Sautmann, A. (2021b). Text messaging for parental engagement in student learning. AEA RCT Registry May

6. https://doi.org/10.1257/rct.6701.

Sautmann, A. (2022). Interactive phone calls to improve reading ﬂuency. AEA RCT Registry April 9.

https://doi.org/10.1257/rct.7663.

Shang, X., R. Heide, P. Menard, E. Kaufmann, and M. Valko (2020). Fixed-conﬁdence guarantees for

Bayesian best-arm identiﬁcation. In International Conference on Artiﬁcial Intelligence and Statistics, pp.

1823–1832. PMLR.

Shreekumar, A. (2020). winference. https://github.com/adviksh/winference.

Tabord-Meehan, M. (2018). Stratiﬁcation trees for adaptive randomization in randomized controlled trials.

arXiv preprint arXiv:1806.05127 .

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the

evidence of two samples. Biometrika 25 (3/4), 285–294.

Xu, M., T. Qin, and T.-Y. Liu (2013). Estimation bias in multi-armed bandit algorithms for search adver-

tising. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger (Eds.), Advances in Neural

Information Processing Systems, Volume 26. Curran Associates, Inc.

York, B. N., S. Loeb, and C. Doss (2019, July). One step at a time: The eﬀects of an early literacy

text-messaging program for parents of preschoolers. Journal of Human Resources 54 (3), 537–566.

Zhang, K., L. Janson, and S. Murphy (2020). Inference for batched bandits. Advances in Neural Information

Processing Systems 33, 9818–9829.

Zhang, K., L. Janson, and S. Murphy (2021). Statistical inference with m-estimators on adaptively collected

data. Advances in Neural Information Processing Systems 34.




                                                    48
A    Intervention Design

The three content intervention variants A, B, and C are as shown in ﬁgure A.1:

  A. Leveling by baseline: assign students to a “basic”, “intermediate”, or “advanced” arm;

  B. Preset: assign all students to an “intermediate” exercise sequence;

  C. Options: allow parents to select the exercise from a menu.

  The leveling by baseline uses observed ﬂuency scores from the end of term 2 and assigns students with

ﬂuency scores of 0-29 into the “basic” arm, 30-64 into the “intermediate” arm, and 65+ into the “advanced”

arm. These cutoﬀs were used previously in a similar context (the external TUSOME evaluation in Kenya,

see Piper et al. (2018)). For students with missing baseline scores, we assign them their class median. For

classes with missing scores, we assign the intermediate level (which in this sample also happens to be the

sample median).




                                 Figure A.1: Exercise leveling variations.




                                                   A.1
B     Oral Reading Fluency Data Quality

In this section we provide more details on the data quality issues with ORF scores.

              Table A.1: Non-missing ORF scores in each exam, by treatment arm, and by wave.

                                                       Wave 1 and 2                                                 Wave 1   Wave 2
    Period          C          T1A         T1B        T1C         T2A         T2B            T2C          Total      Total   Total
  T2 ET           89.6%       88.2%       88.2%      87.8%       91.8%       90.0%       88.9%            88.9%      88.9%   88.8%
  T3 MT           74.0%       74.8%       73.8%      71.4%       74.0%       73.3%       74.8%            73.5%      74.3%   72.6%
  T3 ET           79.3%       81.5%       81.9%      78.0%       83.5%       79.9%       77.4%            80.3%      80.5%   80.1%
   Total          81.0%       81.5%       81.3%      79.1%       83.1%       81.1%       80.4%            80.9%      81.2%   80.5%
N. students        415         330         781         581         231          329          226          2893        1509    1384
Notes: the table presents the percentage of valid ORF measurements for students allocated to each treatment arm in Wave 1 and 2.



    Table A.1 displays the percentage of non-missing ORF measures across treatment arms and periods. There

is no evidence of a systematic relationship between ORF attrition and the treatments arms. However, after

the last data delivery received by the researchers in fall 2021, the endterm exam of term 2 has the highest

average percentage of ORF measures (88.9%) compared to the midterm of term 3 (73.5%) and the endterm

of term 3 (80.3%).

    There are many possible reasons for these patterns. One reason for the endterm diﬀerence could be that

teachers even at the last data delivery had not submitted all their scores for term 3. The number of scores

collected in the midterm may be lower because the examination period for ORF was shorter (2 hours) than

in the endterms (3 hours). Children may also be more likely to miss the midterm than the endterm.

      Table A.2: Average ORF scores from separate data deliveries for endterms of term 2 and term 3.

                                                          Treatment arm                                      Number of
                              C        T1 A       T1 B       T1 C        T2 A         T2 B         T2 C          students
              E2             39.7      38.8        41.7       40.1       39.7         41.2         42.1           2285
          E2 updated         62.3       53         60.6       58.6       53.6         55.6         58.2           286
              E3             48.7      49.9         52        49.1       51.8         51.2         50.2           1897
          E3 updated         42.2      46.1        47.3       44.6       35.9         43.8         48.4           425
           Notes: ﬁrst set of scores obtained for each endtem exam shortly after grading day. Original E2 scores were used
          for leveling for term 3. The updated scores are ORF scores for children whose grades were uploaded to the system
          later and obtained in a second data delivery for all exams weeks after intervention end. Midterm of term 3 not
          shown because there were less than 20 students with an updated score in the second data delivery.



    Table A.2 shows average scores from separate data deliveries we received for the endterm exams of term 2

and term 3. Each data delivery included all scores that were submitted up to that point. The ﬁrst delivery

was received shortly after each exam took place. Crucially, for term 2, this was also the time when exercise


                                                               A.2
leveling based on reading ability for the next term was determined, in order to start IVR calls in time for

the next term. The second data delivery (for all exams) was received in Fall 2021.

    The data show large diﬀerences between the scores submitted soon after the exam vs. later (during the

next term). This is especially true for the data from term 2. This gap in scores could be an explanation

for why leveling reading exercises is not as successful: children with missing scores tended to have better

reading skills, and they might have received too easy exercises on average. Interestingly, while the average

in the second data delivery is higher for endterm 2, for endterm 3 it is lower. These diﬀerences could be

due to systematic patterns in the time of score submission, such as remote locations having poorer internet

connectivity and also lower reading levels: note that the second delivery for term 2 was much smaller than

for term 3. But the diﬀerence could also stem simply from variation that arises because scores for a whole

school or classroom are sent at once and there is a lot of inter-school variance.

    In any case, the two tables show that even after many weeks, a substantial share of ORF scores for each

exam was still missing. When examining scores, we additionally found an unusually large percentage of

scores that are multiples of 5 (“rounded” scores). One reason for this could be measurement error, stemming

for example from the teacher having only imprecise means to measure time.

C     Additional Results

C.1     Observed call engagement

Table A.3 show sthe average number of calls (out of nine calls) with successful engagement, by treatment

arm and wave, and Figure A.2 shows the histogram of observed call engagement. The ﬁrst bar shows the

number of phone numbers with zero engagement. This share is nearly the same in every call format except in

treatment arm T2C, suggesting that the same share of parents listen to the exercises at least once. Diﬀerences

in sustained engagement arise from the second call onward.

              Table A.3: Mean number of successful engagements by treatment arm and wave.

                                         Wave 1       Wave 2*        Wave 1 and 2
                               T1A        0.602         0.752             0.655
                               T1B        0.745         0.761             0.756
                               T1C        0.633         0.554             0.582
                               T2A        0.542         0.429             0.535
                               T2B        0.595         0.703             0.635
                               T2C        0.358           -               0.358
                                Notes: * No observations were allocated to treatment T2C
                               in Wave 2.




                                                         A.3
(a) Treatment arm T1A                    (b) Treatment arm T1B




(c) Treatment arm T1C                    (d) Treatment arm T2A




(e) Treatment arm T2B                    (f) Treatment arm T2C

    Figure A.2: Observed call engagement, by treatment arm


                             A.4
C.2   Probability optimal for ORF


                  Table A.4: Posterior regret and probability of highest ORF score gains.

                                         Balanced                                     Unbalanced
          Treatment       Posterior regret        Prob. highest        Posterior regret       Prob. highest
             T1A                 1.636                 8.55%                  1.528                 9.11%
             T1B                 0.935                20.30%                  1.143                11.54%
             T1C                 1.458                 8.33%                  1.911                 2.35%
             T2A                 1.134                22.23%                  0.973                26.34%
             T2B                 1.123                20.60%                  0.918                25.80%
             T2C                 1.219                20.00%                  1.029                24.86%
           Notes: Posterior regret expressed in terms of correct words per minute. The table contains information
          from 8,000 posterior draws sampled from 4 independent Markov chains.



C.3   Ex Ante Comparison of Exploration Sampling and RCT


Table A.5: Ex ante comparison: performance of exploration sampling and standard RCT in simulated
samples based on many parameter draws from the prior.

                Panel A: Averages of Posterior Estimates.
                           Exploration Sampling                               Standard RCT
                   Avg. posterior          Avg. posterior        Avg. posterior          Avg. posterior
                   expected policy          probability          expected policy          probability
                       regret                optimal                 regret                optimal
                           0%                  98.97%                  0.01%                  98.91%
                Panel B: Average Realized Values.
                           Exploration Sampling                               Standard RCT
                        Average              Percentage               Average              Percentage
                         policy               best arm                 policy               best arm
                         regret               identiﬁed                regret               identiﬁed
                           0%                  98.99%                    0%                   98.99%
                Notes: The table shows averages from 100 simulated samples drawn using the parameter
               vector {β E , κE , η E }, drawn from their prior distributions. For each sample draw, the same
               ﬁrst wave was used, the second wave was drawn either using the exploration sampling shares
               based on the estimates from the ﬁrst wave, or using equal assignment shares.


  Table A.5 shows simulation results when drawing hypothetical treatment arm averages θk from a uniform

distribution, simulating two experimental samples (one with exploration sampling, one with non-adaptive

sampling) for each draw, and estimating the model parameters from these samples. Note that both equal

and adaptive sampling shares essentially lead to zero regret on average. This is because random independent

draws for the average success rate in the diﬀerent treatment arms often lead to one arm that is clearly a


                                                           A.5
“winner”. In reality, it is likely that the success rates in the diﬀerent arms are highly correlated and will be

clustered more closely than typical random draws from the unit interval.

C.4   Extensive margin for call engagement


                    Table A.6: Extensive margin for call engagement: model coeﬃcients.


                                                    Any successful       At least one
                              Treatment              engagement         second in call
                                                         (1)                  (2)
                              T1A                      −0.85∗                1.71∗
                                                    [−1.09; −0.61]        [1.42; 2.03]
                              T1B                      −0.85∗                1.84∗
                                                    [−1.01; −0.70]        [1.63; 2.06]
                              T1C                      −1.00∗                1.91∗
                                                    [−1.19; −0.81]        [1.67; 2.18]
                              T2A                      −1.05∗                1.54∗
                                                    [−1.34; −0.76]        [1.20; 1.90]
                              T2B                      −0.84∗                1.86∗
                                                    [−1.08; −0.60]        [1.54; 2.19]
                              T2C                      −1.72∗                1.82∗
                                                    [−2.10; −1.37]        [1.46; 2.22]
                              Num. students               2462                2462
                              Notes: ∗ Null hypothesis value outside 95% credible interval.
                             We simulate 4 independent Markov chains of 4,000 posterior
                             draws each and discard the ﬁrst 2,000 as warmup. The re-
                             maining 8,000 draws are used to generate the posterior dis-
                             tributions of the coeﬃcients. The Split-Rˆ of every coeﬃcient
                             is below 1.01 and there are no divergent transitions.




                                                         A.6
Table A.7: Extensive margin for call engagement: probability of engagement.


                   Any successful engagement          At least one second in call
   Treatment        Mean         Prob. highest         Mean        Prob. highest
                     (1)              (2)               (3)             (4)
   T1A             30.12%           32.15%            84.12%            5.75%
   T1B             30.04%           26.55%            85.70%           12.09%
   T1C             27.15%            2.26%            86.59%           35.15%
   T2A             26.20%            3.78%            81.70%            0.91%
   T2B             30.28%           35.26%            85.92%           25.11%
   T2C             15.41%            0.00%            85.40%           20.99%
   Notes: (1) and (3): The average probability is calculated in analog with Eq. ??. (2)
  and (4) The probability optimal is calculated as in Eq. 4.




                                          A.7