Policy Research Working Paper 9847 Preparation, Practice, and Beliefs A Machine Learning Approach to Understanding Teacher Effectiveness Deon Filmer Vatsal Nahata Shwetlena Sabarwal Development Economics Development Research Group & Education Global Practice November 2021 Policy Research Working Paper 9847 Abstract This paper uses machine learning methods to identify teacher beliefs (measured through teacher surveys) emerge key predictors of teacher effectiveness, proxied by stu- as much more important. Overall, teacher covariates are dent learning gains linked to a teacher over an academic stronger predictors of teacher effectiveness in math than in year. Conditional inference forests and the least absolute Kiswahili. Teacher beliefs that they can help disadvantaged shrinkage and selection operator are applied to matched stu- and struggling students learn (for math) and they have good dent-teacher data for math and Kiswahili from grades 2 and relationships within schools (for Kiswahili), teacher practice 3 in 392 schools across Tanzania. These two machine learn- of providing written feedback and reviewing key concepts ing methods produce consistent results and outperform at the end of class (for math), and spending extra time with standard ordinary least squares in out-of-sample prediction struggling students (for Kiswahili) are highly predictive of by 14–24 percent. As in previous research, commonly used teacher effectiveness. As is teacher preparation on how to teacher covariates like teacher gender, education, experience, teach foundational topics (for both Math and Kiswahili). and so forth are not good predictors of teacher effective- These results demonstrate the need to pay more systematic ness. Instead, teacher practice (what teachers do, measured attention to teacher preparation, practice, and beliefs in through classroom observations and student surveys) and teacher research and policy. This paper is a product of the Development Research Group, Development Economics and the Education Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank. org/prwp. The authors may be contacted at ssabarwal@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Preparation, Practice, and Beliefs: A Machine Learning Approach to Understanding Teacher Effectiveness Deon Filmer Vatsal Nahata Shwetlena Sabarwal The World Bank The World Bank The World Bank 1 Keywords: Education; Teacher performance; Teacher value-added; Teacher mindsets; Student achievement JEL Codes: I20; I21; I25; I28; J45 Acknowledgements: We would like to thank the Research In Improving Systems of Education (RISE) program for funding and to the RISE Tanzania team for background work and inputs. Comments and guidance from Samer Al-Samarrai, Noam Angrist, Marina Bassi, Paolo Brunori, Jacobus Cilliers, Xiaoyan Liang, Daniel Mahler, Chiara Masci, Halsey Rogers, Dario Sansone, Fritz Schiltz, Jan Spiess, Falco Stoffi, Inaam Ul Haq, and RISE Quality Assurance Team are gratefully acknowledged. Diwakar Kishore provided excellent research assistance. 1 Authors listed alphabetically. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the view of the World Bank, its Executive Directors, or the countries they represent. 1. Introduction There is strong agreement that teachers matter a lot for student learning; 2 but little agreement on which specific teacher factors matter most. Studies document wide variation in teacher effectiveness that is not well explained by observable teacher characteristics (e.g., McCaffrey et. al 2004; Jacob and Lefgren, 2005; Rivkin, Hanushek, and Kain, 2005; Gordon, Kane, and Staiger, 2006; Kane, Rockoff, and Staiger, 2008). Specifically, observable and widely available teacher characteristics such as teacher qualifications, test scores, training, and experience appear to be weak predictors of teacher contributions to student learning in high-income countries (Rockoff 2004; Rivkin, Hanushek and Kain 2005; Aaronson, Barrow and Sander 2007, Staiger and Rockoff 2010). This finding is mirrored in recent studies from low- and middle-income countries. Research from Pakistan and India does not find a strong relationship between teacher qualifications and teacher value-added in either government or private schools (Bau and Das 2020; Azam and Kingdon 2015); in Ecuador teacher entry-exam performance explains a small fraction of the variation in student learning (Cruz-Aguayo et al. 2017). Bau and Das (2020) find in Pakistan’s context that observed teacher characteristics account for less than 5 percent of the variation in teacher value-added. Using machine learning methods on a rich student-teacher data set from Tanzania, this paper identifies key predictors of teacher effectiveness, proxied through student learning gains linked to a teacher over an academic year, closely linked to the concept of teacher value-added (TVA). 3 Specifically, it explores which aspects of a teacher – who teachers are, what teachers know, what teachers do, or what teachers believe – are most predictive of student learning gains. Machine learning (ML) is a well-suited (albeit novel) approach for identifying key predictors of teacher effectiveness from a large set of teacher, student, and school covariates since most studies in the TVA literature use linear modeling techniques (Koedel et al. 2015). ML applications, which are increasingly common in econometrics (Athey and Imbens 2016, Mullainathan and Spiess 2017), often involve predictions about some variables given others. They manage to uncover generalizable patterns and discover complex structures that were not specified in advance (Mullainathan and Spiess 2017). ML algorithms attempt to select flexible models that fit the data well, but not so well that out-of-sample prediction is compromised (Athey and Imbens 2016). They can be particularly successful on high-dimensional data where we observe many pieces of information on each unit (Athey and Imbens 2016). In our case where data are relatively high dimensional (with 52 explanatory variables), using ML algorithms helps us avoid ad-hoc model selection. Instead, the ML algorithm helps show which teacher covariates matter more for predicting student learning gains, by allowing for highly flexible models that are evaluated on the basis of out-of-sample replicability. ML algorithms also help avoid the problem of multicollinearity 2 See for example: Hanushek and Rivkin 2010; Nye et al. 2004; Chetty et al. 2014a; 2014b; Buhl-Wiggers et al. 2017; Bau and Das, 2020. 3 Several studies show that teacher value-added measures, which control for a student’s prior-year test scores, provide unbiased forecasts of teachers’ causal impacts on student achievement (Bacher-Hicks et al 2017, Glazerman and Protik 2015, Chetty et al. 2014, Bacher-Hicks, Kane, and Staiger 2014, Rothstein 2014). 2 (Dormann et al. 2013). We use two ML algorithms Conditional inference forests (CIF) and Least absolute shrinkage and selection operator (LASSO). We find that ML methods outperform standard OLS in out-of-sample prediction by 14-24 percent. Also, the identified variables of importance are largely consistent across our two, very different, ML methods. ML results are in line with previous research in that commonly used teacher characteristics such as, teacher gender, education qualifications, experience etc. do not seem to hold much predictive power for student learning gains. Instead, what teachers do in terms of specific classroom practices (measured through classroom observations and student surveys) and what teachers believe in terms of how they perceive the abilities of their students and the environment around them (measured through teacher surveys) are consistently revealed to be important. Overall, teacher covariates matter more, and differently, for Math than Kiswahili. For Math, the teacher belief that they can help disadvantaged and struggling students learn; the teacher practice of providing clear and helpful written feedback (on homework and tests); and the teacher preparation in teaching foundational concepts are the three most predictive factors for student learning gains. For Kiswahili, where teacher (and other observable) covariates are on the whole less predictive, teacher preparation, practice, and beliefs still emerge as being important. Specifically, the teacher belief that action is taken against poor teacher performance; the teacher practice of providing extra help to struggling students 4; and (as in Math) teacher preparation in teaching foundational concepts are the three most predictive teacher covariates for student learning gains. Our paper contributes in two ways to the literature on teacher effectiveness. It demonstrates how machine learning techniques can help address long-standing prediction problems in education economics, including that of predicting student learning gains linked to a teacher. Applications of ML in education economics are still not very common. We were able to find three types– one for predicting student dropouts (Aulck et al. 2016; Adelman et al. 2018; Sansone 2019), the second for predicting student performance in international tests like TIMSS and PISA and national tests (Agasisti et al. 2018) and finally in modeling teacher productivity (Chalfin et al. 2016). By applying ML techniques to a rich set of control variables, we are able to explore the question of teacher effectiveness with much more granularity, while letting the data speak. This last part is particularly important because it allows for the use of flexible and non-parametric approaches in estimation while restricting arbitrary judgements on the part of the researcher, the scope for which only increases with a richer set of controls. Another set of important but less overtly actionable insights relate to the importance of teacher beliefs in determining teacher effectiveness. The paper shows that teacher beliefs about whether students can learn and whether they have good relationships matter for their effectiveness. This finding corroborates a sizeable but scattered body of evidence on the importance of teacher beliefs for student outcomes (Sabarwal et. al 2021). Given that teacher beliefs have not been given much systematic attention in the design and implementation of teacher policies, these findings suggest that these beliefs might be an important but missing ingredient of programs and policies for teacher effectiveness. The question is – can these beliefs be changed through interventions and policies? Recent research from different disciplines show that they can. A body of research in education (e.g. Dweck 2006, Yeager et al. 2012, Paunesku et al. 2015) and 4 During breaks, lunch, or after school. 3 organizational psychology (e.g. Heslin, Latham and VandeWalle 2005) have revealed how fixed mindsets can be shifted and how this can help improve motivation and performance. This paper also highlights the importance of further exploring this line of work. Our paper is organized as follows. Section 2 provides information on data, estimation strategy (including a conceptual introduction to the CIF and LASSO algorithms), and limitations; Section 3 provides a descriptive analysis of teacher-level and other covariates, Section 4 presents the main results, and Section 5 concludes. 2. Data, Methodology, and Limitations 2.1 Data This study is a part of the Research on Improving Systems of Education (RISE) program for Tanzania, wherein several researchers are using the same data for different studies looking at various aspects of the Tanzanian education system and reform. 5 The data for this paper comes from 392 schools randomly sampled from 392 wards across 22 districts in 6 representative regions across Tanzania. Our final sample includes 436 teachers and 3,019 students. The baseline survey was conducted between February-May 2019 and the follow-up survey between January-April 2020, targeting 748 teachers and 6,586 students from Grades 2 and 3. Three instruments were used to collect data on teacher covariates. These include a detailed teacher survey which also has a dedicated module on teacher mindsets; a teacher subject content knowledge assessment; and classroom observation of teachers using the Teach Classroom Observation tool. 6 For the detailed teacher survey, 10 teachers were randomly selected from the complete teacher roster for the school provided by the head-teacher. To the extent possible, the survey was targeted at teachers teaching Math and Kiswahili in Grades 2 and 3. After the survey, only the Grade 2 and 3 teachers were invited to take the teacher assessment. Teacher assessments were subject-based and linked to the curriculum. For Kiswahili, teachers were expected to read a short text and answer 8 comprehension questions, while for Math they answered 10 questions about basic algebra operations and geometry. Finally, in each school, one Grade 2 and one Grade 3 teacher were randomly selected for classroom observation using the Teach Classroom Observation tool (Molina et. al. 2018). Teach allows enumerators to rate teaching practices through two 15-minute observations during a lesson. The practices are organized into nine dimensions: Supportive Learning Environment, Positive Behavioral Expectations, Lesson Facilitation, Checks for Understanding, Feedback, Critical Thinking, Autonomy, Perseverance and Socio- Emotional Skills. These dimensions are measured on a five-point scale and then averaged across the two 15-minute observations. The study also included a student survey and student assessment. For each school, around 20 students (10 each from Grades 2 and Grade 3) were randomly selected from a list of all Grade 2 and 3 students (provided by the head-teacher). Students were tested on foundational concepts in Math and Kiswahili. These tests 5 For more details see: https://riseprogramme.org/countries/tanzania 6 https://www.worldbank.org/en/topic/education/brief/teach-related-blogs 4 were developed by Tanzania education professionals and are similar to the Uwezo annual learning assessment – a nationwide assessment used to measure learning in Tanzania (see Mbiti et al. 2021 for exact test creation). The Math test focused on counting, basic addition, subtraction, multiplication and division, while the Kiswahili test focused on correctly reading words, writing sentences and comprehension. 6 For calculating student learning gains, the same set of test questions were used at baseline and follow-up. The test was of a slightly higher level for Grade 3 compared to Grade 2 students. The tests were low stakes and designed to test a range of abilities such that scores could be equated across years using a set of linked questions in baseline and follow up. These features allow us to test children on the same knowledge scale. A student survey was also administered to collect data on student characteristics and student perceptions about teacher practices (e.g. practices that teachers did or did not engage in with students). Our data set contains 52 explanatory variables that can be divided into the following categories: (i) student- level variables such as age, household asset ownership, baseline score and whether they attended private tuitions for the particular subject; (ii) school-level variables such as the pupil teacher ratio at the school, whether the school is in an urban or rural location and certain institutional/governance variables; and (iii) teacher level variables. We divide our teacher level variables into 4 categories: (i) Who teachers are (teacher characteristics); (ii) What teachers know (teacher knowledge measured through the teacher assessment); (iii) What teachers do (teacher practice measured through teacher classroom observation); and (iv) What teachers believe (teacher mindsets). These are discussed further in Section 3 and Annex 1. The unit of observation is the teacher and the outcome of interest is average student learning gains between baseline and follow-up (approximately 8 months) for the teacher. We construct estimates of student learning gains using the matched teacher-student database and student assessment data from baseline and follow- up. Student learning gains linked to a particular teacher are calculated as the percentage correct score in the follow-up student assessment minus the percentage correct score in the baseline student assessment, averaged across their students. We then model student learning gains using our ML algorithms on a host of student, school and teacher level covariates. We conduct the analysis separately for Math and Kiswahili. The estimation of student learning gains for a teacher can be seen as analogous to the estimation of Teacher Value Added (TVA), however there are some differences between our estimation of student learning gains and the way in which TVA is often estimated in the standard education economics literature. In this literature, TVA is estimated as the teacher fixed effect from a regression of student follow-up test scores on student level covariates including lagged test scores (see Koedel et al. 2015 for a comprehensive review)—this is often referred to as step 1. This teacher fixed effect is then regressed on a host of teacher and school level covariates to find out teacher characteristics that best predict TVA (Rockoff 2004, Chetty et al. 2014a, Koedel et al. 2015)—referred to as step 2. We choose the student learning gain approach over the standard TVA approach to avoid imposing a linear functional form in either step 1 or step 2. This allows 6 For Grade 2, the Math portion of the test had 12 questions while the Kiswahili portion had 16 questions. For Grade 3, The Math portion had 17 questions while the Kiswahili portion had 15 questions (to ensure uniformity in comparison, we only chose those questions that were repeated in baseline and follow-up). 5 the Machine Learning algorithms maximum room to use highly flexible and interactive functional forms in a manner that is completely driven by the underlying data. 7 There are two sources of attrition in our data. First, some students could not be contacted at follow-up and second, several teachers could not be matched to students. 8 Ultimately, we were able to map 3,019 students to 436 teachers. Since we conduct our analysis based on the subject taught by a given teacher, our final sample for analysis includes 346 Math teachers matched with 2,359 students; and 336 Kiswahili teachers matched with 2,297 students. In the Table 1, we compare teacher characteristics at baseline in the full sample and the final sample. This comparison suggests that attrition of teachers is mostly random on observables. 2.2 Methodology Our main goal is to identify which covariates matter most for predicting student learning gains. To do this, we rely on machine learning (ML) approaches. In this section we provide a brief overview of our overall approach and of our chosen ML algorithms. We use two machine learning algorithms: Conditional Inference Forests (CIF) and Least Absolute Shrinkage and Selection Operator (LASSO) to predict student learning gains. CIF and LASSO are both supervised algorithms where we have data on the dependent variable. The goal of supervised learning is to learn a function that, given a sample of dependent and explanatory variables, best approximates the relationship between them. Given the availability of data on the dependent variable, the supervised algorithm can compare its estimates to the actual values of the dependent variable.9 Typically, data are split into two sets: a training set and a test set. The algorithm learns about the relationship between the dependent and explanatory variables using the training set. The test set is not used by the algorithm during the model building process and is therefore used to empirically evaluate its out of sample performance (details in Annex 2). Within the family of supervised ML algorithms, we chose CIF and LASSO because (i) they approach the variable selection problem in different ways, with CIF belonging to the non-parametric and LASSO to the parametric class of ML models; (ii) their suitability for variable selection in high-dimensional data like ours; and (iii) their growing popularity in economics and the broader social science literature (Varian 2014; Mullainathan & Spiess 2017). We benchmark the predictive performance of our ML models to the standard linear regression model used in the extant literature, that is Ordinary Least Squares (OLS). Next, we show the key variables of importance for predicting student learning gains identified by the ML methods. Finally, we use the ML- 7 For the sake of robustness, we also estimate TVA using the traditional approach by first calculating teacher fixed effects and subsequently model these teacher fixed effects using our ML algorithms on teacher level covariates. As seen in Annex 3, our main results remain robust in this more traditional TVA specification. 8 The official mapping of students and teachers at the school level happens via “streams” while the actual mapping of students and teachers is done via informal “groups” which comprise students from multiple grades and subjects in a single class. This system of “groups” is not documented at the school level. 9 Unsupervised learning algorithms, on the other hand, do not have data on the dependent variable, so their goal is to infer the natural structure present within a set of explanatory variables. An example is Principal Components Analysis or Clustering data based on a given set of covariates. 6 identified variables to run a parsimonious OLS regression for student learning gains. We do this to further analyze the relative importance of our ML-identified variables. In sections 2.2.1 and 2.2.2 we provide a conceptual introduction to the CIF and LASSO models. In section 2.2.3 present the OLS model which we use to benchmark the performance of our ML models. 2.2.1 Conditional Inference Trees & Forests (CIF) Trees or Decision Trees divide the covariate space (X1,X2,…Xk) into M mutually exclusive regions/groups (G1,G2,…Gm) using a well-defined splitting criterion. This implies that every observation finds itself as part of any one group, with each group being homogenous in the expression of some variables in the covariate space. For any observation yi that finds itself in a given group Gm, the decision tree simply predicts ŷi to be the mean y value of all observations that find themselves in the same group. Due to their inherent non- parametric structure, trees are able to accommodate flexible and highly interactive relationships between the explanatory and dependent variables. The precise manner in which splits are made depends on the variant of the tree used. In this paper we use conditional inference trees (CIT) proposed by Hothorn et al. (2006) instead of the standard regression tree (see Loh 2011 for an introduction) because the latter are biased towards selecting continuous variables with more split points as compared to categorical variables (Hothorn et al. 2006). CITs are constructed as follows: the algorithm tests the relationship between the dependent variable and each explanatory variable and selects the variable with the strongest association. If the association is strong enough (as judged by the significance level α*), it selects the variable and searches for a value in it, using which the sample is split into two, such that the relationship with the dependent variable is maximized. This procedure of selecting a variable and a split value is repeated in each of the two subsamples until no explanatory variable in any subsample is sufficiently related to the dependent variable. We describe an example of a tree in the figure below. Figure A: Example Tree This tree maps out student learning gains for Math using 3 variables: student baseline math score, teacher assessment math score and the percentage of students who say that the teacher reviews concepts at the end of math class. It tells us that if a student’s baseline score is less than or equal to 51.67 percent, teachers 7 reviewing concepts is important (split point at 20%) and gains are higher for those teachers who review concepts more often. On the other hand, for students whose baseline score is greater than 51.67 percent, teacher subject content knowledge (measured by teacher assessment score in math) starts to matter with the split point being 59%. Teachers who score higher in subject content knowledge have higher student learning gains (9.8% vs. 0.69%). There is a bias versus variance trade-off in decision trees captured by the depth of the tree. Shallower trees will have high bias but low variance in their estimates (due to smoothing) while deeper trees will have low bias (as the tree partitions the sample space into more granular groups) but high variance, as they would be sensitive to small changes in the data. The final depth of a tree is closely linked to the significance level α*. 10 Irrespective of the value of α* specified, the tree algorithm still selects the most relevant variable and the most relevant splitting point within that variable, yielding good properties for variable selection. The detailed procedure on how a tree is constructed under CIT is laid out in Annex 2. Trees generally suffer from two major drawbacks. First, their predictions suffer from high variance as they are sensitive to small changes in the data. Second, any given tree will not select more than a handful of variables during their construction. As a result, many variables (especially in high dimensional settings) do not get a chance to contribute to the tree construction process. To address both these issues, it is best to rely on ‘forests’ not just ‘trees’. Accordingly, we use conditional inference forests (CIF) (Breiman, 2001, Biau and Scornet, 2016). A forest is simply a collection of many trees (conventionally 100 or 500) and rests on the “wisdom of crowds” logic. When a forest makes a prediction for any given observation, it – loosely speaking - averages the predictions made by each tree. Two tweaks are made when constructing a Forest. First, only a random sample of predictors are selected when constructing any given tree. This ensures that several trees will not be constructed using similar variables and will therefore not yield correlated predictions. This also allows each explanatory variable to get an adequate chance to prove themselves yielding good properties for variable selection. 11 Second, a random sample of the training data set is used in the construction of each tree. Due to certain statistical properties that are suitable for stable variable selection we sample the training set without replacement (as discussed in Strobl et al. 2007; 2009, Hothorn et al. 2015). These two features along with the fact that predictions are averaged across many trees ensures that the estimates of the dependent variable have low variance; and deeper trees can be grown to achieve low bias. Just like trees, certain tuning parameters have to be optimized for when constructing a CIF. 12 This optimization ensures that the CIF performs well out of sample. Annex 2 lays out the detailed procedures and choices for the tuning parameters. 10 A more complete description of the parameters that finally decide the structure of the tree is provided in Annex 2. 11 In our analysis, we use the square root of the number of explanatory variables based on convention https://www.stat.berkeley.edu/~breiman/Using_random_forests_v3.00.pdf. 12 The tuning parameters we optimize for are: (i) the minimum number of observations required to create a split, (ii) the significance level alpha, and (iii) the number of trees that make up the forest. 8 One drawback with forests is that they cannot be visualized in the same manner as trees. Variable importance measures can however be calculated yielding variables that are most predictive of student learning gains. We measure variable importance using the permutation method described in Strobl et al. (2007). 13 Each explanatory variable is permuted such that its association with the dependent variable is lost. 14 The loss in predictive power caused by permuting a particular variable gives us a measure of its importance when making accurate predictions for student learning gains. We report this measure of variable importance for predicting Math and Kiswahili student learning gains in Tables 4b & 5b, respectively. To make variable importance measures more interpretable, we standardize them such that the most important variable takes a value of one and rank them accordingly. The standardized variable importance numbers shown have been averaged by the number of times a variable emerges as important during 20 model runs (discussed further below). We only choose variables that occur in 14 or more runs to account for any multicollinearity and remove the element of random chance in variable selection (analogous to what Mullainathan and Spiess 2017 do for LASSO). 2.2.2 Least Absolute Shrinkage & Selection Operator (LASSO) The second supervised machine learning algorithm we use is LASSO (Tibshirani, 1996), perhaps the most well-known to economists. LASSO is a penalized form of regression where the L1 norm of the coefficient vector βj is included in the OLS minimization problem 15: �( − 0 − � )2 + �� � = + �� � =1 =1 =1 =1 The absolute-value penalty term effectively ensures that we are left with only a limited number of non-zero coefficients which means, in effect, that LASSO conducts variable selection. LASSO is computationally feasible on high dimensional data sets and also yields good predictions especially when the true underlying relationship is linear and sparse (Zhao and Yu, 2006; Varian, 2014). The tuning parameter λ decides the number of variables selected by LASSO. A larger lambda implies that more coefficients are set to zero (λ=0 gives us OLS). In our empirical strategy, we choose λ based on k-fold cross validation (details in Annex 2). The benefit of LASSO is that it provides easily interpretable and sparse models. 16 It also allows us to determine whether the associations between the outcome and the predictors are positive or negative globally (something that tree based methods cannot do). An issue as seen in the practical implementation of LASSO is that it may not select variables in a stable manner when the data set is split differently into the training 13 As mentioned in Strobl (2007), the permutation method (when used with sampling without replacement) yields unbiased variable importance measures and is also unaffected by different types of predictor variables (categorical or continuous) or their scale of measurement. 14 After the permutation, the out of bag error rate (MSEOOB) is recalculated for the entire forest. The increase in MSEOOB relative to the baseline out-of-bag-error tells us how important a particular variable was. 15 We standardize all variables in our data set to have mean = 0 and standard deviation = 1; this is a pre-requisite prior to running LASSO since LASSO may otherwise penalize coefficients of variables measured on a larger scale. 16 LASSO coefficients should however not be interpreted in the way OLS coefficients are since they are biased towards zero. 9 and test sets (Mullainathan and Spiess 2017). 17 However, in our case, LASSO conducts variable selection in a stable manner with the most important variables being selected in the vast majority of model runs. 2.2.3 OLS specification As mentioned above, we benchmark the performance of our two ML algorithms against OLS. The OLS model we use for comparison is as follows: Yi = 0 + 1X1 + 2X2 + 3X3 + 4X4 + 5X5 + εi where, • Yi is Average Student Learning Gain for Teacher i • X1 is a vector of Student controls such as student age, baseline score, whether a student attends private tuitions in that subject 18 and percentage of students belonging to the lowest asset quartile • X2 is a vector of school level covariates such as pupil teacher ratio, school location (urban/rural), annual entitlement grant per school etc. • X3 is a vector of school level variables that affect the management and governance of teachers • X4 is a vector of controls representing teacher characteristics (who teachers are) such as gender, experience and academic qualifications • X5 is a vector representing our variables of interest divided into 3 categories: (i) what teachers do (measured through the TEACH Classroom Observation Tool and Student surveys) (ii) what teachers believe (measured through modules in Teacher surveys representing mental models) and (iii) what teachers know 19 (teacher assessment scores etc.) • εi represents an idiosyncratic error term 2.3 Limitations Our study suffers from three main limitations. First, we cannot fully address the issue of non-random matching of students, teachers, and schools. A rich set of teacher, student, and school level controls helps, as does the reliance on two cohorts (Grades 2 and 3). Nonetheless, despite our use of a very detailed set of teacher variables, the problem of unobserved variable bias remains. Non-random matching is sufficiently mitigated in models that control for a rich set of covariates (Koedel et al. 2015), and TVA estimates that control for lagged test scores exhibit little to no bias (Chetty et al. 2014a). Second, the total of number of teacher observations in our study (436) is not very high. This is partly because we were unable to match a significant share of teachers in our overall sample of 748 teachers because we could not match them to students due to challenges in data collection and entry. The key challenges in matching teachers and students were those of teacher and student turnover and also significant churn in the students assigned to teachers. Often this churn took place in a way that was not formally documented and was hard to establish credible data on. However, our analysis in Table 1 shows that this attrition of teachers is mostly random on observables. Finally, we rely on ML methods which are appropriate for predictions and associations but 17 This instability can arise due to multicollinearity or if the underlying true relationship is non-linear. 18 These are averaged for all the students taught by a given teacher since our unit of observation is the teacher. 19 A comprehensive list of variables along with their description is available in the Annex 1. 10 do not establish causality (Athey and Imbens 2016). Accordingly, our interpretation and discussion is mostly around predicting student learning gains from a rich set of teacher, student, and school covariates. 3. Descriptive Analysis of Teachers, Students, and Schools 3.1 School Characteristics Our school sample is predominantly public (97 percent) and rural (79 percent). The average pupil to-teacher (PTR) ratio is nearly 63 students per teacher. Almost all the public schools (98 percent) received an average capitation grant of TZS 7,154 per pupil (USD 3) in 2019 (also shown in Table 2a). 3.2 Student Characteristics The average age of students in our sample is around 9 years. About 30 percent of students for any given teacher belong to the lowest quartile of the constructed asset index. 20 On average, students in the sample answered 37 percent of the questions correctly in their Math test and 45 percent in their Kiswahili test. Around 43 percent students were not able to correctly add the numbers 11 and 4. In Kiswahili, about 29 percent students were not able to read the word paka (cat) (further details in Table 2b). Average improvement from baseline test score to test score at follow-up was 12 percent for Kiswahili and 19 percent for Math, as shown in Figure 1. 3.3 Teacher Characteristics The teachers in our sample were selected from the HT-provided teacher roster, based on their subject and grade assignment. Teachers teaching the focal subjects Kiswahili, Math, and English were eligible for sampling; and teaching in Standards 2 and 3 were prioritized. Who teachers are and what they know Overall, 56 percent of teachers are female. Around 54 percent of teachers have worked less than 10 years in the teaching profession. The mean years of experience is 12 years. Around 75 percent of teachers report having been trained at the diploma level or lower. On average, teachers in the sample answered 71 percent of the questions correctly in the Kiswahili assessment and 72 percent in the math assessment, as shown in figure 2. 20 For each student, we take the first principal component score of 8 variables indicating household ownership of the following assets: (i) television, (ii) radio, (iii) electricity, (iv) refrigerator, (v) bed/mattress, (vi) motorbike, (vii) fan, (viii) telephone. We then rank them by quartiles and create a dummy variable which equals 1 if a student belongs to the lowest quartile. 11 How teachers are managed The mean reported gross monthly compensation for teachers is TZS 676,184 (USD 293). Education, relative speaking, is one of the better-paid sectors (RISE Baseline Report 2019). However, since 2000, most teachers have faced stagnating purchasing power at best. Also, the salary differences between teacher certification types have increased over this period ((RISE Baseline Report 2019). Around 36 percent teachers believe that their school regularly recognizes and rewards teacher performance. However, only 25 percent said that student learning outcome is the key metric used by the head teacher to judge their performance. Nearly 84 percent of teachers believe actions are taken in case of poor performance. The action format most often taken, according to teachers, was a warning from the head- teacher. According to teachers, the risk of dismissal or transfer because of poor performance is almost zero. Around 49 percent agree with the following statement about the school leadership: “They will recommend me to be transferred or dismissed in case I receive too many bad performance evaluations.” In terms of support received (personal and professional) from the school administration and Government, only 17 percent and 23 percent, respectively, teachers feel satisfied. What teachers do For what teachers do, we report data from Teach. Broadly, teachers in our sample score high (average rating of 3-3.5 out of 5) in providing a supportive learning environment and in setting positive behavioral expectations in the classroom. However, they score low (average rating of 1.5-2 out of 5) on providing students with feedback, in perseverance, and in social and collaborative skills. We also capture what teachers do through selected questions in the student survey: While 89% of students reported that their teacher explains in another way if they do not understand something, only 61% said that teachers write on their notebooks while correcting their work. About 70% of students said they were afraid of their teacher. What teachers believe Finally, for what teachers believe, some aspects of teacher beliefs are reflected in the section above, in terms of their perceptions of how they are managed. We also use a dedicated module incorporated within the teacher survey that builds on past cross-country research in this area (Sabarwal and Abu Jawdeh 2018). Using questions from this module, we create six mindset-indices, using principal component analysis. 21 The notable insights from the mindset modules are summarized here. Around 93 percent of teachers claimed that they could successfully teach all relevant subject content to even the most difficult students. But despite this high self-efficacy, about 40 percent of teachers believed there is little they can do to help a student’s learning if they come unprepared from previous grades. Teachers have nuanced views about test-based accountability; 95 percent of teachers believed that they should receive additional bonuses if their students perform well on exams, but only 50 percent believed that 21 These are: self-efficacy, locus of control, quality of relationships, positive attitude, reinforcement bias, and support for test-based accountability. 12 their promotion should depend on their student’s performance on exams (for comparisons to other countries see Sabarwal and Abu-Jawdeh 2018). 4. Results In this section we first lay out the relative outperformance of the two supervised ML algorithms used – CIF and LASSO. Next, we present CIF and LASSO results around which teacher covariates are most predictive of student learning gains in Math and Kiswahili. To analyze the relative importance of ML identified variables further, we present results of Post-CIF and Post-LASSO OLS regressions on the selected set of variables. In other words, we show the results from the OLS regression on the ML-identified parsimonious models. 4.1 Performance of machine learning algorithms As mentioned in section 2.2.1, we split our data into a training set and a test set (details in Annex 2). We train the models on the Training Set and then use the Mean Squared Error (MSE) from applying the model in the Test Set as the evaluation metric to judge their performance. The MSE is particularly useful for prediction problems like ours because it optimally trades off bias and variance (Kleinberg et al 2015, brief discussion in Annex 2). A lower MSE implies a better prediction out of sample. For each subject, we divide the MSEOLS by the MSE of our ML methods in order to show the relative MSE. Hence, the relative MSE becomes: = = A relative MSE greater than one implies that OLS performs poorly out of sample relative to the ML algorithm. This could be either because OLS overfits the data or it makes poor use of the explanatory variables due to issues of high dimensionality. We also derive 95% Confidence Intervals for each model 22 to ensure that our results are not sensitive to the training-test split (see for example Brunori et al. 2018 for another application). Results are presented in Tables 3a and 3b and confidence intervals plots in Figure 3. We find that both CIF and LASSO outperform OLS in predicting student learning gains out-of-sample in the vast majority of cases. Therefore, they better model the relationship between teacher characteristics and student learning gains vis-à-vis OLS. CIF outperforms OLS by 21 percent for Math and 14 percent for Kiswahili 23 (average outperformance of 18 percent); LASSO outperforms OLS by 23 percent for Math and 21 percent for Kiswahili (average outperformance of 22 percent). For the purposes of deriving 95% C.I., we run LASSO 200 times and Conditional Inference Forest 100 times. 22 In very few cases, the lower bound of the confidence interval is lower than 1 suggesting that at times, CIF and 23 LASSO may also be overfitting the data relative to OLS. 13 We also find that teacher covariates are more effective at predicting student learning gains for Math than for Kiswahili. This can be seen from the absolute value of the Mean Squared Error of the Test Set which is higher for Kiswahili than Math. This is also evident in the overall CIF and LASSO measure of variable importance presented in Tables 4b and 5b. As discussed in Section 2.2, for CIF, we measure variable importance using the permutation method described in Strobl et al. (2007), wherein each explanatory variable is permuted such that its association with the dependent variable is lost. For Math, this standardized score is 0.53 for the variable of highest importance after baseline score (teacher practice of providing written feedback on students work). For Kiswahili this score is 0.19 for the variable of highest importance after baseline score (teacher belief that action is taken against poor teacher performance). 4.2 Key predictors of Student Learning Gains We report the key predictors for student learning gains using CIF and LASSO in Tables 4a and 5a for Math and Kiswahili, respectively. We also show the detailed results for CIF and LASSO in Tables 4b and 5b, respectively. In these detailed results we show the standardized variable importance scores for CIF 24 and the number of times the variable occurred (in the 20 runs for CIF and 200 runs for LASSO). For LASSO we also show the sign of the coefficient to show the relationship between student learning gains and a given variable in a global sense. Finally, we also run a ‘Post-CIF’ & ‘Post-LASSO’ (Belloni and Chernozhukov 2013) OLS on only those variables that are selected by our ML models with standard errors clustered at the school level. 25 These results are presented in Tables 4c and 5c for Math and Kiswahili respectively. There are three broad patterns of note in our results. First, we find that variables of importance for predicting student learning gains are very similar across CIF and LASSO (see Tables 4a and 5a), with the variables of importance identified through LASSO being a subset of those identified by CIF. This is perhaps unsurprising since LASSO, in general, yields sparser models relative to CIF and the latter also tends to select those variables which may be important in interaction with other variables. This alignment between CIF and LASSO, which occurs for both Math and Kiswahili, is noteworthy because the methods approach the prediction and model building process in a very different manner (non-parametric for CIF and parametric for LASSO). Further, LASSO permits a check on whether the variables of importance identified through ML show the expected direction. 26 Our reported ML results (variables of importance for predicting student learning gains) are also stable across multiple runs. Given that the training set and test set are randomly sampled, it is plausible that variables that are important in one iteration may not be so if the training and test sets are randomly sampled again. Hence, to ensure stability, we run our CIF model 20 times, and only report variables that show up as 24 For LASSO, given the presence of the L1 Norm in LASSO’s minimization problem, its coefficients are downward biased and therefore we do not report absolute coefficients since they cannot be interpreted in the same manner as OLS. 25 Belloni & Chernozhukov (2013) provide a more technical argument on the statistical properties of the Post-LASSO estimator. 26 This is not possible in CIF, given that forests are non-parametric in nature and cannot be visualized in the same manner as trees, they do not tell us whether there is a positive or negative relationship between student learning gains and teacher covariates in a global sense. However, the sign of the LASSO coefficients informs us on this. 14 important in 14 or more runs. 27 Result-stability around variables of importance is more of an issue in LASSO (Mullainathan and Spiess 2017) than in forests with large number of trees (Strobl et al. 2009). Therefore, we ran LASSO 200 times, and present only those variables that show up as important in more than 140 models to remove the role of random chance. This also allows us to account for multicollinearity. Second, as mentioned in Section 4.1, teacher covariates matter more for predicting student learning gains in Math, compared to Kiswahili. Also, the important predictors are different across the two subjects. It is not surprising that teacher covariates are more important, and differently important, for Math as compared to Kiswahili. Studies have generally found greater variance in teacher effects on achievement in Math than in English (or reading). The difference is due to the large share of language learning that happens at home versus the mostly classroom-based learning that happens for Math (Jackson, Rockoff, and Staiger 2014, Bau and Das 2020). Third, students’ baseline score is by far the most important variable 28 for predicting student learning gains, potentially signaling mean reversion. 4.2.1 Math What are the most important predictors of student learning gains in Math? After controlling for a student’s baseline score, the two most important teacher covariates for predicting student learning gains are: (i) teacher practice of providing written feedback to students on their homework / tests and (ii) teacher belief they can help disadvantaged / struggling students learn. 29 These variables have the strongest importance in CIF and are selected as being important by both CIF and LASSO. Aside from these top variables, other variables of importance occurring in both CIF and LASSO are as follows. Teachers with training in teaching foundational concepts (Reading, Writing, Counting) have higher student learning gains. In addition to providing written feedback, two other teacher practices are important. First, teachers who ask more open ended questions (the critical thinking construct on Teach) have lower student learning gains. This is one of the very few counter-intuitive results we see but is consistent with findings from other Teach studies signaling a potential measurement and/or interpretation issue in Teach (Filmer et. al 2020). 30 Second, teachers who review concepts taught at the end of class have higher student learning gains. 27 We run CIF 20 times and LASSO 200 times because CIF, which is a collection of trees, is computationally expensive: 20 model runs for CIF take about 55 minutes to execute while 200 runs of LASSO takes less than 1 minute to execute. 28 We explicitly include baseline test score as a control variable even though it is used to calculate student learning gains. We do this to account for threshold effects in the underlying distribution. We are therefore allowing for the fact that learning gains might be qualitatively different for a student who increases her test score from 20% to 30% as compared to a student who goes from 70% to 80%. 29 This can be interpreted as a proxy of whether teachers consider student learning of disadvantaged or struggling students to be within their locus of control. It may also be interpreted as a variable signaling how much ownership teachers take of the learning of struggling or disadvantaged students. It can also be interpreted as a growth mindset indicator for teachers, given that teachers who believe they can improve the learning of struggling/ disadvantaged students may be seen to be having a growth mindset. 30 For instance, it could be signaling that the practice of asking open-ended questions may produce perverse results if key concepts are not explained well. 15 In terms of what teachers believe, apart from belief that they can help disadvantaged / struggling students learn, two other factors are important. First, teachers who believe that their career progression and salary be linked to student test performance, have higher student learning gains (teacher belief in test-based accountability). Second, teachers who believe they are the most important stakeholder in assessing progress towards professional targets as compared to other stakeholders yield higher student learning gains (teacher belief in their autonomy). In addition to these teacher covariates, school location – rural or urban – is also a variable of importance for student learning gains. The OLS regressions using the variables selected by ML show some interesting insights. In the Post-CIF and Post-LASSO regressions (Table 4b), we find that the teacher belief they can help struggling / disadvantaged students learn is significant at the 1% level. Going from the teacher at the 25th percentile to the 75th percentile on the Locus of Control Index is associated with an increase in student learning gains of 0.21 SD. Moving from the 25th to 75th percentile in the teacher practice of reviewing concepts at the end of the class, is associated with a 0.18 SD gain in student learning gains. Teacher support for test-based accountability (teacher beliefs) is also significant at the 5% level. Teacher training in teaching foundational concepts (teacher preparation) and teacher practice of asking open-ended questions (critical thinking) are significant at the 5% level. The latter is negatively related to student learning gains, in line with the LASSO result. 4.2.2 Kiswahili As mentioned above, the degree of influence exerted by teacher covariates on student learning gains for Kiswahili is lower than it is for Math (see Tables 3 and compare Tables 4b & 5b). After controlling for baseline score and student age, the two most important variables for predicting student learning gains are: whether action is taken against poor teacher performance and whether a teacher provides extra help to those students who face difficulties during breaks or after school hours. These are selected by both ML Models. Another somewhat counter-intuitive result is the negative relationship between teacher beliefs that action is taken against poor performance and student learning gains. Our interpretation, based on contextual information, is that ‘action against poor performance’ typically refers to action against disciplinary infractions (and not necessarily student learning) and in a vast majority of cases involves a warning from head-teacher. So one way to interpret this finding is that teachers who do not believe ‘warnings from head- teachers about disciplinary infractions’ is effective action against poor teacher performance are linked with higher student learning gains. On what teachers know, once again training in teaching foundational concepts (Reading, Writing, Counting) is important. This makes sense because a key focus of 3R Training (the government program aimed at these concepts) is improvement in Kiswahili learning outcomes. On what teachers do, apart from offering extra help, teachers who provide lesson facilitation have higher student learning gains. 31 On what 31 When teachers clearly articulate lesson objectives, explain content clearly and relate classroom lessons with real life situations. 16 teachers believe, the relationships index which measures teacher beliefs about their relationships with students, colleagues and the head teacher is important. 32 Analysis of the Post-CIF and Post-LASSO regressions show that in a more parsimonious model, offering extra help to lagging students (teacher practice) and the Relationships index (teacher beliefs) are statistically significant at the 1% level. Moving from the 25th to the 75th percentile in terms of teachers offering extra help to lagging students, is linked to student learning gains of 0.19 SD. Similarly, going from the teacher at the 25th percentile to the 75th percentile on the Relationships Index is associated with an increase in student learning gains of 0.14 SD. 5. Conclusion Research on teacher effectiveness has struggled to identify observable teacher characteristics that can help explain variation in student performance. In this study, we apply machine learning methods to this problem. Using matched student-teacher data for Grade 2 and 3 students from across 392 schools in Tanzania, we use two ML approaches, Conditional Inference Forest (CIF) and LASSO, to predict student learning gains. We find that ML approaches outperform the standard OLS model by 14-24 percent in out-of-sample predictions. Further, even though both CIF and LASSO take different model-building approaches, they produce largely consistent results. As expected, student baseline scores are the most predictive of student learning gains, signaling mean reversion in the data. Our key finding is that specific elements of what teachers know (teacher preparation); what teachers do (teacher practice); and what teachers believe (teacher beliefs) are more strongly predictive of student learning gains than other teacher, student, and school factors, especially in Math. For Math, CIF results show that the teacher practice of providing written feedback on homework/tests and reviewing key concepts at the end of class (measured through student surveys); the teacher belief that they can help disadvantaged and struggling students learn; and teacher preparation around teaching foundational concepts are the most important predictors of student learning gains. However, one counter-intuitive result that merits further investigation is that teachers who score high on fostering critical thinking in classroom observations (for instance by asking more open-ended questions) have lower student learning gains. For Kiswahili, even though teacher (and other observable) covariates matter less for predicting student learning gains, teacher preparation on teaching foundational skills, teacher practice of providing additional support to struggling students, and teacher belief that they have good relationships within school still emerge as important. Consistent with existing literature, commonly used teacher characteristics such as education, experience, assessment scores etc. do not emerge as important predictors of student learning gains. Outside of teacher variables, rural schools (for both Math and Kiswahili) and older students (for Kiswahili) have stronger student learning gains; but these factors are still less important than the top teacher factors. 32 Examples of questions include (= 1 if 'Agree'; = 0 if 'Disagree'): (i) I have a good relationship with my students, (ii) I have a good relationship with my colleagues, (iii) I have a good relationship with my Head Teacher. 17 Our findings show how machine learning can be a powerful tool for addressing some of the hitherto unanswered questions around teacher effectiveness. They may also contribute to the growing interest in understanding and systematically measuring teacher beliefs and behaviors. No one study can provide a definitive set of guidance, but our results suggest that teacher training programs need to focus more directly on preparing teachers to teach foundational skills, and fostering in them the practice of providing written feedback to students, reviewing key concepts at the end of class, and spending extra time with struggling students. These elements should also be emphasized in teacher supervision and management. Our findings also demonstrate the importance of systematically measuring and targeting specific aspects of teacher beliefs. Research from education and economics has long shown that teacher beliefs can impact student outcomes directly (Jussim and Harber 2005, Bertrand and Duflo 2017, Sabarwal et. al 2021). However, despite their importance and measurability, there is very little systematic data or discussion on teacher beliefs in the rich literature on education impact evaluations (Sabarwal et. al 2021). This paper demonstrates that teacher beliefs need to be a part of the discussion on improving teacher effectiveness. Specifically, for effective teaching, it is crucial for teachers to believe that students can in fact learn. Emerging insights from behavioral economics and social psychology demonstrate that these beliefs can be systematically fostered in teachers (and also students themselves). Incorporating these ideas in the design and implementation of teacher programs may help improve teacher effectiveness. However, more research is needed – both on measurement and application - before clear pathways to doing this can be established. Specifically, it is important to understand – what do teachers believe about whether or not disadvantaged students can learn and how best to help them? How malleable are these beliefs? Can they be realistically reshaped to make a big difference for learning outcomes? At which point (pre-service, in-service) would it be best to intervene, if it does make sense to do so? 18 References Aaronson, D., Barrow, L., & Sander, W. (2007). Teachers and student achievement in the Chicago public high schools. Journal of Labor Economics, 25(1), 95-135. Adelman, M., Haimovich, F., Ham, A., & Vazquez, E. (2018). Predicting school dropout with administrative data: new evidence from Guatemala and Honduras. Education Economics, 26(4), 356- 372. Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353-7360. Athey, S., & Imbens, G. W. (2019). Machine learning methods that economists should know about. Annual Review of Economics, 11, 685-725. Aulck, L., Velagapudi, N., Blumenstock, J., & West, J. (2016). Predicting student dropout in higher education. arXiv preprint arXiv:1606.06364. Azam, M. and Kingdon, G.G. (2015). Assessing teacher quality in India. Journal of Development Economics, 117, pp.74-83. Bacher-Hicks, A., Kane, T. J., & Staiger, D. O. (2014). Validating teacher effect estimates using changes in teacher assignments in Los Angeles (No. w20657). National Bureau of Economic Research. Bacher-Hicks, A., Chin, M. J., Kane, T. J., & Staiger, D. O. (2017). An evaluation of bias in three measures of teacher quality: Value-added, classroom observations, and student surveys (No. w23478). National Bureau of Economic Research. Bau, N., & Das, J. (2020). Teacher value added in a low-income country. American Economic Journal: Economic Policy, 12(1), 62-96. Belloni, A., & Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli, 19(2), 521-547. Bertrand, M., & Duflo, E. (2017). Field Experiments on Discrimination. In Handbook of Field Experi- ments. edited by Banerjee, A. and Duflo, E. Amsterdam, Netherlands: Vol. 1. Elsevier, 309–93. Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197-227. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. Brunori, P., Hufe, P., & Mahler, D. G. (2018). The roots of inequality: Estimating inequality of opportunity from regression trees. World Bank Policy Research Working Paper, (8349). Buhl-Wiggers, J., Kerwin, J., Smith, J., & Thornton, R. (2017, April). The impact of teacher effectiveness on student learning in Africa. In Centre for the Study of African Economies Conference. Chalfin, A., Danieli, O., Hillis, A., Jelveh, Z., Luca, M., Ludwig, J., & Mullainathan, S. (2016). Productivity and selection of human capital with machine learning. American Economic Review, 106(5), 124-27. 19 Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014). Measuring the impacts of teachers I: Evaluating bias in teacher value-added estimates. American Economic Review, 104(9), 2593-2632. Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014). Measuring the impacts of teachers II: Teacher value- added and student outcomes in adulthood. American economic review, 104(9), 2633-79. Cruz-Aguayo, Y., Ibarrarán, P., & Schady, N. (2017). Do tests applied to teachers predict their effectiveness?. Economics Letters, 159, 108-111. Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., ... & Lautenbach, S. (2013). Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27-46. Dweck, C. S. (2006). Mindset. New York: Random House. Filmer, D., Molina, E., & Wane, W. (2020). Identifying Effective Teachers: Lessons from Four Classroom Observation Tools. The World Bank. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning (Vol. 1, No. 10). New York: Springer series in statistics. Glazerman, S., & Protik, A. (2015). Validating value-added measures of teacher performance. Association for Public Policy Analysis & Management, November, Miami. Gordon, R. J., Kane, T. J., & Staiger, D. (2006). Identifying effective teachers using performance on the job (pp. 2006-01). Washington, DC: Brookings Institution. Hanushek, E. A., & Rivkin, S. G. (2010). Generalizations about using value-added measures of teacher quality. American Economic Review, 100(2), 267-71. Heslin, P. A., Latham, G. P., & VandeWalle, D. (2005). The effect of implicit person theory on performance appraisals. Journal of Applied Psychology, 90(5), 842. Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674. Hothorn, T., & Zeileis, A. (2015). partykit: A modular toolkit for recursive partytioning in R. The Journal of Machine Learning Research, 16(1), 3905-3909. Jacob, B. A., & Lefgren, L. (2005). Principals as agents: Subjective performance measurement in education (No. w11463). National Bureau of Economic Research. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical learning (Vol. 112, p. 18). New York: springer. Jussim, L., & K. D. Harber. (2005). Teacher Expectations and Self-Fulfilling Prophecies: Knowns and Unknowns, Resolved and Unresolved Controversies. Personality and Social Psychology Review, 9 (2): 131–55. 20 Kane, T. J., Rockoff, J. E., & Staiger, D. O. (2008). What does certification tell us about teacher effectiveness? Evidence from New York City. Economics of Education review, 27(6), 615-631. Kleinberg, J., Ludwig, J., Mullainathan, S., & Obermeyer, Z. (2015). Prediction policy problems. American Economic Review, 105(5), 491-95. Koedel, C., Mihaly, K., & Rockoff, J. E. (2015). Value-added modeling: A review. Economics of Education Review, 47, 180-195. Loh, W. Y. (2011). Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(1), 14-23. Mbiti, I., Romero, M., & Schipper, Y. (2019). Designing effective teacher performance pay programs: Experimental evidence from Tanzania (No. w25903). National Bureau of Economic Research. McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. (2004). Models for value- added modeling of teacher effects. Journal of educational and behavioral statistics, 29(1), 67-101. Molina, E., Fatima, S. F., Ho, A., Hurtado, C. M., Wilichowksi, T., & Pushparatnam, A. (2018). Measuring Teaching Practices at Scale: Results from the Development and Validation of the Teach Classroom Observation Tool. Mullainathan, S., & Spiess, J. (2017). Machine learning: an applied econometric approach. Journal of Economic Perspectives, 31(2), 87-106. Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects?. Educational evaluation and policy analysis, 26(3), 237-257. Paunesku, D., Walton, G. M., Romero, C., Smith, E. N., Yeager, D. S., & Dweck, C. S. (2015). Mind-set interventions are a scalable treatment for academic underachievement. Psychological science, 26(6), 784- 793. Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic achievement. Econometrica, 73(2), 417-458. Rockoff, J. E. (2004). The impact of individual teachers on student achievement: Evidence from panel data. American economic review, 94(2), 247-252. Rothstein, J. (2014). Revisiting the impacts of teachers. UC-Berkeley Working Paper. Sabarwal, S., & Abu-Jawdeh, M. (2018). What teachers believe: mental models about accountability, absenteeism, and student learning. World Bank Policy Research Working Paper, (8454). Sansone, D. (2019). Beyond early warning indicators: high school dropout and machine learning. Oxford bulletin of economics and statistics, 81(2), 456-485. Schiltz, F., Masci, C., Agasisti, T., & Horn, D. (2018). Using regression tree ensembles to model interaction effects: a graphical approach. Applied Economics, 50(58), 6341-6354. Staiger, Douglas O., and Jonah E. Rockoff. 2010. “Searching for effective teachers with imperfect information.” Journal of Economic Perspectives 24(3): 97-118.) 21 Strobl, C., Hothorn, T., & Zeileis, A. (2009). Party on!. Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 1-21. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. Varian, H. R. (2014). Big data: New tricks for econometrics. Journal of Economic Perspectives, 28(2), 3-28. Yeager, D. S., & Dweck, C. S. (2012). Mindsets that promote resilience: When students believe that personal characteristics can be developed. Educational psychologist, 47(4), 302-314. Zhao, P., & Yu, B. (2006). On model selection consistency of Lasso. The Journal of Machine Learning Research, 7, 2541-2563. 22 Figure 1: Distribution of Student Learning Gains by Subject Figure 2: Distribution of Teacher Assessment Scores 23 Table 1: Sample characteristics with and without attrition Full Sample Sample After Attrition Variable p-value (N=748) (N=436) Teacher gender 0.465 0.440 0.407 (= 1 if Male) What is the highest level of education that you have completed? 0.282 0.239 0.097* (=1 if Diploma or Higher) Did you specialize in Kiswahili in your teacher training 0.545 0.550 0.868 (= 1 if Yes) Did you specialize in Math in your teacher training 0.443 0.463 0.489 (= 1 if Yes) Have you ever received training in 3Rs 0.652 0.665 0.656 (= 1 if Yes) What is your current gross total compensation per month? 664469.086 644741.355 0.282 (Tanzanian Shillings) Students deserve more of my attention if they: They are lagging behind in classwork 0.243 0.227 0.524 (= 1 if Yes) Teacher's Age 36.916 36.606 0.609 Teacher's Experience 12.496 12.163 0.598 Teacher's Experience in Current School 6.921 6.950 0.941 Teacher Score in Math Assessment (%) 74.552 73.242 0.244 Teacher Score in Kiswahili Assessment (%) 70.306 70.618 0.709 Note: This table presents selected teacher attributes to illustrate balance after attrition. p-values reported in represent the probability of obtaining the corresponding t-test for a null hypothesis that there is no difference in means across the original and current sample. Standard errors are clustered at the school level for this test. 24 Table 2a: School Level Characteristics (N=306) Standard Mean Variable Deviation School's Location (= 1 if Urban/Semi-Urban; = 0 if Rural) 0.209 0.407 Pupil Teacher Ratio 62.586 23.427 Amount of Money that is entitled to school as Annual Entitlement Grant? (in Tanzanian Shillings) 7120.916 2989.730 Does the School have a School Management Team? (= 1 if Yes, = 0 if No) 0.778 0.416 Does the School have a Whole School Development Plan? 0.869 0.338 Table 2b: Student Level Characteristics Math (N=346) Kiswahili (N=336) Standard Standard Mean Mean Variable Deviation Deviation Student Age 8.820 0.907 8.836 0.893 Percentage of students who attend private tuitions in respective subject 4.741 11.260 3.712 10.375 Percentage of students who belong to lowest asset quartile? 30.838 24.437 29.368 24.267 Baseline Test Score (%) in respective subject 36.926 12.111 45.088 25.981 25 Table 2c: Teacher Level Characteristics Math (N=346) Kiswahili (N=336) Variable Standard Standard Mean Mean Deviation Deviation Teacher gender (= 1 if Male; = 0 if Female) 0.436 0.497 0.408 0.492 Teacher’s position in the school (= 1 if Head Who Teachers Are? Teacher or Deputy Head Teacher; = 0 if Academic 0.156 0.361 0.155 0.360 (Teacher Teacher) Characteristics) Highest Level of Education attained (= 1 if Diploma 0.220 0.415 0.244 0.430 or Higher, = 0 otherwise) Teacher's Experience (Number of years) 12.777 10.796 12.652 10.870 Specialization subject during teacher training? 0.480 0.500 0.562 0.497 (= 1 if Math/Kiswahili; = 0 for any other subject) Has the teacher received training in 3R? (Reading, Writing, and Counting) 0.705 0.457 0.702 0.458 (= 1 if Yes; = 0 if No) Is the Teacher provided with information on students’ ability at the beginning of the year? 0.704 0.447 0.683 0.457 (= 1 if Yes; = 0 if No) What Teachers Know? Does the Teacher have a record of the pupils' (Teacher Knowledge) continuous assessments? 0.327 0.470 0.336 0.473 (= 1 if Yes; = 0 if No) Has the Teacher assessed student's curriculum skills using written assessments in the last 5 school 0.760 0.428 0.747 0.435 days? (= 1 if Yes; = 0 if No) Did the Teacher attend any form of training in the last one year? 0.483 0.500 0.509 0.501 (= 1 if Yes; = 0 if No) 26 Teacher Assessment Score in Math/Kiswahili (%) 71.465 20.369 69.848 14.059 Share of Time teacher spent teaching with 70.809 25.453 72.123 24.771 medium/high student engagement? (%) Supportive Learning Environment 3.425 0.591 3.414 0.557 (Range of Values: 1-5) Positive Behavioral Expectations 3.134 0.694 3.128 0.677 (Range of Values: 1-5) Lesson Facilitation 2.935 0.697 2.961 0.686 (Range of Values: 1-5) What Teachers Do? Checks for Understanding 3.116 0.844 3.027 0.822 (Teach Classroom (Range of Values: 1-5) Observation Tool) Feedback 2.194 0.890 2.149 0.850 (Range of Values: 1-5) Critical Thinking 2.246 0.718 2.275 0.734 (Range of Values: 1-5) Autonomy 2.438 0.644 2.443 0.657 (Range of Values: 1-5) Perseverance 2.048 0.458 2.049 0.467 (Range of Values: 1-5) Social & Collaborative Skills 1.526 0.738 1.519 0.713 (Range of Values: 1-5) If you don’t understand something, your teacher 88.996 17.340 89.384 15.772 explains it another way What Teachers Do? You are afraid of your teacher 69.578 26.428 69.243 27.216 (Percentage of students taught by a At the end of each class, your teacher takes the 78.734 23.687 77.954 24.131 given teacher who said time to review/discuss Yes when asked the When the teacher corrects my work, she writes on 61.977 30.478 62.565 30.933 following) my papers to help me Your teacher offers extra help to students who find 79.851 22.886 79.887 23.135 the subject difficult What Teachers Students deserve more of my attention if they: 0.223 0.417 0.214 0.411 Believe? They are lagging behind in classwork/homework (1 if Yes; = 0 if No) 27 If students aren’t disciplined at home, they aren’t likely to accept any discipline at school 0.486 0.501 0.491 .506 (1 if Yes; = 0 if No) It is okay to be absent as long as I: complete the curriculum OR leave students with work OR doing 0.682 0.466 0.679 0.468 something useful for the community (= 1 if Yes; = 0 if No) Are any actions taken in case of poor teacher performance? 0.821 0.384 0.857 0.350 (1 if Yes; = 0 if No) 1st PC score of 4 Positive Attitude oriented 0.000 1.176 0.000 1.187 variables. 1st PC score of 5 Incentive oriented variables. 0.000 1.248 -0.000 1.244 1st PC score of 6 Self-Efficacy oriented variables. 0.000 1.366 0.000 1.370 1st PC score of 3 Relationship oriented variables. -0.000 1.055 -0.000 1.038 1st PC score of 7 Reinforcement tendency -0.000 1.528 0.000 1.545 variables. 1st PC score of 4 Locus of Control variables. -0.000 1.066 -0.000 1.081 Who is the most important person to assess progress in your professional targets? 0.286 0.453 0.277 0.448 (= 1 if Teacher says myself; = 0 if Teacher names another person such as Head Teacher) How often does someone from the school Teacher leadership observe your classroom? 0.488 0.501 0.485 0.501 Management/School (=1 if 'Once per term'; = 0 if lesser) Level Governance Does your school regularly recognize or reward teacher performance? 0.361 0.478 0.357 0.477 (1 if Yes; = 0 if No) What is the one key result HT would assess when 0.251 0.434 0.256 0.437 rating your job performance? 28 (1 if Exam Results/Learning Progress; = 0 if Other criteria) They will recommend me to be transferred or dismissed in case I receive too many bad 0.500 0.501 0.509 0.501 performance evaluations? (1 if 'Agree'; = 0 if 'Disagree') Are you satisfied by the support you get from the school administration? 0.171 0.377 0.176 0.381 (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) Are you satisfied by the support you get from the Government? 0.231 0.422 0.223 0.417 (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) 29 Table 3a: Relative Performance of CIF and LASSO vis-à-vis OLS Relative Mean Squared Error Test Set Math Kiswahili 1.21 1.14 MSEOLS/MSECIF [0.99,1.53] [0.85,1.61] 1.23 1.22 MSEOLS/MSELASSO [0.96,1.63] [0.97,1.46] Note: (i) Figures in parenthesis show 95% C.I. (ii) Relative MSE is defined as MSEOLS/MSEML Model Table 3b: Absolute Performance of CIF and LASSO vis-à-vis OLS Absolute Mean Squared Error (Test Set) Machine Learning Algorithm Math Kiswahili Conditional Inference Forest 85.30 175.82 [54.74,113.14] [103.78,292.90] LASSO 83.38 161.07 [55.25,115.43] [105.28,224.07] OLS 102.38 194.69 [70.14,140.05] [121.98,283.49] Note: Figures in parenthesis show 95% C.I. 30 Figure 3: Relative MSE for CIF and LASSO vis-à-vis OLS (95% C.I.) 31 Table 4a: Variable Importance for Math (N=346) Variable Conditional Inference Forest LASSO 1) Baseline Math (%) Score ✓ ✓ 2) When the math teacher corrects my work, he/she writes on my papers to help me understand ✓ ✓ (Percentage of students for a given teacher who said Yes when asked this question) 3) Locus of Control Index (Teacher Mental Models) (Teachers who believe they can help disadvantaged / struggling students learn) ✓ ✓ 4) Critical Thinking (Classroom Observation) (Teacher rated higher if she asks more open ended questions or provides thinking ✓ ✓ tasks to students) 5) Have you ever received training in 3Rs? (Teachers who received training in the 3R (Reading, Writing, Counting) Program) ✓ ✓ 6) Teacher Incentive Index (Teacher Mental Models) (Teachers who strongly believe that their career progression and salary is linked to ✓ ✗ their students’ test-score performance) 7) School in Urban Area ✓ ✓ 8) At the end of each class, your Math teacher takes the time to review and discuss concepts (Percentage of students for a given teacher who said Yes when ✓ ✓ asked this question) 9) Who is the most important person to assess your progress towards your professional targets? (= 1 if Teacher says myself; = 0 if Teacher names another ✓ ✗ person such as Head Teacher) Note: (i) Variables in bold are selected as important by both CIF & LASSO (ii) Variables that are important for CIF are those that occur 14 or times in 20 runs of the model; while important variables for LASSO are those that occur more than 140 times in 200 runs of the model. 32 Table 4b: Detailed Variable Importance for Math – CIF and LASSO (N=346) Variable Conditional Inference Forest LASSO 1) Baseline Math (%) Score 0.99 (20) 197 (-) 2) When the math teacher corrects my work, he/she writes on my papers to help me understand 0.53 (19) 162 (+) (Percentage of students for a given teacher who said Yes when asked this question) 3) Locus of Control Index (Teacher Mental Models) (Teachers who believe they can help disadvantaged / struggling students learn) 0.45 (20) 191 (+) 4) Critical Thinking (Classroom Observation) (Teacher rated higher if she asks more open ended questions or provides thinking tasks to 0.37 (20) 196 (-) students) 5) Have you ever received training in 3Rs (Teachers who received training in the 3R (Reading, Writing, Counting) Program) 0.25 (18) 184 (+) 6) Teacher Incentive Index (Teacher Mental Models) (Teachers who strongly believe that their career progression and salary be linked to their 0.23 (19) students’ test-performance score higher) 7) School in Urban Area 0.20 (18) 147 (-) 8) At the end of each class, your Math teacher takes the time to review and discuss concepts (Percentage of students for a given teacher who said Yes when asked this 0.20 (16) 161 (+) question) 9) Who is the most important person to assess your progress towards your professional targets? (= 1 if Teacher says myself; = 0 if Teacher names another person such as Head 0.14 (17) Teacher) Note: (i) Variables in bold are selected as important by both CIF & LASSO (ii) Numbers for CIF show relative variable importance (variables ranked in terms of loss in predictive power if a given variable is permuted) and figures in parenthesis show the number of times the variable showed up in 20 runs of the model (iii) Figures for LASSO show the number of times the variable appeared in 200 runs of the model along with the coefficient sign in parenthesis 33 Table 4c: OLS Regressions of Student Learning Gains for Math on variables selected by CIF & LASSO (Post-CIF & Post-LASSO) (N=346) Variables selected by Variables selected by Variables selected by Variable CIF CIF and LASSO either CIF or LASSO (A) (B) (AUB) 1) Baseline Math Score -0.194*** -0.186*** -0.194*** (0.0564) (0.0564) (0.0564) 2) When the math teacher corrects my work, he/she writes on my papers to help me 0.0663 0.0704 0.0663 understand (0.0542) (0.0549) (0.0542) 3) Locus of Control Index (Teacher Mental 0.151*** 0.146*** 0.151*** Models) (0.0527) (0.0536) (0.0527) 4) Critical Thinking (Classroom Observation) -0.125** -0.127** -0.125** (0.0523) (0.0527) (0.0523) 5) Have you ever received training in 3Rs 0.128*** 0.120** 0.128*** (0.0494) (0.0497) (0.0494) 6) Teacher Incentive Index (Teacher Mental 0.100** 0.100** Models) (0.0499) (0.0499) 7) School in Urban Area -0.0879* -0.0872 -0.0879* (0.0530) (0.0540) (0.0530) 8) At the end of each class, your Math teacher 0.126** 0.123** 0.126** takes the time to review and discuss concepts (0.0562) (0.0567) (0.0562) 9) Who is the most important person to assess your 0.0895* 0.0895* progress towards your professional targets? (0.0494) (0.0494) 0.144 0.127 0.144 R-squared Note: We report robust standard errors clustered at the School level in parenthesis | Variables in Bold have been selected as important by both CIF & LASSO | Variables have been standardized with Mean = 0 and Std. Dev. = 1 | ***, **, and * indicate significance at the 1, 5, and 10 critical level, respectively | 34 Table 5a: Variable Importance for Kiswahili (N=336) Variable Conditional Inference Forest LASSO 1) Baseline Kiswahili Score ✓ ✓ 2) Are any actions taken in case of poor teacher performance? (= 1 if Teacher says Yes; = 0 if Teacher says No) ✓ ✓ 3) Student Age ✓ ✗ 4) Have you received training in 3Rs? (Teachers who received training in the 3R (Reading, Writing, Counting) Program) ✓ ✗ 5) Your Kiswahili teacher offers extra help to students who find the subject difficult ✓ ✓ (Percentage of students for a given teacher who said Yes when asked this question) 6) Are you satisfied by the support you get from the school administration? (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) ✓ ✗ 7) It is okay to be absent as long as I: complete the curriculum OR leave students with work OR doing something useful for the community ✓ ✗ (= 1 if Teacher answers ‘Yes’; = 0 if Teacher answers ‘No’) 8) Lesson Facilitation (Classroom Observation) (Teacher rated higher if lesson objectives are clearly articulated, explanation of ✓ ✗ content is clear, teacher connects lessons to real life) 9) Pupil Teacher Ratio ✓ ✗ 10) Relationships Index (Teachers who strongly believe that they have a good relationship with their ✓ ✓ students, colleagues & head teacher score higher) 11) Are you satisfied by the support you get from the Government? (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) ✓ ✗ Note: (i) Variables in bold are selected as important by both CIF & LASSO (ii) Variables that are important for CIF are those that occur 14 or times in 20 runs of the model; while important variables for LASSO are those that occur more than 140 times in 200 runs of the model 35 Table 5b: Detailed Variable Importance for Kiswahili – CIF and LASSO Variable Conditional Inference LASSO Forest 1) Baseline Kiswahili Score 1 (20) 200 (-) 2) Are any actions taken in case of poor teacher performance? (= 1 if Teacher says Yes; = 0 if Teacher says No) 0.19 (20) 178 (-) 3) Student Age 0.15 (20) 4) Have you received training in 3Rs? (Teachers who received training in the 3R (Reading, Writing, Counting) Program) 0.06 (18) 5) Your Kiswahili teacher offers extra help to students who find the subject difficult 0.05 (20) 188 (+) (Percentage of students for a given teacher who said Yes when asked this question) 6) Are you satisfied by the support you get from the school administration? 0.05 (18) (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) 7) It is okay to be absent as long as I: complete the curriculum OR leave students with work OR doing something useful for the community (= 1 if 0.04 (16) Teacher answers ‘Yes’; = 0 if Teacher answers ‘No’) 8) Lesson Facilitation (Classroom Observation) 0.04 (14) (Teacher rated higher if lesson objectives are clearly articulated, explanation of content is clear, teacher connects lessons to real life) 9) Pupil Teacher Ratio 0.02 (19) 10) Relationships Index (Teachers who strongly believe that they have a good relationship with 0.02 (14) 145 (+) their students, colleagues & head teacher score higher) 11) Are you satisfied by the support you get from the Government? (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) 144 (+) Note: (i) Variables in bold are selected as important by both CIF & LASSO (ii) Numbers for CIF show relative variable importance (variables ranked in terms of loss in predictive power if that variable is permuted) and figures in parenthesis show the number of times the variable showed up in 20 runs of the model (iii) Figures for LASSO show the number of times the variable appeared in 200 runs of the model along with the coefficient sign in parenthesis 36 Table 5c: OLS Regressions of Student Learning Gains for Kiswahili on variables selected by CIF & LASSO (Post-CIF & Post-LASSO; N=336) Variables selected by Variables selected by CIF Variables selected by either CIF and LASSO CIF or LASSO Variable (A) (B) (AUB) 1) Baseline Kiswahili Score -0.491*** -0.491*** -0.489*** (0.0624) (0.0541) (0.0621) -0.0303 -0.0283 2) Student Age (0.0450) (0.0447) 3) Are any actions taken in case of poor teacher -0.115** -0.120** -0.107* performance? (0.0539) (0.0518) (0.0545) 0.0154 0.0205 4) Have you received training in 3Rs? (0.0471) (0.0465) 5) Your Kiswahili teacher offers extra help to students who 0.139*** 0.138*** 0.130*** find the subject difficult (0.0471) (0.0458) (0.0475) 6) Are you satisfied by the support you get from the school administration? (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not 0.1875 0.151 Satisfied’) (0.134) (0.133) 7) It is okay to be absent as long as I: complete the curriculum OR leave students with work OR doing something useful for the 0.0657 0.0649 community (0.0448) (0.0448) 0.0727 0.0746* 8) Lesson Facilitation (Classroom Observation) (0.0454) (0.0446) -0.0492 -0.0478 9) Pupil Teacher Ratio (0.0499) (0.0492) 0.114*** 0.104*** 0.115*** 10) Relationships Index (0.0381) (0.0374) (0.0382) 11) Are you satisfied by the support you get from the 0.0805 0.0813 Government? (0.0514) (0.0511) R-squared 0.315 0.307 0.321 Note: We report robust standard errors clustered at the School level in parenthesis | Variables in Bold have been selected as important by both CIF & LASSO | Variables have been standardized with Mean = 0 and Std. Dev. = 1 | ***, **, and * indicate significance at the 1, 5, and 10 critical level, respectively 37 Annex 1: Variable Description Variable Category Variable Description Average Student Age for a given Teacher Percentage of students taught by a given teacher who attend private tuitions in Math/Kiswahili Student Characteristics Percentage of students taught by a given teacher who belong to lowest asset quartile? (based on quartile ranking of first Principal Component Score of dummy variables indicating ownership of several assets such as TV, Land, Electricity etc.) Baseline Math/Kiswahili Test Score (%) Pupil Teacher Ratio Does the School have a School Management Team? (= 1 if Yes, = 0 if No) School Characteristics Does the School have a Whole School Development Plan? Amount of Money that is entitled to school as Annual Entitlement Grant? (in Tanzanian Shillings) School's Location (= 1 if Urban/Semi-Urban; = 0 if Rural) Teacher gender (= 1 if Male; = 0 if Female) Who Teachers Are? Teacher’s position in the school (= 1 if Head Teacher or Deputy Head Teacher; = 0 if Academic Teacher) (Teacher Characteristics) Highest Level of Education attained (= 1 if Diploma or Higher, = 0 otherwise) Teacher's Experience (Number of years) Specialization subject during teacher training? (= 1 if Math/Kiswahili; = 0 for any other subject) Has the teacher received training in 3R? (Reading, Writing, and Counting) What Teachers Know? (= 1 if Yes; = 0 if No) (Teacher Knowledge) Teacher Assessment Score in Math/Kiswahili (%) Did the Teacher attend any form of training in the last one year? (= 1 if Yes; = 0 if No) 38 Is the Teacher provided with information on students’ ability at the beginning of the year? (= 1 if Yes; = 0 if No) Does the Teacher have a record of the pupils' continuous assessments? (= 1 if Yes; = 0 if No) Has the Teacher assessed student's curriculum skills using written assessments? (= 1 if Yes; = 0 if No) Supportive Learning Environment (Range of Values: 1-5) Positive Behavioral Expectations (Range of Values: 1-5) Lesson Facilitation (Range of Values: 1-5) Checks for Understanding (Range of Values: 1-5) Feedback What Teachers Do? (Range of Values: 1-5) (Teach Classroom Observation Tool) Critical Thinking (Range of Values: 1-5) Autonomy (Range of Values: 1-5) Perseverance (Range of Values: 1-5) Social & Collaborative Skills (Range of Values: 1-5) Share of Time teacher spent teaching with medium/high student engagement? (%) If you don’t understand something, your Math/Kiswahili teacher explains it another way What Teachers Do? You are afraid of your Math/Kiswahili teacher (Percentage of students taught by a At the end of each class, your Math/Kiswahili teacher takes the time to review/discuss given teacher who said Yes when asked the following) When the Math/Kiswahili teacher corrects my work, she writes on my papers to help me Your Math/Kiswahili teacher offers extra help to students who find the subject difficult Positive Attitude Index 1st Principal Component score of four positive attitude variables: 39 (= 1 if Agree; = 0 if Disagree) 1) I am fully satisfied with my current job. 2) My students’ learning/achievement motivates me to carry on teaching. 3) My workload is manageable 4) If I could start over I would choose teaching as a career? Teacher Incentive Index 1st Principal Component score of 5 incentive oriented variables: (= 1 if Agree; = 0 if Disagree) 1) If my students perform well on official external exams, I should receive an additional bonus 2) My promotion should partly be dependent on my student’s performance on tests. 3) The main factor used to assess my performance as a teacher should be my students (= 1 if bonuses; = 0 for capitation grants) 4) Would you prefer teacher bonuses or school level capitation grants? (= 1 for bonus increase; = 0 for flat increase) 5) Would you prefer a flat increase in salaries of all teachers or a bonus component for performance? Self Efficacy Index 1st Principal Component score of 6 self-efficacy oriented variables: (= 1 if Agree; = 0 if Disagree) 1) I can successfully teach all relevant subject content to even the most difficult students 2) I can find creative ways to cope with difficulties such as budget cuts 3) I try new ways of teaching in class. 4) Through my teaching I can help students overcome their constraints/difficulties. 5) I can maintain a positive relationship with parents even when tensions arise. 6) I am convinced that I can help address my students’ needs. What Teacher's Believe? Relationships Index 1st Principal Component score of 3 Relationship oriented variables: (= 1 if 'Agree'; = 0 if 'Disagree') 1) I have a good relationship with my students. 2) I have a good relationship with my colleagues. 3) I have a good relationship with my Head Teacher. Reinforcement Bias Index 1st Principal Component score of 7 Reinforcement biased tendency variables: Students deserve more of my attention if they (1 if Yes; = 0 if No): 1) Are motivated to learn 2) Attend school regularly 3) Come to school with the necessary material 4) Have the necessary concepts and foundations from previous classes 5) Their parents are involved in the education of their child 40 6) Their parents are willing to invest the necessary financial resources in their child’s education 7) They are performing well in my class Locus of Control Index 1st Principal Component score of 4 Locus of Control oriented variables: There is little I can do to help a student’s learning if (= 1 if No; = 0 if Yes): 1) Students come unprepared from previous grades 2) Parents do not seek feedback from the teacher on student performance 3) Parents do not have the necessary education to help their child be more successful at school (1 = 'Disagree'; 0 = 'Agree') 4) If parents would do more for their children, I could do more It is okay to be absent as long as I: complete the curriculum OR leave students with work OR doing something useful for the community (= 1 if Yes; = 0 if No) Students deserve more of my attention if they: They are lagging behind in classwork/homework (1 if Yes; = 0 if No) Are any actions taken in case of poor teacher performance? (1 if Yes; = 0 if No) If students aren’t disciplined at home, they aren’t likely to accept any discipline at school (1 if Yes; = 0 if No) Who is the most important person to assess progress in your professional targets? (= 1 if Teacher says myself; = 0 if Teacher names another person such as Head Teacher) How often does someone from the school leadership observe your classroom? (=1 if 'Once per term'; = 0 if lesser) Does your school regularly recognize or reward teacher performance? (1 if Yes; = 0 if No) What is the one key result HT would assess when rating your job performance? Teacher Management/School Level (1 if Exam Results/Learning Progress; = 0 if Other criteria) Governance They will recommend me to be transferred or dismissed in case I receive too many bad performance evaluations? (1 if 'Agree'; = 0 if 'Disagree') Are you satisfied by the support you get from the school administration? (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) Are you satisfied by the support you get from the Government? (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) 41 Annex 2: Details and Notes on Methodology A.2.1 Assessing Model Performance In order to assess model performance, we follow the commonplace practice in machine learning of splitting the data set into a training set with and a test set. We then calculate the Mean Squared Error of the Test Set to assess model performance. The detailed procedure is as follows: 1. Use the sample to create two non-overlapping randomly sampled data sets: a training set with 80% observations where i-T ∈ {1,….,N-T}; N-T = 4/5N and the remaining 20% to form the test set with iT ∈ {1,….,NT}; NT = 1/5N 2. Run the three models: CIF, LASSO and OLS on the training set. This yields a prediction function that characterizes how the explanatory variables are associated with the outcome variable � ( − ) 3. Pass the test set values of the explanatory variables into the prediction function created in step 2 to yield estimates/predictions of the � ( ) � = outcome variable 4. Calculate the Mean Squared Error of the test sample 1 � ]2 = ∑∈[ − A 2.2 Conditional Inference Trees & Forests Decision Trees partition the sample into M mutually exclusive groups through recursive binary splitting. Based on a splitting criterion, they continue to partition the sample into two until a pre-defined criterion is no more fulfilled (such as an information gain or a minimum improvement in RSS) or until a pre-specified threshold is reached (such as a minimum number of sample required to create further splits). Once every observation is part of a group that is common in the expression of the covariate space X: (X1,X2,…Xk), for a given vector of the dependent variable y = (y1,y2,…..,yn), we get the vector of predicted values ŷ = (ŷ1, ŷ2,….., ŷn), where 1 � = � ∈ =1 42 Conditional Inference Algorithm 1. Choose a significance level α* 2. Test the null hypothesis for independence of the density function: 0 : (| ) = () for all ∈ X and obtain a p-value associated with each test, a. Adjust the p-values for multiple hypothesis testing using the Bonferroni correction 3. Select the variable ∗ with the lowest p-value ∗ a. If > α*: stop the tree making process ∗ b. If < α*: continue by selecting ∗ as the splitting variable 4. Test the null hypothesis for independence of the density function between the sub-samples for each possible binary splitting point s amongst the values taken by ∗ and obtain a p-value corresponding to each splitting point a. Split the sample based on ∗ by selecting ∗ , the splitting point having the lowest p-value 5. Repeat steps 2-4 for each of the resulting sub-samples until a. The dependent variable is independent of each explanatory variable in every sub-sample OR b. The number of observations left to create further splits is lower than a pre-specified threshold (usually 5 or 10) (Minimum number of samples required to make a split). This ensures smoothness in predictions made by the tree, OR c. The tree reaches a pre-specified maximum depth (the longest paths between the root and a leaf) The final structure and depth of a tree is dependent on 3 hyper-parameters – the minimum number of observations required to create a split, the maximum depth of the tree and the significance level alpha. The first two are related to each other: trees assigned to have shallower depths will automatically have large number of observations in each split and conversely, if the minimum number of observations required to create a split is kept large then that would automatically result in shallower trees. The variant of the decision tree used in this study and highlighted above is Conditional Inference Trees. Unlike standard decision trees, conditional trees select those variables and splits that are most related to the dependent variable in a statistically significant sense. Moreover, as pointed out by Hothorn et al. (2006), standard decision trees and random forests are biased towards selecting variables that offer more splitting points: in a data set consisting of continuous and dummy variables, they would tend to select more continuous variables during tree construction. Therefore, for the purposes of our study, we use Conditional Inference Trees and Forests as they would likely yield us with variables that best predict student learning gains. Just like trees, we need to specify tuning parameters when constructing a conditional forest: (i) the minimum number of observations required to create a split (analogous to the maximum depth of a tree) (iii) the significance level alpha & (iv) the number of trees that make up the forest. Optimal selection of tuning parameters ensures that the model does not overfit the data and performs well out-of-sample relative to in-sample performance. Since our goal is to build the best predictive model for student learning gains that performs well out of sample, we select the values of the tuning parameters that minimize the Mean Squared Error of the Out-of-Bag (OOB) sample (MSEOOB). 43 Samples of the training set that are not used in the construction of a given tree (due to sampling without replacement) are known as Out-of-Bag samples (OOB). Predictions are then made for these OOB samples using only those trees that did not contain them. The mean squared error of such predictions are then averaged at the forest level across every OOB sample to yield us MSEOOB. We use MSEOOB for parameter tuning (as described above) and for variable importance through the permutation method (Strobl et al. 2007) (described in footnote 16). A.2.3 Optimal tuning parameters and a discussion on Mean Squared Error Selecting optimal tuning parameters for Conditional Inference Forests In this section, we highlight our selection of the optimal tuning parameters for Conditional Inference Forests (α*, B*, s*) for Math & Kiswahili where α represents the significance level, B is the number of trees used to build the Forest and s is the minimum sample size required in order to create splits (minimum split criteria). In order to select (α*, B*, s*) in a data-driven manner, we follow these steps: 1. Create a grid of values for each tuning parameter a. α = [0,0.20,0.40,0.60,0.80,0.90,0.95,0.99] b. B = [500,750,1000,1250,1500] c. S = [5,10,20] 2. Run the Conditional Inference Forest on each combination of values of the 3 tuning parameters and calculate the Out-of-Bag Mean Squared Error (MSEOOB) for each such Forest. 3. Select the set of tuning parameters whose Conditional Inference Forest has the lowest MSEOOB From our analysis, we find that for Math, the optimal tuning parameters are: {α*= 0, B*= 1000, s* = 20} and for Kiswahili, the optimal tuning parameters are {α*= 0, B*= 1250, s* = 10}. We therefore show relative outperformance (vis-à-vis OLS) and variable importance measures for Math & Kiswahili by constructing the Conditional Inference Forest built using these tuning parameters. Selecting optimal tuning parameters for LASSO Given that the LASSO minimization problem includes a penalty term in the form of the L1 norm of the coefficient vector βj, the tuning parameter λ plays a key role in variable selection. Hence in order to select the optimal λ we conduct k-fold cross validation (k=10) using the entire data set (rather than splitting the data between a training and test set). Under some minimal assumptions, k-fold cross validation provides unbiased estimates of the out-of-sample MSE (Friedman et al. 2009). The procedure is laid out as follows: 1. We randomly split the the set of observations into k groups or folds of approximately equal size. 2. The first fold is treated as a test set, and LASSO is implemented on the remaining k − 1 folds. 3. The mean squared error of the test set fold, MSE1, is then calculated 44 4. This procedure is repeated k times such that each of the k folds created becomes a test set. This process results in k estimates of MSE, MSE1, MSE2, . . . , MSEk. The k-fold Cross Validation estimate is computed by averaging MSE1, MSE2, . . . , MSEk. 5. We repeat Steps 1-4 for each value of λ and select the λ which has the minimum Mean Squared Error estimate. We follow the approach of specifying potential λ values as given in Hastie et al. (2013) and consider 100 values in the range of [0.0001,105] From our analysis, we find that for Math, the optimal λ = 0.53 and for Kiswahili, the optimal λ = 1.23. In figures A.1 and A.2 we plot the k-fold cross validation Mean Squared Error against the log(λ) values. The numbers at the top tell us the number of variables chosen by the LASSO model as being important. Figure A.1: Optimal Lambda Plot for Math 45 Figure A.2: Optimal Lambda Plot for Kiswahili Discussion on Mean Squared Error For any given prediction problem, we are interested in how well our model predicts data that it has never seen before. In order to evaluate how accurately a given Machine Learning model predicts the dependent variable out of sample, we need to account not just for the bias in its prediction (how close is the true value to the predicted value) but also for the variance in its predictions (the sensitivity of our predictions to changes in the underlying data) since any out of sample data is part of the true population. The Mean Squared Error can be written as: ̂() − )2 ] [( Let’s consider the Mean Squared Error at a new data point (x0,y0) which can be further decomposed into 3 terms ̂() − [ [( �0 ] − )2 + () �0 ])2 ] + ( [ The first term represents Variance which tells us the amount by which the model’s prediction would change if we estimated it on a 46 different data set, the second term represents Bias squared and third term represents the irreducible error term. Given that the first two are non-negative, the MSE can never lie below Var(ε). In order to yield accurate predictions, a machine learning model should minimize the expected mean squared error of the test set. Therefore, it should have low variance and low bias in its predictions. In general, more flexible models will have lower bias but higher variance whereas less-flexible models like OLS will have low bias but high variance. This trade-off between bias and variance is recurrent in Machine Learning. In fact, since OLS is the best linear unbiased estimator, it does not allow for any trade-off as it sets the second term to zero. 47 Annex 3: Modeling TVA (traditional approach) Step 1: Regress follow-up scores on baseline scores along with student-level controls and get teacher Fixed Effects (TVA) (student-level regression). Robust standard errors clustered at school level. Step 2: Run the teacher fixed effects (TVA) through the Machine Learning algorithms (CIF & LASSO) with teacher-level and school- level covariates. Math (N=346) Variable Selected by CIF or Selected by CIF or LASSO LASSO (current model) (traditional TVA model) 1) Baseline Math (%) Score ✓ Not applicable (used in calculating TVA) 2) When the math teacher corrects my work, he/she writes on my papers to help me understand ✓ ✓ (Percentage of students for a given teacher who said Yes when asked this question) 3) Locus of Control Index (Teacher Mental Models) (Teachers who believe they can help disadvantaged / ✓ ✓ struggling students learn) 4) Critical Thinking (Classroom Observation) (Teacher rated higher if she asks more open ended ✓ ✓ questions or provides thinking tasks to students) 5) Have you ever received training in 3Rs? (Teachers who received training in the 3R (Reading, ✓ ✗ Writing, Counting) Program) 6) Teacher Incentive Index (Teacher Mental Models) (Teachers who strongly believe that their career ✓ ✓ progression and salary is linked to their students’ test-score performance) 48 Not applicable (school level 7) School in Urban Area ✓ variables interpreted as controls) 8) At the end of each class, your Math teacher takes the time to review and discuss concepts (Percentage of ✓ ✓ students for a given teacher who said Yes when asked this question) 9) Who is the most important person to assess your progress towards your professional targets? (= 1 if Teacher ✓ ✗ says myself; = 0 if Teacher names another person such as Head Teacher) 49 Kiswahili (N=336) Selected by CIF or Selected by CIF or LASSO Variable LASSO (current model) (traditional TVA model) 1) Baseline Kiswahili Score ✓ Not applicable (used in calculating TVA) 2) Are any actions taken in case of poor teacher performance? ✓ ✓ (= 1 if Teacher says Yes; = 0 if Teacher says No) 3) Student Age ✓ Not applicable (used in calculating TVA) 4) Have you received training in 3Rs? (Teachers who received training in the 3R ✓ ✓ (Reading, Writing, Counting) Program) 5) Your Kiswahili teacher offers extra help to students who find the subject difficult ✓ ✓ (Percentage of students for a given teacher who said Yes when asked this question) 6) Are you satisfied by the support you get from the school administration? ✓ ✓ (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) 7) It is okay to be absent as long as I: complete the curriculum OR leave students with work OR ✓ ✓ doing something useful for the community (= 1 if Teacher answers ‘Yes’; = 0 if Teacher answers ‘No’) 8) Lesson Facilitation (Classroom Observation) ✓ ✓ 50 (Teacher rated higher if lesson objectives are clearly articulated, explanation of content is clear, teacher connects lessons to real life) 9) Pupil Teacher Ratio ✓ ✓ 10) Relationships Index (Teachers who strongly believe that they have a ✓ ✗ good relationship with their students, colleagues & head teacher score higher) 11) Are you satisfied by the support you get ✓ from the Government? ✓ (= 1 if Teacher says ‘Satisfied’; = 0 if Teacher says ‘Not Satisfied’) 51