Policy Research Working Paper 8854 Improving Management with Individual and Group-Based Consulting Results from a Randomized Experiment in Colombia Leonardo Iacovone William Maloney David McKenzie Development Economics Development Research Group & Finance, Competitiveness and Innovation Global Practice May 2019 Policy Research Working Paper 8854 Abstract Differences in management quality are an important con- in management practices of a similar magnitude (8–10 tributor to productivity differences across countries. A key percentage points), so that the new group-based approach question is how to best improve poor management in devel- dominates on a cost-benefit basis. Moreover, the paper finds oping countries. This paper tests two different approaches some evidence that the group-based intervention led to to improving management in Colombian auto parts increases in firm size over the next three years, while the firms. The first uses intensive and expensive one-on-one impacts on firm outcomes are smaller and statistically insig- consulting, while the second draws on agricultural exten- nificant for the individual consulting. The results point to sion approaches to provide consulting to small groups of the potential of group-based approaches as a pathway to firms at approximately one-third of the cost of the indi- scaling up management improvements. vidual approach. Both approaches lead to improvements This paper is a product of the Development Research Group, Development Economics and the Finance, Competitiveness and Innovation Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at liacovone@worldbank.org,wmaloney@ worldbank.org and dmckenzie@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Improving Management with Individual and Group-Based Consulting: Results from a Randomized Experiment in Colombia* Leonardo Iacovone, World Bank William Maloney, World Bank David McKenzie, World Bank Keywords: Management, Employment, Scaling-Up Interventions, Colombia. JEL codes: O14, O32, L2, M2 * The authors gratefully acknowledge the collaboration of Paula Toro Santana, the staff of CNP and DNP in Colombia, and project management and research assistance provided by Cosma Gabaglio, Camilo Andrés Gutiérrez Silva, Pablo Villar, María Aránzazu Rodríguez Uribe and Innovations for Poverty Action Colombia. Funding is gratefully acknowledged from the DIME i2i Trust Fund, the Knowledge for Change Program (KCP), the World Bank and the IPA SME Initiative, as well as intervention funding from SENA. This study is registered in the AEA RCT registry AEARCTR-0000528. Since no identifying information was collected on human subjects, the study was exempted from the Innovations for Poverty Action IRB. Comments from Miriam Bruhn, Jasmin Chakeri, Siddharth Sharma participants in seminars at George Mason, the IDB, the Management Practices in the Private and Social Sectors conference, MIT Sloan, PEDL, and Williams College are greatly appreciated. 1. Introduction There are large differences in the management practices used by firms within and across countries (Bloom and Van Reenen, 2007). These differences are strongly correlated with productivity, with Bloom et al. (2016) estimating that differences in management can account for 30 percent of cross- country productivity differences. An experiment with 17 textile firms in India provides a proof-of- concept that intensive individualized consulting can deliver lasting improvements in the practices of badly managed firms, resulting in productivity improvements of 17 percent (Bloom et al, 2013; Bloom et al, 2018). However, the intervention was implemented by an international consulting company under close supervision from researchers, and cost $75,000 per treated firm.1 This high cost is likely to be prohibitive for many small and medium-size enterprises (SMEs) to finance themselves, and for governments seeking to scale up this to assist large numbers of firms. This paper seeks to test two approaches that governments can use to scale-up management improvements. The first is to use a very similar intervention of intensive individualized consulting, but to use local teams of consultants to deliver the intervention at a lower cost of approximately $30,000 per firm. The second, more novel, intervention is a group-based approach that aims to deliver improvements at lower cost (around $10,000 per firm), and to leverage group-learning dynamics, inspired by the approach used in the delivery of agricultural extension services. We partner with the Colombian government to conduct an experiment to measure the impact of these two competing interventions on SMEs in the Colombian auto parts manufacturing sector. Our sample of 159 firms with an average size of 58 employees, randomized into three groups of 53 firms, is an order of magnitude larger than that used in Bloom et al. (2013) and enables us to measure the impact of such a program when implemented at a multi-million-dollar scale by a government. We show that the Colombian auto parts sector has similar levels of management practices to start with as the average Colombian manufacturing firm, which is low by global standards and similar to that in countries like India and Kenya with lower per-capita incomes. Both the individual and group-based interventions lead to improvements in management of similar magnitudes of 8 to 10 percentage points (relative to a control mean of 56 percent of structured managerial practices being 1 Moreover, $75,000 was the academically discounted rate, with the consulting firm estimating a market price of up to $250,000 for those services. 2  implemented). This improvement is broad-based, with improvements in just over half of a detailed set of 141 practices measured. We then track firms for 1.5 to 2.5 years post-implementation with survey data, and use administrative data on employment to measure impacts up to 3 to 4 years post-implementation. We find evidence that the group-based intervention has grown the treated firms, with a statistically significant 6 to 7 worker (10 to 12 percent) increase in employment relative to the control group; a 8 to 9 percent growth in sales which is not statistically significant when compared to the control (p=0.12), but is statistically greater than the impact of the individual treatment; and higher energy input usage. In contrast, we find smaller and statistically insignificant impacts of the individual-based treatment. Neither treatment has a significant increase in productivity, with a 95% confidence interval of [-13%, +8%]. The group-based intervention clearly dominates the individual intervention on a cost-benefit basis, and, although there is considerable uncertainty associated with this estimate, we estimate that the group-based intervention is likely to pay for itself in terms of higher firm profits within the first year. This work contributes to at least three literatures. The first is a general literature on improving business practices and management in firms. Most of this literature has focused on short training courses and microenterprises (see McKenzie and Woodruff, 2014 for a review). However, several studies show the potential of more intensive individualized consulting to improve management in small and medium enterprises. In addition to Bloom et al.’s (2013) work in India, this includes Bruhn et al. (2018) with firms averaging 14 workers in Mexico, and Higuchi et al. (2017) with firms averaging 20 workers in Vietnam.2 Secondly, while we are not aware of other studies that directly test group-based versus individual consulting, a recent literature has highlighted the ability of firms to improve their business practices when formed into groups or paired with other firms that can serve as role models (e.g. Cai and Szeidl 2018, Chatterji et al. 2018, Dalton et al. 2018, Lafortune et al. 2018). Finally, this paper contributes to a broader literature on how to scale up policies from promising researcher pilot studies (e.g. Banerjee et al, 2017, Bold et al, 2018). Our results show the promise of group-based consulting as a pathway to greater scale.                                                              2 A related quasi-experiment provided evidence of the long-term impact of participating in the Productivity Program, which allowed Italian firms to participate in study trips in U.S. plants followed by consulting sessions of U.S. experts at Italian firms (Giorcelli, 2019). 3    2. Context and Sample 2.1 Choosing the Industry and Sample Labor productivity in Colombia is low, with it taking around four Colombian workers to produce what one worker does in the United States (Londoño, 2017). As a result, improving productivity is a priority for government policy. The Government of Colombia was interested in testing whether the productivity improvements from better management demonstrated in India by Bloom et al. (2013) could be achieved at a larger scale in Colombia, as well as in generating more employment in these larger firms. In order to test different approaches, they wanted to choose a sector that was thought to have sufficient numbers of firms, to have production in a number of locations throughout the country, was thought to have some potential for growth, and was thought to be similar enough to other industrial sectors in the country that the results from this pilot could be applicable to other industries. These criteria led to the selection of the auto-parts sector. This sector consists largely of second-tier suppliers to large car manufacturers, producing parts like fenders, tires, suspension parts, plastic parts, paints, etc. that are sold to the assemblers that supply directly national and international car manufacturers as well as to retailers of spare parts. Appendix 1 provides some examples of the products. The auto parts sector in Colombia employs approximately 25,000 workers, and sells both to car and bus manufacturers within Colombia, as well as exporting approximately US$500 million each year, with Ecuador, the República Bolivariana de Venezuela and Brazil the main export markets (Proexport Colombia, 2012). Public announcement of the program was made in April 2012 (Appendix 2 contains the full timeline), and firms were also informed of the program through the car manufacturers such as Sofasa (which assembles Renault cars in Colombia), General Motors, and Busscar (which manufacturers buses). To be eligible firms had to be legally registered, in business for at least two years, be a first or second-tier supplier to the automobile industry, and be located in one of four areas: the departments of Antioquia, Cundinamarca, Valle del Cauca, and the Eje Cafetero (Coffee Axis). The firms were told the program would offer assistance in improving production practices in order to improve profitability, productivity and competitiveness, and that the program would not require any payment by the firms, but that they would need to commit time and effort of their workforce to supply information required and to implement suggestions made. 4    Public provision of the program to firms was justified both with reference to the overall policy objective of improving productivity, as well as due to the presence of several market failures that prevent firms from improving management on their own. A first issue is that of information: many badly managed firms do not know they are badly managed, with data from the World Management Survey showing that Colombian managers perceive their firms to be slightly better managed than U.S. firms, when the reality is substantially worse management.3 Secondly, even if firms know they need to improve, they may be unable to identify which providers can offer good services, may lack the financial resources to pay for consulting, and a lack of insurance may prevent them from investing in an activity with uncertain payoffs. A total of 218 firms applied for the program. Of these, 180 were accepted in the preliminary step, with the remainder rejected for being too small, or for only being distributors rather than manufacturers of parts. Eleven firms then dropped out, so 169 firms formed the group to take part in the first, diagnostic, phase of the project. Following the diagnostic, we dropped firms with fewer than 10 workers, to leave a sample of 159 firms for the experiment. 2.2 Random assignment and firm characteristics Firms were randomly assigned to three groups of 53 firms each. Since the number of firms in each group would be small, we aimed to improve balance on observables by forming matched triplets of firms, choosing this grouping in a way to minimize the Mahalanobis distance between firms in a triplet in terms of their geographic location, size, labor productivity, and management practices.4 This took place in November 2013, after the diagnostic phase (described below). Then within each triplet, firms were randomly allocated to a control group and two treatment groups: an individual- consulting treatment group and a group-consulting treatment group. Table 1 provides some summary characteristics of the firms, along with their means by treatment group status. The mean (median) firm has been in business for 24 (23.5) years, with only 20 percent having been in business for fewer than 10 years. A key feature of the data is that firms are heterogeneous in terms of size and product produced. Firms had a mean of 59 and median of 40                                                              3 Colombian firms had an average WMS score of 2.50 in 2014 (described below), but an average perceived score of 3.76. In contrast, U.S. firms had an average WMS score of 3.32, and perceived score of 3.57. 4 Location consisted of Cundinamarca and Valle regional dummies; firm size consisted of dummies for small (10 to 50 workers) and medium size (51 to 310 workers), as well as for the number of employees; management practices consisted of indices for practices in human resources, production, logistics, marketing and finance; as well as for seven individual management practices identified as priority areas in many diagnostic plans. 5    employees at the time of application, with 59 percent of the firms classified as small (10-50 workers), and the remainder as medium (51 or more workers), with the maximum being 310, and the 10-90 range being from 13 to 119 workers. Mean sales were approximately US$2.7 million in 2013, with a 10th percentile of US$280,000 and 90th percentile of US$6.3 million, showing the large variation in firm size. These are almost all single plant firms, with the main subsectors being metal products (60%) and plastic products (18%). The sample also includes firms making rubber products (5%), chemical products such as injection molding (4%), electronic components (4%), as well as firms working with leather, wood, and glass. Among the firms, 94 percent are tier 2 firms in the value chain, with 6 percent tier 1.5 Forty-five percent of firms had exported in at least one month of 2013. Half the firms are located in the Cundinamarca region, which includes Bogota, with the region of Valle del Cauca, which includes Cali, the next biggest. Management practices were measured in terms of 141 individual practices, developed by the Colombian National Productivity Center, classified into five areas: financial practices (made up of 29 individual practices), human resource practices (20), logistics practices (31), marketing practices (22), and production practices (39). Each practice was scored on a five-point scale, where 1 indicates that the practice is not used, and 5 that it is implemented and under control. Scores were then aggregated and calculated as a percentage of the maximum possible score for that index. Appendix 3 provides more details of the specific measures. At baseline average scores for these practices range from 43 (human resources) to 51 (financing practices), relative to a potential maximum score of 100, indicating that firms have significant room to improve on these practices. We refer to this as the Anexo K (Annex K) management practices measure, with this terminology referring to the form used to collect this data. Table 1 shows that while the random assignment was able to achieve balance on most baseline variables, there are a couple of imbalances. These reflect the difficulty of balancing many variables in a relatively small sample of heterogeneous firms. For example, the control group is more likely to be in metal products than either treatment group and starts with lower labor productivity. In our analysis we use firm fixed effects or controls for the baseline value of interest to make the firms more comparable and reduce the effect of this heterogeneity.                                                              5 Tier 1 means that the firm directly supplies the original equipment manufacturer (e.g. Ford, Suzuki, etc.), while tier 2 means the firm supplies a tier 1 supplier without supplying the vehicle manufacturer directly. 6    2.3 External validity and comparison to Bloom Van Reenen Management Practices In 2013, prior to the interventions, we commissioned the LSE survey team responsible for the Bloom and Van Reenen (2007) World Management Surveys (WMS) to apply their methodology to a random sample of 180 firms representative of the Colombian manufacturing sector, as well as to a sub-sample of 72 companies from our sample with 40 or more employees.6 Appendix 7 summarizes this survey process, and provides three key results. First, the mean and distribution of WMS management practices scores for our auto parts firms are similar to those of the overall manufacturing sector in Colombia (2.38 versus 2.54). Second, Colombia’s average management practices score shows firms are, on average, poorly managed in global terms, but similar to many other developing countries. The average score is just below that of firms in India and just above that in Kenya in the WMS. The auto parts sector in Colombia is thus a fairly typical sector for both the country, and for developing countries as a whole, in terms of management practices. A final use of this baseline WMS data is to compare the Anexo K management measure, our main measure of management used in this paper, to the WMS. Appendix 5 shows that the two are significantly correlated in the cross-section at baseline, with a correlation of 0.26 between the two overall indices. The Anexo K has a stronger correlation (0.44) with the monitoring subcomponent of the WMS, reflecting a particular emphasis on measurement and monitoring than on other management practices. 2.4 Macroeconomic context The Colombian auto parts sector had sales grow at an annual average of 5.4 percent per year over the 2002 to 2012 period leading up to our experiment (Reina et al, 2014).7 At the start of our study, imports averaged 68 percent of total sales in the sector, and were the main source of competition for most firms in our study. However, the country was hit by a combination of external and internal shocks starting in late 2013, which resulted in a large depreciation of the peso, from an average of 1930 COP to the USD in 2013 to approximately 3000 COP to the USD in each of 2015, 2016, and 2017. Domestic new vehicle sales fell from 326,000 units in 2014 to 238,000 units in 2017, a 27% drop (BBVA Research). Export sales of auto parts fell 51 percent in dollar terms over the 2013- 2016 period, driven by weak economies in the main export destinations of the República                                                              6 This size restriction was made since the WMS was designed for firms with 50 or more employees. 7 The report notes a nominal growth rate of 11.2 percent, which we deflate using the Colombian inflation rate taken from the World Development Indicators. 7    Bolivariana de Venezuela, Ecuador and Brazil. The aggregate context is thus one of weakening overall demand for the sector, but where the weakened currency increased competitiveness against imports. Real sales of domestic production were then fairly flat over our study period, falling 0.12 percent between 2013 and 2016.8 3. The Intervention The program was implemented by the National Productivity Center (Centro Nacional de Productividad, CNP), which is a Colombian non-profit institution with the mandate to contribute to increase productivity, innovation and competitiveness of Colombian businesses. CNP originally was funded and supported by Japanese technical cooperation and has been the recipient of training and in-house technical assistance to develop capabilities in implementing managerial consulting services such as Lean, Six-Sigma, etc. During its 15 years of experience CNP has developed a model of operation that has allowed it to support more than 4,000 Colombian companies in different areas of management, innovation, productivity and competitiveness. CNP used two types of consultants for the intervention. The first were lead consultants, who were long-term employees of CNP with more than 10 years experience, and experience managing teams. They led area consultants, who had to have had at least 5 years experience, and specialized in a particular focus area such as logistics or finance. The direct cost of implementation of this program was approximately US$2.4 million. 3.1 Diagnostic phase All firms, including the control, received a diagnostic as the first phase. This was implemented on a rolling basis between June and October 2013. The diagnostic was carried out by a team of 6 consultants, consisting of a lead consultant and five specialists, one for each area (Logistics, Human Resources, Finance, Marketing and Sales, and Production). The diagnostic began with an opening meeting with top and middle management, and then each area specialists would have five days of meetings with the responsible manager in the firm for their area to evaluate the 141 individual management practices that form Anexo K. This forms the baseline management practices measure. The consultants would also examine the firms’ key performance indicators for the last three years (to the extent records existed), and work with the leader to finish with a report                                                              8 Export data and sales data are from DANE and are for the CIIU sector 2930 “Manufacturing of parts, pieces, and accessories for automobiles and their motors”. 8    (improvement plan) that analyzed managerial practices for each area, the key performance indicators for each area, and recommended practices to prioritize. This diagnostic phase lasted 2 full-time weeks and cost 8,426,550 COP (US$3,553) per firm.9 The diagnostic identified priority practices to be improved by management with the accompaniement of the consultants. These practices were intended to be ones which required minimal capital investment, and which could be implemented reasonably quickly and were expected to lead to relatively rapid improvements in the firm. While these priority practices were individualized by firm, some of the priority areas for improvement in each of the five areas were common to many firms. These include implementing master budgets across areas, improving systems for tracking costs, defining explicitly the strategic objectives of each position in the plant, implementing plans to improve the skills of people in management roles, lining up sales and marketing plans with business strategy, and analyzing machine downtime and quality problems daily across different supervisors. 3.2 Individual Consulting Treatment Assignment to treatment took place after the diagnostic phase, in November 2013. Firms assigned to the individual consulting treatment group then received individual support for a period of 6 months, in the time window between March and November 2014. They were assigned a team of five consultants, one for each of the five processes (logistics, human resources, finance, marketing and sales, and production), along with a leader. The intervention began with an opening meeting that brought together the leaders within the firm responsible for each of these five processes, along with the six consultants to define the different roles and responsibilities and set out a work plan. Then each of the five area consultants would visit the firms and provide 20 hours of training to the person in the firm in charge of their respective area. This would involve a theoretical part with the goal of familiarizing the firm’s management with modern management concepts and methods, complimented with practical exercises to apply these concepts to their firm. This was then followed by individual consulting to help the firms                                                              9 We use the average exchange rate over 2014-15 of 2,372 COP = 1 USD for all currency conversions in this paper. Cost numbers are implementation costs, and exclude initial costs of intervention design, and additional costs of data collection for the impact evaluation. To the extent this data collection process also helps firms improve management, it could be considered another part of the intervention, and averaged a further US$20,000 per firm (including the control group). Note that our costs are the costs to the government, and so do not include the opportunity cost of time to the firms participating, nor any minor travel costs incurred by them in traveling to meetings. 9    implement the improvement plan developed during the diagnostic phase. Every area would be covered by different consultants and with different schedules but would typically involve weekly meetings for four hours per visit, spread over three to five months. Once per month, the team would meet with the whole firm’s management to discuss improvements and re-define priorities and next actions. The total consultant time was 500 hours, consisting of 100 hours of providing training, and then approximately 100 4-hour sessions per firm of individual consulting. The cost of this individual intervention was US$28,950 per firm receiving treatment. Based on our discussions with firms and own observations of the process, the implementation appears to have involved an emphasis on teaching firms how to measure and monitor key performance indicators, and on providing firms with the set of tools needed to better understand how their business is performing. It appears that there was less direct implementation from the consultants. For example, the consultants might go through the financial and performance data from the firm and suggest the need for the firm to consider new product lines or develop new markets abroad, but seldom make more direct recommendations (e.g. you should try exporting product X to Ecuador, or you should start using this production technology). 3.3 Group Consulting Treatment The idea behind the group consulting treatment was to test whether the same gains in management improvements could be achieved more efficiently through working with small groups at a time, motivated in part through the way agricultural extension services are often implemented. The group treatment arm aimed to lower costs in two key ways. First, by working with multiple firms at once, and potentially having them also learn from one another, each consultant’s time was spread over more firms. Secondly, rather than consultants having to travel to the firms, most of the meetings took place in central meeting places such as conference rooms, cutting down on consultant travel time. Groups were formed of 3 to 8 firms located in the same region, such that members are not direct competitors to one another, but are instead producing complementary products with similar management problems.10 These groups were formed after the randomization, in November 2013. However, unfortunately a different government budgetary entity was designated to pay for this                                                               The composed groups are 1 group of 8 firms, 4 groups of 7 firms, 2 groups of 6 firms, 1 group of 4 firms, and 1 10 group of 3 firms.  10    treatment arm than that was paying for the individual treatment. This entity significantly delayed the payment, meaning that the group intervention was unable to start until over a year after the individual intervention, running six months from September 2015 to May 2016 (with different groups starting and stopping at different times, and a break over the Christmas period). Leaders from the firms in a group signed an agreement to work together and help each other improve. Like the individual treatment, the group treatment began with training classes that covered theoretical aspects of management. The difference is that these classes were delivered to the group in a classroom setting, instead of one-on-one in the firm. Each firm would send the staff in charge of a particular area or production process along to that training session. For example, when financial training was performed, firms would send the people responsible for the firm’s financial components to the training. These sessions lasted for a total of 40 hours per group, including a session on the topic of cooperation among firms. This was then followed by group consulting sessions, designed to help firms implement the management improvements. In any given week, a group would discuss two areas, having one or two meetings focusing on a single area (for a maximum of four meetings a week per group). Only management with responsibilities over the area being discussed would participate in the meetings. The same two areas would be covered at the same time over about eight weeks. After a break over Christmas, the remaining three areas would be covered the same way. The order in which areas were discussed was not the same for each group. The group meetings would focus on the implementation of the actions agreed in the improvement plans of each company. Within each group, each firm had to work on the improvement of the topic that had been prioritized for a number of firms in the group, unless the firm excelled already in that topic. Therefore, each firm would still be focused on the issues that had been prioritized in the Improvement Plan, but its Action Plan would be updated to include relevant issues taken from the other firms’ Improvement Plans. If a firm already excelled in topics that were central in other firms’ Improvement Plans, it would be used as an example and its experience would be discussed in detail. In the individual intervention, consultants were at the firm for all visits, so could directly see implementation attempts and problems and adjust their recommendations accordingly. In contrast, during the group intervention, it was more difficult to directly verify changes being made in 11    logistics and production. This was solved by requiring firms to provide evidence of what they had implemented in the form of bringing photos to the group meetings. In addition, firms in the group treatment still had a monthly one-on-one visit, which took place at the plant, when a consultant would meet with senior management, and one hour at the end of each meeting was used to visit the plant and review improvements. This process enabled the group intervention to be significantly cheaper than the individual intervention, with an average cost of US$10,500 per firm receiving treatment. Firms received 408 hours of consultant time each, consisting of 40 hours group training, and 92 4-hour group sessions. 4. Take-up, Data sources, and Attrition 4.1 Take-up The take-up rate for the individual intervention was 86.8%, with all 46 of the 53 firms which started this intervention completing it. The longer delay until beginning the group intervention reduced the take-up rate for this intervention, with 40 of the 53 firms in this group (75.4%) starting the intervention, and 36 firms (67.9%) completing it. Table A4.1 shows the baseline characteristics of those who completed the intervention are not statistically different from those who dropped out, with the one exception being that dropout from the individual treatment was more common in the Antioquia region than elsewhere. The main reasons given for drop-out from both groups were lack of owner time to participate, and lack of continuity in the program (especially for the group treatment). 4.2 Data Sources, Measurement of Key Outcomes, and Attrition Baseline data were collected from the application form and diagnostic phase and cover firm characteristics in 2013. We then use three types of follow-up data, discussed in detail in Appendix 3. The first is data on the management practices in the firm. Our main measure is the Anexo K management score, which is a score measuring the average adoption rate of the 141 different practices detailed in Appendix 3. This was collected by CNP during in-person visits to the firms. It was measured during the diagnostic for 156 of the 159 firms (3 of the firms had components missing), monthly from the treatment groups during the time of their interventions, as well as annually in 2014 and 2015 for the individual and control groups, and in 2015 and 2016 for the group treatment. The second type of data consists of key performance indicators (KPIs) from the firms, which were collected during in-person visits. We use this to measure impacts on firm sales 12    and employment, as well as on defect rates, inventory levels, and energy usage. The final source of data comes from linking firms to administrative data sources on employment and exports. Obtaining data from the firms was difficult and complicated by several factors. First, a consequence of poor management is that firms did not routinely and consistently keep records of some KPIs. Firms would change the units of measurement at times from pesos to physical units, and the type of physical unit they used (e.g. from number of items to kilograms).11 Second, data collection in the firms was conducted during on-site visits by CNP. We hired Innovations for Poverty Action to provide an independent check on this data, and to help in extracting data from the firms – this included oversight of both the management practice data and the KPI data. But CNP had breaks in its contracts, which meant data collection halted for months at a time, and they had a long list of KPIs they wanted from firms, which increased the burden on firms of reporting. The result was that some firms dropped out of providing follow-up information, even after repeated follow-up visits seeking just a few key variables. Third, ten of the firms closed during the course of the study (4 control, 3 individual treatment, and 3 group treatment, p-value of equality of death rates 0.911). These three factors mean that we only have both employment and sales data through to December 2017 for 105 firms (69% of the sample), comprising 33 control firms, 37 individual treatment firms, and 35 group treatment firms (p-value of equality of attrition rates is 0.744). Table A4.2 compares the baseline characteristics of these firms to those that attrit, and shows that we cannot reject equality of means. Moreover, balance on baseline observables for those firms which do report is similar to our balance on the overall sample. Nevertheless, we use firm fixed effects in our estimation of impacts on firm outcomes to control further for any time-invariant differences among firms. For employment outcomes, we can also use the PILA (Unified Register of Contributions), which is the national information system used by firms to file the mandatory contributions to health, pensions, and disability insurance paid for workers. This data has the advantage of covering more of the firms, since we could match 156 of the 159 firms to these records. Moreover, it is more comprehensive in length, enabling us to track firms from pre-intervention (January 2013), right                                                              11 These changes in units also occurred because firms would produce different products at different times, depending on what orders they received. 13    through until the end of December 2018, which corresponds to three years after the group interventions and more than four years after the individual intervention ended. The potential drawback is that it only covers formal employment. Appendix 8 discusses this data in more detail, and compares it to the firm survey data, finding a correlation of 0.93 (Figure A8.1), and that few firms appear to have large numbers of informal workers. 5. Impact on Management Practices The interventions aimed to improve specific management practices covered under the 141 practices that comprise Anexo K. These practices were measured for all firms during the diagnostic phase in 2013, and then measured monthly during the implementation periods of the individual and group interventions, and again one-year post-intervention. The control group had these measured towards the end of the individual treatment intervention, and again at the time of the one-year follow-up. Figure 1 shows the trajectory of impacts on management practices for the overall Anexo K management score, and for the scores under the five separate areas of finances, human resources, logistics, marketing and sales, and production practices. We see that the individual treatment group sharply improves practices overall, and in all five areas, during the implementation phase, while the control group improves by much less. The group treatment likewise sharply improves practices for this treatment group during the implementation phase, and end up with practices at or above where the individual treatment group ended. This improvement in management then persists for the following year for both groups. Figure 2 compares the distributions of management practices at baseline, and at the last follow-up, for the three groups. Kolmogorov-Smirnov tests show we cannot reject equality of distributions at baseline, but at the endline, both the individual and group treatments are significantly different from the control group (p-values 0.004 and 0.003 respectively), although are not significantly different from each other (p-value 0.643). For our regression analysis, we therefore classify our data into three periods: baseline, during the intervention (measured at the end of implementation for the individual and group treatments, and the first follow-up for the control group), and post-intervention (measured at the one-year follow- up post-intervention for the individual and group treatments, and the second follow-up for the control group). This time-shifts the data for the group treatment to account for the delay in implementation, which meant that its follow-ups took place a year later than the other two groups. 14    We then estimate the following ANCOVA regression (McKenzie, 2012) for t=2 (during) and t=3 (post-intervention) that controls for the randomization triplets and the baseline level of management practices, and allows the impacts to vary during the intervention from post- intervention: , ∗ , ∗ , ∗ , ∗ , + ∑ 1 ∈ 1 3 , , (1) Where 1 ∈ is a dummy for firm i being in randomization triplet g, 1 3 is a time period fixed effect, and the standard errors are clustered at the firm level. Table 2 presents the estimated treatment effects on these management practices. Panel A uses the unbalanced panel, which includes firms whose practices were measured in only one of the two follow-up periods, and Panel B the balanced panel of firms measured in both follow-ups. Four key results are evident. First, we see the immediate treatment impacts seen in Figure 1 are statistically significant at the 1 percent levels for both treatments. Second, these treatments persist for at least one year post-intervention. The estimated effect size is between 8 and 10 percentage points, relative to the control group implementing 56 percent of the practices by 2015. Third, the impact persists. Fourth, the individual and group treatments yield impacts that are similar to one another in magnitude, and we cannot reject equality of treatment effects for the overall index, or for any of the five areas, in the post-intervention period. How large an effect is this improvement of 8 to 10 percentage points in management practices? It is only approximately one-third the size of the improvement of 26 percentage points found by Bloom et al. (2013) from their management intervention in India, but approximately twice the size of the typical improvement found in standard business training courses given to smaller firms (McKenzie and Woodruff, 2016). 5.1 Which Practices Improved? The improvement in management practices is broad, occurring in Figure 1 and Table 2 across all five areas with reasonably similar magnitudes. Table A4.1 looks at the sub-index and individual practice level. The individual treatment has a positive and statistically significant impact (at the 15    5% level) on 23 out of the 35 sub-indices (66%), and 67 out of the 141 individual practices (48%), while the group treatment has a positive and statistically significant impact (at the 5% level) on 20 of the 35 sub-indices (57%), and 73 out of the 141 individual practices (52%). Table A4.2 examines which practices have had the largest impacts. These are mainly practices concerning defining strategic goals and objectives, setting up master budgets, and monitoring key performance indicators. The smallest number of improvements are seen in human resource practices and logistics practices. Figure 3 plots the estimated treatment effects practice by practice for the individual and group treatments. The correlation is 0.71, showing that the two different approaches to improving management not only resulted in a similar aggregate improvement in management, but also to a similar mix of practices improved. The main area of difference occurs with several production practices related to preventative maintenance, which improved more with the group treatment than the individual treatment. Why didn’t firms change more of their management practices? Qualitative interviews suggest several explanations. A first one is delays in implementation, which caused some firms to lose interest. The consultants pointed to problems getting family-run businesses to focus on improvements, and that a lack of a data culture prevents firms from recognizing their flaws. For this reason, much of their initial focus was on getting firms to collect KPIs and to have meetings to identify problems, which, in our opinion, may have come at the expense of “quick wins” in which changes in specific practices could be seen by firms to lead quickly to noticeable improvements in business outcomes.12 We also asked the consultants to go through a flowchart to explain why key practices identified in the diagnostic were not then implemented (before the intervention). This was done in early 2014 for approximately two practices per firm in 87 firms in the individual and control groups, for a total of 151 practices. Firms had heard of the practices, but were rated low in their knowledge about the practices, with 72% of firms being scored as a 1 or 2 out of 5 on knowledge of how to implement the practice. The consultants believed that external factors (<1%) and firm human and                                                              12 For example, in India, the international consulting company we used started by identifying a couple of practices that could be changed quickly and where the firm could see immediate results, and then hand-held firms through changing these practices as a way to garner enthusiasm and momentum for broader changes. 16    financial resources were not constraints to implementation (only 6%). In contrast, they thought that the firm owner mistakenly did not consider the practices to be profitable in 58% of cases. This is consistent with the findings of Bloom et al. (2013) that the main reasons for practices not being implemented were lack of knowledge about the practices, and firm owners not thinking the practices were worth implementing. 5.2 Robustness Checks of the Management Improvement We consider the robustness of the improvement in management practices to different weighting schemes, to sample attrition, and to alternative measurement tools. Robustness to weights: Our measures of management practices are averages of the different practices. The Anexo K overall index is an average of the 35 sub-indices, and ranges with 20 (indicating scores of 1 for every individual practice) to 100 (indicating scores of 5 for every individual practice). With any aggregate index, there is always a question as to the appropriate choice of weights, and of how sensitive the results are to alternative weighting schemes. Table 3 examines robustness to different choices of how to aggregate the 141 practices. Column 1 shows our aggregate index from Table 2. Columns 2 through 5 then consider four alternative weighting schemes. Column 2 uses the first principal component of the 141 practices; Columns 3 and 4 use lasso regression to identify the sub-set of practices which best predicts baseline log employment and labor productivity respectively, and then post-lasso regression to form the weights. This chooses 19 practices to weight according to their predictive power for employment, and 14 to weight for their predictive power for labor productivity. Finally, column 5 uses the subset of firms for which we also have baseline data from the World Management Survey, and uses lasso to choose weights that best predict the baseline WMS score, which selects only 6 practices.13 The coefficients cannot be directly compared across columns in terms of magnitudes, but can be considered relative to the control group standard deviation. The estimated treatment effects are 0.8 to 0.9 standard deviations (s.d.) when using our aggregate index, 0.9 to 1.0 s.d. when using principal components, 0.6 s.d. when weighting to predict employment, 0.8 s.d. when weighting to predict labor productivity, and 0.7 to 1.1 s.d. when weighting to predict the WMS score. Thus,                                                              13 The smaller number of practices chosen is likely because of the much smaller sample for which the WMS is available. 17    regardless of the choice of weights, we find the treatment impacts are positive, similar in magnitude, and statistically significant. Robustness to attrition: Appendix 6 examines robustness of our results to attrition of the management practice data. It shows that the firms for which we have endline management practice data have similar baseline management practices to those firms which attrit, and that this also holds separately by treatment status. It provides Lee bounds for the impact on management practices. These bounds are relatively narrow and positive, and statistically significant, even at the lower bound when measuring the impact during the intervention. However, since control group attrition is higher by the endline, the bounds are wider for the post-intervention period, and the lower bound for the treatment effects are positive, but not statistically significant for either treatment. However, for this lower bound to hold, it would need to be the case that the best managed control firms were the ones that attrited. We show that this is not the case in terms of either baseline management practices, nor management practices as measured in the first follow-up. Coupled with our use of a balanced panel and randomization triplet fixed effects as controls (which identifies treatment by comparing firms with similar baseline characteristics), we believe survey attrition is extremely unlikely to be driving the positive impacts found on management. Robustness to alternative measurement of management: Appendix 7 discusses our efforts to also measure changes in management using the World Management Survey (WMS) and Management and Organizational Practices Survey (MOPS). These measures are at a more general level than the Anexo K measures, and were designed for medium-sized firms of 50 or more employees, whereas our sample includes firms with as low as 10 workers. A combination of budget constraints and attrition mean that we only have this data for 70 of the 159 firms (WMS), and 95 firms (MOPS). We show that our Anexo K measures are correlated with the WMS and MOPS in the cross-section, but not in the panel, and that our WMS and MOPS measures appear to be noisily measured, with less predictive power for business outcomes than Anexo K. Our measured treatment impacts on these two measures are smaller in magnitude and not statistically significant. The improvement in management we obtain is thus not able to be detected using these alternative management instruments. 18    5.3 Correlated Practice Changes within the Group Treatment The motivation for the group intervention suggested two possible ways in which working with firms in groups could foster improvements in management practices. A first possibility is one of coordinated experimentation and learning, whereby group members try to improve the same practice together, and so they are able to motivate and learn from one another. A second possibility is one of existing knowledge transfer, whereby group members are able to learn how to implement a practice from other group members who were already implementing it well to begin with. We explore the extent to which these two mechanisms are occurring in our sample by running the following regression for the change in management practice j in firm i assigned to group g: Δ , , Δ , , max , , , , (2) , Where Δ , , denotes the mean change in practice j for other members in i’s group, and max , , denotes the maximum level of practice j at baseline among other , members in i’s group. We stack the 141 individual practices, and then cluster the standard errors at the firm level. Table 4 reports the results of estimating equation (2). Column 1 shows that there is a significant positive association between the change in a practice for a firm and the mean change made by other firms in their group. Column 2 shows that, in contrast, there is no significant relationship, with the highest baseline level of practices observed amongst other firms in the group. Column 3 controls for both factors together and confirms the significant and positive association with the average change made by others in the group. A one-unit change (on a 5-point scale) in the practice by others in the group is associated by a 0.1unit change by the firm. This suggests some coordinated experimentation and learning is taking place within groups, but that group members are not taking existing best practices from other group members across into their own firms. 6. Impacts on Firm Outcomes 6.1 Impact on Employment Employment is a key outcome for several reasons. First, from the policy side, governments around the world are interested in increasing employment in larger, more formal firms. This is the case in Colombia, where the average unemployment rate was 8 percent during our intervention period, and where 47 percent of those who were employed were in informal jobs. As shown in Appendix 19    8, almost all employment in our firms is formal and eligible for social security and health benefits, and the mean (median) monthly wages of firms in our sample of $492 ($331) in 2018 are well above the minimum monthly salary of $248 and median monthly wage of $283.14 Second, from a measurement perspective, (paid formal) employment is the best measure we have of growth in firm size. This is both a result of data coverage (formal employment data are available for more firms and over a longer time period than any of the other outcomes we consider), and of the inherent volatility in firm sales (Lewis and Rao, 2015) and potential problems of firms strategically misreporting sales because of taxation concerns (Carillo et al, 2017). For these reasons, employment is also the main measure of firm growth that Bruhn et al. (2018) highlight in their individualized consulting experiment. Finally, from a theory perspective, employment growth is a key marker of firm size and productivity as firms age (e.g. Hsieh and Klenow, 2014). Given the heterogeneity amongst firms in initial employment size, and the differences in coverage of the different data sources, we use firm fixed effects in estimating the treatment impacts. We estimate the following equation for firm i at time t: , ∗ , ∗ , ∗ , ∗ , + ∑ 1 , (3) Where the are firm fixed effects, During and Post indicate the periods during the individual or group interventions, and after these interventions respectively, 1(s=t) are time fixed effects, Individual and Group denote assignment to the individual and group treatment status respectively, and the standard errors , are clustered at the firm level. The randomization triplets are subsumed by the firm fixed effects here. We consider both levels and the inverse hyperbolic sine of employment as outcomes. Table 5 presents the treatment impacts on employment. The first two columns use the employment data obtained from firms. While these data are available for some 145 firms for some months, only 108 of the firms have data for much of 2017. The group treatment results in a statistically significant increase in employment of 6 workers post-treatment, or 12 percent. In contrast, the                                                              14 2018 numbers use an exchange rate of 3,155 COP to 1 USD. The minimum monthly salary in Colombia for 2018 was 781,242 COP, and median monthly wage was 882,500. 20    individual treatment results in negative point estimates on the level of employment, and an effect which is significantly different from the group treatment at the 10 percent significance level. Columns 3 onwards of Table 5 use formal employment data from the PILA. Column 3 first documents that treatment had no significant impact on firm survival. Columns 4 through 7 use the same time period as we have firm data for – January 2013 to December 2017 – to enable comparison to our firm data results and because this is the period over which we can examine other outcomes. Columns 4 and 5 show that when we consider the employment levels of surviving firms, the group treatment firms are significantly larger in size, with similar magnitudes as found with the firm data. In contrast, the individual treatment has smaller impacts on employment, which are not significantly different from zero post-treatment, and which are significantly different from the group treatment when using the inverse hyperbolic sine. Columns 6 and 7 consider unconditional employment as the outcome, coding employment as zero once firms die. The point estimates still suggest a 4 worker (9 percent) increase in employment after the group treatment, but the standard errors are larger, and these impacts are no longer statistically significantly different from zero, or from the individual treatment. Finally, columns 8 and 9 add in the 2018 data, and examine the extent which treatment effects vary with time since treatment. The group treatment has positive treatment effects in all three years post-intervention, and we cannot reject equality of treatment effects over time. Figure 4 shows the distribution of changes in employment between 2013 and 2017 by treatment status. The control and individual interventions have similar distributions, with changes centered on zero. In contrast, the group treatment has a much smaller peak and more positive mass to the right of zero, indicating firms which expanded. However, the figure also shows long tails, coming from the small number of large firms which closed down – these long tails are what cause the standard errors to increase in the unconditional estimates. Figure A8.2 shows positive quantile treatment effects of the group treatment on the change in employment, which are significant around the 60th and 70th quantiles, but very wide confidence intervals at the bottom reflecting this couple of firms with large drops in employment. An interesting question is whether the increased employment in the group treated firms comes from and whether it changes the composition of the labor force in these firms. We were able to use anonymized worker-level data to examine these questions in more detail (see also Appendix 8). A 21    first point to note is that there is considerable worker churn: there are 23,156 distinct workers who work at least one month in one of our firms in the 2013 to 2017 period, but only 7,500 to 8,000 workers in any given month. On average firms have 3 percent of their workforce leave each month and 3 percent join. Most of this churn comes from outside of the study firms: only 272 workers (1.2%) worked for two or more firms in our sample during this five-year period, and only 32 workers worked for firms in more than one treatment group. The growth in the group treatment firms therefore did not come from them hiring away workers already working in the other treatment groups. Table A8 then examines the impact of treatment on the composition of workers, finding no changes in the gender or age of workers with treatment, and no significant changes in worker retention or worker compensation, but point estimates suggesting that the group treatment firms retained more of their workers and slightly increased salaries. 6.2 Impact on Sales Monthly sales data were collected directly from firm record books, and are converted into millions of real (December 2017) Colombian pesos using the Producer Price Index. We have some months of post-baseline sales data for 145 firms, and data on 99 firms for the balanced panel of all 60 months between January 2013 and December 2017. Figure 5 uses the balanced panel and plots the trajectory of mean real sales by treatment group, demeaned by the 2013 treatment group means (left panel). We see the means of the three treatment groups track each other closely until the group intervention starts. Firms in the group treatment then see mean sales increase relative to the other two groups, with this gap widest in the first six months after treatment, and then closing. The right panel shows the distribution of changes in annual sales between the year 2013 and year 2017. We see the control and individual treatment groups have similar distributions of change (Kolmogorov-Smirnov test of equality p-value 0.855), while we can reject equality of the individual and group distributions (p-value 0.032). The group intervention has more variation in the change of sales, with a few firms experiencing a drop in sales, and more firms also experiencing growth in sales than occurs in the other two groups. Table 6 estimates equation (3) for the level and the inverse hyperbolic sine of sales, using firm fixed effects to account for potential baseline differences across treatments that can arise from sample attrition, firm heterogeneity, and the sample size. Columns 1 and 4 use the unbalanced panel, and columns 2, 3, and 5 the balanced panel. The group treatment has positive treatment 22    effects on sales, of 63-71 million COP per month (USD $26,500-$29,900) in levels, or 9 to 10 percent in log terms. This treatment effect is not statistically significant compared to the control group (lowest p-value is 0.12 in column 1), but is statistically different from the individual treatment level effect post-intervention. The individual treatment effects have negative point estimates in level terms, and a point estimate close to zero post-intervention for the balanced panel for log sales. 6.3 Channels of Production Impact The results on employment and sales suggest that the group intervention increased the size of the firm, causing it to employ more people and sell more. In Table 7 we examine different channels through which this increase may have occurred. Column 1 considers the defect rate. Bloom et al. (2013) found quality improvements to be one of the first signs of improvement from better management in their Indian study. We only have defect data in 2017 for 78 of the firms in the study, due to many firms not keeping consistent records on defects. A first point to note is that the defect rates are low (which is one reason some firms do not record them): the control group has a mean defect rate of 0.025 and median rate of 0.007 in 2017, which compares to much higher defect rates in India (5 percent of output was scrapped, after mending of defects was done). The result is that many of the auto parts firms do not have much scope to reduce defects, and we see treatment effects that are all very close to zero and statistically insignificant. Columns 2 and 3 consider monthly inventories. In India, Bloom et al. (2013) found firms had excess inventory levels, which they reduced when management improved. Large stockpiles of inventories are less common in the auto parts sector, with some firms doing job work and producing upon request. Data are only available for half the sample of firms, due to some firms not keeping records, or changing the units in which they record inventories over time. The control mean level of inventories is equal in value to 1.4 months of mean sales. We see no significant change in inventories, with the sign of the coefficients changing between level and log specifications. However, the confidence intervals are wide, and include the 21 percent reduction in inventories found in Bloom et al. (2013), as well as increases in inventories of more than this magnitude. Columns 4 and 5 consider energy costs, which are another input into producing more. The data here are consistent with the group treatment firms getting larger by using more inputs to produce 23    and sell more. They use more energy both during and post-intervention, with this increase statistically significant when measured in levels during the intervention. The log results suggest firms are using 17 percent more energy, although this is not statistically significant. In contrast, the pattern is more mixed for the individual treatment group, which has a statistically insignificant increase in energy costs when measured in levels, but statistically insignificant decrease when measured in logs. Column 6 examines whether the improvement in management has resulted in higher labor productivity (measured as real sales per worker). The percent increase in employment for group treated firms is slightly higher than the percent increase in sales, and the result is a small, and statistically insignificant, drop in labor productivity (3 percent). The individual treatment also has a small and negative point estimate on labor productivity. These results contrast with the 17 percent improvement in productivity found in India by Bloom et al. (2013). However, since the improvement in management in our experiment is only one-third that found in India, a proportional improvement in productivity would be only 5.7 percent.15 The confidence intervals of approximately [-13%, +8%] for the productivity effect found here include both the possibility of a productivity improvement of this magnitude, commensurate with what the existing literature would predict from a management improvement of the size, but also include the possibility that labor productivity fell. Finally, columns 6 and 7 examine the extent to which any increase in sales came through exports. We use administrative data on exports, which have the advantage of being available for all firms and all months. Sixty percent of firms exported in at least one month between January 2013 and December, but on average, only 21 percent of firms export in a given month. As a result, most sales are domestic: exports are 0 percent of monthly sales for the median firm, and 3.8 percent for the mean; and even conditional on exporting in a month, exports for the median exporter are only 14 percent of that month’s sales. Column 6 shows that there a small, negative, and insignificant effect on the extensive margin of whether firms export at all in a given month. Column 7 shows                                                              15 See McKenzie and Woodruff (2017) for evidence from multiple developing countries that improvements in business practices appear to be have a linear relationship with firm outcomes over large ranges, which would suggest that this assumption of proportional scaling is a reasonable approximation. 24    negative and statistically insignificant impacts on the amount exported, conditional on exporting. Thus, any gains in sales have come through increased domestic sales, not through more exporting. 6.4 Comparison to Policy Maker Expectations In June 2014, we elicited expectations about the program’s impact on employment and productivity from 15 policy makers drawn from the Ministry of Planning (DNP), Ministry of Commerce and Tourism, SENA, and Program of Productive Transformation (PTP). The expected mean (median) treatment effect for the individual treatment was 5.7% (3%) for employment and 16.3% (10%) for productivity; while for the group treatment the expected mean (median) treatment effect was 3.3% (5%) for employment, and 7.3% (5%) for productivity. Our estimated treatment effects for the group treatment are similar in magnitude to these estimates, while the individual treatment has under-performed relative to expectations, especially on productivity. Moreover, the policy makers thought the individual treatment would have a larger impact, which is the opposite of what we find. We also asked what size impacts they would require to consider the program a success that could be scaled at the national level: the mean response was 6% for employment for both programs, and 24% for the individual program on productivity, and 13% for the group program. The estimated impact of the group intervention on employment is thus large enough to be considered a success, whereas neither program has enough of an impact on productivity to be considered a success. 6.5 Cost-Benefit Both the individual and group treatments succeeded to a similar magnitude in improving the set of management practices measured by the Anexo K. The impacts on firm outcomes are less precisely measured, but show increases in firm size for the group treatment, that in some specifications is statistically different from that of the individual treatment. The group treatment cost US$10,500 per firm for the intervention stage, compared to US$28,950 per firm for the individual treatment. The group treatment therefore clearly dominates the individual treatment on a cost-benefit basis. It is more difficult to measure whether the group treatment pays for itself, given the uncertainty associated with the sales impact, and that we lack firm profitability data over time. Baseline data suggest that profit margins are 11 percent of sales for the median firm. If we take the estimated group treatment effect on sales of US$26,500-$29,900 per month, and multiply this by the profit rate, this gives a suggested point estimate of US$$3,000 per month in profits, in which case the group treatment would pay for itself within 4 months. If the sales effect is one standard error below 25    the point estimate, then the estimated profit effect would be approximately $750 per month, and it would pay for itself within 14 months. Since 84 percent of the distribution of treatment effects are at least this high, this suggests the group treatment would pay for itself in just over a year, and within the period over which we measure post-intervention outcomes. These cost-benefit calculations would look less promising from a government policy perspective if the gains to treated firms came from them capturing sales from control firms or from other firms outside of the experimental sample. At least within our experimental sample, firms specialize in different products (which is what allowed groups to be formed easily without having firms who are competitors), suggesting that internal validity of our estimates should not be invalidated by such spillovers. Moreover, as noted in our discussion of the setting, the sector is one where the main competitors to most firms are imports, which became more expensive with the depreciation of the peso. It therefore seems likely that any sales gains achieved by the group treatment would have mostly come from taking business away from imports. 6.6 Why did the group treatment do better than the individual? The group and individual treatments led to similar improvements in management practices, yet we only find evidence of improvements in firm outcomes for the group treatment. What explains this difference? A first possibility is that the two treatments did have similar effects, and it is just small sample sizes coupled with firm heterogeneity that prevents us from detecting this effect in the individual treatment group. Although the point estimates show larger impacts from the group treatment, we can only weakly reject equality of the treatment effects of the two interventions in some specifications of employment or sales, while we cannot reject equality for others. A second possibility is that the group treatment may have a larger impact because it either provides a way for the improvements in management to persist longer, or because it delivers additional benefits to firms beyond the improvements they obtain in management practices. To investigate this possibility, group firms were asked approximately one year after the intervention whether they still met with other group members, and what the main benefit of meeting in a group had been. None of the firms continued formally meeting together as a group, but 54 percent said they still communicate occasionally with other group members. The main benefit they saw of meeting in a group was to interchange experiences, noting the value of seeing other firms facing similar problems, and how others had solved these problems. Only four firms said they saw a possibility 26    of using the group to find a supplier or customer, with only one giving an example of this actually happening, saying it was short-lived. This suggests that if the group treatment is having an additional effect, it is more through providing advice and specific solutions to problems firms face (as in Brooks et al, 2018) or experiencing directly how others implement better managerial practices, and reducing uncertainty about their usefulness, rather than through direct business relationships. 7. Conclusions The experiment of Bloom et al. (2013) provided a proof-of-concept that poor management could be improved. But moving from a pilot demonstration to a scalable program of management improvement requires lowering the cost of delivery and testing whether such a program can be locally implemented when subject to the constraints imposed by government bureaucracy. As is common with other social programs (Rossi 1987, Vivalt 2017), impacts on management are smaller when delivered by programs run by a government at scale than under a small researcher pilot. Yet, both the individual and group treatments were able to improve management practices by 8 to 10 percentage points, with this resulting in an increase in firm size under the group treatment at least. As a result, the group treatment model pioneered here clearly dominates the individual consulting model on a cost-benefit basis, and offers a promising approach to scaling management. As with firms, good management also matters for the public sector (Rasul and Rogger, 2018), and there were several challenges to implementation. These included delays in contracts which caused challenges for data collection, and delays in implementation which likely reduced the effectiveness of the programs implemented. It is also possible that contracting only a single organization to implement the intervention may have led to hold-up problems and removed the performance incentives that competition among consulting firms could have provided. A government contemplating scaling up management support programs in the least costly way therefore should consider the group extension approach, but pay careful attention to the quality of its own management in doing so. References Banerjee, Abhijit, Rukmini Banerji, James Berry, Esther Duflo, Harini Kannan, Shobhini Mukerji, Marc Shotland and Michael Walton (2017) “From Proof of Concept to Scaleable Policies: 27    Challenges and Solutions, with an Application”, Journal of Economic Perspectives 31(4): 73-102. BBVA Research (2018) “Situación Automotriz 2018 Colombia”, BBVA Research, March. Bloom, Nicholas, and John Van Reenen (2007). "Measuring and Explaining Management Practices across Firms and Countries" Quarterly Journal of Economics, 122(4), 1341- 1408. Bloom, Nicholas, Benn Eifert, Aprajit Mahajan, David McKenzie, and John Roberts (2013). "Does Management Matter? Evidence from India" Quarterly Journal of Economics, 128(1), 1- 51. Bloom, Nicholas, Aprajit Mahajan, David McKenzie, and John Roberts (2018) “Do Management Interventions Last? Evidence from India”, American Economic Journal: Applied Economics, forthcoming. Bloom, Nicholas, Raffaella Sadun, and John Van Reenen (2016) “Management as a Technology”, NBER Working Papers 22327, National Bureau of Economic Research, Inc.. Bloom, Nicholas, Erik Brynjolfsson, Lucia Foster, Ron Jarmin, Megha Patnaik, Itay Saporta- Eksten, and John Van Reenen (2018) “What Drives Differences in Management Practices?”, American Economic Review, forthcoming. Bold, Tessa, Mwangi Kimenyi, Germano Mwabu, Alice Ng'ang'a and Justin Sandefur (2018) “Experimental Evidence on Scaling Up Education Reforms in Kenya”, Journal of Public Economics, 168: 1-20. Brooks, Wyatt, Kevin Donovan and Terence Johnson (2018) “Mentors or Teachers? Microenterprise Training in Kenya”, American Economic Journal: Applied Economics, 10(4): 196-221. Bruhn, Miriam, Dean Karlan, and Antoinette Schoar (2018) “The Impact of Consulting Services on Small and Medium Enterprises: Evidence from a Randomized Trial in Mexico", Journal of Political Economy, 126(2): 635-87. Cai, Jing and Adam Szeidl (2018) “Interfirm Relationships and Firm Performance”, Quarterly Journal of Economics 133(3): 1229-1282. Carrillo, Paul, Dina Pomeranz, and Monica Singhal (2017) “Dodging the Taxman: Firm Misreporting and Limits to Tax Enforcement.” American Economic Journal: Applied Economics, 9(2): 144-64. Chatterji, Aaron, Solene Delecourt, Sharique Hasan and Rembrand Koning (2018) “When does Advice Impact Startup Performance?”, Strategic Management Journal, forthcoming. Dalton, Patricio, Julius Rüschenpöhler, Burak Uras and Bilal Zia (2018) “Learning Business Practices from Peers: Experimental Evidence from Small-Scale Retailers in an Emerging Market”, Giorcelli, Michela (2019) “The long-term effects of management and technology transfers”, American Economic Review 109(1): 121-52. Higuchi, Yuki, Vu Hoang Nam and Tetsushi Sonobe (2017) “Management skill, entrepreneurial motivation, and enterprise survival: Evidence from randomized experiments and repeated surveys in Vietnam”, Mimeo. Hsieh, Chang-Tai and Peter Klenow (2014) “The Life Cycle of Plants in India and Mexico”, Quarterly Journal of Economics 129(3): 1035-1084. Lafortune, Jeanne, Julio Riutort and José Tessada (2018) “Role models or individual consulting: The impact of personalizing micro-entrepreneurship training”, American Economic Journal: Applied Economics, 10(4): 222-45. 28    Lewis, Randall and Justin Rao (2015) “The unfavorable economics of measuring the returns to advertising”, Quarterly Journal of Economics 130(4): 1941-73. Londoño, Andrés (2017) “Low Productivity: the Elephant in the Room in Colombia’s Minimum Wage Debate”, Panam Post, November 28 https://panampost.com/andres- londono/2017/11/28/low-productivity-minimum-wage-debate/ McKenzie, David (2012) “Beyond Baseline and Follow-up: The case for more T in experiments”, Journal of Development Economics, 99(2): 210-21. McKenzie, David and Christopher Woodruff (2017) “Business Practices in Small Firms in Developing Countries”, Management Science, 63(9): 2967-81 McKenzie, David and Christopher Woodruff (2014) “What are we learning from business training evaluations around the developing world?”, World Bank Research Observer, 29(1): 48-82 Proexport Colombia (2012) “Automotive Industry in Colombia”, http://www.investincolombia.com.co/attachments/Automotive%20Industry%20in%20Co lombia%20-%20April%202012.pdf [accessed February 16, 2015] Rasul, Imran and Daniel Rogger (2018) “Management of Bureaucrats and Public Service Delivery: Evidence from the Nigerian Civil Service, Economic Journal 128 (608): 413-446 Reina, Mauricio, Sandra Oviedo and Jonathan Moreno (2014) “Importancia Económica del Sector Automotor en Colombia”, Fedesarrollo, Bogota. Rossi, Peter (1987) ““The Iron Law Of Evaluation And Other Metallic Rules”, pp. 3-20 in Joan Miller and Michael Lewis (ed.) Research in Social Problems and Public Policy volume 4. Jai Press Inc. Vivalt, Eva (2017) “How much can we generalize from impact evaluations?”, Mimeo. ANU. 29    Figure 1: Trajectory of Impacts on Management Practices Notes: Means shown by treatment status. Anexo K was measured at baseline (2013) for all firms. It was then measured monthly during implementation of the individual and group treatments, along with a one- year follow-up, and was measured for the control group at the same time as the end of the individual intervention, and at the time of the individual one-year follow-up. Vertical lines indicate approximate periods of implementation of the individual intervention (first two lines) and group intervention (second two lines). Data are for the unbalanced panel, although figure looks similar for balanced panel. 30    Figure 2: Impact on Distribution of Management Practices Notes: Kernel densities shown of Anexo K management practices at baseline, and at last follow-up, for the balanced panel of firms for which these practices were measured at all points in time. Kolmogorov-Smirnov tests of equality of distributions at baseline have p-values 0.210 (control vs individual), 0.998 (control vs group), and 0.422 (individual vs group); and at endline have p-values 0.004 (control vs individual), 0.003 (control vs group), and 0.643 (individual vs group). 31    Figure 3: The Individual and Group Treatments Improved Specific Practices to a Similar Extent Notes: Empty circles denotes that difference between the two treatments is not statistically significant at the 5% level; Solid circles indicate that difference between the two treatments is statistically significant at the 5% level; Solid diamonds indicate that difference is statistically significant at the 1% level. Correlation between group treatment effect and individual treatment effect is 0.71. 45 degree line shown. 32    Figure 4: Distribution of Changes in Employment 2013 to 2017 Notes: Employment data are formal employment data taken from the PILA, and are shown for the 149 firms that have data for both 2013 and 2017 (including zeros for firms that close). Kernel densities show the distribution of the difference in mean employment for each firm in 2017 compared to in 2013. 33    Figure 5: Trajectory of Sales and Distribution of Changes in Sales Notes: Sales are reported in millions of real (December 2017) Colombian pesos, and are shown for the 99 firms that have data for every month between Jan 2013 and Dec 2017. Left panel demeans sales by the treatment group mean in 2013. Vertical lines in left panel show the period of the individual intervention (first two lines) and group intervention (second two lines). Right panel shows the kernel density of the change in the inverse hyperbolic sine of sales for the year 2017 compared to the year 2013 by treatment status. 34    Table 1: Baseline Balance Means by Treatment Group p‐value for testing equality Overall Sample Control Individual Group Control v Control v All 3  Mean S.D. Group Consulting Consulting Individual Group Equal Variables used for matched triplets Number of Employees 59 53 64 61 53 0.841 0.285 0.464 Small Firm (<=50 employees) 0.59 0.49 0.60 0.58 0.58 0.845 0.845 0.975 Medium Firm (>50 employees) 0.41 0.49 0.40 0.42 0.42 0.845 0.845 0.975 Cundinamarca 0.48 0.50 0.55 0.49 0.40 0.564 0.122 0.291 Valle 0.16 0.37 0.17 0.09 0.23 0.255 0.469 0.157 Labor Productivity 31 18 26 32 34 0.059 0.027 0.030 Financing Practices 51 14 51 48 53 0.225 0.508 0.164 Human Resources Practices 43 12 42 42 43 0.897 0.686 0.843 Logistics Practices 46 13 49 43 47 0.017 0.457 0.050 Marketing Practices 46 15 47 43 46 0.190 0.687 0.409 Production Practices 47 13 47 47 46 0.963 0.881 0.989 Variables not explicitly balanced on Level 2 Supplier 0.94 0.24 0.94 0.94 0.92 1.000 0.699 0.909 Metal Products 0.60 0.49 0.75 0.51 0.53 0.009 0.015 0.011 Plastic Products 0.18 0.38 0.15 0.17 0.21 0.794 0.452 0.749 Firm Age (Years) 24 14 27 23 22 0.177 0.058 0.147 Anexo K score 46 10 47 45 47 0.200 0.955 0.353 USD Sales in 2013 2715957 3387147 2134280 3345606 2703821 0.098 0.303 0.196 Export at all in 2013 0.45 0.50 0.47 0.42 0.45 0.562 0.847 0.839 Sample Size 159 53 53 53 35    Table 2: Impact on Management Practices Overall Finance   HR   Logistics  Marketing Production   Score Practices Practices Practices Practices Practices Panel A: Unbalanced Panel Individual Treatment*During Intervention 9.703*** 9.644*** 10.793*** 8.708*** 10.637*** 5.696*** (1.370) (1.852) (1.822) (1.603) (2.280) (1.806) Individual Treatment*Post Intervention 9.620*** 9.712*** 8.974*** 8.585*** 9.451*** 8.488*** (1.830) (2.413) (2.508) (2.457) (2.466) (1.993) Group Treatment*During Intervention 11.971*** 13.841*** 12.249*** 9.327*** 11.899*** 11.798*** (1.660) (2.057) (2.078) (2.047) (2.599) (1.993) Group Treatment*Post Intervention 8.544*** 9.820*** 7.156*** 5.860** 9.046*** 10.694*** (1.894) (2.306) (2.655) (2.539) (2.637) (2.048) Sample Size 225 226 226 225 226 225 P‐value: Individual=Group During 0.145 0.027 0.451 0.753 0.568 0.002 P‐value: Individual=Group Post 0.533 0.958 0.365 0.235 0.864 0.315 Control Mean 55.98 59.18 52.39 57.75 54.80 55.79 Control SD 10.79 13.79 11.25 14.33 12.58 11.19 Panel B: Balanced Panel Individual Treatment*During Intervention 9.861*** 10.608*** 11.111*** 8.639*** 9.072*** 6.803*** (1.756) (2.277) (2.328) (1.962) (2.985) (2.010) Individual Treatment*Post Intervention 9.757*** 10.118*** 9.463*** 8.629*** 8.568*** 8.935*** (2.014) (2.650) (2.780) (2.646) (2.723) (2.078) Group Treatment*During Intervention 12.118*** 15.094*** 12.227*** 8.942*** 11.309*** 12.688*** (2.029) (2.373) (2.583) (2.413) (3.349) (2.279) Group Treatment*Post Intervention 8.889*** 9.912*** 7.502** 6.022** 9.166*** 11.513*** (2.067) (2.490) (2.912) (2.729) (2.920) (2.157) Sample Size 202 202 202 202 202 202 P‐value: Individual=Group During 0.152 0.027 0.555 0.881 0.341 0.006 P‐value: Individual=Group Post 0.627 0.925 0.343 0.274 0.813 0.248 Control Mean 55.98 59.18 52.39 57.75 54.80 55.79 Control SD 10.79 13.79 11.25 14.33 12.58 11.19 Notes: Panel A is for the 124 firms for which Anexo K management practices are measured post‐baseline, panel B for the 101 firms for which practices are measured both during and after intervention. Robust standard errors in parentheses, clustered at the firm level. *, **, *** denote significance at the 10, 5,  and 1 percent levels respectively. Anexo K management practices are 141 management practices divided into five sub‐areas. Ancova estimation controls for baseline (December 2013) mean, and time fixed effects included, along with randomization triplet dummies. Note: Group treatment moved back one period, since no control group data collected during 2016. 36    Table 3: Robustness of Impact on Management Practices to different weighting schemes Overall Principal Lasso  Lasso  Lasso Anexo K component Log Employ. Productivity WMS Panel A: Unbalanced Panel Individual Treatment*During Intervention 9.703*** 6.014*** 0.227*** 7.065*** 0.079** (1.370) (0.946) (0.085) (1.238) (0.036) Individual Treatment*Post Intervention 9.620*** 6.012*** 0.286** 8.297*** 0.140*** (1.830) (1.217) (0.115) (1.811) (0.041) Group Treatment*During Intervention 11.971*** 7.266*** 0.403*** 9.269*** 0.240*** (1.660) (1.177) (0.090) (1.463) (0.040) Group Treatment*Post Intervention 8.544*** 5.512*** 0.301*** 7.596*** 0.225*** (1.894) (1.220) (0.106) (1.706) (0.040) Sample Size 225 200 213 217 221 P‐value: Individual=Group During 0.145 0.208 0.020 0.111 0.000 P‐value: Individual=Group Post 0.533 0.658 0.862 0.670 0.043 Control Mean 55.98 5.59 2.46 43.01 0.93 Control SD 10.79 6.03 0.47 9.66 0.20 Panel B: Balanced Panel Individual Treatment*During Intervention 9.861*** 6.048*** 0.273** 7.302*** 0.100** (1.756) (1.327) (0.119) (1.602) (0.049) Individual Treatment*Post Intervention 9.757*** 5.972*** 0.309** 8.451*** 0.148*** (2.014) (1.402) (0.122) (2.003) (0.044) Group Treatment*During Intervention 12.118*** 7.494*** 0.445*** 9.624*** 0.263*** (2.029) (1.525) (0.118) (1.781) (0.051) Group Treatment*Post Intervention 8.889*** 5.736*** 0.361*** 8.009*** 0.242*** (2.067) (1.416) (0.111) (1.914) (0.043) Sample Size 202 178 190 194 198 P‐value: Individual=Group During 0.152 0.174 0.032 0.114 0.000 P‐value: Individual=Group Post 0.627 0.844 0.539 0.797 0.032 Control Mean 55.98 5.59 2.46 43.01 0.93 Control SD 10.79 6.03 0.47 9.66 0.20 Notes: Panel A is for the 124 firms for which Anexo K management practices are measured post‐baseline, panel B for the 101 firms for which practices are measured both during and after intervention. Robust standard errors in parentheses, clustered at the firm level. *, **, *** denote significance at the 10, 5,  and 1 percent levels respectively. Anexo K management practices are 141 management practices divided into five sub‐areas. Ancova estimation controls for baseline (December 2013) mean,  time and triplet fixed effects. Principal Component takes the first principal component of the 141 practices. Remaining columns using Lasso to choose the subset of practices that best predict log baseline employment, log labor productivity, and the WMS baseline management score respectively, with post‐Lasso coefficients then  providing the weightings on the different practices used. 37    Table 4: Correlation of Practice Changes Within Groups           Dependent Variable: Change in Practice between Baseline and Endline           (1)  (2)  (3)  Mean Change in Practice for other Group Members  0.100*     0.104**     (0.050)     (0.049)  Maximum Baseline Level of Practice for Other Group Members     0.001  0.014        (0.021)  (0.019)  Sample Size (Firms*Practices)  5069  5210  5069  Mean Change in Practices  0.168  0.171  0.168  Notes:           Regression uses the stacked panel of 141 practices for firms in the group treatment.     Robust standard errors in parentheses, clustered at the firm level. *, **, and *** denote   significance at the 10, 5, and 1 percent levels respectively.           38    Table 5: Impact on Employment Firm Data PILA Data Jan 2013‐Dec 2017 Jan 2013‐Dec 2017 Jan 2013‐Dec 2018 Firm Conditional Unconditional Conditional Level I.H.S. Survival Level I.H.S. Level I.H.S. Level I.H.S. Individual Treatment*During Intervention ‐3.012 ‐0.018 ‐1.987 ‐0.058* ‐1.048 0.001 ‐2.056 ‐0.063* (2.912) (0.040) (2.339) (0.035) (2.279) (0.045) (2.129) (0.033) Individual Treatment*Post Intervention ‐2.150 0.040 0.019 1.222 0.027 2.563 0.113 (3.741) (0.052) (0.049) (4.253) (0.066) (4.103) (0.094) Group Treatment*During Intervention 3.837* 0.101** 4.685 0.097* 3.081 0.081 4.815 0.104** (2.268) (0.039) (3.053) (0.050) (3.386) (0.094) (3.155) (0.052) Group Treatment*Post Intervention 5.874** 0.121** 0.019 6.806* 0.164** 4.044 0.087 (2.848) (0.049) (0.049) (3.746) (0.068) (4.199) (0.133) Individual Treatment* Year 1 Post 0.104 ‐0.008 (3.071) (0.043) Individual Treatment* Year 2 Post 1.662 0.049 (4.534) (0.071) Individual Treatment* Year 3 Post 1.698 0.014 (5.224) (0.098) Group Treatment* Year 1 Post 7.516* 0.173** (4.016) (0.076) Group Treatment* Year 2 Post 6.545 0.169* (4.832) (0.093) Group Treatment* Year 3 Post 5.549 0.158 (5.324) (0.096) Sample Size (N*T) 7299 7299 159 8553 8553 9076 9076 10225 10225 Number of Firms 145 145 159 146 146 156 156 146 146 P‐value: Individual=Group During 0.058 0.033 0.036 0.013 0.258 0.412 0.041 0.011 P‐value: Individual=Group Post 0.072 0.190 1.000 0.112 0.044 0.734 0.844 0.322 0.128 P‐value: Individual Year 1 = Year 2 = Year 3 0.801 0.269 P‐value: Group Year 1 = Year 2 = Year 3 0.744 0.943 Control Mean in 2013 56.077 4.360 0.937 59.291 59.291 56.219 56.219 59.291 4.420 Control S.D. in 2013 51.328 0.864 51.950 51.950 51.190 51.190 51.950 0.890 Notes: Fixed effects regressions with firm and time fixed effects. Standard errors clustered at the firm level are in parentheses.  *, **, *** denote significance at the 10, 5, and 1 percent levels respectively. Level denotes monthly level of employment; I.H.S. is inverse hyperbolic sine transformation.  Firm data are taken from firm records, PILA data are formal employment data from administrative records. Conditional is for the group of surviving firms, unconditional codes employment as zero once firm dies. 39    Table 6: Impact on Sales Monthly Sales I.H.S. Monthly Sales Individual Treatment*During Intervention ‐18 ‐38 ‐22 0.054 ‐0.026 (29) (35) (30) (0.044) (0.044) Individual Treatment*Post Intervention ‐54 ‐75 ‐38 0.049 0.029 (59) (65) (37) (0.068) (0.075) Group Treatment*During Intervention 52 51 44 0.080 0.086 (52) (59) (53) (0.061) (0.069) Group Treatment*Post Intervention 71 68 63 0.103 0.091 (46) (50) (48) (0.084) (0.093) Balanced Panel No Yes Yes No Yes Winsorized at the 99th percentile No No Yes No No Sample Size (N*T) 7343 5940 5940 7343 5940 Number of Firms 145 99 99 145 99 P‐value: Individual=Group During 0.263 0.222 0.305 0.743 0.211 P‐value: Individual=Group Post 0.109 0.095 0.099 0.519 0.486 Control Mean in 2017 388 407 407 5.994 6.033 Notes: Coefficients are from fixed effects regressions with time and firm fixed effects, with standard errors  clustered at the firm level.  *, **, *** denote significance at the 10, 5, and 1 percent levels. 40    Table 7: Channels of Production Impact Defect Inventories Energy Costs Labor Productivity Export at Log Rate Levels Logs Levels Logs Log Sales/Worker all exports Individual Treatment*During Intervention ‐0.008 ‐63 ‐0.185 544 ‐0.079 0.016 0.018 ‐0.116 (0.008) (75) (0.224) (926) (0.063) (0.046) (0.019) (0.211) Individual Treatment*Post Intervention ‐0.008 ‐78 0.118 1430 ‐0.038 ‐0.024 ‐0.009 ‐0.108 (0.005) (180) (0.268) (1079) (0.159) (0.054) (0.026) (0.195) Group Treatment*During Intervention 0.000 79 0.049 1718** 0.155 ‐0.003 ‐0.017 ‐0.271 (0.004) (103) (0.189) (831) (0.094) (0.049) (0.026) (0.170) Group Treatment*Post Intervention ‐0.005 28 ‐0.169 1072 0.156 ‐0.033 ‐0.039 ‐0.114 (0.005) (121) (0.259) (821) (0.145) (0.059) (0.028) (0.137) Sample Size (N*T) 3879 3875 3849 5121 5121 5591 8904 1983 Number of Firms 78 76 76 97 97 100 159 96 P‐value: Individual=Group During 0.400 0.199 0.332 0.379 0.063 0.762 0.251 0.586 P‐value: Individual=Group Post 0.600 0.652 0.350 0.761 0.422 0.897 0.311 0.978 Control Mean in 2017 0.025 554 5.150 8564 8.063 1.771 0.212 9.602 Notes: Regressions control for firm and time fixed effects, and are restricted to samples with data available in December 2017. Defect rate is the proportion of production that is faulty; inventories are in millions of real (December 2017) pesos; energy costs are in thousands of real (December 2017) pesos. Labor productivity is defined as log real sales (in millions of pesos) per worker. Export at all is a dummy  variable that takes value one if the firm exported directly abroad in the past month, and zero otherwise; Log exports is the log of the USD value of the amount exported in the month, and is conditional on exporting taking place. Standard errors clustered at the firm level. *, **, and *** denote significance at the 10, 5, and 1 percent levels respectively. 41    ONLINE APPENDIX Appendix 1: Examples of Products Manufactured Appendix 2: Timeline Appendix 3: Data Appendix Appendix 4: Drop-out and Attrition Appendix 5: Impacts on Individual Management Practices Appendix 6: Robustness of Management Improvement to Sample Attrition Appendix 7: Impacts on World Management Survey and MOPS management measures Appendix 8: Comparison of PILA and Firm Employment Data and Changes in Composition of Firm Employment 42    Appendix 1: Examples of Products Manufactured Air Filters Glass Panels Rubber parts Metal parts Plastic parts Tires Injection molding/cushioning GPS tracking services 43    Appendix 2: Timeline April 12, 2012: Pilot program officially launched and firms invited to apply June 25, 2012: Deadline for firms to apply to the program June 11, 2013: Diagnostic phase starts October 30, 2013: Diagnostic phase ends November, 2013: Random assignment to treatment status 2013: World Management Survey administered to subsample of 72 firms with 40+ workers, as well as to random sample of 180 firms representative of Colombian manufacturing sector March-November 2014: Individual Consulting Intervention September 2015-April 2016: Group Consulting Intervention November to December 2015: Round 1firm data collection (individual, group and control treatment) January to February 2016: Round 2 of firm data collection (individual and control treatment) March to April 2016: Round 3 of firm data collection (control treatment) June 2016: Round 4 of firm data collection (group treatment) November 2016: Second round of World Management Survey administered November 2017-July 2018 : Last round of firm data collection from firms Note: firm data collection would collect all months of data available from firm records during in-person firm visits. Timing of when this was extracted from firms varied according to CNP’s contractual agreements, in which they were paid for batches of data collection at a time. Administrative data on employment are available from the PILA from January 2013 through December 2018. 44    Appendix 3: Data Appendix A3.A. Management practices indicators The 141 management practices defined by CNP can be divided into five main areas: Finance, Production, Logistics, HR, Marketing. Each of these areas can be itself divided into five to eight sub-areas. The score of the five main areas is the average of the score of their sub-areas. Below we discuss each of these sub- areas and explain which practices were considered to calculate their score. At the most basic level, each single practice is graded on the following scale: 1 = “Not existing”, 2 = “In construction”, 3 = “Formalized”, 4 = “Implemented”, 5 = “Operating under control”. For some indicators, the 1 to 5 scale does not exactly refer to the implementation stage of a practice, instead it indicates how developed or optimized a specific aspect is – for instance whether strategical goals and individual responsibilities are clear to each worker. Such information was collected in three stages: during the diagnostic phase, during the intervention, and once a year after the intervention. Human Resources i. Strategic objectives leverage on people’s talent The first aspect of Human Resources relates to the alignment of employees’ objectives with corporate strategy, and to the clarity of such objectives for each employee. Here we consider four components. The first one evaluates how strategic objectives leverage on people’s and teams’ talent. The second component assesses whether there are human talent development plans, and whether these leverage on corporate strategy. The third component assesses whether a strategic plan is defined, that includes clear objectives and goals concerning human talent. The last component assesses whether the skill development plans are defined also for the operational level. ii. Competency-based management model for human talent development The focus of this measure is on whether the company manages employee competences – based on the business strategy – in order to develop human talent. It is comprised of two measures. The first one assesses whether human resources are monitored based on their impact on the strategic objectives of the organization. The second component addresses the development of work profiles, which must be defined and aligned with business competencies. iii. Organizational structure prepared to contribute to the achievement of strategic goals The third sub-area evaluates whether the formal and informal structure of the organization allows the realization of corporate strategy. Is there a formally defined structure? Are all roles well defined at every level of the organization? Three measures are taken into consideration. The first one evaluates if the management’s focus is on processes which are aligned with the strategy of the firm. The second one assesses whether a communication system between the different processes of the organization has been developed. The last measure assesses whether a communication system between the different levels of the organization has been developed. iv. Program of human talent development (according business competences) This measure evaluates how the organization works on building and retaining human talent to achieve a competitive advantage over the competition. Two components are considered: Management of 45    development plans (career plans) for employees at managerial level, and the level of application of the sector’s technical norms for the development of technical operational competences. v. Organizational climate The focus of this sub-area is the management of a work climate. Work climate must be appropriate for the development of Human Capital and directed towards the achievement of corporate strategy. We consider three components. Is there a culture of monitoring work climate, as strategic lever? Are there programs to improve work climate? At which level are risks for health and safety controlled? vi. Social responsibility within the enterprise Here we evaluate how the company manages its internal social responsibilities. This measure is comprised of three components. The first one assesses whether there are programs of improvement of the family environment of employees, in order to incentivize their productivity. The second one verifies whether a formal contracting system is in place, which generates well-being and productivity in workers. The last one evaluates the implementation of a system of recognition and retribution of new ideas and improvement suggestions at the operational level. vii. Promotion of an open-communication/high-performance organizational culture, and of a culture of high personal involvement Three measures are considered for this indicator. Did the company develop a culture of control and periodic monitoring of result achievement? How developed is the performance-based reward system for the management? How developed is the performance-based reward system for employees at the operational level? Production i. Alignment of functions at the operational, managerial and directive level The first sub-area of Production focuses on whether all people working in the plant know the corporate strategy and work to realize it. To achieve this, it is necessary that all workers and processes have improvement goals aligned with corporate strategy. This measure is comprised of five components. The first two evaluate the implementation and monthly monitoring of strategic goals between the Plant Manager and his/her supervisor. The third and fourth components assess whether strategical goals and individual responsibilities are clear to each worker, and whether each worker has improvement goals. The last component assesses whether the performance of teams at the operational level is evaluated based on the strategic goals. ii. Definitions and management of the most important operational processes Here we evaluate how operational processes are defined and managed, from the order to the delivery of the final product. Do they allow to accomplish the strategy (Standards, Policies, Roles, 5s, Layout, Established Processes)? This sub-area includes six components. The first one evaluates whether processes are well identified and have a proper description (VSN, SIPOC). The second one assesses whether the plant layout allows optimal material flow. The third one concerns the implementation of a 5S program in the plant. The fourth one evaluates how bottlenecks are identified and managed. The last two components evaluate standards, specifications and work instructions used by workers, and how these are verified by supervisors. 46    iii. Formal method to measure and manage the plant’s efficiency (Waste, Hours paid/Service capacity, machinery’s efficiency) The third sub-area evaluates how the company measures and manages the main KPIs of the plant, such as team efficiency, efficiency in the use of material, response time, etc. The first of component of this sub- area concerns the monthly measure of the plant’s KPIs (OEE, Waste, Defects, Lead time, Others). The second indicator concerns weekly or bi-weekly management of KPIs’ goals (OEE, Waste, Defects, Lead Time, Others). The third one assesses whether improvement programs for KPIs (times and quality) are developed applying instruments of plant management. The last one assesses whether a culture of daily recollection of facts and data is in place, in order to demonstrate improvement in processes. iv. Recollection of information regarding results, continual improvement, and performance of processes Here we assess how the company is managing data and information regarding processes, results and continuous improvement. The four components of this sub-area are the following: Is there a culture of visual management with daily-updated graphs of machinery performance? Are duration and quality of each process recorded daily by the responsible worker? Does the Administrative Management make sure that monitoring instruments are in good condition and precise? Is there a monitoring and sampling plan to capture the information necessary to the improvement of processes? v. Process to detect and solve anomalies in the execution of tasks The focus of this sub-area is to evaluate how anomalies in processes are managed within the plant. It is comprised of five components. The first one assesses whether there is a mechanism so that workers report anomalies of time and quality to their supervisors. The second one assesses whether criteria are defined to realize analysis of anomalies. The third one concerns the daily analysis of time and quality anomalies by supervisors and workers. The fourth one assesses whether supervisors and workers manage improvement plans to eliminate time and quality anomalies. The last component concerns job descriptions, and whether they include responsibilities of anomalies solving. vi. Technical planning of production based on the analysis of demand The focus of sixth sub-area is the planning of production. Is such planning based on a statistical analysis of clients’ orders? Does such planning guarantee the flexibility necessary to achieve a high level of service? Four components constitute this sub-area. The first one assesses whether meetings to revise programming take place between production and sales areas. The second component evaluates the use of statistical methods to collect information and analyze production programming, according to demand variation. The third one evaluates production planning to ensure the availability of material for the monthly, weekly and daily program. The last component evaluates monitoring and management of service to clients (deliveries in quality, time and quantity). vii. Management of safety during the process, contingencies, emergencies / impact on the environment Here we assess how the company monitors its impact on people and the environment, which actions are undertaken to mitigate any negative impact, and how it complies with safety and environmental norms and regulations. This sub-area is comprised by five measures. The first one concerns the compliance with safety requirements, laws and norms. The second measure assesses whether the necessary norms and standards of safety within the plant are well defined. The third one evaluates the management of the indicators of industrial safety within the plant (number of accidents, level of noise, temperature). The fourth one concerns 47    monitoring and management of the plant’s environmental impact. The last measure assesses compliance with the norms regarding evacuation routes and cleared zones for fire-fighting equipment. viii. Maintenance guarantees the optimal condition of infrastructure The last sub-area of Production evaluates the maintenance plan, how maintenance is monitored and managed and how maintenance is related to the creation of value by the enterprise. All this is paramount to guarantee optimal condition of machinery, furniture, equipment and tools. This measure reflects the following four points. Is there a preventive maintenance plan for the equipment? Are technicians able to rapidly repair damage to the machines? Are replacements available, so to allow to rapidly repair damage to the machines? Does Maintenance Management work with indicators such as MTTR, MTBF, Availability? Logistics i. Process of alignment of functions at the operational, managerial and directive level The first sub-area of Logistics looks at the alignment of functions, and at the deployment of the organizational strategy. It is comprised of three components. The first one concerns the implementation of strategic goals between the Logistics Head and his/her supervisor, and whether there are specific projects to achieve such goals. The second component assesses whether there is a monthly control of strategic goals by the Plant Manager and the supervisor. The last component concerns the alignment of employees’ objectives in the logistics area with the firm’s strategic goals. ii. Structure and management of the supply chain (planning, purchases and provisions, storage of raw material, plant supply, storage of finished product, distribution, client service) Here we evaluate if employees in the logistics area understand their roles and activities. In this sub-area there are four measures. The first one evaluates procedures and work instructions for logistics processes. The second measure is concerned with the layout of the areas of logistic operations in the supply chain. The third component assesses if a 5S plan for the supply chain is in place. The last component evaluates monitoring and management of KPIs in the logistic process (inventory, lead time, service level). iii. Planning and management of demand / alignment of productive and logistic processes This sub-area evaluates the procedure through which demand is planned and the reaction to changes in the established plan. Here we have four distinct components. The first one assesses whether a statistical system is in place, in order to study and analyze demand. The second component concerns the definition of the demand’s planning, and whether such definition is updated with annual, trimestral and monthly frequency. The third component evaluates whether communication between logistics and the areas of marketing and sales goes through a system that includes rules to change the production plan. The last component evaluates the way a firm monitors and manages the compliance with the budgets of production planning. iv. Planning, management and control of inventories of raw material, supplies, product on process and finished product (Inventory Policies) This sub-area evaluates the design of the inventory system, and the maintenance of inventory levels. The five components upon which this measure is based are the following. The first one assesses whether the levels of inventory (raw material, semi-finalized product WIP, finished product) are kept at an optimal level 48    related to the variation in demand. The second component assesses whether the inventory movement it is recorded daily and controlled weekly. The third component states whether a methodology of classification of inventory ABC is in place, in order to establish policies of inventory, supply, storage and control accordingly. The fourth component verifies the use of MRP systems, where product structures are defined, in ways that allow to plan the material needed to comply with production orders. The last component evaluates whether processes are in place, so to guarantee the rotation of inventory according to “First in, first out” schemes. v. Supply system This sub-area concerns the relation with suppliers, the way in which suppliers are evaluated, and the control the firm has over realized purchases. It is comprised of five measures. The first one concerns the management of policies and processes for the selection and evaluation of suppliers. The second measure concerns the management of suppliers’ development. The third measure focusses on the management of raw material prices and supplies. The fourth measure assesses whether Lead Time of suppliers is managed and taken into account in the planning of material supply. The last measure assesses whether purchased items are verified in terms of quantity, quality and opportunity of delivery. vi. Storage system Five components are taken into account while evaluating the storage system. The first one is the management of the inventory of obsolete and non-compliant products. The second one is the implementation of a system to administrate storage locations (layout and 5S). The third one evaluates the implementation of industrial security norms in the warehouse’s operations. The fourth one concerns the use of standards and procedures in the storage operations (picking and packing). The last component evaluates the monitoring and improvement of the storage operation time (picking and packing). vii. Distribution system This last sub-area of Logistics concerns the delivery of the created value to the client. It is comprised of four components. The first one evaluates efficiency in the processes of loading and unloading. The second one evaluates monitoring and management of the efficiency in the delivery process (perfect deliveries). The third component concerns the management of transport routes to reduce costs. The fourth component evaluates the management of reverse logistics for those products, materials or supplies that have to return to the company’s premises. The last component evaluates whether the management of distribution takes into account the current legislation regarding freight transit. Marketing i. Elaboration, management and control of the marketing plan This measure evaluates the design of the guiding document of commercial activities and its alignment to the organization’s strategy. Such indicator is comprised of seven components. The first two assess the implementation of an analysis of trends (economic, commercial, technological, political and social) and of risks (e.g. free commerce, supply, variations in exchange rate, infrastructure, etc.). The third indicator evaluates the segmentation of products, technology, clients, consumers, etc. The fourth component assesses whether commercial strategies are based on contribution margins. The fifth component evaluates the alignment of the marketing and sales plan with the Business Strategy. The sixth indicator assesses whether price, promotion and growth policies are defined using the contribution margins. The last indicator addresses monitoring of sale behavior and trends, and of changes in the marketing plan. 49    ii. Processes of market research This measure indicates how the company conducts market research, and is composed by three components. The first one addresses if and how the company conducts inquiries with clients and potential clients. The second one assesses whether the company conducts periodic monitoring of competitors’ offers. The last component evaluates if and how the company conducts research of marketers and/or distributors. iii. Client and after sales service This measure evaluates the company’s approach to client satisfaction and is comprised of four measures. The first one evaluates the management of clients’ complaints and requests. The second measure concerns the analysis of products’ performance in the market. The third measure assesses whether in the company there is a culture of continuous improvement of products and services. The last component verifies if the company holds periodic meetings to discuss clients’ feedback. iv. Sales management This sub-area focusses on the elaboration, management and control of the sales plan. We consider five indicators. The first three assess whether the company is holding three different types of meetings: with the distribution channels (to capitalize opportunities in the market), planning meetings between sales and production, and meetings of the sales group to analyze sales behavior and trends. The fourth component assesses whether periodic training of the sales team takes place. The last indicator states whether sales agents are evaluated based on performance. v. Relationship management This measure is built on three components evaluating whether the company conducts three types of evaluation studies: of its cooperation with suppliers, of its cooperation with clients, and of its cooperation with competitors. Finance i. Alignment of the financial process with corporate strategy Four components indicate whether strategic objectives and goals are clear at all levels of the financial process, and whether everyone is committed to such goals. The first component refers to the alignment of the Financial Head and Deputy Head with corporate strategic goals. The second component indicates whether a system of monitoring and control of financial goals and objectives is in place. The third indicator refers to the frequency in which financial objectives and goals are achieved. The last component evaluates the financial support to the management processes of the organization. ii. Structure of the administrative and operational information system The administrative information system is evaluated based on monitoring and controlling of processes, in its effectiveness of analysis and decision making. This is reflected in five measures. The first measure evaluates the structure of the corporative information system. The second one assesses whether the setup of administrative and operational business information is appropriate. The third one states if Product Structures are associated with cost and profitability margins (standard, estimated, reals). A fourth indicator refers to the protection of the corporative information system, whereas the last one evaluates the organization of the corporative information system. 50    iii. Formulation and management of budgets This sub-area evaluates how the firm formulates and manages budgets. The measure is comprised of four components. The first two focus on the existence of a Master Budget (operational, financial and of investment) and on its control and monitoring (agendas, finances, investment). The third component assesses Tax Planning, and the last one evaluates how deviations the from Master Budget are analyzed (regarding costs, expenses, sales, working capital, investment). iv. Financial management of results The fourth component of Finance reflects how well the company monitors and manages indicators of financial management, and how it analyzes them to undertake corrective action. Three components build this measure: the first evaluates the structure of control and monitoring indicators (KPIs), the second one the agenda of financial management meetings, and the third one how working capital is managed. v. Programs of financial improvement (costs and expenses, working capital, investment) This sub-area evaluates how projections and saving goals are realized. It is comprised of three components answering the following three questions: is there a program of efficient administration of costs and expenses? Is there an action plan for the compliance with financial improvement programs? Is the available financial information appropriate? vi. Analysis and management of investment projects This sub-area evaluates the process which the firm uses to plan, realize and follow up the purchase of fixed assets. This measure is made of three components. The first component assesses if a program of calculation of investment projects exists and if it is aligned with strategy. The second one verifies whether there is a policy regarding capital investment (CAPEX) and other smaller investments. The last one concerns the implementation of cost-benefit analysis for the different projects and the firm’s investments. vii. Information systems The second-last sub-area of finance evaluates if the information systems are interrelated and if strategies are in place to safely conserve information. Three aspects are considered here: the recollection and storage structure of the administrative information system, recollection and storage structure of the operational information system, and validation of information. viii. Structure of the costing system The last sub-area of finance evaluates whether the costing system supplies real and updated information, so to identify cost anomalies in any process. The first of four components reflects the implementation of a costing system. The second component assesses if results (value estimates and real) are being validated. The last two components evaluate absorption capacity of installed structure and workforce efficiency. 51    A3.B Key Performance indicators Every variable is recorded monthly. Defect rate: this is defined as the ratio of faulty production to total production. Faulty production is defined as not in condition to be sold, and is determined by the firm. There are several key measurement issues with this measure. First, firms vary in whether they record production in physical units (e.g., number of items, kilograms) or in pesos. Second, some firms would calculate this product only for a specific production line or product, and not for the whole plant. Third, in a few cases, firms changed the way they measured these units over time. IPA and CNP worked together to identify these cases, and the series we use is for the set of firms with a consistent measure. Energy cost: Cost of the energy in thousands of pesos. Firms are instructed to record the cost of the energy for each month not the bill they paid that month (which refers to the energy used the previous month). Some firms incorrectly recorded the energy bill of that month – which refers to the energy cost of the previous month. However, it was generally possible to correct this during the recollection meetings. In a couple of cases, firms did not record this variable in pesos, but in KW. It has not been possible to correct this discrepancy during data collection, and data are not available for those firms. Net sales: Total sales (gross sales) minus devolutions (discounts, etc.). This is taken directly from the Profit & Loss Statement (P&L) or records of the firms. Average monthly inventory: Stock of final product that is in condition to be sold (in pesos). Most firms do not keep inventory – for instance because they work on a project schedule. CNP instructed firms to record a missing value if they do not keep inventory. Other firms record physical inventory every three or six months – not monthly – in which cases during the other months they record a missing value. Some firms include in their inventory figures semi-finalized products, not only finalized products. In a limited number of cases, firms did not record inventory in pesos, and it was not possible to correct the values. Total employees: All employees of the firm which are considered "stable or long term", independently of the contract type. There are no standard criteria to define what a "long term" employee is. This is defined by each firm. They calculate it considering the totality of the firm. A3.C Gathering of performance data During the diagnostic phase CNP gave to each firm a specifically designed spreadsheet to track the monthly evolution of KPIs in each of the five main areas (Finance, Production, Logistics, HR, Marketing). CNP also trained each firm to use these spreadsheets. Every firm received such training, which was done before randomly assigning firms to the two treatment groups and the comparison group. Periodically, CNP would visit firms to verify the monitoring of KPIs and resolving any doubt. This information was then recollected during 4 rounds, the first of which took place in July 2015 as described in Appendix 1. The recollection followed this procedure: staff from CNP and IPA would attend a firm’s board meeting, at the end of which the spreadsheets would be revised and KPIs discussed. CNP’s representative would guide the discussion, going through every single indicator, whereas IPA’s analyst would contribute to the data revision and record any relevant information. Special effort was put into ensuring that the data were recorded homogenously across firms and time, also given that some of the information dated back to 2013. During every meeting, inconsistencies were corrected in the use of missing variables, zeros, units, and definitions. Moreover, any anomaly in the evolution of KPIs was also discussed in depth. 52      One challenge stemmed from the fact that not all firms found the use of the provided spreadsheets equally useful. Some firms were therefore filling the spreadsheets only sporadically, and at the same time were using other ways of tracking KPIs as their main instrument, or were not tracking them properly. Other firms were not filling the spreadsheets at all, unless CNP would visit them and help them to do so, which meant that in some cases data were not recorded for months. This resulted in a loss of information, which was sometimes impossible to correct. Another major challenge was that – especially as far as production variables are concerned – CNP did not give strict prescriptions to firms as to how interpret and record variables. This caused differences in the interpretation of variables between firms. Two types of inconsistency are the most frequent: regarding units and regarding whether the variable refers to a production line or to the whole plant. For instance, some firms have recorded the same production variable as “value in pesos” while others recorded it as “number of pieces”. Others have filled “total production” with data regarding their main production line, not regarding the whole plant as it was planned. The freedom in interpreting variables also caused variability in the units used within a given firm, which might have recorded different variables in different ways. Finally, in a limited number of cases there were changes in the way a firm would interpret the same variable over time, and also changes in the way a variable was measured. Given that the freedom to use the spreadsheets in a flexible way was considered by CNP to be part of the intervention, during data collection the only available measure to mitigate these discrepancies was to carefully record any information and explanation. 53    Appendix 4: Drop‐Out and Attrition  Table A4.1 shows that the firms that completed the interventions are similar on baseline characteristics to those which dropped out. Table A4.1: Comparison of Baseline Characteristics of Firms that Completed Interventions to Drop‐Outs     Individual Treatment  Group Treatment  Dropped  p‐ Dropped  p‐    Completed  Out  value  Completed  Out  value  Number of Employees  62.2  54.4  0.746  52.9  53.1  0.981  Small Firm (<=50 employees)  0.59  0.57  0.940  0.58  0.59  0.974  Medium Firm (>50 employees)  0.41  0.43  0.940  0.42  0.41  0.974  Cundinamarca  0.54  0.14  0.049  0.42  0.35  0.665  Valle  0.09  0.14  0.645  0.25  0.18  0.559  Labor Productivity  32  30  0.780  32  39  0.278  Financing Practices  48  50  0.730  53  52  0.855  Human Resources Practices  42  40  0.738  44  43  0.784  Logistics Practices  43  43  0.989  49  43  0.175  Marketing Practices  43  44  0.934  46  46  0.948  Production Practices  46  54  0.229  47  44  0.371  Level 2 Supplier  0.93  1.00  0.496  0.92  0.94  0.758  Metal Products  0.50  0.57  0.731  0.47  0.65  0.242  Plastic Products  0.15  0.29  0.390  0.19  0.24  0.738  Firm Age (Years)  23.3  21.8  0.829  20.9  24.6  0.375  Anexo K score  44.4  46.5  0.679  47.8  45.7  0.487  USD Sales in 2013  3158858  7547448  0.189  2767765  2469362  0.799  Export at all in 2013  0.43  0.29  0.465  0.47  0.41  0.687  Sample Size  46  7     36  17     Table A4.2 compares the characteristics of those firms for which we have December 2017 sales and employment data to the attritors, and then shows the sample of non-attritors is reasonably well balanced on baseline characteristics. 54    Table A4.2: Comparison of Baseline Characteristics of Non‐Attritors to Attritors, and Balance on Non‐Attiting Sample     Full Sample  Sample of Non‐Attritors     Non‐Attritors  Attritors  p‐value  Control  Individual  Group  p‐value  Number of Employees  58.9  59.8  0.921  54.9  68.2  52.9  0.441  Small Firm (<=50 employees)  0.58  0.61  0.716  0.67  0.51  0.57  0.426  Medium Firm (>50 employees)  0.42  0.39  0.716  0.33  0.49  0.43  0.426  Cundinamarca  0.50  0.43  0.349  0.58  0.51  0.43  0.480  Valle  0.16  0.17  0.939  0.18  0.08  0.23  0.174  Labor Productivity  30  32  0.460  26  32  32  0.054  Financing Practices  51  51  0.964  51  48  53  0.154  Human Resources Practices  44  40  0.069  45  43  44  0.906  Logistics Practices  47  44  0.147  50  44  48  0.106  Marketing Practices  46  44  0.281  47  45  47  0.841  Production Practices  47  45  0.480  47  48  46  0.867  Level 2 Supplier  0.94  0.93  0.679  0.94  0.95  0.94  0.993  Metal Products  0.57  0.65  0.353  0.79  0.46  0.49  0.004  Plastic Products  0.15  0.22  0.276  0.09  0.16  0.20  0.404  Firm Age (Years)  24.1  24.1  0.997  27.6  24.6  20.2  0.085  Anexo K score  47.0  44.9  0.218  48.1  45.5  47.6  0.538  USD Sales in 2013  2877978  2252395  0.342  2043854  3515012  3013064  0.133  Export at all in 2013  0.47  0.41  0.480  0.48  0.46  0.46  0.969  Sample Size  105  54     33  37  35     Notes: Attrition defined as not having firm sales and employment data reported from firm records in December 2017. This  can arise from firms refusing to provide this information, as well as from firm death. P‐value in column 3 is for a t‐test of   equality of means by attrition status.                    Columns 4 through 6 provide baseline means by treatment status for the sample of non‐attritors. P‐value in column 7 is for   F‐test of equality of means.                       55    Appendix 5: Impacts on Individual Management Practices  Table A5.1 shows the breakdown of significant improvements in management practices within the Anexo K index: Table A5.1: Summary of Impacts at the Sub‐Index and Individual Practice Level        Sub‐Indices  Individual Practices  # sig.     #  # sig. Ind.  # sig. Group  #  # sig. Ind.  Group  Finances  8  6  5  29  17  15  HR  7  3  2  20  11  6  Logistics  7  5  2  31  8  9  Marketing  5  3  3  22  9  13  Production  8  6  8  39  22  30  TOTAL  35  23  20  141  67  73  Note: lists number of practices that are statistically significant at the 5% level post‐intervention.  Table A5.2 details the individual management practices that have treatment effects of 0.8 or more (on a 5-point scale). Table A5.2: Practices that increase by 0.8 or more from at least one‐treatment                                Individual  Group  Finance Practices                           System of monitoring and control of financial goals in place        0.827***  0.666***                          (0.175)  (0.189)   Frequency at which financial objectives and goals achieved        0.802***  0.648***                          (0.205)  (0.212)   Existence of a Master Budget                 0.718***  1.163***                          (0.263)  (0.259)   Control and Monitoring of Master Budget           0.765***  1.016***                          (0.226)  (0.241)   How deviations from master budget analyzed           0.909***  1.070***                          (0.244)  (0.265)   Structure of Control and Monitoring Indicators (KPIs)        0.935***  0.956***                          (0.247)  (0.237)   Agenda of Financial Management Meetings           1.055***  1.055***                          (0.230)  (0.222)  HR Practices                           Strategic objectives leverage people's and team's talent        0.833***  0.631***                          (0.206)  (0.214)   Human talent development plans linked to corporate strategy     0.809***  0.902***                          (0.200)  (0.215)   Strategic plan defined, that includes clear goals for human talent     0.951***  0.910***  56                            (0.207)  (0.194)  Marketing  Practices                          Implementation of analysis of marketing trends            0.485**  0.867***                          (0.227)  (0.196)   Implementation of analysis of marketing risks           0.630***  0.898***                          (0.230)  (0.226)   Alignment of marketing and sales plan with business strategy     0.663***  0.825***                          (0.216)  (0.227)  Monitoring of sale behavior and trends              0.719***  0.901***                          (0.209)  (0.224)  Production Practices                       Implementation of strategic goals between plant manager and supervisor  0.616***  0.966***                          (0.176)  (0.173)  Monthly monitoring of strategic goals between plant manager and supervisor  0.686***  0.895***                          (0.215)  (0.207)  Strategic goals and roles clear to each worker           0.670***  0.896***                          (0.166)  (0.162)  Each worker has improvement goals              0.562***  0.892***                          (0.188)  (0.170)  Bottlenecks are identified and managed              0.514***  0.842***                          (0.179)  (0.194)  Monthly measurement of plant KPIs              0.822***  0.857***                          (0.193)  (0.200)  Weekly or bi‐weekly management of KPIs           0.851***  0.650***                          (0.227)  (0.212)  Improvement programs for KPIs developed           0.927***  0.989***                          (0.223)  (0.220)  Culture of visual management with graphs of machine performance     0.810***  0.515**                          (0.210)  (0.212)  Supervisors and workers manage improvement plans for quality anomalies  0.802***  0.944***                          (0.187)  (0.215)  Notes: robust standard errors in parentheses, clustered at the firm level. *** denotes significance at the  1 percent level. Coefficients are treatment effects post‐intervention, and control for time effects,   randomization strata, and                        Appendix 6: Robustness of Management Improvements to Sample Attrition Table A6.1 shows the availability of our management score data by time period and measure. The greatest data availability is for the Anexo K measure, but this still suffers from attrition, while the WMS and MOPS data are available for subsets of the same only. 57      Table A6.1: Management Data availability by measure and time period  #  Firms with Data by        Treatment     Measure  Period  Control  Individual  Group  Data source  Anexo K management score  2013  52  51  53  Anexo K collected by CNP     2014  42  46  0  Anexo K collected by CNP     2015  26  40  35  Anexo K collected by CNP     2016  0  0  36  Anexo K collected by CNP  WMS management score  2013  26  24  27  WMS collected by LSE     2016  20  19  31  WMS collected by IPA  MOPS management data  2012  28  33  34  Collected retrospectively by IPA     2017  28  33  34  Collected by IPA    Figure A6.1 compares the distribution of baseline management practice data for firms which attrit and do not have endline (2015 for the control and individual treatment, 2016 for the group treatment) Anexo K data. We see that the distributions of those with and without follow-up management data are similar, both for the full sample, and when we split by treatment status. We cannot reject equality of distributions between attritors and non-attritors using a Kolmogorov- Smirnov test of equality of distributions. This shows that attrition is not selective on initial management practices. 58    Figure A6.1: Distribution of Baseline Anexo K Management Practices by Whether or Not Endline Management Data are Missing Notes: Kolmogorov‐Smirnov tests of equality of distributions of baseline management practices between  firms  with  missing  endline  management  data  and  firms  with  endline  management  data  have  p‐values  0.979 (all firms), 0.995 (control firms), 0.754 (individual treatment), and 0.425 (group treatment).  Note that our main estimates of the treatment effect are for a balanced panel, and include randomization triplet fixed effects. Coupled with the above analysis which shows no selection on baseline management practices into having follow-up data, and Figure 2 which shows clearly the change in distribution of practices for this balanced panel, this suggests our main results are not being driven by selective attrition. Nevertheless, as a further sensitivity check, Table A6.2 provides Lee bounds for the treatment impacts. Table A6.1 shows we have substantially more control firms reporting management practices in 2014 than 2015, so less trimming is required when estimating the impact during the year of intervention than for the post-intervention impact. We see that both the treatments have significant impacts even at the lower bound for the during intervention period. In contrast, the bounds become wider for the post-intervention period. If all the additional firms that attrited from the control group were the best managed firms, then we could not conclude that the intervention had a positive effect. We can examine this assumption using the control firms that attrited between 2014 and 2015. The 16 control firms that attrited had first follow-up (2014) Anexo K scores with a mean of 51.4, while the 26 control firms that did not attit had 2014 mean Anexo 59    K scores with a mean of 52.8 (p-value 0.72). Thus, not only is there no evidence of selective attrition on baseline management practices, neither is there evidence of endline selective attrition based on first follow-up management practices. This strongly suggests that the assumption that it was all the best-managed firms in the control group that differentially attrited is very unlikely to hold, so that the Lee lower bound is unlikely to be applicable. Table A6.2: Lee Bounds of Impact on Anexo K Score        Individual Treatment Effect  Group Treatment Effect  Impact during intervention        Lee lower bound  6.303**  9.368***     (2.723)  (3.290)  Lee Upper bound  9.746***  16.610***     (3.065)  (2.851)  Impact post‐intervention        Lee lower bound  1.076  4.784     (3.628)  (3.218)  Lee Upper bound  13.993***  13.913***     (3.011)  (3.158)  Sample Size  106  106  Proportion trimmed           for during intervention  8.7%  16.7%      for post‐intervention  35.0%  27.8%  Notes: robust standard errors in parentheses. *, **, and *** denote significance  at the 10, 5, and 1 percent levels respectively.       Appendix 7: Impacts on World Management Survey and MOPS Management Measures  WMS 2013 Data Collection We commissioned the London School of Economics (LSE) team responsible for the Bloom and Van Reenen (2007) World Management Surveys (WMS) to apply their methodology to a random sample of 180 firms representative of the Colombian manufacturing sector, as well as to a sub- sample of 77 firms in our sample, focusing on firms with 40 or more employees (Table A6.1). Interviews were done by phone with a manager with thorough knowledge of the production process, typically the plant manager or production manager. The WMS interview is structured as a guided discussion, and is designed to be answered by a manager with thorough knowledge of the production process, typically the production or plant manager. Such discussion lasts between one hour and one hour and a half, and covers the 18 questions related to operations, monitoring, targeting, and people management. The interviewer guides the interviewee by means of open questions, letting him/her speak freely but making sure to have the necessary objective information to score each of the 18 topics using the provided scoring grid. Each of the 18 topics receives a score between 1 (no modern practice is implemented) and 5 (best practice). 60    A first use of this survey was to be able to compare the management practices of the auto parts sector in our sample to that of Colombian manufacturing as a whole. Figure A7.1 shows that the distribution of management practices in our firms is similar to that of all SME manufacturing firms in Colombia. A second purpose was to enable comparison of Colombia to the rest of the world. Figure A7.2 shows Colombia’s average management practices, with a score of 2.54, are poorly managed by global standards, but typical for many developing countries, just below that of India and just above Kenya. The mean management practices score for the auto parts firms of 2.38 is similar. Figure A7.1: Comparison of WMS Management Practices Distribution of our Auto Parts firms to a Representative Sample of the Colombian Manufacturing Sector Source: WMS surveys conducted of 180 Colombian manufacturing firms and 77 auto parts firms conducted by the LSE WMS team in 2013. 61    Figure A7.2: Comparison of Colombian World Management Survey Management Score to Other Countries United States 3.308 Japan 3.230 Germany 3.210 Sweden 3.188 Canada 3.142 Great Britain 3.033 France 3.015 Australia 2.997 Italy 2.978 Mexico 2.899 Poland 2.887 Singapore 2.861 New Zealand 2.851 Northern Ireland 2.839 Portugal 2.826 Republic of Ireland 2.762 Chile 2.752 Spain 2.748 Greece 2.720 China 2.712 Turkey 2.706 Argentina 2.699 Brazil 2.684 Africa India 2.611 Vietnam 2.608 Asia Colombia 2.578 Kenya 2.549 Nigeria 2.516 Oceania Nicaragua 2.397 Myanmar 2.372 Europe Zambia 2.316 Tanzania 2.254 Ghana 2.225 Latin America Ethiopia 2.221 Mozambique 2.027 North America 1.5 2 2.5 3 3.5 Average Management Scores, Manufacturing Source: World Management Surveys, Nick Bloom. WMS 2016 Data Collection In September 2016, we asked Innovations for Poverty Action (IPA) to conduct a second round of the World Management Survey (WMS). The LSE provided support in training the four analysts who conducted the interviews, the two supervisors and the research associate responsible for the survey. All material was provided by the LSE and the training took place in October 2016. Since the WMS is designed for larger firms, we chose as a sample frame the 109 firms in our sample that had at least 25 employees at baseline. This consisted of 37 control, 41 group treatment, and 31 individual treatment firms. Of these 109 firms, we were able to collect data on 70 firms (20 control, 31 group, 19 individual), of which 50 firms had also been interviewed in 2013 (14 control, 22 group, 14 individual). This response rate of 64% is double the standard WMS response rate, reflecting the pre-existing contacts with these firms through the project. Of those companies not interviewed, 3 had closed down, and the remainder either refused, or repeatedly rescheduled and could not be interviewed. 62    Management and Organizational Practices Survey (MOPS) Our final measure of management practices comes from a 16-question survey given to firm owners in 2017, derived from the Management and Organizational Practices Survey (MOPS). This survey was created by the U.S. Census Bureau, and was designed to enable basic management practices to be measured in a self-administered survey format. The survey asks questions related to monitoring, targeting, and incentives, and is intended to measure similar concepts to the WMS (Bloom et al, 2018). It was carried out by Innovations for Poverty Action during in-person visits to the firms, and firms were also asked to recall what these practices were five years earlier (in 2012). Table A6.1 shows that these data were collected for 95 firms. Associations between different measures of management and over time The WMS and MOPs are collected in a much less in-depth way than the Anexo K, and measure different aspects of management. Table A7.1 looks at the baseline correlations between different measures. At baseline, the Anexo K management score has a correlation of 0.26 with the WMS management score, and 0.23 with the MOPS score. By way of comparison, the 38 management practices in Bloom et al. (2013) had a 0.40 correlation with the WMS score. The Anexo K is most highly correlated with the monitoring component of the WMS (correlation of 0.44). When we examine the five areas of the Anexo K, the finance, logistics and production scores are more highly correlated with the WMS than the HR and marketing scores. Recall the WMS does not measure marketing practices, and there is a difference in emphasis in how the two focus on human resource practices. The WMS is more focused on how good and bad performers are hired and rewarded, whereas the Anexo K has more of an emphasis on organizational culture and links to overall business strategy. Notably, while the MOPS and WMS are intended to measure similar concepts, the correlation between the 2012 (recalled) MOPs management score and the WMS is only 0.08, suggesting substantial noise in this measurement. Table A7.1: Correlations between baseline Management Measures        WMS   WMS  WMS  WMS  WMS  MOPS     Overall  Operations  Monitoring  Targets  People  Overall  Anexo K Overall Score  0.26  0.16  0.44  0.04  0.11  0.23  Finance Score  0.28  0.22  0.46  0.07  0.07  0.15  HR Score  0.14  0.09  0.33  ‐0.08  0.03  0.17  Logistics Score  0.23  0.12  0.32  0.07  0.13  0.31  Marketing Score  0.09  0.03  0.12  0.02  0.06  0.10  Production Score  0.26  0.14  0.40  0.07  0.13  0.17  MOPS Overall  0.08  0.00  0.04  0.07  0.10  1.00    Figure A7.3 plots the cross-sectional and panel associations between measures. We see that the endline Anexo K has a cross-sectional correlation of 0.34 at endline with both the WMS and MOPS, and that the WMS and MOPS at endline still only have a correlation of 0.27. More starkly, there is no relationship between the WMS and Anexo K in the panel: firms that improve the most according to the Anexo K are unrelated to those that improve the most according to the WMS. This is also true of the association between changes in the MOPs and changes in the WMS. Recall 63    that the WMS is done double-blind by phone, with enumerators scoring firms on a five-point scale. While there is signal in the responses, this also entails a lot of noise. Bloom et al. (2016) report that the test-retest correlation when two different people from within a plant answered the same questions within a few weeks of one another is only 0.51. In our case, there is an added factor of the baseline being done by the LSE team, while the endline was collected by Innovations for Poverty Action (after training from the LSE team). As such, we should expect much of the change over time in the WMS to reflect measurement error, which can make it difficult to detect treatment effects. Figure A7.3: Cross-sectional and panel correlations between management measures Notes: first column shows cross-sectional correlations pre-treatment, second column shows cross-sectional correlations post-intervention for last measurement obtained by each method, and third column shows correlation of change in management (pre-post) according to each measure. To investigate which of the three management measures is most strongly correlated with business outcomes of interest, we regress baseline log employment and labor productivity on each management measure separately, and then on all three together. The results are shown in Table A7.2. The Anexo K score is strongly associated with both log employment and labor productivity at baseline (both significant at the 1% level), while the WMS and MOPS have weaker associations. When all three measures are included together, the Anexo K measure remains statistically 64    significant, while neither other measure is significant. This suggests the Anexo K measure has a stronger signal for business outcomes than these two alternatives. 65    Table A7.2: Baseline Association of Business Outcomes with  Management Measures     Log Employment     Labor Productivity  Anexo K Score  0.035***        0.017***     0.672***        0.877***     (0.006)        (0.006)     (0.140)        (0.186)  WMS Management Score     0.250*     0.086        4.914     ‐0.652        (0.134)     (0.153)        (4.070)     (5.310)  MOPS Management Score        0.869*  ‐0.554           8.994  ‐2.894           (0.465)  (0.459)           (8.650)  (12.164)  Sample Size  156  77  95  46     156  77  95  46  R‐squared  0.19  0.05  0.03  0.14     0.14  0.01  0.01  0.25  Notes:                             Anexo K management practices are 141 management practices divided into five sub‐areas.        WMS is World Management Survey, taken for subsample of firms in 2013. MOPS is Management and Organizational  Practices Survey, and was conducted in 2017, with recall of practices 5 years earlier used to obtain baseline measure.  Robust standard errors in parentheses, *, **, *** denote significance at the 10, 5, and 1 percent levels respectively.  66    Treatment Effects on WMS and MOPS measures of management Table A7.3 reports the estimated treatment impacts on the WMS and MOPS measures. Since these data are only available for a subset of our firms, we report several different specifications. In Panel A, we use all 70 firms for which follow-up WMS data are available (or the 95 firms with MOPS data for the last column). We do not control for randomization triplet fixed effects given that this would result in relatively few triplets being included. Instead, panel A includes no other controls, while Panel B controls linearly for key baseline variables used in the randomization (region, size, employment, labor productivity, and baseline Anexo K). Panels C through E then use the set of 50 firms for which both baseline and endline WMS data are available. In panels A and B, we find very small and statistically insignificant impacts of either treatment on any of the WMS or MOPS management measures. Restricting to the sample for which we also have baseline data in panels C, D and E results in larger point estimates for the WMS, but the impacts are still far from statistically significant. Our results show that both treatments resulted in significant increases in the Anexo K measure of management practices, and in each of its five subcomponents. This raises the question of why we do not see such a change in the WMS and MOPS? A first potential explanation is that the WMS and MOPS are only available for subsamples of the data, so that the difference in results could stem from sample composition and sample size. To investigate this hypothesis, Table A7.4 re- estimates the management treatment effect regressions for common sub-samples. The first column repeats our estimated impact on the Anexo K measure for the balanced panel. Columns 2 and 3 then consider the 52 firms for which we have both the 2016 WMS and Anexo K measured during and after the intervention. We continue to see a statistically significant impact of the individual treatment on the Anexo K measure using this sub-sample both during and post-intervention, and a significant impact of the group treatment during the intervention, with the magnitude of the estimated effect only falling in a substantive way for the group treatment post-intervention, although with a wide confidence interval. In contrast, there is no significant impact on the WMS using this same sample. The foot of the table converts the estimated treatment effects into confidence intervals expressed in terms of standard deviation changes in the respective management practice. We see that not only are the WMS treatment effects statistically insignificant while those for the Anexo K outcome are statistically significant, but the 95 percent confidence interval for the effect of the individual treatment effect does not even overlap for the two outcomes. This suggests that the lack of impact on the WMS is not simply a matter of the sample composition or statistical power. Likewise, when we restrict to the same sample as the MOPS in columns 4 and 5, we find significant treatment impacts on the Anexo K, and no significant impact on the MOPS, although in this case the confidence intervals do overlap. 67    Table A7.3: Impact on Other Measures of Management Practices WMS WMS WMS WMS WMS MOPS Overall Operations Monitoring Targets People Score All firms interviewed in 2016 Panel A: No controls Individual Treatment 0.040 0.100 0.152 ‐0.045 ‐0.003 ‐0.008 (0.169) (0.345) (0.225) (0.238) (0.156) (0.034) Group Treatment 0.075 0.035 0.152 0.041 0.053 0.013 (0.170) (0.298) (0.209) (0.230) (0.153) (0.031) Panel B: Baseline Controls Individual Treatment ‐0.000 ‐0.030 0.095 ‐0.076 ‐0.007 ‐0.005 (0.166) (0.307) (0.235) (0.243) (0.152) (0.032) Group Treatment 0.061 0.009 0.094 0.094 0.025 0.018 (0.166) (0.276) (0.210) (0.231) (0.162) (0.030) Sample Size 70 70 70 70 70 95 Control Mean in 2016 of outcome 2.92 2.90 3.28 2.94 2.61 0.52 Control S.D. in 2016 of outcome 0.55 1.07 0.68 0.79 0.54 0.13 50 firms interviewed in WMS in 2013 & 2016 Panel C: No Controls Individual Treatment 0.143 0.321 0.314 ‐0.086 0.131 0.010 (0.218) (0.423) (0.256) (0.311) (0.199) (0.051) Group Treatment 0.283 0.357 0.312 0.225 0.284 0.064 (0.216) (0.363) (0.254) (0.293) (0.183) (0.045) Panel D: Baseline Controls Individual Treatment 0.029 0.123 0.153 ‐0.188 0.074 ‐0.011 (0.204) (0.388) (0.257) (0.304) (0.197) (0.055) Group Treatment 0.242 0.238 0.210 0.276 0.241 0.066 (0.203) (0.350) (0.266) (0.286) (0.175) (0.049) Panel E: Baseline Controls + Ancova Individual Treatment 0.072 0.233 0.168 ‐0.160 0.133 ‐0.009 (0.199) (0.394) (0.252) (0.299) (0.199) (0.055) Group Treatment 0.267 0.335 0.232 0.296 0.214 0.068 (0.214) (0.372) (0.276) (0.302) (0.163) (0.048) Sample Size 50 50 50 50 50 46 Control Mean in 2016 of outcome 2.88 2.89 3.24 2.96 2.51 0.53 Control S.D. in 2016 of outcome 0.65 1.13 0.76 0.90 0.56 0.14 Notes: Each panel represents treatment impacts from a separate regression.  70 of the 159 firms were given the WMS survey in 2016,  of which 50 had also received this survey in 2013.  Panels A and C regress outcomes on treatment dummies only. Panels B and D add controls for  dummies for the Cundinamarca and Valle regions, a dummy for having 10 to 50 workers at baseline, the number of employees in 2013, labor productivity in 2013, and the 2013 Anexo K management practice score.  Panel E also controls for the baseline value of the outcome measure. Robust standard errors in parentheses. *, **, and *** indicate significance at the 10, 5, and 1 percent levels.  68    Table A7.4: Impact on Anexo K on Same Samples as WMS and MOPS Balanced Panel WMS Sample MOPS Sample Anexo K Anexo K WMS Anexo K MOPS Individual Treatment*During Intervention 9.413*** 8.350*** 9.669*** (1.760) (2.229) (1.879) Individual Treatment*Post Intervention 9.309*** 8.325*** ‐0.210 9.657*** 0.017 (1.821) (2.368) (0.176) (1.856) (0.036) Group Treatment*During Intervention 11.384*** 7.602** 11.143*** (2.202) (3.164) (2.438) Group Treatment*Post Intervention 8.155*** 3.911 ‐0.132 7.549*** 0.040 (2.124) (3.091) (0.174) (2.318) (0.034) Sample Size 202 104 52 172 86 Control Mean 55.98 60.1 2.93 57.44 0.49 Control SD 10.79 6.98 0.41 10.23 0.12 Implied 95% confidence intervals in S.D. Individual Treatment*Post Intervention [0.53,1.19] [0.53,1.86] [‐1.35,0.33] [0.59,1.30] [‐0.45,0.73] Group Treatment* Post Intervention [0.37,1.14] [‐0.31,1.42] [‐1.15,0.51] [0.29,1.18] [‐0.22, 0.89] Notes: Column 1 is for the 101 firms for which Anexo K management practices are measured both during and post intervention. Columns 2 and 3 restrict to the subset of 52 firms that  also had the WMS measured in 2016, Columns 4 and 5 restrict to the subset of 86 firms that also had the MOPS measured in 2017. Regressions control for baseline (December 2013) Anexo K mean, time fixed effects, and controls for region baseline labor productivity,  baseline number of employees, and for being a small firm at baseline. Robust standard errors in parentheses, clustered at the firm level.  *, **, *** denote significance at the 10, 5, 1 percent levels respectively. A more compelling explanation for the lack of impact on the WMS is due to this measure not being as able to pick up the types of changes in management practices that come from this intervention. A first reason for this is just the general noise in the measure, as discussed above. This noise means that much of the change in the WMS over time may reflect measurement error, making it difficult to detect treatment effects. But a second reason is that the WMS measures practices at a more general level than the level of specificity at which interventions are focused. Evidence in support of the idea that the WMS is not able to pick up the specific changes in practices that these consulting type interventions bring about comes from the India experiment that initially motivated this work. Bloom et al. (2013) report that their treatment plants increased their use of the 38 specific management practices they measure by 37.8 percentage points, significantly larger than the change for the control firms. They asked Accenture to also apply the WMS survey instrument to these firms during this post-intervention measurement phase. However, Accenture did not receive the LSE training on applying this survey instrument, and appears to have graded firms more harshly, with a mean WMS score of 1.45, compared to a baseline mean of 2.69 when conducted by the LSE team. Despite the large change in management practices observed in the 38 management practices used in Bloom et al. (2013), there is no significant difference in the follow- up WMS scores in this case (mean of 1.43 for the treated firms, 1.49 for the control firms, p-value = 0.693). So, as with our Colombian case, if one were to rely on the WMS to measure whether changes in management had occurred, the conclusion would have been that the Indian interventions had no significant effect on management. 69    Appendix 8: Comparison of PILA and Firm Employment Data and  Changes in Composition of Firm Employment  The PILA is the platform through which firms pay social security for their employees. We requested that government ministries with access to this data attempt to match our firms. This was done three times. First, the department of statistics (DANE) matched to the firm data between January 2014 and June 2016. Second, the Ministry of Health matched our firms to their database, covering the period January 2011 through February 2017, and then later re-matched for our firms from January 2012 through December 2018. Matching firms was not trivial, with firms’ names not always given, the identification number of the company changing if the economic activity changes or some other features change, and at times the same firm being listed under the name of the owner versus the firm. The last attempt was the most successful and comprehensive, and our PILA series uses the second Ministry of Health extract as a base, correcting a small number of matching errors with data from the previous attempts. Figure A8.1 shows a scatterplot of the employment reported in the PILA and the employment taken from the firm’s records for the set of 7,010 year-month-firm observations between January 2013 and December 2017 for which we have data from both sources. The correlation is 0.93 over the full period, and the mass of points lie close to the 45-degree line. However, we do see a few points which have lower levels of employment reported in the PILA than in firm records. These likely reflect informal employment. Figure A8.1: Employment Reported in PILA vs Employment Reported by Firms 70    We use the PILA data to construct for each firm the long difference between their mean employment in 2013, and their mean employment in 2017. Figure A8.2 then shows the quantile treatment effects for this change in employment at different quantiles. The left panel shows the full range of quantiles. The confidence intervals are incredibly wide at the bottom quantile, reflecting the effect of the long left tail seen in Figure 4. The right panel zooms in to the subset of quantiles from 20 to 90. We see the group treatment has positive quantile treatment effects at all quantiles, which are statistically significant at the 10 percent level between the 60th and 70th percentiles. Figure A8.2 Quantile Treatment Effects on Long Difference in Formal Employment 2013-17 Note: Formal employment data taken from the PILA.90 percent pointwise confidence intervals shown from cross- sectional estimation of the quantile treatment effect on the long difference in employment. 71    In addition to data at the firm level, anonymized person-level data enable us to track inflows and outflows of workers from these firms, and to examine the gender and age composition of the workforce, as well as the monthly salaries paid to workers. Column 1 of Table A8 looks at the proportion of workers who were working in firms in January 2013 who remained in the firm five years later, at the end of December 2017. In the control group, only 47 percent of workers are remained this length of time. The point estimate suggests a 5 percentage point increase in this retention rate in the group treatment firms, but this is not statistically significant. Columns 2 and 3 show that 74 percent of workers are male and the average worker is age 43, with neither treatment having large, nor statistically significant impacts on these worker characteristics. Finally, Column 4 examines the treatment impact on mean worker monthly wages. The group treatment results in a 36,526 COP (3%) point estimate increase, but this is not statistically significant. Table A8: Impact on Employment Composition Five‐Year Retention: Proportion of Jan 2013  Worker Characteristics workers remaining in  Proportion Mean Mean Monthly firm in Dec 2017 Male Age Salary (COP) Individual Treatment*During Intervention 0.008 ‐0.052 ‐38509 (0.009) (0.362) (29109) Individual Treatment*Post Intervention ‐0.031 0.009 0.130 ‐37387 (0.052) (0.014) (0.485) (36636) Group Treatment*During Intervention ‐0.001 ‐0.041 ‐36 (0.008) (0.404) (25805) Group Treatment*Post Intervention 0.051 ‐0.002 ‐0.272 36526 (0.056) (0.010) (0.467) (33202) Sample Size (N*T) 135 8502 8502 8502 Sample Size (N) 135 146 146 146 P‐value: Individual=Group During 0.472 0.985 0.339 P‐value: Individual=Group Post 0.170 0.500 0.482 0.117 Control Mean 0.47 0.74 43.0 1087335 Control S.D. 0.20 0.14 4.97 418941 Notes: Regressions use PILA data on formal employment, and are for sample of surviving firms. Column 1 is a cross‐sectional regression for firms with employment data in both Jan 2013 and Dec 2017. Columns 2, 3 and 4 include firm and time fixed effects, and cluster standard errors at the firm level.  *, **, *** denote significance at the 10, 5, and 1 percent levels respectively. 72