WPS5346 Policy Research Working Paper 5346 Empirical Econometric Evaluation of Alternative Methods of Dealing with Missing Values in Investment Climate Surveys Alvaro Escribano Jorge Pena J. Luis Guasch The World Bank Latin America and the Caribbean Region Finance & Private Sector June 2010 Policy Research Working Paper 5346 Abstract Investment climate Surveys are valuable instruments a simple replacement mechanism--for application in that improve our understanding of the economic, models with a large number of explanatory variables-- social, political, and institutional factors determining which in turn is a proxy of two methods: multiple economic growth, particularly in emerging and transition imputations and an export-import algorithm. The economies. However, at the same time, they have to performance of this method in the context of total factor overcome some difficult issues related to the quality of productivity estimation in extended production functions the information provided; measurement errors, outlier is evaluated using investment climate surveys from four observations, and missing data that are frequently found countries: India, South Africa, Tanzania, and Turkey. It in these datasets. This paper discusses the applicability is shown that the method is very robust and performs of recent procedures to deal with missing observations reasonably well even under different assumptions on the in investment climate surveys. In particular, it presents nature of the mechanism generating missing data. This paper--a product of the Finance & Private Sector, Poverty Reduction and Economic Management, Latin America and the Caribbean Region--is part of a larger effort in the department to asses the determinants of productivity. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The author may be contacted at jguasch@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Empirical Econometric Evaluation of Alternative Methods of Dealing with Missing Values in Investment Climate Surveys* by Alvaro Escribano, Jorge Pena and J. Luis Guasch* Keywords: Investment Climate surveys, missing observations, incomplete data, random sampling, sample selection, EM-algorithm and bootstrap. JEL classification: C15, C24, C63, C81, C83. * We have benefited from suggestions from Daniel Peņa, Ariel Pakes, Rodolfo Stucchi and Eric Verhoogen. Telefonica-UC3M Chair of Economics of Telecommunications, Department of Economics, Universidad Carlos III de Madrid; alvaroe@eco.uc3m.es. Department of Economics, Universidad Carlos III de Madrid; jpizquie@eco.uc3m.es. *Senior Adviser, World Bank, Head of the World Bank Global Experts Group on Private-Public-Partnerships, and Professor of Economics, Univeristy of California, San Diego; jguasch@worldbank.org 1. Introduction The Investment Climate (IC) surveys (or Enterprise Surveys) have been created as part of a new strategy by the World Bank to put more emphasis on the intangible assets of developing countries such as knowledge, institutions and culture.4 This new set of information that is becoming available to both scholars and policy makers is intended to be a valuable instrument to help improve our understanding of the economic, social, political and institutional factors determining economic growth, particularly in emerging and transition economies. However, at another level, IC surveys are also a source of trouble for researchers. In general, economic data are far from being perfect and when one is carrying out econometric or statistical analysis with a typical dataset too often we have to deal with the problem of missing values.5 IC datasets are not an exception to this. Their imperfections make our job difficult and often even impossible (Griliches, 1986). Incomplete data is an ubiquitous problem and standard econometric and statistical methods have nothing whatsoever to say about how to solve it. The simplest solution to this problem is to exclude from the analysis any cross-sectional observation with any missing value in it. This strategy is commonly known as casewise deletion, listwise deletion or complete case analysis. The advantage of this method lies obviously in its simplicity. The disadvantage is also rather evident to anyone who has used it: in many applications, casewise deletion excludes from the analysis a large fraction of the original sample. In the context of IC surveys, this is quite a high cost in terms of information lost, as well as the monetary cost arising from losing a large proportion of very expensive interviews. The debate we wish to introduce is whether the researcher should apply some treatment on missing values when using investment climate surveys (ICSs) or rather whether it is preferable to operate with the complete case only. One of the main characteristics of the ICSs is the wide set of information they provide. Concretely, the surveys have been designed to perform a variety of economic and statistical analyses, among which especially interesting are those linking investment climate variables and several measures of firms' economic performance, such as productivity, labor demand, sales, exporting activity, FDI propensity, etc. This means having matrices of data with a remarkably large number of rows and therefore the possibility of using econometric models with a wide set of right hand side variables. Unfortunately, in many cases the problem of missing data is so serious that it prevents us from using those kinds of models. In some of these cases the missingness problem reduces the cross sectional observations available in the complete case to even 0% of the original sampling frame.6 Should the researcher therefore 4 Key determinants of the investment climate, which are included and properly measured in the Investment Climate (IC) series of surveys, include physical and institutional infrastructure, economic and political stability, rule of law, infrastructure, approaches to regulations and taxes, functioning of labor and finance markets, and broader features of governance, such as corruption. The World Bank group has long been a supporter of investment climate reform, recognizing the importance of shaping a business environment conducive to the successful start-up and operation of firms of all sizes in all sectors. 5 Information is missing for various reasons. A sizeable fraction of the respondents refuse, forget or fail to answer some questions. In other cases, even well-trained interviewers may neglect to ask some questions. Sometimes respondents just say they do not have the information available to them or they do not know the answer to the question. Some questions are simply not applicable to some respondents (see Allison, 2001). All of these cases may be applicable to IC data. 6 The number of observations available in the complete case decreases as we consider more and more investment climate variables. If we consider all the variables included in the survey, the complete case due to missing cells is 2 limit himself to using models with a reduced number of independent variables with the risk of introducing a more serious omitted variables problem? Or is it preferable to impute missing data in order to be able to use structural models with a wide set of explanatory variables? If we assume the latter as a reasonable solution, the question that arises then is: should we input missing cells in both LHS (independent) and RHS (dependent) variables, or, on the contrary, ,should we satisfy ourselves by replacing missing data in only those explanatory variables of the model? During recent years statisticians have proposed many alternative methods to handle incomplete datasets that offer substantial improvements over casewise deletion. These approaches may be grouped into two families of methods: maximum likelihood and multiple imputation, see Allison (2001), Meng (2000) and Little and Rubin (1987) for a review. However, these methods depend on easily violated assumptions that, to make things worse, are difficult or even impossible to test. In this paper we discuss the applicability of these methods to four IC surveys with very different patterns of missing data among them: India, Turkey, South Africa and Tanzania. In particular, we propose a simple imputation mechanism (which we call the ICA method) that in part departs from the EM-algorithm, and that has been widely applied in various empirical works (Escribano et al, 2008a, b; Escribano, Guasch and Pena 2009 and Escribano et al. 2009). We compare the performance of this method with several alternative approaches to deal with incomplete data and we discuss the different assumptions we need to hold for the different imputation mechanisms to work well. We evaluate the validity of the different methods in the context of the extended production function of Escribano and Guasch (2005 and 2008).7 The extended production function framework used here fits very well with the objective of the paper as the RHS of the equation is compounded by a broad set of explanatory variables.8 On the other hand, although we concentrate on PF variables, the results of the analysis can be easily extended to any variable with missing information included in the ICSs. We demonstrate that, besides the imputation method used, a detailed knowledge of the missingness mechanism in the context of ICSs is a requisite. The missing data problem is at the core of statistical and econometric analysis done with ICSs and therefore a proper treatment of the missing data mechanism is inevitable. We also show that the so-called ICA method proposed performs reasonably well, even under very different patterns of missing data. The differences of the ICA method with respect to other more sophisticated imputation mechanisms, such as EM algorithms, multiple imputation, bootstrap methods or Heckman models, are not remarkably significant, so we propose it as a benchmark, a homogeneous, simple and easy to implement method for models with large numbers of covariates in ICSs, and more importantly, for very complex and unbalanced patterns of missing data. The structure of the paper is as follows. In section 2 we review the patterns of missing data observed in the four IC surveys considered. We compare the original sampling frame with the complete case and we see that in most cases the representativity of the original sample is 0% in most cases. However, if we construct models using only those investment climate variables with a response rate higher than 80%, the complete case increases from 20% to 30% of the original sampling frame. 7 Although it is straightforward to apply this method to any kind of model, especially those involving a large number of RHS variables or structural system of equations. 8 The underlying philosophy of the Escribano and Guasch (2005 and 2008) extended production function is to incorporate in a Cobb-Douglas (or Translog) function a large set of investment climate variables to correct for observable fixed effects. 3 modified and the total number of observations available for regression analysis is considerably reduced. We compare these numbers with the observations available after the replacement mechanism we propose. Section 3 presents the ICA method and other imputation mechanism used as comparators. We also comment on the different assumptions underlying the different methods proposed. We discuss to what extent the missing data mechanism (MDM) presented in the four surveys analyzed may be considered as missing completely at random (MCAR), missing at random (MAR), or non-ignorable. Section 4 shows the regression results for the extended production function under the different replacement methods. Finally, section 5 concludes. All tables and figures are included in an extensive appendix at the end of the paper. 2. Missing data and investment climate surveys We introduce the problem at hand with Table 1.1 (see appendix tables and figures) which shows the total number of observations, the observations available in the complete case and the final number of observations we have after the replacement process we propose--which we discuss later on--in 43 different ICSs. All the surveys share similar characteristics in the sampling procedure applied and, more importantly, in the information provided. The number of observations lost varies among all the surveys considered. The replacement process considerably increases the sample size in all cases (the method is described in section 3).9 The problem of incomplete data is common to all the IC surveys considered, although it is more persistent in countries like Thailand, Niger, Paraguay, Tanzania and Turkey, in which the percentage of observations available in the complete case is below 30%.10 In Table 1.1 we only consider missing values in production function variables. When we consider all variables likely to be used in regression analysis (all investment climate variables), the complete case even reduces to 0% in some cases.11 9 The sample with replacement fills missing values of all variables of the survey (both production function and IC variables). 10 By means of simplification we understand by complete case the sample with replacement only in IC variables. 11 As said, the problem of missing data is, to a lesser or greater extent, common to almost all the variables presented in the IC surveys. We here consider the missingness and its treatment in production function variables (sales, 4 Table 1.1: Observations available for regression analysis after and before imputing missing values and outliers in 43 ICSs Year of the Obs. In the Complete case After imputing missing cells survey sampling frame #Obs. % with respect to #Obs. % with respect to sampling frame sampling frame Latin America Argentina 2006 746 372 49.9 664 89.0 Bolivia 2006 409 209 51.1 336 82.2 Colombia 2006 649 525 80.9 618 95.2 Mexico 2006 1,161 778 67.0 1,093 94.1 Panama 2006 243 97 39.9 223 91.8 Peru 2006 361 230 63.7 337 93.4 Paraguay 2006 440 111 25.2 315 71.6 Uruguay 2006 396 155 39.1 304 76.8 Chile 2006 697 382 54.8 629 90.2 Costa Rica 2005 1029 643 62.5 970 94.3 Ecuador 2006 394 235 59.6 346 87.8 El Salvador 2006 467 296 63.4 439 94.0 Honduras 2006 263 189 71.9 243 92.4 Guatemala 2006 328 262 79.9 316 96.3 Nicaragua 2006 365 230 63.0 341 93.4 Africa Algeria 2002 1,904 1,114 58.5 1,412 74.2 Benin 2004 591 364 61.6 475 80.4 Botswana 2006 114 109 95.6 113 99.1 Cameroon 2006 119 117 98.3 118 99.2 Egypt 2004 2,931 1,317 44.9 2,629 89.7 Eritrea 2002 237 61 25.7 179 75.5 Ethiopia 2002 1,281 1,048 81.8 1,142 89.1 Kenya 2003 852 360 42.3 585 68.7 Madagascar 2005 870 383 44.0 623 71.6 Malawi 2005 320 208 65.0 288 90.0 Mali 2003 462 242 52.4 309 66.9 Mauritius 2005 636 271 42.6 417 65.6 Morocco 2003 2,550 2,352 92.2 2,422 95.0 Namibia 2006 106 100 94.3 104 98.1 Senegal 2003 783 253 32.3 535 68.3 South Africa* 2003 1,737 1,229 70.8 1,492 85.9 Tanzania* 2003 828 325 39.3 561 67.8 Uganda 2003 900 368 40.9 695 77.2 Zambia 2002 564 391 69.3 417 73.9 Asia Indonesia 2003 1,214 486 40.0 1,041 85.7 Malaysia 2001 1,732 605 34.9 1,317 76.0 Philippines 2003 1,432 1,092 76.3 1,272 88.8 Thailand 2004 2,766 646 23.4 1,502 54.3 Pakistan 2007 2358 990 42.0 2,144 90.9 Bangladesh 2006 4804 2,533 52.7 3,946 82.1 India* 2005 6849 4448 64.9 5750 84.0 Europe Croatia 2007 419 219 52.3 372 88.8 Turkey* 2005 2646 771 29.1 1,619 61.2 Complete case includes those observations without missing values and or outliers in sales, materials, capital, labor cost and labor Source: Authors' calculations with IC data. 5 We focus the analysis on the investment climate surveys of India, Turkey, South Africa and Tanzania because they represent almost all the situations regarding the structure of missing data we may find.12 For India, in the complete case we lose 35% of the original sampling frame, while after replacing we only lose 16%. Turkey and Tanzania lose a similar percentage of observations, 70.9% and 60.7% respectively. South Africa only loses 29.2%. Table 1.2 looks in depth at the description of the missingness problem of the four countries selected. In this case, for the computation of the observations available in the complete case, we use all those IC variables included in the survey likely to be used in a regression analysis framework. This means using more than 115 variables in India, 90 in Turkey, 168 in South Africa and 162 in Tanzania. For each country we consider two benchmark cases: the first one includes both PF and IC variables in the computation of the complete case, while the second only considers the IC variables. In the extreme case, when we consider all those IC variables, the complete case reduces to 0% of the complete case in all the countries; it doesn't matter whether we include PF or not. Note that the observations available in the complete case increase as we exclude from the computation of the complete case those IC variables with the largest proportion of empty cells reported. In order to have a large enough number of observations we would need to exclude from the analysis those IC variables with a response rate lower than 95%. Even in this case, and also considering the PF variables, we should be forced to exclude 41.1% of the interviews in India, 76.9% in Turkey, 60.2% in South Africa and 66.2% in Tanzania. The evidence concerning the size of the problem of missing information we have to deal with is overwhelming. materials, capital and employment), although all we say about imputing missing information in production function variables can be easily extended to any other IC variable. 12 These datasets have in turn been analyzed in the following works: Escribano, Guasch and de Orte (2009) for India, Escribano, Guasch, de Orte and Pena (2008b and c) for the case of Turkey and Escribano, Guasch and Pena (2009) for South Africa and Tanzania. 6 Table 1.2: Missing values in IC variables and their incidence on complete case A. India [1] [2] IC variables included # variables # obs. Available % over total # obs. Available % over total All IC variables (a) 115 0 0.0 0 0.0 (b) those IC vars. with response rate >70% 80 500 7.3 588 8.6 (c) those IC vars. with response rate >80% 71 942 13.8 1188 17.3 (d) those IC vars. with response rate >90% 63 1663 24.3 2202 32.2 (e) those IC vars. with response rate >95% 40 2109 30.8 2817 41.1 B. Turkey [1] [2] IC variables included # variables # obs. Available % over total # obs. Available % over total All IC variables (a) 90 1 0.0 4 0.2 (b) those IC vars. with response rate >70% 78 426 16.1 740 28.0 (c) those IC vars. with response rate >80% 77 472 17.8 1226 46.3 (d) those IC vars. with response rate >90% 75 523 19.8 1394 52.7 (e) those IC vars. with response rate >95% 65 697 26.3 2034 76.9 C. South Africa [1] [2] IC variables included # variables # obs. Available % over total # obs. Available % over total All IC variables (a) 168 0 0.0 0 0.0 (b) those IC vars. with response rate >70% 112 93 5.1 114 6.3 (c) those IC vars. with response rate >80% 108 391 21.6 451 24.9 (d) those IC vars. with response rate >90% 92 620 34.3 769 42.5 (e) those IC vars. with response rate >95% 81 828 45.8 1089 60.2 D. Tanzania [1] [2] IC variables included # variables # obs. Available % over total # obs. Available % over total All IC variables (a) 162 0 0.0 0 0.0 (b) those IC vars. with response rate >70% 98 6 0.7 9 1.1 (c) those IC vars. with response rate >80% 89 32 3.9 69 8.3 (d) those IC vars. with response rate >90% 71 118 14.3 251 30.3 (e) those IC vars. with response rate >95% 40 227 27.4 548 66.2 [1] PF variables are also included In the computation of the final number of observations available in the complete case. [2] PF variables are not included In the computation of the final number of observations available in the complete case. (a) All IC variables are included in the computation of the number of observations available in the complete case. (b) Only those IC variables with a response rate higher than 70% are included in the computation of the number of observations available in the complete case. (c) Only those IC variables with a response rate higher than 80% are included in the computation of the number of observations available in the complete case. (d) Only those IC variables with a response rate higher than 90% are included in the computation of the number of observations available in the complete case. (e) Only those IC variables with a response rate higher than 80% are included in the computation of the number of observations available in the complete case. Source: Authors' estimation with ICSs. In the remaining part of this section we first present the pattern of missing values observed in the four surveys considered. We also evaluate the representativity of the sample with replacement and the complete case with respect to the sampling frame. 7 2.1 Sampling and characteristics of the ICSs The sampling of the ICSs is based on a World Bank template used in a large number of countries and customized in collaboration with regional statistical agencies to reflect country-specific issues and policy areas of interest. In order to ensure proper representation of the sectors of interest,13 respondents are carefully selected. The sampling process is normally based on national industry databases and census of firms or establishments,14 which provide the necessary information on the particular population of establishments. To ensure proper representation of firms, stratification is usually done based on three standards: size, sector and location.15 The information contained in the ICSs is composed of a wide set of around 400 variables. Eventually, the number of variables likely to be used in regression analysis is reduced to around 120-200.16 The Investment Climate Surveys provide information regarding firms' experience in a range of areas related to economic performance: financing, governance, corruption, crime, regulation, tax policy, labor relations, conflict resolution, infrastructures, supplies and marketing, quality, technology, and training among others. The ICSs also provide information on the productivity (or production function) variables, sales output (sales are used as measure of output), employment, intermediate materials, capital stock and labor cost. The resulting panel information is short in the time dimension, since it includes only 2 or 3 years of productivity data (in our case 2 years for Turkey and 3 for India, South Africa and Tanzania), and has 1 year of information for the investment climate variables. Finally, it is important to note that all information is based on recall data and not on book values or accounting. 2.2 The missing information problem at first glance Figures 1.1 to 1.4 show the complex and unbalanced patterns of missing values observed in the PF variables in the four countries considered. The most common case is finding observations with information for all the PF but one. In India, the percentage of establishments reporting information for all the PF variables except capital is 16.3%. In the rest of the countries, this percentage is slightly lower but significantly high too. It is less common to observe data on all the PF variables except sales, materials or employment, although in Tanzania the percentage of firms reporting all the figures except sales is relatively important, 9.8%. The cases for which data is collected for only two PF variables represent, in all the countries, less than 1% of total data. Finally, it is very common to have data collected only for labor; this percentage represents 13.3% in India, 27.9% in Turkey, 5.5% in South Africa and 15.7% in Tanzania. 13 Here we focus only on the manufacturing sector. By classifying the establishments by their ISIC code we generally end up with establishments from the following eight sectors: a) Food and beverages; b) Textiles and apparel; c) Chemicals; d) Non-metallic mineral products; e) Metallic products; f) Machinery and equipment; g) Electrical machinery; h) Transport equipment. 14 The unit of reference in the ICSs is the establishment, although in this paper we refer indistinctively to both establishments and firms. 15 Concretely, the establishments are selected according to a random sampling by industry and region. Taking into account this issue we use standard errors allowing for clustering by industry and region (apart from the conventional correction for heteroskedasticity a la White). In some surveys there is also oversampling of large firms. 16 We understand by "likely to be used in regression analysis" all those variables describing the investment climate in which firms operate and likely to be related to firms' economic performance. 8 Figure 1.1: INDIA, Patterns of missing values in Figure 1.2: TURKEY, Patterns of missing PF variables values in PF variables Sales Materials Capital Labor # of # of % of Sales Materials Capital Labor # of # of % of m.v obs. obs. m.v obs. obs. 0 818 30.9 0 4631 67.6 3 737 27.9 1 1113 16.3 1 345 13.0 3 913 13.3 2 189 7.1 2 89 1.3 1 185 7.0 2 47 0.7 2 133 5.0 4 96 3.6 2 28 0.4 1 87 3.3 1 18 0.3 2 35 1.3 1 10 0.1 3 6 0.2 2 5 0.2 3 5 0.2 1 3 0.1 2 2 0.1 Figure 1.3: SOUTH AFRICA, Patterns of missing Figure 1.4: TANZANIA, Patterns of missing values in PF variables values in PF variables Sales Materials Capital Labor # of # of % of Sales Materials Capital Labor # of # of % of m.v obs. obs. m.v obs. obs. 0 1265 69.9 0 313 37.8 1 220 12.2 3 130 15.7 1 81 9.8 4 123 6.8 1 74 8.9 3 99 5.5 1 51 6.2 1 47 2.6 1 38 4.6 2 24 1.3 4 37 4.5 1 17 0.9 2 30 3.6 2 7 0.4 2 26 3.1 2 4 0.2 2 25 3.0 1 1 0.1 3 9 1.1 2 5 0.6 2 1 0.1 2 3 0.4 3 1 0.1 3 3 0.4 2 2 0.2 3 1 0.1 Notes: Yellow means information available on the corresponding variable. White means information is missing. Source: Authors' calculations with ICS data. Tables from 2.1 to 2.4 of the appendix show the distribution of the number of observations available in the original sampling frame, in the complete case and in the sample with replacement, along with the percentage of observations lost with respect to the original sampling frame. From Table 2.1 the percentage of observations lost in India in the complete case varies when we move industry by industry and size by size. Flagrant cases of loss of observations are small firms operating in the non-metallic products sector (61.9%) or the medium-sized firms of the food sector (55.37%). The replacement process allows retrieving for the analysis a considerable percentage of observations. After the replacement we only lost 28.6% and 22.6% in the two cells mentioned previously. In Turkey the percentages of observations lost by size and industry (see Table 2.2) range from 40% (medium-sized firms in the transport equipment sector) to 87.3% (small firms in textiles and apparels industry). South Africa lost 50% of small firms in textiles and apparel and chemical, rubber and plastics sectors (see Table 2.3). Lastly, Tanzania lost more than 70% of small firms in paper, edition and publishing and machinery and equipment and 73% of large firms in textiles and apparels (Table 2.4). 9 Table 2.1: INDIA, Percentage of observations lost due to missing values by industry and size Size Small Medium Large Total (d) Industry #Obs %Lost #Obs %Lost #Obs %Lost #Obs %Lost Food Sampling frame(a) 333 177 87 597 Complete case(b) 177 46.9 79 55.4 51 41.4 307 48.6 With replacement(c) 248 25.5 137 22.6 69 20.7 454 24 Textiles & Leather Sampling frame 426 255 207 888 Complete case 251 41.1 210 17.7 139 32.9 600 32.4 With replacement 325 23.7 235 7.8 178 14 738 16.9 Apparel Sampling frame 360 315 150 825 Complete case 247 31.4 267 15.2 120 20 634 23.2 With replacement 287 20.3 290 7.9 138 8 715 13.3 Chemicals & Sampling frame 426 333 171 930 Chemical prds Complete case 262 38.5 218 34.5 130 24 610 34.4 With replacement 337 20.9 282 15.3 150 12.3 769 17.3 Plastics & Rubbers Sampling frame 279 189 12 480 Complete case 193 30.8 112 40.7 11 8.3 316 34.2 With replacement 243 12.9 157 16.9 11 8.3 411 14.4 Non-metallic Sampling frame 105 63 48 216 products Complete case 40 61.9 38 39.7 32 33.3 110 49.1 With replacement 75 28.6 50 20.6 39 18.8 164 24.1 Structural metal & Sampling frame 618 252 39 909 metal prds Complete case 328 46.9 131 48 21 46.2 480 47.2 With replacement 526 14.9 214 15.1 31 20.5 771 15.2 Machinery & Sampling frame 1074 687 243 2004 Equipment Complete case 749 30.3 482 29.8 160 34.2 1,391 30.6 With replacement 912 15.1 603 12.2 213 12.4 1728 13.8 Total Sampling frame 3621 2271 957 6849 Complete case 2,247 38 1,537 32.3 664 30.6 4,448 35.1 With replacement 2953 18.5 1968 13.3 829 13.4 5750 16.1 Notes: (a) "Sampling frame" refers to the total number of observations (firms surveyed multiplied by the number of years of information). (b) "Complete case" refers to the complete case of production function variables (sales, materials, capital and labor), missing values in other IC variables--other than production function-- are not considered. (c) "With replacement" refers to the sample after imputing IC variables according to the ICA Method; missing values in other IC variables--other than production function--are not considered. Notice that only observations with information available in at least one of sales, labor, labor cost, materials or capital, are imputed (d) "Perc. lost" refer to the percentage of observations lost with respect to the sampling frame. Source: Authors calculations with IC data. 10 Table 2.2: TURKEY, Percentage of observations lost due to missing values by industry and size Size Small Medium Large Total Industry #Obs %Lost(d) #Obs %Lost #Obs %Lost #Obs %Lost Food and Sampling frame(a) 192 170 202 564 Beverages Complete case(b) 56 70.8 57 66.5 82 59.4 195 65.4 With replacement(c) 134 30.2 116 31.8 150 25.7 400 29.1 Textiles and Sampling frame 110 230 398 738 Apparel Complete case 14 87.3 47 79.6 115 71.1 176 76.2 With replacement 48 56.4 130 43.5 257 35.4 435 41.1 Chemicals Sampling frame 118 98 136 352 Complete case 24 79.7 29 70.4 51 62.5 104 70.5 With replacement 60 49.2 67 31.6 87 36.0 214 39.2 Non-metallic Sampling frame 54 66 46 166 mineral products Complete case 15 72.2 20 69.7 19 58.7 54 67.5 With replacement 46 14.8 51 22.7 30 34.8 127 23.5 Metal products Sampling frame 94 98 92 284 (ex. M&E) Complete case 30 68.1 43 56.1 34 63.0 107 62.3 With replacement 68 27.7 82 16.3 59 35.9 209 26.4 Machinery and Sampling frame 98 78 80 256 Equipment Complete case 37 62.2 31 60.3 38 52.5 106 58.6 With replacement 79 19.4 52 33.3 63 21.3 194 24.2 Electrical Sampling frame 58 40 36 134 machinery Complete case 19 67.2 19 52.5 15 58.3 53 60.4 With replacement 42 27.6 34 15.0 24 33.3 100 25.4 Transport Sampling frame 64 30 58 152 equipment Complete case 31 51.6 18 40.0 15 74.1 64 57.9 With replacement 54 15.6 25 16.7 46 20.7 125 17.8 Total Sampling frame 788 810 1048 2646 Complete case 226 71.3 264 67.4 369 64.8 859 67.5 With replacement 531 32.6 557 31.2 716 31.7 1804 31.8 Notes: (a) "Sampling frame" refers to the total number of observations (firms surveyed multiplied by the number of years of information). (b) "Complete case" refers to the complete case of production function variables (sales, materials, capital and labor), missing values in other IC variables--other than production function-- are not considered. (c) "With replacement" refers to the sample after imputing IC variables according to the ICA Method; missing values in other IC variables--other than production function--are not considered. Notice that only observations with information available on at least one of sales, labor, labor cost, materials or capital, are imputed (d) "Perc. lost" refers to the percentage of observations lost with respect to the sampling frame. Source: Authors calculations with IC data. 11 Table 2.3: SOUTH AFRICA, Percentage of observations lost due to missing values by industry and size Size Small Medium Large Total (d) Industry #Obs %Lost #Obs %Lost #Obs %Lost #Obs %Lost Food & beverages Sampling frame(a) 22 80 87 189 Complete case(b) 13 40.9 49 38.8 69 20.7 131 30.7 With replacement(c) 14 36.4 66 17.5 82 5.7 162 14.3 Textiles & apparel Sampling frame 12 43 120 175 Complete case 6 50 32 25.6 69 42.5 107 38.9 With replacement 10 16.7 33 23.3 101 15.8 144 17.7 Chemicals, rubber & Sampling frame 42 119 118 279 plastics Complete case 21 50 79 33.6 87 26.3 187 33 With replacement 29 31 111 6.7 101 14.4 241 13.6 Paper, edition & Sampling frame 13 89 54 156 publishing Complete case 10 23.1 65 27 45 16.7 120 23.1 With replacement 10 23.1 78 12.4 49 9.3 137 12.2 Machinery & equipment Sampling frame 47 252 256 555 Complete case 25 46.8 198 21.4 212 17.2 435 21.6 With replacement 35 25.5 222 11.9 241 5.9 498 10.3 Wood & furniture Sampling frame 13 74 58 145 Complete case 7 46.2 55 25.7 39 32.8 101 30.3 With replacement 11 15.4 69 6.8 50 13.8 130 10.3 Non-metallic products Sampling frame 13 23 30 66 Complete case 3 76.9 18 21.7 22 26.7 43 34.8 With replacement 6 53.8 18 21.7 26 13.3 50 24.2 Other Sampling frame 27 63 57 147 Complete case 19 29.6 38 39.7 47 17.5 104 29.3 With replacement 25 7.4 50 20.6 51 10.5 126 14.3 Total Sampling frame 189 743 780 1712 Complete case 104 45 534 28.1 590 24.4 1228 28.3 With replacement 140 25.9 647 12.9 701 10.1 1488 13.1 Notes: (a) "Sampling frame" refers to the total number of observations (firms surveyed multiplied by the number of years of information). (b) "Complete case" refers to the complete case of production function variables (sales, materials, capital and labor), missing values in other IC variables--other than production function-- are not considered. (c) "With replacement" refers to the sample after imputing IC variables according to the ICA Method; missing values in other IC variables--other than production function--are not considered. Notice that only observations with information available in at least one of sales, labor, labor cost, materials or capital, are imputed (d) "Perc. lost" refers to the percentage of observations lost with respect to the sampling frame. Source: Authors calculations with IC data. 12 Table 2.4: TANZANIA, Percentage of observations lost due to missing values by industry and size Size Small Medium Large Total (d) Industry #Obs %Lost #Obs %Lost #Obs %Lost #Obs %Lost Food & beverages Sampling frame(a) 105 87 51 243 Complete case(b) 47 55.2 44 49.4 17 66.7 108 55.6 With replacement(c) 82 21.9 57 34.5 31 39.2 170 30 Textiles & apparel Sampling frame 33 41 19 93 Complete case 10 69.7 14 65.9 5 73.7 29 68.8 With replacement 26 21.2 24 41.5 8 57.9 58 37.6 Chemicals, rubber & Sampling frame 23 55 24 102 plastics Complete case 10 56.5 18 67.3 14 41.7 42 58.8 With replacement 13 43.5 40 27.3 16 33.3 69 32.4 Paper, edition & Sampling frame 27 39 9 75 publishing Complete case 8 70.4 19 51.3 6 33.3 33 56 With replacement 16 40.7 30 23.1 9 0 55 26.7 Machinery & Sampling frame 49 29 9 87 equipment Complete case 14 71.4 6 79.3 6 33.3 26 70.1 With replacement 36 26.5 21 27.6 8 11.1 65 25.3 Wood & furniture Sampling frame 133 53 9 195 Complete case 52 60.9 13 75.5 3 66.7 68 65.1 With replacement 89 33.1 23 56.6 5 44.4 117 40 Non-metallic Sampling frame 11 16 6 33 products Complete case 3 72.7 11 31.3 5 16.7 19 42.4 With replacement 9 18.2 12 25 6 0 27 18.2 Total Sampling frame 381 320 127 828 Complete case 144 62.2 125 60.9 56 55.9 325 60.7 With replacement 271 28.9 207 35.3 83 34.6 561 32.2 Notes: (a) "Sampling frame" refers to the total number of observations (firms surveyed multiplied by the number of years of information). (b) "Complete case" refers to the complete case of production function variables (sales, materials, capital and labor), missing values in other IC variables--other than production function-- are not considered. (c) "With replacement" refers to the sample after imputing IC variables according to the ICA Method; missing values in other IC variables--other than production function--are not considered. Notice that only observations with information available in at least one of sales, labor, labor cost, materials or capital, are imputed (d) "Perc. lost" refers to the percentage of observations lost with respect to the sampling frame. Source: Authors calculations with IC data. 13 Tables 3.1, 3.2, 3.3 and 3.4 attempt to illustrate how the representativity of the sampling frame changes with respect to the complete case and the sample with replacement.17 In all cases, the percentages vary slightly in the complete case with respect to the sampling frame. The percentages of the sample with replacement are more similar to the sampling frame. For instance, in India from Table 3.1, panel a), the percentage of `food' firms falls from 8.7% to 6.9%, while after the replacement it is 7.9%. Symmetrically, the percentage of `apparel' firms jumps from 12% to 14.3% in the complete case and to 12.4% in the sample with replacement. Similar patterns can be observed in the remaining countries. Finally, from these tables response rates do differ across countries, but within countries they are remarkably uniform across regions and industries. Table 3.1: INDIA, Representativity of sampling frame, complete case and sample with replacement Sampling frame(a) Complete case(b) With replacement(c) # Obs Perc over total # Obs Perc over total # Obs Perc over total a) by Industry Food 597 8.7 307 6.9 454 7.9 Textiles & Leather 888 13 600 13.5 738 12.8 Apparel 825 12 634 14.3 715 12.4 Chemicals & Chemical prds 930 13.6 610 13.7 769 13.4 Plastics & Rubbers 480 7 316 7.1 411 7.1 Non-metallic products 216 3.2 110 2.5 164 2.9 Structural metal & metal prds 909 13.3 480 10.8 771 13.4 Machinery & Equipment 2,004 29.3 1,391 31.3 1,728 30.1 Total 6,849 100 4,448 100 5,750 100 b) by size Small 3,621 52.9 2,247 50.5 2,953 51.4 Medium 2,271 33.2 1,537 34.6 1,968 34.2 Large 957 14 664 14.9 829 14.4 Total 6,849 100 4,448 100 5,750 100 Notes: (a) "Sampling frame" refers to the total number of observations (firms surveyed multiplied by the number of years of information). (b) "Complete case" refers to the complete case of production function variables (sales, materials, capital and labor), missing values in other IC variables--other than production function-- are not considered. (c) "With replacement" refers to the sample after imputing IC variables according to the ICA Method; missing values in other IC variables--other than production function--are not considered. Notice that only observations with information available in at least one of sales, labor, labor cost, materials or capital, are imputed Source: Authors calculations with ICSs data. 17 In order to evaluate how representativity changes from the sampling frame to the complete case, we would need to have information on the weight of each category over the reference population. Unfortunately, this information is not available. As second best, we can still demonstrate how representativity changes from the data we have. Let us suppose population is split into two strata, and that the original sample selects a given number of observations for strata 1 and 2, and as a result X and Y are the percentages that represent the weight of each strata in the population. In the complete case, we introduce the missing data problem so instead of X and Y we have Xī, Yī. If we suppose that the sampling frame is representative of the population then the complete case is said to be representative if, and only if, the weights in the complete case are proportional to the weights in the sampling frame; that is XXī and YYī. 14 Table 3.2: TURKEY, Representativity of sampling frame, complete case and sample with replacement Sampling frame(a) Complete case(b) With replacement(c) # Obs Perc over total # Obs Perc over total # Obs Perc over total a) by Industry Food and Bev. 564 21.3 195 22.7 400 22.2 Textiles and Apparel 738 27.9 176 20.5 435 24.1 Chemicals 352 13.3 104 12.1 214 11.9 Non-metallic mineral products 166 6.3 54 6.3 127 7.0 Metal products (ex. M&E) 284 10.7 107 12.5 209 11.6 Machinery and Equipment 256 9.7 106 12.3 194 10.8 Electrical machinery 134 5.1 53 6.2 100 5.5 Transport equipment 152 5.7 64 7.5 125 6.9 Total 2,646 100 859 100.0 1,804 100.0 b) by size Small 788 29.8 226 26.3 531 29.4 Medium 810 30.6 264 30.7 557 30.9 Large 1048 39.6 369 43.0 716 39.7 Total 2,646 100.0 859 100.0 1,804 100.0 Notes: Same as Table 3.1. Table 3.3: SOUTH AFRICA, Representativity of sampling frame, complete case and sample with replacement Sampling frame(a) Complete case(b) With replacement(c) # Obs Perc over total # Obs Perc over total # Obs Perc over total a) by Industry Food & beverages 189 10.9 131 10.7 159 10.7 Texts & apparel 180 10.4 107 8.7 143 9.6 Chemicals rubber & plastics 285 16.4 187 15.2 241 16.2 Paper, edition & publishing 159 9.2 120 9.8 137 9.2 Machinery & equipment 561 32.3 435 35.4 497 33.4 Wood & furniture 147 8.5 102 8.3 131 8.8 Non-metallic products 66 3.8 43 3.5 49 3.3 Other 150 8.6 104 8.5 129 8.7 Total 1,737 100 1,229 100 1,486 100 b) by size Small 189 11 104 8.5 139 9.4 Medium 743 43.4 534 43.5 647 43.7 Large 780 45.6 590 48 696 47 Total 1,712 100 1,228 100 1,482 100 Notes: Same as Table 3.1. 15 Table 3.4: TANZANIA, Representativity of sampling frame, complete case and sample with replacement Sampling frame(a) Complete case(b) With replacement(c) # Obs Perc over total # Obs Perc over total # Obs Perc over total a) by Industry Food & beverages 243 29.3 108 33.2 170 30.3 Textiles & apparel 93 11.2 29 8.9 58 10.3 Chemicals, rubber & plastics 102 12.3 42 12.9 69 12.3 Paper, edition & publishing 75 9.1 33 10.2 55 9.8 Machinery & equipment/Metallic products 87 10.5 26 8 65 11.6 Wood & furniture 195 23.6 68 20.9 117 20.9 Non-metallic products 33 4 19 5.8 27 4.8 Total 828 100 325 100 561 100 b) by size Small 381 46 144 44.3 271 48.3 Medium 320 38.6 125 38.5 207 36.9 Large 127 15.3 56 17.2 83 14.8 Total 828 100 325 100 561 100 Notes: Same as Table 3.1. 3. Imputation of missing values: The ICA method Rubin (1976) rigorously defined the assumptions that might plausibly be made about missing data mechanisms (MDM).18 When the MDM is ignorable, the objective of the replacement methods is not to augment the sample size, but to preserve the sample representativity, to gain efficiency in the estimation and to retrieve for the analysis a large number of very expensive interviews. The alternative to these methods is the listwise deletion, which is not a panacea even when the MDM is ignorable. Operating with the complete case is only acceptable if incomplete cases attributable to missing data comprise a small percentage, say 5% or less, of the number of total cases (Schafer, 1997), and when the complete case preserves the representativeness of the original sampling frame. In addition, in models with a large number of regressors, missing data problems may encourage analysts to leave out of the regression some explanatory variables with a high proportion of missing values. As Cameron and Trivedi (2005) point out, this practice may be misleading as it leads to an omitted variables problem, which is more serious than the missing data problem per se. To see how the various mechanisms applied to deal with missing data perform, it is useful to depart from a population model of interest. A repeated task that applied researchers carry out in the context of IC data is the estimation of production functions to perform a variety of productivity analyses. Concretely, let us suppose the extended production function as in Escribano and Guasch (2005 and 2008). The population model is given by log Yit 0 L log Lit M log M it K log K it IC ICi D Dit uit , (1) 18 Data on Y variable is said to be missing completely at random (MCAR) if P(Y missing | Y, X)= P(Y missing), where X is a matrix of other variables on data. Data is missing at random (MAR) if P(Y missing | Y, X)= P(Y missing | X). Missing data is nonignorable if P(Y missing | Y, X)= P(Y missing | X, Y). 16 where logY, logL, logM and logK represents output, labor, materials and capital all in logs, IC is the time-invariant vector of investment climate and other control variables and D is a vector of industry/region/size/time dummies. Since the usual time, industry, region and size fixed effects are included in the vector D, and the usual fixed effects are assumed to be observable and included in IC vector, u is assumed to be a usual i.i.d error.19 Equation (1) is of special interest for the purpose of this paper as it implies using a large proportion of the variables included in the ICSs. Furthermore, it is especially useful to illustrate the trade-off between plausible biases inherent in measurement errors that could arise after replacing missing data and the omitted variables bias associated with the complete case. Concretely, in the four cases considered, the final vector of significant IC variables is intended to include 27 variables in India, 18 in Turkey, 31 in South Africa and 25 in Tanzania.20 The definition of the variables used, classified into five broad groups (infrastructures, red tape, finance, quality, and other), is in the appendix on definition of variables. For identification in (1) if we observe all data and under regularity conditions, it is clear that, following Wooldridge (2007), we need E (uit | log Lit , log M it , log K it , ICi , Dit ) 0 . Now let the pattern of missing values for each observation i at moment t be given by sit, where sit=0 if missing value and 1 otherwise. So what we observe is sit log Yit sit ( 0 L log Lit M log M it K log K it IC ICi D Dit ) sit uit . (2) If the pattern of missing values is M.A.R or M.C.A.R then the necessary conditions for equation (4) to be identified are E ( sit uit ) 0 , E[( sit J )( sit uit )] E[( sit Juit )] 0 with J log Lit , log M it , log K it , ICi , Dit . In the additional case of exogenous sample selection, when the pattern of missing values is determined only by the explanatory variables of (1),--for instance the missing values have some patterns on time, size, industries, regions or even between exporters/non-Exporters firms, domestic/foreign, etc--we also need that E(sit uit | sit log Lit , sit log Mit , sit log Kit , sit ICi , sit Dit ) sit E(ui | sit log Lit , sit log Mit , sit log Kit , sit ICi , sit Dit ) 0 . That is, for the identification condition in this case to hold, we need to control for any exogenous variable affecting the pattern of missing values, and this is the way we proceed in the estimation of the productivity equations. Note that once we have controlled for all these variables, we can estimate (2) in the complete case consistently, although at the cost of losing efficiency and in some cases the representativity of the original sampling frame. 19 Concretely, equation (1) is based on the methodology proposed in Escribano and Guasch (2005 and 2008) with further developments in Escribano et al (2008a and b). The selection of variables is detailed in these papers, and it is based on a general to particular procedure. Although for the purpose of this paper, we are not interested in the properties of the model, but wish to test the sensitivity of the results to the imputation method used, it is interesting to clarify that the underlying philosophy of this methodology is to use the time-invariant vectors of IC variables to correct for observable fixed effects. 20 Although the initial set of IC vectors comprises more than 150 variables, a reduction process from the general to the specific was applied in order to find the final sets of significant variables. The final set of variables is required to be robust to 12 different TFP measures. More details are in Escribano and Guasch (2005 and 2008). 17 When the pattern of missing values s is correlated with the dependent variable of (1) we are in the presence of a self-selection case.21 In this case the missing values are not ignorable and we cannot get rid of incomplete observations. In this case, equation (2) must be estimated by other sample selection corrections, such as the Heckman selection model. In what follows we discuss the first imputation mechanism proposed to deal with the problem of incomplete data; the ICA method. 3.1 Imputation of missing values: The ICA method Our method of imputing missing data, which we call the ICA method, shares the expectation step of the Expectation-Maximization (EM) algorithm proposed in the seminal paper of Dempster, Laird and Rubin (1977), a method that, within the maximum likelihood approaches, has been widely applied in several scientific fields (see McLachlan and Krishnan (1997) for a review). In particular, the replacement strategy used departs from the expectation of the production function variables conditional on the industry, region and size the corresponding observation belongs to (`expectation step'). Or equivalently, we replace the missing value by the expectation of the distribution of the variable conditional on the information on sector, region and size according to next equation E ( J it | DR ,it , DI ,it , DS ,it ) 0 R , J DR ,it I, J DI ,it S , J DS ,it J Y , L, M , K (3) where Y, L, M and K represents output, labor, materials and capital and DR, DI and DS are vectors of region, industry and size dummies respectively. Notice that we choose (3) such that it represents the special features of the IC datasets--in IC surveys industry, region and size are the variables used to stratify the sample. After excluding from the replacement process those observations with all the production function variables missing, 22 estimated values to replace incomplete data are given by J it 0 R , J DR ,it T , J DI ,it T , J DS ,it ^ ^ ^ ^ J Y , L, M , K (4) Unlike the EM algorithm,23 the ICA method has the advantage of separating the imputation of missing data from the estimation of the parameters of the population model. More precisely, separating the imputation mechanism of a population model is the main characteristic of the multiple imputation approaches, which allows using them with virtually any kind of data and any kind of model. The ICA method is, in fact, a general multiple imputation mechanism in which we assume that each imputed variable can be represented as a linear function of the 21 Notice that as equation (1) is equivalent to: log Yit L log Lit M log M it K log Kit 0 IC ICi D Dit uit , where on the right hand side we have the productivity index. We are clearly concerned with the possible correlation of the MDM with productivity or TFP as it may induce biases in the estimators of the vector . 22 The ICA method is conservative in the sense that we do not replace missing cells for those observations with all but one PF variables unobserved. We force the industry-region-size cells to have at least 18 values to estimate consistently the sample average. Moreover, in order to avoid biases caused by outlier observations, we use the within-group median instead of the within-group mean. 23 The EM algorithm imputes missing data conditional on a given population model, and therefore chooses the candidates' values to replace the missing cells that maximize the likelihood function conditional on a vector of parameters of that model. 18 variables used to stratify the sample (dummies of industry, region and size), and therefore the fitted values can be used to replace missing data. Hence, the first assumption we need is that the imputed variable can be represented as a multiple linear function of other variables. The second condition that needs to be met for multiple imputation to work well is that all the variables, including those replaced and those used to replace, have normal distributions (see Allison, 2001).24 According to equation (3) and (4), equation (2) represents the `maximization step', which is now given by sit log Yit sit ( 0 L log Lit M log M it K log Kit IC ICi Dit ) sit uit * * * (5) where y, l, m and k with a tilde on top represent the imputed variables and s* is the new pattern of missing values after the replacement process.25 With identification conditions in the MAR case given by E ( s* u ) 0 , E[( s* J )( s* u )] E[( s* Ju )] 0 with J l , m , k , IC , D , while in the it it it it it it it it it it i it case of exogenous sample selection we need that * * * * * * E(situit | sit log Lit , sit log Mit , sit log Kit , sit ICi , sit Dit ) sit E(uit | sit log Lit , sit log Mit , sit log Kit , sit ICi , sit Dit ) 0 . * * * * * * That is, we need to control for any explanatory variable correlated with s* to get consistency either in the inputs or IC variables. When the two assumptions mentioned above (normality and linearity of imputed variables on dummies of industry, region and size) do not hold, the replacement strategy is no longer consistent. Very little can be said about the asymptotic distributions of the estimators obtained under these circumstances because they have not yet been derived. In a general fashion, in these cases we can understand our replaced variables as the classic problem of variables measured with error. In order to illustrate this, let our model be given by yi xi ui , where yi represents sales and xi is a vector of inputs. Suppose that in the population we have that E (ui | xi ) 0 , and that xi is missing when i S . When we predict xi i S such that xi xi vi ^ where xi is our predicted value, then the model becomes yi xi vi ui . Where when i S ^ xi xi and vi=0, while if i S xi xi and vi xi xi . Therefore, consistency of estimates of ^ ^ depends on whether E (vi | xi ) 0 . Consistency follows if the linear regression of the inputs on industry, region and size variables gives us a noisy measure of the true level of the variables. Otherwise we will have a vi and the parameters obtained from regression analysis would be consequently downward biased, and the magnitude of the bias will depend on the standard deviation of the error term relative to the standard deviation of the variable and the proportion of replaced values.26 24 Although these are strong assumptions, the imputation method seems to works well even when the variables have distributions that are manifestly not normal, see Schafer (1997). 25 Variables included in the IC and C vectors are imputed by using the same procedure. However, by means of illustration and simplification here we only discuss the identification condition as if only PF variables were imputed. 26 We thank Ariel Pakes for useful suggestions at this point. 19 3.2 Performance of the ICA method The performance of the ICA method is illustrated by plotting the Kernel densities of the PF variables in the complete case and after imputing missing data. Those are in figures 2.1 to 2.4 in the appendix at the end of the paper. Overall, from these figures the distributions of the ICA method and the complete case tend to be similar when the proportion of missing values is not too high. Divergences appear as the proportion of unobserved sample becomes larger. 20 Figure 2.1: INDIA, evaluation of performance of the ICA method 1 I. Kernel estimates of output and input densities in the complete case and in the sample after imputing missing values by the ICA method A. Sales (log) B. Materials (log) .25 .25 .2 .2 .15 .15 Density Density .1 .1 .05 .05 0 0 0 5 10 15 20 25 5 10 15 20 25 log-Sales log-Materials Complete case ICA method Complete case ICA method C. Capital stock (log) D. Employment (log) .4 .4 .3 .3 Density Density .2 .2 .1 .1 0 0 0 5 10 15 20 6 8 10 12 14 16 log-Capital log-Employment Complete case ICA method Complete case ICA method II. Table of descriptive statistics and tests of equality of distributions of output and inputs in the complete case and in the sample with imputation by the ICA method # Obs. Mean Std. Dev. Min. Max. One-sample K-S (# imputed) Test (p-value) Sales (log) Complete case 5841 12.08 2.30 1.30 22.79 0.000 ICA meth. 5935 (94) 12.07 2.29 1.30 22.79 0.000 Materials (log) Complete case 5597 11.44 2.30 2.94 22.20 0.000 ICA meth. 5933 (336) 11.40 2.28 2.94 22.20 0.000 Capital (log) Complete case 4555 10.31 2.11 1.85 20.73 0.000 ICA meth. 5918 (1363) 10.28 2.10 1.85 20.73 0.000 Empl (log) Complete case 6164 10.82 1.33 6.54 16.16 0.000 ICA meth. 6321 (157) 10.82 1.34 6.54 16.16 0.000 Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. The null hypothesis of the one-sample Kolmogorov-Smirnov Test is that the cumulative distribution differs from the hypothesized theoretical normal distribution. Source: Authors' estimations with ICSs data. 21 Figure 2.2: TURKEY, evaluation of performance of the ICA method I. Kernel1 estimates of output and input densities in the complete case and in the sample after imputing missing values by the ICA method A. Sales (log) B. Materials (log) .2 .2 .15 .15 Density Density .1 .1 .05 .05 0 0 5 10 15 20 5 10 15 20 log-Sales log-Materials Complete case ICA method Complete case ICA method C. Capital stock (log) D. Employment (log) .25 .25 .2 .2 .15 .15 Density Density .1 .1 .05 .05 0 0 0 5 10 15 20 8 10 12 14 16 log-Capital log-Employment Complete case ICA method Complete case ICA method II. Table of descriptive statistics and tests of equality of distributions of output and inputs in the complete case and in the sample with imputation by the ICA method # Obs. Mean Std. Dev. Min. Max. One-sample K-S (# imputed) Test (p-value) Sales (log) Complete case 1497 14.24 2.10 7.78 19.40 0.004 ICA meth. 1821 (324) 14.30 1.99 7.78 19.40 0.000 Materials (log) Complete case 1293 13.19 2.31 4.33 18.65 0.020 ICA meth. 1822 (529) 13.37 2.13 4.34 18.65 0.000 Capital (log) Complete case 1289 11.39 2.26 0.63 19.65 0.015 ICA meth. 1816 (527) 11.32 2.05 1.05 19.65 0.004 Empl (log) Complete case 2529 11.63 1.45 7.64 15.42 0.001 ICA meth. 2548 (19) 11.63 1.45 7.64 15.42 0.001 Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. The null hypothesis of the one-sample Kolmogorov-Smirnov Test is that the cumulative distribution differs from the hypothesized theoretical normal distribution. Source: Authors' estimations with ICSs data. 22 Figure 2.3: SOUTH AFRICA, evaluation of performance of the ICA method I. Kernel1 estimates of output and input densities in the complete case and in the sample after imputing missing values by the ICA method .25 A. Sales (log) B. Materials (log) .2 .2 .15 .15 Density Density .1 .1 .05 .05 0 0 5 10 15 20 25 5 10 15 20 25 log-Sales log-Materials Complete case ICA method Complete case ICA method C. Capital stock (log) D. Employment (log) .4 .4 .3 .3 Density Density .2 .2 .1 .1 0 0 5 10 15 20 25 5 10 15 20 log-Capital log-Employment Complete case ICA method Complete case ICA method II. Table of descriptive statistics and tests of equality of distributions of output and inputs in the complete case and in the sample with imputation by the ICA method # Obs. Mean Std. Dev. Min. Max. One-sample K-S (# imputed) Test (p-value) Sales (log) Complete case 1497 14.24 2.10 7.78 19.40 0.000 ICA meth. 1821 (324) 14.30 1.99 7.78 19.40 0.000 Materials (log) Complete case 1293 13.19 2.31 4.33 18.65 0.000 ICA meth. 1822 (529) 13.37 2.13 4.34 18.65 0.000 Capital (log) Complete case 1289 11.39 2.26 0.63 19.65 0.000 ICA meth. 1816 (527) 11.32 2.05 1.05 19.65 0.000 Empl (log) Complete case 2529 11.63 1.45 7.64 15.42 0.000 ICA meth. 2548 (19) 11.63 1.45 7.64 15.42 0.000 Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. The null hypothesis of the one-sample Kolmogorov-Smirnov Test is that the cumulative distribution differs from the hypothesized theoretical normal distribution. Source: Authors' estimations with ICSs data. 23 Figure 2.4: TANZANIA, evaluation of performance of the ICA method 1 I. Kernel estimates of output and input densities in the complete case and in the sample after imputing missing values by the ICA method A. Sales (log) B. Materials (log) .15 .2 .15 .1 Density Density .1 .05 .05 0 0 5 10 15 20 5 10 15 20 log-Sales log-Materials Complete case ICA method Complete case ICA method C. Capital stock (log) D. Employment (log) .15 .4 .3 .1 Density Density .2 .05 .1 0 0 5 10 15 20 8 10 12 14 16 log-Capital log-Employment Complete case ICA method Complete case ICA method II. Table of descriptive statistics and tests of equality of distributions of output and inputs in the complete case and in the sample with imputation by the ICA method # Obs. Mean Std. Dev. Min. Max. One-sample K-S (# imputed) Test (p-value) Sales (log) Complete case 1497 14.24 2.10 7.78 19.40 0.012 ICA meth. 1821 (324) 14.30 1.99 7.78 19.40 0.001 Materials (log) Complete case 1293 13.19 2.31 4.33 18.65 0.169 ICA meth. 1822 (529) 13.37 2.13 4.34 18.65 0.093 Capital (log) Complete case 1289 11.39 2.26 0.63 19.65 0.053 ICA meth. 1816 (527) 11.32 2.05 1.05 19.65 0.027 Empl (log) Complete case 2529 11.63 1.45 7.64 15.42 0.006 ICA meth. 2548 (19) 11.63 1.45 7.64 15.42 0.002 Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. The null hypothesis of the one-sample Kolmogorov-Smirnov Test is that the cumulative distribution differs from the hypothesized theoretical normal distribution. Source: Authors' estimations with ICSs data. From a more detailed analysis of Figure 2.1, which illustrates the case of India, it is clear that there are not significant differences in the distributions of any of the PF variables in the complete case and in the sample with replacement by the ICA method, which is supported by the Kolmogorov- Smirnov tests. Furthermore, both the sample mean and the standard deviation do not change significantly before and after the imputation process (especially important is the fact that the standard 24 deviation does not decline after the imputation). These observations hold for all the PF variables, even for the case of the capital stock, for which the proportion of imputed values is much higher than in the remaining variables. The case of South Africa case represented in Figure 2.3 reaches the same conclusions as the India sample. On the other hand, the performance of the ICA method in the cases of Turkey and Tanzania shows significantly different behavior from the previous cases. Thus, in Turkey where the response rate of PF variables is below 40%, the kernel estimates suggest slight differences in the shape of the distributions, and, although the sample means are rather similar, the standard deviation estimated after imputing missing values decreases as the proportion of missing values increases. The same holds for the case of Tanzania, although in this case the problem becomes more acute as the sample distributions are far from normal, rejecting the null hypothesis of the Kolmogorov-Smirnov tests. The extent to which the ICA method gives us a good approximation of the population distribution of the variables and therefore leads to a consistent estimation of equation (1) depends on the determinants of the MDM. Studying and analyzing the characteristics of the MDM is precisely the aim of sections 4 and 5, where we investigate the links between the patterns of missing values and productivity, sales and other key characteristics at the firm level such as accountability, informality, corruption, crime, innovative activity, etc. This analysis will be significantly important in the remaining sections, when we compare the ICA method with extensions and other different imputation mechanisms, which rely on different assumptions about the nature of the missingness mechanism. 4. The nature of the missing data mechanism The following section aims to present a careful descriptive analysis of the characteristics of those firms having missing values, in order to judge whether the missing data mechanism may be treated as missing at random or not. 4.1 Why do some establishments refuse to provide or avoid providing certain information? At this point, one question of great concern is the nature of the generating data process: missing completely at random, missing at random or non-ignorable missing data. Different assumptions can be made about the nature of the mechanism generating missing values. In general, missing values may be considered a consequence of some of the following causes: a) firms refuse to answer some questions (they do not have the information at hand, they simply do not know the information, they do not want to report it, they forget to answer some questions, etc); b) the interviewer neglects to ask some questions; and c) the question does not apply to some firms. Since missing data arising from an oversight of the interviewer or because the question simply does not apply represents a small share of the total number of missing values and may be assumed as random, we are clearly concerned with the cases in which firms avoid, refuse or simply do not answer some questions. Here one can make some assumptions as to why firms do not report certain figures to the interviewer. Maybe firms do not report data on production function variables because of lack of accountability. It could also be a matter of informality. Those firms that do not report all sales to IRS authorities may have an incentive to avoid reporting these figures to the data collector as well, even though data is confidential. In this vein, one may also consider that missing 25 values could be correlated with the level of corruption within the environment in which firms operate. Productivity or level of sales could also explain missing values: the higher the level of sales (or productivity) the lower the number of missing values. The explanation could simply be that weaker/less profitable firms do not keep proper accountability, or maybe the managers of weaker firms are less likely to know the PF figures (it is important to point out here that PF variables come from recall data). At this point, the question is whether the pattern of missing values is directly correlated with sales or TFP or if itis correlated indirectly through other variables such as share of exports, imports, access to infrastructures, capacity, innovation, R&D, quality, use of IC technologies, informality, corruption, accountability, etc, which are known to be strongly associated with sales and TFP.27 If the pattern of missing values is directly correlated with the dependent variable of our model--sales or TFP in our case--then the MAR or MCAR assumptions no longer hold. In this case, the missing value mechanism is said to be non-ignorable and the missing data mechanism needs to be modeled together with the structural model we are trying to estimate. On the other hand, when the missing data mechanism is related with sales or TFP indirectly through other--independent or exogenous variables in the dataset, the missing data mechanism is considered to be missing at random, which under regularity conditions is equivalent to saying that missing data is ignorable.28 In this case we can get rid of missing data and operate only with the complete case once we have controlled for the variables correlated with the missingness mechanism. However, some caveats need to be made regarding casewise deletion as we will see in later sections. The descriptive analysis we propose in this section allows us to obtain deeper and more thorough knowledge of the MDM. This is especially useful when the MDM is non-ignorable (not MAR and therefore not MCAR). As Meng (2000) signals, ignorability is untestable from the observed data, so caution is required when drawing conclusions from models with imputed data. Furthermore, sensitivity analysis and subjective knowledge of the nature of the MDM play a critical role here, as Molerberghs et al. (1999) illustrate. In fact, modeling the MDM is a very active line of research with a number of unresolved problems (see e.g. Heitjan, 1994 and 1999; Ibrahim, et al., 1999). From now on, the aim is, therefore, to describe the characteristics of those firms reporting missing values. The types of questions we are aiming to address are: has the missingness mechanism some relevant information for the parameters we are attempting to estimate? Or, in other words, are the parameters of the MDM related to the parameters of our model? And, as a consequence, is the MDM ignorable? 4.2 Is a missing value more likely to be found within small firms? 27 Notice that we are concerned with the correlation of the MDM with either sales or TFP. We use the extended production function of equation (1) where a wide set of IC and C variables is plugged into a general PF in order to control for observable fixed effects. The correlation of MDM with sales may introduce bias in the input-output elasticities estimates, whereas the correlation with TFP could imply biased IC parameters estimates. 28 A separate question is whether MAR is equivalent to ignorable missing data. Even when the missing data mechanism is assumed to be MAR, an additional assumption is needed to ensure that empty cells can be ignored: the parameters of the missing data process need to be unrelated with the parameters of the model we are willing to estimate. However, MAR and ignorability are almost always considered as equivalent assumptions in the literature, since the assumption that the parameters defining the missingness model are unrelated to the structural model is easily satisfied (see Allison, 2001 and Heitjan and Basu, 1996 for illustrations). 26 Firstly, we are concerned with the possibility of systematic bias in the response rates to questions on sales and inputs. Table 4 shows the number of missing values in sales and inputs according to size, which are known to correlate strongly with productivity (and also with sales).29 The pattern in response rates is that small firms (those with fewer than twenty employees) tend to respond less often in India and South Africa. The pattern is somewhat different in Turkey and Tanzania where missing values in the inputs are uniformly distributed across categories of firms' sizes, with the exception of capital stock which has a higher proportion of missing values within small firms. At this point, these results could suggest the presence of some degree of systematic bias of the response rates in India and South Africa. Nonetheless, further investigation is needed to give additional insight into this question. The fact that small firms report less information also suggests that response rates to detailed sales and costs questions could have more to do with accounting and capacity--less affordable for small firms. 29 Categories of size are: small, fewer than 20 employees; medium, between 20 and 100 employees; large, more than 100 employees. 27 Table 4: Number of missing values in production function variables by size Small Medium Large a) INDIA Totals by size 3,621 2,271 957 (a) Sales Number of missing 646 257 95 (b) Perc over totals by size 17.8 11.3 9.9 Labor Number of missing 0 0 0 Perc over totals by size 0 0 0 Materials Number of missing 688 278 101 Perc over totals by size 19 12.2 10.6 Capital Number of missing 1258 640 245 Perc over totals by size 34.7 28.2 25.6 b) TURKEY Totals by size 788 810 1048 Sales Number of missing 335 365 449 Perc over totals by size 42.5 45.1 42.8 Labor Number of missing 34 37 46 Perc over totals by size 4.3 4.6 4.4 Materials Number of missing 346 396 521 Perc over totals by size 43.9 48.9 49.7 Capital Number of missing 462 388 507 Perc over totals by size 58.6 47.9 48.4 c) SOUTH AFRICA Totals by size 197 783 804 Sales Number of missing 40 95 76 Perc over totals by size 20.3 12.1 9.5 Labor Number of missing 23 54 43 Perc over totals by size 11.7 6.9 5.3 Materials Number of missing 53 111 97 Perc over totals by size 26.9 14.2 12.1 Capital Number of missing 69 204 154 Perc over totals by size 35 26.1 19.2 d) TANZANIA Totals by size 361 302 127 Sales Number of missing 129 121 40 Perc over totals by size 35.7 40.1 31.5 Labor Number of missing 28 21 11 Perc over totals by size 7.8 7 8.7 Materials Number of missing 114 87 38 Perc over totals by size 31.6 28.8 29.9 Capital Number of missing 53 111 97 Perc over totals by size 14.7 36.8 76.4 Small: less than 20 employees; medium: between 20 and 100 employees; large: more than 100 employees. (a) Number of missing includes both missing values and outliers in the corresponding variables. (b) Percentage over the total number of observations in each category of firms' size. Source: Authors calculations with IC data. 28 4.3 Are missing values distributed uniformly across different categories of firms? Tables 5.1 to 5.4 offer further empirical underpinning on whether the MDM is related to a firm's' weakness, or rather are other firms' attributes what determine the probability of observing a missing value. Table 5.1 focuses on the case of India. It compares the share of firms reporting at least one missing value on PF variables in the whole sample, with the share of firms reporting missing values by categories of key IC variables. In the case of India, 32.8% of firms report at least one missing value in PF variables. This percentage varies when we take into account categories of IC variables. Thus, those firms that do not use e-mail or experience power outages tend to respond less often to PF questions, respectively 39.0% and 37.8% of firms with missing information within these two categories. It is indicative of the nature of the MDM that those firms hiding some share of sales and/or workforce from IRS tax authorities have more missing values in PF variables on average (see the rows corresponding to Informality (I) and Informality (II)). With regard to corruption, those firms that operate in a more corrupt environment report fewer missing values. Similar conclusions can be obtained from crime; those firms having suffered criminal attempts also tend to avoid reporting PF figures. Symptomatic of the nature of the MDM in India is the fact that firms with access to a credit line and with the annual statements reviewed by a external auditor, report a lower proportion of missing values (PF information is lost for 40.4% of firms without access to credit and 50.2% of firms with the annual statements not audited externally, report at least one missing value). This indicates that a plausible explanation for the missing values is the lack of proper accountability or even informality. Continuing with Table 5.1, other indicative variables of the pattern of missing values are the exporting activity (only 18.2% of those firms exporting directly report any missing value) and the education of the manager (28.5% of firms with a manager with a university education report missing values, while 35.1% of the remaining firms report missing values). These two variables indicate that the level of competitiveness of the firm is another important factor explaining the pattern of missing values. However, other variables that are known to correlate strongly with competitiveness and productivity, such as FDI or the introduction of new technologies and products,, do not provide any further information on the MDM. 29 Table 5.1: INDIA, Proportion of observations with missing values in production function (PF) variables by key IC determinants Proportion of Establishments with: complete information at least one missing Key IC variables on PF variables value in PF variables Whole sample 67.2 32.8 1. Generator Establishments not using own generator 68.6 31.4 Establishments using own generator 66.3 33.7 2. Power outages Establishments that do not experience power outages 61 39 Establishments experiencing power outages 69.4 30.6 3. Water outages Establishments that do not experience water outages 66.9 33.1 Establishments experiencing water outages 71.5 28.5 4. E-mail Establishments that do not use e-mail 62.2 37.8 Establishments using e-mail 70.6 29.4 5. Web page Establishments that do not use web page 66.8 33.2 Establishments using web page 68.3 31.7 6. Informality (I) Establishments reporting all sales to IRS authorities 76.4 23.6 Establishments that hide some share of sales from the IRS 63.5 36.5 7. Informality (II) Establishments reporting all workforce to IRS authorities 78.1 21.9 Establishments that hide some share of workforce from the IRS 62 38 8. Corruption (I) Establishments that do not pay bribes to deal with bureaucracy 63.4 36.6 Establishments paying bribes to deal with bureaucracy 71.6 28.4 9. Corruption (II) Establishments that do not pay bribes to obtain contracts with the gov. 64.6 35.4 Establishments paying bribes to obtain contracts with the government 74.3 25.7 10. Crime Establishments that do not suffer losses due to crime 67.7 32.3 Establishments suffering losses due to crime 58.4 41.6 11. Security Establishments without security expenses 67.1 32.9 Establishments with security expenses 68.2 31.8 12. Loan Establishments without access to a loan 67.5 32.5 Establishments with access to a loan 67.2 32.8 13. Credit line Establishments without access to a credit line 59.6 40.4 Establishments with access to a credit line 73.8 26.2 14. Auditory Establishments with annual statements reviewed by external auditory 49.8 50.2 Establishments without annual statements reviewed by external auditory 70.4 29.6 15. Innovation (I) Establishments without ISO certification 67 33 Establishments with ISO certification 67.8 32.2 16. Innovation (II) Establishments that do not introduce new products 66.4 33.6 Establishments introducing new products 68.7 31.3 17. Innovation Establishments that do not introduce new technologies (III) Establishments introducing new technologies 18. Training Establishments that do not provide training 71.4 28.6 Establishments providing training 65.1 34.9 19. Manager skills Managers with less than a university education 64.9 35.1 Managers with more than a university education 71.5 28.5 20. Exporting Establishments that do not export 68.9 31.1 activity Establishments exporting 81.8 18.2 21. FDI inflows Establishments that do not receive FDI inflows 67.2 32.8 Establishments receiving FDI inflows 60.7 39.3 22. Incorporated Establishments not in an incorporated company 66.8 33.2 company Establishments in an incorporated company 67.9 32.1 23. Holding Establishments not in a holding Establishments in a holding 24. Capacity Establishments that do not use all their capacity 67.2 32.8 utilization Establishments using all their capacity 68.6 31.4 Within production function variables we include labor (labor cost), capital, sales and materials. Source: Authors calculations with IC data. 30 Table 5.2: TURKEY, Proportion of observations with missing values in production function (PF) variables by key IC determinants Proportion of Establishments with: complete information at least one missing Key IC variables on PF variables value in PF variables Whole sample 52.4 47.6 1. Generator Establishments not using own generator Establishments using own generator 2. Power outages Establishments that do not experience power outages 41.1 58.9 Establishments experiencing power outages 55.7 44.3 3. Water outages Establishments that do not experience water outages 53.7 46.3 Establishments experiencing water outages 44.7 55.3 4. E-mail Establishments that do not use e-mail 56.0 44.0 Establishments using e-mail 51.5 48.5 5. Web page Establishments that do not use web page 51.8 48.2 Establishments using web page 52.6 47.4 6. Informality (I) Establishments reporting all sales to IRS authorities 47.1 52.9 Establishments that hide some share of sales from IRS 55.2 44.8 7. Informality (II) Establishments reporting all workforce to IRS authorities 47.6 52.4 Establishments that hide some share of workforce from IRS 57.0 43.0 8. Corruption (I) Establishments that do not pay bribes to deal with bureaucracy 48.0 52.0 Establishments paying bribes to deal with bureaucracy 76.2 23.8 9. Corruption (II) Establishments that do not pay bribes to obtain contracts with the gov 47.7 52.3 Establishments paying bribes to obtain contracts with the government 63.9 36.1 10. Crime Establishments that do not suffer losses due to crime 52.2 47.8 Establishments suffering losses due to crime 54.4 45.6 11. Security Establishments without security expenses 32.0 68.0 Establishments with security expenses 93.0 7.0 12. Loan Establishments without access to a loan 47.6 52.4 Establishments with access to a loan 56.4 43.6 13. Credit line Establishments without access to a credit line 45.5 54.5 Establishments with access to a credit line 60.4 39.6 14. Auditory Establishments with annual statements reviewed by external auditory 56.2 43.8 Establishments without annual statements reviewed by external auditory 47.1 52.9 15. Innovation Establishments without ISO certification 51.0 49.0 (I) Establishments with ISO certification 54.4 45.6 16. Innovation Establishments that do not introduce new products 50.6 49.4 (II) Establishments introducing new products 55.5 44.5 17. Innovation Establishments that do not introduce new technologies 44.0 56.0 (III) Establishments introducing new technologies 64.0 36.0 18. Training Establishments that do not provide training 47.5 52.5 Establishments providing training 56.6 43.4 19. Manager Managers with less than a university education 52.0 48.0 skills Managers with more than a university education 53.9 46.1 20. Exporting Establishments that do not export 54.3 45.7 activity Establishments exporting 50.2 49.8 21. FDI inflows Establishments that do not receive FDI inflows 52.8 47.2 Establishments receiving FDI inflows 43.1 56.9 22. Incorporated Establishments not in an incorporated company 51.9 48.1 company Establishments in an incorporated company 62.1 37.9 23. Holding Establishments not in a holding 53.1 46.9 Establishments in a holding 42.5 57.5 24. Capacity Establishments that do not use all their capacity 55.5 44.5 utilization Establishments using all their capacity 38.0 62.0 Within production function variables we include labor (labor cost), capital, sales and materials. Source: Authors calculations with IC data. 31 Table 5.3: SOUTH AFRICA, Proportion of observations with missing values in production function (PF) variables by key IC determinants Proportion of Establishments with: complete information at least one missing Key IC variables on PF variables value in PF variables Whole sample 72 28 1. Generator Establishments not using own generator 71.8 28.2 Establishments using own generator 73.3 26.7 2. Power outages Establishments that do not experience power outages 63.4 36.6 Establishments experiencing power outages 76.6 23.4 3. Water outages Establishments that do not experience water outages 64.9 35.1 Establishments experiencing water outages 89.4 10.6 4. E-mail Establishments that do not use e-mail 33.3 66.7 Establishments using e-mail 72.5 27.5 5. Web page Establishments that do not use web page 71.9 28.1 Establishments using web page 72.0 28.0 6. Informality (I) Establishments reporting all sales to IRS authorities 59.3 40.7 Establishments that hide some share of sales from IRS 74.3 25.7 7. Informality (II) Establishments reporting all workforce to IRS authorities Establishments that hide some share of workforce from IRS 8. Corruption (I) Establishments that do not pay bribes to deal with bureaucracy 73.4 26.6 Establishments paying bribes to deal with bureaucracy 33.3 66.7 9. Corruption (II) Establishments that do not pay bribes to obtain contracts with the gov 73.7 26.3 Establishments paying bribes to obtain contracts with the government 40.0 60.0 10. Crime Establishments that do not suffer losses due to crime 70.9 29.1 Establishments suffering losses due to crime 72.9 27.1 11. Security Establishments without security expenses 62.1 37.9 Establishments with security expenses 74.5 25.5 12. Loan Establishments without access to a loan 73.9 26.1 Establishments with access to a loan 68.8 31.2 13. Credit line Establishments without access to a credit line 72.6 27.4 Establishments with access to a credit line 71.6 28.4 14. Auditory Establishments with annual statements reviewed by external auditory 38.9 61.1 Establishments without annual statements reviewed by external auditory 73.0 27.0 15. Innovation Establishments without ISO certification 70.9 29.1 (I) Establishments with ISO certification 73.6 26.4 16. Innovation Establishments that do not introduce new products 62.4 37.6 (II) Establishments introducing new products 76.4 23.6 17. Innovation Establishments that do not introduce new technologies 67.6 32.4 (III) Establishments introducing new technologies 74.9 25.1 18. Training Establishments that do not provide training 73.5 26.5 Establishments providing training 71.1 28.9 19. Manager Managers with less than a university education 63.7 36.3 skills Managers with more than a university education 75.3 24.7 20. Exporting Establishments that do not export 69.8 30.2 activity Establishments exporting 75.4 24.6 21. FDI inflows Establishments that do not receive FDI inflows 71.9 28.1 Establishments receiving FDI inflows 72.4 27.6 22. Incorporated Establishments not in an incorporated company 72.9 27.1 company Establishments in a incorporated company 51.4 48.6 23. Holding Establishments not in a holding 72.4 27.6 Establishments in a holding 69.0 31.0 24. Capacity Establishments that do not use all their capacity 72.8 27.2 utilization Establishments using all their capacity 66.7 33.3 Within production function variables we include labor (labor cost), capital, sales and materials. Source: Authors calculations with IC data. 32 Table 5.4: TANZANIA, Proportion of observations with missing values in production function (PF) variables by key IC determinants Proportion of Establishments with: complete information at least one missing Key IC variables on PF variables value in PF variables Whole sample 44.8 55.2 1. Generator Establishments not using own generator 44.9 55.1 Establishments using own generator 45.1 54.9 2. Power outages Establishments that do not experience power outages 42.1 57.9 Establishments experiencing power outages 45.7 54.3 3. Water outages Establishments that do not experience water outages 42.5 57.5 Establishments experiencing water outages 50.4 49.6 4. E-mail Establishments that do not use e-mail 43.2 56.8 Establishments using e-mail 46.4 53.6 5. Web page Establishments that do not use web page 43.4 56.6 Establishments using web page 50.0 50.0 6. Informality (I) Establishments reporting all sales to IRS authorities 45.6 54.4 Establishments that hide some share of sales from IRS 44.3 55.7 7. Informality (II) Establishments reporting all workforce to IRS authorities Establishments that hide some share of workforce to IRS 8. Corruption (I) Establishments that do not pay bribes to deal with bureaucracy 41.3 58.7 Establishments paying bribes to deal with bureaucracy 50.0 50.0 9. Corruption (II) Establishments that do not pay bribes to obtain contracts with the gov 42.8 57.2 Establishments paying bribes to obtain contracts with the government 54.2 45.8 10. Crime Establishments that do not suffer losses due to crime 58.9 41.1 Establishments suffering losses due to crime 0.0 0.0 11. Security Establishments without security expenses 45.0 55.0 Establishments with security expenses 47.6 52.4 12. Loan Establishments without access to a loan 51.8 48.2 Establishments with access to a loan 61.0 39.0 13. Credit line Establishments without access to a credit line 42.1 57.9 Establishments with access to a credit line 50.2 49.8 14. Auditory Establishments with annual statements reviewed by external auditory 32.7 67.3 Establishments without annual statements reviewed by external auditory 48.9 51.1 15. Innovation Establishments without ISO certification 43.4 56.6 (I) Establishments with ISO certification 57.6 42.4 16. Innovation Establishments that do not introduce new products 44.9 55.1 (II) Establishments introducing new products 47.0 53.0 17. Innovation Establishments that do not introduce new technologies 48.3 51.7 (III) Establishments introducing new technologies 39.9 60.1 18. Training Establishments that do not provide training 44.5 55.5 Establishments providing training 47.9 52.1 19. Manager Managers with less than a university education skills Managers with more than a university education 20. Exporting Establishments that do not export 44.6 55.4 activity Establishments exporting 51.6 48.4 21. FDI inflows Establishments that do not receive FDI inflows 43.9 56.1 Establishments receiving FDI inflows 47.5 52.5 22. Incorporated Establishments not in an incorporated company 45.1 54.9 company Establishments in an incorporated company 38.1 61.9 23. Holding Establishments not in a holding 46.4 53.6 Establishments in a holding 33.3 66.7 24. Capacity Establishments that do not use all their capacity 45.5 54.5 utilization Establishments using all their capacity 36.1 63.9 Within production function variables we include labor (labor cost), capital, sales and materials. Source: Authors calculations with IC data. The case of Turkey is represented in Tables 5.2. The patterns are similar to those observed in India. TPower outages experienced, e-mail usage, informalities and corruption are good indicators of the pattern of missing values. Again the proportion of missing values within firms having access to credit and to an external auditory is larger relative to those that do not, which all corroborates the explanation of accountability as a determinant of the MDM. Other variables with important 33 implications for the MDM are exports, the FDI, the introduction of new technologies, the legal status of the firm (am incorporated company or not) and the percentage of capacity utilization. Similar conclusions can be obtained for South Africa in Table 5.3. Missingness in this country appears to be associated with water outages, use of e-mail, informality and corruption, accountability, and the legal status, and, to a lesser extent, with power outages, security expenses and the introduction of new products and technologies. These patterns are even more pronounced in Tanzania. Table 5.4 illustrates that, for instance, in those firms with access to a loan, 39% report missing values, while in those firms without loans the percentage rises to 48.2%. The same holds for informality, corruption, quality, technology, exporting activity, legal status, holdings or capacity utilization. 4.4 More on the relationship between the MDM and the investment climate variables Continuing with the analysis presented so far and in order to go into more depth regarding the relationship between the probability of observing a missing value in TFP and the IC variables, we propose the following model for the probability of observing data on TFP in terms of IC and D variables Pr( sia 1| Di , ICi ) ( 0 2 Di 3a ICi ia ) , a a a where sit is a dichotomous variable of value 1 if we observe all sales, labor, materials and capital and zero otherwise. Symmetrically, in the case of sales, we have the following equation Pr( sib 1| Di , ICi ) ( 0 2 Di 3 ICi ib ) , b b b b where in this case sit takes value 1 if we observe data for sales. Tables 6.1 to 6.4 present the estimated results by applying a LPM to model the probability of having a missing value conditional on the investment climate faced by firms. Concretely, we propose four models for each country. First we consider missing values in TFP conditioning in two different vectors of IC variables. The first specification includes the same set of IC variables as that included in equation (5); that is, the set of covariates statistically significant in the extended production function, before imputing missing values by the ICA method. The second specification chooses the set of significant correlates starting from the whole set of IC variables and applying a general-to- specific procedure of selection of variables. The case of sales is symmetrical in the sense that model [3] uses the same set of IC variables as in equation (5), while the specification shown in column [4] selects the set of variables as we did in the case of column [2]. 34 Table 6.1: INDIA, Linear probability models for the probability of observing TFP and sales Dependent variables: Missing on TFP (a) Missing on sales (b) [1] [2] [3] [4] Explanatory variables: Coeff. Std. Err. Coeff. Std. Err. Coeff. Std. Err. Coeff. Std. Err. Infrastructures: Longest # of days to clear customs for export (a) -0.0279 [0.0112]** -0.0108 [0.0082] Dummy for own generator -0.0066 [0.0165] 0.0072 [0.0134] Water supply from public sources (b) 0.0001 [0.0002] 0.0000 [0.0002] Shipment losses in the domestic market (b) -0.0044 [0.0015]*** -0.0028 [0.0014]** Dummy for own transport -0.0083 [0.0208] 0.0122 [0.0199] Dummy for web page 0.0153 [0.0177] 0.0191 [0.0207] Losses due to power outages (b) -0.0023 [0.0010]** -0.0025 [0.0007]*** Dummy for e-mail (b) 0.0282 [0.0166]* 0.031 [0.0183]* Shipment losses, domestic (b) -0.0043 [0.0014] -0.0028 [0.0011]** Losses due to transport outages (b) -0.0033 [0.0018]*** -0.0035 [0.0015]** Red tape, corruption and crime: Dummy for security 0.0146 [0.0188] 0.0033 [0.0157] Sales reported to taxes (b) 0.0006 [0.0006] 0.0005 [0.0005] Workforce reported f taxes (b) -0.0004 [0.0004] -0.0001 [0.0004] Dummy for payments to speed up bureaucracy 0.0347 [0.0137]** 0.0359 [0.0122]*** Dummy for interventionist labor regulation -0.0327 [0.0180]* -0.0379 [0.0187]** -0.0383 [0.0185]** -0.0409 [0.0189]** Absenteeism (b) -0.0165 [0.0074]** -0.0122 [0.0057]** Dummy for payments to deal with bur. issues (b) 0.0222 [0.0140] 0.0261 [0.0136]* Finance: Dummy for external audit 0.0121 [0.0174] 0.0538 [0.0252]** 0.0086 [0.0140] 0.0423 [0.0161]*** Dummy for trade association -0.0002 [0.0002] 0.0003 [0.0002] Working capital financed by domestic private banks (b) 0.0234 [0.0146] 0.0231 [0.0134]* Dummy for loan (b) 0.0337 [0.0209] 0.0319 [0.0159]** Largest shareholder (b) -0.0003 [0.0002] -0.0004 [0.0002]** Dummy for loan with collateral (b) -0.0802 [0.0318]** -0.0573 [0.0252]** Loans denominated in foreign currency (b) -0.0011 [0.0003]*** -0.0008 [0.0003]*** Quality, innovation and labor skills: Dummy for R&D (a) 0.0016 [0.1084] 0.0153 [0.0147] -0.04 [0.0666] 0.0296 [0.0130]** Dummy for product innovation -0.0073 [0.0157] -0.0099 [0.0133] Dummy for foreign license (b) 0.0481 [0.0314] 0.0572 [0.0297]* Dummy for internal training (b) 0.0025 [0.0197] 0.0001 [0.0186] Unskilled workforce (a) 0.0021 [0.0012]* 0.0017 [0.0011] Workforce with computer 0.0006 [0.0004] 0.0001 [0.0003] Dummy for ISO quality certification (b) 0.0148 [0.0173] 0.0325 [0.0156]*** Dummy for outsourcing (b) 0.0457 [0.0174] 0.0213 [0.0135] Dummy for external training (b) -0.0334 [0.0235] -0.0256 [0.0164] Other control variables: Dummy for incorporated company 0.0185 [0.0146] 0.0308 [0.0139]** Age 0.0077 [0.0103] 0.0097 [0.0095] Share of exports (b) 0.0002 [0.0002] 0.0002 [0.0002] Trade union (b) 0.0007 [0.0004]* 0.0008 [0.0003] 0.0006 [0.0003]* 0.0008 [0.0003]*** Strikes (b) -0.0165 [0.0133] -0.0037 [0.0158] Constant Yes Yes Yes Yes Industry/region/size dummies Yes Yes Yes Yes Observations 2048 2277 2048 2277 R-squared 0.23 0.23 0.18 0.18 (a) Missing in TFP takes value 1 if we observe all sales, materials, labor and capital, and 0 otherwise. (b) Missing in TFP takes value 1 if we observe sales, and 0 otherwise. [1] Model of the probability of observing a missing value in TFP conditional the IC and C variables significant in equation (1). [2] Model of the probability of observing a missing value in TFP and the matrices IC* and C*, selected from the whole set of IC and C variables. [1] Model of the probability of observing a missing value in sales conditional on in the IC and C variables significant in equation (1). [2] Model of the probability of observing a missing value in sales and the matrices IC* and C*, selected from the whole set of IC and C variables. Significance given by robust standard errors allowing for clustering by industry and region *** 1%, **5%, * 10%. Source: Authors' estimations with ICSs data. 35 Table 6.2: TURKEY, Linear probability models for the probability of observing TFP and sales Dependent variables: Missing on TFP Missing on sales [1] [2] [3] [4] Explanatory variables: Coeff. Std. Err. Coeff. Std. Err. Coeff. Std. Err. Coeff. Std. Err. Infrastructures: Days to clear customs for imports (a) 0.019 [0.0592] 0.0189 [0.0669] Losses due to power outages (b) -0.0029 [0.0016]* Losses due to water outages (b) 0.0035 [0.0010]*** Shipment losses (b) -0.0038 [0.0017]** Dummy for e-mail (b) 0.021 [0.0341] 0.0811 [0.0378]** 0.1088 [0.0377]*** Electricity from generator (b) 0.0009 [0.0004]** Red tape, corruption and crime: Crime losses (b) 0.0024 [0.0005]*** 0.0035 [0.0004]*** Security expenses (b) 0.1273 [0.0350]*** 0.1322 [0.0403]*** Manager's time spent on bur. issues (b) -0.003 [0.0009]*** -0.0025 [0.0012]** Dummy for consultant to help deal with bur. issues -0.0693 [0.0175]*** -0.0713 [0.0270]** Number of inspections (B) -0.0036 [0.0022] -0.0221 [0.0129]* 0.0003 [0.0002] Payments to deal with bureaucratic issues (a) 0.00001 [0.0002] 0.0013 [0.0004]*** 0.0092 [0.0037]** 0.0019 [0.0004]*** Sales declared for taxes (a) 0.0087 [0.0035]** -0.0011 [0.0004]** -0.003 [0.0022] -0.0013 [0.0004]*** Payments to obtain a contract with the government (b) -0.0309 [0.0132]** -0.0156 [0.0022]*** -0.0276 [0.0170] -0.0136 [0.0028]*** Production lost due to absenteeism (b) -0.0149 [0.0024]*** -0.0136 [0.0027]*** Dummy for informal competition (b) -0.0332 [0.0177]* -0.0368 [0.0176]** Delay in obtaining a water supply (a) -0.0282 [0.0214] -0.033 [0.0238] Dummy for lawsuit (b) -0.0494 [0.0218]** -0.0728 [0.0293]** Finance: Dummy for credit line -0.0763 [0.0243]*** -0.0908 [0.0247]*** -0.0778 [0.0232]*** Dummy for external auditory (a) 0.0443 [0.0194]** -0.0548 [0.0234]** 0.0327 [0.0230] Loans in foreign currency (b) -0.0005 [0.0003]* -0.0006 [0.0005] Dummy for new land purchased -0.0528 [0.0313]* Dummy for loan denominated in Turkish Lira (b) -0.1216 [0.0238]*** -0.1645 [0.0238]*** Dummy for loan denominated in foreign currency (b) -0.1001 [0.0317]*** -0.1472 [0.0379]*** Dummy for long-term loan (b) 0.1261 [0.0356]*** Quality, innovation and labor skills: Dummy for ISO quality certification (b) 0.0869 [0.0192]*** 0.0696 [0.0206]*** Dummy for new technology (b) -0.1027 [0.0223]*** -0.0987 [0.0260]*** Dummy for foreign licensed technology (b) 0.0607 [0.0244]** Staff with university education (b) 0.0001 [0.0010] 0.0016 [0.0007]** 0.001 [0.0010] Staff-part time workers 0.0018 [0.0007]** 0.0014 [0.0009] 0.0012 [0.0008] Other control variables: Dummy for incorporated company -0.092 [0.0557] -0.0851 [0.0394]** Age -0.0457 [0.0220]** Market share 0.0008 [0.0007] Production lost due to strikes (b) -0.0408 [0.0180]** -0.0056 [0.0246] Dummy for recently privatized firm 0.0222 [0.0949] -0.0344 [0.0877] Dummy for competition against imported products -0.0472 [0.0441] -0.0261 [0.0393] Constant Yes Yes Yes Yes Industry/region/size dummies Yes Yes Yes Yes Observations 1323 1323 1323 1323 R-squared 0.2 0.31 0.24 0.3 See footnotes in Table 6.1. Source: Authors' estimations with ICSs data. 36 Table 6.3: SOUTH AFRICA, Linear probability models for the probability of observing TFP and sales Dependent variables: Missing on TFP Missing on sales [1] [2] [3] [4] Explanatory variables: Coeff. Std. Err. Coeff. Std. Err. Coeff. Std. Err. Coeff. Std. Err. Infrastructures: Days to clear customs for imports (a) -0.018 [0.0587] -0.0782 [0.0509] Sales lost due to power outages (b) -0.0061 [0.0044] -0.0068 [0.0036]* -0.0059 [0.0026]** -0.0051 [0.0022]** Water outages (b) 0.0166 [0.0231] 0.016 [0.0032]*** 0.0021 [0.0196] Average duration of transport failures (a) -0.0206 [0.0467] 0.0064 [0.0445] Wait for electric supply (a) 0.0193 [0.0313] 0.0202 [0.0342] Dummy for email (b) 0.1795 [0.0686]** Dummy for internet 0.0356 [0.0138]** Sales lost due to delivery delays (b) 0.0103 [0.0040]** 0.0115 [0.0034]*** 0.003 [0.0028] 0.0039 [0.0027] Red tape, corruption and crime: Manager's time spent on bur. issues (b) 0.0022 [0.0010]** 0.001 [0.0007] Payments to deal with bureaucratic issues (b) -0.0011 [0.0007] -0.0015 [0.0005]*** -0.0011 [0.0008] Sales declared for taxes (a) 0.0006 [0.0028] -0.0022 [0.0029] -0.0027 [0.0016]* Payments to obtain a contract with the gov. (b) 0.0119 [0.0078] 0.0199 [0.0093]** 0.0199 [0.0079]** Security expenses (a) 0.0033 [0.0102] 0.0078 [0.0024]*** 0.0084 [0.0082] Crime losses (a) 0.0241 [0.0201] Illegal payments in protection (b) -0.0324 [0.0595] -0.0003 [0.0424] Crime losses (a) 0.023 [0.0404] 0.0472 [0.0368] Finance: Percentage of credit unused (b) 0.0002 [0.0003] 0.0004 [0.0003] 0.0004 [0.0002]* Dummy for loan -0.0025 [0.0329] 0.0017 [0.0213] Dummy for credit line (b) -0.0193 [0.0143] Value of the collateral (b) 0.00001 [0.0002] -0.0001 [0.0001] Loans in foreign currency (b) 0.0002 [0.0008] -0.0005 [0.0004] -0.0006 [0.0003]* Charge to clear a check (a) -0.0094 [0.0279] -0.037 [0.0252] -0.0307 [0.0162]* Largest shareholder 0.0002 [0.0004] 0.0003 [0.0004] Working capital fin. by foreign commercial banks (b) 0.003 [0.0026] 0.0046 [0.0026]* 0.0045 [0.0026]* Working capital financed by informal sources (b) 0.0011 [0.0008] 0.0002 [0.0003] Dummy for external auditory (b) -0.1669 [0.0911]* -0.1817 [0.0812]** Quality, innovation and labor skills: Dummy for ISO quality certification (b) 0.0375 [0.0258] 0.0304 [0.0175]* 0.036 [0.0180]* Dummy for new product (b) -0.0234 [0.0310] 0.007 [0.0205] Dummy for discontinued product line (b) -0.0316 [0.0264] -0.0185 [0.0143] Dummy for outsourcing (b) -0.0421 [0.0192]** -0.0267 [0.0138]* Staff - management 0.0009 [0.0012] 0.0013 [0.0010] Staff - non-production workers -0.0009 [0.0007] -0.0008 [0.0006] Dummy for training (b) -0.0231 [0.0146] Training for unskilled workers (a) 0.0015 [0.0023] 0.00001 [0.0020] University staff (b) -0.0007 [0.0007] -0.0012 [0.0005]** -0.0013 [0.0005]** Manager's experience (b) 0.002 [0.0102] -0.0063 [0.0073] Dummy for closed plant -0.0463 [0.0210]** Other control variables: Age (b) -0.0004 [0.0005] -0.0002 [0.0003] Share of the local market (b) 0.0002 [0.0004] 0.0002 [0.0003] Capacity utilization (b) -0.0018 [0.0009]** Constant Yes Yes Yes Yes Industry/region/size dummies Yes Yes Yes Yes Observations 586 594 586 594 R-squared 0.22 0.25 0.24 0.24 See footnotes in Table 6.1. Source: Authors' estimations with ICSs data. 37 Table 6.4: TANZANIA, Linear probability models for the probability of observing TFP and sales Dependent variables: Missing on TFP Missing on sales [1] [2] [3] [4] Explanatory variables: Coeff. Std. Err. Coeff. Std. Err. Coeff. Std. Err. Coeff. Std. Err. Infrastructures: Electricity from own generator (b) -0.0007 [0.0014] -0.0007 [0.0015] Losses due to power outages (b) 0.0035 [0.0050] 0.0049 [0.0023]** 0.0021 [0.0031] Losses due to water outages (b) Water from own well or water infrastructure (a) 0.00001 [0.0030] 0.001 [0.0023] Losses due to phone outages (a) -0.0308 [0.0158]* -0.0219 [0.0157] Transport outages (a) -0.0125 [0.0349] -0.0406 [0.0264] Losses due to transport delay (b) -0.0067 [0.0020]*** Dummy for own roads (b) -0.1213 [0.0768] -0.0904 [0.0977] Dummy for webpage (b) 0.061 [0.0795] 0.0322 [0.0775] Wait for a water supply (a) 0.0192 [0.0249] -0.0178 [0.0271] Low quality supplies (a) -0.0035 [0.0109] -0.0053 [0.0087] -0.0025 [0.0013]* Days of inventory of main supply 0.0358 [0.0175]** Red tape, corruption and crime: Gift to obtain an operating license (b) -0.0519 [0.0754] -0.0152 [0.1104] Payments to deal with bureaucratic issues (b) -0.0592 [0.0227]** -0.0803 [0.0147]*** -0.045 [0.0267] -0.0648 [0.0151]*** Days in inspections (b) -0.0509 [0.0378] -0.0788 [0.0403]* -0.0241 [0.0387] Payments to obtain a contract with the gov. (b) -0.0092 [0.0039]** -0.0117 [0.0034]*** -0.0063 [0.0046] -0.01 [0.0040]** Security expenses (b) -0.0023 [0.0026] -0.0035 [0.0028] Illegal payments for protection (b) -0.0075 [0.0224] -0.0385 [0.0072]*** -0.0405 [0.0095]*** Finance: Dummy for credit line (b) -0.1182 [0.0657]* Interest rate of the loan (a) 0.0033 [0.0076] -0.0017 [0.0061] Loans denominated in foreign currency (b) -0.0014 [0.0009] Dummy for current or saving account (b) 0.1616 [0.0856]* 0.2347 [0.0706]*** Working capital financed by commercial banks (b) -0.0009 [0.0007] -0.0011 [0.0010] Working capital financed by leasing (b) -0.0059 [0.0023]** -0.0059 [0.0013]*** Inputs bought on credit (b) -0.0016 [0.0008]* Sales bought on credit (b) 0.0007 [0.0012] 0.0007 [0.0011] Delay in clearing a domestic currency wire (a) 0.2385 [0.1403]* 0.196 [0.1479] Quality, innovation and labor skills: Dummy for new product (b) 0.0087 [0.0501] 0.002 [0.0462] Dummy for foreign license (b) -0.2748 [0.0649]*** Dummy for upgraded product (b) -0.1705 [0.0752]** Dummy for new technology (b) 0.1973 [0.0631]*** 0.3095 [0.0721]*** Dummy for joint venture (b) -0.2179 [0.0796]** Dummy for outsourcing (b) -0.2066 [0.0960]** Dummy for brought in house (b) -0.2265 [0.0707]*** Staff - skilled workers (b) 0.0007 [0.0004]* 0.0009 [0.0003]*** Staff - professional workers (b) -0.0055 [0.0033] -0.0075 [0.0040]* Workforce with computer (b) 0.003 [0.0017]* 0.0055 [0.0017]*** -0.0007 [0.0014] 0.0026 [0.0015]* Dummy for training (b) -0.0954 [0.0596] Other control variables: Dummy for incorporated company (b) 0.012 [0.1990] -0.075 [0.1534] Dummy for FDI (b) 0.1112 [0.0636]* 0.1255 [0.0549]** 0.1049 [0.0618]* 0.1717 [0.0586]*** Dummy for industrial zone (b) 0.121 [0.0737] 0.1274 [0.0668]* Constant Yes Yes Yes Yes Industry/region/size dummies Yes Yes Yes Yes Observations 262 262 262 262 R-squared 0.18 0.22 0.16 0.3 See footnotes in Table 6.1. Source: Authors' estimations with ICSs data. 38 Besides gathering evidence to show which are the variables empirically associated with the MDM, the main motivation for these models, is to know to what extent we need to control for IC variables in the estimation of equation (5). Bear in mind that even when the MDM is assumed to be MAR, we still need the following moment condition: E(sit uit | sit log Lit , sit log Mit , sit log Kit , sit ICi , sit Dit ) sit E(ui | sit log Lit , sit log Mit , sit log Kit , sit ICi , sit Dit ) 0 , and therefore independence between the set of IC variables we are interested in (those of equations (1) and (5)) and the MDM is achieved only before controlling for any variable correlated with the MDM. At this point, in setting up our model, the question is whether it is enough to use the matrix of IC variables of equations (1) and (5) or, on the contrary, we have to find a better model for the MDM. The results illustrate the clear relation between the MDM and the IC. Whether we use missingness in TFP (model [2]) or in sales (model [4]), those IC variables are able to explain a large proportion of the variance of the MDM. Furthermore, the results come to confirm the analysis of section 4.3, auditing, innovative activity, financing, capacity, corruption or informality among others are significant covariates of the pattern of missing data in all the countries, even after controlling for size, industry and region effects. Moreover, the IC variables used as covariates of equation (1) present high correlation with the MDM, especially in Turkey (see specifications [1] and [3]), supporting the assumption of exogenous sampling selection, with the IC variables influencing the data generating process. Thereby, controlling for those IC variables becomes a requisite. The question that arises at this point is whether it is enough to control for the IC variables of equation (1)--those of specifications [1] and [3]--, or rather should we select the set correlates of the MDM from the whole set of IC variables, as in specifications [2] and [4]?. In this respect, we argue that models [1] and [3] incorporate most of the information we require on the IC. In order to test it, we perform likelihood-ratio tests between model [1] on the one hand and [1] plus [2] on the other. Symmetrically, in the case of sales, we compare model [3] with [3] plus [4]. In addition, we also compare the R2, AIC and BIC criterions of model [1] with that of model [1] plus [2] ([3] with [3] plus [4] for sales). Given these results, in the remaining part of the paper we only control for the IC variables included in equation (1).30 4.5 Some exhibits on the plausible correlation of PF variables and MDM The descriptive analysis of the MDM is completed in figures 3.1 to 3.4. These figures compare the probability of picking an establishment with complete information for all production function variables with the probability of selecting an establishment with information for sales (panel A) and at least one missing value in the remaining PF variables. Panels B, C and D, simply change sales for materials, capital and employment respectively. The aim of these figures is to determine to what extent the pattern of missing values is correlated with PF variables. If the probability mass of picking a firm with a missing value is accumulated around low values of sales, materials, capital and employment, it could indicate that having a missing value is negatively related to the level of sales, materials, labor and/or capital. In other words, the probability of randomly drawing a firm with 30 We also believe that there exists a clear trade-off between parsimony and simplicity in the specification and adding further controls for the MDM 39 information for sales and with, at least, one PF variable missing is higher in firms with low sales. The same holds for materials and employment. The probability is lower for the case of capital. The same pattern is observed in India, Turkey, South Africa and Tanzania. 40 Figure 3.1: INDIA, Kernel density estimates of PF variables (without M.V in PF variables and with M.V in any PF variable) A. Sales B. Materials .3 .25 .2 .2 .15 Density Density .1 .1 .05 0 0 0 5 10 15 20 25 5 10 15 20 25 Log-sales Log-materials Reported sales and the rest of PF vars Reported materials and the rest of PF vars Reported sales with m.v in at least one PF var Reported materials with m.v in at least one PF var C. Labor D. Capital .4 .4 .3 .3 Density Density .2 .2 .1 .1 0 0 6 8 10 12 14 16 0 5 10 15 20 Log-employment (hours per year) Log-capital stock Reported employment and the rest of PF vars Reported capital and the rest of PF vars Reported empl. with m.v in at least one PF var Reported capital with m.v in at least one PF var Notes: Reported X and the rest of PF variables is the distribution of those establishments reporting all PF variables Reported X with m.v in at least one of the rest of P.F is the distribution of those establishments reporting the corresponding PF variable and also reporting at least one missing value in the remaining PF variables Epanechnikov kernel. Each point estimated within a range of 300 values. Source: Authors' estimations with ICSs data. 41 Figure 3.2: TURKEY, Kernel density estimates of PF variables (without M.V in PF variables and with M.V in any PF variable) A. Sales B. Materials .2 .2 .15 .15 Density Density .1 .1 .05 .05 0 0 5 10 15 20 5 10 15 20 Log-sales Log-materials Reported sales and the rest of PF vars Reported materials and the rest of PF vars Reported sales with m.v in at least one PF var Reported materials with m.v in at least one PF var C. Labor D. Capital .3 .2 .15 .2 Density Density .1 .1 .05 0 0 8 10 12 14 16 0 5 10 15 20 Employment (hours per year) Log-capital stock Reported employment and the rest of PF vars Reported empl. with m.v in at least one PF va Reported capital and the rest of PF vars Reported capital with m.v in at least one PF var Notes: Reported X and the rest of PF variables is the distribution of those establishments reporting all PF variables Reported X with m.v in at least one of the rest of P.F is the distribution of those establishments reporting the corresponding PF variable and also reporting at least one missing value in the remaining PF variables Epanechnikov kernel. Each point estimated within a range of 300 values. Source: Authors' estimations with ICSs data. 42 Figure 3.3: SOUTH AFRICA, Kernel density estimates of PF variables (without M.V in PF variables and with M.V in any PF variable) .25 A. Sales B. Materials .2 .2 .15 .15 Density Density .1 .1 .05 .05 0 0 5 10 15 20 25 5 10 15 20 25 Log-sales Log-materials Reported sales and the rest of PF vars Reported materials and the rest of PF vars Reported sales with m.v in at least one PF var Reported materials with m.v in at least one PF var C. Labor D. Capital .4 .4 .3 .3 Density Density .2 .2 .1 .1 0 0 5 10 15 20 5 10 15 20 25 Log-employment (hours per year) Log-capital stock Reported employment and the rest of PF vars Reported capital and the rest of PF vars Reported empl. with m.v in at least one PF var Reported capital with m.v in at least one PF var Notes: Reported X and the rest of PF variable sis the distribution of those establishments reporting all PF variables Reported X with m.v in at least one of the rest of P.F is the distribution of those establishments reporting the corresponding PF variable and also reporting at least one missing value in the remaining PF variables Epanechnikov kernel. Each point estimated within a range of 300 values. Source: Authors' estimations with ICSs data. 43 Figure 3.4: TANZANIA, Kernel density estimates of PF variables in Tanzania (without M.V in PF variables and with M.V in any PF variable) A. Sales B. Materials .2 .15 .15 .1 Density Density .1 .05 .05 0 0 5 10 15 20 5 10 15 20 Log-sales Log-materials Reported sales and the rest of PF vars Reported materials and the rest of PF vars Reported sales with m.v in at least one PF var Reported materials with m.v in at least one PF var C. Labor D. Capital .15 .4 .3 .1 Density Density .2 .05 .1 0 0 8 10 12 14 16 5 10 15 20 Log-employment (hours per year) Log-capital stock Reported employment and the rest of PF vars Reported capital and the rest of PF vars Reported empl. with m.v in at least one PF var Reported capital with m.v in at least one PF var Notes: Reported X and the rest of PF variables is the distribution of those establishments reporting all PF variables Reported X with m.v in at least one of the rest of P.F is the distribution of those establishments reporting the corresponding PF variable and also reporting at least one missing value in the remaining PF variables Epanechnikov kernel. Each point estimated within a range of 300 values. Source: Authors' estimations with ICSs data. Figures 3.1 to 3.4 support the story of weaker firms reporting more missing values. However, the story is not yet conclusive. Firms with low sales (and materials, capital and employment) do not usually need proper accountability also tend to operate in more corrupt environments and are less innovative and dynamic. In addition, as most of the firms are accumulated around low values, it is easy to infer that the probability of picking a firm with any missing value in the PF variables will be higher within this range of values as well. From these figures we cannot conclude that low sales do not imply weakness or low productivity, and therefore higher probability of having missing values. 44 4.6 Can we relate the MDM and our endogenous variables by means of the ICSs? So far we know that the MDMs in the countries analyzed are, in some way, related with a number of firms' attributes, such as accountability, corruption, openness, informality or size. However, we are not still able to conclude whether the MDM is determined independently of sales and TFP. The debate would probably end if we were able to construct a model of the probability of having a missing value and productivity (or sales) as RHS variable. Unfortunately, this is not possible because, obviously, we do not observe either productivity or sales when we observe a missing value. However, we can still take advantage of the particular structure of the pattern of missing values to relate it with productivity or sales. Since the number of missing values reported increases when we move backwards in time, we can construct a model relating the probability of having a missing value in any PF variable in period t and productivity (tfp) in period t+1 plus other controls. That is, assuming that information in t+1 is better than in period t--bearing in mind that establishments report recall data--we propose the model below for the probability of having a missing value Pr( sit 1 | tfp it 1 , Dit , IC i ) ( 0a 1a tfp it 1 2a Dit 3a IC i ita ) , a where sa takes value 1 if we observe all sales, labor, materials and capital and 0 otherwise.31 Or alternatively we can also use the following model for sales Pr( sit 1 / y it 1 , Dit , IC i ) ( 0b 1b y it 1 2b Dit 3b IC it it ) , b b where sb takes value 0 if we do not observe sales and y is the logarithm of firms' sales. The question we are trying to answer with these kinds of models is whether the probability of observing a missing value in period t-1 is correlated with the level of sales (productivity or TFP) in period t. Or, in other words, are more productive/profitable firms more likely to keep track of their input/output accountability? Obviously, these models do not imply contemporaneous correlations but we think they might still be a good indicator of the actual relation between the level of sales/TFP and the MDM. On the other hand, an additional consideration should be noted; there is a selection bias in the models as we are only able to use those observations with observable sales or TFP in t+1, so the resulting sub-sample is likely to be biased toward those responding firms. In order to reduce the degree of the bias, we use those imputed values of sales or TFP in period t+1.32 31 In addition, if we assume a first order Markov process for productivity, Pr(tfpt+1/ tfpt, tfpt-1,...)= Pr(tfpt+1/ tfpt) and a therefore tfp in t+1 is a good proxy of tfp in period t the model is reduced to Pr( sit 1/ tfpit , Dit ) (0 1tfpit 2 Dit it ) . 32 Although by applying this strategy we reduce the degree of sample bias, the problem remains to some extent. Nonetheless, we still believe that the models can be very informative about the relation of the plausible endogeneity of the MDM. 45 Table 7: Linear probability models for the effect of TFP and sales on the probability of observing a missing value in t+1 1 A. Missing in TFP Dependent variables: for each country a dummy taking value 1 if we observe all labor, materials, capital and sales Explanatory variables India Turkey South Africa Tanzania log TFP (t+1) 0.0168* 0.0183** 0.0212 0.0281 [0.0091] [0.0084] [0.0180] [0.0250] IC variables 3 Yes Yes Yes Yes Constant Yes Yes Yes Yes Industry/region/size dummies Yes Yes Yes Yes Observations 1476 426 454 87 R-squared 0.27 0.07 0.19 0.32 B. Missing in sales 2 Dependent variables: for each country a dummy taking value 1 if we observe sales Explanatory variables India Turkey South Africa Tanzania log sales (t+1) 0.0063* 0.0069 0.0079 0.0033 [0.0033] [0.0043] [0.0083] [0.0144] IC variables 3 Yes Yes Yes Yes Constant Yes Yes Yes Yes Industry/region/size dummies Yes Yes Yes Yes Observations 1894 677 564 155 R-squared 0.17 0.05 0.16 0.14 1 Missing in TFP takes value 1 if we observe all sales, materials, labor and capital, and 0 otherwise. 2 Missing in sales takes value 1 if we observe sales and 0 otherwise. 3 The set of IC variables of equation (1) is also included. Both TFP and sales are used before imputing missing values. Significance given by robust standard errors allowing for clustering by industry and region *** 1%, **5%, * 10%. Source: Authors' estimations with ICSs data. The results of both equations for missingness in TFP and sales are in Table 7. Under endogenous sampling when the pattern of missing values is correlated with sales or TFP and if we were able to observe everything, we should expect a positive relation between contemporaneous TFP/sales and the missingness problem before controlling for other determinants such as IC and D variables. As a consequence, the relation between missingness `yesterday' and TFP/sales `today' should also be positive. Table 7 supports this view for TFP (see Table 7 panel A) and for the cases of India and Turkey, where the ^ a is positive and therefore more productive firms in year t+1 are 1 associated with a higher probability of being able to keep track of proper accountability on output and inputs in past years. Note that we find this relation even before controlling for IC and D effects. However, the ^ a for South Africa and Tanzania do not indicate any significant association between 1 TFP and missingness in these countries. On the other hand, in the case of sales (panel B) we only observe a positive and significant effect of ^ a in India, although the effect in Turkey is no longer 1 significant. In South Africa and Tanzania the effect remains non-significant. Therefore, Table 7 points to a plausible endogenous selection problem between missingness and TFP in India and Turkey, with the endogenous sampling selection problem corroborated in the case of sales in India but not in Turkey. On the opposite side, the analysis does not support this view in South Africa and Tanzania, neither in the case of sales nor TFP. Nonetheless, Table 7 does not allow us to conclude that there is a self-selection problem in India and Turkey, nor that the MDM is MAR in South Africa and Tanzania. At this point caution is a requisite. All we are able to say is that we have four different patterns of data generating mechanisms. For some of them we find evidence of a more likely self-selection problem and under which we can test the performance of the various imputation methods, including the Heckman models. 46 4.7 Conclusions on the nature of the MDM The question at the core of the analysis of this section is whether the MDM in these countries is governed only by the level of sales or TFP (weakness) or if the MDM can be explained by a number of firms attributes, such as the level of competitiveness, dynamism, corruption, informality, accountability and other indicators relating to the firms' capacity: MAR versus non ignorable missing data assumptions. According to the descriptive analysis presented, the MDM mechanism has to do with informality and corruption and also with the capacity of the firms. More dynamic firms engaged in R&D, quality, innovation of new products, technologies and operating in more exigent and competitive export markets tend to report fewer missing values. Accountability can by itself explain a large share of missing data too. Much of these variables indicate that weaker firms tend to avoid reporting PF figures, and size is in some cases a good indicator of weakness as section 4.1 indicated. All these patterns are, to a greater or lesser extent, common to all the countries analyzed. Notwithstanding this clear relation between IC and MDM, we cannot reject the hypotheses of non-ignorability in any of the cases. As already pointed out, this assumption is untestable from the available data. The preliminary descriptive analysis of section 4.5 points to a relation between the level of usage of inputs and output and missingness. Furthermore, previous econometric analyses of section 4.6 report a plausible relation between TFP and sales and missingness in t-1, especially in the cases of India and Turkey. In either MAR or non-ignorable MDM, we believe that according to the analysis presented, controlling for those IC and D variables related with the missingness mechanism is a requisite, as can be shown from the LPM models presented for the probability of observing the required data to construct sales or TFP measures. This is the way we proceed in the rest of the paper. The aim of the following sections is to explore the dichotomy "MAR versus non-ignorability" of the MDM and their effects on the imputation mechanism proposed by comparing the sensitivity of the results of estimating the extended production function (1) under two assumptions: first, MDM is ignorable and therefore it may be explained by a number of exogenous firms' characteristics; and second, the MDM is endogenous and intimately linked to the level of sales and TFP of the firms. We also take advantage of the heterogeneity of the aprioristic relations observed between the MDM and their determinant in the four countries considered. This will allow us to illustrate how sensitive the results are under very different assumptions. In addition, besides testing the non-ignorable MDM, the analysis we present in what follows also allows us to study how the sensitivity of the imputations from the ICA method responds to: first, additional assumptions, such as randomness, or the amount of information embodied in the ICA method, all of them requiring the MAR assumption; and second, to different patterns of missing data: Turkey and Tanzania with a response rate for sales and TFP lower than 40% and India and South Africa with more than 70% of observations reported. 5. Robustness analysis As indicated, the aim of the paper is to compare the results of estimating equation (1) under the ICA method and several alternative imputation procedures. The methods presented to test the robustness of the results have their origins in two distinct bodies of statistical literature. The first one is related with likelihood-based inference with incomplete data, in particular, the EM algorithm. The second concerns the techniques of Markov Chain Monte Carlo (MCMC), generally referred to as multiple 47 imputation. We also consider extensions of the ICA method, allowing for additional randomness in the imputation procedure and the selection of the explanatory variables in equation (3). Lastly, we consider the estimation of (1) by sample selection estimation, such as different Heckman models.33 The literature on missing data points to the advantages of modern imputation mechanisms-- EM-type algorithms and MCMC simulations--over other simpler methods based on basic standard regression techniques (such as the ICA method presented), see Allison (2001) and Little and Rubin (1987) for a review. Nonetheless, while most of these techniques have been widely evaluated under univariate missing data patterns (missingness for only one variable), or simple patterns of missingness in some of the variables of the dataset, the patterns of missing data observed in ICSs are very complex and unbalanced, even if we only consider PF variables and not the remaining IC variables. As an additional objective, it raises the possibility of evaluating the performance of modern imputation mechanisms under the complex and very different patterns of missing data observed in ICSs. 5.1 The ICA Method as an EM type algorithm The EM algorithm has been widely applied in a broad range of applications, from missing data to latent variables models. Here we present several EM algorithms that will serve as a benchmark to be compared with the ICA method proposed. In particular, the aim is to test the sensitivity of the results obtained from the ICA method compared with other more sophisticated imputation mechanisms allowing for an additional randomness and amount of information embodied in the imputation mechanism. EM-type algorithms are based on an underlying likelihood function of the process generating data, and as a consequence imputed missing data is based on draws from the posterior predictive distributions of the postulated missing data mechanism (or data generating process). A key issue under these mechanisms is whether the MDM may be considered as MAR or not. 5.1.1 EMAlgorithm on size, industry and region Let J denote the vector dependent variable of interest, determined by the underlying unobserved vector variable JMis. Let f *( J Mis | X, ) 0 be the joint density of the latent variables conditional on the matrix of observed regressors X, and let f ( J | X, ) 0 be the joint density of the observed variables. In essence, the maximum likelihood estimator (MLE) in this case maximizes 1 1 1 QN ( ) LN ( ) ln f *( J Mis | X, ) ln f ( J Mis | J , X, ) .34 (6) N N N 33 Note that although in this section we only analyze the behavior of PF variables as if they were the only set of imputed variables, IC variables are in all the cases imputed by the ICA method. 34 Note that J* uniquely determines J but the inverse is not true, that is, J does not uniquely determine J*; from the Bayes Rule it follows that f (J | X) f *(J*| X)/ f *(J*| J, X) (see Cameron and Trivedi, 2005). , , , 48 The first term is not observed and therefore it is ignored. The second term is replaced by its expected value which does not involve JMis. The process is iterative; at the r-th round the expectation ^ of the second term is evaluated at . The Expectation step of the algorithm therefore calculates r 1 ^ QN ( | ^r ) E ln f ( J Mis | J , X, ) | J , X, r . (7) N ^ ^ The Maximization step simply maximizes QN ( | r ) to compute r 1 . Note that the iterative process continues until convergence is achieved. In this paper, we follow Cameron and Trivedi (2005) and propose the next EM type algorithm with our model rewritten as J1 X1 u1 J X u . (8) Mis 2 2 Where N1 are the available observations and N2 the missing observations and X denotes the ^ explanatory variables, the EM algorithm consists of (1) estimating using the N1 available ^ ^ observations; (2) generating J Mis X 2 ; (3) in order to mimic the distribution of J1 generating ^a ^ ^ adjusted values of J Mis ( V 1/ 2 J Mis ) u m , where u m is a Monte Carlo draw from the N(0, s2) distribution, being s2 the variance of u1 and a estimate of V can be obtained as ^ ^ ^ ^ V ( J Mis ) V ( J | X 2 ) s 2 ( I N 2 X 2 [ X1 ' X1 ]1 X 2 ') , and denotes element by element multiplication; ^ (4) using the augmented sample obtain a revised estimate of ; (5) repeating steps (1) to (4) until convergence is achieved, in the sense that the change in the sum of the square residuals becomes arbitrarily small. Note that steps (3) and (4) are simply random draws from the conditional distributions of J given in the case of step (3), and of given s2 in the case of step (4). In this first case, by means of direct comparisons with the ICA method, we include in the matrix X only the industry, region and size dummies. We also exclude from the imputation those observations with all production function variables missing. Note the advantages of the EM algorithms over the ICA method. Since the EM algorithm ^ works on the posterior predictive density, after each replication the new estimation of improves the previous one--because in each iteration we are approaching the postulated distribution of the mechanism generating data. In addition, theoretically the estimates of s2 improve the ones obtained in the ICA method, as those are likely to be downward biased as they do not make allowance for the uncertainty inherent in JMis. Obviously, these advantages greatly depend on the specification (model) chosen for the EM algorithm. 5.1.2 Extended EMAlgorithm on PF variables The first alternative model for the EM algorithm is to extend matrix X to contain industry, region size, dummies and production function variables. The imputation now has two iterative processes. The first iteration process is the iterative EM algorithm per se, while the second one consists of replacing missing cells conditional on the information available for the remaining production function variables and the patterns of missing values observed (see Figures 1 to 4). We start by 49 replacing the production function variable with the larger amount of missing values where X contains the remaining PF variables. We continue by applying the EM algorithm to the remaining PF variables. 5.1.3 Extended EMAlgorithm on PF and IC variables In order to check the sensitivity of the results to the matrix X used, and therefore to the amount of information embodied in the EM algorithm, we include in this case industry/region/size dummies, PF variables and a large set of IC variables. Concretely, the set of IC variables comes from the significant IC variables of equation (1). The idea is to check how the EM algorithm responds to the amount of information incorporated in the imputation mechanism. Different results with respect to EM algorithms in sections 5.1.1 and 5.1.2 would pose some doubts about the validity of the ICA method, as it does not incorporate enough information in the imputation mechanism. 5.2 Further extensions of the ICA method We now extend the ICA method to meet additional assumptions on the MDM. In particular we develop the ICA method to incorporate some degree of randomness in the imputation. We also propose an ICA method in which the dependent variable of the model (sales or logY) is excluded from the imputation procedure. 5.2.1 Random industryregionsize replacement: random ICA Method Under the two assumptions mentioned in section 3 (normality of replaced variables and linearity, apart from the MAR assumption) the ICA method leads to consistent estimation of the parameters of equation (1). However, it could be argued that a more efficient method might be used. Notice that by imputing missing values we are modifying the population distribution of replaced variables. In particular, if the two conditions mentioned in section 3 hold the sample average of the modified distribution of the variable it converges with the population expectation. Unfortunately, this is not true in the case of the standard deviation. With the replacement strategy we are reducing the variability of the distribution of those variables with missing values and therefore any statistical inference will be based on downward biased standard errors. Moreover, the bias in the standard errors will be higher as the proportion of missing values increases and the sample size decreases. This problem will arise whenever we use imputed data as if it were real data. It has to do with the lack of uncertainty in the estimation of the parameters of estimating regressors equations and reflects the fact that conventional formulas to compute standard errors do not correct for imputed data. The ICA method, although deterministic, introduces variability in the imputation of missing data by replacing missing cells for industries, regions and sizes with the variability given by I*R*S being I, R and S the numbers of industries, regions and sizes respectively. A good question is therefore whether this variation is enough or if the ICA method leads to downward biased standard errors. To answer this, we propose an alternative variation of the ICA method which consists of adding a random part to each imputed value. The new replacement strategy is again based on the expectation of equation (3), but in this case a random term is added in order to embody uncertainty to the imputation mechanism 50 J it 0 R , J DR ,it I , J DI ,it S , J DS ,it J , J ,it ^ ^ ^ ^ ^ J Y , L, M , K (9) where J , is the standard error of the residual J ,it from ^ J it 0 R , J DR ,it I , J DI ,it S , J DS ,it J ,it J Y , L, M , K and J ,it is a random draw from J ,it . In particular, we take 100 random draws from J ,it constructing 100 candidate values to replace each missing cell in the data matrix. To make the definite replacement we compute the average across the 100 candidate values. 5.2.2 Random industryregionsize replacement: bootstrap ICA Method Another problem arising from the lack of uncertainty inherent in deterministic imputation methods is that, generally, when certain instruments and/or regressors are estimated in a first stage (in our case for production function variables) the asymptotic variance needs to be adjusted because of the generated instruments, see Pagan (1984), Newey (1984), Murphy and Topel (1985) and Newey and McFadden (1994).35 A plausible solution for this problem is to compute the bootstrap estimate of the standard errors of the estimated coefficients of equation (5). The idea is to create `r' replications of the original sample using as strata industry and region. In the next step and for each replication, we apply equation (4) to replace the missing data and to estimate equation (5). The result will be a bootstrap distribution of the estimators of equation (4) under different replacements of missing data that can be used to compute the bootstrap estimates of the standard errors. 5.2.3 ICA method on the inputs One can also look at the imputation of missing data in the dependent variable of equation (1), sales. In this respect, it can be argued that the MDM may be correlated with the dependent variable of (1), so imputing missing values in sales and estimate (2) by OLS or standard econometric techniques is not a valid solution. In this case, when s depends on logY, it is clear that s and u are no longer uncorrelated, even though we control for IC and D variables. In particular when s is correlated with logY in equation (2) there is a self-selection problem that should be handled with other sample selection corrections, such as the Heckman model, as we shall see later on. Here we propose the same replacement mechanism as in section 3, but in this case excluding the sales of the replacement process. The extended production function to be estimated is therefore ** ** ** sit log Yit sit ( 0 L log Lit M log M it K log K it IC ICi D Dit ) sit uit , (12) with identification conditions symmetrical to those of equation (5). Note that when there is no sample selection, incomplete data is MAR, the incompleteness of logY is not so large that it makes the complete case unrepresentative of the real population and we are not concerned with efficiency, estimating (12) by standard techniques is equivalent to estimating 35 More precisely, the problem appears when testing the null hypotheses H 0 : 0 , where , , , are the coefficients of generated regressors (see equation 1). Before including the generated regressors in (1), the usual test statistic on has a limiting standard normal distribution under H0. However, when 0 ,, standard t statistics will not be asymptotically valid and an adjustment is needed for the asymptotic variances of all estimators of generated regressors. 51 (5) or (2). On the contrary, when there is a sample selection problem, the point of reference to compare with (12) would be the Heckman selection model. 5.3 Multiple imputation via switching regression The aim now is to propose different imputation mechanisms to compare their performance with the ICA method and its variations. The following imputation mechanism was first proposed by van Buuren, Boshuizen and Knook (1999) and it has been chosen because it fits very well with datasets with a large amount of missing values in many variables, such as IC datasets. See also Schafer (1999) for a tutorial on multiple imputation, and Schafer (1997) and Gelman, King and Liu (1998) for applications. The basic idea is to create a small number of data copies, each of which has the missing values suitably imputed. Each imputed dataset is then analyzed independently. Estimates of the parameters of interest are properly averaged across the data copies, while standard errors are computed according to `Rubin rules', see Rubin (1987). In particular, this multiple imputation mechanism is accomplished in the following steps: 1. Specify the posterior predictive density of incomplete data as p(JMIS|X,s) given that the non- response mechanism is p( s | J, IC, C, D) and the complete data model is p(J, IC, C, D), where X is the set of covariates used in the imputation mechanism and s is the pattern of missing values. The posterior predictive density is generally given by p(JMIS | X , s) p( JMIS | X , s, ) p( | X , s)d (13) where the standard procedure to impute missing data consists of first, drawing a value of * from p ( | X , s ) and second, drawing a value JMis* from p ( J MIS | X , s, * ) . 2. The next step is to draw imputations from this density to produce m complete datasets. Here we follow van Buuren et al. (1999) and we produce m=5 datasets. 3. Estimate equation (1) m times. 4. Pool the m results. This imputation mechanism involves choosing the form of the linear model and the predictor variables. In particular, we use a linear regression of each JMIS= Y, L, M and K on a set X of predictor variables, where the set of predictor variables is given by X=Y, K, L, M, and D. Note that each J is used as a predictor variable and as an imputed variable in (10), while D are used only as predictor variables. 5.4 Sample selection correction (I): Heckman on complete case If the pattern of missing values is endogenously determined (it is correlated with output (logY) in equation (4)), thereby giving rise to a self-selection problem, the ICA method may lead to inconsistent estimates of parameters of (1). In these cases one has to implement the Heckman (1976) or Heckit method to correct for self-selection, since OLS applied either to the complete case or to the sample with replacement is inconsistent. In particular, the Heckman model over the complete case is given by 52 E (log Yit | log Lit , log M it , log Kit , ICiH , Dit , sit 1) 0 L log Lit M log M it K log Kit , (14) ICi Dit ( L log Lit M log M it K log Kit IC ICiH Dit ) where as usual (.) is simply the inverse of Mills ratio or Heckman's lambda given by the following Probit Pr( s 1| lit , mit , kit , ICiH , Dit ) ( L log Kit M log M it K log Kit IC ICiH Dit ) , (15) with the following moment condition E (u | log Lit , log M it ,log Kit , ICi , ICiH , Dit ) 0 . The Heckman method is highly sensitive to model choice, requiring a good knowledge of the nature of the missing data mechanism. For this reason, the selection of the Probit model in (12) goes from the general to the specific, to select the variables with a significant effect on the probability of having a missing value. Concretely, the selection of variables starts with a wide set of more than 120 IC and D variables in each country. Eventually, the final set of significant variables is reduced to a number around 15 and 25. 5.5 Sample selection correction (II): Heckman imputing inputs with the ICA method In 3.4 the selection of Heckman model is based on the complete case. In this section, we propose performing the same model on the sample after replacing missing values in employment, materials and capital according to equations (10) and (11). The Heckman model in this case is given by E (log Yit | log Lit , log M it , log Kit , ICiH , Dit , sit 1) 0 L log Lit M log M it K log Kit , (16) IC D ( log L log M log K IC H D ) i it L it M it K it IC i it with Heckman's Lambda and moment condition obtained symmetrical to the previous sub-section. Note that equation (17) is directly comparable with equation (12). In addition, in sections 5.2.1 and 5.2.2, we introduced the problem of lack of uncertainty in the estimation of the standard errors of estimating regressors equations. A solution proposed was to obtain the bootstrap standard errors under replacement of missing values in each resampling. The solution here is similar: we obtain the bootstrap standard errors to make statistical inference and to correct the aforementioned problem. More precisely, we will compare the standard errors from the estimating sample with the bootstrap estimator of the standard errors, which will give us a benchmark on how serious this issue is in our case. 6. Empirical results The objective of this section is to evaluate to what extent the results obtained from the ICA method are influenced by different assumptions on the MDM. In particular, as we pointed out in section 5, under the ICA method we have to consider two different key assumptions on the patterns of missing data. First, if we can assume MDM as MAR, in which case then we test the goodness-of-fit of the ICA method against other more sophisticated mechanisms that are supposed to work better, as they consider the randomness issue and are able to include more information in the imputation mechanisms. And second, the MDM is non-ignorable and therefore we are forced to apply sample selection corrections such as Heckman models. 53 The evaluation of the ICA method is based on the kernel estimates of inputs and output and the underlying TFP densities under all the imputation mechanism proposed. We also present the empirical results from estimating the extended production function (1) under different imputation methods. In all the cases, we use the ICA method as a benchmark for comparison purposes. In all the regressions, outliers, defined as those observations with ratios of labor cost to sales and/or materials to sales greater than one, are excluded. 6.1 Evaluation of imputation mechanism: Comparison of estimated inputs and output densities The kernel densities of log Yit , log Lit , log M it , log Kit for each country and for the complete case, the ICA method, the random ICA method and the three EM-type algorithms considered are in figures 4.1 to 4.4. In turn, the descriptive statistics of the variables under each imputation mechanism are in tables 8.1 to 8.4. 54 Figure 4.1: INDIA, comparison of the ICA method and other imputation mechanisms for PF variables I. Kernel1 estimates of output and input densities A. Sales (log) B. Materials (log) .25 .25 .2 .2 D ensity of Mate rials D ensity of Sale s .15 .15 .1 .1 .05 .05 0 0 0 5 10 15 20 25 5 10 15 20 25 log-Sales log-Materials Complete case ICA method Complete case ICA method Random ICA EM alg. [1] Random ICA EM alg. [1] EM alg. [2] EM alg [3] EM alg. [2] EM alg [3] C. Capital stock (log) D. Employment (log) .5 .4 .4 D e n sity o f E m p lo y m e n t .3 D e n sity o f C ap ital .3 .2 .2 .1 .1 0 0 0 5 10 15 20 6 8 10 12 14 16 log-Capital log-Employment Complete case ICA method Complete case ICA method Random ICA EM alg. [1] Random ICA EM alg. [1] EM alg. [2] EM alg [3] EM alg. [2] EM alg [3] Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. Source: Authors' estimations with ICSs data. Figure 4.2: TURKEY, comparison of the ICA method and other imputation mechanisms for PF variables I. Kernel1 estimates of output and inputs densities A. Sales (log) B. Materials (log) 55 .2 5 .2 5 .2 .2 D e n sit y o f M a t e ria ls D e n sit y o f S ale s .1 5 .1 5 .1 .1 .0 5 .0 5 0 0 5 10 15 20 5 10 15 20 log-Sales log-Materials Complete case ICA method Complete case ICA method Random ICA EM alg. [1] Random ICA EM alg. [1] EM alg. [2] EM alg [3] EM alg. [2] EM alg [3] C. Capital stock (log) D. Employment (log) .4 .2 5 D e n sit y o f E m p lo y m e n t .2 .3 D e n sity o f C ap it al .1 5 .2 .1 .1 .0 5 0 0 0 5 10 15 20 8 10 12 14 16 log-Capital log-Employment Complete case ICA method Complete case ICA method Random ICA EM alg. [1] Random ICA EM alg. [1] EM alg. [2] EM alg [3] EM alg. [2] EM alg [3] Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. Source: Authors' estimations with ICSs data. 56 Figure 4.3: SOUTH AFRICA, comparison of the ICA method and other imputation mechanisms for PF variables I. Kernel1 estimates of output and inputs densities A. Sales (log) B. Materials (log) .25 .2 5 .2 .2 D e n s ity o f M a te ria ls D ensity of S ales .15 .1 5 .1 .1 .05 .0 5 0 0 5 10 15 20 25 5 10 15 20 25 log-Sales log-Materials Complete case ICA method Complete case ICA method Random ICA EM alg. [1] Random ICA EM alg. [1] EM alg. [2] EM alg [3] EM alg. [2] EM alg [3] C. Capital stock (log) D. Employment (log) .4 .4 D e n s ity o f E m p lo y m e n t .3 .3 D ensity of C apital .2 .2 .1 .1 0 0 5 10 15 20 25 log-Capital 5 10 15 20 log-Employment Complete case ICA method Random ICA EM alg. [1] Complete case ICA method EM alg. [2] EM alg [3] Random ICA EM alg. [1] EM alg. [2] EM alg [3] Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. Source: Authors' estimations with ICSs data. 57 Figure 4.4: TANZANIA, comparison of the ICA method and other imputation mechanisms for PF variables I. Kernel1 estimates of output and input densities A. Sales (log) B. Materials (log) .2 .2 .1 5 .1 5 D e n s ity o f M a te ria ls D e n s ity o f S a le s .1 .1 .0 5 .0 5 0 0 5 10 15 20 5 10 15 20 log-Sales log-Materials Complete case ICA method Complete case ICA method Random ICA EM alg. [1] Random ICA EM alg. [1] EM alg. [2] EM alg [3] EM alg. [2] EM alg [3] C. Capital stock (log) D. Employment (log) .4 .2 D e n s ity o f E m p lo y m e n t .15 .3 D ensity of C apital .2 .05 .1 .1 0 0 5 10 15 20 8 10 12 14 16 log-Capital log-Employment Complete case ICA method Complete case ICA method Random ICA EM alg. [1] Random ICA EM alg. [1] EM alg. [2] EM alg [3] EM alg. [2] EM alg [3] Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. Source: Authors' estimations with ICSs data. 58 Table 8.1 INDIA, Descriptive statistics of production function variables under different imputation mechanism Variable #Obs. (#imputed) Mean Std. Dev. Min Max Sales Complete case 5841.00 12.08 2.30 1.30 22.79 ICA method 5935 (94) 12.07 2.29 1.30 22.79 Random ICA meth. 5935 (94) 12.13 2.32 1.30 22.79 EM alg. [1] 6848 (1007) 12.02 2.19 1.30 22.79 EM alg. [2] 5882 (41) 12.08 2.30 1.30 22.79 EM alg. [3] 5882 (41) 12.08 2.30 1.30 22.79 Materials Complete case 5597.00 11.44 2.30 2.94 22.20 ICA method 5933 (336) 11.40 2.28 2.94 22.20 Random ICA meth. 5933 (336) 11.57 2.35 2.94 22.20 EM alg. [1] 6848 (1251) 11.35 2.17 2.94 22.20 EM alg. [2] 5906 (309) 11.42 2.32 2.94 22.20 EM alg. [3] 5906 (336) 11.42 2.32 2.94 22.20 Capital Complete case 4555.00 10.31 2.11 1.85 20.73 ICA method 5918 (1363) 10.28 2.10 1.85 20.73 Random ICA meth. 5918 (1363) 11.20 2.47 1.85 20.73 EM alg. [1] 6848 (2293) 10.26 1.89 1.85 20.73 EM alg. [2] 5807 (1252) 10.25 2.04 1.85 20.73 EM alg. [3] 5807 (1252) 10.23 2.02 1.85 20.73 Employment Complete case 6164.00 10.82 1.33 6.54 16.16 ICA method 6321 (157) 10.82 1.34 6.54 16.16 Random ICA meth. 6321 (157) 10.84 1.34 6.54 16.16 EM alg. [1] 6849 (687) 10.78 1.31 6.54 16.16 EM alg. [2] 6164 (0) 10.82 1.33 6.54 16.16 EM alg. [3] 6164 (0) 10.82 1.33 6.54 16.16 Source: Authors' estimations with ICSs data. 59 Table 8.2 TURKEY, Descriptive statistics of production function variables under different imputation mechanism Variable #Obs. Mean Std. Dev. Min Max (#imputed) Sales Complete case 1497 14.24 2.10 7.78 19.40 ICA method 1821 (324) 14.30 1.99 7.78 19.40 Random ICA meth. 1821 (324) 14.44 1.97 7.78 19.40 EM alg. [1] 2646 (1149) 14.27 1.78 7.78 19.40 EM alg. [2] 1808 (311) 14.22 2.02 7.55 19.40 EM alg. [3] 1808 (311) 14.22 2.01 7.78 19.40 Materials Complete case 1293 13.19 2.31 4.33 18.65 ICA method 1822 (529) 13.37 2.13 4.34 18.65 Random ICA meth. 1822 (529) 13.59 2.12 4.34 18.65 EM alg. [1] 2646 (1353) 13.31 1.86 4.33 18.65 EM alg. [2] 1802 (509) 13.18 2.18 4.33 18.65 EM alg. [3] 1802 (509) 13.15 2.18 4.33 18.65 Capital Complete case 1289 11.39 2.26 0.63 19.65 ICA method 1816 (527) 11.32 2.05 1.05 19.65 Random ICA meth. 1816 (527) 11.86 2.05 1.05 19.65 EM alg. [1] 2646 (1357) 11.22 1.79 0.63 19.65 EM alg. [2] 1807 (518) 11.28 2.05 0.63 19.65 EM alg. [3] 1807 (518) 11.30 2.04 0.63 19.65 Employment Complete case 2529 11.63 1.45 7.64 15.42 ICA method 2548 (19) 11.63 1.45 7.64 15.42 Random ICA meth. 2548 (19) 11.63 1.44 7.64 15.42 EM alg. [1] 2646 (117) 11.63 1.44 7.64 15.42 EM alg. [2] 2539 (10) 11.63 1.45 7.64 15.42 EM alg. [3] 2539 (10) 11.63 1.45 7.64 15.42 Source: Authors' estimations with ICSs data. 60 Table 8.3 SOUTH AFRICA, Descriptive statistics of production function variables under different imputation mechanism Variable #Obs. (#imputed) Mean Std. Dev. Min Max Sales Complete case 1578 17.43 1.86 8.28 24.29 ICA method 1587 (9) 17.44 1.87 8.28 24.29 Random ICA meth. 1587 (9) 17.44 1.87 8.28 24.29 EM alg. [1] 1789 (211) 17.42 1.81 8.28 24.29 EM alg. [2] 1587 (9) 17.44 1.87 8.28 24.29 EM alg. [3] 1587 (9) 17.44 1.87 8.28 24.29 Materials Complete case 1508 16.59 2.03 3.56 24.21 ICA method 1587 (79) 16.60 2.00 3.56 24.21 Random ICA meth. 1587 (79) 16.66 2.01 3.56 24.21 EM alg. [1] 1789 (281) 16.58 1.93 3.56 24.21 EM alg. [2] 1586 (78) 16.59 2.08 3.56 24.21 EM alg. [3] 1586 (78) 16.59 2.08 3.56 24.21 Capital Complete case 1337 15.29 1.89 7.90 23.48 ICA method 1586 (249) 15.25 1.86 7.90 23.48 Random ICA meth. 1586 (249) 15.60 1.90 7.90 23.48 EM alg. [1] 1786 (449) 15.24 1.75 7.90 23.48 EM alg. [2] 1583 (246) 15.20 1.84 7.90 23.48 EM alg. [3] 1580 (243) 15.22 1.87 7.90 23.48 Employment Complete case 1664 12.12 1.40 5.19 17.47 ICA method 1685 (21) 12.12 1.40 5.19 17.47 Random ICA meth. 1685 (21) 12.13 1.40 5.19 17.47 EM alg. [1] 1784 (120) 12.10 1.40 5.19 17.47 EM alg. [2] 1680 (16) 12.13 1.40 5.19 17.47 EM alg. [3] 1680 (16) 12.13 1.40 5.19 17.47 The null hypothesis of the one-sample Kolmogorov-Smirnov Test is that the cumulative distribution differs from the hypothesized theoretical normal distribution. Source: Authors' estimations with ICSs data. 61 Table 8.4 TANZANIA, Descriptive statistics of production function variables under different imputation mechanism Variable #Obs. (#imputed) Mean Std. Dev. Min Max Sales Complete case 511 14.52 2.43 7.54 20.73 ICA method 667 (156) 14.60 2.30 7.54 20.73 Random ICA meth. 667 (156) 14.85 2.25 7.54 20.73 EM alg. [1] 801 (290) 14.51 2.18 7.54 20.73 EM alg. [2] 647 (136) 14.48 2.42 7.54 20.73 EM alg. [3] 647 (136) 14.48 2.41 7.54 20.73 Materials Complete case 539 13.76 2.58 4.78 20.07 ICA method 667 (128) 13.82 2.52 4.78 20.07 Random ICA meth. 667 (128) 14.08 2.49 4.78 20.07 EM alg. [1] 803 (264) 13.74 2.32 4.78 20.07 EM alg. [2] 646 (107) 13.67 2.58 4.78 20.07 EM alg. [3] 646 (107) 13.67 2.57 4.78 20.07 Capital Complete case 529 13.59 2.69 6.86 19.54 ICA method 664 (135) 13.54 2.57 6.86 19.54 Random ICA meth. 664 (135) 13.91 2.51 6.86 19.54 EM alg. [1] 806 (277) 13.46 2.40 6.86 19.54 EM alg. [2] 654 (125) 13.26 2.74 6.86 19.54 EM alg. [3] 654 (125) 13.26 2.81 5.47 19.54 Employment Complete case 730 10.92 1.37 7.50 15.23 ICA method 788 (58) 10.91 1.34 7.50 15.23 Random ICA meth. 788 (58) 10.94 1.34 7.50 15.23 EM alg. [1] 790 (60) 10.92 1.36 7.50 15.23 EM alg. [2] 758 (28) 10.92 1.36 7.50 15.23 EM alg. [3] 768 (38) 10.92 1.36 7.50 15.23 Source: Authors' estimations with ICSs data. We find that the proportion of missing values is an important factor in the observed underlying distributions after imputing missing values. Therefore, by means of explanation it is useful to discuss the results by groups of countries. The first group, with India and South Africa, comprises those countries with the largest response rate of PF variables, 65% in India and 70% in South Africa. The second group includes Tanzania and Turkey, whose response rates are only 40 and 30% respectively. As shown in the kernel densities, the response rate dramatically determines the shape of the densities after imputing missing values. In India (see Figure 4.1), where the response rate is reasonably high in all the variables except capital, all the methods lead to estimated densities similar to those of the complete case. However, in the case of capital where the response rate is considerably lower, we observe a dramatic change in the distribution of the imputed values by the Random ICA method. Concretely, the distribution appears to have two modes, moving a considerable proportion of density from the center of the distribution to the right. This misleading behavior is already indicated in the case of materials, although to a lesser extent. Regarding the estimated distributions of the remaining imputation mechanism, all of them lead to results similar to those of the complete case, including the ICA method and EM 62 algorithms. Nonetheless, in terms of descriptive statistics, it is noticeable that, in spite of the uncertainty inherent in the EM algorithm [1], it slightly reduces the estimated standard deviation of all PF variables, even with respect to the ICA method case. This is probably due to the higher number of imputed cells than under other mechanisms. Nonetheless, it must also be pointed out that the reduction of the standard deviation is only of the order of one decimal point. In this sense, the Random ICA method, and the remaining EM algorithms increase, to some extent, the estimated standard errors with respect to the ICA method. The case of South Africa is virtually symmetrical to that of India. Again the Random ICA method performs badly in the case of capital. Likewise, due to the larger proportion of missing values imputed, the EM algorithm [1] leads to estimated standard errors that slightly reduce those of the complete case. As the response rate of PF variables decreases, the estimated densities obtained from the EM algorithms and Random ICA method tend to be different from those of the complete case and the standard ICA method, especially in the case of the Random ICA method. This is illustrated in the cases of Turkey and Tanzania in figures 4.2 and 4.4. Nonetheless, the estimated descriptive statistics are quite homogeneous among imputation methods, as shown in tables 8.2 and 8.4. The estimated means are virtually equal in all the cases, and the standard errors show great consistency across specifications, except in the EM algorithm [1] where, again due to the larger proportion of values imputed, the standard errors are slightly lower. It is useful to recapitulate the main conclusions of this subsection before introducing the results of estimating equation (1). Overall, there are small differences in the imputation of PF variables. Nonetheless, these differences become more marked as the number of missing values increases and when the variables are far from being normally distributed. 6.2 Evaluation of imputation mechanism: Comparison of estimating results of equation (1) 6.2.1 Comparison of the ICA method and other EM algorithms Tables 9.1, 9.2, 9.3 and 9.4 show the results of estimating equation (5) after imputing missing values by the ICA method and by the three EM algorithms proposed in section 5.1. A key conclusion is that when the proportion of missing values is not large enough there are no remarkable differences between applying the ICA method or the EM algorithm [1], neither in the point estimates of the input-output (I-O) elasticities, nor in the standard errors (recall that uncertainty is a key issue under EM algorithms). Another interesting observation is that we do not gain much by extending the EM algorithm to include the IC variables among the information set. 63 Table 9.1: INDIA, Extended production function and comparison of ICA method with EM algorithms Dependent variable: Log of total sales EM Algorithms2 ICA Method1 [1] [2] [3] Category Variable Coeff. std. err. Boot. s.e Coeff. std. err. Coeff. std. err. Coeff. std. err. PF variables Log-employment 0.1027 [0.0341]*** (0.0306)*** 0.0976 [0.0331]*** 0.0516 [0.0250]** 0.0527 [0.0250]** Log-materials 0.7989 [0.0185]*** (0.0462)*** 0.8362 [0.0186]*** 0.8607 [0.0176]*** 0.8628 [0.0177]*** Log-capital 0.0676 [0.0239]*** (0.0153)*** 0.0629 [0.0225]*** 0.0537 [0.0146]*** 0.0502 [0.0147]*** Infrastructure Longest # of days to clear customs for exports (a) -0.0125 [0.0263] (0.0376) -0.0039 [0.0275] -0.0158 [0.0209] -0.0156 [0.0208] Dummy for own generator 0.0538 [0.0422] (0.0424) 0.0378 [0.0396] 0.015 [0.0247] 0.0131 [0.0249] Water supply from public sources (b) 0.0014 [0.0005]*** (0.0008)* 0.0013 [0.0004]*** 0.0009 [0.0003]*** 0.0008 [0.0003]** Shipment losses in the domestic market (b) -0.0047 [0.0039] (0.0128) -0.0023 [0.0035] -0.0017 [0.0030] -0.0016 [0.0030] Dummy for own transport 0.0238 [0.0475] (0.0861) -0.0084 [0.0464] -0.003 [0.0340] -0.0023 [0.0341] Dummy for web page 0.0402 [0.0394] (0.0264) 0.0047 [0.0378] 0.0013 [0.0310] 0.0008 [0.0313] Dummy for security 0.0467 [0.0423] (0.1407) 0.0426 [0.0403] 0.0497 [0.0285]* 0.0505 [0.0285]* Red tape, Sales reported for taxes (b) 0.0006 [0.0014] (0.0052) 0.0009 [0.0013] 0.0008 [0.0010] 0.0009 [0.0010] corruption and Workforce reported for taxes (b) -0.0015 [0.0012] (0.0042) -0.0015 [0.0010] -0.0009 [0.0008] -0.0009 [0.0008] crime Dummy for payments to speed up bureaucracy -0.0464 [0.0336] (0.0526) -0.0443 [0.0292] 0.0041 [0.0255] 0.0083 [0.0259] Dummy for interventionist labor regulation -0.036 [0.0361] (0.0211)* -0.0317 [0.0340] -0.0259 [0.0330] -0.028 [0.0331] Absenteeism (b) -0.0299 [0.0222] (0.0571) -0.0204 [0.0195] -0.0069 [0.0156] -0.0071 [0.0160] Finance and Dummy for trade association 0.0785 [0.0455]* (0.0456)* 0.0756 [0.0408]* 0.024 [0.0297] 0.0194 [0.0300] corporate Working capital financed by domestic private banks (b) 0.0002 [0.0007] (0.0005) -0.0002 [0.0007] 0.0003 [0.0006] 0.0003 [0.0006] governance Dummy for external audit 0.0691 [0.0395]* (0.0452) 0.0662 [0.0362]* 0.0633 [0.0283]** 0.0655 [0.0282]** Dummy for loan (b) 0.1102 [0.0473]** (0.0637)* 0.0892 [0.0464]* 0.0121 [0.0331] 0.006 [0.0327] Quality, Dummy for R&D (a) 0.1787 [0.2382] (0.2347) 0.2041 [0.2534] 0.0702 [0.1322] 0.0638 [0.1320] innovation and Dummy for product innovation -0.0073 [0.0360] (0.0710) -0.0153 [0.0332] -0.025 [0.0244] -0.0265 [0.0246] labor skills Dummy for foreign license (b) 0.204 [0.1053]* (0.1302) 0.1425 [0.1033] 0.086 [0.0847] 0.0801 [0.0852] Dummy for internal training (b) 0.0579 [0.0533] (0.0516) 0.0578 [0.0511] 0.0702 [0.0443] 0.0703 [0.0442] Unskilled workforce (a) 0.0013 [0.0036] (0.0016) 0.0013 [0.0036] -0.0034 [0.0030] -0.0039 [0.0031] Workforce with computer 0.0017 [0.0011] (0.0015) 0.0016 [0.0010] 0.0012 [0.0009] 0.0011 [0.0008] Other control Dummy for incorporated company 0.0265 [0.0396] (0.0901) 0.0162 [0.0368] 0.0272 [0.0301] 0.0261 [0.0300] variables Age 0.0534 [0.0267]** (0.0214)** 0.0438 [0.0251]* 0.0456 [0.0174]** 0.0487 [0.0174]*** Share of exports (b) 0.001 [0.0009] (0.0005)** 0.0006 [0.0009] 0.00004 [0.0006] -0.0001 [0.0006] Trade union (b) 0.0008 [0.0012] (0.0010) 0.0008 [0.0012] 0.0009 [0.0009] 0.0007 [0.0009] Strikes (b) -0.0683 [0.0449] (0.0821) -0.0475 [0.0380] -0.0112 [0.0307] -0.0107 [0.0314] Constant 0.7377 [0.3449]** 0.4456 [0.3504] 1.0108 [0.2499]*** 1.0335 [0.2492]*** Industry/region/size/time dummies Yes Yes Yes Yes Observations 5211 5216 5175 5176 R-squared 0.88 0.9 0.94 0.94 Estimating results of equation (1) under different imputation mechanisms for missing data. Those observations with missing values in all sales, labor (labor cost), materials and capital are excluded in all the regressions. 1 ICA method is in section 3 of main text. Significance is given by clustered and White-robust standard errors in brackets; *** 1%, **5%, * 10%. In parentheses are bootstrap standard errors after 1000 replications (see section 5.2.2 on the motivation of using bootstrap standard errors). Correlation by clusters is also considered. 2 EM algorithms are explained in section 5.1. EM alg [1] includes as covariates of the imputation mechanism industry/region/size/time (I/R/S/T) dummies (see section 5.1.1); EM alg [2] includes I/R/S/T dummies and production function variables (see section 5.1.2); EM alg [3] also includes a set of IC variables (see section 5.1.3). Significance is given by clustered White-robust standard errors. (a) IC variables instrumented with industry/region average variables. (b) missing values in IC variables replaced by means of ICA method. Source: Authors' calculations with ICSs 64 Table 9.2: TURKEY, Extended production function and comparison of ICA method with EM algorithms EM Algorithms2 ICA Method1 Dependent variable: Log of total sales [1] [2] [3] Category Variable Coeff. std. err. Boot. st. er. Coeff. std. err. Coeff. std. err. Coeff. std. err. PF variables Log-employment 0.416 [0.0492]*** (0.1088)*** 0.3743 [0.0434]*** 0.3421 [0.0459]*** 0.3323 [0.0467]*** Log-materials 0.4184 [0.0404]*** (0.0249)*** 0.4829 [0.0429]*** 0.6075 [0.0369]*** 0.6052 [0.0370]*** Log-capital 0.0371 [0.0165]** (0.0428) 0.0548 [0.0199]*** 0.0801 [0.0190]*** 0.0783 [0.0184]*** Infrastructures Days to clear customs for imports (a) -0.0707 [0.0686] (0.0688) -0.1497 [0.0578]** -0.1206 [0.0516]** -0.1399 [0.0462]*** Dummy for e-mail 0.2866 [0.0920]*** (0.1365)** 0.1659 [0.0789]** 0.1648 [0.0726]** 0.188 [0.0720]** Red tape, Security expenses (b) -0.0246 [0.0828] (0.0011)*** -0.0117 [0.0520] -0.0504 [0.0456] -0.0647 [0.0416] corruption and Payments to deal with bureaucratic issues (a) -0.011 [0.0020]*** (0.0077) -0.0092 [0.0013]*** -0.0065 [0.0011]*** -0.0072 [0.0012]*** crime Sales declared for taxes (a) -0.0226 [0.0057]*** (0.0045)*** -0.0234 [0.0046]*** -0.0148 [0.0042]*** -0.0177 [0.0040]*** Number of inspections (b) 0.0046 [0.0044] (0.0597) -0.0002 [0.0026] 0.0001 [0.0026] 0.0007 [0.0023] Payments to obtain a contract with the government (b) -0.0373 [0.0315] (0.0058)*** -0.0524 [0.0217]** -0.0274 [0.0175] -0.0514 [0.0159]*** Production lost due to absenteeism (b) -0.0054 [0.0043] (0.0367) -0.0122 [0.0037]*** -0.0082 [0.0028]*** -0.0094 [0.0029]*** Dummy for informal competition (b) 0.0044 [0.0295] (0.1203) -0.0055 [0.0236] -0.0013 [0.0189] -0.0059 [0.0196] Delay in obtaining a water supply (a) -0.1325 [0.0634]** (0.0993) -0.1388 [0.0565]** -0.0746 [0.0559] -0.0935 [0.0600] Finance Dummy for credit line 0.068 [0.0868] (0.1383) 0.1157 [0.0702] 0.0744 [0.0660] 0.0778 [0.0674] Dummy for external auditory (a) 0.0863 [0.0753] (0.1117) 0.0655 [0.0461] 0.0627 [0.0397] 0.0935 [0.0406]** Loans in foreign currency (b) 0.0018 [0.0009]** (0.0010)* 0.0013 [0.0005]** 0.0008 [0.0006] 0.0007 [0.0006] Quality, innov. Staff with university education (b) 0.0095 [0.0026]*** (0.0018)*** 0.0087 [0.0029]*** 0.0064 [0.0029]** 0.0081 [0.0029]*** and labor skills Staff-part time workers -0.008 [0.0030]** (0.0222) -0.0046 [0.0023]* -0.0059 [0.0016]*** -0.0058 [0.0018]*** Other control Production lost due to strikes (b) -0.1689 [0.0634]** (0.0351)*** -0.1596 [0.0435]*** -0.124 [0.0322]*** -0.1072 [0.0323]*** variables Dummy for recently privatized firm 1.0606 [0.2812]*** (0.2511)*** 0.8692 [0.2579]*** 0.6825 [0.2478]*** 0.6644 [0.2508]** Dummy for competition against imported products 0.2069 [0.0962]** (0.2737) 0.1595 [0.0736]** 0.0951 [0.0603] 0.0755 [0.0607] Constant 3.5299 [0.7190]*** 3.6661 [0.5851]*** 1.6872 [0.3782]*** 1.9648 [0.3791]*** Industry/region/size/time dummies Yes Yes Yes Yes Observations 1684 1679 1733 1733 R-squared 0.73 0.81 0.86 0.86 Estimating results of equation (1) under different imputation mechanisms for missing data. Those observations with missing values in all sales, labor (labor cost), materials and capital are excluded in all the regressions. 1, 2 See footnotes in Table 9.1. (a) IC variables instrumented with industry/region average variables. (b) missing values in IC variables replaced by means of ICA method. Source: Authors' calculations with ICSs. 65 Table 9.3: SOUTH AFRICA, Extended production function and comparison of ICA method with EM algorithms EM Algorithms2 ICA Method1 Dependent variable: Log of total sales [1] [2] [3] Category Variable Coeff. std. err. Boot. st. er. Coeff. std. err. Coeff. std. err. Coeff. std. err. PF variables Log-employment 0.3226 [0.0711]*** (0.0365)*** 0.3144 [0.0676]*** 0.2285 [0.0667]*** 0.2261 [0.0666]*** Log-materials 0.5195 [0.1017]*** (0.0214)*** 0.5355 [0.0942]*** 0.5781 [0.0947]*** 0.574 [0.0943]*** Log-capital 0.1247 [0.0300]*** (0.0118)*** 0.1287 [0.0370]*** 0.123 [0.0373]*** 0.1282 [0.0386]*** Infrastructure Days to clear customs for imports (a) -0.1188 [0.1125] (0.1233) 0.1291 [0.1320] 0.0193 [0.0935] 0.0322 [0.0975] Sales lost due to power outages (b) -0.0171 [0.0114] (0.0047)*** -0.0128 [0.0101] -0.0112 [0.0077] -0.0096 [0.0073] Water outages (b) -0.1477 [0.0527]*** (0.0942) -0.1287 [0.0438]*** -0.1482 [0.0533]*** -0.1611 [0.0562]*** Average duration of transport failures (a) -0.0439 [0.0806] (0.0379) 0.06 [0.0893] 0.0021 [0.0628] -0.0156 [0.0611] Wait for electric supply (a) -0.0867 [0.0553] (0.0173)*** -0.1368 [0.0337]*** -0.0921 [0.0272]*** -0.0863 [0.0258]*** Sales lost due to delivery delays (b) -0.0099 [0.0083] (0.0073) -0.0148 [0.0084]* -0.0097 [0.0072] -0.0077 [0.0065] Red tape, Manager's time spent on bur. issues (b) 0.007 [0.0051] (0.0016)*** 0.0077 [0.0050] 0.0077 [0.0057] 0.0084 [0.0058] corruption and Payments to deal with bureaucratic issues (b) -0.0045 [0.0024]* (0.3604) -0.005 [0.0026]* -0.0042 [0.0023]* -0.0121 [0.0038]*** crime Sales declared for taxes (a) 0.0056 [0.0046] (0.0022)** 0.0056 [0.0042] 0.0059 [0.0025]** 0.0058 [0.0027]** Payments to obtain a contract with the government (b) -0.0144 [0.0185] (0.1975) -0.0119 [0.0175] -0.0161 [0.0146] -0.015 [0.0144] Security expenses (a) 0.1407 [0.0511]** (0.0069)*** -0.0023 [0.0148] -0.0075 [0.0109] -0.0056 [0.0113] Illegal payments in protection (b) 0.3969 [0.2428] (0.1128)*** 0.3754 [0.2492] 0.3882 [0.2202]* 0.4761 [0.2187]** Crime losses (a) -0.0502 [0.0788] (0.1374) -0.0541 [0.0948] 0.0099 [0.0621] 0.0193 [0.0662] Finance and Percentage of credit unused (b) 0.0014 [0.0010] (0.0013) 0.0016 [0.0008]* 0.0019 [0.0010]* 0.002 [0.0010]* corporate Dummy for loan 0.0715 [0.0492] (0.0327)** 0.0841 [0.0479]* 0.0762 [0.0406]* 0.0761 [0.0407]* governance Value of the collateral (b) -0.0008 [0.0002]*** (0.0009) -0.0007 [0.0002]*** -0.0006 [0.0002]*** -0.0007 [0.0002]*** Loans in foreign currency (b) 0.0018 [0.0022] (0.0024) 0.0007 [0.0018] -0.0002 [0.0012] 0.0001 [0.0012] Charge to clear a check (a) -0.1164 [0.0503]** (0.0253)*** -0.0861 [0.0520] -0.0995 [0.0387]** -0.1068 [0.0384]*** Largest shareholder 0.0006 [0.0010] (0.0008) 0.0011 [0.0010] 0.001 [0.0007] 0.0011 [0.0007] Working capital financed by foreign commercial banks (b) 0.0106 [0.0083] (0.0084) 0.0072 [0.0072] 0.0057 [0.0060] 0.0044 [0.0063] Working capital financed by informal sources (b) -0.0022 [0.0023] (0.0001)*** -0.0018 [0.0021] -0.0027 [0.0018] -0.0026 [0.0018] Quality, Dummy for ISO quality certification (b) 0.1603 [0.0766]** (0.0365)*** 0.1521 [0.0732]** 0.0838 [0.0404]** 0.0782 [0.0390]* innovation and Dummy for new product (b) 0.091 [0.0494]* (0.0113)*** 0.1083 [0.0530]** 0.1053 [0.0461]** 0.1001 [0.0460]** labor skills Dummy for discontinued product line (b) -0.1007 [0.0610] (0.0384)** -0.1029 [0.0560]* -0.0874 [0.0541] -0.0805 [0.0534] Staff - management 0.004 [0.0028] (0.0009)*** 0.0036 [0.0028] 0.0034 [0.0030] 0.0032 [0.0030] Staff - non-production workers -0.0034 [0.0022] (0.0025) -0.0032 [0.0022] -0.0023 [0.0022] -0.0024 [0.0022] Training for unskilled workers (a) 0.001 [0.0026] (0.0030) -0.0001 [0.0038] 0.0018 [0.0019] 0.0008 [0.0021] University staff (b) 0.0049 [0.0015]*** (0.0007)*** 0.0055 [0.0016]*** 0.0039 [0.0014]*** 0.0038 [0.0014]** Manager's experience (b) 0.0391 [0.0249] (0.0217)* 0.0369 [0.0222] 0.028 [0.0187] 0.0271 [0.0184] Other control Age (b) 0.0018 [0.0015] (0.0016) 0.0014 [0.0013] 0.0023 [0.0013]* 0.0023 [0.0013]* variables Share of the local market (b) 0.0032 [0.0008]*** (0.0004)*** 0.0035 [0.0009]*** 0.0027 [0.0007]*** 0.0028 [0.0007]*** Constant 2.7174 [0.8932]*** (0.0365)*** 2.0109 [0.8200]** 2.5368 [0.7330]*** 2.5977 [0.7464]*** Industry/region/size/time dummies Yes Yes Yes Yes Observations 1483 1528 1552 1550 R-squared 0.89 0.89 0.9 0.9 Estimating results of equation (1) under different imputation mechanisms for missing data. Those observations with missing values in all sales, labor (labor cost), materials and capital are excluded in all the regressions. 1, 2, See footnotes in Table 9.1. Source: Authors' calculations with ICSs. 66 Table 9.4: TANZANIA, Extended production function and comparison of ICA method with EM algorithms EM Algorithms2 ICA Method1 Dependent variable: Log of total sales [1] [2] [3] Category Variable Coeff. std. err. Boot. st. er. Coeff. std. err. Coeff. std. err. Coeff. std. err. PF variables Log-employment 0.1655 [0.0853]* (0.0512]*** 0.1142 [0.0919] 0.0584 [0.0459] 0.0207 [0.0501] Log-materials 0.4252 [0.0581]*** (0.0340]*** 0.4867 [0.0677]*** 0.7201 [0.0435]*** 0.724 [0.0401]*** Log-capital 0.1589 [0.0323]*** (0.0208]*** 0.1628 [0.0317]*** 0.1326 [0.0286]*** 0.1171 [0.0288]*** Infrastructure Electricity from own generator (b) 0.0021 [0.0016] (0.0053] 0.0035 [0.0016]** 0.0036 [0.0011]*** 0.0027 [0.0011]** Losses due to water outages (b) -0.0112 [0.0058]* (0.0162] -0.0172 [0.0049]*** -0.0087 [0.0033]** -0.0082 [0.0034]** Water from own well or water infrastructure (a) 0.0001 [0.0051] (0.0011] -0.0044 [0.0044] 0.0013 [0.0031] 0.0029 [0.0035] Losses due to phone outages (a) -0.0322 [0.0198] (0.0071]*** -0.0066 [0.0268] 0.0115 [0.0232] 0.0159 [0.0260] Transport outages (a) -0.0047 [0.0703] (0.1168] 0.0366 [0.0623] 0.0069 [0.0295] -0.0287 [0.0280] Dummy for own roads (b) 0.289 [0.1488]* (0.0581]*** 0.3789 [0.1279]*** 0.2864 [0.1176]** 0.4766 [0.1189]*** Dummy for webpage (b) 0.1578 [0.1212] (0.1994] 0.0972 [0.1533] 0.1054 [0.1346] 0.2051 [0.1243] Wait for a water supply (a) -0.1814 [0.0427]*** (0.0702]** -0.1354 [0.0533]** -0.093 [0.0262]*** -0.1649 [0.0235]*** Low quality supplies (a) -0.0163 [0.0128] (0.0041]*** -0.0351 [0.0141]** -0.0165 [0.0105] -0.0202 [0.0112]* Red tape, Gift to obtain an operating license (b) -0.4983 [0.1935]** (0.1066]*** -0.3964 [0.1550]** 0.0537 [0.1051] -0.0553 [0.0983] corruption and Payments to deal with bureaucratic issues (b) 0.0939 [0.0299]*** (0.0164]*** 0.0808 [0.0272]*** 0.0512 [0.0503] 0.085 [0.0396]** crime Days in inspections (b) -0.1045 [0.0735] (0.0494]** -0.0735 [0.0703] 0.0027 [0.0379] 0.0005 [0.0362] Payments to obtain a contract with the government (b) -0.0114 [0.0066]* (0.0091] -0.0026 [0.0059] -0.0082 [0.0040]* -0.0079 [0.0044]* Security expenses (b) -0.0119 [0.0042]*** (0.0092] -0.0081 [0.0040]* -0.0023 [0.0031] -0.0005 [0.0035] Illegal payments in protection (b) -0.0827 [0.0170]*** (0.1019] -0.0518 [0.0140]*** -0.031 [0.0144]** -0.0489 [0.0206]** Finance and Interest rate of the loan (a) -0.0109 [0.0145] (0.0099] -0.0139 [0.0127] 0.0033 [0.0073] 0.0117 [0.0078] corporate Working capital financed by commercial banks (b) -0.0009 [0.0018] (0.0012] -0.0015 [0.0016] -0.0016 [0.0011] -0.001 [0.0011] governance Working capital financed by leasing (b) -0.0794 [0.0282]*** (0.0054]*** -0.118 [0.0279]*** -0.015 [0.0038]*** -0.0893 [0.0428]** Sales bought on credit (b) -0.0014 [0.0012] (0.0011] 0 [0.0011] 0.0006 [0.0010] 0.0006 [0.0010] Delay in clearing a domestic currency wire (a) -0.3418 [0.3273] (0.0935]*** -0.0439 [0.2600] 0.1691 [0.1544] 0.0498 [0.1606] Quality, Dummy for new product (b) 0.0429 [0.1063] (0.2036] -0.0053 [0.1090] -0.0481 [0.0782] -0.0897 [0.0632] innovation and Staff - skilled workers (b) 0.0026 [0.0023] (0.0050] 0.0025 [0.0021] 0.0036 [0.0014]** 0.0038 [0.0014]** labor skills Workforce with computer (b) 0.0066 [0.0030]** (0.0056] 0.0071 [0.0034]** 0.0001 [0.0049] 0.003 [0.0041] Other control Dummy for incorporated company (b) 0.2914 [0.2023] (0.5683] -0.0777 [0.4506] 0.1645 [0.1868] 0.0871 [0.2324] variables Dummy for FDI (b) 0.1397 [0.1445] (0.2844] 0.0825 [0.1397] -0.0662 [0.0792] -0.0859 [0.0768] Constant 7.2978 [1.0168]*** 6.3827 [0.8512]*** 2.4414 [0.5932]*** 3.296 [0.6161]*** Industry/region/size/time dummies Yes Yes Yes Yes Observations 559 560 603 597 R-squared 0.88 0.88 0.94 0.94 Estimating results of equation (1) under different imputation mechanisms for missing data. Those observations with missing values in all sales, labor (labor cost), materials and capital are excluded in all the regressions. 1, 2, See footnotes in Table 9.1. Source: Authors' calculations with ICSs. 67 Table 9.1 focuses on the case of India, in which the ICA method and the EM algorithm on industry, region and size variables (EM algorithm [1]) lead to similar results in terms of input-output elasticities. However, there are divergences in the input-output elasticities estimated for the remaining two EM-algorithms. Concretely, the employment coefficient decreases from 0.1 in the ICA method and EM algorithm [1], to 0.05 in the EM algorithms [2] and [3]. Similarly, it is worth mentioning that the estimates of the standard errors of the coefficients of the input-output elasticities do not improve in the EM algorithm [1] with respect to the ICA method, and are even lower in the EM algorithms [2] and [3]. It is important to note that most of the differences between the ICA method and the EM algorithm [1] on the one hand and the EM algorithms [2] and [3] on the other can be explained by the greater amount of information embodied in the imputation process: production function variables in the EM [2] and production function, IC, and D variables in EM [3]; and not by the iterative process based on posterior predictive densities as in the EM algorithms. When the pattern of missing data is very unbalanced and we are able to observe only one or two PF variables for each cross-sectional observation, those EM algorithms including additional variables, beyond the region/industry/size dummies, are more likely to lead to heterogeneous results as they include a different amount of information for each cross-section. This becomes more patent in the case of the EM algorithm [3], in which we also include IC variables in the imputation. Apart from this observation, the elasticities and semi-elasticities of IC variables show a reasonable robustness to the imputation mechanism used. In general terms, the ICA method is more consistent with the results from the EM algorithm [1], whereas EM algorithms [2] and [3] show more differences. For example, out of 6 IC variables significant in the ICA method case, 5 are also significant in the EM algorithm [1], while only 3 in the EM algorithms [2] and [3] (see Table 12). Nonetheless, the changes observed are only in the magnitude of the coefficients estimated, and never in the direction of the effects. All the estimated IC coefficients move within a reasonable range of values in the four cases. 68 Table 12: Summary of results from estimating equation (1) under different imputation methods with respect to the ICA method case Complete ICA method & variations EM algorithms Multiple Heckman models case ICA ICA met. Random ICA met. EM alg. EM alg. EM alg. imputation Heckman on Heckman Heckman met. (boot. s. e.) ICA met. on inputs [1] [2] [3] complete replacing (boot. case inputs s.e) India: Input-output Significant change in estimated No - - No No No Yes (L, Yes (L, Yes (L) No No - Tables elasticities elasticity?3 M) M) 9.1, 10.1 Change in significance?3 No - No No No No No No No No No No & 11.1 IC variables Significant variables1 4, (3) 6 6, (2) 6, (3) 4, (1) 5, (0) 4, (1) 4, (1) 5, (2) 11, (7) 11, (6) 15, (10) [27 vars.] Non-significant variables2 23, (5) 21 21, (2) 21, (4) 23, (3) 22, (1) 23, (3) 23, (3) 22, (2) 16, (2) 16, (1) 12, (1) Change in the direction of the effect?3 No - - No No No No No No No No No Number of observations 3943 5211 - 5063 5134 5216 5175 5176 5262 4233 5407 - Significant Heckman's Lambda? - - - - - - - - - No No - Turkey: Input-output Significant change in estimated Yes (M) - - No Yes (L, Yes (M, Yes (L, Yes (L, Yes (L, M) Yes (L. K) Yes (L. - Table elasticities elasticity?3 M, K) L) M, K) M, K) K) 9.2, 10.2 Change in significance?3 No - Yes (L) Yes (L) No No No No No No No No & 11.2 IC variables Significant variables1 9, (3) 10 8, (2) 9, (0) 9, (0) 13, (3) 9, (2) 11, (4) 10, (2) 9, (3) 11, (1) 16, (6) [18 vars.] Non-significant variables2 9, (4) 8 10, (4) 9, (1) 9, (1) 5, (0) 9, (2) 7, (2) 8, (2) 9, (2) 7, (0) 2, (0) Change in the direction of the effect?3 No - - No No No No No No No No No Number of observations 792 1684 - 1684 1360 1679 1733 1733 1646 1941 2509 - Significant Heckman's Lambda? - - - - - - - - - No No - South Input-output Significant change in estimated No - - Yes (K) No No Yes (L) Yes (L) Yes (L) No No - 3 Africa: elasticities elasticity? Table Change in significance?3 No - No No No No No No No No No No 9.3, 10.3 IC variables Significant variables1 10, (3) 9 16, (10) 12, (3) 9, (0) 12, (5) 14, (6) 14, (5) 15, (7) 15, (8) 19, (11) 18, (10) & 11.3 [31 vars.] Non-significant variables2 21, (2) 22 15, (3) 19, (0) 22, (0) 19, (2) 17, (1) 17, (1) 16, (1) 16, (2) 12, (1) 13, (1) Change in the direction of the effect?3 No - - No No No No No No No No - Number of observations 1483 1528 1552 1550 1443 1657 - Significant Heckman's Lambda? - - - - - - - - - No No - Tanzania: Input-output Significant change in estimated Yes (M) - - Yes (L, K) Yes (L, Yes (L) Yes (M, Yes (M, Yes (L, M, Yes (M) Yes (M) - Table elasticities elasticity?3 M) L) L) K) 9.4, 10.4 Change in significance?3 No - No No No Yes (L) Yes (L) Yes (L) No No No No & 11.4 IC variables Significant variables1 10, (4) 10 9, (4) 11, (4) 10, (2) 11, (2) 8, (2) 10, (3) 8, (2) 14, (9) 9, (5) 7, (5) [25 vars.] Non-significant variables2 15, (3) 15 16, (5) 14, (3) 15, (2) 14, (1) 17, (4) 15, (3) 17, (4) 11, (5) 16, (6) 18, (6) Change in the direction of the effect?3 No - - No No No No No - Number of observations 291 559 - 557 442 560 603 597 570 581 771 - Significant Heckman's Lambda? - - - - - - - - - No No - 1 In parenthesis: variables non-significant in the ICA method that became significant under other imputation mechanisms. 2 In parenthesis: variables significant in the ICA method and no longer significant under other imputation mechanisms. 3 With respect to the ICA method. A more detailed description of the results is in Tables 8.1 to 8.4. Source: Authors' calculations with ICSs data. 69 The case of South Africa in Table 9.3, with a pattern of missing values similar to that of India, leads to analogous conclusions. Again the I-O elasticities estimated under the ICA method are rather similar to those we get under the EM algorithm [1], whereas the EM algorithms [2] and [3] diverge in the sense that the estimated I-O elasticity for employment is almost one percent point lower than in the ICA method and EM algorithm [1]. The patterns observed for the standard errors estimated are the same as those of India: almost equal standard errors between the ICA method and the rest of EM algorithms, so no improvements of efficiency can be observed from using the EM algorithms in this case. Concretely, from Table 12 there are 10 significant IC variables under the ICA method, and the same variables are significant again under the EM algorithm [1] (plus another three new significant IC variables). In the EM algorithms [2] and [3] only 7 IC variables out of 10 repeat significance. The patterns observed in India and South Africa are not supported by the Turkish case in Table 9.2. Recall that the proportion of missing values among PF variables reaches 70%, and therefore the effects of the imputation mechanism used will be quite different from those applied to patterns of missing data with only a 20% or 30% response rate. In this case, it is remarkable that I-O elasticities in the EM algorithms [1], [2] and [3] are closer to constant returns to scale (CRS) than the ICA method is. In this sense, and in terms of I-O elasticities, the results from the ICA method are different from the EM algorithms, with materials and capital elasticities significantly lower than in the remaining cases. However, the estimated standard errors do not change much and the significance of the PF variables is not modified in any of the cases. In spite of these changes in the I-O elasticities, it is important to note that again the IC parameters appear to be robust to the imputation method used. Ten IC variables turned out to be significant in the ICA method case, 12 in the EM algorithm [1] and 14 in the EM algorithms [2] and [3]. Apart from minor changes in the magnitude of the coefficients, and in some cases in the significance of some variables, we do not observe changes in the estimated directions of the effects of the IC variables. Finally, the case of Tanzania is presented in Table 9.4. The proportion of missing values in PF variables in this country is more than 70% of the original sampling frame, similar to that of Turkey. However, unlike the Turkish case, EM algorithms [2] and [3] do not improve the results obtained from the ICA method. Again, the ICA method and EM algorithm show symmetrical behavior with similar I-O elasticities, whereas in EM algorithms [2] and [3] the estimated elasticity for employment is three times lower than in the ICA method, increasing in turn the elasticity of materials. On the other hand, almost all of those IC variables significant in the ICA method repeat significance in the EM algorithms, and what is more important, the coefficients are robust to all the imputation mechanisms, apart from marginal differences in some variables (see Table 12). 6.2.2 Comparison of the ICA method with complete case, extensions of the ICA method and multiple imputation In this section, we compare the results obtained from the ICA method with those from the complete case, other extensions of the ICA method (see section 5.2) and multiple imputation 70 (see section 5.3) in tables 10.1 to 10.4. Table 10.1 focuses on the case of India. The fourth column comprises the results of the complete case, for which the number of observations is considerably reduced with respect to the ICA method case, from 5211 to 3943. In spite of the reduced number of observations used, there are not significant changes either in the estimated I-O elasticities, or in their level of significance. Referring to the IC parameters, it is worth mentioning that, although there are no changes in the directions of the estimated effects, and the coefficients are rather robust in both specifications, some of the variables lost their significance in the complete case, with respect to the ICA method. Thus, out of the 6 significant IC variables in the ICA method, only 1 is also significant in the complete case. 71 Table 10.1: INDIA, Extended production function and comparison of ICA method with extensions Dependent variable: log of total sales ICA method and extensions Complete case4 Multiple imputation Original ICA meth.1 Random ICA m.2 ICA m. on inputs3 (Switching regr.)5 Category Variable Coeff. std. err. Boot. s.e Coeff. std. err. Coeff. std. err. Coeff. std. err. Coeff. std. err. PF variables Log-employment 0.1027 [0.0341]*** (0.0306)*** 0.1051 [0.0346]*** 0.0922 [0.0343]*** 0.1168 [0.0317]*** 0.0659 [0.0245]*** Log-materials 0.7989 [0.0185]*** (0.0462)*** 0.8135 [0.0186]*** 0.8054 [0.0192]*** 0.7994 [0.0236]*** 0.8560 [0.0169]*** Log-capital 0.0676 [0.0239]*** (0.0153)*** 0.0438 [0.0143]*** 0.0722 [0.0248]*** 0.0504 [0.0170]*** 0.0452 [0.0128]*** Infrastructure Longest # of days to clear customs for export (a) -0.0125 [0.0263] (0.0376) -0.01 [0.0317] -0.0167 [0.0266] -0.0432 [0.0268] -0.0155 [0.0213] Dummy for own generator 0.0538 [0.0422] (0.0424) -0.0083 [0.0453] 0.0516 [0.0431] 0.0424 [0.0293] 0.0198 [0.0254] Water supply from public sources (b) 0.0014 [0.0005]*** (0.0008)* 0.0009 [0.0006] 0.0014 [0.0005]*** 0.0013 [0.0004]*** 0.0008 [0.0003]** Shipment losses in the domestic market (b) -0.0047 [0.0039] (0.0128) -0.0075 [0.0034]** -0.0037 [0.0038] -0.0023 [0.0054] -0.0020 [0.0029] Dummy for own transport 0.0238 [0.0475] (0.0861) 0.0013 [0.0459] 0.0334 [0.0482] 0.0465 [0.0369] -0.0038 [0.0347] Dummy for web page 0.0402 [0.0394] (0.0264) 0.0516 [0.0427] 0.0329 [0.0382] 0.0098 [0.0327] 0.0067 [0.0316] Dummy for security 0.0467 [0.0423] (0.1407) 0.045 [0.0392] 0.0573 [0.0429] 0.0564 [0.0293]* 0.0582 [0.0293]** Red tape, Sales reported for taxes (b) 0.0006 [0.0014] (0.0052) 0.002 [0.0012]* 0.0009 [0.0014] 0.0002 [0.0010] 0.0010 [0.0009] corruption Workforce reported for taxes (b) -0.0015 [0.0012] (0.0042) -0.0021 [0.0009]** -0.0014 [0.0012] 0.0005 [0.0008] -0.0010 [0.0007] and crime Dummy for payments to speed up bureaucracy -0.0464 [0.0336] (0.0526) -0.0148 [0.0265] -0.0416 [0.0335] 0.0072 [0.0247] 0.0004 [0.0254] Dummy for interventionist labor regulation -0.036 [0.0361] (0.0211)* -0.0372 [0.0369] -0.0275 [0.0368] -0.031 [0.0330] -0.0303 [0.0322] Absenteeism (b) -0.0299 [0.0222] (0.0571) -0.0233 [0.0256] -0.0263 [0.0216] -0.0011 [0.0193] -0.0108 [0.0158] Finance and Dummy for trade association 0.0785 [0.0455]* (0.0456)* 0.094 [0.0480]* 0.0734 [0.0454] 0.022 [0.0388] 0.0263 [0.0302] corporate Working capital financed by domestic private banks (b) 0.0002 [0.0007] (0.0005) 0.0005 [0.0006] 0.0002 [0.0008] 0.0003 [0.0008] 0.0002 [0.0005] governance Dummy for external audit 0.0691 [0.0395]* (0.0452) 0.0541 [0.0440] 0.0627 [0.0386] 0.0392 [0.0300] 0.0689 [0.0294]** Dummy for loan (b) 0.1102 [0.0473]** (0.0637)* 0.0851 [0.0538] 0.1107 [0.0492]** -0.0397 [0.0409] 0.0188 [0.0337] Quality, Dummy for R&D (a) 0.1787 [0.2382] (0.2347) 0.0959 [0.1637] 0.1885 [0.2400] 0.0862 [0.1313] 0.1143 [0.1353] innovation Dummy for product innovation -0.0073 [0.0360] (0.0710) -0.0331 [0.0392] -0.0079 [0.0366] -0.0528 [0.0262]** -0.0285 [0.0276] and labor Dummy for foreign license (b) 0.204 [0.1053]* (0.1302) 0.2384 [0.1181]** 0.1555 [0.1013] 0.1401 [0.0939] 0.1032 [0.0835] skills Dummy for internal training (b) 0.0579 [0.0533] (0.0516) 0.0744 [0.0649] 0.0631 [0.0537] 0.0884 [0.0458] 0.0717 [0.0440]* Unskilled workforce (a) 0.0013 [0.0036] (0.0016) 0.0038 [0.0042] 0.0003 [0.0037] -0.001 [0.0033] -0.0030 [0.0029] Workforce with computer 0.0017 [0.0011] (0.0015) 0.0014 [0.0009] 0.0019 [0.0011]* 0.0007 [0.0007] 0.0012 [0.0008] Other control Dummy for incorporated company 0.0265 [0.0396] (0.0901) 0.056 [0.0358] 0.0127 [0.0423] 0.0494 [0.0282]* 0.0280 [0.0311] variables Age 0.0534 [0.0267]** (0.0214)** 0.0352 [0.0287] 0.0525 [0.0271]* 0.0322 [0.0208] 0.0392 [0.0182]** Share of exports (b) 0.001 [0.0009] (0.0005)** 0.001 [0.0010] 0.001 [0.0009] 0.0002 [0.0005] -0.0001 [0.0005] Trade union (b) 0.0008 [0.0012] (0.0010) 0.0015 [0.0013] 0.001 [0.0013] 0.0001 [0.0008] 0.0007 [0.0008] Strikes (b) -0.0683 [0.0449] (0.0821) -0.0557 [0.0470] -0.0707 [0.0457] 0.0248 [0.0439] -0.0112 [0.0321] Constant 0.7377 [0.3449]** 0.7174 [0.3636]* 0.7182 [0.3455]** 1.0943 [0.2692]*** 0.9976 [0.2528]*** Industry/region/size/time dummies Yes Yes Yes Yes Yes Observations 5211 5063 5134 3943 5262 R-squared 0.88 0.88 0.88 0.94 - Estimating results of equation (1) under different imputation mechanisms for missing data. Those observations with missing values in all sales, labor (labor cost), materials and capital are excluded in all the regressions. 1 See footnote 1 in Table 9.1. 2 Random ICA method is described in section 5.2.1. 3 ICA method on inputs is in section 5.2.3. 4 Complete case considers missingness in PF variables only, not in IC variables. 5 Multiple imputation via switching regression can be found in section 5.3. In all the cases significance is given by clustered and White- robust standard errors in brackets; *** 1%, **5%, * 10%. In the case of the ICA method, in parentheses are bootstrap standard errors after 1000 replications (see section 5.2.2 on the motivation for using bootstrap standard errors). Correlation by cluster is also considered. Source: Authors' calculations with ICSs. 72 Table 10.2: TURKEY, Extended production function and comparison of ICA method with extensions Dependent variable: log of total sales ICA method and extensions Complete case4 Multiple imputation Original ICA meth. 1 Random ICA m. 2 ICA m. on inputs 3 (Switching regr.)5 Category Variable Coeff. std. err. Boot. st. er. Coeff. std. err. Coeff. std. err. Coeff. std. err. Coeff. std. err. PF variables Log-employment 0.416 [0.0492]*** (0.1088)*** 0.3819 [0.0501]*** 0.5106 [0.0558]*** 0.4002 [0.0885]*** 0.3446 [0.0524]*** Log-materials 0.4184 [0.0404]*** (0.0249)*** 0.4137 [0.0392]*** 0.4615 [0.0484]*** 0.5332 [0.0494]*** 0.5779 [0.0316]*** Log-capital 0.0371 [0.0165]** (0.0428) 0.0193 [0.0198] 0.0686 [0.0232]*** 0.0639 [0.0271]** 0.0603 [0.0246]** Infrastructures Days to clear customs for imports (a) -0.0707 [0.0686] (0.0688) -0.1133 [0.0776] -0.0711 [0.0705] -0.1594 [0.0856]* -0.1318 [0.0660]** Dummy for e-mail 0.2866 [0.0920]*** (0.1365)** 0.3833 [0.1048]*** 0.3072 [0.1054]*** 0.0317 [0.1295] 0.1729 [0.0754]** Red tape, Security expenses (b) -0.0246 [0.0828] (0.0011)*** 0.0137 [0.0836] -0.0861 [0.0919] -0.0468 [0.0786] -0.0215 [0.0587] corruption and Payments to deal with bureaucratic issues (a) -0.011 [0.0020]*** (0.0077) -0.0108 [0.0021]*** -0.0102 [0.0021]*** -0.0084 [0.0014]*** -0.0073 [0.0011]*** crime Sales declared for taxes (a) -0.0226 [0.0057]*** (0.0045)*** -0.0197 [0.0061]*** -0.0151 [0.0065]** -0.0184 [0.0082]** -0.0159 [0.0051]*** Number of inspections (b) 0.0046 [0.0044] (0.0597) 0.001 [0.0049] 0.005 [0.0044] -0.0019 [0.0038] 0.0000 [0.0036] Payments to obtain a contract with the government (b) -0.0373 [0.0315] (0.0058)*** -0.0345 [0.0357] -0.0217 [0.0368] -0.0257 [0.0360] -0.0354 [0.0236] Production lost due to absenteeism (b) -0.0054 [0.0043] (0.0367) -0.0079 [0.0051] -0.005 [0.0039] -0.0107 [0.0054]* -0.0110 [0.0036]*** Dummy for informal competition (b) 0.0044 [0.0295] (0.1203) -0.0083 [0.0323] 0.0207 [0.0279] -0.0015 [0.0315] -0.0062 [0.0232] Delay in obtaining a water supply (a) -0.1325 [0.0634]** (0.0993) -0.1346 [0.0688]* -0.1419 [0.0863] -0.0825 [0.0785] -0.0965 [0.0571]* Finance Dummy for credit line 0.068 [0.0868] (0.1383) 0.0967 [0.0905] 0.0888 [0.1061] 0.0657 [0.0685] 0.0699 [0.0719] Dummy for external auditory (a) 0.0863 [0.0753] (0.1117) 0.0992 [0.0739] 0.1012 [0.0791] 0.1385 [0.0709]* 0.0781 [0.0521] Loans in foreign currency (b) 0.0018 [0.0009]** (0.0010)* 0.0015 [0.0008]* 0.0018 [0.0010]* 0.0005 [0.0009] 0.0009 [0.0008] Quality, innov. Staff with university education (b) 0.0095 [0.0026]*** (0.0018)*** 0.0107 [0.0028]*** 0.01 [0.0040]** 0.008 [0.0035]** 0.0060 [0.0032]* and labor skills Staff-part time workers -0.008 [0.0030]** (0.0222) -0.0077 [0.0032]** -0.0102 [0.0029]*** -0.0069 [0.0027]** -0.0067 [0.0019]*** Other control Production lost due to strikes (b) -0.1689 [0.0634]** (0.0351)*** -0.1063 [0.0650] -0.1538 [0.0671]** -0.1765 [0.0521]*** -0.1092 [0.0564]* variables Dummy for recently privatized firm 1.0606 [0.2812]*** (0.2511)*** 1.0239 [0.2791]*** 1.0215 [0.3100]*** 1.2627 [0.3162]*** 0.8012 [0.2884]*** Dummy for competition against imported products 0.2069 [0.0962]** (0.2737) 0.2013 [0.0962]** 0.2096 [0.1041]* 0.0156 [0.0823] 0.1021 [0.0665] Constant 3.5299 [0.7190]*** 4.6379 [0.7023]*** 1.4306 [0.5738]** 2.6911 [0.7730]*** 2.6126 [0.4577]*** Industry/region/size/time dummies Yes Yes Yes Yes Yes Observations 1684 1684 1360 792 1646 R-squared 0.73 0.68 0.75 0.85 - Notes of Table 10.1 Source: Authors' calculations with ICSs. 73 Table 10.3: SOUTH AFRICA, Extended production function and comparison of ICA method with extensions Dependent variable: log of total sales ICA method and extensions Complete case4 Multiple imputation Original ICA meth.1 Random ICA meth.2 ICA met. on inputs3 (Switching regr.)5 Category Variable Coeff. std. err. Boot. st. er. Coeff. std. err. Coeff. std. err. Coeff. std. err. Coeff. std. err. PF variables Log-employment 0.3226 [0.0711]*** (0.0365)*** 0.3822 [0.0776]*** 0.3295 [0.0717]*** 0.3428 [0.0541]*** 0.2453 [0.0681]*** Log-materials 0.5195 [0.1017]*** (0.0214)*** 0.4914 [0.0877]*** 0.5182 [0.1015]*** 0.4877 [0.0961]*** 0.5674 [0.0905]*** Log-capital 0.1247 [0.0300]*** (0.0118)*** 0.0791 [0.0264]*** 0.124 [0.0302]*** 0.1118 [0.0322]*** 0.1180 [0.0345]*** Infrastructure Days to clear customs for imports (a) -0.1188 [0.1125] (0.1233) -0.14 [0.1247] -0.1407 [0.1176] 0.018 [0.1976] 0.0423 [0.1008] Sales lost due to power outages (b) -0.0171 [0.0114] (0.0047)*** -0.0194 [0.0127] -0.0142 [0.0104] -0.003 [0.0085] -0.0107 [0.0080] Water outages (b) -0.1477 [0.0527]*** (0.0942) -0.1441 [0.0591]** -0.1405 [0.0513]** -0.1427 [0.0659]** -0.1393 [0.0504]*** Average duration of transport failures (a) -0.0439 [0.0806] (0.0379) -0.0065 [0.0867] -0.074 [0.0832] 0.1229 [0.1507] -0.0022 [0.0762] Wait for electric supply (a) -0.0867 [0.0553] (0.0173)*** -0.1075 [0.0589]* -0.0767 [0.0573] -0.0629 [0.0558] -0.1014 [0.0309]*** Sales lost due to delivery delays (b) -0.0099 [0.0083] (0.0073) -0.0111 [0.0092] -0.0119 [0.0080] -0.0074 [0.0081] -0.0089 [0.0072] Red tape, Manager's time spent on bur. issues (b) 0.007 [0.0051] (0.0016)*** 0.0072 [0.0051] 0.0073 [0.0052] 0.0058 [0.0043] 0.0079 [0.0056] corruption and Payments to deal with bureaucratic issues (b) -0.0045 [0.0024]* (0.3604) -0.0063 [0.0031]* -0.0045 [0.0023]* -0.0008 [0.0125] -0.0044 [0.0024]* crime Sales declared for taxes (a) 0.0056 [0.0046] (0.0022)** 0.0015 [0.0049] 0.0064 [0.0044] 0.0091 [0.0039]** 0.0058 [0.0031]* Payments to obtain a contract with the government (b) -0.0144 [0.0185] (0.1975) -0.0218 [0.0201] -0.017 [0.0208] -0.0129 [0.0112] -0.0180 [0.0162] Security expenses (a) 0.1407 [0.0511]** (0.0069)*** 0.1245 [0.0586]** 0.1159 [0.0477]** 0.0227 [0.0146] -0.0075 [0.0123] Illegal payments for protection (b) 0.3969 [0.2428] (0.1128)*** 0.4048 [0.2751] 0.3997 [0.2428] 0.3265 [0.3225] 0.3606 [0.2254]* Crime losses (a) -0.0502 [0.0788] (0.1374) 0.0153 [0.0855] -0.0679 [0.0786] 0.1115 [0.0871] -0.0121 [0.0708] Finance and Percentage of credit unused (b) 0.0014 [0.0010] (0.0013) 0.0014 [0.0010] 0.0015 [0.0010] 0.0007 [0.0006] 0.0018 [0.0010]* corporate Dummy for loan 0.0715 [0.0492] (0.0327)** 0.0678 [0.0547] 0.072 [0.0493] 0.0602 [0.0421] 0.0814 [0.0443]* governance Value of the collateral (b) -0.0008 [0.0002]*** (0.0009) -0.0008 [0.0002]*** -0.0008 [0.0002]*** -0.0009 [0.0002]*** -0.0007 [0.0002]*** Loans in foreign currency (b) 0.0018 [0.0022] (0.0024) 0.0024 [0.0023] 0.0016 [0.0021] 0.0012 [0.0011] -0.0001 [0.0012] Charge to clear a check (a) -0.1164 [0.0503]** (0.0253)*** -0.1404 [0.0570]** -0.1108 [0.0501]** -0.1722 [0.0582]*** -0.0905 [0.0402]** Largest shareholder 0.0006 [0.0010] (0.0008) -0.0003 [0.0010] 0.0008 [0.0009] 0.0001 [0.0009] 0.0010 [0.0008] Working capital fin. by foreign commercial banks (b) 0.0106 [0.0083] (0.0084) 0.0073 [0.0090] 0.0107 [0.0082] 0.0203 [0.0195] 0.0050 [0.0062] Working capital financed by informal sources (b) -0.0022 [0.0023] (0.0001)*** -0.0032 [0.0023] -0.0021 [0.0023] -0.0046 [0.0011]*** -0.0025 [0.0019] Quality, Dummy for ISO quality certification (b) 0.1603 [0.0766]** (0.0365)*** 0.1956 [0.0646]*** 0.1578 [0.0764]** 0.121 [0.0670]* 0.1029 [0.0454]** innovation and Dummy for new product (b) 0.091 [0.0494]* (0.0113)*** 0.1233 [0.0587]** 0.0926 [0.0496]* 0.0461 [0.0393] 0.0948 [0.0475]** labor skills Dummy for discontinued product line (b) -0.1007 [0.0610] (0.0384)** -0.1334 [0.0648]** -0.099 [0.0597] -0.0616 [0.0353]* -0.0864 [0.0527]* Staff - management 0.004 [0.0028] (0.0009)*** 0.0049 [0.0027]* 0.0038 [0.0027] 0.0041 [0.0030] 0.0034 [0.0030] Staff - non-production workers -0.0034 [0.0022] (0.0025) -0.0033 [0.0021] -0.0033 [0.0022] -0.0026 [0.0021] -0.0024 [0.0021] Training for unskilled workers (a) 0.001 [0.0026] (0.0030) 0.0023 [0.0028] 0 [0.0025] -0.0047 [0.0045] 0.0011 [0.0027] University staff (b) 0.0049 [0.0015]*** (0.0007)*** 0.0051 [0.0015]*** 0.0049 [0.0014]*** 0.0044 [0.0011]*** 0.0043 [0.0014]*** Manager's experience (b) 0.0391 [0.0249] (0.0217)* 0.0412 [0.0271] 0.0387 [0.0249] 0.0325 [0.0254] 0.0292 [0.0196] Other control Age (b) 0.0018 [0.0015] (0.0016) 0.0019 [0.0014] 0.0017 [0.0014] 0.0023 [0.0013]* 0.0021 [0.0013]* variables Share of the local market (b) 0.0032 [0.0008]*** (0.0004)*** 0.0023 [0.0009]** 0.0032 [0.0008]*** 0.0027 [0.0009]*** 0.0029 [0.0007]*** Constant 2.7174 [0.8932]*** (0.0365)*** 3.5878 [0.8355]*** 2.6721 [0.8751]*** 2.6313 [0.9880]** 2.6249 [0.7400]*** Industry/region/size/time dummies Yes Yes Yes Yes Yes Observations 1483 1483 1474 1236 1483 R-squared 0.89 0.87 0.89 0.91 Notes for Table 10.1 Source: Authors' calculations with ICSs. 74 Table 10.4: TANZANIA, Extended production function and comparison of ICA method with extensions Dependent variable: log of total sales ICA method and extensions Complete case4 Multiple imputation Original ICA meth. 1 Random ICA met. 2 ICA met. on inputs 3 (Swithching regression)5 Category Variable Coeff. std. err. Boot. st. er. Coeff. std. err. Coeff. std. err. Coeff. std. err. Coeff. std. err. PF variables Log-employment 0.1655 [0.0853]* (0.0512]*** 0.2643 [0.1039]** 0.2339 [0.0603]*** 0.1651 [0.0681]** 0.1217 (0.0625]** Log-materials 0.4252 [0.0581]*** (0.0340]*** 0.4008 [0.0527]*** 0.6087 [0.0406]*** 0.6242 [0.0468]*** 0.7170 (0.0390]*** Log-capital 0.1589 [0.0323]*** (0.0208]*** 0.0975 [0.0418]** 0.1302 [0.0280]*** 0.1311 [0.0312]*** 0.0977 (0.0294]*** Infrastructure Electricity from own generator (b) 0.0021 [0.0016] (0.0053] 0.0013 [0.0017] 0.0019 [0.0016] -0.0002 [0.0022] 0.0039 (0.0016]** Losses due to water outages (b) -0.0112 [0.0058]* (0.0162] -0.0132 [0.0081] -0.0058 [0.0051] -0.0107 [0.0062]* -0.0094 (0.0046]** Water from own well or water infrastructure (a) 0.0001 [0.0051] (0.0011] -0.0094 [0.0060] -0.0017 [0.0046] 0.0004 [0.0056] -0.0003 (0.0038] Losses due to phone outages (a) -0.0322 [0.0198] (0.0071]*** -0.0453 [0.0237]* 0.0003 [0.0208] 0.0089 [0.0209] 0.0078 (0.0238] Transport outages (a) -0.0047 [0.0703] (0.1168] 0.0785 [0.0940] 0.0243 [0.0573] -0.0859 [0.0567] 0.0054 (0.0322] Dummy for own roads (b) 0.289 [0.1488]* (0.0581]*** 0.1502 [0.1582] 0.4010 [0.1164]*** 0.4073 [0.1249]*** 0.3117 (0.1422]** Dummy for webpage (b) 0.1578 [0.1212] (0.1994] 0.1453 [0.1280] 0.2560 [0.1038]** 0.3106 [0.1170]** 0.0977 (0.1635] Wait for a water supply (a) -0.1814 [0.0427]*** (0.0702]** -0.1769 [0.0531]*** -0.1388 [0.0411]*** -0.1252 [0.0326]*** -0.1036 (0.0356]*** Low quality supplies (a) -0.0163 [0.0128] (0.0041]*** -0.0389 [0.0164]** -0.0210 [0.0127] -0.0285 [0.0142]* -0.0183 (0.0120] Red tape, Gift to obtain an operating license (b) -0.4983 [0.1935]** (0.1066]*** -0.4607 [0.2385]* -0.3262 [0.1439]** -0.1671 [0.1562] 0.0694 (0.1218] corruption and Payments to deal with bureaucratic issues (b) 0.0939 [0.0299]*** (0.0164]*** 0.0376 [0.0578] 0.1182 [0.0295]*** 0.0767 [0.0192]*** 0.0546 (0.0472] crime Days in inspections (b) -0.1045 [0.0735] (0.0494]** -0.1172 [0.0984] -0.0514 [0.0425] -0.0524 [0.0643] -0.0009 (0.0461] Payments to obtain a contract with the government (b) -0.0114 [0.0066]* (0.0091] -0.0177 [0.0086]** -0.0189 [0.0066]*** -0.0254 [0.0078]*** -0.0140 (0.0051]*** Security expenses (b) -0.0119 [0.0042]*** (0.0092] -0.0151 [0.0055]** -0.0072 [0.0034]** 0.008 [0.0193] -0.0042 (0.0032] Illegal payments for protection (b) -0.0827 [0.0170]*** (0.1019] -0.081 [0.0329]** -0.0774 [0.0179]*** -0.0603 [0.0251]** -0.0392 (0.0131]*** Finance and Interest rate of the loan (a) -0.0109 [0.0145] (0.0099] -0.0028 [0.0182] -0.0038 [0.0094] 0.0111 [0.0113] -0.0021 (0.0090] corporate Working capital financed by commercial banks (b) -0.0009 [0.0018] (0.0012] -0.0008 [0.0021] -0.0013 [0.0014] 0.0007 [0.0013] -0.0014 (0.0012] governance Working capital financed by leasing (b) -0.0794 [0.0282]*** (0.0054]*** -0.1362 [0.0450]*** -0.0489 [0.0305] -0.0304 [0.0329] -0.0129 (0.0069]* Sales bought on credit (b) -0.0014 [0.0012] (0.0011] -0.0036 [0.0017]** -0.0003 [0.0011] -0.0021 [0.0014] -0.0002 (0.0014] Delay in clearing a domestic currency wire (a) -0.3418 [0.3273] (0.0935]*** -0.0024 [0.3738] 0.1242 [0.2583] 0.3236 [0.2952] 0.2044 (0.1717] Quality, Dummy for new product (b) 0.0429 [0.1063] (0.2036] 0.1217 [0.1118] -0.0526 [0.0945] -0.1533 [0.1066] -0.1045 (0.0981] innovation and Staff - skilled workers (b) 0.0026 [0.0023] (0.0050] 0.0053 [0.0028]* 0.0038 [0.0022]* 0.0054 [0.0021]** 0.0039 (0.0020]* labor skills Workforce with computer (b) 0.0066 [0.0030]** (0.0056] 0.0079 [0.0038]** 0.0094 [0.0039]** 0.0154 [0.0055]*** 0.0037 (0.0045] Other control Dummy for incorporated company (b) 0.2914 [0.2023] (0.5683] 0.238 [0.2648] 0.2327 [0.1841] -0.2476 [0.1896] 0.2544 (0.2270] variables Dummy for FDI (b) 0.1397 [0.1445] (0.2844] 0.3044 [0.1888] 0.1788 [0.1225] 0.1061 [0.1123] -0.0255 (0.1128] Constant 7.2978 [1.0168]*** 7.2545 [1.3295]*** 2.7433 [0.8631]*** 3.1164 [0.8674]*** 2.4194 [0.7159] Industry/region/size/time dummies Yes Yes Yes Yes Yes Observations 559 557 442 291 570 R-squared 0.88 0.81 0.9300 0.95 Notes forTable 10.1 Source: Authors' calculations with ICSs. 75 Especially interesting is the comparison of the ICA method with the Random ICA method--introduced in section 5.2.1--in which we introduce a random component to the imputation procedure in order to test the role played by the uncertainty inherent in the imputation mechanism. In a similar vein, another interesting point is to check the sensitivity of the significance level of the variables using bootstrap standard errors to correct for the problem of generated regressors (see section 5.2.2). Only 2 IC variables lose their significance in the ICA method with bootstrap standard error with respect to the regular case, and 2 new variables became significant. A similar pattern is observed in the Random ICA method with 6 significant IC variables, of which 3 were also significant in the ICA method (Table 12 includes the summary of significant IC variables in each case). Finally, the ICA method on inputs and the multiple imputation cases lead to similar results in the I-O elasticities, with the exception of a slight decline in the capital elasticity. In both cases, the significance of some IC variables is lost, although the direction of the estimated effects never changes. Similar conclusions can be drawn in the case of South Africa, the results of which are presented in Table 10.3. In this case, the number of observations used in the complete case only differs by 250 with respect to the ICA method. As expected from the larger response rate of PF variables in this country, there is no significant efficiency lost in the complete case and most IC variables remain significant. As in the case of India,, the Random ICA method and the bootstrap standard errors change the significance of some variables, and while some variables lose their significance, a small group of other IC variables become significant. Finally, both the ICA method on inputs and multiple imputation show robust results with respect to the ICA method. We only observe changes in the second or third decimals. The cases of Turkey and Tanzania (tables 10.2 and 10.4 respectively) are rather different from the two previous ones. In both cases, using the complete case implies using less than 50% of the sample under the complete case. This implies a clear efficiency loss, which is translated into four less significant IC variables in the complete case in Turkey and three in Tanzania. By means of significance of IC variables, the results from the Random ICA, Bootstrap ICA method and ICA on inputs cases are more consistent with those from the standard ICA method. In this respect, introducing more uncertainty into the imputation procedure used in Turkey does not change the significance of 6 and 9 IC variables, depending on whether we focus on the Bootstrap ICA or on the Random ICA respectively. In Tanzania the patterns are similar: 4 IC variables lose their significance in both the Bootstrap ICA and the Random ICA. Lastly, in both cases, Turkey and Tanzania, the ICA method on inputs and the multiple imputation do not modify the results of the ICA method. On the other hand, regarding I-O elasticities and in the case of Turkey, it is important to note that, although we only observe changes in the I-O estimate for materials, the I-O elasticity of employment is non-significant under the ICA method with bootstrap standard errors and the Random ICA method. 76 6.2.3 Comparison of the ICA method and the Heckman selection model We now focus on the comparison of the ICA method and the Heckman models proposed in section 5.4 and 5.5. The estimating results are in tables 11.1 to 11.4. The main conclusions are summarized in Table 12. 77 Table 11.1: INDIA, Extended production function and comparison of ICA method with Heckman models Dependent variable: log of total sales Heckman models2 ICA Method1 Heckman on comp case Heckman replacing inputs Category Variable Coeff. std. err. Boot. s.e Coeff. std. err. Coeff. std. err. Boot. s.e PF variables Log-employment 0.1027 [0.0341]*** (0.0306)*** 0.1127 [0.0160]*** 0.0806 [0.0184]*** (0.0452)*** Log-materials 0.7989 [0.0185]*** (0.0462)*** 0.7998 [0.0069]*** 0.8121 [0.0070]*** (0.0567)*** Log-capital 0.0676 [0.0239]*** (0.0153)*** 0.0477 [0.0062]*** 0.0578 [0.0070]*** (0.0168)*** Infrastructure Longest # of days to clear customs for exports (a) -0.0125 [0.0263] (0.0376) -0.0451 [0.0155]*** -0.0077 [0.0150] (0.1542) Dummy for own generator 0.0538 [0.0422] (0.0424) 0.0466 [0.0229]** 0.0769 [0.0265]*** (0.0064)*** Water supply from public sources (b) 0.0014 [0.0005]*** (0.0008)* 0.0014 [0.0003]*** 0.0012 [0.0003]*** (0.0460)*** Shipment losses in the domestic market (b) -0.0047 [0.0039] (0.0128) -0.0029 [0.0033] -0.0022 [0.0029] (0.1197) Dummy for own transport 0.0238 [0.0475] (0.0861) 0.0438 [0.0283] -0.0063 [0.0336] (0.0742) Dummy for web page 0.0402 [0.0394] (0.0264) 0.0061 [0.0221] 0.0212 [0.0263] (0.0051)** Dummy for security 0.0467 [0.0423] (0.1407) 0.0487 [0.0200]** 0.018 [0.0240] (0.0035)** Red tape, Sales reported to taxes (b) 0.0006 [0.0014] (0.0052) -0.0001 [0.0007] 0.0011 [0.0008] (0.0073) corruption and Workforce reported for taxes (b) -0.0015 [0.0012] (0.0042) 0.0005 [0.0007] -0.001 [0.0007] (0.0049) crime Dummy for payments to speed up bureaucracy -0.0464 [0.0336] (0.0526) 0.0079 [0.0186] -0.0259 [0.0226] (0.0463) Dummy for interventionist labor regulation -0.036 [0.0361] (0.0211)* -0.0407 [0.0226]* -0.0334 [0.0272] (0.0658)** Absenteeism (b) -0.0299 [0.0222] (0.0571) 0.0003 [0.0112] -0.0147 [0.0129] (0.1783)** Finance and Dummy for trade association 0.0785 [0.0455]* (0.0456)* 0.0339 [0.0241] 0.0143 [0.0274] (0.0762) corporate Working capital financed by domestic private banks (b) 0.0002 [0.0007] (0.0005) 0.0004 [0.0004] 0.001 [0.0004]** (0.0006)** governance Dummy for external audit 0.0691 [0.0395]* (0.0452) 0.0419 [0.0204]** 0.0827 [0.0245]*** (0.0408)* Dummy for loan (b) 0.1102 [0.0473]** (0.0637)* -0.0395 [0.0301] 0.1181 [0.0340]*** (0.0002)*** Quality, Dummy for R&D (a) 0.1787 [0.2382] (0.2347) 0.0813 [0.0933] 0.2063 [0.1112]* (0.0010) innovation Dummy for product innovation -0.0073 [0.0360] (0.0710) -0.0508 [0.0200]** -0.0081 [0.0233] (0.0352)*** and labor Dummy for foreign license (b) 0.204 [0.1053]* (0.1302) 0.141 [0.0434]*** 0.1478 [0.0499]*** (0.0006) skills Dummy for internal training (b) 0.0579 [0.0533] (0.0516) 0.0794 [0.0290]*** 0.0813 [0.0338]** (0.0093) Unskilled workforce (a) 0.0013 [0.0036] (0.0016) -0.0016 [0.0017] -0.004 [0.0019]** (0.1225) Workforce with computer 0.0017 [0.0011] (0.0015) 0.0006 [0.0005] 0.0015 [0.0006]*** (0.0498)*** Other control Dummy for incorporated company 0.0265 [0.0396] (0.0901) 0.0566 [0.0225]** 0.016 [0.0273] (0.0398)** variables Age 0.0534 [0.0267]** (0.0214)** 0.0363 [0.0146]** 0.0856 [0.0181]*** (0.0431)** Share of exports (b) 0.001 [0.0009] (0.0005)** 0.0001 [0.0004] 0.0003 [0.0004] (0.0020)** Trade union (b) 0.0008 [0.0012] (0.0010) -0.00004 [0.0005] 0.0002 [0.0005] (0.0014)** Strikes (b) -0.0683 [0.0449] (0.0821) 0.0482 [0.0301] -0.0213 [0.0317] (0.0043) Constant 0.7377 [0.3449]** 1.1579 [0.1899]*** 0.8508 [0.2174]*** Industry/region/size/time dummies Yes Yes Yes Observations 5211 4323 (Cens: 5515/ Unc: 5407 (Censored: 515/ Uncens: 4982) R-squared 0.88 3808 Heckman's Lambda (Inverse of Mills ration) 0.0130 [0.0634] 0.1221 [0.0926] . Estimating results of equation (1) under different imputation mechanisms for missing data. Those observations with missing values in all sales, labor (labor cost), materials and capital are excluded in all the regressions.1 See footnote in Table 8.1. 2 Heckman models are explained in section 5.4. Heckman model on complete case considers missingness only in PF variables, not in IC variables, see section 5.4.1. Heckman replacing inputs compute the model on the sample with replacement of missing values in inputs (labor, materials and capital), see section 5.4.2. In all the cases significance is given by clustered by industry and region White- robust standard errors in brackets; *** 1%, **5%, * 10%. In the case of the ICA method and Heckmand replacing inputs, in parentheses are bootstrap standard errors after 1000 replications (see sections and 5.2.2 5.4.2). Correlation by cluster is also considered. Source: Authors' calculations with ICSs. 78 Table 11.2: TURKEY, Extended production function and comparison of ICA method with Heckman models Dependent variable: log of total sales Heckman models2 ICA Method1 Heckman on complete case Heckman replacing inputs Category Variable Coeff. std. err. Boot. st. er. Coeff. Std.Err Coeff. std. err. Boot. st. er. PF variables Log-employment 0.416 [0.0492]*** (0.1088)*** 0.4017 [0.0423]*** 0.5104 [0.0427]*** (0.0376)*** Log-materials 0.4184 [0.0404]*** (0.0249)*** 0.5306 [0.0189]*** 0.4585 [0.0187]*** (0.0310)*** Log-capital 0.0371 [0.0165]** (0.0428) 0.063 [0.0164]*** 0.067 [0.0182]*** (0.0168)*** Infrastructures Days to clear customs for imports (a) -0.0707 [0.0686] (0.0688) -0.155 [0.0835]* -0.0648 [0.0859] (0.0556)*** Dummy for e-mail 0.2866 [0.0920]*** (0.1365)** 0.0193 [0.0822] 0.3121 [0.0786]*** (0.0659)** Red tape, Security expenses (b) -0.0246 [0.0828] (0.0011)*** -0.0379 [0.0824] -0.0658 [0.0831] (0.0575)** corruption and Payments to deal with bureaucratic issues (a) -0.011 [0.0020]*** (0.0077) -0.0084 [0.0009]*** -0.0101 [0.0010]*** (0.0012)*** crime Sales declared to taxes (a) -0.0226 [0.0057]*** (0.0045)*** -0.0175 [0.0075]** -0.0131 [0.0077]* (0.0055)*** Number of inspections (b) 0.0046 [0.0044] (0.0597) -0.0017 [0.0043] 0.0049 [0.0045] (0.0028) Payments to obtain a contract with the government (b) -0.0373 [0.0315] (0.0058)*** -0.0371 [0.0323] -0.0363 [0.0315] (0.0256)** Production lost due to absenteeism (b) -0.0054 [0.0043] (0.0367) -0.0138 [0.0073]* -0.0102 [0.0074] (0.0042)** Dummy for informal competition (b) 0.0044 [0.0295] (0.1203) -0.011 [0.0283] 0.0046 [0.0306] (0.0194) Delay in obtaining a water supply (a) -0.1325 [0.0634]** (0.0993) -0.0926 [0.0588] -0.165 [0.0603]*** (0.0467)*** Finance Dummy for credit line 0.068 [0.0868] (0.1383) 0.0473 [0.0621] 0.0493 [0.0644] (0.0482)** Dummy for external auditory (a) 0.0863 [0.0753] (0.1117) 0.1407 [0.0617]** 0.1075 [0.0641]* (0.0448)*** Loans in foreign currency (b) 0.0018 [0.0009]** (0.0010)* 0.0003 [0.0009] 0.0016 [0.0009]* (0.0008)* Quality, innov. Staff with university education (b) 0.0095 [0.0026]*** (0.0018)*** 0.0083 [0.0023]*** 0.0104 [0.0024]*** (0.0018)*** and labor skills Staff-part time workers -0.008 [0.0030]** (0.0222) -0.0065 [0.0027]** -0.0093 [0.0028]*** (0.0019)*** Other control Production lost due to strikes (b) -0.1689 [0.0634]** (0.0351)*** -0.1805 [0.0593]*** -0.153 [0.0723]** (0.0453)*** variables Dummy for recently privatized firm 1.0606 [0.2812]*** (0.2511)*** 1.3287 [0.3695]*** 1.0391 [0.2582]*** (0.2653)*** Dummy for competition against imported products 0.2069 [0.0962]** (0.2737) 0.021 [0.0724] 0.2084 [0.0730]*** (0.0634)*** Constant 3.5299 [0.7190]*** 3.0323 [0.6775]*** 1.7704 [0.7084]** (0.0376)*** Industry/region/size/time dummies Yes Yes Yes Observations 1684 1941 (Censored: 1149/ 2509 (Censored: 1149/ Uncensored: R-squared 0.73 Uncensored: 792) 1360) Heckman's Lambda -0.1531 [0.1188] 0.0639 (0.1332] Notes for Table 11.1. Source: Authors' calculations with ICSs. 79 Table 11.3: SOUTH AFRICA, Extended production function and comparison of ICA method with Heckman models Dependent variable: log of total sales Heckman models2 ICA Method1 Heckman on complete case Heckman replacing inputs Category Variable Coeff. std. err. Boot. st. er. Coeff. Std.Err Coeff. std. err. Boot. st. er. PF variables Log-employment 0.3226 [0.0711]*** (0.0365]*** 0.3427 [0.0261]*** 0.3275 [0.0250]*** (0.0452)*** Log-materials 0.5195 [0.1017]*** (0.0214]*** 0.4871 [0.0121]*** 0.5184 [0.0120]*** (0.0567)*** Log-capital 0.1247 [0.0300]*** (0.0118]*** 0.1117 [0.0123]*** 0.1241 [0.0129]*** (0.0168)*** Infrastructure Days to clear customs for import s(a) -0.1188 [0.1125] (0.1233] 0.032 [0.1133] -0.1728 [0.1286] (0.1542) Sales lost due to power outages (b) -0.0171 [0.0114] (0.0047]*** -0.0059 [0.0062] -0.0166 [0.0069]** (0.0064)*** Water outages (b) -0.1477 [0.0527]*** (0.0942] -0.1215 [0.0501]** -0.1383 [0.0516]*** (0.0460)*** Average duration of transport failures (a) -0.0439 [0.0806] (0.0379] 0.1092 [0.0936] -0.0821 [0.0985] (0.1197) Wait for electric supply (a) -0.0867 [0.0553] (0.0173]*** -0.0311 [0.0544] -0.057 [0.0717] (0.0742) Sales lost due to delivery delays (b) -0.0099 [0.0083] (0.0073] -0.0069 [0.0054] -0.0109 [0.0054]** (0.0051)** Red tape, Manager's time spent on bur. issues (b) 0.007 [0.0051] (0.0016]*** 0.0065 [0.0016]*** 0.0079 [0.0017]*** (0.0035)** corruption and Payments to deal with bureaucratic issues (b) -0.0045 [0.0024]* (0.3604] -0.0028 [0.0101] -0.0056 [0.0039] (0.0073) crime Sales declared to taxes (a) 0.0056 [0.0046] (0.0022]** 0.0079 [0.0041]* 0.0062 [0.0056] (0.0049) Payments to obtain a contract with the government (b) -0.0144 [0.0185] (0.1975] -0.0099 [0.0198] -0.0134 [0.0228] (0.0463) Security expenses (a) 0.1407 [0.0511]** (0.0069]*** 0.0308 [0.0152]** 0.1324 [0.0578]** (0.0658)** Illegal payments in protection (b) 0.3969 [0.2428] (0.1128]*** 0.2767 [0.1745] 0.3686 [0.0888]*** (0.1783)** Crime losses (a) -0.0502 [0.0788] (0.1374] 0.1006 [0.0792] -0.0561 [0.0817] (0.0762) Finance and Percentage of credit unused (b) 0.0014 [0.0010] (0.0013] 0.0006 [0.0005] 0.0013 [0.0006]** (0.0006)** corporate Dummy for loan 0.0715 [0.0492] (0.0327]** 0.0634 [0.0400] 0.0705 [0.0413]* (0.0408)* governance Value of the collateral (b) -0.0008 [0.0002]*** (0.0009] -0.0009 [0.0002]*** -0.0008 [0.0002]*** (0.0002)*** Loans in foreign currency (b) 0.0018 [0.0022] (0.0024] 0.0013 [0.0012] 0.0015 [0.0012] (0.0010) Charge to clear a check (a) -0.1164 [0.0503]** (0.0253]*** -0.1773 [0.0324]*** -0.1239 [0.0340]*** (0.0352)*** Largest shareholder 0.0006 [0.0010] (0.0008] 0.0000 [0.0006] 0.0008 [0.0007] (0.0006) Working capital financed by foreign commercial banks (b) 0.0106 [0.0083] (0.0084] 0.0241 [0.0070]*** 0.0134 [0.0045]*** (0.0093) Working capital financed by informal sources (b) -0.0022 [0.0023] (0.0001]*** -0.0044 [0.0031] -0.002 [0.0036] (0.1225) Quality, innovation Dummy for ISO quality certification (b) 0.1603 [0.0766]** (0.0365]*** 0.1208 [0.0359]*** 0.1599 [0.0389]*** (0.0498)*** and labor skills Dummy for new product (b) 0.091 [0.0494]* (0.0113]*** 0.0322 [0.0377] 0.0807 [0.0398]** (0.0398)** Dummy for discontinued product line (b) -0.1007 [0.0610] (0.0384]** -0.0565 [0.0333]* -0.0865 [0.0375]** (0.0431)** Staff - management 0.004 [0.0028] (0.0009]*** 0.0047 [0.0016]*** 0.0041 [0.0015]*** (0.0020)** Staff - non-production workers -0.0034 [0.0022] (0.0025] -0.0027 [0.0011]** -0.0033 [0.0012]*** (0.0014)** Training for unskilled workers (a) 0.001 [0.0026] (0.0030] -0.0048 [0.0032] 0.0012 [0.0041] (0.0043) University staff (b) 0.0049 [0.0015]*** (0.0007]*** 0.0036 [0.0015]** 0.0044 [0.0014]*** (0.0012)*** Manager's experience (b) 0.0391 [0.0249] (0.0217]* 0.0336 [0.0142]** 0.0369 [0.0150]** (0.0173)** Other control Age (b) 0.0018 [0.0015] (0.0016] 0.0016 [0.0009]* 0.0012 [0.0010] (0.0011) variables Share of the local market (b) 0.0032 [0.0008]*** (0.0004]*** 0.0028 [0.0006]*** 0.0031 [0.0006]*** (0.0007)*** Constant 2.7174 [0.8932]*** 2.7155 [0.5500]*** 2.7170 [0.6986]*** Industry/region/size/time dummies Yes Yes Yes Observations 1483 1443 (Censored: 2007/ 1657 (Censored: 183/ Uncens.: 1484) R-squared 0.89 Uncens: 1236) Heckman's Lambda -0.2747 [0.1993] -0.2471 [0.2303] Notes for Table 11.1. Source: Authors' calculations with ICSs. 80 Table 11.4: TANZANIA, Extended production function and comparison of ICA method with Heckman models Dependent variable: log of total sales Heckman models2 ICA Method1 Heckman on complete case Heckman replacing inputs Category Variable Coeff. std. err. Boot. st. er. Coeff. std. err. Coeff. std. err. Boot. st. er. PF variables Log-employment 0.1655 [0.0853]* (0.0512)*** 0.1422 [0.0557]** 0.1742 [0.0669]*** (0.0677)** Log-materials 0.4252 [0.0581]*** (0.0340)*** 0.6176 [0.0274]*** 0.6099 [0.0317]*** (0.0439)*** Log-capital 0.1589 [0.0323]*** (0.0208)*** 0.1427 [0.0209]*** 0.1417 [0.0265]*** (0.0235)*** Infrastructure Electricity from own generator (b) 0.0021 [0.0016] (0.0053) -0.001 [0.0018] 0.0041 [0.0020]** (0.0017)** Losses due to water outages (b) -0.0112 [0.0058]* (0.0162) -0.0081 [0.0060] -0.0029 [0.0063] (0.0054) Water from own well or water infrastructure (a) 0.0001 [0.0051] (0.0011) 0.001 [0.0031] 0.0044 [0.0036] (0.0042) Losses due to phone outages (a) -0.0322 [0.0198] (0.0071)*** -0.0315 [0.0284] -0.0226 [0.0321] (0.0291) Transport outages (a) -0.0047 [0.0703] (0.1168) -0.1172 [0.0503]** -0.0214 [0.0583] (0.0499) Dummy for own roads (b) 0.289 [0.1488]* (0.0581)*** 0.3742 [0.1143]*** 0.3416 [0.1444]** (0.1321)*** Dummy for webpage (b) 0.1578 [0.1212] (0.1994) 0.3178 [0.0972]*** 0.1595 [0.1208] (0.1468) Wait for a water supply (a) -0.1814 [0.0427]*** (0.0702)** -0.1214 [0.0415]*** -0.1888 [0.0551]*** (0.0466)*** Low quality supplies (a) -0.0163 [0.0128] (0.0041)*** -0.0252 [0.0116]** -0.0323 [0.0118]*** (0.0130)** Red tape, Gift to obtain an operating license (b) -0.4983 [0.1935]** (0.1066)*** -0.1757 [0.1281] 0.0688 [0.1482] (0.1589) corruption and Payments to deal with bureaucratic issues (b) 0.0939 [0.0299]*** (0.0164)*** 0.0365 [0.0420] 0.0245 [0.0446] (0.0495) crime Days in inspections (b) -0.1045 [0.0735] (0.0494)** -0.1106 [0.0525]** -0.0246 [0.0585] (0.0580) Payments to obtain a contract with the government (b) -0.0114 [0.0066]* (0.0091) -0.0332 [0.0088]*** -0.0101 [0.0074] (0.0066) Security expenses (b) -0.0119 [0.0042]*** (0.0092) 0.0068 [0.0108] -0.0051 [0.0058] (0.0052) Illegal payments in protection (b) -0.0827 [0.0170]*** (0.1019) -0.1209 [0.0478]** -0.026 [0.0467] (0.0493) Finance and Interest rate of the loan (a) -0.0109 [0.0145] (0.0099) 0.0036 [0.0098] -0.0074 [0.0115] (0.0127) corporate Working capital financed by commercial banks (b) -0.0009 [0.0018] (0.0012) 0.0000 [0.0014] -0.003 [0.0016]* (0.0015)** governance Working capital financed by leasing (b) -0.0794 [0.0282]*** (0.0054)*** -0.0234 [0.0408] -0.0473 [0.0096]*** (0.0806) Sales bought on credit (b) -0.0014 [0.0012] (0.0011) -0.0029 [0.0012]** 0.0038 [0.0014]*** (0.0014)*** Delay in clearing a domestic currency wire (a) -0.3418 [0.3273] (0.0935)*** 0.4842 [0.1853]*** 0.1533 [0.1876] (0.1996) Quality, Dummy for new product (b) 0.0429 [0.1063] (0.2036) -0.1942 [0.0850]** -0.003 [0.1014] (0.0951) innovation and Staff - skilled workers (b) 0.0026 [0.0023] (0.0050) 0.0074 [0.0020]*** 0.0092 [0.0026]*** (0.0024)*** labor skills Workforce with computer (b) 0.0066 [0.0030]** (0.0056) 0.0183 [0.0037]*** -0.0084 [0.0032]*** (0.0070) Other control Dummy for incorporated company (b) 0.2914 [0.2023] (0.5683) -0.2149 [0.3207] 0.1701 [0.3050] (0.1810) variables Dummy for FDI (b) 0.1397 [0.1445] (0.2844) 0.1752 [0.1051]* -0.0289 [0.1426] (0.1326) Constant 7.2978 [1.0168]*** 3.8725 [0.7997]*** 3.1102 [0.9936]*** Industry/region/size/time dummies Yes Yes Yes Observations 559 581 (Censored: 290/ Uncens: 771 (Censored: 317/ Uncens: 454) R-squared 0.88 291) Heckman's Lambda -0.2747[0.1993] -0.2471[0.2303] Notes for Table 11.1 Source: Authors' calculations with ICSs. 81 First of all, we consider it important to note that Heckman's Lambda is significant in none of the four cases. Thereby, the plausible selection bias is not supported by the Heckman model in any country. Besides the significance of Heckman's Lambda, the results are quite similar when we correct for the endogenous selection and when we do not. In India and South Africa there are no significant changes in the I-O elasticities. Nonetheless, the larger proportion of missing observations in Turkey and South Africa introduces some degree of heterogeneity between the results of the ICA method and the Heckman models. Even under very different estimated I-O elasticities, the IC parameters move within a reasonable range of values and there are no changes in the estimated direction of the effects. Overall, there are more IC variables significant in the Heckman model, even when we consider bootstrap standard errors. 6.3 Evaluation of the imputation mechanism: Comparison of estimated TFP densities We end this section with the evaluation of the estimated densities of the TFPs for each country. The estimated kernel densities of the different TFP measures obtained after applying the different imputation mechanism are obtained from equation (1) according to the following expression log TFPit sit [log Yit ( L log Lit M log M it K log Kit )] , where log TFPit is the * ^ ^ ^ measured productivity after the imputation process, log Y , log L , log M , log K are the it it it it imputed inputs and output, the alphas with a hat on top denote the different estimated input- output elasticities after imputing missing values and s* is the pattern of missing values in PF variables after the imputation process. The results are in figures 5.1 to 5.2, along with the descriptive statistics of each TFP measure and the correlation matrix among productivities. Again we should differentiate between two groups of countries. In the first one, say that consisting of India and South Africa, the estimated TFP measures show a similar shape of kernel densities, although with different estimated means, especially in the case of EM algorithm [1] in India. In South Africa, this pattern is more marked, with more ostensible differences in the first moment of the distribution of the different TFP measures, although all the kernel densities have a similar shape, indicating that the standard deviations do not differ much among them, which is corroborated in panels B and C, where the descriptive statistics and the matrix of correlations are shown. 82 Figure 5.1: INDIA, evaluation of TFP measures under different imputation methods I. Kernel1 estimates of TFP densities 1.5 Density of TFP .5 0 1 -2 -1 0 1 2 3 log-TFP Complete case ICA method Random ICA ICA on inputs EM alg. [1] EM alg. [2] EM alg [3] II. Table of descriptive statistics of TFP measures # Obs Mean Std. Dev. Min Max Complete case 4327 1.15 0.68 -12.51 12.08 ICA meth. 5915 1.17 0.98 -12.53 12.19 Random ICA 5915 1.10 1.19 -12.51 12.15 ICA on inputs 5821 1.16 0.95 -12.55 12.25 Em alg [1] 6848 0.83 0.90 -12.96 12.09 Em alg [2] 5731 1.13 0.71 -12.66 12.44 Em alg [3] 5731 1.13 0.71 -12.67 12.43 III. Correlation matrix between TFP measures Complete ICA meth. Random ICA ICA on Em alg [1] Em alg [2] Em alg [3] case inputs Complete case 1.000 ICA meth. 0.999 1.000 Random ICA 1.000 0.999 1.000 ICA on inputs 0.998 1.000 0.998 1.000 Em alg [1] 0.993 0.995 0.994 0.996 1.000 Em alg [2] 0.990 0.991 0.993 0.993 0.997 1.000 Em alg [3] 0.990 0.991 0.993 0.993 0.997 1.000 1.000 Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. Complete case: TFP measure from the sample without replacement of missing values; likewise, input-output elasticities are obtained from estimating equation (1) in the complete case (see I-O elasticities in Table 9.1). ICA method: TFP measure with inputs and output replaced by ICA method and input-output elasticities from Table 8.1. Random ICA: TFP measure with inputs and output replaced by random ICA method and input output elasticities from Table 9.1. ICA on inputs: TFP measure when only inputs are imputed by the ICA method (not sales), the I-O elasticities and semi- elasticities used are in Table 9.1. Em alg. [1]: TFP measure obtained under imputation of inputs and output by the EM algorithm described in section 5.1.1. Likewise, the I-O elasticities are in Table 9.1. Em alg. [2]: In this case the EM algorithm used is that described in section 5.1.2. The I-O elasticities are in Table 9.1. Em alg. [3]: The description of the EM algorithm is in section 5.1.3. The I-O elasticities are in Table 9.1. Source: Authors' estimations with ICSs. 83 Figure 5.2: TURKEY, evaluation of TFP measures under different imputation methods I. Kernel1 estimates of TFP densities 1 .8 ensity of TFP .4 .6 D .2 0 -2 0 2 4 6 8 log-TFP Complete case ICA method Random ICA ICA on inputs EM alg. [1] EM alg. [2] EM alg [3] II. Table of descriptive statistics of TFP measures # Obs Mean Std. Dev. Min Max Complete case 818 1.84 1.01 -5.25 6.41 ICA meth. 1805 3.45 1.20 -3.36 7.85 Random ICA 1805 4.16 1.28 -2.67 8.91 ICA on inputs 1481 1.37 1.23 -5.64 5.87 Em alg [1] 2646 2.87 0.97 -4.05 7.44 Em alg [2] 1802 1.33 0.88 -5.84 6.13 Em alg [3] 1802 1.51 0.88 -5.65 6.31 III. Correlation matrix between TFP measures Complete ICA meth. Random ICA ICA on Em alg [1] Em alg [2] Em alg [3] case inputs Complete case 1.000 ICA meth. 0.969 1.000 Random ICA 0.954 0.998 1.000 ICA on inputs 0.992 0.974 0.956 1.000 Em alg [1] 0.990 0.993 0.986 0.985 1.000 Em alg [2] 0.990 0.927 0.908 0.969 0.964 1.000 Em alg [3] 0.991 0.932 0.914 0.969 0.968 1.000 1.000 Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. Complete case: TFP measure from the sample without replacement of missing values; likewise, input-output elasticities are obtained from estimating equation (1) in the complete case (see I-O elasticities in Table 9.2). ICA method: TFP measure with inputs and output replaced by ICA method and input-output elasticities from Table 8.2. Random ICA: TFP measure with inputs and output replaced by random ICA method and input output elasticities from Table 9.2. ICA on inputs: TFP measure when only inputs are imputed by the ICA method (not sales).The I-O elasticities and semi- elasticities used are in Table 9.2. Em alg. [1]: TFP measure obtained under imputation of inputs and output by the EM algorithm described in section 5.1.1. Likewise, the I-O elasticities are in Table 9.2. Em alg. [2]: In this case the EM algorithm used is that described in section 5.1.2. The I-O elasticities are in Table 9.2. Em alg. [3]: The description of the EM algorithm is in section 5.1.3. The I-O elasticities are in Table 9.2. Source: Authors' estimations with ICSs. 84 Figure 5.3: SOUTH AFRICA, evaluation of TFP measures under different imputation methods I. Kernel1 estimates of TFP densities 1.5 P 1 ensity of TF .5 D 0 0 2 4 6 log-TFP Complete case ICA method Random ICA ICA on inputs EM alg. [1] EM alg. [2] EM alg [3] II. Table of descriptive statistics of TFP measures # Obs Mean Std. Dev. Min Max Complete case 1265 3.50 0.70 -3.74 10.34 ICA meth. 1585 2.99 0.84 -4.34 10.28 Random ICA 1585 3.38 0.90 -4.97 10.31 ICA on inputs 1576 2.94 0.84 -4.39 10.21 Em alg [1] 1784 2.78 0.80 -4.47 10.26 Em alg [2] 1581 3.21 0.72 -4.01 11.21 Em alg [3] 1578 3.22 0.72 -4.00 11.18 III. Correlation matrix between TFP measures Complete ICA meth. Random ICA ICA on Em alg [1] Em alg [2] Em alg [3] case inputs Complete case 1.000 ICA meth. 0.996 1.000 Random ICA 0.998 0.993 1.000 ICA on inputs 0.996 1.000 0.993 1.000 Em alg [1] 0.992 0.999 0.988 0.999 1.000 Em alg [2] 0.982 0.991 0.975 0.990 0.992 1.000 Em alg [3] 0.982 0.991 0.975 0.990 0.993 1.000 1.000 Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. Complete case: TFP measure from the sample without replacement of missing values; likewise, input-output elasticities are obtained from estimating equation (1) in the complete case (see I-O elasticities in Table 9.3). ICA method: TFP measure with inputs and output replaced by ICA method and input-output elasticities from Table 8.3. Random ICA: TFP measure with inputs and output replaced by random ICA method and input output elasticities from Table 9.3. ICA on inputs: TFP measure when only inputs are imputed by the ICA method (not sales),.The I-O elasticities and semi- elasticities used are in Table 9.3. Em alg. [1]: TFP measure obtained under imputation of inputs and output by the EM algorithm described in section 5.1.1. Likewise, the I-O elasticities are in Table 9.3. Em alg. [2]: In this case the EM algorithm used is that described in section 5.1.2. The I-O elasticities are in Table 9.3. Em alg. [3]: The description of the EM algorithm is in section 5.1.3. The I-O elasticities are in Table 9.3. Source: Authors' estimations with ICSs. 85 Figure 5.4: TANZANIA, evaluation of TFP measures under different imputation methods I. Kernel1 estimates of TFP densities 1 .8 ensity of TFP .4 .6 D .2 0 -2 0 2 4 6 8 log-TFP Complete case ICA method Random ICA ICA on inputs EM alg. [1] EM alg. [2] EM alg [3] II. Table of descriptive statistics of TFP measures # Obs Mean Std. Dev. Min Max Complete case 313 2.53 0.87 -3.21 5.47 ICA meth. 661 4.79 1.30 -0.75 9.72 Random ICA 661 4.98 1.50 -1.47 10.03 ICA on inputs 505 1.81 1.14 -3.68 6.85 Em alg [1] 790 4.39 1.18 -1.30 8.92 Em alg [2] 628 2.25 0.80 -3.82 7.64 Em alg [3] 638 2.81 0.86 -3.21 8.35 III. Correlation matrix between TFP measures Complete ICA meth. Random ICA ICA on Em alg [1] Em alg [2] Em alg [3] case inputs Complete case 1.000 ICA meth. 0.913 1.000 Random ICA 0.877 0.991 1.000 ICA on inputs 0.997 0.904 0.869 1.000 Em alg [1] 0.948 0.994 0.975 0.937 1.000 Em alg [2] 0.981 0.829 0.779 0.971 0.884 1.000 Em alg [3] 0.979 0.849 0.804 0.963 0.901 0.996 1.000 Notes: 1 Epanechnikov kernel. Each point estimated within a range of 300 values. Complete case: TFP measure from the sample without replacement of missing values; likewise, input-output elasticities are obtained from estimating equation (1) in the complete case (see I-O elasticities in Table 9.4). ICA method: TFP measure with inputs and output replaced by ICA method and input-output elasticities from Table 8.4. Random ICA: TFP measure with inputs and output replaced by random ICA method and input output elasticities from Table 9.4. ICA on inputs: TFP measure when only inputs are imputed by the ICA method (not sales). The I-O elasticities and semi- elasticities used are in Table 9.4. Em alg. [1]: TFP measure obtained under imputation of inputs and output by the EM algorithm described in section 5.1.1. Likewise, the I-O elasticities are in Table 9.4. Em alg. [2]: In this case the EM algorithm used is that described in section 5.1.2. The I-O elasticities are in Table 9.4. Em alg. [3]: The description of the EM algorithm is in section 5.1.3 and the I-O elasticities in Table 9.4. Source: Authors' estimations with ICSs. 86 In Turkey and Tanzania the results are somewhat different. The larger proportion of missing values in these two countries results in two different blocks of TFP measures. The first block comprises the TFP measures from the complete case, the ICA method on inputs, and the EM algorithms [2] and [3]. The second block includes the remaining measures, that is, those from the ICA method, the EM algorithm [1] and the Random ICA method. TFP measures are similar within each group, however between blocks there are evident differences in all the shapes of the distribution, the skewness, the kurtosis, as well as in the estimated means and standard errors, as panel B shows. In spite of all these differences, panel C shows that the correlations of the TFP measure from the ICA method with the remaining cases are between .8 and .99. Likewise, the correlation among the remaining measures is considerably high. 6.4 Summary and main conclusions The ICA method performs reasonably well. Even under very different patterns of missing data and assumptions we are able to get robust results from different methods of handling missing data after controlling for IC variables in the estimation. When we assume that the MDM is MAR then there are two main issues we should consider: uncertainty and amount of information used in the imputation. On the other hand, if a non-ignorable pattern of missing data is assumed, then we are forced to test the robustness of the results of the ICA method with the Heckman models. We find that, overall, the ICA method is a good alternative even when the proportion of missing values is relatively high and the underlying variables are manifestly non-normal., leading to rather more homogenous results than other more sophisticated methods. We also observe that uncertainty, amount of information and non-ignorability of the MDM are not big issues in the context of ICSs; or at least they are not so serious as to invalidate the results of the ICA method. Lastly, we find that in order to get robust results under different imputation mechanisms, it is essential to control for the same set of IC variables, as they contain a good deal of information on the MDM.1 The main conclusions of this section can be summarized as follows: Overall, there are small differences in the estimated distribution of the imputed PF variables. Nonetheless, these differences become more marked as the number of missing values imputed increases and when the variables are not normally distributed. In particular, the Random ICA method, is the mechanism with the worst performance under a large proportion of missing values, followed by the EM algorithms. The ICA method preserves with reasonable precision the main moments of the distribution of the variables in the complete case.2 1 Obviously, this assertion is conditioned by the objectives one may have. 2 This would imply that the ICA method performs well when the MDM is MCAR or MAR, since in that case, under regularity conditions, the distribution in the complete case shares the same characteristics as the population distribution. Nonetheless, at this point if the MDM is non-ignorable we cannot say anything about the goodness-of- fit of the ICA method, since it could be replicating any distribution different from the population distribution. 87 These differences in the estimated distributions become even clearer if we focus on the TFP. However, the conclusions are the same whether we focus on inputs and output or TFP. We found reasonably robust elasticities in equation (1) under all the imputation methods proposed. However, there are important differences in the I-O elasticities and in the significance of the IC variables. The ICA method, EM algorithm [1], Random ICA method and Bootstrap ICA method lead to homogeneous results among them. That is, introducing uncertainty into the ICA method, regardless of whether, in order to get it, we use the EM algorithm [1], Random ICA method or Bootstrap ICA method, does not change significantly either the estimated effects or the level of significance of IC variables This suggests that uncertainty is not a big issue. Obviously, there are slight differences in the standard errors, but we argue that they are not so serious as to invalidate the results of the ICA method. In all cases, EM algorithms [2] and [3] lead to differences in the I-O estimates, although the IC parameters are again quite robust and do not vary much, the level of significance is affected in a higher proportion of cases than in the EM algorithm [1]. More importantly, EM algorithms [2] and [3] are not homogeneous among themselves, suggesting that the amount of information embodied in the imputation algorithm does not consequently improve the results. Another interesting observation is that the performance of the EM algorithms [2] and [3] greatly depends on the structure of the MDM. When the pattern of missing data is very unbalanced, meaning that it is common to observe only one or two PF variables in each cross-sectional observation, these two EM algorithms lead to rather different results from the ICA method and EM algorithm [1]. Intuitively, this is probably due to the unbalanced amount of information included in each cross-sectional observation. Only in Tanzania and Turkey, when the proportion of missing values is larger than in the other two countries, do we observe significant changes in the estimated I-O elasticities under the Heckman models with respect to the ICA method. As a general rule, there are more significant IC variables under the Heckman models than under the ICA method. Heckman's Lambda is never significant, which does not support the story of non- ignorable MDM and confirms that correcting for endogenous selection does not change considerably the results. It is also important to note that it does not matter whether we replace only the independent variables, the dependent variable or both of them. In all the cases, the results are similar. More importantly, the Heckman model with the inputs replaced by the ICA method and the case of the ICA method on the inputs are similar in both cases. Finally, we find it essential to control for IC variables in the estimation in all the cases. We believe that this is what allows us to get such robust results under very 88 different assumptions and patterns of missing data. This is supported by section 4.4, where we saw that IC variables are able to explain a rather large proportion of the variability of the MDM in all the countries. 7. Conclusions When the missing data mechanism (MDM) is ignorable, the objective of the imputation methods is not to augment the sample size, but to preserve the sample representativity, to gain efficiency in the estimation and to retrieve for the analysis a large number of very expensive interviews. The alternative to these methods is the complete case or listwise deletion, which is not a panacea even when the MDM is ignorable. Operating with the complete case is only acceptable if incomplete cases attributable to missing data comprise a small percentage, say 5% or less, of the number of total cases (Schafer, 1997), and when the complete case preserves the representativeness of the original sampling frame. In addition, in models with a large number of regressors, the problem of missing data may encourage analysts to leave out of the regression some explanatory variables with a high proportion of missing values. As Cameron and Trivedi (2005) point out, this practice may be misleading as it leads to an omitted variables problem, which could be more serious than the missing data problem per se. The first question we raise in this paper is, hence, whether the researcher should do something about the missing values when dealing with investment climate surveys (ICSs). In the context of ICSs, a large proportion of the sample size is lost in the complete case and the representativeness of the original sample frame is, to some extent, modified. Given these results, the MDM can in no way be considered as missing completely at random (MCAR), and consequently a complete case could lead to inconsistent and inefficient results. In order to overcome this problem, we propose a imputation mechanism that fits well with the characteristics of ICSs--with unbalanced patterns of missing data and a low proportion of available observations in the complete case--likely to be used to construct structural models composed of single, or even systems of, equations with a large number of explanatory variables, all of them containing missing data. The imputation method proposed, which we call the ICA method, departs from the class of EM type algorithms and relies on the expectation of the imputed variables conditional to the sector, region and size they belong to. The performance of the ICA method depends on several characteristics of the MDM, such as the number of variables replaced or the proportion of missing values in the complete case; but especially, it depends on the nature of the MDM: missing at random (MAR) or non-ignorable. Taking this into account, we analyze the MDM of four countries with very different patterns of missing data (India, Turkey, South Africa and Tanzania) to find out to what extent the MDM can be treated as MAR or not. Although not conclusive on the nature of the MDM, the descriptive analysis shows that this has to do with a variety of IC determinants, such as informality and corruption and also with the capacity of the firms. More dynamic firms engaged in R&D, quality, innovation of new products, technologies and operating in more exigent and competitive 89 export markets tend to report fewer missing values. Accountability and size can by themselves explain a large share of missing data too. On the other hand, the analysis does not allow us to reject the non-ignorability assumption on the MDM in any case. In addition, given the results of the descriptive analysis and apart from the discussion concerning MAR and non-ignorable MDM, an interesting result is the need to control for those variables related with the MDM. Inconsistency would follow if we did not control for the large set of IC variables in the estimation. In the next step of the analysis presented in the paper, we estimated an extended production function under imputation of missing values by the ICA method and we test the estimating results against other imputation mechanisms. We first considered imputation mechanisms requiring the MAR assumption like the ICA method, including the complete case, EM algorithms, extensions of the ICA method and multiple imputation. We then included in the analysis methods considering the non-ignorable assumption on the MDM; essentially we considered the Heckman model under different specifications. Although caution is always a requisite when drawing conclusions from a model with imputed data, the ICA method leads the results to be more robust than even more sophisticated imputation methods also requiring the MAR assumption. We observe that more complex imputation mechanisms are rather sensitive to both the proportion of missing values and how these missing values are distributed among variables. When the MDM is very unbalanced, in the sense that we can observe only one or two PF variables for each cross- sectional observation, those EM algorithms including additional explanatory variables, such as inputs or IC variables, lead to changes in the results compared with the more linear, parsimonious and simpler ICA method and EM algorithm [1], both including only industry/region/size variables always available. This suggests that more complex imputation methods based on simulations, especially EM algorithms and multiple imputation based on Markov Chain Monte Carlo, require a deeper and more thorough knowledge of MDM that would allow us to handle proper assumptions on the unknown densities of data generating processes. The issue of the sensitivity of the results to the selection of a proper model for the MDM constitutes an interesting question to be handled in further research regarding ICSs.3 In this sense, we believe that incorporating systematically more information concerning the imputation mechanism does not constitute, per se, an improvement in the estimates. Rather, given the sensitivity of the results to the model choice for the MDM, extending the matrix of covariates used to impute missing values requires detailed, thorough knowledge of the determinants of the MDM, and this is likely to vary from country to country. Regarding the lack of uncertainty inherent in the ICA method as a deterministic imputation method, we find that using other mechanisms allowing for additional uncertainty in the imputation mechanisms, such as the so-called Random ICA method, Bootstrap ICA method or EM algorithms, does not change the results significantly. Despite changes in the 3 ICSs in particular and data collected from developing countries in general present the missingness issue as an additional challenge for applied researchers. We consider that a proper, systematic methodology to deal with this problem is required, especially if more sophisticated imputation mechanisms are applied. 90 level of significance of some coefficients, most of the variables remain significant when incorporating additional randomness. Nonetheless, we also observe that the randomness issue becomes more important as the proportion of missing values increases (in the cases of Turkey and Tanzania). On the other hand, provided we control for the same set of IC variables in all the specifications, the results under the complete case and the ICA method are reasonably consistent between the two. Even in those cases in which the complete case represents less than half of the original sampling frame, the estimated parameters of production function (PF) and IC variables is within a reasonable range of values. This illustrates the importance of using the large set of IC variables, in order to control for the data generating process in the estimation.4 Likewise, the ICA method shows reasonable robustness to the endogenous sampling case. Heckman's lambda is non-significant in all cases, which does not support the endogenous sampling selection hypotheses. The results of the ICA method are similar to those of the Heckman regressions, indicating that even if there were an endogenous sampling selection problem, this would not be serious enough to bias the final results. In this sense, replacing only those RHS variables and not the dependent variable (sales in our case) does not change the results, provided the endogenous sample selection is not supported by the models and the robustness in the results. As the use of Investment Climate Surveys becomes more and more important among policy makers, scholars and applied researchers, thorough research into the causes of the missingness problem in order to improve the quality of the data is becoming a requisite. The parsimonious methodology we propose here is intended to be a first step in helping prepare the way forward and delve further into this line of research. 4 In order to pursue this issue more deeply, further research is needed. Nonetheless, once the relation between IC variables and the MDM is proved, using them to gain independency between our model and the MDM is a requisite. We believe that this procedure is what balances the results, in the sense that it is what allows us to get robust results in specifications. 91 References Allison, P.D (2001): "Missing Data," Quantitative Applications in the Social Sciences. Sage University Paper. Buuren v S., H.C. Boshuizen and D.L. Knook (1999): "Multiple imputation of missing blood pressure covariates in survival analysis," Statistics in Medicine; 18(6):681-94. Cameron, A.C. and P.K Trivedi (2005): "Microeconometrics: Theory and Applications," Cambridge University Press. Dempster A.P., N.M. Laird and D.B. Rubin (1977): "Maximum Likelihood Estimation for Incomplete Data Via the EM Algorithm," Journal of the Royal Statistical Society, Series B, 39, 1-38. Escribano, A., and J. L. Guasch (2005): "Assessing the Impact of Investment Climate on Productivity Using Firm Level Data: Methodology and the Cases of Guatemala, Honduras and Nicaragua," World Bank Policy Research Working Paper #3621, Washington, DC. Escribano, A., and J. L. Guasch (2008): "Robust Methodology for Investment Climate Assessment on Productivity: Application to Investment Climate Surveys from Central America," Working Paper # 08-19 (11), Universidad Carlos III de Madrid. Escribano, A., J. L. Guasch, and M. de Orte (2009): "INDIA: Investment Climate Assessment on Productivity, Allocative Efficiency and Other Economic Performance Measures of the Manufacturing Sector,"mimeo, Universidad Carlos III de Madrid. Escribano, A., J.L. Guasch, M. de Orte and J. Pena (2008a): "Investment Climate and Firm's Performance: Econometric and Applications to Turkey's Investment Climate Survey," Working Paper # 08-20 (12), Universidad Carlos III de Madrid. Escribano, A., J.L. Guasch, M. de Orte and J. Pena (2008b): "Investment Climate Assessment Based on Demean Olley and Pakes Decompositions: Methodology and Applications to Turkey's Investment Climate Survey," Working Paper # 08-20 (12), Universidad Carlos III de Madrid. Escribano, A., J. L. Guasch and J. Pena (2009): "Assessing the Impact of Infrastructure Quality on Firm Productivity in Africa," World Bank Policy Research Working Paper #(forthcoming), Washington DC. Escribano, A., M. de Orte, J. Pena and J. L. Guasch. (2009): "Investment Climate Assessment on Economic Performance Using Firm Level Data: Pooling Manufacturing Firms from Indonesia, Malaysia, Philippines and Thailand from 2001 to 2002," Singapore Economic Review, Special Issue on the Econometric Analysis of Panel Data, Vol. 54, #3, August 20092009. Gelman, A., G. King and C. Liu (1998): "Not Asked and Not Answered: Multiple Imputation for Multiple Surveys," Journal of the American Statistical Association, 93, 846-874. Griliches, Z. (1986): "Economic Data Issues," Handbook of Econometrics, Vol, III. Ed. R.F. Engle and D. McFadden. Amsterdam: North Holland, 1464-1514. Heckman, J.J. (1976): "The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models," Annals of Economic and Social Measurement 5, 475-492. Heitjan, D.F. (1994): "Ignorability in General Incomplete-Data Models," Biometrika, 81, 701-708. 92 ___________(1999): "Causal Inference in Clinical Trials, A Comparative Example," Controlled Clinical Trials, 20, 309-318. Heitjan D.F. and S. Basu (1999): "Distinguishing `Missing at Random' and `Missing Completely at Random'," The American Statistician, 50, 207-213. Ibrahim, J.G., S.R. Lipsitz and M-H Chen (1999): "Missing Covariates in Generalized Linear Models When the Missing Data Mechanism is Ignorable," Journal of the Royal Statistical Society, Ser. B, 61, 173-190. Little R.J.A. and D.B. Rubin, (1987): "Statistical Analysis with Missing Data," Wiley Series in Probability and Mathematical Statistics. John Wiley and Sons Eds. McLachlan G.J and T. Krishnan (1997): "The EM Algorithm and Extensions," New York: Wiley. Meng, X.L. (2000): "Missing Data: Dial M for?" Journal of the American Statistical Association, Vol. 95, No. 452, (Dec., 2000), pp. 1325 -1330. Molerberghs, G., E.J. Goetghebeur, S.R. Lipsitz, M.G. Kenward (1999): "Nonrandom Missingness in Categorical Data: Strengths and Limitations," The American Statistician, 53, 110-118. Murphy, K. M. and R. H. Topel (1985): "Estimation and inference in two-step econometric models," Journal of Business and Economic Statistics, 3(4): 370­379. Newey, W.K (1984): "A Method of Moments Interpretation of Sequential Estimators," Economic Letters, 14, 201-206. Newey, W.K and D. McFadden (1994); "Large Sample Estimation and Hypotheses Testing," Handbook of Econometrics, Vol, 4. Ed. R.F. Engle and D. McFadden. Amsterdam: North Holland, 2111-2245. Pagan, A.R (1984): "Econometric Issues in the Analysis of Regressions with Generated Regressors," International Economic Review, 25, 221-247. Rubin, D.B. (1976): "Inference and Missing Data," Biometrika, 63, 581-592. Rubin, D.B. (1987): "Multiple Imputation for Nonresponse in Surveys," New York: Wiley. Schafer, J.L (1997): "Analysis of Incomplete Multivariate Data," London: Chapman and Hall. Schafer, J.L (1999): "Multiple Imputation: A Primer," Statistical Methods in Medical Research, 8, 3-15. Wooldridge, J.M (2007): "Econometric Analysis of Cross Section and Panel Data," The MIT Press. Cambridge, Massachusetts. 93 Appendix I: definition of IC variables I Infrastructures IC variables Country Measureme Definition nt units Longest # of days to clear customs for (IND) Log Longest number of days that it took to clear customs exports when exporting Days to clear customs for imports (TUR, SA) Log Average number of days that it takes to clear customs when importing Dummy for own generator (IND) 0 or 1 Dummy variable taking value 1 if the firm has own generator Electricity from own generator (TUR, TZA) Percentage Percentage of total electricity used that came from own generators Losses due to power outages (IND, TUR, Perc Percentage of total annual sales lost as a result of SA, TZA) power outages Wait for electric supply (SA) Log Average number of days that it takes to obtain a power supply Water supply from public sources (IND) Perc. Percentage of the water used by the establishment that came from public sources Water from own well or water (SA) Perc. Percentage of the water used by the establishment that infrastructure came from own well or water infrastructures Losses due to water outages (TUR, TZA) Perc. Percentage of total annual sales lost as a result of water outages Water outages (SA) Log Total number of water outages experienced per year Wait for a water supply (TUR, TZA) Log Average number of days that it takes to obtain a water supply Shipment losses in the domestic market (IND, TUR) Perc. Percentage of products shipped that were lost as a consequence of theft, breakage, or spoilage Dummy for own transport (IND) 0 or 1 Dummy variable taking value 1 if uses own transport services Average duration of transport failures (SA) Log Average duration in hours of transport failures Transport outages (TZA) Log Total number of transport failures per year Losses due to transport delay (IND, TZA) Perc. Percentage of total annual sales lost as a consequence of transport delays Losses due to phone outages (TZA) Perc. Percentage of total annual sales lost as a consequence of phone interruptions Dummy for web page (IND, SA, 0 or 1 Dummy variable taking value 1 if the firm uses web TZA) page to communicate with clients or suppliers Dummy for e-mail (IND, TUR, 0 or 1 Dummy variable taking value 1 if the firm uses e-mail SA) to communicate with clients or suppliers Sales lost due to delivery delays (SA) Perc. Percentage of total annual sales lost as a consequence of delivery delays Dummy for own roads (TZA) 0 or 1 Dummy variable taking value 1 if the firm has own roads. Low quality supplies (TZA) Perc. Percentage of total supplies that were of lower than agreed upon quality per year Days of inventory of main supply (TZA) Log Days of inventory that the establishment kept its main supply in storage on average during the last year 94 II Red tape, corruption and crime IC variables Country Measureme Definition nt units Crime losses (TUR, SA) Perc. Percentage of total annual sales lost as a consequence of crime, vandalism or arson Dummy for security (IND) 0 or 1 Dummy variable taking value 1 if the firm has security expenses Security expenses (TUR, SA, Perc. Security expenses as a percentage of total annual sales TZA) Illegal payments for protection (SA, TUR) Perc. Illegal payments for protection (e.g. to organized crime) to prevent violence as a percentage of total annual sales per year Manager's time spent on bur. Issues (TUR, SA) Perc. Percentage of manager's time spent in dealing with bureaucratic issues Payments to deal with bureaucratic issues (TUR, SA, Perc. Payments to deal with bureaucratic issues as a TZA) percentage of total annual sales Payments to obtain a contract with the (TUR, SA, Perc. Payments to obtain a contract with the government as government TZA) a percentage of total annual sales Dummy for payments to speed up (IND) 0 or 1 Dummy variable taking value 1 if the establishment bureaucracy declared making payments to 'speed up' bureaucratic issues Dummy for payments to deal with bur. (IND) 0 or 1 Dummy taking value 1 if the firm declared making issues 'irregular' payments to deal with bureaucratic issues Dummy for interventionist labor regulation (IND) 0 or 1 Dummy taking value 1 if the firm considers that regulation affected its decisions to hire or fire employees Gift to obtain a operating license (TZA) Perc. Gifts as a percentage of total annual sales paid to get an operating license Number of inspections (TUR) Log Total number of inspections received by the firm per year Days in inspections (TZA) Log Total number of days that the firm received inspections from public officials during the last year Sales reported for taxes (IND, TUR, Perc. Percentage of total annual sales reported to IRS tax SA) authorities Workforce reported for taxes (IND) Perc. Percentage of total workforce reported to IRS tax authorities Production lost due to absenteeism (IND, TUR) Log Days production lost as a consequence of employees' absenteeism Dummy for informal competition (TUR) 0 or 1 Dummy variable taking value 1 if the firm declared competing against informal competition Dummy for lawsuit (TUR) 0 or 1 Dummy variable taking value 1 if the firm had any lawsuit during the last year 95 III Finance IC variables Country Measureme Definition nt units Dummy for external audit (IND, TUR, 0 or 1 Dummy taking value 1 if the firm has its annual SA) statements reviewed by an external auditor Dummy for trade association (IND) 0 or 1 Dummy variable taking value 1 if the firm belongs to a trade association Dummy for loan (IND, SA) 0 or 1 Dummy taking value 1 if the firm has access to a loan from any financial institution Largest shareholder (IND, SA) Perc. Percentage of firm's equity that belongs to the largest shareholder Dummy for credit line (TUR, SA, 0 or 1 Dummy taking value 1 if the firm has access to a credit TZA) line from any financial institution Percentage of credit unused (SA) Perc. Percentage of the credit line that is currently unused Dummy for loan with collateral (IND) 0 or 1 Dummy taking value 1 if the firm has a loan with associated collateral Value of the collateral (SA) Perc. Value of the collateral as a percentage of the total value of the loan Loans denominated in foreign currency (IND, TUR, Perc. Percentage of total firm's loans that were denominated SA, TZA) in foreign currency Dummy for loan denominated in Turkish (TUR) 0 or 1 Dummy taking value 1 if the firm has access to a loan Lira denominated in Turkish Lira Dummy for loan denominated in foreign (TUR) 0 or 1 Dummy taking value 1 if the firm has access to a loan currency denominated in foreign currency Dummy for long-term loan (TUR) 0 or 1 Dummy taking value 1 if the firm has access to a loan for more than 1 year Interest rate of the loan (TZA) Perc. Interest rate of the last loan obtained by the firm Dummy for new land purchased (TUR) 0 or 1 Dummy taking value 1 if the firm obtained new land in the last year Charge to clear a check (SA) Perc. Charges to clear a check as a percentage of the value of the check Delay in clearing a domestic currency wire (TZA) Log Average number of days that it takes to clear a domestic currency wire Working capital financed by domestic (IND) Perc. Percentage of working capital financed by funds from private banks domestic private banks Working capital financed by commercial (TZA) Perc. Percentage of working capital financed by funds from banks commercial banks Working capital financed by foreign (SA) Perc. Percentage of working capital financed by funds from commercial banks foreign commercial banks Working capital financed by informal (SA) Perc. Percentage of working capital financed by funds from sources informal sources Working capital financed by leasing (TZA) Perc. Percentage of working capital financed by funds from leasing arrangement Dummy for current or saving account (TZA) 0 or 1 Dummy taking value 1 if the firm has access to a current or saving account Inputs bought on credit (TZA) Perc. Percentage of inputs bought on credit per year Sales bought on credit (TZA) Perc. Percentage of sales bought on credit per year 96 IV Quality innovation and labor skills IC variables Country Measureme Definition nt units Dummy for R&D (IND) 0 or 1 Dummy taking value 1 if the firm invests in R&D Dummy for new technology (TUR, TZA) 0 or 1 Dummy taking value 1 if the firm introduced new technology inherent to the production process during the last year Dummy for new product (SA, TZA) 0 or 1 Dummy taking value 1 if the firm introduced a new product of product line during the last year Dummy for product innovation (IND, TZA) 0 or 1 Dummy taking value 1 if the firm introduced a product innovation during the last year Dummy for discontinued product line (SA) 0 or 1 Dummy taking value 1 if the firm discontinued the production of any product during the last year Dummy for foreign license (IND, TUR, 0 or 1 Dummy taking value 1 if the firm has a technology TZA) licensed from a foreign company Dummy for internal training (IND, SA, 0 or 1 Dummy taking value 1 if the firm provides training to TZA) its employees Training for unskilled workers (SA) Perc. Percentage of unskilled workers that received training during the last year Workforce with computer (IND, TZA) Perc. Percentage of workers on the staff that regularly uses computer at job Dummy for ISO quality certification (IND, TUR, 0 or 1 Dummy taking value 1 if the firm has an ISO quality SA) certification Dummy for outsourcing (IND, SA, 0 or 1 Dummy taking value 1 if the firm outsourced any part TZA) of production in the last year Dummy for brought in house (TZA) 0 or 1 Dummy taking value 1 if the firm brought in house any part of the production process previously outsourced Dummy for external training (IND) 0 or 1 Dummy taking value 1 if the firm provided external training for its employees Staff - skilled workers (TZA) Perc. Percentage of skilled workers on staff Staff - professional workers (TZA) Perc. Percentage of professional workers on staff Unskilled workforce (IND) Perc. Percentage of unskilled workforce on staff Staff with university education (TUR, SA) Perc. Percentage of staff with at least one year of university education Staff-part time workers (TUR) Perc. Percentage of part time workers on staff Staff - management (SA) Perc. Percentage of management on the staff Staff - non-production workers (SA) Perc. Percentage of non-production workers in staff Manager's experience (SA) Log Manager's experience in years Dummy for closed plant (SA) 0 or 1 Dummy taking value 1 if the firm closed a plant during the year previous to the survey Dummy for joint venture (TZA) 0 or 1 Dummy taking value 1 if the firm agreed to do a joint venture during the last year 97 V Other control variables IC variables Country Measureme Definition nt units Dummy for incorporated company (IND, TUR, 0 or 1 Dummy taking value 1 if the firm is constituted as an TZA) incorporated company Age (IND, TUR, Log Age of the firm in years SA) Share of exports (IND) Perc. Percentage of total annual sales exported Trade union (IND) Perc. Percentage of workers that belong to a trade union Strikes (IND, TUR) Log Days of production lost due to strikes Market share (TUR, SA) Perc. Share of market share Dummy for recently privatized firm (TUR) 0 or 1 Dummy taking value 1 if the firm was privatized within the last five years Dummy for competition against imported (TUR) 0 or 1 Dummy taking value 1 if the firm competes against products imported products Capacity utilization (SA) Perc. Percentage of total capacity used by the firm the last year Dummy for FDI (TZA) 0 or 1 Dummy taking value 1 if the firm received FDI inflows Dummy for industrial zone (TZA) 0 or 1 Dummy taking value 1 if the firm is located in an industrial zone 98