The World Bank Economic Review, 31(1), 2017, 129–157 doi: 10.1093/wber/lhv054 Article Genuine Fakes: The Prevalence and Implications of Data Fabrication in a Large South African Survey Arden Finn and Vimal Ranchhod Abstract How prevalent is data fabrication in household surveys? Would such fabrication substantially affect the validity of empirical analyses? We document how we identified such fabrication in South Africa’s longitudinal National Income Dynamics Study, which affected about 7% of the sample. The fabrication was detected while fieldwork was still on- going, and the relevant interviews were reconducted. We thus have an observed counterfactual that allows us to mea- sure how problematic such fabrication would have been, had it remained undetected. We compare estimates from the dataset that includes the fabricated interviews with corresponding estimates that includes the corrected data in- stead. We find that the fabrication would not have affected our univariate and cross-sectional estimates meaningfully, but would have led us to reach substantially different conclusions when implementing panel estimators. We estimate that the data quality investigation in this survey had a benefit-cost ratio of at least 24, and was thus easily justifiable. JEL classification: C18, C81, C83, J23 Key words: Interviewer fraud, Data quality, Survey methodology Introduction For anyone involved in the running of a survey, issues of data quality are of critical importance. Surveys can cost millions of dollars, require years of planning by large teams of people and need con- siderable levels of sustained effort. All of these resources are allocated for the sole purpose of produc- ing high-quality data. All empirical findings, in turn, are premised on the assumption that the data being used are of a reasonable quality. This caveat applies to vast literatures in economics, sociology, demography, and political science, amongst others. Indeed, it is so ubiquitous that it hardly ever gets stated explicitly. Arden Finn is a doctoral student and researcher at the Southern Africa Labour and Development Unit, University of Cape Town. He acknowledges support from the National Research Foundation’s Human and Social Dynamics in Development Grand Challenge. His email address is: fnnard001@myuct.ac.za. Vimal Ranchhod (corresponding author) is an associate professor in SALDRU at the University of Cape Town. He acknowledges support from the Research Chairs Initiative of the Department of Science and Technology and National Research Foundation. His email address is: vimal.ranchhod@uc- t.ac.za. The authors would like to thank the editor, three anonymous referees, Louise de Villiers, Andre Hofmeyr, Murray Leibbrandt, Brendan Maughan-Brown, Martin Wittenberg, Ingrid Woolard, and the entire NIDS team for their assistance and advice. This paper benefited greatly from comments by seminar participants at the University of Cape Town and the University of Michigan and conference participants at the University of Oxford. All errors and omissions remain the sole re- sponsibility of the authors. C The Author 2015. Published by Oxford University Press on behalf of the International Bank for Reconstruction and Development / THE WORLD BANK. V All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 130 Finn and Ranchhod In this paper, we investigate one aspect of the data production process that might lead us to question the quality of survey data. Most survey organizations, either directly or indirectly, employ interviewers to conduct their surveys. The interviewers, though, might not have the same objectives as the survey or- ganization. These principal-agent problems might result in interviewers “cheating.”1 Interviewers may engage in cheating behavior for a variety of reasons. First, interviewers may be re- luctant to ask sensitive questions about topics related to income, wealth, or sexual behavior. Second, some sections are very long and interviewers might want to leave them out in order to save time. Third, the characteristics of the primary sampling unit (PSU) may play a role. If the PSU is in an area that is con- sidered dangerous (which is not uncommon in the South African context) or is very far away, then inter- viewers may end up cheating rather than visiting the PSU. Fourth, interviewers might be remunerated according to the number of successful interviews that they have completed. In the case of refusals or the case where it is easier to fabricate an interview, this would incentivize cheating. Finally, the penalties for cheating may be small. If the survey company is unable or unwilling to monitor the behavior of the inter- viewers, then the expected payoff to cheating might exceed the expected costs for some workers. There are also different ways in which cheating behavior could manifest itself. First, and most prob- lematic, interviewers could fabricate entire interviews. In later waves of longitudinal studies, there is usu- ally some prepopulation of the questionnaires based on data from previous waves. This often includes a list of household members from the roster and their demographic characteristics. Interviewers can view this information and use it to form the basis of their fabrication. Interviewers could also cheat by leaving out sections of interviews. For example, in wave 1 of the National Income Dynamics Study (NIDS) ques- tionnaire, which is the longitudinal dataset that we use in this study, the labor market section is substan- tial and has a total of eighty-nine questions.2 However, a respondent who is “not economically active” will only answer seven simple “yes no” questions. Interviewers could save time by setting respondents’ labor market statuses to “not economically active” when they are, in fact, working or looking for work. A different way to save time would be to leave out certain people in the household. This would be easy to implement in a cross-sectional study. In a longitudinal study, the interviewer might ignore new mem- bers in the household, such as babies or in-migrants, or exaggerate the number of people from the previ- ous wave who have died. Our research findings presented in this paper are primarily concerned with the most problematic type of cheating listed, namely the fabrication of entire interviews. We find that analy- sis of cross-sectional numerical patterns and longitudinal anthropometric measures are the most effective means of detecting data fabrication in our context. Our methods suggest that approximately 7% of the sample was affected in this way. If the fabrication had not been detected, it would not have substantially affected our cross-sectional estimates but would have led us to reach different findings as more complex, longitudinal, estimators were used. A brief cost-benefit analysis of the data quality investigation suggests that the benefit was more than twenty-four times the aggregate cost. The remainder of this paper is structured as follows. In section I, we argue that the incidence of inter- viewer cheating is a common problem in the implementation of large household surveys in several coun- tries, including South Africa. In section II we turn our focus to the first two waves of the NIDS dataset and evaluate a number of methods that we considered to detect interviewer cheating.3 The two most suc- cessful methods, Benford’s law and anthropometric diagnostics are dealt with in greater detail than the 1 We use the word “cheating” although in some cases a better word might be “negligence.” The former implies intent whereas the latter could arise out of ignorance, incompetence or misunderstandings, and we cannot always separate be- tween the two. In either case, interviewers do something that they ought not to have done, which results in a deterioration of the aggregate quality of the data produced. 2 This includes ten sub-questions. 3 This work was done by the authors while wave 2 of NIDS was still in the field. At the time, both authors were employed in the NIDS office. The World Bank Economic Review 131 others. Section III analyzes what the consequences for future research would have been had the cheating not been detected and corrected for and compares the benefits of the data quality investigation to the ag- gregate costs of detecting fabricated data. Section IV offers recommendations for future fieldwork opera- tions and provides some concluding remarks. I. The Prevalence of Interviewer Cheating in Survey Data The phenomenon of interviewers making up data is a global and persistent problem with a sizeable liter- ature dedicated to documenting and detecting it. A number of studies use data from major surveys in the United States to detect whether data fabrication had occurred. Schreiner et al. (1988) use Census Bureau Studies data from 1982 to 1987 to highlight the importance of reinterviewing respondents as a means of fraud detection. In their study, 83% of suspected falsifications turned out to indeed be a result of cheat- ing. Most of the cheating was detected through reinterviews, although some were picked up because of anomalies in the data. In addition, most of the cheating involved total rather than partial fabrication of individual-level data. The authors find that falsification rates range from 0.4% to 6.5%, depending on which one of the Census Bureau surveys is used. Finally, they note that interviewers who had served for longer periods of time are significantly less likely to be data fabricators. Li et al. (2011) make the point that the Census Bureau’s reinterview strategy for detecting falsification can be improved upon. The con- ventional reinterview methods detect falsification in less than 0.1% of the data. The authors use data from the Current Population Survey to try to design an alternative sampling method that should underlie the reinterview process. Using a combination of real data and simulations, they conclude that alternative sampling methods could find up to 20% more fabricated interviews than the current system. Murphy et al. (2004) use data anomalies from the National Survey on Drug Use and Health to identify suspicious interviewer behavior. In particular, they flag relatively short or long interview durations as possible signs of falsification and show how taking these durations into account adds to the power of the fraud detec- tion process. Outside of the United States, Sch€ afer et al. (2004) use data from the German Socio-economic Panel (SOEP) to test the reliability of two methods of fraud detection. Data fabrication was low in all waves and all samples of the SOEP, never exceeding 2.4% of all cases, with the overall share of faked data at about 0.5% (Schr€ apler and Wagner 2005). The authors use the fabricated data that was removed from the publicly released version of the SOEP and find that using Benford’s law as the basis for detecting sus- picious data correctly identifies all cases of fabrication. In addition, they exploit the fact that cheating in- terviewers tend to have less variability in their responses over all questions and all interviews than noncheaters. Interviewer-level tests for surprisingly low variance also correctly identified all of the cases of cheating. All confirmed cheating interviewers were middle-aged and male, and the effect of education on the probability of cheating was not statistically significant. There are a number of other studies that use characteristics in the data themselves as a means of iden- tifying fabrication. These include, amongst others, Bredl et al. (2008) in an unspecified non-OECD coun- try and Porras and English (2004), Cho et al. (2003), and Swanson et al. (2003) in the United States. A broad review on much of the literature related to the detection of fabricated data can be found in Birnbaum (2012), who charts the methods used in twelve different datasets in the developed and devel- oping world. Although the fabrication of survey data is an issue of concern to researchers throughout the world, the remainder of this study narrows the focus somewhat by drawing attention to illustrative cases of in- terviewer cheating in the South African context. These examples highlight the existence, but not the prevalence, of data fabrication in South African household surveys. They motivate our study by showing that there is enough prior evidence to take this phenomenon seriously, even though they are agnostic as to how problematic or widespread this is in household survey datasets. 132 Finn and Ranchhod KwaZulu-Natal Income Dynamics Study (KIDS) KIDS is a household level panel dataset that was conducted in 1993, 1998, and 2004. It revisited a subset of the households located in the KwaZulu-Natal province of South Africa that were included in the origi- nal SALDRU/PSLSD 1993 survey. Follow-up fieldwork in May of 2001 suggested that there may have been cheating by interviewers in some clusters. Subsequent investigations revealed that the fabrication was limited to two clusters, and these were permanently removed from the sample.4 Judge and Schechter (2009) compare data from the deleted clusters to data from the retained clusters and find large differ- ences between the two in the module on crop production and animal ownership. Survey on Time and Risk Preferences Between 2010 and 2011, researchers from the University of Cape Town conducted a survey on time and risk preferences in the three major metropolitan regions of South Africa,5 with a budget of about 300,000 US dollars.6 They had a sample size of about 300 respondents and visited each of them six times at three monthly intervals. The survey included a background questionnaire as well as two ex- perimental modules. In the experimental modules, respondents were asked to choose between various alternatives in order to ascertain their appetite for risk and their discount factors. In order to obtain truthful responses, all choices were incentivized to have some probability of entailing an actual cash payout. In the time preferences component, respondents answered forty questions. They then rolled a ten- sided die, and if it landed on zero, they would get paid for one of their forty responses. The relevant question was selected by rolling a ten-sided die and a four-sided die simultaneously. In the risk prefer- ences component, respondents were also asked forty questions, one of which would yield a cash pay- out with certainty. The relevant question was also selected by means of rolling a ten-sided die and a four-sided die simultaneously. The payouts varied by question and by the choices made by the respon- dents in that question. The interviewer would then pay the amount of the winnings in cash to the respondents. After the data were collected, researchers found a suspiciously high rate of interviewees getting paid out for the time preferences component. The ex ante probability of this occurring was 10%, but the re- spondents were “winning” 25% of the time overall. Moreover, respondents were observed to have a disproportionately high probability of having “randomly selected” questions with relatively higher cash payouts in both the risk preferences and time preferences component of the study. Further investi- gation indicated that these anomalies were driven by data from a subset of interviewers who almost al- ways paid out the maximum amounts permissible. People involved in the study believe that some interviewers colluded with respondents so as maximize the actual disbursements, which they could then share. The problem was identified only after the fourth wave of data had been obtained, with the fifth wave already in the field, and both the time preferences and risk preferences components of the study had to be abandoned. Cape Area Panel Study: Wave 5 The Cape Area Panel Study (CAPS) is a longitudinal study of young adults in the Cape Town metropoli- tan area. Wave 1 was conducted in 2002 with a sample of about 4,800 young adults aged fourteen to twenty-two. In the fifth wave of CAPS, conducted in 2009, part of the interview included a finger-prick 4 See May et al. (2007) with an earlier version available for download at http://www.datafirst.uct.ac.za/catalogue3/index. php/catalog/286. 5 These are Johannesburg/Pretoria, Cape Town, and Durban. 6 Information on the details of this study was obtained through interviews with Andre Hofmeyr. At the time, he was a re- searcher actively involved in the survey. The World Bank Economic Review 133 test for HIV status which was administered by the interviewer.7 The ex ante expectation was that about 30% of the women interviewed would be HIV positive. For most interviewers, the proportion of HIV- positive female respondents was indeed reasonably close to 30%, but after a certain date, one inter- viewer returned HIV-positive results for every respondent. It took a considerable amount of the time for the lab results on the blood to be returned to the opera- tional headquarters. Thus, by the time that this was discovered, the interviewer in question had already been paid and had left the survey. Investigations discovered that the interviewer in question had not, in fact, taken blood samples from respondents, but had obtained blood samples from some other source. The result was that all data collected by this interviewer after a certain date was deleted and did not form part of the fifth wave. A common method of monitoring interviewer behavior is to phone respondents in the weeks or months after the interview in order to verify that they were in fact interviewed. One of the interviewers obtained the list of verification questions and set up a system in which her sister-in-law pretended to be a respondent each time she was called by the survey company. This suggests that interviewers who do cheat can use quite sophisticated methods to avoid detection. This interviewer’s cheating was only dis- covered after the conclusion of fieldwork, and all relevant data was deleted from the study. Overall, a total of eight interviewers had engaged in some form of cheating, out of an average of about forty interviewers over the course of the fieldwork. A total of 289 fraudulent interviews were de- leted from the public release version, which represented about 9% of the expected sample at the start of wave 5. Time Use Study: 2000 In 2000, StatsSA, the official statistical agency of South Africa, conducted a national time use study over three different months in order to investigate how South Africans spend their day. The total sample size was approximately 14,000. Household members were eligible to participate if they were aged ten or older. Interviewers were asked to fill in a household roster in descending order of age, and if more than two household members were eligible, to select two household members sequentially using a sampling grid.8 If the interviewer reached the end of the grid, that is at the eleventh such household, then she was instructed to “loop” back to the start - that is, to treat the eleventh household with four eligible people as if it were the first household with four eligible people that she had encountered (Statistics South Africa 2000). The sampling grid yields an asymptotic distribution of the frequency with which household members of a particular age-rank ought to have been selected on aggregate, conditional on the number of eligible persons. For example, in households with three eligible persons, we would expect to find that person 1 was selected 50% of the time, person 2 was selected 70% of the time and person 3 was selected 80% of the time.9 In the dataset, however, we find that persons 1, 2, and 3 in households with three eligible per- sons were in fact selected 81%, 81%, and 38% of the time, respectively. A Chi-squared test rejects the null hypothesis that the realized distribution corresponds to the expected distribution at any reasonable level of significance.10 We repeated the analysis for households with four, five, and six eligible members, 7 The information on CAPS was obtained through interviews with Brendan Maughan-Brown, coordinator of the fifth wave of CAPS. More information can be obtained in Lam et al. (2012), which can be downloaded at http://www. datafirst.uct.ac.za/catalogue3/index.php/catalog/266. 8 A copy of the sampling grid is included in appendix A as table A1. To illustrate how it works, suppose that an interviewer came to her first house with four eligible members. She should then select persons 2 and 4, i.e., the second and fourth old- est members of the household. In the second such household, she should select persons 1 and 3, etc. 9 The total adds up to 200% since two household members were selected. 10 The results of this analysis are presented in table A2 in appendix 5. 134 Finn and Ranchhod respectively, and convincingly rejected the null hypothesis of equivalence of distributions in each case. We interpret this as evidence that interviewers did not, in fact, follow instructions about whom to select in households where they had some degree of choice. An alternative explanation to interviewer cheating is that the asymptotic distribution is not the cor- rect distribution to use, since we would only expect it to be realised if each interviewer had “many” households with more than two eligible persons. Nonetheless, at least in the case with three eligible per- sons, there are more than 1,000 such households in total. If each interviewer encountered ten or more such households then the asymptotic distribution would provide a reasonable approximation of the cor- rect expected distribution. Moreover, the magnitude of the divergence is so great that, unless the ex- pected distribution that we use is grossly incorrect, we would continue to reject the null hypothesis that the realised distribution corresponds to the “true” expected distribution. Note that in this case we are not claiming evidence of data fabrication. The “cheating” that we ob- serve here is of a very different nature than those previously documented, and it is important to draw the distinction between intentional and unintentional sources of data error. What explains such cheating? We conjecture that interviewers violated the sampling instructions due to some combination of the avail- ability of respondents as well as variations in the time that it would take for different respondents to complete the questionnaire. Two empirical observations support this conjecture. First, in households where the interviewer had some choice, 53.7% of those eligible were female, but 56% of those selected were female. This difference is small in terms of percentage points but since it applies to just over 9,000 observations it is statistically significant. Moreover, there is nothing in the sampling grid that suggests a clear gender bias in terms of who ought to be selected. A more likely explanation is that females are more likely to be available for an interview, since they are much less likely to be employed in South Africa.11 Second, we observe that in households with three or more eligible persons, 51% of those eligible are younger than twenty-one years of age. Of these potential respondents, only 35% were selected to fill in a questionnaire. Teenagers are probably less likely to be available due to being in school. Additionally, they might be less willing to par- ticipate in interviews in the first place, and it might take longer for them to complete a questionnaire. If our conjecture is correct, then the violation of the sampling framework has potentially serious im- plications for analyses. Obtaining a disproportionate number of unemployed or “not economically ac- tive” people in the sample will bias measures of aggregate time use, and conventional sampling weights, even if adjusted for nonresponse, will not correct this bias. Labour Force Survey 2001 Devey et al. (2006) contains an interesting figure, reproduced below as figure 1, that charts the number of people in South Africa classified as “informally employed.” The authors use data from October Household Surveys (OHS) and Labour Force Surveys (LFS) from 1997 to 2004. The most striking fea- ture of the data is the spike in February 2001, with a jump of almost 750,000 informal workers, fol- lowed by a fall of approximately the same number in September 2001. It is implausible that such a spike should be present in nationally representative data with a consistent survey instrument. The authors point out that in the February 2001 LFS, interviewers were offered additional incentives to interview in- formal workers in the households that they had visited. Although there is no established claim for cheating having taken place whereby interviewers fabri- cated data on the informal economy, it is nevertheless very suspicious to see the spike when finding infor- mal sector workers was incentivized, and an immediate and equivalent reversal when the incentive was 11 In StatsSA’s September 2000 Labour Force Survey (LFS), amongst prime working aged adults aged twenty-one to fifty- nine, the employment rates for men and women were 61.6% and 45.5%, respectively. This was a nationally representa- tive survey with a sample size of about 50,000 working-age adults. The World Bank Economic Review 135 Figure 1. Informal Employment (in Millions) from OHSs and LFSs 1997–2004 Source: Reproduced from Devey et al. (2006). removed. In addition, we cannot be certain as to whether the February 2001 data is incorrect or whether all of the other data is reflecting a systematic under-counting of informal employment or both. The ex- ample nonetheless highlights that data quality is potentially affected in a substantial way by fieldworker incentives. In summation, we have provided evidence of interviewer cheating in four substantial South African surveys and highlighted potential cheating in a fifth. These surveys have been both cross-sectional and longitudinal and span a time period from 1993 to 2011. The implementing agencies include local and in- ternational fieldwork companies, as well as organisations that employed and managed interviewers di- rectly. The cheating manifests in various forms, including outright fabrication of entire interviews, falsification of responses to particular questions and not following the sampling instructions. In some cases the cheating did not affect the overall legitimacy of the study, whereas in one case an entire compo- nent of the study needed to be abandoned. The research areas that have potentially been affected include time use, health, risk preferences, labor market status, poverty, inequality, and intergenerational mobil- ity. Thus the incidence of interviewer cheating is widespread and its effects are potentially far reaching. In the next section, we discuss how we attempted to identify possible cheating in wave 2 of NIDS. II. Interviewer Cheating in Wave 2 of the National Income Dynamics Study The second wave of NIDS is the main focus of this paper. NIDS is a nationally representative longitudinal study that collects data from respondents on many socio-economic topics including education, labor mar- ket participation, fertility, mortality, migration, income, expenditure, and anthropometric measures. The survey starts with a household roster that documents all people who are resident in the household at the time of the interview. From the information captured in the roster, respondents are classified as either chil- dren (aged 0–14) or adults (aged fifteen and above). Interviewers are instructed to then administer either a “child questionnaire” or an “adult questionnaire” for each household resident. In cases where the respon- dent refused or was not available, interviewers were asked to try to get a knowledgeable person in the household to fill in a “proxy questionnaire” on behalf of the respondent who could not be interviewed. 136 Finn and Ranchhod The first wave, which took place in 2008, had a sample size of 7,301 households, and about 17,000 people completed the adult questionnaire. In that wave, interviewers used paper-based questionnaires and entered responses by hand. The completed questionnaires were then sent from the fieldwork com- pany to a data capturing company and, by the time the full dataset was received by the survey operations team, fieldwork had already been completed. The primary data quality control procedures thus occurred after the fieldwork had been completed in wave 1. The second wave of NIDS was conducted over 2010 and 2011. Interviewers used a Computer Assisted Personal Interview (CAPI) system, whereby interviewers filled in responses on a hand-held com- puter. Data from completed questionnaires were then uploaded to a server on a daily basis. One of the advantages of having data come in “live” was that a verification process was undertaken while the inter- viewers were still in the field. This allowed for corrections to be made as part of the ongoing fieldwork operations so that suspicious data could be verified or replaced rather than deleted. Our objective for the verification process was to create a measure that could rank interviewers by de- creasing levels of suspiciousness. Once interviewers were ranked, the respondents that they interviewed were called back to ascertain whether they had been interviewed or not and, if they had indeed been in- terviewed, whether the entire questionnaire had been completed. In creating the suspicion-based ranking of interviewers we considered using nine different methods. The central idea in using each of these possibilities was that interviewers who do cheat will do so either to save time or to earn more money or both. Interviewers could earn more money as they received a performance-based incentive for each completed individual and household questionnaire. This would re- sult in systematic differences in some dimensions of the data that were generated by cheating inter- viewers, when compared either to the data obtained from noncheating interviewers or to externally motivated benchmarks. The most successful of these nine methods were the use of Benford’s law and anthropometric compar- isons, which we discuss in detail below. Although the other seven methods were not particularly useful for diagnostic purposes, we document what did not work as it might still be useful to people running sur- veys in other contexts. Unsuccessful Methods of Detection Method 1: Number of Deaths between Waves One way to speed up the process of completing an entire household would be to falsely classify a house- hold member from wave 1 as deceased. This would allow a fieldworker to “complete” interviewing the household much faster. Alternatively, fieldworkers could falsely classify someone who had died between waves as being still alive and then fabricate the data. We compared the mortality rates of respondents by fieldworker, and did not observe any anomalies in the data. Method 2: Number of Refusals/not Available The method and thought behind using this metric is identical to that for using deaths above. Fieldworkers could save time by not interviewing everyone (for example, by fabricating refusals and nonresponses from respondents) or fabricate data for those who had, in fact, refused to be interviewed. We compared the response rates by fieldworker and did not observe any anomalies in the data. Method 3: Fieldworkers who are Disproportionately likely to Activate Substantial Skip Patterns in the Survey Our thoughts here were that one could save considerable time in some sections by capturing certain re- sponses. As already discussed, this incentive is strongest in the labor market section. We abandoned this method as the levels of unemployment in South Africa are high, levels of labor force participation are low, and these are concentrated in certain neighbourhoods and regions (Leibbrandt et al. 2010). Since The World Bank Economic Review 137 fieldwork was coordinated geographically, fieldworkers could plausibly have genuinely encountered a pool of respondents with low levels of employment and labor force attachment in their allotted region. In addition, the unemployment rates and percentage that were not economically active, by fieldworker, yielded several fieldworkers with high values, so this was not a particularly useful tool for discriminating between suspicious and nonsuspicious fieldworkers. Method 4: Using Length of Interviews to Identify Fabrication If fieldworkers were fabricating data, we expected them to complete the surveys relatively quickly. The software we used had time stamps for both when the interview began and was completed, which theoret- ically allowed us to calculate the time per interview. We also expected that each adult interview would take between forty-five minutes and one hour to complete. Unfortunately, the time stamp for completion was activated manually, and several fieldworkers only did so at night prior to uploading the data to the server. This rendered this component of the investigation useless. Method 5: Using GPS Coordinates to Verify where the Interview Took Place Part of the survey captures the GPS coordinates of the household. This was required for all households in both wave 1 and wave 2. The coordinates were obtained by means of a handheld GPS device that was accurate to a radius of 100m. If interviews were being fabricated, we would expect to find differences be- tween the wave 1 and wave 2 coordinates. We encountered two problems with this method. First, there was considerable measurement error in wave 1, so not all differences could be attributed to wave 2 cheating. Second, fieldworkers were given the GPS coordinates from wave 1 to assist them in finding the households. Instead of entering the GPS readings from the GPS device in wave 2, a cheating fieldworker could simply re-enter the coordinates that they had received. Method 6: Comparing Wave 1 and Wave 2 Signatures Each completed questionnaire in each wave needed to be accompanied by a signed paper-based consent form. We considered comparing wave 1 and wave 2 signatures to identify discrepancies. We abandoned this approach very quickly, as signatures have some variability over time, and the method was far too la- bor intensive. Method 7: Low Rates of In-migration or Births between Waves If fieldworkers were fabricating entire households, then they would not be able to know about any new household members that entered between waves. They would then either systematically under-estimate the number of new members, or else have to fabricate these new members as well. We calculated the number of new members per household by fieldworker, but there were no clear patterns or anomalies. If cheating fieldworkers did indeed fabricate new members as well, or if cheating fieldworkers cheated only on some fraction of their households, then it would be much harder for this diagnostic to yield us- able information. Using Benford’s Law In contrast to the methods used above, the use of Benford’s law as a ranking mechanism for suspicious interviewers proved to be very useful. Following a paper by Sch€ afer et al. (2004), we used Benford’s law as the basis of a test of the distribution of the numerical data reported by each interviewer. Benford’s law is an empirical law that was first described in Benford (1938). It describes the probability distribu- tion of leading digits in tables of numerical data and asserts that the distribution is not uniform, as might be expected a priori, but rather follows a certain logarithmic probability distribution given by: 138 Finn and Ranchhod   1 Prðleading digit ¼ dÞ ¼ log10 1 þ ; d ¼ 1; 2; :::; 9 d This implies that the probability of a leading digit being 1 is about 30%, the probability of it being 2 is about 17.6%, with the corresponding probabilities of the subsequent leading digits decreasing monoton- ically until we find that the probability of the leading digit being 9 is approximately 4.6%. The probabil- ity distribution of leading digits is shown in table 1, below. Table 1. Benford’s Distribution of Leading Digits 1 2 3 4 5 6 7 8 9 30.1% 17.6% 12.5% 9.7% 7.9% 6.7% 5.8% 5.1% 4.6% For a long time this phenomenon was viewed as not much more than a numerical curiosity. However, some practical implications began to emerge (Scott and Fasli 2001; Durtschi et al. 2004), and Benford’s law has since been used to detect fraud in financial statements of companies (Carslaw 1988; Thomas 1989). More recently, it has been used in a wide variety of settings in the United States (Durtschi et al. 2004). Judge and Schechter (2009) used Benford’s law to compare the data from the deleted and retained clusters in the KIDS example discussed earlier. The law has also been found to hold with a large number of other kinds of data, including the population of towns, the length of rivers, and the half-life of radio- active atoms. The basic premise is this: If you have a relatively large dataset and you accept that Benford’s law holds, then you can identify possible cheating by comparing the realized distribution of leading digits for each interviewer to the distribution of leading digits that would be expected if Benford’s law holds. Hill (1995) provided the first theoretical basis for the law and showed that the law applies most accu- rately to stock market data, some accounting data and census statistics. The intuition underlying the proof is the following: Consider a variable which grows at some constant rate. Regardless of the initial value or the growth rate, the asymptotic distribution of leading digits of this variable (over time) will conform to Benford’s law.12 Thus, a random sample of such variables at a moment in time will also con- form to Benford’s law. More recently, Sch€afer et al. (2004) and Schr€apler (2011) argue that certain survey data also conform to Benford’s law and use this for the express purpose of identifying cheating interviewers in the German Socio-economic Panel (SOEP). Schr€ apler (2011) summarizes three requirements that need to be satisfied in order for Benford’s law to be a useful diagnostic for detecting fraud in survey data. First, the data should not have a built-in maximum value. Second, there should be no externally assigned values in data. For example, the South African old age pension is a rand amount that is assigned to an individual, and this is an example of data that cannot be used in the diagnostic test. Finally, the distribution of the data should be positively skewed with a median that is lower than the mean. Of all the variables in wave 2 of NIDS, a number of those in the income and expenditure modules satisfied all of these criteria. Some of these are reported at the household level (for example, household nonfood expenditure), some are re- ported at the individual level (for example, monthly wages), and others are aggregated across 12 Informally, consider a variable that grows at a constant rate with an initial value of 1 (note that the initial value does not have to be 1, it is used here for illustrative purposes). As this number grows, more time will be spent between 1 and 2, than between 2 and 3, and more time will be spent between 2 and 3 than between 3 and 4. This amount of time with each successive leading digit decreases until the value reaches 10, in which case more time is spent between 10 and 20 than between 20 and 30, and so on. As the time period increases, the distribution of leading digits converges to Benford’s distribution. The World Bank Economic Review 139 respondents in the same household (for example, household wages). Variables that included any system- atic component, such as the value of government grants, were not valid candidates for inclusion. Figure 2, below, plots the distribution of leading digits of the variable reflecting the total amount of labor market income received by respondents in the 4,705 households with positive wages in the wave 1 data. This distribution, shown by the bars, is plotted together with the logarithmic distribution that we expect to observe, assuming that Benford’s law holds. By comparing the two distributions, it seems that the leading digits of this variable fit the logarithmic distribution very well. The observed proportions of each leading digit are very close to the proportions that we expect to observe ex ante, and fall with each successive digit, except for eight which is slightly higher than expected.13 Figure 2. Observed and Expected Leading Digits—Wages in Wave 1 Source: Own calculations using NIDS Wave 1 2008. Given that the data seem to follow this logarithmic distribution for some of the monetary variables, we next sorted the wave 2 observations by interviewer and considered the conditional leading digits of total household income as recorded by the interviewer. We ranked how far each interviewer’s distribu- tion of leading digits was from the logarithmic distribution by computing Chi-squared statistics. Ranking interviewers by this method yielded positive results for the detection of cheating. Of the inter- viewers with the five highest Chi-squared values, four were subsequently found to have fabricated entire questionnaires. The top ten Chi-squared rankings are shown in table 2, on the following page. The cheating interviewers are highlighted in bold.14 The fact that six of the top ten most suspicious interviewers were identified using Benford’s law sug- gests that using this method is appropriate for survey data of this nature. Nonetheless, some cheating in- terviewers may have left the monetary variables blank, or set them to missing. In this case we would have the unique problem of missing fake data, which also happens to be fake missing data. One of the ways of overcoming this problem is to use repeated observations of the same respondents over time to pick up possible fabrication. We thus exploited the longitudinal dimension of the data and evaluated var- iables that are difficult to fabricate convincingly in a panel study—namely, height and weight. 13 See figure A1 in appendix 6 for a comparison of observed versus expected distributions of leading digits for some other wave 1 and wave 2 monetary variables. 14 The full table for all interviewers who collected data from at least forty households can be found in table A3. 140 Finn and Ranchhod Table 2. Most Suspicious Interviewers by Chi-squared Ranking Ranking Interviewer Chi-squared (no. of interviews) 1 A 39.7 (80) 2 B 31.2 (49) 3 OK 28.0 (66) 4 C 27.3 (74) 5 D 27.1 (100) 6 E 24.7 (42) 7 OK 21.7 (64) 8 F 21.4 (73) 9 OK 21.2 (44) 10 OK 19.8 (67) Interviewers who were found to have fabricated data are identified by letter and are in bold typeface. Interviewers who were suspicious but did not fabricate data are denoted “OK” and are not in bold typeface. Number of interviews refers to the total number of household interviews submitted by the interviewer. Source: Own calculations using pre-public release NIDS Wave 2 data, 2010. Anthropometric Measures One of the advantages of a longitudinal dataset is that prior or future waves may be used to calibrate data quality in other waves. Slow-moving variables such as anthropometric measures are good candi- dates for this exercise. The first two waves of NIDS included modules in which respondents were weighed and measured for height. Weights were obtained using digital scales that were accurate to 0.1kg, while heights were obtained using a portable stadiometer. These data were not prepopulated into the CAPI system, making it almost impossible for interviewers to systematically fabricate values that were consistent with wave 1 measures for respondents that they had not seen. Various diagnostic measures were used to rank interviewers according to their likelihood of having cheated. These were: 15 • Mean adult body mass index (BMI), by interviewer. Our thinking was that interviewers who were fabricating data might not be aware of the extent to which the height and weight of respondents are correlated. This would result in “abnormal” BMI measures. We considered interviewers who gener- ated exceptionally high or exceptionally low average BMI values as potentially suspicious. • The mean growth in the height of adult respondents between waves, by interviewer. We expected that the heights of adults would be stable over a two year span, on average. If the mean growth in height for a particular interviewer differed substantially from zero in either direction then we inter- preted this as an indication of possible cheating. • The mean BMI growth from wave 1 to wave 2, by interviewer. If this differed substantially from zero then this was a sign of possible fabrication. • Spikeplots of the weight distribution, by interviewer. Given that the scales were digital and that no in- terviewer had interviewed hundreds of respondents, we expected to obtain a uniform distribution of weights. Visual inspection of the spikeplots allowed us to relatively quickly identify suspicious pat- terns such as “heaping” at natural reference numbers. The diagnostics above were restricted to adults only, where adults were classified as respondents twenty years old and above. Running the diagnostics on children would have presented a problem as the height and weight variables for children are more volatile, even in a two- to three-year period. 15 BMI is calculated by dividing a person’s mass (in kilograms) by the square of their height (in meters). A BMI above twenty-five is considered to be overweight by the medical profession. The World Bank Economic Review 141 Table 3, below, shows the list of suspicious interviewers generated using the mean adult BMI in the wave 2 cross-section. As before, interviewers who were found to have cheated are highlighted in bold. Interviewer E, who interviewed ninety-seven adults, had the highest mean BMI of 55.3. The highest mean BMI measure that we verified ex post was 43.3 but was based on only twenty respondents. At the other end of the distribution, interviewer H returned a very low mean BMI of 21.7, much lower than the overall average of 28.6 from 9,821 adults. Table 3. Suspicious Interviewers and Mean Adult BMI Interviewer N Mean BMI E 97 55.3 F 24 49.6 G 49 48.9 B 104 44.7 OK 20 43.4 OK 33 38.8 H 156 21.7 OK 62 21.5 Total 9,821 28.6 Interviewers who were found to have fabricated data are identified by letter, and are in bold typeface. Interviewers who were suspicious but did not fabricate data are denoted “OK” and are not in bold typeface. N is the number of completed questionnaires submitted containing adult BMI data. Source: Own calculations using pre-public release NIDS Wave 2 data, 2010. Many of the same interviewers also appeared to be suspicious when BMI growth, rather than the mean of BMI, was used for identifying cheating. As shown in table 4, six of the twelve most suspicious interviewers were subsequently found to have fabricated either part of an interview or the whole inter- view, for at least one of their interviews. The mean percentage change in BMI for the entire adult sample was 9%. Interviewer G, who only interviewed thirty-two adults, returned a BMI growth rate of 173%, followed by interviewers E and B with 109% and 99%, respectively. At the other end of the distribution, H’s 117 respondents showed a decrease in BMI of 19%, on average. Table 4. Suspicious Interviewers and Mean Change in Adult BMI Interviewer N Mean % change G 32 173 E 67 109 B 75 99 OK 38 33 I 83 31 OK 89 25 OK 35 23 OK 14 20 J 80 20 OK 44 À7 OK 40 À15 H 117 219 Total 5,560 9 Interviewers who were found to have fabricated data are identified by letter, and are in bold typeface. Interviewers who were suspicious but did not fabricate data are denoted “OK” and are not in bold typeface. N is the number of completed questionnaires submitted containing adult BMI data. Source: Own calculations using pre-public release NIDS Wave 2 data, 2010. 142 Finn and Ranchhod We present the suspicious list obtained by using mean adult height growth between waves in table 5, below. Interviewers who were subsequently found to have fabricated data are shown in bold once again. Of the 5,710 adults for whom valid data were recorded in both waves, the mean change in height was a rise of 0.11%. Interviewers H and A recorded the largest mean growth rates of around 5%. The four most suspicious interviewers at the other end of the distribution, interviewers E, I, B, and G recorded very large negative growth in height ranging from À4.95% to À14.70%. Table 5. Suspicious Interviewers and Mean Change in Adult Height Interviewer N Mean % change H 118 5.21 A 120 4.86 OK 35 4.81 OK 41 4.72 OK 45 4.62 E 68 24.95 I 83 25.38 B 75 27.14 G 32 214.70 Total 5,710 0.11 Interviewers who were found to have fabricated data are identified by letter, and are in bold typeface. Interviewers who were suspicious but did not fabricate data are denoted “OK” and are not in bold typeface. N is the number of completed questionnaires submitted containing adult height data. Source: Own calculations using pre-public release NIDS Wave 2 data, 2010. The final anthropometric method used to detect suspicious interviewers was a visual inspection of spike-plots of the weight distributions. This allowed us to quickly observe heaping at focal points. Moreover, one weakness of the three methods used above was that they would only diagnose cheating if the proportion of interviews that were faked was “substantial” enough to affect the mean.16 This method was not dependent on the mean of the weight distribution obtained by the interviewer.17 It could thus be informative even in cases where a interviewer had cheated on only a small fraction of surveys. We provide two spikeplots as illustrative examples of recorded weights for adults, with values re- stricted between 48 and 100. The upper panel of figure 3 shows the spike-plot of the weight distribution of a nonsuspicious interviewer. This interviewer interviewed thirty-nine adults (with weights between 48kg and 100kg) and recorded two weight values for each of them, hence the uniformity at a frequency of two on the y-axis. Only two adults out of the thirty-nine had the same weight; 52kg and 81.1kg. Contrast this to the spike-plot of a suspicious interviewer, shown in the lower panel of figure 3. This in- terviewer interviewed 125 adults and there is significant spiking at 65kg, 70kg, 75kg, 80kg, and 85kg with relatively few of the other observations having different values. Note that the y-axis goes up to 44, compared with the upper panel where it only goes up to 4. This distribution immediately raised suspi- cions that the interviewer had made up the anthropometric data at best, and had fabricated the entire in- terview at worst. 16 We also considered interviewers who captured anthropometric data that consisted of a “high” number of outliers. This did not prove to be effective. Almost all interviewers had some outliers, which could arise because the respondent really did have an exceptional height or weight, or due to measurement error in either wave 1 or wave 2. 17 An additional approach that we used was to sort interviewers by the 10th, 25th, 50th, 75th, and 90th percentiles of their obtained distributions of height, weight, BMI, and the growth of these variables. These did not yield substantially new insights beyond those obtained from the methods already described and employed. The World Bank Economic Review 143 Figure 3. Spike-Plot of Weight Distributions Source: Own calculations using pre-public release NIDS Wave 2 data, 2010. Bringing the three anthropometric measures together allowed us to create a crude index of suspicion. Interviewers were assigned scores of zero to three, depending on how many times they were flagged as suspicious in each of the diagnostic methods. Table 6 shows the top twelve most suspicious interviewers according to the three anthropometric diagnostics. Every interviewer who scored three out of three was later found to have fabricated data. Of the interviewers who were flagged as suspicious using the anthro- pometric measures, interviewers A and B were also flagged as the two most suspicious interviewers using the Benford’s law method. Constructing a joint index of the Benford scores and the anthropometric scores did not prove fruitful, as some of the highly suspicious interviewers according to the Benford method did not provide much anthropometric data—they tended to record refusals in this section— while, in general, the suspicious interviewers according to anthropometrics failed to provide many data points for income and wage variables. An important issue to consider is that data in wave 1 could have been fabricated as well. If interviews (or parts of interviews) were faked in the first wave of data, this could feed through and make large 144 Finn and Ranchhod Table 6. Combined Anthropometric Suspicion Index Interviewer BMI BMI Growth Height Growth Row Total H 1 1 1 3 E 1 1 1 3 B 1 1 1 3 G 1 1 1 3 I 1 1 2 OK 1 1 2 F 1 1 A 1 1 J 1 1 OK 1 1 OK 1 1 OK 1 1 Col. Total 7 8 7 21 Interviewers who were found to have fabricated data are identified by letter, and are in bold typeface. Interviewers who were suspicious but did not fabricate data are denoted “OK” and are not in bold typeface. Source: Own calculations using pre-public release NIDS Wave 2 data, 2010. changes in anthropometric data look suspicious. This would be problematic as the error is entering in the first period, rather than in the second period, which was the focus of our investigation. On average, however, even if the first wave contains fabricated data, this should be diluted as different interviewers interviewed different respondents in both waves. The probability that the majority of a wave 2 inter- viewer’s valid interviews are combined with mostly fake wave 1 data is small but nontrivial, given the spatial logistics under which fieldwork was conducted in both waves. Our assumption, therefore, is not that the wave 1 data are perfect but rather that there is no perfect overlap between respondents who were interviewed by cheating interviewers in waves 1 and 2. One implication of the possibility that we are identifying wave 1 cheating instead of wave 2 cheating, is that we needed to be more cautious about our conclusions. The diagnostics that we employed are in- herently probabilistic, not deterministic, and this lack of determinism is exacerbated by any wave 1 cheating that occurred. This provided an important motivation for the second stage of our auditing pro- cess, which we discuss below. The operational response from NIDS made use of a meta-list of suspicious interviewers that was drawn up using a combination of the Benford’s law rankings and the anthropometrics rankings. The NIDS operations centre initiated an intensive set of telephonic callbacks in order to verify whether or not the interviews of suspicious interviewers had actually been conducted. Priority was given to calling back respondents who were interviewed by the most suspicious interviewers and the NIDS team worked down the list systematically, calling every household for which data had been collected by that inter- viewer, until there was a high level of confidence about the veracity of the data. Where fraud was evi- dent, all data from those interviews were rejected at the expense of the company conducting the fieldwork. This was true regardless of whether there was partial or total fabrication of the relevant inter- view. Figure A2 in the appendix shows the list of verification questions asked of respondents whose data were submitted by suspicious interviewers. Totally fabricated interviews were identified by question 1 (interviewer submitted data but respondent was never interviewed), while partial fabrication was identi- fied by questions 2 to 6 in this verification questionnaire. In the former case, the entire interview was reconducted and submitted, while in the latter case only the relevant missing modules of the question- naire were administered. In summation, of the 781 households on the suspicion list that were successfully contacted, over 70% had substantial data quality concerns that were driven entirely by interviewer cheating. This represents The World Bank Economic Review 145 7.3% of the wave 2 households at the time that the verification process started.18 Our information about the interviewers themselves is limited to their names and the interview teams in which they operated. While it would have been useful to compare the characteristics of cheating interviewers to noncheating interviewers for policy purposes, their details were not captured by the company that implemented the fieldwork. III. Implications for Analysis By how much would the presence of the fabricated data have affected our estimates had the cheating in- terviewers not been discovered? Assuming that some fabrication is probably present in most surveys and, simultaneously, that most interviewers are probably honest, should we be wary of most empirical results? Alternatively, does the measurement error caused by interviewer cheating have relatively small effects on our estimates, such that, for practical purposes, we may ignore its implications with respect to research findings? In addition to the resources invested in the production of data, considerable time, en- ergy and resources are invested by users of these data, and research findings subsequently feed into im- portant policy making discussions and debates. Measuring the effects of interviewer cheating on the validity of empirical findings is the objective of this section of the paper. From an econometric perspective, data fabrication leads to measurement error for potentially all of the variables in some subset of the data. A priori, we cannot make a general prediction about the effects of data fabrication on subsequent estimates, as the effects, if any, will depend on multiple factors. These factors include the fraction of the overall dataset that is fabricated, the difference between the fabricated data and the true data that it represents, the type of estimator being implemented, whether the fabrica- tion results in classical19 or nonclassical20 measurement error in the variable or variables that are being used, and the magnitudes of such measurement error. Moreover, if one is using a multivariate estimator, the empirical effects will depend on the relationship between the variables being used in the fabricated dataset, relative to the true relationship between those variables. Any theoretical predictions thus need to be restricted by quite a specific set of criteria. Nonetheless, there are some well known and fairly general effects that measurement error in a regres- sor will have in a regression analysis. First, measurement error in an independent variable will result in a violation of the orthogonality condition. This will induce biased estimates of the b vector obtained from an OLS regression (Wooldridge 2002). In the case of classical measurement error, this will result in an attenuation bias, that is, a bias of the estimated coefficient towards zero. Second, some common estima- tors, such as fixed effects estimators and first difference estimators, are more sensitive to a particular endogeneity problem than a standard OLS estimator (Griliches and Hausman 1986). In addition, Schnell (1991) and Schr€ apler and Wagner (2005) find that univariate statistics such as means, medians and variance are generally robust to the presence of fake data, where the prevalence of fake data is less than 5%. However, the negative effects of fake data begin to compound as analysis 18 Finn and Ranchhod (2013) contains a more detailed breakdown of the verification process and outcomes. 19 Assuming the true model is yà ¼ Xà b þ  but we measure X ¼ Xà þ l and y ¼ yà þ v, under the conditions of classical measurement error, u and v are i.i.d. and uncorrelated with Xà ; yà and  and the estimated b coefficients are biased in the direction of zero. See Bound et al. (2001) for a comprehensive overview of the literature on bias due to measurement error. 20 Of course, the misclassification of a categorical variable such as labor market status or a dummy variable such as em- ployed unemployed cannot be thought of in the same way as classical measurement error, as the error itself cannot be mean zero. In fact, for dummy variables, the measurement error must be negatively correlated with the true value of the variable. The case for measurement error in categorical variables is not as straightforward, but a thorough treatment is beyond the scope of this paper. See Krueger and Summers (1988) for a discussion of the results of measurement error in categorical regressions. 146 Finn and Ranchhod moves to a multivariate setting, particularly when some of the commonly used panel data estimators are used (Schnell 1991; Schr€ apler and Wagner 2005). We should mention again that our focus in this study is data that were totally fabricated by interviewers, and not questionnaires that were partially correct and partially fake. As such, our findings should be interpreted as lower bounds of the problem of cheat- ing in this dataset. In order to provide an illustrative example of the effects of cheating in our dataset, we chose to inves- tigate the broad theme of understanding the effects of finding employment on health, as measured by BMI. We chose this area for investigation for two reasons. First, we have spent some time documenting the fabrication that took place in the labor market module as well as the height and weight measurement module. Looking at the effect of finding employment on a measure of well-being such as BMI comple- ments our section on detecting fabrication. Second, the determinants of BMI as well as the effect of em- ployment on BMI are topics that have received a great deal of attention in the recent South African literature (Wittenberg 2013, 2009; Ardington and Gasealahwe 2012; Ardington and Case 2009) and our study provides an important addition to these.21 We then needed to choose a set of variables and a set of estimators for our analysis. For the variables, we include BMI, age, education and labor market status. Following the theoretical discussion in the preceding paragraphs, we calculate the mean BMI, the labor market transition rates and finally, we fit OLS and First Difference regression models of BMI on age, ed- ucation and employment.22 Data To implement the analysis, we constructed two datasets. The first, which we refer to as the “Dirty” data- set, is a combination of the “Always Correct” data combined with the “Fake” data at the time that our verification process was completed. Essentially, it represents what the NIDS wave 2 dataset would have been if the cheating had gone undetected, and the survey was completed at the date that our verification process drew to a close. The second dataset, which we refer to as the “Clean” dataset, is composed of the same “Always Correct” data, combined with the subsequently corrected data where such correction was possible. The variables that we use are all at the individual level. They are: • BMI - This is calculated as a person’s mass in kilograms divided by height in metres squared. Since each respondent had either two or three measures of height and weight each, we used the average of all recorded measures. There was no prepopulation of this variable in wave 2. • Age - This was measured in integer years. The variable triggered a data confirmation question for in- terviewers if the respondent had aged by less than one year or by more than two years between wave 1 and wave 2. Interviewers had access to the wave 1 roster, hence even fabricated surveys would likely have appropriate data in wave 2. • Years of education - This variable is bounded between 0 and 15. The wave 1 information on educa- tion was also given to interviewers. Moreover, if the education levels had increased by more than two years, or had decreased between wave 1 and wave 2, the software would ask for confirmation from the interviewers. Thus, we expect to have only a small amount of measurement error on this variable in the fabricated data. • Male - This is an indicator variable that captures the sex of the respondent. It was prepopulated based on the wave 1 dataset. 21 We also modeled the effect of receiving the state old age pension on labor force participation using the Dirty and Clean datasets. The results of these analyses are available from the authors on request. 22 We collapse the four labor market states into a binary employed variable for the regressions. We did this as it made more sense theoretically and it made the discussion of the regression results simpler. We also performed the estimations with the labor market states disaggregated and the overall findings do not change substantially (not reported). The World Bank Economic Review 147 The labor market status variables are comprised of four mutually exclusive indicator variables,23 which represent the labor market state of respondents. These are all derived from the labor market section of the survey. These variables were not prepopulated. They are: • Employed - This is an indicator variable that takes on a value of one if the respondent had any form of employment at the time of the interview. • Unemployed (searching) - This is an indicator variable that takes on a value of one if the respondent was not employed but was actively looking for work in the month prior to the interview. • Unemployed (discouraged) - This is an indicator variable that takes on a value of one if the respondent was not employed and was not actively looking for work in the month prior to the interview, but stated that he she would like to have a job. The difference between the searching and nonsearching unemployed conforms to the standard ILO definitions for these categories (International Labour Office 2011). • Not economically active - This is an indicator variable that takes on a value of one if the respondent was not employed, was not actively looking for work in the month prior to the interview and stated that they would not accept a job offer. There are a few additional data issues that require elaboration. First, for our entire analysis, we restrict our estimation sample to include only the adult African sub-population aged eighteen to sixty-five.24 Second, we restrict the sample to include BMI values of less than 50, as we were concerned that some of the extremely high BMI values were due to the scales being inadvertently set to pounds instead of kilo- grams. In addition, we exclude any observation with any covariate missing from our samples, as they would not survive into our regression analyses. Third, we do not make use of either the sampling weights or attrition-corrected weights in any of the subsequent analysis. Our objective is not to replicate popula- tion level analyses, but merely to compare the differences between estimates obtained from the “Dirty” and the “Clean” dataset. Moreover, we would have had to recalculate all of the post-stratification weights, as the datasets that we use do not represent the full sample due to the time at which we com- pleted our audit. Fourth, in our regressions we re-weight the subsequently corrected data by the inverse of the ratio of the number of corrected fakes to the number of fakes. We do this because we want the weighted fraction of data from the “Always Correct” data to be the same in the Dirty and Clean data- sets. The implicit assumption here is that the group of corrected fakes are representative of the group of fakes that we were unable to subsequently re-interview.25 The sample sizes, and how they are affected by our restrictions, are displayed in table 7 below. We observe that the BMI cutoff of 50 is not too onerous. We lose 106 and 84 observations from the Dirty and Clean datasets respectively. This represents less than two percent of either sample, and a substantial fraction of these are observations from the “Always Correct” subset of the data. Our final sample sizes for the OLS and First Differences analysis are 6,768 and 5,388 observations for the Dirty dataset, and 6,576 and 5,263 for the Clean dataset, respectively. The sample sizes for the First Differences regressions are substantially smaller for two reasons. First, new household members would not have been inter- viewed in wave 1. Second, any missing data in any covariate in wave 1 would have resulted in that obser- vation being dropped from the First Differences sample as well. 23 There are some cases where the questions used to derive a person’s labor market status were not answered. In these cases, we cannot define their status. Otherwise, the four variables would be mutually exclusive and exhaustive. 24 This is because we have small sample sizes for the other race groups, especially when using the balanced panel members from wave 1 and wave 2. Wittenberg (2013) applies a similar restriction to the NIDS data. 25 We test the assumption that the corrected fakes are representative of all fakes by testing for the equality of means be- tween the corrected fakes and the uncorrected fakes, from the corresponding wave 1 data, for the variables used in the regression section of the paper. We find that we are not able to reject the hypothesis of equal means for any of these var- iables at the 10% level of significance. 148 Finn and Ranchhod Table 7. Sample Sizes Number Dirty Clean All 6,874 6,660 BMI<50 6,768 6,576 Cross-sectional OLS 6,768 6,576 First differenced 5,388 5,263 Sample restricted to African adults aged 18 to 65 in wave 2. Source: Own calculations using pre-public release NIDS Wave 2 data, 2010. Summary Statistics The means of the variables in each of the sub-datasets, as well as the Clean and Dirty datasets, are pro- vided in table 8 below. Note that the mean of a variable in the Dirty dataset will be a weighted average of the corresponding means in the Fake dataset and the Always Correct dataset, with the weight being determined by the proportion of the data in the Dirty dataset that originates from the Fake and Always Correct datasets, respectively. Similarly, the mean in the Clean dataset will be a function of the means in the Corrected Fake and Always Correct datasets. Any differences in the means between the Clean and Dirty datasets must therefore reflect differences in the means between the Fake and Corrected Fake data- sets, combined with the differences in their respective sample sizes. Table 8. Means of Variables Used in Analysis Fakes Corrected Fakes Always Correct Dirty Dataset Clean Dataset BMI 26.73 25.76 26.94 26.92 26.89 Age 35.45 37.08 36.16 36.11 36.20 Education (years) 6.74 7.47 8.12 8.03 8.09 Employed 19.82% 28.40% 33.99% 33.05% 33.77% Unemployed (searching) 5.12% 15.56% 10.98% 10.59% 11.16% Unemployed (discouraged) 0.69% 3.50% 5.93% 5.59% 5.84% Not economically active 73.27% 52.53% 48.60% 50.24% 48.75% Male 40.98% 39.69% 40.26% 40.31% 40.24% Number 449 257 6,319 6,768 6,576 Samples are restricted to Africans aged 18 to 65 in wave 2, with BMI values less than 50. The number of fakes do not equal the number of corrected fakes because not all faked respondents were successfully re-interviewed. The means in the Dirty and Clean dataset do not precisely correspond to the weighted means obtained from the first three columns due to rounding effects. Source: Own calculations using pre-public release and public release NIDS Wave 2 data, 2010. For the BMI, age, male, and years of education variables that we use, the difference in means between the Fake and Corrected Fake datasets is relatively small. Thus, in the aggregated Dirty and Clean data- sets, the difference in means for these variables is very small, at less than 0.15 units in each case. For some of the other variables, such as Unemployed (discouraged), the difference in means between the Fakes and Corrected Fakes is somewhat larger, at 2.81 percentage points, but the aggregate difference in means for these variables remains relatively small. This is because the relative weightings contributed by Fakes and Corrected Fakes to the means in the Dirty and Clean datasets are also small. In contrast, for the Employed, Unemployed (Searching), and NEA variables, the difference in means between the Fakes and Corrected Fakes is substantial. Even these differences, however, get substantially moderated when we calculate the means of the Dirty and the Clean datasets. For example, let us consider the percentage that are employed in the Fakes and Corrected Fakes datasets. The difference in the means is large, at 8.58 percentage points. Nonetheless, the weight that these contribute to the Dirty and Clean The World Bank Economic Review 149 Table 9. Cross-sectional and First-Differenced Regressions Cross sectional First differenced Dirty Clean Dirty Clean Variables W2 BMI W2 BMI D BMI D BMI Age 0.14*** 0.14*** 0.00 0.00* (0.01) (0.01) (0.00) (0.00) Male À4.52*** À4.60*** (0.15) (0.15) Education 0.13*** 0.16*** 0.08** 0.02 (0.02) (0.02) (0.04) (0.04) Employed 0.76*** 0.70*** 0.31** 0.18 (0.16) (0.16) (0.15) (0.14) Constant 22.50*** 22.00*** 1.03*** 1.00*** (0.37) (0.38) (0.08) (0.07) Observations 6,768 6,576 5,388 5,263 R-squared 0.19 0.20 0.00 0.00 Standard errors in parentheses ***p<0.01, **p<0.05, *p<0.1. Samples are restricted to Africans aged 18 to 65 in wave 2, with BMI values less than 50. Columns 2 and 3 present results from cross-section OLS estimation. Columns 4 and 5 present results first-differenced regressions, and the regressors should be read as differences, rather than levels. Source: Own calculations using pre-public release and public release NIDS Wave 2 data, 2010. datasets are relatively small, at 6.6 and 3.9 percent. Thus the aggregate difference in mean percentage employed between the Dirty and Clean datasets is only 0.72 percentage points. This is substantially smaller than the corresponding difference in means between the Fakes and Corrected Fakes dataset and depending on one’s interest, may or may not be considered to be “substantial.” Overall then, we confirm the finding by Schnell (1991) that a univariate statistic such as the mean is generally robust to the pres- ence of a small amount of fake data.26 Regression Results Our final set of analyses involves estimating the regression coefficient of employment on BMI. We first present the cross-sectional regression results using wave 2 data and compare the coefficients from the Dirty and Clean datasets. One weakness of this approach, if we think that BMI is a proxy for health, is that we are likely to have a selection problem since healthier people will probably be more likely to find employment. A natural extension would be to estimate a fixed effects model of employment on BMI. We thus also present the First Differences (FD) form of the regression. We do not have strong priors regarding the differences between the Dirty and Clean datasets based on econometric theory. The measurement error in the employment dummy cannot be classical measurement er- ror as the variable is a binary variable (Aigner 1973). Moreover, in the FD model, the error has a particular distribution that is not symmetric. Nonetheless, we do expect that we have a measurement error problem and that this will cause an endogeneity problem. We also know that the FD estimator is more sensitive to measurement error than the OLS estimator, so we might expect that the presence of fabricated data will have a stronger effect on the FD coefficients than the OLS coefficients (Hausman 2001). In table 9, we present the regression outputs from estimating the OLS model on the Dirty and Clean data. Our dependent variable is BMI and our regressors are age, gender, education, and employment sta- tus. The overall finding is that the regression results look very similar. The R-squared values differ by 26 An earlier version of this paper (see Finn and Ranchhod 2013) presented labor market transition matrices for the Dirty and Clean datasets as an additional exposition of the effects of the presence of data fabrication. 150 Finn and Ranchhod 0.01 units, and none of the coefficients are statistically significantly different at any reasonable signifi- cance level. The education variable, where we do have something that resembles classical measurement error within a limited range, is indeed slightly smaller in the regression using the Dirty dataset, but the difference is only 0.03 BMI units, which is quite negligible. The coefficients on the employed dummy are both positive and significant. The coefficient is slightly larger when using the Dirty dataset (0.76 versus 0.70), but again they only differ by 0.06 BMI units. The similarities are not surprising given what was observed in table 8. In the cross-sectional datasets the differences in the means of the relevant variables were all very small, and most of the data in both the Dirty and Clean datasets are obtained from the Always Correct dataset. Our final set of results are obtained from the FD regressions and are presented in the last two columns of table 9. Note that the male dummy gets dropped as it is a time invariant variable. Our findings from this component of our analyses are a bit more nuanced than those from our earlier analyses. When we compare the differences in the FD results between the Dirty and Clean datasets, we notice that the Dirty coefficients on education and employment are larger than those obtained from the Clean dataset, and they are statistically significant whereas those obtained from the Clean dataset are not sta- tistically significant. On the other hand, the differences in magnitude are 0.06 and 0.13 BMI units for the education and employed variables, which are not particularly large. Moreover, the differences in the coefficients are not statistically significant. From this perspective, the fabricated data does affect our esti- mates but not in a meaningful way. Alternatively, when we compare the OLS and FD results within each dataset, we observe that the co- efficients from both the Dirty and Clean datasets are reduced quite substantially. For example, in the Dirty regressions, the coefficient on education is 0.13 in the OLS regression but decreases to 0.08 in the FD regression. The decrease obtained in the Clean dataset is from 0.16 to 0.02 and also results in a change in the statistical significance of the coefficient. A similar comparison between the OLS and FD coefficients focusing on the employed dummy yields larger decreases in the coefficients (in absolute value) for both the Dirty and the Clean datasets. If our only dataset had been the Dirty dataset, we would have concluded that using a longitudinal estimator re- sults in a decrease of our estimated coefficient from 0.76 to 0.31, that is a decrease of 0.45 BMI units, al- though both coefficients are statistically significant at the 5% level. In contrast, if we had performed the identical exercise using only the Clean dataset, we would have concluded that using a longitudinal esti- mator results in a decrease of our estimated coefficient from 0.70 to 0.18, that is, a decrease of 0.52 BMI units. Moreover, we would observe that the FD coefficient, unlike the OLS coefficient, is not statistically significantly different from zero. One question that comes to mind is why the coefficients in the regression output are larger for the Dirty data than for the Clean data. If there is classical measurement error in the X variables, then one ex- pects the point estimates to be biased downwards towards zero. However, in our study, the opposite is true. There are a number of reasons why this may be the case. First, the results of regressions on the Dirty dataset presented in table 9 are not subject to the usual as- sumptions about classical measurement error. For example, the education variable approximates a con- tinuous variable, but given the computer check when recording this variable, it is not clear what the realized measurement error will look like. Second, the Dirty dataset includes measurement error of the dependent variables as well. With classical measurement error affecting a dependent variable, we would expect larger standard errors corresponding to the coefficient estimates, but our point estimates should be unaffected in expectation. However, the possible nonclassical measurement error on the Y variable in our study means that this expectation is no longer valid. In addition, the setting of our study is different to the majority of the literature on measurement error. Instead of presenting a model in which we con- sider the error of an X variable or a Y variable in isolation, the Dirty dataset potentially has nonclassical The World Bank Economic Review 151 measurement error in both the dependent and independent variables simultaneously. The possible corre- lation between errors associated with these variables means that the standard framework is not applica- ble to this study. Extending from the cross section to the first differenced regressions, it is also unclear what one should expect if one begins with nonclassical measurement error in the X variables and then es- timates a longitudinal model using these variables. Discussion We are aware that we are probably not getting the true “causal” estimate of labor market status on BMI, but our focus is primarily on measuring the difference between the estimates obtained by using the fabricated data instead of the subsequently corrected data. This is the main contribution of this section of the paper. To our knowledge, all previous research on this topic has amounted to comparing an ex ante dataset containing fabricated data to an ex post dataset where the fraudulent data has been deleted but not replaced. In our case, where households with fabricated data were re-integrated into the NIDS dataset, we are in a unique position in that we observe both the fabricated data as well as the subse- quently corrected data. The overarching question that we set out to answer was whether the fabricated data would have af- fected our estimates in a meaningful way. Our findings suggest that the answer to this question depends on the estimator being considered and the purposes for which the analysis is being conducted. The gen- eral picture that emerges, which is consistent with econometric theory, is that the cross-sectional esti- mates of means and OLS regressions are not substantially affected by the presence of a relatively small amount of fabricated data. At the same time, the identical amount of fabrication does affect the longitu- dinal estimators. For the FD regressions, the difference in the estimates would have led us to reach quite different overall conclusions. It is useful to try to perform a cost-benefit analysis of our data quality investigation. Challenges arise in terms of attributing financial costs to the various activities that were undertaken, as well as on placing a financial value on the better quality data. Nonetheless, even the most conservative calculations suggest that the investigation was worthwhile in that the value of the benefits was at least twenty-four times the aggregate costs associated with the related activities. For example, the overall budget for the entire Wave 2 operation was approximately R34 million.27 If the deliverable of the entire project is a dataset, and we use the 7.3% fabrication rate that we identified, and we assume that fabricated data has no value, then the cost of the fabrication would have been about R2.48 million rand. In contrast, the costs associated with the investigation were primarily driven by the time invested by various members of project staff. In total, we estimate the salary costs to have been not greater than R100,000. This results in a “benefits-to- cost” ratio of greater than twenty-four, and we believe that this is a conservative estimate.28 As such, un- dertaking active steps to quickly identify cheating interviewers is easily justifiable. Better quality data is always desirable. Thus, other things being equal, we should always get the best quality data that we can. Unfortunately, running a survey is a costly and complex task, and the costs of auditing and monitoring interviewers competes with several other tasks for resources in a finite budget. Thus, the “other things being equal” assumption is not very realistic. Nonetheless, given the overall costs 27 At the existing 2010/2011 exchange rate, this was approximately 4.25 million US dollars. 28 The estimate is conservative for several reasons. First, we have been generous in accounting for the time spent by the staff on these specific activities, as most staff were working on several other tasks at the time. Second, the proportion of fabricated data would potentially have increased with time in the absence of our investigation, as the cheating inter- viewers were “completing” interviews at a relatively fast rate. Third, assuming that fabricated data has no value is itself conservative. The point of the survey is to obtain data that will assist in evidence based policy making, and thus fabri- cated data, insofar as it leads to false research findings, has the potential to have a strictly negative and potentially large value. 152 Finn and Ranchhod and effort invested in running a large survey, the marginal costs of performing some generic data quality checks for fabricated data seems to be highly warranted, especially in an environment where we now have evidence of fabrication in several studies. In our study, the estimators that were most affected were the longitudinal estimators. At the same time, the marginal cost of detecting fabricated data can be sub- stantially lowered when one has longitudinal data, as one can then look for intertemporal data anoma- lies in addition to cross-sectional data anomalies. This makes an even stronger case for the argument that adequate resources be allocated for identifying data fabrication in longitudinal studies. IV. Conclusion In this paper, we argued that the incidence of interviewer cheating is widespread. We documented cheat- ing and potential cheating in five substantial South African surveys. Of the various methods that we con- sidered to detect fraudulent data, two were more useful in our context than the others. These were Benford’s law and the identification of anomalies in the anthropometric data of respondents. Looking forward, there may be ways to improve on our process for identifying fabrication. First, sur- vey companies that are using computers with built in GPS devices to fill in questionnaires can use the software to capture the time and place that a survey was conducted. This can be done without the knowledge of the interviewer, which will aid in detection. Using GPS software will allow much better monitoring of interviewers’ whereabouts while they are in the field. In addition, interviewers that fabri- cate data are likely to complete entering the data much faster than an actual survey would take to com- plete. These two pieces of information alone would greatly improve the data quality auditing process. Second, wireless networks and cellular technologies are now widespread even in developing countries. It is thus not unrealistic to expect to get data with just a day’s lag. Previously, while using paper ques- tionnaires, it would take months to obtain data in an electronic form. With the real-time uploading of the data, one can now check on each interviewer much earlier in the process, and constantly monitor each interviewer’s performance. This enables survey organisations to fire cheating interviewers, as well as compel the fieldwork company to redo the interview. The incentive structure facing interviewers is also an important part of reducing the probability of data fabrication, even before going to field. It is easy to see how a system of paying interviewers immedi- ately for completed questionnaires could lead to higher rates of fabrication. These perverse outcomes could be mitigated by delaying payment to interviewers until a certain proportion of each batch of com- pleted questionnaires has been verified. One important part of obtaining high quality data relates to interviewer selection and training. In NIDS, fieldwork was outsourced to a survey company under the condition that all interviewers had at least completed secondary school. In addition, all interviewers were required to attend an intensive week-long training course, during which they received repeated assessments and feedback. Fieldworkers who did not meet the required standards were not allowed to proceed with actual interviews. Despite this, the probability of fabrication cannot realistically be assumed to be zero. Thus, ex post data quality checks remain an integral process with which we can verify the data quality. This should be thought of as complementary to the other efforts at maintaining data quality that occur prior to fieldwork. Other possibilities might be to use built in cameras to take photographs of survey respondents, real- time callbacks to ensure that the interview did in fact take place, and to strategically not prepopulate cer- tain variables in longitudinal studies so that sizable deviations from time-invariant or slow-moving vari- ables are flagged immediately. In summation, it seems that there are several relatively low cost ways in which survey organizations can use modern technology to minimize both the likelihood of interviewer cheating as well as the impact of such cheating on the overall quality of the data, and explicitly perform- ing such quality control activities is easily justifiable. The World Bank Economic Review 153 Appendices A Time Use Table A1. Time Use Survey Selection Grid Persons 10 years þ HH1 HH2 HH3 HH4 HH5 HH6 HH7 HH8 HH9 HH10 1 1 1 1 1 1 1 1 1 1 1 2 12 12 12 12 12 12 12 12 12 12 3 12 13 23 23 13 23 12 23 13 23 4 24 13 13 24 13 24 13 24 13 24 5 35 14 13 24 15 24 24 45 12 24 6 56 46 12 12 15 46 15 35 46 13 7 26 46 25 57 24 47 57 14 26 14 8 15 13 68 25 14 56 23 57 68 28 9 49 13 49 15 27 29 23 45 78 26 10 39 16 23 49 13 810 56 37 25 89 Source: Time Use Survey Fieldworker Manual. Table A2. Chi-squared Tests for Difference in Distributions Person Number Number Number Eligible of HHs 1 2 3 4 5 6 Total Chi-sq. 3 1,045 Expected % 50 70 80 Expected # 523 732 836 2,090 v2 ð2Þ Actual # 845 844 401 2,090 CV: 5.99 Difference 323 113 À435 0 Test stat: 442.7 4 1,104 Expected % 50 50 50 50 Expected # 552 552 552 552 2,208 v2 ð3Þ Actual # 738 731 530 209 2,208 CV: 7.91 Difference 186 179 À22 À343 0 Test stat: 334.7 5 901 Expected % 40 50 20 60 30 Expected # 360 451 180 541 270 1,802 v2 ð4Þ Actual # 529 434 476 265 98 1,802 CV: 9.45 Difference 169 À17 296 À276 À172 0 Test stat: 815.4 6 590 Expected % 50 20 20 30 40 40 Expected # 295 118 118 177 236 236 1,180 v2 ð5Þ Actual # 293 214 263 193 140 77 1,180 CV: 11.1 Difference À2 96 145 16 À96 À159 0 Test stat: 403.9 Expected numbers rounded to closest integer. 154 Finn and Ranchhod B Leading Digits of Other Monetary Variables Figure A1. Leading Digit Distributions for Other Monetary Variables Source: Own calculations using NIDS Wave 1 2008 and Wave 2 2010/2011. C Full Fieldworker Chi-squared Table Table A3. Interviewers by Chi-squared Ranking Ranking Interviewer Chi-squared (no. of interviews) 1 A 39.7 (80) 2 B 31.2 (49) 3 OK 28.0 (66) 4 C 27.3 (74) 5 D 27.1 (100) 6 E 24.7 (42) 7 OK 21.7 (64) 8 F 21.4 (73) 9 OK 21.2 (44) 10 OK 19.8 (67) 11 OK 19.7 (43) 12 G 18.7 (47) 13 OK 18.0 (58) 14 OK 18.0 (73) 15 OK 17.0 (49) 16 OK 16.0 (52) 17 OK 15.9 (53) 18 OK 15.8 (53) 19 OK 14.9 (81) 20 OK 13.6 (41) 21 H 13.2 (51) 22 I 12.3 (53) 23 OK 11.8 (48) 24 OK 11.5 (56) 25 OK 10.9 (64) 26 OK 10.7 (48) 27 OK 10.5 (43) 28 OK 10.3 (44) The World Bank Economic Review 155 Table A3. (continued) Ranking Interviewer Chi-squared (no. of interviews) 29 OK 8.4 (51) 30 OK 8.1 (41) 31 OK 7.2 (62) 32 OK 6.7 (57) 33 OK 6.4 (82) 34 OK 6.3 (41) 35 OK 6.1 (49) 36 OK 5.8 (66) 37 OK 5.7 (47) 38 OK 4.5 (71) 39 OK 4.5 (58) Interviewers who were found to have fabricated some or all of their data are emphasized in bold. Source: Own calculations using pre-public release NIDS Wave 2 data, 2010. D NIDS Verification Questionnaire Figure A2. NIDS Verification Questionnaire 156 Finn and Ranchhod References Aigner, D. J. 1973. “Regression with a Binary Independent Variable Subject to Errors of Observation.” Journal of Econometrics 1 (1): 49–59. Ardington, C., and A. Case. 2009. “Health: Analysis of the NIDS Wave 1 Dataset.” NIDS Discussion Paper 2, National Income Dynamics Study, University of Cape Town. Ardington, C., and B. Gasealahwe. 2012. “Health: Analysis of the NIDS Wave 1 and 2 Datasets.” SALDRU Working Papers 80, Southern Africa Labour and Development Research Unit, University of Cape Town. Benford, F. 1938. “The Law of Anomalous Numbers.” Proceedings of the American Philosophical Society 78 (4): 551–72. Birnbaum, B. 2012. Algorithmic Approaches to Detecting Interviewer Fabrication in Surveys, PhD thesis, University of Washington. Bound, J., C. Brown, and N. Mathiowetz. 2001. “Measurement Error in Survey Data.” In J. Heckman, and E. Leamer, eds, Handbook of Econometrics, vol. 5 of Handbook of Econometrics. Amersterdam: Elsevier, chapter 59, 3705–843. Bredl, S., P. Winker, and K. Ko ¨ tschau. 2008. “A Statistical Approach to Detect Cheating Interviewers.” Technical Report 39, Diskussionsbeitra €ge: Zentrum fu¨ r internationale Entwicklungs-und Umweltforschung. Carslaw, C. A. 1988. “Anomalies in Income Numbers: Evidence of Goal Oriented Behavior.” Accounting Review 63 (2): 321–27. Cho, M., J. Eltinge, and D. Swanson. 2003. “Inferential Methods to Identify Possible Interviewer Fraud using Leading Digit Preference Patterns and Design Effect Matrices.” Proceedings of the American Statistical Association (Survey Research Methods Section): 936–41. Devey, R., C. Skinner, and I. Valodia. 2006. “Definitions, Data and the Informal Economy in South Africa: A Critical Analysis.” In V. Padayachee, ed., The Development Decade?: Economic and Social Change in South Africa, 1994– 2004, HSRC Press, chapter 15: 302–323. Durtschi, C., W. Hillison, and C. Pacini. 2004. “The Effective Use of Benford’s Law to Assist in Detecting Fraud in Accounting Data.” Journal of Forensic Accounting 5 (1): 17–34. Finn, A., and V. Ranchhod. 2013. “Genuine Fakes: The Prevalence and Implications of Fieldworker Fraud in a Large South African Survey.” SALDRU Working Papers 115, Southern Africa Labour and Development Research Unit, University of Cape Town. Griliches, Z., and J. A. Hausman. 1986. “Errors in Variables in Panel Data.” Journal of Econometrics 31 (1): 93–118. Hausman, J. 2001. “Mismeasured Variables in Econometric Analysis: Problems from the Right and Problems from the Left.” Journal of Economic Perspectives 15 (4): 57–67. Hill, T. P. 1995. “A Statistical Derivation of the Significant-Digit Law.” Statistical Science 10 (4): 354–63. International Labour Office. 2011. “Unemployment, Underemployment and Inactivity Indicators.” in Key Indicators of the Labour Market, ILO, chapter 4. Judge, G., and L. Schechter. 2009. “Detecting Problems in Survey Data Using Benford’s Law.” Journal of Human Resources 44 (1): 1–24. Krueger, A. B., and L. H. Summers. 1988. “Efficiency Wages and the Inter-industry Wage Structure.” Econometrica 56 (2): 259–93. Lam, D., C. Ardington, N. Branson, B. Maughan-Brown, A. Menendez, J. Seekings, and M. Sparks. 2012. “The Cape Area Panel Study: Overview and Technical Documentation Waves 1-2-3-4-5(2002–2009).” Technical report, The University of Cape Town. Leibbrandt, M., I. Woolard, A. Finn, and J. Argent. 2010. “Trends in South African Income Distribution and Poverty since the fall of Apartheid.” OECD Social, Employment and Migration Working Papers 101, OECD Publishing. Li, J., J. Brick, B. Tran, and P. Singer. 2011, “Using Statistical Models for Sample Design of a Reinterview Program.” Journal of Official Statistics 27 (3): 433–50. May, J., J. Agu ¨ ero, M. Carter, and I. Timaeus. 2007. “The KwaZulu-Natal Income Dynamics Study (KIDS) 3rd Wave: Methods, First Findings and an Agenda for Future Research.” Development Southern Africa 24 (5): 629–48. Murphy, J., R. Baxter, J. Eyerman, D. Cunningham, and J. Kennet. 2004, “A System for Detecting Interviewer Falsification.” American Association for Public Opinion Research, 59th Annual Conference, 4968–75. The World Bank Economic Review 157 Porras, J., and N. English. 2004. “Data-Driven Approaches to Identifying Interviewer Data Falsification: The Case of Health Surveys.” Proceedings of the American Statistical Association (Survey Research Methods Section): 4223–28. Sch€afer, C., J.-P. Schr€apler, K.-R. M €uller, and G. G. Wagner. 2004. “Automatic Identification of Faked and Fraudulent Interviews in Surveys by Two Different Methods.” DIW Discussion Paper 441, DIW Berlin, German Institute for Economic Research. Schnell, R. 1991. “Der Einfluß gef€ ¨ r Soziologie 20: 25–35. alschter Interviews auf Survey Ergebnisse.” Zetischrift fu Schr€apler, J.-P. 2011. “Benford’s Law as an Instrument for Fraud Detection in Surveys Using the Data of the Socio- Economic Panel (SOEP).” Journal of Economics and Statistics 231 (5–6): 685–718. Schr€apler, J.-P., and G. Wagner. 2005. “Characteristics and Impact of Faked Interviews in Surveys - An Analysis of Genuine Fakes in the Raw Data of SOEP.” Allgemeines Statistisches Archiv 89 (1): 7–20. Schreiner, I., K. Pennie, and J. Newbrough. 1988. “Interviewer Falsification in Census Bureau Surveys.” Proceedings of the American Statistical Association (Survey Research Methods Section), 491–96. Scott, P. D., and M. Fasli. 2001. “Benford’s law: An Empirical Investigation and a Novel E xplanation.” Unpublished manuscript. Statistics South Africa. 2000. Time Use Survey: Fieldworker’s Manual. Unpublished training manual. Swanson, D., M. Cho, and J. Eltinge. 2003. “Detecting Possibly Fraudulent or Error-Prone Survey Data using Benford’s Law.” Proceedings of the American Statistical Association (Survey Research Methods Section): 4172–77. Thomas, J. K. 1989. “Unusual Patterns in Reported Earnings.” The Accounting Review 64 (4): 773–87. Wittenberg, M. 2009. “Weighing the Value of Asset Proxies: The case of the Body Mass Index in South Africa,” SALDRU Working Papers 39, Southern Africa Labour and Development Research Unit, University of Cape Town. ———. 2013. “The Weight of Success: The Body Mass Index and Economic Well-Being in Southern Africa.” Review of Income and Wealth 59: S62–S83. Wooldridge, J. M. 2002. Econometric Analysis of Cross Section and Panel Data, The MIT press.