Imputed Welfare Estimates in Regression Analysis1
Chris Elbers2 Jean O. Lanjouw3 Peter Lanjouw4
April 26, 2004
1We thank Ravi Kanbur, Tony Venables and other participants at the WIDER project meeting on
Spatial Inequality in Development, May 2003, for comments on an earlier draft of this paper. Also we
wish to thank Jishnu Das, Elisabeth Sadoulet and Alain de Janvry for valuable input, and Fran¸cois
Bourguignon and Martin Ravallion for stimulating our interest in the questions pursued here. Finally,
the paper has benefited from the comments of two anonymous referees.
2Amsterdam Institute for International Development; and Vrije Universiteit Amsterdam, cel-
bers@feweb.vu.nl.
3ARE Department, U.C. Berkeley, Brookings Institution and the Center for Global Development,
Washington, DC, jlanjouw@are.berkeley.edu.
4World Bank, Washington, DC, planjouw@worldbank.org. Financial support was gratefully received
from the Bank Netherlands Partnership Program. None of the views expressed here should be taken to
represent those of the World Bank or affiliated organizations.
Abstract
We discuss the use of imputed data in regression analysis, in particular the use of highly
disaggregated welfare indicators (from so-called "poverty maps"). We show that such indicators
can be used both as explanatory variables on the right-hand side and as the phenomenon to
explain on the left-hand side. We try out practical ways of adjusting standard errors of the
regression coefficients to reflect the error introduced by using imputed, rather than actual,
welfare indicators. These are illustrated by regression experiments based on data from Ecuador.
For regressions with imputed variables on the left-hand side, we argue that essentially the same
aggregate relationships would be found with either actual or imputed variables. We address the
methodological question of how to interpret aggregate relationships found in such regressions.
Introduction
The growing access of researchers to household data makes possible the estimation of inequality
and poverty measures at very disaggregated levels. In Elbers, Lanjouw and Lanjouw (2003)
we describe a procedure that combines the broad coverage of a census or large survey and the
detail of household survey data to arrive at estimators that are quite precise - comparing very
favorably to estimates based on either source alone. Using this strategy, estimates of local
welfare (so-called poverty maps) have been constructed for many countries (see Demombynes,
et al., 2003, for examples). These maps provide useful information about the geographic spread
of relative poverty and inequality that can be directly useful to policy makers pursuing poverty
alleviation or development goals.
In addition to their direct informational use, the imputed welfare estimates also provide a
wealth of distributional information that could be used in economic analysis. Theories abound
regarding what causes localities to be poor or unequal and how these characteristics might affect
other social or economic outcomes. In the absence of appropriate data, it has been difficult to
explore these ideas empirically. Imputed welfare estimates could enable more extensive applied
distributional analysis. In such studies, however, it will be important to take account of the
fact that the estimates are exactly that, estimates, and not data. Thus, in this paper we discuss
the econometric issues raised when using imputed welfare estimates in regression analysis - as
either a dependent variable or an explanatory variable.
Most of the issues we discuss are quite general and arise in all situations using predicted
variables. An extensive discussion, for example, can be found in Murphy and Topel (1985).
We focus here on the use of imputed welfare variables. We make suggestions regarding some
particular problems that might arise when using these variables, and explore the importance of
various issues using Ecuador as an illustration.
Specifically, we are interested analyzing the relationships between a true welfare measure,
W, and other variables in what we will call "downstream" regressions. W is unknown but
we have consistent estimates of the expected value of W denoted by µ. The estimate µ is an
error-ridden variable. However, by its construction it can be understood as an instrumented
version of W and standard results, including consistency, for IV estimators obtain. Although
the welfare estimates are more complex than standard instrumental variables, we show how one
can use information about the distribution of µ to calculate consistent standard errors for the
downstream coefficient estimates.
Using an imputed value to serve as an explanatory variable may create an endogeneity
problem if the variables used in its construction are correlated with the disturbance in the
downstream regression. We examine the likely importance of this concern when the correlated
variables and the regressions are at various levels of aggregation and suggest ways to avoid
1
introducing an endogeneity bias. The use of imputed values may also resolve an endogeneity
problem. In some situations the true value of W may be correlated with the disturbance term.
In this case, one would like to instrument W using variables uncorrelated with the disturbance
in the downstream regression. With attention paid to how they are constructed, predicted
values µ can be interpreted as useful instruments for W when W is endogeneous. That is,
predicted values can be superior to the unknown true values W.
The construction of the imputed welfare estimates is briefly described in Section 1. In
Section 2 we discuss the use of these estimates as an explanatory variable and in Section
3 describe practical approaches to calculating consistent standard errors for the downstream
regression coefficients. Endogeneity issues are considered in Section 4. In Section 5 we discuss
the use of an imputed value as the dependent variable in the downstream regression. The last
section concludes.
1 Calculation of Imputed Welfare Estimates
Denote by W a measure of poverty or inequality based on the distribution of a household-level
variable of interest, yh, for instance per-capita expenditure. Data on yh as well as a number of
covariates zh are available from a household survey, where h refers to a household included in
the survey and bold variables indicate vectors and matrices. Measuring household per-capita
expenditure reliably is very costly, therefore this kind of survey is typically only representative
at high levels of aggregation, say the province level. Consequently, welfare indicators W, based
on direct observations of y, are also at best available at the province level. By bringing in
information from other data sources we can overcome this limit and compile welfare estimates
at levels of aggregation far below the province level.
The idea is that from the household survey we can estimate the joint distribution of yh and
one or more of the covariates zhi. Assume that a larger-scale sample or a census of households
is available besides the survey and containing observations of some of the components of zh.1
By estimating the joint distribution of yh and the subset of covariates also in the census,2 this
estimated distribution can be used to generate the distribution of yh for any sub-population in
the larger sample conditional on the sub-population's observed characteristics. This, in turn,
allows us to generate the conditional distribution of W for sub-populations. We do this by
means of simulation.
In what follows we let z denote the vector of covariates which can be linked to both survey
1Ideally survey and census would refer to the same year. If not, it is necessary either to assume that the
relationship between consumption and observables did not change over the period between the data souces (as
we do below), or the model estimated must be extended to capture any change.
2Actually, we can do better than that. We can bring in any variable that can be linked both to survey and
census households. In practice this appears to be a crucial improvement. See Elbers et al. (2003) for details.
2
and census households. The first step, which we call the "first stage", is to develop an accurate
empirical model of ych, the per-capita expenditure of household h in sample cluster c. Typical
applications have used a log-linear approximation to the conditional distribution of ych,
lnych = E[lnych|zch] + uch zch + c + ch. (1)
By including cluster random effects c in the equation we allow for a within cluster correlation
of disturbances uch. The error components and are assumed to be independent of each
other. They are uncorrelated with observables, zch, by construction.
Suppose that there are M households in a target population and household h has mh family
members. In general one will want to account for household size in welfare measures, so we
write W(m,Z,,u), where m, Z and u are conformable arrays of household size, observable
characteristics and disturbances, respectively. The expected value of W given the observable
characteristics and the model of expenditure is denoted µ = E[W|m,Z,], where is the
vector of model parameters, including and any parameters describing the distribution of the
disturbances and .
In the second stage construction of our estimator of µ, we replace with consistent estima-
tors, , from the first stage expenditure regression. Simulation is used to obtain µ =E{E[W |
m,Z,]}, where the outer expectation is over the sampling distribution of ^, given = ^.
The difference between µ, the estimator of the expected value of W for a population, and
the actual level may be written
= W - µ = (W - µ) + (µ - µ). (2)
Thus the prediction error has two components:3
Idiosyncratic Error - (W - µ). The actual value of the welfare indicator for a population
deviates from its expected value, µ, as a result of the realizations of the unobserved
component of expenditure. This component increases as one focuses on smaller target
populations.
Model Error - (µ-µ). This component of the prediction error is determined by the properties
of the first-stage estimators so it does not increase or fall systematically as the size of the
target population changes.
Elbers, Lanjouw and Lanjouw (2003) show that these error components are asymptotically
normal, converging in the population size M and the household survey size s. In typical
applications the overlap between the target population and the survey is virtually nil, so the
3There is also simulation error, but we will assume that it has been made small enough to ignore.
3
variance of the total prediction error is the sum of individual error variance components:
V = VI + VM. (3)
2 Predicted Welfare as an Explanatory Variable
Consider first using imputed welfare measures on the "right-hand side" of a regression. Start
from the general regression equation
D = x + W + . (4)
D is explained by regressor vector x and welfare indicator W. We are interested in estimating
, the effect of W on D. In our case W is not observed, so we use estimates of expected welfare
µ. As discussed above, our predicted welfare is related to W as
W = µ + . (5)
Substituting this equation in the regression equation one gets
D = x + µ + ( + ). (6)
It follows that can be consistently estimated if x and µ are uncorrelated with and , or if
µ is uncorrelated with x, , and .
In our case, µ is a consistent estimator of the conditional expectation of W, is a prediction
error and so µ and are uncorrelated. Thus, if the other standard properties are met, using µ
in a regression rather than W still yields consistent estimates.
Note the importance of the fact that µ is a prediction. If instead µ were some other proxy
for W, then the error would represent a form of measurement error which is correlated with
µ. It is possible that an analyst might inadvertently introduce measurement error into the
regression when W is a discrete variable (say, poverty status at the household level). Because
its expectation, µ = E(W|z), is a continuous variable (expected poverty status given household
characteristics z) it is tempting in this circumstance to use a discrete version of µ, say W, in
^
equation (6) ( i.e. setting W = 1 for a household if µ 0.5 and 0 otherwise). This would not
^
be advisable because measurement error typically leads to attenuation bias in the estimation
of .
4
2.1 Standard errors on downstream regression coefficients
When using imputed welfare as an explanatory variable in a regression equation, the estimated
standard errors on the regression coefficients must take account of additional noise in the
estimates. To see this, insert equation (2) into regression equation (4) to obtain
D = x + µ + (µ - µ) + (W - µ) + . (7)
The error term in this regression consists of the following components
= (µ - µ) + (W - µ) + . (8)
Thus there are three sources of error in the estimates of the downstream regression coefficients
of equation (7). One is the standard sampling error (represented by ), a second derives
from the difference between W and the true expectation µ (the idiosyncratic error), and the
third is from the difference between µ and its estimate µ (the model error). Except for the
idiosyncratic error part (W - µ) this error decomposition is very similar to formula (8) in
Murphy and Topel (1985). As in their case, estimating equation (7) directly from data on D,
x, and µ, would typically underestimate the true variance of the estimator for , because the
model error term (µ - µ) creates correlation across the observations.
The source of correlation is clear. Recall that model error arises because computation of
µ requires knowledge of , the parameter vector that describes consumption. As explained
in section 1, estimates of these parameters are used to impute (the conditional distribution
of) consumption expenditure for all households in a target population. Because the same
expenditure model is applied to a group of households the same (erroneous) parameter estimates
are applied to all of them, thus creating correlation across errors in the prediction of those
households' consumption expenditure. This is unlike correlation resulting from location effects
in that correlation due to model error is very likely to carry over to higher levels of aggregation
(sub-district, district, and so on).4
It is nonetheless straightforward to estimate the full variance of the estimated parameters
and of the regression equation (7) once the correlation of the model error across downstream
observations is known. To see this, rewrite the regression equation as
D = X + (µ - µ) + e,
where X is the matrix of observations (x,µ), = (,) is the vector of regression parameters,
and e = (W - µ) + is the residual part not related to model error. M is the covariance
4Typically it is possible to estimate separate consumption models at the level of strata. Estimates for
sub-populations belonging to different survey strata then do not have correlated model error.
5
matrix of model error in µ. If the components of e are i.i.d. and there are no endogeneity issues
plaguing the regression, then OLS is consistent and the OLS estimator for has (asymptotic)
variance
Var() = e(X X)-1 + 2(X X)-1X MX(X X)-1.
2 (9)
Alternatively, feasible GLS could be used instead of OLS. For clarity of exposition we do
not discuss this here. GLS would often be the preferred method if downstream regression
equation (4) is estimated at the household level. This is because households within the same
cluster typically share a common location effect which affects their consumption level in a
similar way (the disturbance component c in equation 1). Thus the idiosyncratic part of the
prediction error (W - µ) is correlated when observations are at the level of households. This
complication does not occur if the regression is at higher levels of aggregation.5
Below we will refer to the first term in (9) as the "sampling" part of the variance and the
second part as the "model" part of the variance. In the next section we try out alternative
ways to compute Var().
3 Estimation of Standard Errors
Suppose, first, that one knew the true expected value of W, that is µ = µ and M = 0. In this
case, the second term of Var() in equation (9) would disappear. The downstream regression
model (7) and standard errors could be estimated in the usual way.
If OLS is used in a household-level regression, the sampling part of the variance in can
still be estimated consistently using standard methods. One approach is to estimate the model
with µ and then use downstream regression residuals and a robust variance formula (see, for
example, Greene (2000), equation 11-14). With large numbers of downstream observations,
however, this is cumbersome. Alternatively one could bootstrap the variance by resampling out
of the downstream data (including µ), re-estimating the model many times, and calculating the
variance of the resulting estimates of .6 By bootstrapping, any correlation in the idiosyncratic
error across observations due to location effects is incorporated in the variance estimation
directly. Note again that, in general, if the downstream regression is estimated at a level of
aggregation higher than the level of any location effects then the prediction errors (W-µ) are
no longer correlated and these steps are unnecessary.
We now turn to estimation of the second term in (9), the variance due to model error in the
imputed welfare estimates. From our earlier discussion, recall that denotes the true parameter
5The prediction errors might also be heteroscedastic, although in our experience the variance of µ does not
appear systematically related to its size, as one might perhaps expect.
6The bootstrapping should be nested like the error structure in equation (1) by first drawing groups of
households at the level of aggregation where c applies, and then drawing households randomly within the
group.
6
vector underlying expenditure equation (1), and write µ = µ() to stress the dependency of µ
on . Following Murphy and Topel we could relate the model error µ - µ to the error in by
linear approximation:
µ
µ-µ ()(-)
and use the estimated variance of to infer the (asymptotic) error distribution of µ.7 However,
for the purpose of calculating the variance in downstream regression coefficients the simplest
approach is to simulate the distribution of µ-µ directly (see below, section 3.1). Bypassing the
calculation of derivatives has the additional advantage that (small-sample) bias arising from
the linear approximation in Murphy and Topel's approach is avoided.
The simulations are described in the following subsection. They are done under the as-
sumption that there is no correlation between the model error part (µ-µ) and the other error
components of equation (8), (W - µ) + .8 The main justification for this assumption is that
the model error µ - µ is ultimately caused by sampling variation in the survey from which
is estimated. This survey typically covers only a tiny fraction of the population for which the
welfare indicators µ have been compiled9 and may come from a different time period if census
and survey--regrettably--are from different years.
To perform the calculations described in subsection 3.1 below one needs to employ the data
that were used in the computation of the estimators µ. Since most researchers will not have
access to the unit record level data, particularly not for a census, we propose in subsection 3.2
several alternative ways to approximate M when pieces of information are unknown. Ulti-
mately our goal is to find a parsimonious and satisfactory representation of M which could
be reported together with the welfare estimates in a poverty mapping project so that analysts
can readily adjust standard errors from regressions involving imputed welfare estimates. In the
final subsection we give empirical illustrations for Ecuador.
3.1 Estimation with unit record data
The model error part in the variance of , i.e. the second term in (9), is due to error in the
consumption model used to estimate µ. The combined effect of sampling and model error can
be simulated by drawing from the distribution of µ and e and re-estimating the downstream
regression model. As discussed in Section 1, the estimates µ are determined by household
survey data and a vector of estimated consumption model parameters . In the simulations we
take the following steps:
7See Murphy and Topel (1985), page 374. This approach is taken in Elbers, Lanjouw and Lanjouw (2003).
8In the terminology of Murphy and Topel this is the case of independent random components.
9Dependence would be completely eliminated if surveyed households could be excluded from the census data.
However, identifying survey households in the census is practically impossible.
7
r
1. Draw vectors , r = 1, ..., R, from the appropriate sampling distribution (see Elbers, et
al., 2003, for this distribution).
2. Draw a simulated vector of downstream regression disturbances er from an estimated
distribution of e. Construct a new vector of simulated dependent variables
Dr = x + µ + er.
3. Simulate the expected welfare measures implied by each, µr = µ( ). (The covariance
r
matrix of the µr is M, which however is not needed in this more direct procedure.)
4. Estimate the downstream regression coeffient using the simulated Dr and Xr, where
Xr is the matrix of observations (x,µr). Note that µ, not µr is used in step 3 above.
r
The variance of these R simulated values gives an estimate of the total error variance of .10
3.2 Estimation without unit record data
The estimation strategy described in the preceding subsection requires access to the unit record
data. Note, however, that it is quite straightforward to report the model variance for each µ.
If the µ were independent across observations this simulation could be done at the level of the µ
without need for access to the data used in the construction of µ. However, as discussed earlier,
the estimates of µ will often be correlated. For example, typically one consumption model is
estimated for rural areas and another for urban areas. Then welfare estimates imputed for
rural populations (households, villages, sub-districts, etc.) in the downstream data would share
model error, and likewise for the urban populations. It is not easy, however, to characterize this
correlation because it is dependent on the values for the z variables associated with any pair of
downstream observations (households, villages, sub-districts, etc.). It becomes yet harder to
characterize if the unit of observation in the downstream regression mixes households having
different consumption models. Thus, the straightforward approach - when it is possible - is to
begin the simulation from the estimated consumption parameters, , as above.
If the unit record data are unavailable, we must start from the µ and use some approximation
to their correlation. Take the typical case where a different consumption model is estimated for
each stratum in the household survey data. There is then no correlation between households in
different strata. Let -1 Ks 1 be the correlation coefficient between units within stratum
s. For example, suppose µ is a vector of estimates for four households, two in stratum F and
10The estimator is consistent but biased when µ is unknown. The average of the simulated coefficient
r
estimates derived from this procedure would give an unbiased estimator under the (estimated) sampling
distribution of µ.
8
two in stratum Q. VM represents the model part of the variance of µh for household h = 1,...,4
h
(see equation 3). Then model covariance matrix for µ is:
VM1 KF VM1VM2 0 0
VM2 0 0
M = . (10)
VM3 KQ VM3VM4
VM4
We explore a number of different approximations for the matrix M. The purpose is to give
guidance to downstream researchers whose information about the true matrix may be limited.
It is also to suggest the type of information that should be provided by those producing welfare
estimates to improve their usefulness. Each approximation yields an estimated matrix M.
It is likely that the downstream researcher has little or no information about the differing
degrees of correlation in model error across units. Thus we try to approximate these values
with correlation coefficients that are constant within a given stratum (the Ks). If the welfare
estimates are coming from secondary sources, the researcher also may know only the total
variance in µ, V, and not the portion due to model error. In this case a second approximation is
needed, with VM = GV. Reasonable values for K and G will depend on the level of aggregation
of the µ.
In the following subsection we present examples from Ecuador. These show the importance
of including the model part of Var(), and indicate how sensitive estimates of the variance are to
assumptions about the degree of correlation in the imputed welfare estimates, µ. As will become
clear from that discussion, the approximations outlined above do not perform particularly well.
One might try to obtain a reasonable upperbound for the variance, replacing M in equation
(9) by a diagonal matrix I for sufficiently high . The model error variance part then simply
becomes
2(X X)-1.
The question is, of course, what an appropriate value for would be. We have used the
maximum total variance, V, found among welfare estimates at the level used in the downstream
regression.
Finally, one way to assess the possibilities for a parsimonious representation of M is to see
how many terms in a singular value decomposition of it are needed. Results are summarized
in Tables 1 and 2 below.
3.3 Experiments for Ecuador
Our empirical examples use data from Ecuador. Expected welfare is based on household per-
capita expenditure. Consumption models are estimated using the 1994 Ecuadorian Encuesta
9
Sobre Las Condiciones de Vida, a household survey following the general format of a World
Bank Living Standards Measurement Survey. It is stratified by eight regions and separate
models are estimated for each stratum. We were able to capture most of the effect of location
on consumption with available explanatory variables. This means that there is little correlation
across households in their idiosyncratic error. The models are used to impute welfare measures
for target populations in the 1990 Ecuadorian census. (See Elbers, Lanjouw, and Lanjouw,
2002, for a full discussion of the estimation procedure and diagnostics.)
We study canton-level regressions where the dependent variable is "garbage", the percent-
age of households in the canton whose garbage is collected by the municipal trucks.11 The
explanatory variables are a normalized measure of cantonal population size and a point esti-
mate of welfare, either the local headcount or the local inequality index GE(0.5). Moreover,
province dummies have been added to avoid obvious omitted variables bias. The estimations
use pooled data for the Rural Costa and Sierra regions for a total of 164 cantons with an average
population of 26,650.
The regression results are reported at the top of Tables 1 and 2, respectively, where the
reported standard errors reported in the top panel of the table include only the sampling part
of the error. Local poverty is associated with a lower incidence of garbage collection, while
greater community inequality is associated with a higher level of service. Both regressions have
reasonable R2s, given that these are cross-section regressions. The coefficients on the province
dummies are not reported. Each of these dummies is highly significant in both regressions. On
the other hand, without the dummies the parameter estimates and significance levels of the
welfare indicators are very similar to the values reported in Tables 1 and 2.
The bottom part of each table shows the additional error in the welfare coefficient due
to the fact that welfare levels - either poverty or inequality - have been imputed. The first
row gives the results obtained when full information about M can be determined from the
unit record data. We use the empirical covariance matrix derived from 100 simulated sets
of welfare indicators. Consider first Table 1. Column (1) gives the additional variance - the
(`µ',`µ') component of the matrix 2(X X)-1X MX(X X)-1. The second column gives the full
adjusted variance (8.959 plus column 1) corresponding to a standard error of 3.128. Columns
(3) and (4) indicate the share of the model variance in the total variance, and the increase
as a percentage of the non-adjusted variance. At over 9% the addition to the variance in the
downstream regression coefficient on the headcount, due to the fact that it is estimated, is not
trivial. However, the coefficient is still clearly significant.
The next lines in the table explore different ideas for approximating the model covariance
matrix M. The results are negative; these simple approximations to the covariance matrix
11The available levels of aggregation are (in increasing order of aggregation) household, parroquia, canton,
province, and region.
10
simply do not work, and our quest for a parsimonious approximation to M will have to
continue.
In each case we approximate G, discussed in section 3.2, by taking the share of model error in
the variance of the total prediction error in µ, VM/V, and averaging it over cantons. This gives
0.92 for Rural Costa and 0.66 for Rural Sierra. These numbers are high because idiosyncratic
error diminishes in importance at the canton level due to aggregation. Each row makes a
different assumption about the degree of correlation, K, between estimates of the expected
headcount across cantons within each of the two strata. Clearly in this model using a single
value to summarize the correlation leads to underestimation of the model error component -
for any value of K between 0 and 1. Note that the underestimation gets worse if one allows
for more (average) correlation. These results are not general: in a regression without province
dummies the approximated model error component increases with correlation and the model
error effect is well reproduced for average correlation of K = 0.15.
The `max V' line shows that a crude error approximation (see section 3.2) with equal to
the maximum among all prediction error variances, V, gives a safe but rather high upper bound
to the model part of the variance in . The final lines in the table explore how many terms
in a singular value decomposition of M would be needed to accurately replicate the model
error-induced error on the headcount coefficient. Twenty terms suffice, which is some, but not
a big gain compared to needing the full M matrix.
In Table 2 we see that the fact that the welfare variable is imputed makes considerably
more difference when it is an indicator of inequality. There are two reasons for this. First, the
unadjusted regression results in a much lower significance level for the coefficient on the welfare
indicator and second, the prediction error on the inequality measure is much bigger than that of
the headcount. On average the prediction (standard) error is 11.7% for the inequality measure
and 4.2% for the headcount. Thus, inclusion of the model error in increases its variance by
more than 100%. We see that the coefficient on inequality, which appeared to be significant
when model error was ignored, is in fact borderline significant at a 10% level (the t-statistic is
1.70). Looking further down the table, we find again that using a single value to summarize
the correlation across welfare estimates leads to underestimation of the model error component
for any value of K between 0 and 1.12 However, in this regression the approximation improves
if one allows for more correlation. The crude error estimation (Max V) gives a very high upper
bound in this case and would lead one to (incorrectly) soundly reject a relationship between
inequality and garbage collection services.
12The table reports results for the (extreme) assumption that all prediction error is model error, or G = 1.
11
4 Endogeneity
In this section we discuss two types of endogeneity issues.
4.1 Endogeneity of W
True welfare W may be correlated with the regression disturbance . In this case, one would
like to instrument for W, and µ may be a better explanatory variable to use in the downstream
regression even if W were known.
Example one - Health: Suppose that a health indicator of interest is independent of inequality
but both are correlated with an omitted variable "ethnic diversity". Estimating the
health regression with true inequality could give a significant, but spurious, coefficient.
Example two - Credit: Suppose that credit availability is independent of poverty but both
are correlated with an omitted variable "remoteness". In this situation we would find a
negative coefficient on W in a credit regression, but again it would be spurious.
Using µ instead of W resolves this type of endogeneity problem.
4.2 Endogeneity of µ
As in any problem involving instrumental variables, using predicted values may create, rather
than resolve, an endogenity problem. However, it is important to realize that when µ is
correlated with the downstream disturbance , the (unknown) true value of welfare, W, would
likely also be correlated with the disturbance. There would indeed be an endogeneity problem,
but not one special to having used a predicted value for welfare. The usual remedy would apply:
instrument µ.13 The only cause for additional concern then, would be if by construction µ was
correlated with when W itself was not.
One plausible way to have a regression in which expected welfare is correlated with the
disturbance is if one of the variables used in the construction of µ should have been included
in the downstream regression but is omitted. That said, note that while the suspect variable
would have entered the regression, say, linearly, it enters µ "mixed" in a non-linear fashion and
possibly at a different level of aggregation. So it is not obvious whether the effect of the omitted
variable would be picked up on µ in the downstream regression.
An investigation of the correlation between selected household-level variables used in the
consumption model and the resulting estimates of expected welfare is presented in Table 3.
13In principle rather than instrumenting after the fact one could use exogenous variables in the construction
of µ. However, in practice this is unlikely to be feasible because the welfare estimates are typically constructed
for targeting purposes. Moreover, appropriate exogenous variables will depend on the particular downstream
application.
12
The first column gives the measure, either the poverty headcount or the GE (0.5) measure
of inequality. The second column shows the level of the explanatory variables, i.e. "Parroquia"
indicates that the variables are means at that level of aggregation. The third column gives
the level of aggregation for the welfare estimate, µ. The rest of the columns give correlation
coefficients between µ and the variable indicated in the column heading.
Several points emerge. First, there is far less correlation between the µ and other variables
when µ is an inequality measure. This is not surprising as inequality is particularly non-linear.
With a household-level regression, in fact, the "mixing" seems to remove almost all correlation.
In household regressions, then, it seems extremely unlikely that including a constructed estimate
of inequality will create any endogeneity issues. Second, in many cases we do see considerable
correlation. In these situations the best advice would be to instrument µ. Again, we emphasize
that this is not special to using a predicted variable and is likely to be an important precaution
even if one were to have true W.
Finally, it is interesting to observe that for both poverty and inequality, the correlations
get stronger at higher levels of aggregation. Take eduation of the head, for example. Al-
though estimated poverty at the parroquia level is constructed from household measures of
education, it is more strongly correlated (-0.32 vs -0.62) with the average level of education for
the parroquia than it is with the household measures used in its construction. There seem to
be macro relationships between the variables and the welfare levels that extend beyond their
micro relationship with household consumption. These call for further investigation.
5 Predicted Welfare on the Left-Hand Side
We have seen how imputed welfare estimates can be used in a straightforward way as explana-
tory variables. Many questions of interest in development, however, concern the determinants
of distributional outcomes. Exploring these questions requires using imputed variables on the
LHS of a regression and on the face of it this looks suspect. The expenditure equation (1) gives a
full statistical description of household level consumption. Given the distribution of household
observables z in the target population, and the distribution of the error components and
the (expected) distribution of consumption expenditure is fully determined: there seems to be
no room for further determination of this distribution. For instance, suppose the expenditure
equation involves a household-level education variable. Then it would seem to be very suspect
to regress canton-level poverty, imputed from the expenditure equation, on average education
in the canton. Since the regression coefficient on average education is completely determined
by the expenditure model and the distribution of education in the population; interpreting it
as evidence of a direct relationship, at the aggregate level, seems misleading.
13
5.1 Analysis
For simplicity, let household per-capita expenditure ykh of household h in canton k be de-
termined by the single variable household-level education zkh and an i.i.d. error term ukh,
uncorrelated with zkh:
lnykh = zkh + ukh. (11)
The imputed head count at the canton level is14
1 Nk
µk = mkh Pr(ukh a - zkh),
Nk
hHk
where Hk denotes the set of households in canton k, Nk the total population, and mkh the
household size. Obviously, regressing the imputed headcount µk on zk, the average level of
education in location k, will result in a significant regression parameter which seems at best to
have only descriptive value. However, we would find essentially the same aggregate relationship
if we would have regressed true average poverty, Wk, on average education:
E(Wk|zk) = E(E(Wk|{zkh,ukh})|zk)
1 Nk
= E( mkh Pr(ukh a - zkh)|zk)
Nk
hHk
= E(µk|zk).
The issue is not so much to use an imputed or true variable on the LHS, but to interpret an
aggregate relationship as causal or direct: if such a relationship exists, we will find it using
either true or imputed variables; if it does not exist, the aggregate fit is a statistical artifact in
both cases. Here is our main proposition:
If handled carefully, regressions involving imputed indicators of welfare on the LHS
and/or the RHS, will give regression coefficients not systematically different from
similar regressions, involving the true indicators.
Note that the above analysis does not hinge on specifying the expenditure model correctly.
If the true expenditure generating process differs from the specified expenditure model, the
latter's success will simply depend on the degree of correlation of observed variables used in
the expenditure regression with the true expenditure-determining variables. But this remains
true at the aggregate level, which is equally misspecified or well-specified with true or imputed
variables.
14For ease of discussion we abstract from model error in this section. Complications from model error can be
handled as in the previous sections.
14
Formally, suppose we want to regress a welfare indicator on explanatory variables zk, then
we have for the imputed and true welfare indicator:
E(µk|zk) = E(E(Wk|{zkh})|zk)
and
E(Wk|zk) = E(E(Wk|{zkh,zk})|zk).
Hence, if
E(E(Wk|{zkh})|zk) = E(E(Wk|{zkh,zk})|zk),
then
E(µk|zk) = E(Wk|zk).
In other words, if the information in {zkh|h = 1,...,Nk} includes the information in zk, then
putting µk or Wk on the LHS essentially makes no difference. This condition will be satisfied if
zk is part of the household characteristics {zkh} or is otherwise a function of these. More gen-
erally, if zk does not significantly add explanatory power to household per-capita consumption
expenditure, beyond the variables zkh, then a regression of Wk on zk would give essentially the
same coefficients as a regression of imputed welfare µk on zk.
Another way to make this point is to consider the regression of Wk on zk:
Wk = zk + k. (12)
Let Wk = µk + k, with k the (idiosyncratic) prediction error. It follows that
µk = zk + k - k. (13)
If k is uncorrelated with zk the latter regression is no more problematic than the former.
Correlation between k and zk will be negligible if including zk in the consumption regression
(1) does not lead to significant improvement of the fit. This will be the case if the z variables
are constructed from census data, or more generally from the same data sources used in the
construction of the welfare indicators. These variables, if not included already, will have been
considered for inclusion in the consumption regression so that correlation between k and zk is
unlikely to be a problem.
On the other hand if one has (location) data zk from other sources and there is no practical
way to test how well it would have performed as an additional explanatory variable in the
consumption regression, then correlation between zk and k in equation (13) might compromise
15
the estimation of . A solution for this would be to instrument zk with census data.15
Finally, a household-level statistical relationship such as the expenditure equation (1) does
not preclude the existence of aggregate causal relationships. The expenditure model and the
information on the distribution of explanatory variables in the population (from the census) do
allow one to predict statistical relationships at aggregated levels. But as emphasized in Elbers,
et al., (2003, p. 356) the parameters of the expenditure model measure correlation not causality.
The predicted aggregate relationships are based on these correlations and therefore say nothing
about the existence or non-existence of causal aggregate relationships. The correlation patterns
found at the household level in the survey and census data could very well have sprung from
an aggregate causal relationship. As always in regressions: caveat emptor. It takes meticulous
diagnostics before a regression coefficient can be interpreted as marginal impact. The use
of imputed rather than true variables does not in any way simplify or compound that basic
difficulty.
5.2 Example
Consider a Kuznets-type regression of vk, the variance of log per-capita household consumption
in location k on average consumption y¯k, both estimated using the model in equation (11). We
take the distribution of both the education variable zkh and the error term ukh to be normal.
Hence we find
vk = var(zkh) + var(ukh)
y¯k = ezk+12 . vk
Assume that both var(zkh) and var(ukh) are heteroskedastic; for the sake of argument, let both
depend on the average level of education:
vk = var(zkh) + var(ukh) = (zk).
Differentiating, we find
dvk = (zk)dzk
1
dy¯k = y¯k(1 + (zk))dzk.
2
15Such instrumenting requires access to census data. However, the target regression will typically not be at the
household level but at higher levels of aggregation for which it may be easier to obtain the necessary census-based
data.
16
Hence,
dvk (zk)
= .
dy¯k y¯k(1 + (zk))
1
2
The slope of the Kuznets curve is ultimately determined by the heteroskedasticity function
(zk). Here we have calculated the slope using imputed variables. The main point to note is
that explanation of the Kuznets curve depends on explanation of the function (zk), which
itself has nothing to do with using imputed or true variables. If the use of imputed variables
has helped to obtain more information for the analysis of (zk), that is only an improvement.
6 Conclusions
Some of the oldest research activities in Development Economics involve the analysis of distribu-
tional indicators in relation to other indicators. The Kuznets curve, relating income inequality
to average income level, is a famous example. Another example is the never-ending debate on
the relationship between inequality and growth, with disagreement both on the sign of the rela-
tionship and the direction of causality. One of the main motives behind our poverty mapping
project was to compile more disaggregate and closely comparable estimates of distributional
measures to begin building a better empirical foundation for these discussions.
Because the estimated inequality and poverty measures are predicted values rather than
data, their use in regression analysis requires attention to econometric issues. We have discussed
how imputed distributional indicators can be used as explanatory variables in regressions. Our
conclusion is that imputed variables on the right-hand side can be regarded as a special kind
of instrumented variables and, if handled correctly, can be safely used in estimation. This is
demonstrated in regressions using data from Ecuador. In a canton-level regression of garbage
collection on imputed headcount poverty, the fact that explanatory variables were imputed had
a small but non-negligible effect on the estimated standard errors of the regression coefficients.
On the other hand, in a similar regression on local inequality the increase in error due to
imputation was far greater.
To calculate correct standard errors requires knowledge of the model error in the welfare
estimates used as explanatory variables. Our (limited) experience suggests that there may be
no simple parsimonious subsitute for the full covariance matrix of model errors. This need not
imply, however, that only those with access to census record data will be able to proceed. Those
calculating the welfare estimates can store the requisite information for use by downstream
researchers, along side the point estimates and their prediction errors. The most efficient way
to store the information, whether as matrix M, as vectors of simulated draws µr (step 3 in
section 3.1), or some other form, would depend on the context.
Using imputed variables on the left-hand side is trickier, but essentially such regressions
17
yield results no different from what would follow from similar regressions involving the true
welfare indicators. However, such regressions might suffer from problems of omitted variable
bias inherent in using imputed variables. We have discussed ways to avoid such problems.
We conclude that the scope for analysis of distributional issues at various levels of aggrega-
tion is vastly expanded by the availability of poverty maps.
References
[1] Demombynes, Gabriel, Chris Elbers, Jenny Lanjouw, Peter Lanjouw, Johan Mistiaen and
Berk Ozler (2003) "Producing an Improved Geographic Profile of Poverty: Methodology
and Evidence from Three Developing Countries," WIDER Discussion Paper no. 2002/39.
Forthcoming in Rolph van der Hoeven and Anthony Shorrocks (eds.) Growth, Inequality
and Poverty. (Oxford: Oxford University Press).
[2] Elbers, Chris, J.O. Lanjouw and Peter Lanjouw. (2003) "Micro-Level Estimation of Poverty
and Inequality," Econometrica. Vol. 71, no. 1, pp. 355-64.
[3] (2002)."Micro-Level Estimation of Welfare," Policy Research Working Paper no. WPS
2911. The World Bank.
[4] Greene, William H. (2000) Econometric Analysis. Fourth Edition. (New Jersey: Prentice-
Hall, Inc.)
[5] Murphy, Kevin M. and Robert H. Topel (1985) "Estimation and Inference in Two-Step
Econometric Models," Journal of Business & Economic Statistics, Vol. 3, no. 4, pp. 370-79.
18
Model Variance in Downstream Regression Coefficients and Approximations
Headcount and Canton-level Data
Standard Regression Output
Coefficient on population 0.332
Coefficient on the headcount, ^ -19.132
Estimated (robust) standard error of ^ 2.993
Estimated variance of ^ 8.959
Adjusted R2 0.66
Analysis of Estimated Model Variance in ^
Model Total variance Model share Percentage
variance in ^ in ^ (1)/(2) increase in
variance
(1) (2) (3) (4)
Using
`True' M 0.826 9.786 0.084 9.22
K-values
0.00 0.416 9.375 0.044 4.64
0.33 0.366 9.325 0.039 4.08
0.66 0.315 9.274 0.034 3.52
1.00 0.263 9.222 0.029 2.93
Max V 1.971 10.930 0.180 22.00
Single Value
Decomposition
5 terms 0.200 9.159 0.022 2.23
10 terms 0.520 9.479 0.055 5.80
15 terms 0.759 9.718 0.078 8.47
20 terms 0.803 9.762 0.082 8.96
Table 1. The effect of prediction error in explanatory variables in a regression of an
index of garbage collection on imputed headcount poverty. Source: authors' calculations.
19
Model Variance in Downstream Regression Coefficients and Approximations
GE(0.5) Inequality Measure and Canton-level Data
Standard Regression Output
Coefficient on population 0.413
Coefficient on the headcount, ^ 11.951
Estimated (robust) standard error of ^ 4.876
Estimated variance of ^ 23.771
Adjusted R2 0.52
Analysis of Estimated Model Variance in ^
Model Total variance Model share Percentage
variance in ^ in ^ (1)/(2) increase in
variance
(1) (2) (3) (4)
Using
`True' M 25.933 49.704 0.522 109.10
K-values
0.00 8.664 32.435 0.267 36.45
0.33 13.290 37.060 0.359 55.91
0.66 17.915 41.685 0.430 75.37
1.00 22.680 46.451 0.488 95.41
Max V 36.230 60.000 0.604 152.41
Single Value
Decomposition
5 terms 25.719 49.490 0.520 108.20
10 terms 25.827 49.598 0.521 108.65
15 terms 25.830 49.601 0.521 108.66
20 terms 25.849 49.620 0.521 108.75
Table 2. The effect of prediction error in explanatory variables in a regression of an
index of garbage collection on imputed GE(0.5) inequality. Source: authors' calculations.
20
Indigenous
Education Age of Household language Sole use Shared use
Measure Regression µ~ household household spoken in sewage sewage
head head head has no household connection connection
spouse
Headcount Household Household -0.36 <0.01 -0.33 0.14 -0.43 -0.06 0.07
Parroquia -0.32 0.05 0.01 0.26 -0.34 -0.17 0.07
Canton 0.20 0.05 <0.01 0.21 -0.22 -0.11 0.02
Parroquia Parroquia -0.62 0.23 0.02 0.32 -0.71 -0.51 -0.03
Canton Canton -0.69 0.40 -0.08 0.38 -0.78 -0.57 0.05
GE (0.5) Household Parroquia 0.11 <0.01 0.02 0.05 0.06 0.07 -0.11
Canton 0.07 0.01 0.04 0.10 -0.02 0.05 -0.12
Parroquia Parroquia 0.15 0.01 0.17 0.11 0.06 0.13 -0.27
Canton Canton 0.08 0.07 0.24 0.15 -0.08 0.19 -0.37
Table 3. Correlations between welfare indicators and household characteristics, used in their construction. Source: authors' calculations using unit records of
Ecuador population census, 1990.
21