1DISCUSSION PAPER Report No. WUDD 0, ANALYZING AN URBAN ROUSING SURVEY: ECONOMIC MODELS AND STATISTICAL TECHNIQUES by Stephen Malpezzi May, 1984 Water Supply and Urban Development Department Operations Policy Staff The World Bank The views presented here are those of the author and they should not be i.terpreted °as reflecting those of the World Bank. The author is indebted to David Hoaglin and Paul Velleman, and their publisher, Duxbury Press, for permission to use the computår code in Appendix-C. ,This paper is a draft and will be revised. Comments are welcomed by the author. Michael Bamberger and David Hoaglin have provided helpful comments on this version of the paper; many of their comments will be incorporated in the next version. The author is on the staff of the Water Supply and Urban Develdpment Department of the World Bank. 4/e ABSTRACT The purpose of this paper is to explain some common statistical procedures and their application to housing market analysis. Most of the emphasis is on the use of medians and other "order statistics," and on regression analysis. Examples are used to illustrate some of the techniques, using actual data from an Egyptian housing survey as well as "manufactured" data. This document was prepared to assist the Central Bureau of Statistics and the Ministry of Works and Housing of Kenya in the analysis of an urban housing survey. A companion paper is Planning an Urban Housing Survey: Key Issues for Researchers and Program Managers in Developing Countries (Water Supply and Urban Development Department Discussion Paper No. U'DD-44, November 1982). This paper is written for housing market analysts with some previous exposure to basic statisticss TABLE OF CONTENTS Page No. PREFACE............... . . . . . . . . .....i1ii List of Tables.... ..... v List of Figures. ....... ........ vi PART I: STATISTICAL TECHNIQUES......o . oo*o.oooot..... oo. ooo.... 1 Section I.1 Introduction: Two Purposes of Statistics......... 1 1.2 Medians and Related Order Statistics.............. 2 1.3 Housing Market Analysis Using Regression Techniques... o,.,,,,,,.. ... 9 1.3.1 The Logic of Regression Analysis......... 9 Simple Bivariate Regression Versus Multiple Regression................... 14 1.3.2 An Example Using Developing Country Data..15 1.3.3 Residuals......... . .ooo........... 18 1.3.4 RypothesisTesting....... 19 R-Squared...............s... .... . 21 Standard Error of the Equation.i......... 23 Standard Errors of Regression Coefficients.. .... 00..9.0. . 0 0 . .0* 29 Confidence Intervals: Coefficients....... 34 Confidence Intervals: Predicted Values... 36 1.3.5 The Role of Functional Form (Transfor- mations) in Regression Analysis........ 37 Power Terms ....... ... 40 Logarithms.o.....oesooooo.,.......... 40 Dummy Variables.. ...,,**90000.,*...... 40 PART II: ECONOMIC MODELS FOR HOUSING MARKET ANALYSIS............o. 46 Section 11.1 Introduction......46 11.2 Composite Demand Models.. ... ......... 49 11.2.1 Measurement................ .oooo.o.o. 49 Measuring Housing Consumption............ 49 Measuring Housing Prices............... 53 Measuring Incomes..... 54 Demographic Variables................ .....56 11.2.2 Integrating the Effects of Tenure Choice and Mobility on Housing Demand..........o.. ................. 57 11.2.3 Tying It All Together: Examples of Demand Equations Using Egyptian Data...................59 II.3 Introduction to Hedonic Price Indexes............. 66 11.3.1 Theoretical Basis...... .............. ..66 11.3.2 An Example. .. ....... 69 TABLE OF CONTENTS cont'd Page No. PART III COMPUTATIONAL TECHNIQUESQUE..... ........... 72 Section III.1 Preparing the Data for Analysis... ,....... 72 Section 111.2 Computational Notese.... ............eessvt.e, 73 Appendix A - Kenyan Housing Survey.....aG............. 75 Appendix B - Introduction to Logarithms and Elasticity0*.....ooss. 85 Appendix C - Fortran Subroutines for Order Statistics ... $60800990 94 Appendix D - Suggestions for Further Reading........ ......e ... 115 Appendix E - Data Appendix for Simple Examples 117 Appendix F - Outline of Suggested Tables for an Urban Housing Survey Reporte st.05.ve. ... 129 References: . . . . ........................ . . .....135 111 PREFACE This paper has been prepared for Kenya's Central Bureau of Statistics (CBS) and Ministry of Works and Housing (MWH) to as5ist them in preparing a report on the housing situation in Kenya's urban areas, using the 1983 Urban Housing Survey (the questionnaire is appended). This paper is a companion to an earlier paper, Planning and Urban Housing Survey: Key Issues for Researchers and Program Managers.1/ That paper discussed general goals of housing market analysis, some common problems encountered, and suggested questions for the prospective survey. Now the survey questionnaire has been completed, and this paper addresses those topics again, but this time with reference to the actual survey, and suggests concrete solutions based on the information contained therein. This paper is divided into three parts. Part I focuses on statistical techniques, with particular reference to using medians and regression analysis. Part II concentrates on developing several simple but useful economic models of housing market which can be estimated with data from the Kenyan surveya Part III gives some handy computational hints which can be used in actually estimating the models described in the first two parts. The three parts of the paper are related, by design. In fact, dividing the paper up into statistical and economic parts is convenient but somewhat artificial. There will necessarily be some overlap, especially between the first two parts, so read both parts togetaer. For example, the topic of functional form is treated in several places: the general discussion in Part I and the specific examples of Part II. Although much of the material covered in this report is well known to statisticians (especially the contents of Part I) basic material is included throughout alongside more advanced material in order to make the paper as 1/ Malpezzi, Bamberger and Mayo (1981). iv self-contained as possible. Additional references on many of these topics can be found in the papers listed in Appendix D. A few comments are in order regarding the data used in the examples. Some examples use real survey data from Cairo, Egypt, so that one can see how these techniques work with actual data. Other examples use hypothetical data which have been constructed to exaggerate certain relationships which are featured in the text. Results from Kenyan data will differ considerably from both the Egyptian data and from the manufactured data; we want to emphasize that the examples illustrate techniques, not expected results. The data used for many of the examples are presented in the appendix, and can be used to replicate the examples. v List of Tables Page No. 1. Summary Statistics from Cairo Sample.....................,,, 5 2. Summary Statistics from Cairo Sample, by Income Quintile ....7 3. Simple Regression of Log Rent or Log Income, Cairo ..............16 4. Regression Examples Using Manufactured Data with Large and Small Vrac. .... . . . . . .. .24 5. F Table Showing .01 and .05 Probability Levels..............00 27 6. The t Distribution and the Normal Distribution ....esetoe...32 7. Regression Example Using Manufactured Data: Rent and HousehC-ld Size ............. . . . . . . 39 8. Dummy VariabI2 Coding Scee . . . .. . . 44 9. Regression Example Illustrating Us*e of Dummy Variables ........................ . . . 45 10. Measures of Housing Consumption................................50 11. Cairo Renter Demand Equations Using Log of Gros s Rent....... . . ................. 000 0*0 000 040 000 9ea 6 0 12. Cairo Owner Demand Equations Using Log of House Val ue.. o* 61 13. Simple Demand Equations for Renters and Owners 4. ir 64. Car RetrHd icEuto ..soooooooooeso 7 vi List of Figures Page No. 1. Linear Plot of Rent By Income (Cairo).10...... 2. Logarthims Plot of Rent by Income (Cairo)................ 13 3. Histogram of Residuals from Simple Renter Demand Equation (Cairo)........,,,ooovoovooo09000040000000 20 4. Plots of Manufactured Data With Large and Small Variance..... 22 5. Plot of Manufactured Data Illustrating Hypothetical Non-linear Relationships Between Rent and Household Size........... 38 PART I: STATISTICAL TECHNIQUES Sectior I.1 Introduction: Two Purposes of Statistics .nformation about housing conditions is costly to collect, and difficult to use intelligently unless we have some wal to reduce the information into manageable form. For example, someone charged with designing a housing program for a particular town will want to know the relationship 1W between people's incomes and how much they are willing to pay for housing. Since collecting information for every household in the town is expensive, we obviously rely upon a sample of households to collect this kind of information. But even after a well-chosen sample is surveyed we have more raw information than we can comfortably digest -- what you can make of the rents and incomes of a thousand or even a hundred people? But you can easily compute the average income and average rent of your sample, and these two numbers give you more usable information than the hundreds of numbers they were derived from. A statistic -- an average, median, regression coefficient, or whatever -- summarizes the information in a sample, and can be used for two purposes. First, statistics are descriptive -- a way to reduce a lot of information in a sample to one or perhaps several pieces of information, which can be more easily absorbed by the analyst. Second, statistics can be used to test hypotheses, that is, establish the probable truth or falsity of certain propositions, given the information in the sample. The two purposes are, of course, related. Suppose we divide up our sample into low and high income groups, and compute the average rent of each sub-sample. These two new pieces of information (1) sumarize the rents paid by each group (description) and -2 - (2) permit a test of the admittedly simple hypothesis that higher income people spend more on housing (inference). This paper will explain some common statistical procedures and their application to housing market analysis in some detail. Most of the emphasis will be on the use of medians and other "order statistics", and on regression analysis. Much of the material will be familiar to many readers, especially to statisticians, but the note will go over the basics in order to make the discussion somewhat self-contained. References are given where appropriate for those who want to pursue these topics in more detail. Examples will be used to illustrate some of the techniques, using actual data from an Egyptian housing survey.1/ Section 1.2 Medians and Related Order Statistics Order statistics are statistics which are based on ranks or order by some criterion variable.2/ The most common order statistic is the median. Like the arithmetic mean, or average, it is a measure of central tendency or location, but it has several desirable properties which will be briefly discussed. Suppose we have a small sample of five households, with rents of 100, 120, 120, 150 and 250 shillings, respectively. The average rent of this sample is ,of course, 148 shillings. The median rent is the rent paid by the "middle household", or 120 shillings. In general, the median of any variable is the value of that variable for which half the sample values are above the 1/ See Mayo et al., 1982, for a description of the data. 2/ They are also called non-parametric statistics. See Blalock (1960), Chapter 5. -3 - median and half are below.2-/ It is computed by (1) sorting the sample or. the variable of interest, (2) computing one-half of the sample size, (call it N/2), and (3) reporting the value of that variable for the "N/2th" observation. Other order statistics can be computed in a similar fashion, e.g. quartiles are computed using N/4, quintiles using N/5, deciles using N/1O, percentiles using N/100 and so on. The median is also the second quartile, the fiftieth percentile, and the fifth decile. When should we use the arithmetic mean, and when use medians? The short answer is, the mean is superior when the data are normally distributed;./ the median is better when data are best approximated by some other distribution. Rents, house values, and incomes are examples of varibles that are not, in general, normally distributed, but are truncated at zero and have more very large values than do normally distributed variables. Because of this, the means of these variables can be unduly affected by a few extreme observations; the median is much less sensitive to the presence of large values: we say the median is more robust than the mean. It is a better representation of the typical value in the sample. Other order statistics can be computed in addition to the median. Two common statistics and the first quartile and the third quartile. To compute them, rank the data and compute the sample size, N. Divide the data into fourths, then proceed as follows: the value of the "N/4th" observation is the value of the first quartile; one-fourth of the data have lower values, three-fourths have higher values. The "2N/4th" observation is of course the median, discussed above. The "3N/4th" observation is the third quartile, for which three-fourths of the data have lower values and one- fourth have higher values. 3/ If the number of sample observations is odd the median is more precisely the (N/2 + .5)th observation; if N is even a common procedure is to average the two observations (N/2 + .5) and (N/2 - .5). 4/ For a review of the normal distribution see any statistics text. -4- The first and third quartiles give a good idea of the spread of the distribution; half of the data lie between these two values. Their difference is often computed and referred to as the "interquartile range," and can be thought of as the order statistic analagous to the more familiar standard deviation. Medians and other order statistics can be very useful in cross- classifications. An example from the Egyptian housing survey will illustrate the idea (Table 1). The median rent of our sample of Cairo renters is 8 Egyptian pounds. The arithmetic average is 13 pounds but this overstates the rent of the "typical" unit because the average is heavily influenced by a few extreme observations, up to 224 pounds). Notice that the mean is approximately equal to the third quartile, not the median. In other words, the mean is not the best estimate of the rent paid by the typical (i.e. middle) consumer. Also, we can illustrate that medians are more robust than averages. Suppose we drop the top five observations (57, 73, 109, 156 and 224 pounds) and recompute. The mean is 11 pounds (a difference of 15 percent) but the median remains stable at 8 pounds. Small changes in the sample do not affect the median as much as the mean. We can see similar patterns with other distributions, such as rent- to-income ratios, also included in Table 1. Then, suppose we want to know how rents are related to total income. One effective method to use is to: (1) divide the sample into groups based on income ranks (e.g. quartiles, quintiles, deciles or whatever) then (2) compute the median within each group. We'd like to have at least thirty observations in each cell to ensure reliable results, so we divide the data into quintiles (using deciles resulted in sample sizes of 20 in several -5- Table 1: Summary Statistics from Cairo Sample Income Gross Rent Rent-to-Income Mean 115 13 .18 Median 87 8 .10 First Quartile 59 6 .06 Third Quartile 129 14 .16 Note: Definition: Mean is the arithmetic average. Median is the mid-point of the distribution of rents. First Quartile is the rent paid by the household which is at the twenty-fifth percentile (i.e, one-fourth of all households pay less, three-fourths pay more). Third Quartile is the seventy-fifth percentile (three-fourths pay less, one fourth pay more). Also, notice that the average (or median or quartile) of each household's rent-to-income ratio is not the same as the sample average (median, etc.) rent to the sample average (median, etc.) income. -6 - cells). Then we compute the median, and the first and third quartiles, within each quintile. Table 2 presents these results. Now add two refinements. Obviously, rents increase with income, but we'd like better information on how fast they go up relative to total consumption. One way to do this is to look at rent-to-income ratios rather than rents. If the ratio goes up with income, then rent goes up faster than income; if the ratio is constant, rent goes up at the same rate as income; if the ratio decreases as income goes up, then rents go up more slowly than income (although they still go up). In economic jargon, these three cases correspond to elastic demand, demand of unit elasticity, and inelastic demand, respectively. The second refinement is this: compute the first and third quartiles of the rent-to-income ratio as well as the median (second quartile) within each income quintile. This gives us a good idea of the distribution of the ratio, i.e. how much it varies in each group. Table 2 presents these results for our example data. Several interesting patterns emerge. -7- Table 2: Summary Statistics from Cairo Sample, By Income Quintile Gross Rent Rent-To-Income-Ratio Number First Third First Third of Quartile Quartile Quartile Quartile Sample of Median of of Median of Observations Rent Rent Rent R/I R/I R/I - Fifth 49 8 20 27 .04 .06 .11 (150-797) Fourth 53 7 8 13 .06 .08 .11 (100-149) Third 46 6 9 13 .07 .10 .15 (75-99) Second 49 6 8 11 .09 .13 .17 (54-74) First 49 4 6 9 .10 .14 .28 (0-53) Total Sample 246 6 8 14 .06 .10 .16 -8 - First, as everyone expects, rents increase with income. In particular, the typical rent in the highest quintile is about twice that in the other quintiles. Second, notice that the median rent in the fourth quintile is actually a little lower than that in the third quintile, but that the difference (1 pound) is small relative to the spread of the distributions, as measured by the differences between first and third quartiles of rent within the third and fourth income quintiles (these numbers are 13-6=7 pounds, and 13-7=6 pounds respectively). This illustrates an important point: with real world data, careful analysis requires looking at the spread of distributions in addition to point estimates. The key finding is that the rent distribution is relatively flat in the second, third and fourth quintiles. Most of the differences in median rents is at the very top and very bottom of the income distribution. Differences between income class medians are small relative to the spread within classes. In other words, the bivariate relationship between rent and income is positive, but income alone does not explain much of the observed variation in rents paid. The last three columns of Table 2 illustrate that even though rents go up with income, the proportion of rent-to-income declines. In particular, the poor in Cairo often pay large proportions of their income on rent; a fourth of the poorest income class pay 28 percent or more. These kinds of results show that some common rules of thumb about affordability are contradicted in the Cairo market.4 4/ Policy implications of particular results are not discussed in this paper. Forthcoming papers from a research project on "Housing Demand and Finance in Developing Countries" conducted by the World Bank's Water Supply and Urban Development Department will address these issues. 9 Section 1.3 Housing Market Analysis Using Regression Techniques One limitation of the methods described above is that order statistics are difficult to apply to multivariate problems: for example, suppose we hypothesize that willingness to pay depends on other variables as well as income or total consumption. For example, we might expect that larger families consume more housing; that higher income families consume more housing; but also that larger families have higher incomes because they have more wage earners. Regression analysis permits us to estimate the separate effects of household size and income from a sample in which all three variables are correlated.!/ It also permits us to test hypotheses about the relative importance of these separate effects*-- The next few pages will develop some of the ideas behind regression analysis by starting with the simplest problem, one dependent variable and one independent variable, and then extending the technique to several independent variables. After the basics of the statistical technique are covered, we will discuss the actual specification of regressions for the coming report, in Part II. 1.3.1 The Logic of Regression Anaylsis Consider once again the relationship between rent and income. Figure 1 shows a plot of our example data from Cairo, Egypt. If there was no 5/ If the independent variable income and household size are highly correlated, it is hard to separate effects even with this technique, but the standard errors of the regression coefficients will warn us of the problem. This will be discussed below. 6/ Hypothesis testing with regression analysis assumes normality, which we stated above is not always realistic; but it turns out that these tests are still approximately correct with the kind of truncated distributions we encounter in economic analysis. For a more detailed discussion, see Theil (1971) pp. 615 ff. 스 relationship between rents and income, the plotted points would, of course, be scattered across the page in random fashion. If there was a very strong positive relationship, the plotted points would mostly fall near a line with positive slope drawn on the page (that is, near a line which represents increasing values of the rent variable as income increases). It is not surprising that with real world data, we often get a pattern somewhat in between these two extremes: the plot will show some tendency for large rents and incomes to be associated, but the pattern will usually not be very pronounced. Looking at Figure I we see that there are some points plotted in the upper right-hand corner (high rent, high income) but no points in the upper left hand corner (high rent, low income). Most of the points are bunched in the lower left, and any pattern is hard to discern. This is related to the problem we discussed above: that rents and incomes typically have a skewed distribution, so that a few outlying observations (especially high income observations) can obscure what's going on in the rest of the data. We don't want to just drop the outlying observations (unless we think they are so unrealistic that they are mistaken or miscoded responses) because these observations contain valuable information. A common solution to this problem is "reexpression" or "transfor-mation" of the original rent and income variables in order to mitigate the problem.-L/ What we'd like to do is find a way to compute new variables which (1) contain essentially the same information as the original variables, that is, how fast rents increase with income, but (2) the new variables more closely approximate a normal distribution and are therefore better candidates for statistical analysis. 7/ See Tukey (1977), Chapter 4 for a more detailed discussion of transformations. - 12 - A common transformation used in economic analysis is the natural logarithm-- Logarithms have the desirable property that they contain information about the original variable -- without exception, the larger the original variable, the larger its logarithm -- while in most cases the log of rent or income more closely approximates the normal distribution than the 9/ original untransformed variable.-- Figure 2 presents a plot of logarithms of rents and incomes for the Cairo sample. Notice that the pattern of positive association is more pronounced in Figure 2 than in Figure 1. How can we summarize the information contained in these plots? To state that rent increases with income does not significantly extend the frontiers of human knowledge. What we want to know is, by how much does it increase? If we drew a line through the points in Figure 2, the slope of that line would be a number that would tell us how much the log of rent went up as the log of income increased by one, or in terms of the original intransformed rent and income, the percentage increase in rent given a percentage increase in income (See appendix B). Regression analysis is nothing more than a technique for fitting the best line through a collection of points like those in Figure 2 10! 8/ "Natural" logarithms are logarithms using the base 2.718. "Common" logarithms use the base 10. See Appendix B for details; we always work with natural logarithms. 9/ There are other advantages which are discussed in Appendix B. 10/ Technically, "best" means minimum variance among all unbiased linear estimators. Estimates always have some error associated with them; they are estimates of an unknown "true" parameter. Unbiased means that although our results for any given sample have some error, if we look at many samples (e.g. many towns) these errors tend to cancel out. Minimum variance means that there is no technique that we could use to fit an unbiased line which would usually come closer to the true parameter. For more details and proofs, see any statistics textbook. FIGURE 2 LOGRITHMIC PLOT OF RENT BY INCOME (CAIRO) 5- L L 4- 0 - G 4 N T 3- H L y glø R 2 E N T - * g g g .0 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 5.4 6.0 6.6 LOG MONTHLY INCOME - 14 - Simple Bivariate Regression Versus Multiple Regression The example we have used has been deliberately limited to one dependent variable and one independent (right hand side) variable, because we can illustrate the principles involved with graphs and simple algebra. However, it is straightforward to extend these techniques algebaically to more than one independent or explanatory variable. In fact, one of the chief advantages of regression analysis is that it is a multivariate technique. With this technique it is possible to sort out the separate effects of several explanatory variables, even when the explanatory variables themselves are interrelated. For example, suppose we started out with a sample of 150 renters in a particular town, and wanted to estimate the effects of (1) income, (2) household size, and (3) age of household head upon housing consumption. Using cross tabulations, we could divide the sample into, say, 5 income groups, 4 household size groups, and 3 age of household groups. Then we could compute the mean or the median rent in each cell, and examine the results to get estimates of the effects of these variables on consumption. But there are 5 * 4 * 3 = 60 cells! Many cells will be empty, and most will have only a few observations. Our means or medians will be extremely unreliable. Regression techniques get around this problem. With income entered as a continuous variable, 3 household size dummy variables, and 2 age of head dummy variables, and a constant term, we can run a regression which estimates the separate effects of each variable but which has 150 - (1 + 3 + 2 + 1) = 143 degrees of freedom. Our estimates will be more reliable, using the same data. For the rest of the paper we will skip back and forth between simple bivariate examples and multiple regression examples. Although the multiple regression examples can't be graphed as easily, there are no essential - 15 - differences between simple and multiple regression. Tests or procedures which we illustrate with simple examples can be applied straightforwardly to multiple regression models. Section 1.3.2 An Example Using Developing Country Data We will not discuss the details of how to compute a regression coefficient here; they can be found in any statistics text and in manuals for computer packages like SPSS. Table 3 is a photocopy of the output from a regression computed using the same Cairo data, by the computer package SAS. Most likely, CBS will compute these regression results using SPSS, a similar type of package. Even though this is a simple one-variable regression, computer packages print out a lot of numbers -- we will focus on the most important ones: parameter estimates, standard error, t-statistics for the hypothesis that the coefficient is zero, and R-squared. To fit that line through the plot we need only the parameter estimates for the log of income variable, and the intercept. Put differently, when we estimate this regression we are assuming the following model of housing demand: (1) log(R) = a + b * log(I) + u where log(R) is the log of rent, log(I) is the log of income, a and b are regression coefficients to be estimated, and u is the "residual", or the difference between the value of log(R) predicted by our estimated a and b for a given observation (i.e. for a given value of log(I), and the actual value of log(R) for that observation. For now, assume that this is the correct or true model. Later we will develop a better model of housing demand. Every observation in the sample has its own values of log(R) and log (I), and of u. The two numbers a and b are fixed. Note that we have Table 3 Simple Regression of Log Rent on Log Income, Cairo DEP VARIABLE: LMGRENT LOG MONTHLY RENT SUM OF MEAN SOURCE DF SQUARES SQUARE F VALUE PROB>F MODEL 1 24.289792 24.289792 49.052 0.0001 ERROR 244 120.824 0.495182 C TOTAL 245 145.114 ROOT MSE 0.703692 R-SQUARE 0.1674 DEP MEAN 2.223925 ADJ R-SQ 0.1640 C.V. 31.64187 PARAMETER STANDARD T FOR HO: VARIABLE VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > ITI LABEL INTERCEP 1 0.371449 0.268277 1.385 0.1674 INTERCEPT LMINCOME 1 0.413728 0.059072 7.004 0.0001 LOG MONTHLY INCOME 0% rI - 17 - estimated the residuals as well as the coefficients; the residuals will be used to measure "how good" our regression coefficients are since they are used to compute standard errors and R-squared. But first we will discuss the coefficients. One of the advantages of regression analysis is that coefficient estimates permit a prediction of the dependent variable [log(R)l given the level of the independent variable or variables [here log(I)I. Continuing our Cairo example, if a household has an income of 85 Egyptian pounds per month, we can predict the log of rent as: predicted log(R) = .371449 + .413728 * log(85) = 2.20950 and to get predicted rent we take the antilog of this result ("exponentiate"): predicted rent = exp (2.20950) = 9.11 Egyptian pounds. Recall our example of median rents computed by income quintiles, from above. A family with an income of 85 pounds is in the third quintile, whose median rent is 10.5 pounds. 10.5 pounds is thus the predicted rent for an 85 pound income from the median procedure, and 9.1 pounds is the predicted rent for the same household from the regression procedure. Which is better? The median procedure is more resistant to mistakes from miscoded data, and is easier for non-statisticians to understand. The regression procedure permits more "fine tuning" of the estimates; the impact of say, a 10 pound increase in income on housing consumption is difficult to compute using the median procedure; with the regression procedure it's easyi-l/ 11/ For example, the impact of a 10 pound increase from 85 to 95 pounds is: exp [.371 + .414 * log(95)] - exp [.371 + .414 * log(85)] = 9.55 - 9.11 = .44 pounds. - 18 - Section 1.3.3 Residuals Turning now to the estimated residuals (the "u" in equation 1), note that these are estimates of the error in the regression equation.. They are just, for each sample household, the difference between actual house rent, and the rent predicted by our estimated equation. Suppose we have two sample households, each with 85 Egyptian pounds income; household 1 lives in a house that rents for 10 pounds per month, and household 2's house rents for 8 pounds. Our estimated rent is the same for both households, namely 9.11 pounds, so we've made estimated errors of .89 and -1.11 pounds respectively..l21 There are as many residuals as there are sample households, 246. These 246 residuals are 246 pieces of information that tell us how well our equation fits the data. From these residuals are calculated many familiar statistics that summarize that information131 We will discuss briefly (1) three measures of how well the equation does overall: R-squared, the standard error of the equation, and the F-test for the equation; and (2) two measures of how well we estimate individual coefficients: the standard error of a coefficient. First, as background, we discuss some properties of errors and their estimates (residuals). The most common assumption in statistics is that errors are normally distributed with a zero mean. If this assumption is correct, many tests exist which can be usoed for testing various hypotheses about the equation, e.g. 12/ A subtle point is that the residuals are estimated errors because they are calculated from estimated coefficients. Given the correct model, we never know the true values of the regression coefficients but calculate estimates. If we knew the true a and the true b, we'd know the true errors. 13/ Details of calculation can be found in any statistics textbook. Many books also contain more advanced ideas on how to use these residuals, e.g. how to check the assumption of constant variance, exact tests for normality, etc. - 19 - whether a particular coefficient is statistically different from zero.1!!' In many examples we know this assumption is not strictly correct. Consider our housing expenditure equation. Since rent can never be less than zero, there is a bound on size of negative errors, (actual rent less than estimated rent) but there is no such bound on positive errors (actual greater than estimates), at least in principle. Normal errors would be unbounded on both sides. Fortunately, many studies have demonstrated that if the errors are only approximately normally distributed, the usual tests and statistics are approximately correct 15/ and we can justify using them. By approximately normal, we mean that the errors (and hence their estimates) are bell-shaped when plotted as in Figure 3, i.e. most residuals are small in absolute magnitude, and cluster around the central value of zero. This can be checked by doing plots similar to Figure 3. Note that even if the residuals are erratically distributed, regression coefficients still have desirable properties, if we have the correct model. That is, we still obtain the "best" estimates, we just can't apply the usual tests of significance. Section 1.3.4 Hypothesis Testing Plots of residuals like those in Figure 3 yield useful information about the regression. But remember that there were originally two purposes of statistical analysis: to somehow reduce a large and unwieldy number of pieces 14/ We discuss what this means below. 15/ See Theil, pp. 615 ff. FIGURE 3 HISTOGRAM OF RESIDUALS FROM SIMPLE RENTER DEMAND EQUATION (CAIRO) FREQUENCY BAR CHART FREQUENCY 25 + 20 + 15* * 1+* *** *** * *15*+ * * ** * * 4+ * *** * * * * ** * **** ** *** * * * * * * * * * * * * *ss sse *** 2 1 1 1 1 1 1 1 1 1 1 OOOOOOOOOOOOOOOOOOO 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 O 9 8 7 6 5 4 3 2 1 O 9 8 7 6 5 4 3 2 1 O 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 RESIDUAL MIDPOINT RESIDUALS - 21 - of information from our sample into a few manageable statistics that conveyed the essential information contained in the sample, and to make formal tests of specific hypotheses. The first step in hypothesis testing is, of course, to carefully formulate a testable hypothesis: for example, "the coefficient of income is not discernibly different from one" or "the regression equation explains more of the variation in the dependent variable than would be expected by mere chance." We will have more to say about the exact specification of hypothesis as we go through examples. The residuals contain the information that can be used (with the coefficients) to test hypotheses. But whereas the coefficients are only a few numbers, there are as many residuals as there are sample observations. Fortunately, we can form summary statistics of these estimated residuals themselves, and use these for our tests. The next few pages discuss some commonly used statistics computed from the residuals: R-squared, the standard error of the overall equation, and the standard errors of the estimated coefficients. Again, formulas and computational details can be found in any statistics text; the purpose of this description is to provide an inituitive understanding. R-Squared. Consider the two plots of two different samples in Figure 4, with their estimated regression lines drawn in. (These plots are constructed from manufactured data in order to exaggerate the effects we want to talk about). Each line has the same slope, and the same intercept, but in 4-A most of the data points are clustered around the line, while in 4-B the data do not 'fit' the line as well. This is true despite the fact that when you estimate the regressions for each sample you get almost the same regression 귀 - 23 - coefficient-. Table 4 displays the regression equations used to fit the two lines in the plots. Notice that the estimated slopes and the estimated intercepts are roughly in agreement, but the R-squared statistics are very different (.80 versus .28). R-squared, or the square of the multiple correlation coefficient, is a statistic that gives a measure of the goodness- of-fit of the regression equation. R-squared varies between zero and one, and can be interpreted as the percentage of variation in the dependent variable (here log of rent) that can be "explained" by that regression equation. That is, if all the data points lie on the line -- if there is no error in our regression -- then the difference between our predicted values and the actual values is zero. R- squared is calculated as the difference between 1 and the ratio of the sum of squared errors (often abbreviated SSE) to the sum of the squares of the dependent variable or residuals (often abbreviated SST, for sum of squares total); when the errors are all zero, i.e. all points fall on the line or 11 perfect fit", R-squared is one. If the errors are large, the fit is poor, and the ratio SSE/SST approaches 1, and R-squared approaches zero. The Standard Error of the Overal. Equation. With real world data, R-squared will never be exactly zero or exactly one. Even if we pick nonsense data with no relationship -- even if we regress completely random numbers with each other -- we will always get some -numerical coefficient estimates and some positive (though small) R-squared. If R-squared is small, is that because our data are unrelated, or because the relationship we are measuring is a true one but is rather weak? In other words, what is a low R-squared? Fortunately we can construct a formal test of the hypothesis "this regression equation result 16/ They are the same because we made the data up that way for the example. TABLE 4-A REGRESSION EXAMPLE USING MANUFACTURED DATA DEPENDENT VARIABLE WITH SMALL VARIANCE DEP VARIABLE: Yl SUM OF MEAN SOURCE DF SQUARES SQUARE F VALUE PROB>F MODEL 1 491.942 491.942 414.040 0.0001 ERROR 98 116.439 1.188151 C TOTAL 99 608.381 ROOT MSE 1.090024 R-SQUARE 0.8086 DEP MEAN 6.043012 ADJ R-SQ 0.8067 C.V. 18.03775 PARAMETER STANDARD T FOR HO: VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > ITI INTERCEP 1 1.715451 0.238984 7.178 0.0001 X 1 0.785114 0.038584 20.348 0.0001 I TABLE 4-B REGRESSION EXAMPLE USING MANUFACTURED DATA DEPENDENT VAIABLE WITH LARGE VARIANCE DEP VARIABLE: Y2 SUM OF MEAN SOURCE DF SQUARES SQUARE F VALUE PROB>F MODEL 1 468.433 468.433 38.501 0.0001 ERROR 98 1192.331 12.166642 C TOTAL 99 1660.764 ROOT MSE 3.488071 R-SQUARE 0.2821 DEP MEAN 6.123751 ADJ R-SQ 0.2747 C.V. 56.95972 PARAMETER STANDARD T FOR HO: VARIABLE DF ESTIMATE ERROR PARAMETER=O PROB > ITI INTERCEP 1 1.900860 0.764748 2.486 0.0146 X 1 0.766124 0.123470 6.205 0.0001 - 25 - is due merely to chance: there is no real statistically discernable relationship present in these data." The way the test is computed makes use again of the relationship between the sum of squared errors and the total variance in the dependent variable. By definition, the total sum of squares of the dependent variable equals the sum of squared errors plus that part of the SST which is not error, that is, which is explained by the regression. (2) SST = SSE + SSR where SSR is an abbreviation for "sum of squares due to regression." In other words, the total sum of squares can be partitioned into explained (SSR) and unexplained (SSE) variation. If the number of observations is small, or if the number of independent variables is large, we can get a large R-squared just because there aren't enough data, or pieces of information, to tell us whether our estimates are due to chance or not. For example in an extreme case, where we fit a line to two or three points, the R-squared will always be close to one because once we mechanically draw the line through a few points there aren't any other points left over to tell us whether there is any error or not. So to test whether R-squared is due to chance or not we rely on corrections for "degrees of freedom," i.e., how many observations we have in excess of the number of regression coefficients to be estimated. It turns out that a good way to make this correction is to compute the number: 1-- (R-squared)/k (3) F =- ----------------------- (1-R-squared)/(N-k-1) 17/ See any statistics text or the SPSS manual, page 335, for details. - 26 - where N is the number of sample observations and k is the number of regression coefficients estimated. We see that as R-squared increases the number F increases, but that as the number of observations and the number of coefficients change F also changes. The number F can be compared to a table to see if the value of F is statistically significant, that is, whether the independent variables really have an effect on the dependent variable or whether the measured effect might be due merely to chance. One page of an F table is reprinted as Table 5. Tables such as these, with so-called critical values of the F distribution for a few probabilities (usually .1, .05, and .01) can be found in any statistics book. Some computer packages (like SAS, which we have used) provide these probabilities directly. SPSS does not, so we have to use a table. The critical value, or level of significance, is the probability of making a mistake that we are willing to accept. By convention, .05 is the most commonly used level, i.e., there is one change in 20 that we will not reject the hypothesis that our result is random when it is is fact random. - 27 - Table 5: F Table showing 0.01 and 0.05 Drobability levels Table for 0.05 probability level u - degrecs of freedom for numerator 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 o 1 161 200 216 225 230 234 237 239 241 242 244 246 248 249 250 251 252 253 254 2 18.5 19.0 19.2 19,2 19.3 19.3 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.5 19.5 19.5 19.5 19.5 19.5 3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53 4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63 5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.37 6 5.99 5.14 4.76 4.53 4:39 4.28 4.21 4.15 4,10 4.06 4.00 ? 94 3.87 3.84 3.81 3.77 3.74 3.70 3.67 7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.31 3.44 3.41 3.38 3.34 3.30 3.27 3.23 8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93 9 '5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71 - 10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 -2.58 2.54 0 E 11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40 12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30 13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21 14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13 15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07 16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01 17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96 18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92 19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88 20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84 92 21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81 1 22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.9.4 1.89 1.84 1.78 23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76 24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73 25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71 30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62 40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51 60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39 120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25 ao 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00 Table for 0.01 probability level V = degrees of freedom for numerator 1 2 3 4 5 6 7 8 9 10 12 .15 20 24 30 40 60 120 c 1 4052 5000 5403 5625 5764 5859 5928 5982 6023 6056 6106 6157 6209 6235 6261 6287 6313 6339 6366 2 98.5 99.0 99.2 99.2 99.3 99.3 99.4 99.4 99.4 99.4 99.4 99.4 99.4 99.5 99.5 99.5 99.5 99.5 99.5 3 34.1 30.8 29.5 28.7 23.2 27.9 27.7 27.5 27.3 27.2 27.1 26.9 26.7 26.6 26.5 26.4 26.3 26.2 26.1 4 21.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.5 14.4 14.2 14.0 13.9 13.8 13.7 13.7 13.6 13.5 5 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1 9.89 9.72 9.55 9.47 9.38 9.29 9.20 9.11 9.02 6 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56 7.40 7.31 7.23 7.14 7.06 6.97 6.88 7 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74 5.65 8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 5.36 5.28 5.20 5.12 5.03 4.95 4.86 9 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40 4.31 10 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4.00 3.91 11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.25 4.10 4.02 3.94 3.86 3.78 3.69 3.65 12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 3.86 3.78 3.70 3.62 3.54 3.45 3.36 13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.82 3.66 3.59 3.51 3.43 3.34 3.25 3.17 14 8.86 6.51 5.56 5.04 4.70 4.46 4.28 4.14 4.03 3.94 3.80 3.66 3.51 3.43 3.35 3.27 3.18 3.09 3.00 15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52 3.37 3.29 3.21 3.13 3.05 2.96 2.8? 16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.41 3.26 3.18 3.10 3.02 2.93 2.84 2.75 42 17 8.40 6.11 5.19 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.31 3.16 3.08 3.00 2.92 2.83 2.75 2.65 18 8.29 6.01 5.09 4.58 4.25 4.01 3.84· 3.71 3.60 3.51 3.37 3.23 3.08 3.00 2.92 2.84 2.75 2.66 2.57 19 8.19 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.15 3.00 2.92 2.84 2.76 2.67 2.58 2.49 20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52 2.42 % 21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.03 2.88 2.80 2.72 2.64 2.55 2.46 2.36 l 22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 2.98 2.83 2.75 2.67 2.58 2.50 2.40 2.31 h 23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.93 2.78 2.70 2.62 2.54 2.45 2.35 2.26 24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.89 2.74 2.66 2.58 2.49 2.40 2.31 2.21 25 7.77 5.57 4.68 4.18 3.86 3.63 3.46 3.32 3.22 3.13 2.99 2.85 2.70 2.62 2.53 2.45 2.36 2.27 2.17 30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.47 2.39 2.30 2.21 2.11 2.01 40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.52 2.37 2.29 2.20 2.11 2.02 1.92 1.80 60 7.08 4,98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73 1.60 120 6.x5 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53 1.38 r 6,43 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.70 1.59 1.47 1.32 1.00 - 28 - The table is read as follows. The F distribution has two numbers associated with it called degrees of freedom. The so-called "degrees of freedom in the numerator," "vl" ("nl", in some tables), is the number of parameters estimated in the regression. In our simple Egyptian example, v1 is equal to 2. The corresponding denominator, or "v2'" (or 'n2") is the difference between the number of observations in the regression sample and the number of independent (right hand side) variables. We want to find the critical value F (v1, v2). Tables do not provide numbers for every possible combination of degrees of freedom, but you can see from Fig. 4.1 that when v2 is greater than 30 the critical values don't change very much, so pick the best approximation. For our example we choose F (2,120). We see that the critical values for F (2,120) are 3.07 if the acceptable probability of a mistake is less than .05 and 4.79 if the acceptable probability is less than .01. By a mistake we mean rejecting the hypothesis that the coefficient is zero when it is in fact zero, or rejecting a true hypothesis. 18/ Now compare the F statistic in Table 4.B to the critical values from Table 5. 38,501 is greater than both 3.07 (level = .05) and 4.79 (level = .01) so we reject the hypothesis that the regression results are random with less than 1 chance in 100 of making the mistake discussed above (since 38.501 is greater than 4.79 the critical value at the .01 level). Again, by convention the 0.05 level is accepted as an indication that the result is statistically significant. 18/ This is another kind of mistake, that we do not reject the null hypothesis that the coefficient is zero when in fact it is not zero, or failing to reject a false hypothesis. Under certain assumptions the F test will minimize the chance of this kind of error. See Blalock, pp. 91-96 and pp. 188-193. - 29 - Standard Errors of Regression Coefficients, and Their Use in Tests. The standard error of the regression and its associated F-test tell us whether, overall, our estimates are due to some real relationship between dependent variable and the independent variables, or whether our results could be due merely to chance. When we have several independent variables, we might want to know whether some of them are related to the dependent variable and others are not. Alternatively, we might want to know how good our estimate is: how close we can assume our estimate is to the true coefficient? Each coefficient estimate is a random variable. Every time we draw a new sample and estimate our regression, we get numerically different results, even if the underlying population parameter (the true coefficient) remains fixed. Thus we speak of the distribution of coefficient estimates. Under certain conditions (chiefly that the regression estimated is the correct model, i.e., contains all relevant variables and uses the correct mathematical functional form of the relation) our coefficient estimates are unbiased, that is, each time we draw a sample and estimate there is some difference between our estimates and the (unknown) true parameter but over number of samples and estimates these differences cancel out. In other words, "on average" we will get close to the true estimates.19/ But the standard error of the coefficient gives us a measure of the distribution of these errors. Remarkably enough, 19/ Having the correct model is a restrictive assumption, at least in the strict sense of having all relevant variables, known without error, and knowing the funtional form exactly. In particular, many economic variables are difficult to measure, and will almost always have measurement errors. Fortunately it turns out that if the missing variables are uncorrelated with the included variables, and if errors in independent variables are uncorrelated with the (true) error of the fully specified model, then the coefficient estimates are still unbiased. See Theil (1971), ch. 3. - 30 - even though we do not know the true parameter, we can at least find out how close to it we are likely to be. Again, we will skip the computational details since the computer does the dirty work (see any statistics text or the SPSS manual, page 326). Referring back to Figure 3 we see that the standard error of the coefficient of the log of income is 0.059072. Notice that this is much smaller than the coefficient estimate of 0.413728. Intuitively, that indicates that our estimate is a good one, i.e., one that indicates not too far from the true parameter. But we can do better. We can formally test the hypothesis that the true coefficient is actually some fixed number. We can also state that we have a certain degree of confidence that the true parameter is within some interval. We will explain by example. First, suppose that we want to test the hypothesis that the true (unknown) coefficient takes on some fixed value. A common example is to test the hypothesis that the coefficient equals zero. This is common because it is equivalent to testing whether or not the variable "makes a difference" in the regression, but we could just as easily test, for example, the hypothesis that the true coefficient equals one. If the errors are normally distributed, then we can compute a test statistic as follows: (1) compute the difference between the estimate and the "maintained hypothesis," the tentatively assumed value of the true parameter and the estimated parameter; (2) divide this number by the standard error of the regression coefficient, to get a measure of how large the difference is relative to the errors expected in the estimates. This measure is known as the t-statistic. 2i The t-statistic for 20/ Some readers will be familiar with t-tests in another context: testing the equality of two sample means. In fact the test for a "significant coefficient" is conceptually very similar to a test of the equality of means. The interested reader can pursue this in any advanced statistics text. - 31 - our example from Table 3 os 7.004. Under the normality assumption, statisticians have tabulated what a "large" value of this statistic is.21/ Table 6 reproduces a typical table. Most statistics books and packages describe these tests as so-called t-tests, and most regression packages automatically compute the t-test for the hypothesis that the coefficient is zero, since this is the most common single hypothesis. It is, confusing unfortunately, that SPSS is one of the few major packages which computes an F-test for this hypothesis rather than the t-test, but the confusion is unnecessary because both tests (t-test and F-test) give exactly the same results. The numbers are different when you compute the test statistic, but the tables used are different in exactly the same proportion. In fact, the F- statistic is the square of the t-statistic, so SPSS would have printed out a value of 7.004 * 7.004.22/ The table look-up procedure is similar to that for the F-test described above, except that v1, the degrees of freedom for the numerator is always one because we are testing one coefficient estimate at a time. Looking back to Table 3 we see that the critical value for F with 1 and 120 degrees of freedom is 3.92 at the .05 level and 6.85 at the .01 level 21/ The table look-up procedure is similar to the procedure used above for the F-test so we will not repeat the example. As before, the researcher selects a "significance level," by convention usually .05, which represents the acceptable probability of a Type II error (not rejecting the null hypothesis when in truth it is false). When using the t-test one confronts the issue of whether to use the "one-tailed test" or the "two-tailed test". A full discussion of this issue would take us too far afield, so we simply recommend using the two-tailed test, which is the more conservative procedure. See Blalock, ch. 10, especially pp. 127-128, for a more detailed discussion. 22/ The F-statistic always corresponds to the two-tailed test discussed in the preceding footnote. To convince yourself, pick a significance level, look up a few critical values for different degrees of freedom for each test (using v1=1 for the F-test) and you will see that the numbers in the F table are the square of the numbers in the t-table. - 32 - Table 6 The t Distribution and the Normal Distribution' Degrees Pb .25 1 .05 .025 .01 .005 G of Freedom .5 .2 .1 .05 .02 .01 - 1 1.000 3.078 6.314 12.706 31.821 63.657 2 .816 1.886 2.920 4.303 6.965 9.925 3 .765 1.638 2.353 3.182 4.541 5.841 4 .741 1.533 2.132 2.776 3.747 4.604 5 .727 1.476 2.015 2.571 3.365 4.032 6 .718 1.440 1.943 2.447 3.143 3.707 7 .711 1.415 1.895 2.365 2.998 3.499 8 .706 1.397 1.860 2.306 2.896 3.355 9 .703 1.383 1.833 2.262 2.821 3.250 10 .700 1.372 1.812 2.228 2.764 3.169 11 .697 1.363 1.796 2.201 2.713 3.106 12 .695 1.356 1.782 2.179 2.681 3.055 13 .694 1.350 1.771 2.160 2.650 3.012 14 .692 1.345 1.761 2.145 2,624 2.977 15 .691 1.341 1.753 2.131 2.602 2.947 16 .690 1.337 1.746 2.120 2.583 2.921 17 .689 1.333 1.740 2.110 2.567 2.898 18 .688 1.330 1.734 2.101 2.552 2.878 19 .688 1.328 1.729 2.093 2.539 2.861 20 .687 1.325 1.725 2.086 2.528 2.845 21 .686 1.323 1.721 2.080 2.518 2.831 22 .686 1.321 1.717 2.074 .2.508 2.819 23 .685 1.319 1.714 2.069 2.500 2.807 24 .685 1.318 1.711 2.064 2.492 2.797 25 .684 1.316 1.708 2.060 2.485 2.787 26 .684 1.315 1.706 2.056 2.479 2.779 27 .684 1.314 1.703 2.052 2.473 2.771 28 .683 1.313 1.701 2.048 2.467 2.763 29 .683 1.311 1.699 2.045 2.462 2.756 30 .683 1.310 1.697 2.042 2.457 2.750 40 .681 1.303 1.684 2.021 2.423 2.704 60 .679 1.296 1.671 2.000 2.390 2.660 120 .677 1.289 1.658 1.980 2.358 2.617 (Normal) co .674 1.282 1.5 1.960 2.326 2.76 Source. This tabie is abridged from E. S. Pearson and H. 0. Hartley. Bionerrika Tables for Statisticiana, Vol. 1 (1954), p. 138, with kind permission of the Syndics of the Cambridge University Press, publishers for the Biometrika Society. a The smaller probability shown at the head of each column is the area in one tail; the larger probability is the area in both tails. Example: With 20 degrees of freedom, at value larger than 1.725 has a .05 probability and a t value exceeding 1.725 in absolute value has a .1 probability. - 33 - (recall that we use degrees of freedom 1, 120 since it is the closest entry in the table to the correct number of degrees of freedom 1, 253). Since 44.12 is greater than either of these numbers, we can reject the null hypothesis that the (unknown) true coefficient is zero. In other words, the test indicates that income has a statistically discernable effect on housing consumption (rent). When producing important reports, always check the computed statistic against the table. In exploratory work, however, a useful rule of thumb is that a t-statistic greater than 2 (or an F statistic greater than 4) is greater than the critical value at an approximate significance level of .05, a common level. The rule of thumb is a good approximation as long as you have at least 30 degrees of freedom, otherwise use the table even for preliminary examination of the results. The rule is only correct for tests involving a single coefficient. If we wanted to test a different hypothesis we would set the test up slightly differently. Suppose we want to test the hypothesis that the coefficient of income is less than one.23/ Then the computer no longer prints out the t or F statistic that we want, and we compute it by hand: (1) the difference between 1 and .4137 is .5863; dividing this result by the standard error of the coefficient (.059072) yields 9.93. This is the t-statistic, and we can compare it to the critical value in a t table as before, or we can square it to get the equivalent F statistic (98.60) and compare this result to the critical value from Table 3 (here 3.92 or 6.85, depending on whether we choose a significance level of .05 or .01). Again the computed statistic 23/ This is another common hypothesis because, as we will show below, this kind of logrithmic regression coefficient is an elasticity, or measure of responsiveness. If it is equal to one, then expenditures on housing rise exactly proportionally with income. If less than one, non-housing expenditures rise faster. - 34 - exceeds the critical value, so we reject the null hypothesis. The (unknown) true coefficient of income is less than one, or more strictly, the probability that the coefficient is as large as one is less than .01 (the significance 24/ level) given the sample-- Confidence Intervals: Coefficients. Now let's look at a related use of the standard error of the coefficient: the confidence interval. When we get an estimate, we get a single number that is our "best guess" of the unknown true parameter. This is known as a point estimate. Even if we have a very good estimate, the chances that the true income coefficient is exactly .413728 is very small. Intuitively, the chance that it is between, say, .4 and .5 is much greater. The chance that the true parameter is between .35 and .55 is greater still. In fact, we can compute the probabilities that the unknown true estimate is within an interval, that is, how likely it is that the true number lies between two numbers which bracket the estimate. This is an extremely useful computation, because we often care little about small deviations from the estimate but we may care a lot about large deviations. If the true parameter is .39 instead of .43, who cares? It makes little difference for policy makers who want to calculate the affordability of housing projects. But if 24/ Sometimes, fact, had we chosen an even smaller significance level, say .005 or .001 instead of .01, we might well have rejected the null hypothesis. Since the choice of significance level is so arbitrary (.05 is chosen merely by convention, not with reference to a well specified loss function), many statisticians now prefer to compute the probability that the null hyupothesis is true given the sample, and report that number. SAS computes this number but SPSS does not (see the last column of Figure 3), and it is difficult to compute by hand so this paper emphasizes the older, traditional method of hypothesis testing. - 35 - the true parameter is .13 or 1.13, we do care, because these numbers imply very different answers to policy questions about affordability, If we can make a precise statement about the probability that a true parameter lies within an interval, we call that a confidence interval, or an interval estimate. Confidence intervals can be constructed using the standard errors or the coefficients, and the method of construction is derived from the same basic theory of random variables that gave us the t and F tests 26/ The upper bound of the confidence interval is computed as: (1) b + t * (standard error of b) where b is the regression coefficient, and t is the relevant critical value from the ta)-le. For our example, at .05 level of significance, F is 3.92 so t is 1.98, and we compute: .413728 + [1.98 * (.059072] = .530691 or about 0.53. The lower bound is computed using a minus sign instead of a plus sign in equation (1), so the lower bound is .296765 Since we used a significaace level of .05, there is only a 5 out of 100 chance that the true parameter lies outside of the interval between .30 and .56. Another way of looking at this is to say that the probability that the true parameter is within their interval is 1 - .05 = .95, so this is often called a 95 percent confidence interval for the coefficient. 25/ Statistical significance is not the same as qualitative importance. Throughout the analysis the researcher must bear in mind that he or she needs to understand what magnitude make a policy difference, as well as whether a result is statistically significant. This is common sense, and an example will make the idea clear. Suppose we had computed an income elasticity of, say, 0.93 with a standard error of 0.03. Testing the hypothesis that the true coefficient is actually 1, we get a t- statistic of 2.33 (or an F of 5.44). For a large number of degrees of freedom, this is significant at the .05 level. But in practical terms, .93 is so close to 1 that the policy implications are practically the same. Unfortunately there are never any hard and fast rules about what is a qualitatively important difference -- as opposed to a statistically significant difference -- but the analyst must never lose sight of the difference between importance and significance. 26/ See Theil (1971) ch. 4. - 36 - Confidence Intervals for Predicted Values of the Dependent Variable. When we predict the dependent variable we use all regression coefficients simultaneously, so the confidence interval for the prediction or forecast must make use of the joint distribution of all coefficients. This problem is slightly more complicated, and the problem will probably not arise in producing the Kenyan housing report, so we will not discuss it here. Interested readers can consult Theil, pp. 130-137, or other econometric texts. - 37 - Section 1.3.5 The Role of Functional Form (Transformations) in Regression Analysis The most natural specification of any relation in the ordinary- least-squares regression framework (OLS) is a linear relation: such as: (1) RENT = a + b*INCOME + c*HHSIZE = d*DISTANCE + u where RENT and INCOME are self explanatory, HHSIZE is the number of persons in the household, and DISTANCE is distance to the central business district, our proxy for intrametropolitan housing prices; a, b, c and d are regression coefficients to be estimated, and u is the residual, or estimated error term concrete suggestions for actual specifications will be given in Part II. Several variants of the linear functional form are commonly employed in regression analysis. Each will be discussed briefly in turn, and then we will compare their advantages and disadvantages, and make recommendations for the Kenyan study. Only functional forms which can be easily estimated with OLS regression (as computed with a package like SPSS) will be considered. OLS is, of course, a linear technique, but that is not as restrictive as it first seems because it is "linear in the parameters," i.e., the restriction is that the relationship between each coefficient and its variable must be linear; variables themselves may be non-linear. Let us illustrate this with a simplified example about the relationship between household size and housing consumption (rents). Assume for simplicity that household size is the -onl determinant of rent so we can graph the relationship in two dimensions in Figure 5. This data was artifically manufactured to emphasize "curvature". Suppose that we ran a linear regression on this made-up data. With test data made up for this example we get the results displayed in Table 7-A. Notice that the fit is poor (R-squared is only .016) and the t-statistic for variable HHSIZE is not FIGURE 5 PLOT OF MANUFACTURED DATA ILLUSTRATING HYPOTHETICAL NONLINEAR RELATIONSHIP BETWEEN RENT AND HHSIZE RENT 1750- 1500- 0 S* 1250 * 1000 000 e * *0 100 250-* ** 070 1 2 3 4 5 6 7 8 9 10 HHSIZE TABLE 7-A REGRESSION EXAMPLE USING MANUFACTURED DATA RENT AND HOUSEHOLD SIZE SIMPLE LINEAR MODEL DEP VARIABLE: RENT SUM OF MEAN SOURCE OF SQUARES SQUARE F VALUE PROB>F MODEL 1 249946 249946 1.611 0.2074 ERROR 98 15207604 155180 C TOTAL 99 15457550 ROOT MSE 393.928 R-SQUARE 0.0162 DEP MEAN 808.574 ADJ R-SQ 0.0061 C.V. 48.71893 PARAMETER STANDARD T FOR HO: VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > jTI INTERCEP 1 702.977 92.057895 7.636 0.0001 HHSIZE 1 17.687810 13.936972 1.269 0.2074 TABLE 7-B REGRESSION EXAMPLE USING MANUFACTURED DATA RENT AND HOUSEHOLD SIZE QUADRATIC MODEL DEP VARIABLE: RENT SUM OF MEAN SOURCE OF SQUARES SQUARE F VALUE PROB>F MODEL 2 3510183 1755241 14.251 0.0001 ERROR 97 11947067 123166 C TOTAL 99 15457550 ROOT MSE 350.950 R-SQUARE 0.2271 DEP MEAN 808.574 ADJ R-SQ 0.2112 C.V. 43.40354 PARAMETER STANDARD T FOR HO: VARIABLE OF ESTIMATE ERROR PARAMETER=O PROB > ITI INTERCEP 1 44.817509 151.952 0.295 0.7687 HHSIZE 1 322.909 60.607330 5.328 0.0001 HSIZESQ 1 -26.679109 5.185272 -5.145 0.0001 - 40 - significant at commonly used levels. In other words, because our relation was misspecified the regression results show only weak evidence of a relationship even though visual inspection shows a reasonably strong but "curved" (not linear) relationship. Power Terms. However, one way to algebarically represent a "curved" relationship is with a quadratic equation, that is, with an equation which includes the square of the relevant independent variable. If we compute the new variable HHSIZESQ = HHSIZE ** 2, and add it to the regression, with our test data we get the results in Table 7-B. Comparing them to 7-A, we see: (1) the fit improves dramatically (to .227), and (2) both variables, HHSIZE and HSIZESQ have large t-statistics. Now our results with the correctly specified functional form indicate that household size is a strong determinant of housing consumption, albeit in a nonlinear fashion. Logarithms. In fact, we have briefly already discussed another common transformation used in regression models, logarithms. When we looked at the plots in Figures 1 and 2, and chose the logrithmic transformation, we were implicitly making a functional form decision. Appendix B discusses the ideas behind logarithms in some detail, and will also delve into the economic interpretation of logarithms, so we will have little to say about them here except to note that they are one of the most important classes of variable transformations. Dummy Variables. Sometimes we are interested in estimating the effect of a variable which has no natural numerical representation. For example, suppose that we believed that housing consumption depended partly on the sex of the household - 41 - head. How can we specify a regression that will permit us to test this hypothesis? Dummy variables are a useful technique for estimating the effects of a categorical variable. If an observation has the characteristic of interest (is in a particular category) the variable takes on the value 1; otherwise the variable takes on the value 0. For example, we could construct a variable named FEMALE which took the value I if the household was female- headed. Even if a variable can be represented by a continuous number in a natural way, it is sometimes useful to code the data in dummy variables, especially if (1) there are a limited number of categories, and (2) there is reason to believe that the effect of the variable varies with the level of the variable. For example, the effect of a one-person change in household size on housing consumption might vary between small and large households; if household size is entered directly in a linear regression, the regression coefficient measures the average effect of a change in household size across the entire sample of different household sizes; since only one number is estimated, if the true effects vary with household size, the simple linear regression will not pick up the differences. This will be discussed in more detail in Part II, with an example; for now we want to concentrate on the mechanics of how these variables are computed and interpreted. Suppose we wanted to compute dummy variables for household size categories. If we know which household sizes occur in the sample, we can just compute a series of dummy variables such as HH1 (I if household size equals 1, 0 otherwise), HH2 (1 if household size equals 2, 0 otherwise), HH3, HH4, HH5, and so on. There are two useful modifications to this simple procedure. First, and most importantly, we must omit a dummy variable for one of the categories in the regression procedure. The dummy variable measures the effect of being - 42 - in a particular class relative to not being in that class. There must be a "base case" or a class to which the effects of the other classes are compared. In other words, if we run a regression with dummy variables HH2, HH3, HH4 and HH5, the coefficients of those variables are, respectively, the estimates of the difference in housing consumption between 2-person households and the base case (omitted category, here 1-person households), 3-person households and that same base case, 4-person households and the base case, and 5-person households and the base case. If we had included the dummy HH1 in the regression as well, and had tried to compute the results, the computer program would have failed because there would be no base case against which to compare the effects of other categories. There is a second useful modification to this procedure. It is often the case with categorical variables that we may have only a few observations in the extreme values; for example, we will likely have lots of sample observations with 1, 2, 3, 4, perhaps even 10 individuals in a household; but at some point there will be a fall-off in the number of sample observations we can expect in some categories. In fact, we might not even be sure that some categories will have any observations. Since there probably isn't any fundamental difference between the effects of, say, the 12th additional household member and the 13th, or even the 14th, we can use the following trick: past a certain cutoff point, make the variable continuous instead of a dummy variable, so that we can avoid having a dummy variable for every possible category (some of which may not exist in our sample). Let's suppose we determine the cutoff to be households of size 8. Then we can create a new variable, HHGE9 which takes on the value of household size if household size is greater than or equal to 9, and is zero otherwise. - 43 - Table 8 illustrates how this procedure works. The variables HH2, HH3, HH4, and so on are constructed as dummy variables; and HHGE9 is a continuous variable. Table 9 estimates a model using these dummy variables in place of HHSIZE and HSIZESQ (from above). The interpretation of these coefficients is as follows. We estimate that a one-person household pays 502 shillings in rent, because that is the (rounded) estimate of the intercept. The intercept is the estimate for the base case, or omitted category. We estimate that a two person household spends an average of 523 shillings (502 plus 21); a three person household 751 (502 plus 249), and so on. We estimate a nine person household spends (502 plus 9*24 = 718) 718 shillings, and a ten person household 742 shillings (502 plus 10 * 24). -44 - Table 8: Dummy Variable Coding Scheme Number of Persons HH2 HH3 HH4 HH5 HH6 HH7 HH8 HHGE9 1 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 3 0 1 0 0 0 0 0 0 4 0 0 1 0 0 0 0 0 5 0 0 0 1 0 0 0 0 6 0 0 0 0 1 0 0 0 7 0 0 0 0 0 1 0 0 8 0 0 0 0 0 0 1 0 9 0 0 0 0 0 0 0 9 10 0 0 0 0 0 0 0 10 11 0 0 0 0 0 0 0 11 12 0 0 0 0 0 0 0 12 TABLE 9 REGRESSION EXAMPLE ILLUSTRATING USE OF DUMMY VARIABLES RENT AND HOUSEHOLD SIZE DEP VARIABLE: RENT SUM OF MEAN SOURCE DF SQUARES SQUARE F VALUE PROB>F MODEL 8 3269505 408688 3.051 0.0044 ERROR 91 12188046 133935 C TOTAL 99 15457550 ROOT MSE 365.971 R-SQUARE 0.2115 DEP MEAN 808.574 ADJ R-SQ 0.1422 C.V. 45.26127 PARAMETER STANDARD T FOR HO: VARIABLE OF ESTIMATE ERROR PARAMETER=0 PROB > ITI INTERCEP 1 501.970 162.513 3.089 0.0027 HH2 i 20.568941 193.834 0.106 0.9157 HH3 1 248.549 207.731 1.196 0.2346 HH4 1 377.665 207.731 1.818 0.0723 HH5 1 523.256 196.434 2.664 0.0091 HH6 1 606.550 199.509 3.040 0.0031 I HH7 1 439.741 207.731 2.117 0.0370 HH8 1 348.476 193.834 1.798 0,0755 1 HHGE9 1 24-011529 18.805857 1.277 0.2049 - 46 - PART II: ECONOMIC MODELS FOR HOUSING MARKET ANALYSIS Section II.1 Introduction The preceding pages have emphasized statistical techniques, especially regression analysis, without much reference to what model we want to apply the techniques to. A temptingly simple approach is to simply pick a dependent variable of interest -- say housing consumption, as proxied by rent -- and just put a lot of variables in the equation and pick out the best fit. There are several problems with this approach. One important problem is given enough totally unrelated variables, and running enough regressions, eventually we'll hit upon a "significant" result purely by chance. Remember that if we pick a .05 confidence interval, we are permitting 1 chance in 20 that we will observe an apparent relationship merely by chance (a Type II error). If data are totally unrelated and we run 20 variables in regressions we can expect to get one spurious "significant" result that is a mistake. A second problem has even more practical importance. Many economic variables happen to be related through "intervening" variables which can lead to erroneous interpretation of statistical results. A few simple examples will make this clear. Suppose that you have the following information collected from a sample of households: (1) Housing Consumption, measured by house value (2) Household income (3) Household size (4) Head's occupation (5) Whether or not the household, participates in a government housing program (6) For whom the head voted in the last election - 47 - Suppose further that you want to measure the effects of program participation on housing consumption. Making a model consists of choosing the variables that cause the household to choose a certain level of housing consumption, and specifying the functional form of the relationship, e.g., whether it is linear. For the rest of this section we assume the variables are related in a linear fashion so that we can concentrate on variable selection. It cannot be overemphasized that regression analysis is a computational technique which is not by itself sufficient evidence of causality (x causes y); rather, it demonstrates correlation (x and y often occur together). Thinking about which variables "cause" other variables is the essence of model building, and that which distinguishes the correct use of statistics in social sciences from merely fitting the data at hand. A variable is a cause of housing consumption if and only if value can be changed by manipulating only that variable, holding othtr variables constant.27/ Clearly, people with higher incomes can be expected to consume more housing. Larger households may need more space and, thus, consume more. Participation in the program may affect housing consumption; this is the hypothesis of particular interest. It is less clear that head's occupation will affect housing consumption, except indirectly: occupation is a determinant of income, and income is a determinant of housing consumption. We might assume that occupation and housing consumption are only related through the intervening variable income, and that once we control for income 27/ Strictly speaking "variables" aren't causes or outcomes but our measures of real world pheonomena, and often imperfect measures at that. Variables don't cause anything, they measure that which causes. For ease of exposition, however, we often speak loosely of variables as if they are the phenomena they measure. - 48 - differences among occupations we will observe no residual difference in value among occupations. Now consider household size. Larger households presumably require more housing. They also have more wage earners, on average, and hence more income. Therefore household size affects value both directly and through the intervening variable income. It is important to emphasize that strong relationships can exist which don't belong in our model. Suppose there has been a recent election with two candidates, and the only issue in the election is property taxes. Mr. George favors a large increase in property taxes to finance municipal government, while Mr. Rohatyn argues for cutting taxes and floating a bond issue instead. Suppose further that all those who own large houses therefore vote for Mr. Rohatyn, and all those who own small houses vote for Mr. George. Clearly, there is a strong relationship between house value and voting behavior, but voting behavior does not cause investment in housing. If we regressed value against voting behavior, the procedure would give us a "significant" coefficient for that behavior, but to interpret this as evidence that voting is a determinant of investment is a mistake. Mere statistical techniques cannot substitute for common sense. This is where economic theory becomes useful. Most economic theory -- most social science theory of any kind -- can be very loosely thought of as an attempt to put common sense on a more rigorous footing. Economics is a way of looking at the world that helps us to decide which variables should be important in explaining phenomena like consumption and investment. It would be beyond the scope of this paper to discuss theory in detail, but we want to use the next few pages to summarize the current consensus of economists about a few simple models which can be estimated with the Kenyan data and which will - 49 - yield information that will help policymakers make better decisions about housing programs and policies. Section 11.2 Composite Demand Models The first paper in this series discussed alternative measures of housing consumption in general terms.28/ That discussion will not be repeated here, but for easy reference Table 10 replicates Exhibit 2 of that paper, which summarizes different measures, their advantages, and their disadvantages. This section will focus on models constructed from expenditure measures, that is rents and house values. Later sections will deal with individual characteristics and hedonic price models. There are six essential issues which must be tackled when constructing models of housing demand. Three are essentially measurement issues: how to measure housing consumption, incomes, and prices. Another issue is how to integrate related behavioral outcomes like tenure choice and mobility into the demand relation. Finally there is the question of alternative functional forms and the choice of estimating technique. Each will be discussed in turn. Section 11.2.1 Measurement Issues Measuring housing consumption. Ordinary demand analysis begins by postulating a relationship between the quantity of a good demanded, its relative price, the income of the household, and other things that may affect demand such as household size. If for simplicity we forget about "other things" for a minute, and postulate that the demand function is linear, this model suggests that given household survey data we estimate a regression equation of the form: (1)Q = a + b (I) + c (P) + u 28/ Malpezzi, Bamberger and Mayo, pp. 5-9. - 50 - Table 10: Measures of Housing Consumption Measure Comments Advantages Disadvantares Expenditures Product of price and quantity. Easily measured. Have to assume constant Appropriate if price price, or avera.Ce to permit doesn't vary differences to cancel ouc. A. Rent Flow measure of Good measure of cost per Rarely measured for owners. expenditures. period. Easy to measure for renters. B. Value Stock measure of Closely related to cost of Rarely measured for renters. expenditures. supplying a similar Sometimes inaccurate even structure. for ouners. Direct Quantity Examples include number of Some--like number of rooms Some--like condition of and Quality rooms, type of sanitation, --are easily measured. structure or lot size--are Measures utilities present, lot size, If you have good quantity difficult to reasure. and condition of structure. measures and expenditures Focusing on one or a few. Many possible measures exist. can compute prices as well. may give misleading results, Multivariate Regression analysis is most In theory, good way to Difficult to use in Statistical commonly used. compute prices and practice. Measures Often referred to as hedonic quantities. index. I I - 51 - where Q is the quantity of housing sevices demanded, I is the household's income P is the relative price of housing, and u is the residual from the regression equation. The estimated parameters are a, b and c. Income and price elasticities can be calculated from b and c, respectively.29/ The problem, perhaps not apparent at first, is the following: What is Q? What is I? What is P? The first problem is the measurement of consumption, Q. What is a unit of housing services? Consider two renters. The rents they pay are not consumption but are expenditures: R = P * Q. If I pay more rent than you, does it mean I consume more housing, or do I pay a higher price? If we know. that both pay the same price, then rents are a good measure of consumption..30 Often studies use data from the same city and use rent as a measure of consumption. But does the price of housing vary within the city? That depends on whether by housing services you mean those services produced solely by the structure, or those services produced jointly by structure and location. If the former, then the price varies within the market, and the 29/ The elasticity is a unitless measure of responsiveness E =AQ)' and Ep = AQ/Q, where E is the income elasticity AI/I AP/P and E is the price elasticity. Since the regression coefficients of I and P in a linear regression can be interpreted as NQ/AI or tQ/AP, then for any level of Q and Y we can calculate: E = b * (I/Q) or Ep = c * (P/Q), where b and c are the regression coefficients of income and price.. In particular, note than for a linear demand model, the elasticity varies for each observation. It is common to present such variable elasticities evaluated at the mean value of the variables, as in Table 2. However, if we estimate logrithmic models, the elasticity is the coefficient. See Appendix B for details. 30/ House values are related to rents in the following way: a house is worth the (discounted) sum of the future rents it can be expected to command. The sum of rents is discounted since a dollar of rent today is worth more than a dollar of rent a year from now (you can earn interest on today's dollar for a year). Also the expected rent the dwelling commands in the future might be different than today's rent. That's why house values can change faster or slower than rents. Most of what we say about the flow 'rent' can be applied to the stock concept 'value' so most of the analysis concentrates on rents. - 52 - resulting estimate is the so-called income consumption path, or Engel curve. That is, the regression is: (2) R = P *Q = a + b (I) + u where b has a slightly different interpretation than in (1): it reveals expenditure behavior, not the direct relationship between quantity demanded, and incomes and prices. If the latter -- consider housing defined broadly as structure and location -- then the price of housing including access is constant throughout the market, denoted P and the regression estimated is: (3) R =P* Q = a + b (I) + u where the bar over P indicates it is fixed over the sample and the coefficient b can be shown to reveal the relationship between quantity demanded and income, as well as the income-expenditure relationship. Gross versus net rent. Another issue which arises in consumption measurement is the following. Consider nominal rent payments (question H-2). Some renters pay for utilities separately, and some have utility charges included in their monthly rent. Ideally, we would like to choose either gross rent (rent for structure plus utilities payments) or net rent (rent minus the imputed charges for utilities which are included in monthly rent). Gross rent corresponds more closely to a total cost of shelter while net rent facilitiates comparisons with owner regressions based on house value, which are net of utility payments. Unfortunately, good information on utility payments is hard to collect.31/ One way to get around this problem is to use nominal rent (question H-2) and to include a dummy variable in the regression for households which have utilities included in rent. Since these households presumably pay more to cover the utility payments, the coefficient of this 31/ See Follain and Malpezzi (1981). - 53 - dummy variable should be positive, and the other regression results will be for a sort of net rent, since the dummy controls for the extra costs. The other procedure is to compute gross rent as the sum of rent and utility payments. This is the simple procedure, although comparisons between owners and renters are distorted. Recommendation to CBS and MWH for Housing Consumption Measures In,Regression Models of Housing Demand. 1. For renters, use nominal rent (variable H-2) plus monthly utility payments (H-4). 2. For owners, use house value for single family structure (A- 1=1). If an owner-occupied multifamily unit, use structure value but estimate separately from single family units. Measuring housing prices. This topic has been partly addressed in the section on consumption; these two measurement issues are obviously closely related. Common procedures followed in earlier studies have been: (1) ignore prices, (2) assume that prices are constant throughout the sample, which amounts to the same thing, (3) allow prices to vary within submarkets, e.g., compute a place to place index for several locations in the total sample, and (4) compute separate prices for each observation, using either a hedonic price index methods or using some proxy such as distance to the Central Business District. Recommendation to CBS and MWH for Price Measures in Regression Models of Housing Demand. 1. Prices vary from city to city. Therefore, estimate separate equations for large cities with big samples (more than 150 degrees of freedom). For smaller towns, pool observations but include dummy variables for different towns. - 54 - 2. Within a city or town, prices also vary with location. The simplest model postulates that price varies with distance to Central Business District. Therefore, to control for intrametropolitan price differences, include distance to town center (c-15) as an independent variable if there is sufficient variation in the distance variable to yield significant estimates. Measuring incomes. Since adjusting the consumption of housing services is so costly and undertaken so infrequently, it is commonly postulated that the demand for housing is related to some expectation of the household economic situation over a time period longer than the immediate market period. Commonly researchers try to distinguish between current and permanent income, where permanent income is adjusted to reflect long run expectations about future income.2/ In other words, consumption does not change as much from year-to-year as total income. People save in good years and spend their savings or borrow in bad years. Rent changes even less than total consumption, because it is so costly to move. Since consumption is related to long-run or permanent income, this suggests current income is not the true determinant of housing consumption, permanent income is. In practice, there are three common ways in which researchers try to proxy permanent income, which is never directly observable. The first, advocated by Friedman in his seminal paper, is to use a weighted average of past incomes as proxy for permanent income, where the weights reflect some market disc ut rate. This approach requires panel data (a cross secton of households surveyed repeatedly over time). Most such panels have data for three or four years at most, so the average used could be improved if longer 32/ The classic work on the permanent income hypothesis is Friedman (1957). A related hypothesis which yields similar qualitative conclusions for the demand for durable goods is the life-cycle earnings hypothesis (see Ando and Modigliani (1963). - 55 - time series were available. Most studies using this approach assume a very high discount rate. Also, note that the empirical implementation of Friedman's theory is somewhat ad hoc, because the theory postulates that consumption depends on future expectations, which may differ from past experience. A second method is to use a first stage regression of current income against age, education and other determinants of current income, and to use the prediction from this equation as an instrumental variable proxying permanent income. This method implicitly assumes that the relevant permanent income measure varies over a person's lifetime. The third empirical approach is straightforward and intuitively appealing. Since households make decisions about consumption largely on the basis of permanent income, and consumption is measurable, why not use consumption as a proxy for permanent income? The assumption of this approach is that changes in transitory income do not affect total consumption or housing consumption. Of these three approaches, the third is appealing. The first approach requires time-series data which are unobtainable. The second can be done with Kenyan data but is somewhat complicated. The third approach is the simplest and can be easily implemented with the Kenyan data. Mayo and Malpezzi (1983) show that this simple approach yields results similar to more complicated techniques. Recommendation to CBS and MWH on Measuring Income in Regression Models of Housing Demand. 1. Since income is coded into categories (section G) but consumption variables are continuous, and since consumption is an excellent proxy for the theoretically preferred "permanent income", use total - 56 - consumption as our measure of income. For regression analysis, use the natural logarithm of consumption. Demographic Variables Most economic texts focus on the role of prices and incomes in determining patterns of demand. The underlying assumption is that other determinants of demand, such as tastes, family composition and size, are "held fixed". Empirical work requires that we include these kinds of variables in our regression models so that this assumption is tenable. The most important single demographic variable affecting housing consumption is household size. Other candidates for inclusion in the analysis are: age of household head; number of children (measured separately from number of adults); and the sex of the head of household. Sometimes it is hypothesized that tastes vary by income class or by tenure; this will be discussed below. In addition to selecting demographic variables for inclusion in the regressions, we have to choose functional forms. Functional form was discussed in Part I, but we should reemphasize here that several of the demographic variables can be assumed to have different effects depending on their level. For example, as household size increases, demand will usually increase; but it may be the case that for extremely large families, food expenditures become so large that they tend to "crowd out" additional housing expenditures, so that we observe housing expenditures first rising then falling with household size. In a similar fashion, housing expenditure might first increase with head's age as the head reaches peak earning power and contemplates a growing family; it might shrink with advancing age as children form their own families. The point to note here is that the functional form of the regression equation should be able to capture these nonlinearities. - 57 - Recommendation to CBS and MWH on Demographic Variables in Regression Models of Housing Demand. 1. Household size. At a minimum, the estimated demand relations should include a measure of household size. Since the effects of one additional household member are probably different for large families than for small families, a flexible functional form -- dummy variables or an additional quadratic term -- may be useful. These will be discussed in more detail under functional form. 2. Other candidates for inclusion include age of household head (possibly as dummy categories), a dummy for female headed households, and number of children. Some experimentation may be necessary, to determine which variables make a difference in the Kenyan context. Section 11.2.2 Integrating the Effects of Tenure Choice and Mobility on Housing Demand. The earlier companion paper (Malpezzi, Bamberber and Mayo) emphasized the relationships between moving and tenure, and the consumption of housing. Much of the recent literature on housing demand in the developed countries focuses on the relationship between tenure choice and demand. For some time it has been common to estimate separate demand equations by tenure group, and more recent work has tried to incorporate the simultaneity of the tenure choice and housing demand decisions. Briefly, most studies of developed country data show higher income elasticities for owners than for renters, presumably because owner occupied housing is an investment good as well as a consumption good, but the use of simultaneous methods has so far demonstrated little impact on the size of the elasticities. Of course, tenure in developing countries is often characterized by much more complicated arrangements than is the case in developed countries, - 58 - with various forms of squatting and rent-with-deposit schemes (key money), often prevalent. A general survey of different types of tenure arrangements can be found in Doebele (1978). However, for the purposes of the Housing Project it is recommended that the demand equations be estimated separately for a simple tenure grouping: (1) private renters, (2) private single-family unit owners, (3) private multifamily unit owners, and (4) subsidized or public units. Some of these groupings will not have sufficient observations for reliable estimation in any but the longest cities. For small towns two of these groups might have to be collapsed with a dummy variable for the smaller group. Mobility is also important as an indicator of current housing preferences. Households that have moved recently or who are about to move may in their actual or projected choices more accurately indicate the sorts of priorities that households place on different housing and neighborhood features than is the case for the sitting population. Thus project designers may wish to pay greater attention to the choices made by recent movers or of prospective movers (which can be ascertained by surveys) than to the choices made by those who have neither moved recently nor intend to move. Some recent studies have estimated separate demand regressions for recent movers and long-term residents (see Mayo and Malpezzi, Appendix A, for a discussion). We do not recommend this approach for the Kenyan study because there are too few degrees of freedom to segment by this variable as well as tenure and city. Instead, include length of tenure, and its square, directly in the demand equation. This will correct for any bias due to possibly different demands for long-term residents. - 59 - Summary of Recommendation to CBS and MWH on Tenure and Mobility in Housing Demand Estimation. 1. Estimate separate equations for each tenure group. If there are too few observations for a particular tenure group in a smaller town, include them with another group and use a dummy variable for the smaller group. 2. Included the length of tenure, and its square, in the regression equation. Section 11.2.3 Tying It All Together: Examples of Demand Equations Using Egyptian Data Now that we have discussed several variables and specification issues in isolation, we will examine actual estimates which will illustrate how these ideas work out in practice. However, bear in mind that this is an example from a very a typical city in a different country. When similar models are applied to data from Kenya, we will expect to find some important differences in the results. The discussion is meant only to give the flavor of looking over some results, and not to indicate "correct" results. Tables 11 and 12 present estimates from demand equations whose specification resembles the model we have discussed above. Notice the following general points: 1. The dependent variables are different for renters (log gross rent) and for owners (log house value). Rent is a flow concept (i.e. an amount per month) while value is a stock concept (i.e. an amount paid once and for all). 2. The models fit the data quite well. R-squared statistics are around 0.4. TABLE II CAIRO RENTER DEMAND EQUATIONS USING LOG OF GROSS RENT DEP VARIABLE: LMGRENT SUM OF MEAN SOURCE OF SQUARES SQUARE F VALUE PROB>F MODEL 11 58.583454 5.325769 14.660 0.0001 ERROR 241 87.552540 0.363289 C TOTAL 252 146.136 ROOT MSE 0.602734 R-SQUARE 0.4009 DEP MEAN 2.228022 ADJ R-SQ 0.3735 C.V. 27.05243 PARAMETER STANDARD T FOR HO: VARIABLE VARIABLE OF ESTIMATE ERROR PARAMETER=0 PROB > ITI LABEL INTERCEP 1 -0.987718 0.620380 -1.592 0.1127 INTERCEPT LCONSUME 1 0.481276 0.053391 9.014 0.0001 LOG CONSUMPTION HHSIZE 1 0.011340 0.078861 0,144 0.8858 HOUSEHOLD SIZE HSIZESQ I -0.000634716 0.006612508 -0.096 0.9236 AGERESP 1 -0.033588 0.017504 -1.919 0.0562 AGE OF HEAD AGESQ 1 0.0004114552 0.0001908698 2.156 0.0321 AGE SQUARED FEMALE 1 0.086849 0.096230 0.903 0.3677 FEMALE HEADED HOUSEHOLD LINGER 1 -0.054956 0.012128 -4.531 0.0001 LENGTH OF TENURE LNGRSQ 1 0.0007157815 0.0002916292 2.454 0.0148 LENTGTH OF TENURE SQUARED PUBHSG 1 -0.684564 0.198754 -3.444 0.0007 PUBLIC HOUSING DUMMY GOVHSG 1 1.273605 0.377436 3.374 0.0009 GOVERNMENT SUBSIDY DUMMY DIST 1 -0.00370299 0.008788988 -0.421 .0.6739 DISTANCE TO CITY CENTER TABLE 12 CAIRO OWNER DEMAND EQUATIONS USING LOG OF HOUSE VAP-UE DEP VARIABLE: LVALUE SUM OF MEAN SOURCE OF SQUARES SQUARE F VALUE PROB>F MODEL 9 27.531840 3.059093 3.450 0.0030 ERROR 41 36.352156 0.886638 C TOTAL 50 63.883996 ROOT MSE 0.941615 R-SQUARE 0.4310 DEP MEAN 8.963114 ADJ.R-SQ 0.3061 C.V. 10.50544 PARAMETER STANDARD T FOR HO: VARIABLE VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > IT! LABEL INTERCEP 1 -0.380739 2.122989 -0.179 0.8586 INTERCEPT LCONSUME 1 0.894612 0.189601 4.718 0.0001 LOG CONSUMPTION HHSIZE 1 0.223190 0.210164 1.062 0.2945 HOUSEHOLD SIZE HSIZESQ 1 -0.027146 0.016278 -1.668 0.1030 AGERESP 1 0.002294721 0.058554 0.039 0.9689 AGE OF HEAD AGESQ 1 0.0002119295 0.0006270886 0.338 0.7371 AGE SQUARED FEMALE 1 0.212448 0.427333 0.497 0.6217 FEMALE HEADED HOUSEHOLD LINGER 1 0.020451 0.033929 0.603 0.5500 LENGTH OF TENURE LNGRSQ 1 -0.000241492 0.0007379156 -0.327 0.7451 LENTGTH OF TENURE SQUARED DIST 1 0.004872892 0.031499 0.155 0.8778 DISTANCE TO CITY CENTER H - 62 - 3. Housing consumption rises with income. For owners, it rises almost proportionately with income (elasticity is close to 1, 0.89) while for renters it rises more slowly (elasticity is only .48). 4. Housing consumption rises with household size (the coefficient of HHSIZE is positive) but at a declining rate (the coefficient of HHSIZESQ is negative). 5. Age of the head affects rents (t-statistics are about 2 for both age variables) but not owner consumption. 6. Sex of household head has little measured effect on housing consumption (t-statistics are very small for both owners and renters). 7. Length of tenure has a strong effect on rents but little effect on values. 8. For renters, those who live in public housing pay less (the coefficient of PUBHSG is less than zero) but those who receive government subsidies spend more on housing (GOVHSG is greater than zero). 9. The effect of distance to the city enter, our proxy for housing price, is negliglible. Let us examine each of these points in a little more detail. The relationship between rents and values was discussed earlier in the paper. Values are simply the present value of the (discounted) expected future rents the building and the land command. As we all know, the value of a unit can be 50 or 100 or more times the monthly rent it could command. One of the advantages of the logrithmic model is that it facilitates comparisons between the owner and renter results because the coefficients can be interpreted as percentage changes, and the differences in the measurement units (stock versus flow) is captured in the intercept term of the regression equation. - 63 - Typical R-squared statistics for cross-section data in housing market analysis range from .1 to .6. The fits reported here are quite good by this informal rule of thumb. As discussed above in Appendix B, the coefficient from a logrithmic independent variable in a regression with a logrithmic dependent variable can be directly interpreted as an elasticity, or unitless measure of responsiveness. Therefore, we conclude that in Cairo owners spend more of each additional unit of income on housing than do renters. This is not surprising, since for owners housing is an investment good as well as current consumption. Most people expect housing consumption to increase with household size, but for very large households housing consumption may begin to increase more slowly or even decrease as more income is allocated to food, If this hypothesis is correct we expect a positive coefficient for HHSIZE and a negative coefficient for HHSIZESQ. In general, the demographic variables (characteristics of the household other than income) are less important for owners than for renters. This is not surprising since higher adjustment costs presumably lead owners to make longer term housing decisions which are less strongly related to current demographic characteristics. The reasons the length of tenure variable has such a strong effect on renters in Cairo is that there is a strong rent control law in that market. Rent control, if enforced, leads to severe distortions and inefficiencies in the housing market. See Thibodeau (1982) for a discussion. Since there were not enough public housing units or subsidized households to estimate a separate regression for these samples as recommended above, the dummy variable PUBHSG and GOVHSG were included in the pooled rental sample. The results indicate that public housing residents spend much less - 64 - than otherwise identical households. Those renters who receive subsidies apparently do spend it on housing since they spend more than twice as much on housing as otherwise identical households. Distance to the city center was included as a proxy for housing prices since close-in units presumably cost more per unit of housing services than those farther out. But the effect of lower prices can be offset by increased consumption of the quantity of housing services; since rent is price times quantity, these two effects can cancel out. Also, when samples are restricted to a few geographical areas to hold down costs, we may be left with little variation in locational variables; this will also decrease the reliability of the distance coefficient. Regression models similar to those just presented can be estimated using data from the larger cities and towns. For smaller towns, there may not be enough degrees of freedom left to reliable estimate such models. It may be necessary to estimate a simpler model in the smaller towns in order to conserve degrees of freedom. One simple model which has been used in the past studies and found to perform well is to regress the same dependent variables against the log of income or consumption, household size, and household size squared (See Mayo and Malpezzi, 1983). Table 13 presents estimates from this simple model, using the same sample as Tables 11 and 12. Notice the key result: the coefficient of log consumption, our proxy for permanent income, is very stable. The household size variables do change, because the variables omitted from this simple model are more highly correlated with household size than with income or consumption. In other words, with the simple model.we can retain some confidence in the key income coefficients. Since the income or TABLE 13-A SIMPLE DEMAND EQUATION FOR RENTERS IN CAIRO DEP VARIABLE: LMGRENT SUM OF MEAN SOURCE DF SQUARES SQUARE F VALUE PROB>F MODEL 3 40.783117 13.594372 24.236 0.000i ERROR 343 192.397 0.560924 C TOTAL 346 233.180 ROOT MSE 0.748949 R-SQUARE 0.1749 DEP MEAN 2.338085 ADJ R-SQ 0.1677 C.V. 32.03256 PARAMETER STANDARD T FOR HO: VARIABLE VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > ITI LABEL INTERCEP 1 -1.396345 0.521220 -2.679 0.0077 INTERCEPT LCONSUME 1 0.450601 0.054375 8.287 0.0001 LOG CONSUMPTION HHSIZE 1 -0.144916 0.073210 -1.979 0.0486 HOUSEHOLD SIZE HSIZESQ 1 0.012216 0.006185292 1.975 0.0491 Ln TABLE 13-B SIMPLE DEMAND EQUATION FOR OWNERS IN CAIRO DEP VARIABLE: LVALUE SUM OF MEAN SOURCE DF SQUARES SQUARE F VALUE PROB>F MODEL 3 24.459881 8.153294 8.163 0.0002 ERROR 50 49.940011 0.998800 C TOTAL 53 74.399893 ROOT MSE 0.999400 R-SQUARE 0.3288 DEP MEAN 8.905308 ADJ R-SQ 0.2885 C.V. 11.22252 PARAMETER STANDARD T FOR HO: VARIABLE VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > ITI LABEL INTERCEP 1 1.557142 1.598742 0.974 0.3348 INTERCEPT LCONSUME 1 0.837154 0.175473 4.771 0.0001 LOG CONSUMPTION HHSIZE 1 0.040022 0.197403 0.203 0.8402 HOUSEHOLD SIZE HSIZESQ 1 -0.014159 0.015588 -0.908 0.3681 - 66 - consumption coefficient is the key result for affordability calculations, this simple model can still provide valuable information for those towns which have small sample sizes. Another alternative is to pool several small towns, and include dummy variables for each town but one. Section 11.3 Introduction to Hedonic Price Indexes This section summarizes some recent advances in housing market analysis. In particular, we will focus on the estimation of so-called hedonic prices for housing and how they can be used to construct indexes and to estimate demand and supply relationships. First, we will present an introductory and intuitive explanation of hedonic price estimation. Then we will present an empirical example. Let us emphasize at the outset that we do not recommend that hedonic analysis be included in the first basic reports produced by CBS and MWH. This additional analysis will be time consuming and should be the focus of future reports, perhaps with technical assistance from the Bank and perhaps with collaboration of academic researchers. This brief introduction is included because these models will be useful for future price index work. 11.3.1 Theoretical Basis To a large extent, housing market analysis consists of comparing different dwellings. For example, measuring inflation requires comparing the price of housing today to that of some base period, but often in the interim the housing stock has changed, through new construction, rehabilitation, conversion, and demolition, so that we are actually comparing two different groups of dwellings. Other examples abound, such as comparing the price of housing in different locations, measuring the effects of racial or caste - 67 - discrimination in housing, and studying the effects of government subsidies and tax policies on how we are sheltered. All require that we compare different dwellings. Such comparisons are made daily not only by researchers, but also by those interested in more effective government programs and by bankers, developers, and landlords. In fact, each of us make such comparisons every time we move or consider moving. The problem faced by anyone trying to analyze a housing market is the well-known difficulty of making these comparisons. How are the rents for two different dwellings in two different locations related?331 Housing is not a homogeneous good like wheat or oil, but can be thought of as a bundle of diverse characteristics such as a number of rooms of certain types, in a particular location, of a certain age, and so on. These specific characteristics are more amenable to comparison, so one may compare dwellings by comparing characteristics. Most people agree that comparing the value of, say, two houses with the same number of rooms in nearby locations is easier than comparing two dwellings with an unknown number of rooms in uncertain locations, even though in practice the distinction between a "good"--housing-- and its "characteristics" or "attributes"--like the number of rooms--is very much ad hoc. The method of hedonic equations is one way expenditures on housing can be decomposed into measurable prices and quantities so that rents for different dwellings or for identical dwellings in different places can be predicted. A hedonic equation is a regression of expenditures (rents or values) on housing characteristics and will be explained in detail below. 33/ Of course for owners we usually measure expenditures by the stock expenditure value -- rather than by the flow expenditure -- rent. For now, assume everyone rents. We will return to this distinction later. - 68 - Briefly, the independent variables represent the individual characteristics of the dwelling, and the regression coefficients are estimates of the implicit prices of these characteristics. The results provide us with estimated prices for housing characteristics, and we can then compare two dwellings by using these prices as weights. For example, the estimated price for a variable measuring number of rooms indicates the change in value or rent associated with the addition or deletion of one room. It tells us in a dollar and cents way how much "more house" is provided by a dwelling with an extra room. Ordinarily we would prefer to estimate such a regression separately in each market, where prices and quantities ideally clear. The definition of markets will be addressed in some detail later. Once we have estimated the implicit prices of measurable housing characteristics in each market, we can select a standard set of characteristics, or bundle, and price a dwelling meeting these specifications in each market. In this manner we can construct price indexes for housing of constant quality across markets. In a similar fashion we can use the results from a particular market's regression to estimate how prices of identical dwellings vary with location within a single market (e.g., with distance from the city center) or even to decompose the differences in rent or house values into price and quantity differences. Some simplified examples will make these procedures clear. The hedonic regression assumes that we know the determinants of a unit's rent: R = f (S, L, C), where R = contract rent S = structural characteristics; - 69 - L = neighborhood characteristics, including location within the market; and C = contract conditions or characteristics, such as utilities included in rent. 11.3.2 An Example Suppose we estimate this relationship assuming a log-linear functional form: ln R = a + bS + cL + dC where a, b, c and d are regression coefficients. Of course, in practice there can be many variables included on the right hand side. Table 14 presents a sample hedonic regression using the Cairo data. Since we only want to introduce the concept of the hedonic index Table 14 will not be discussed in detail. The coefficients of the independent variables are interpreted as the percentage change in rent from an additional unit of the characteristic. For example, Table 14 indicates that each additional room adds 21 percent to the rent commanded by a dwelling; a bath is worth 14 percent; and so on. Detailed interpretation of hedonic indexes can be found in Malpezzi, Ozanne and Thibodeau (1981). The determinants of rents and values are of interest in their own right to project designers and others. In addition, the results can be used to compute place-to-place price indexes for a constant quality dwelling. Once the coefficients have been estimated with a separate regression for each market (city or town) we can predict the rent for the sam unit in each market by: 1. Pick a set of independent variables which describes the unit to be priced. This is called the bundle. Tablel4 CAIRO RENTER HEDONIC EQUATION MODEL: TWO DEP VARIABLE: LGRENTI LOG GROSS RENT SUM OF MEAN SOURCE OF SQUARES SQUARE F VALUE PROB>F MODEL 22 87.353286 3.970604 15.606 0.0001 ERROR 265 67.425505 0.254436 C TOTAL 287 154.779 ROOT MSE 0.504416 R-SQUARE 0.5644 DEP MEAN 2.264077 ADJ R-SQ 0.5282 C.V. 22.27912 PARAMETER STANDARD T FOR HO: VARIABLE VARIABLE OF ESTIMATE ERROR PARAMETER=0 PROB > ITI LABEL INTERCEP 1 1.307565 0.439738 . 2.974 0.0032 INTERCEPT ROOMS 1 0.209695 0.028091 7.465 0.0001 NUMBER OF ROOMS FOR HH BATH 1 0.137401 0.091340 1.504 0.1337 JUST WHAT YOU THINK LE2STORY 1 -0.023454 0.085953 -0.273 0.7852 STRUCTURE LE 2 STORIES AGE76 1 0.688762 0.148079 4.651 0.0001 AGE DUMMY BUILT POST 76 AGE7176 1 0.459552 0.114360 4.018 0.0001 BUILT 71 TO 76 AGE6070 1 0.262954 0.083423 3.152 0.0018 BUILT 60 TO 70 AGLAND 1 0.079783 0.223413 0.357 0.7213 AGRICULTURAL LAND PAVED 1 -0.046628 0.101338 -0.460 0.6458 PAVED ROAD UPRMDL 1 0.139769 0.089565 1.561 0.1198 UPPER OR MIDDLE CLASS DISTRICT C LINGER 1 -0.013274 0.003841981 -3.455 0.0006 LENGTH OF TENURE FURN 1 0.618900 0.256550 2.412 0.0165 RENT INCLUDES FURNITURE WPRIV 1 0.127770 0.093400 1.368 0.1725 PRIVATE WATER CONNECTION SPUB2 1 -0.095058 0.134986 -0.704 0.4819 SEWER CONNECTION ELEC 1 0.277901 0.264421 1.051 0.2942 ELECTRICITY DUMMY SLITE 1 -0.087595 0.065469 -1.338 0.1821 STREET LIGHTS SIDE 1 0.101721 0.104296 0.975 0.3303 SIDEWALKS POOR 1 -0.027840 0.080172 -0.347 0.7287 BLDG BAD OR COLLAPSE LSCAPE 1 0.297088 0.097969 3.032 0.0027 LANDSCAPE HI OR MED QUALITY DFAC 1 -0.405620 0.307701 -1.318 0.1886 DUMMY COMMUNITY FACILITIES NFAC 1 0.149828 0.060533 2.475 0.0139 NUMBER COMMUNITY FACILITIES DESLAND 1 -0.037973 0.239745 -0.158 0.8743 DESERT LAND DIST 1 -0.018855 0.008929309 -2.112 0.0357 DISTANCE TO CBD - 71 - 2. For each regression in turn, multiply each coefficient by the value chosen for this particular bundle, and sum them. 3. Since the original regressions were log-linear (at least in our example), exponentiate this sum to get back to shillings. Detailed discussion of these procedures can be found in Follain and Ozanne (1981). - 72 - PART III: COMPUTATIONAL TECHNIQUES Section III.1: Preparing the Data For Analysis This section will be brief, because many of the outstanding issues are covered in Sae-Hau (1982), which is sent separately. Here we will reiterate some of the points covered in the aide-memoire (Malpezzi to Ondorra, July 1983). So far, this paper has emphasized techniques and models for analysis. But as CBS and NWH staff would be the first to emphasize, good analysis requires good data. Careful attention must be paid to the preparation of the data for computer analysis. Three issues can be briefly mentioned in this regard. First, this is a complex survey with two levels of observation: the household and the structure. It will often be necessary to cross-classify responses by both levels of observations. For example, we might want to know how many low-income people (household level information) live in units with piped water (structure level information). For this kind of tabulation it is necessary that household and structure variables be linked, and that the counts be done using the correct weights (in this case, household level weights). In other words, when the data are prepared the file structure must link households and structures. Second, careful attention must be paid to how missing values are to be treated. For example, suppose a respondent owns his dwelling, so he does not answer the question "Amount of rent paid last month." If the current practice of filling all non-responses with zeroes is followed, it is easy to make computing errors such as including these zeroes in the computational of averages. It is also more difficult to visually inspect data when the zero- fill procedure is used. Finally, there are some questions for which zero is a legitimate response, and using the zero fill procedure means we can no longer - 73 - tell a non-response from a legitimate response of zero. For these reasons it is recommended that non-responses be coded as blanks. Third, it is often necessary to recode responses for analysis. These recodes should always be performed on copies of the original variables, which are added to the file, and the original data should be left in unrecoded form, so that if we want to change the recoding we can always go back to the original data. Section 111.2: Computational Notes The outstanding computational issue is the choice of computer software. Choosing the correct software makes life easy for both the analyst and the programmer. Given the enormous demands on computer personnel, the correct choice of software shifts much of the burden to the machine. We recommend that as a long-range stzategy CBS investigates acquiring more sophisticated,software such as SAS (Statistical Analysis System), perhaps-in conjunction with a specialized table-producing set of software such as TPL 34/ In the short run, we recommend a greater reliance on SPSS, which is already installed on your system. SPSS will be particularly useful for regression analysis, and the manual (Nie et al., 1975) is a good general reference on statistics. The paper by Sae-Hau also discusses the use of SPSS in some detail. SPSS has several disadvantages. It does not have much flexibility for managing data sets. It lacks several basic econometric procedures which would be of interest for future work (but these procedures are not necessary for the current reports). It is difficult to add or modify existing procedures. Finally, it does not have many capabilities for computing order statistics. 34/ These are examples of available software. SAS is highly recommended, but we have no firm recommendation for a specific table-producing language at this time. - 74 - In practice, then, FORTRAN or COBOL are often used to merge data sets into a form which SPSS can handle. The lack of programming flexibility is not a serious problem for the currently planned analysis, except that it would be useful to have the use of order statistics. The FREQUENCIES procedure does compute medians, however. If CBS decides to use other order statistics, Appendix C lists some FORTRAN subroutine code which sort data and comnpute some order statistics. This code would probably require modification for the specific problem 1-t hand, but it may be useful as a guide. For the currently planned reports, the lack of software probably dictates that we rely on cross-tabulations and regression analysis for most of the analysis, along with medians computed with FREQUENCIES in SPSS. For future work, however, CBS will want to investigate software for the computation of order statistics. - 75 - APPENDIX A: KENYAN HOUSING QUESTIONNAIRE - 76 - Page 1 CENTRAL BUREAU OF STATISTICS MINISTRY OF ECONOMIC PLANNING AND DEVELOPMENT !71CHECK URBAN HOUSING SURVEY 1983 INTERVIEWED EDITFO o ta UE No[HOUSEHOLD QUESTIONNAIRE DATE RI .1 1 ... -ENUMEfRATORICLER1, 1 11 ISUPERVISOR STRUCTURE A SOCIOL ECONOMIC CHARACTERISTICS OF TENANT 1 Name of household twad TW NM 2 Occupation of houschola head - k - B. DESCRIPTICN OF RESIDENTIAL UNIT 13 17 4 DUC4IN LEVE PERIOD f1 2. 5 SEX O lDC4IFII LEPEl TYPE CF UNIT OWNERSHIP OF TYPE OF SCHE NUMBER OF ROOMS USED ROOMS FOR SLENLET UNIT HAD I" T HIS _ __T F 0 -9 STATUS I .UNIT By I Owner- I Male I No Sch.oiling i House . rivate 1. Worteuge Occupied 2 Female 2 Primary 2. Maisonetti 2. Local Authority 2. Private Realed 3 Secondnry A Flat 3. Central Govt. :I.Tenant Pu.chase 3 other & University I Swuh;li 4 Pastotal 4. Rental [5. Shanty 5. C om pany S. Si tasan dServic Below Ove, Tol 6. Other 6. Unautho-ised 6. Unautherised IS Years Ls15 Yeas 4. E Z FZ received per Molp Frmalft Mle Femao Maio Female Years MonthE z Z a 8 ' 1920 2324 7 35 36 39 0 13 5 C 1OST COMAON4 CONSTRUCTION MATERIALS S FACILITIES B 3. 1. 2. D C T 4. 6 7EDNESTIC F1.1103 TUTEY WALLS RUIOF WATEP SUPPLY TOILET BATHING KITCHEN LIGHTING TELEPHONE ROOS GARBAGE DISPOSUN ___.-- -- - .- _ _U_ __FO _ -. _ ! ----1-F -I ____AVILABIITYNOT WATER fA Fl IJEFC*ILTI ECLESRVANSINE I Earth 01 8, 1 ks I: Ttalih FOR O R WHEET(ER701 1nie 1 nvl Hue 1 Frvt 1 oragOccuean 2, Wand 02 BIndCs 2 Tn PED WATIEPIPED WATER YeSTE DPS CLTIN STREETS1 -. Terrnzo 103, 5lanes 3, Tiues 1' OItsidit 1* CovtmmuYs iP-aeFls noo :Pi nl V Elecritt 2: Yes V YAS Dtba unal 2 Sam 2C M2osmty a Private i en d L Conrate.-Wao 0010 ,Con c rct a 4' Concrete Withi 2* Comma nai 2: Private Pit Outdoor 2' oml2 Priae ampfin D:N0No 2Cmuinol2 Sm ..a Fla "I Centrae/ G'ommu.Tnan3: Prchtitase 3uth n iGaber 5: Can-crele-- Tiles O5 Wood 5 Carrugalie Iron Communal 2: No 3 CommunS l Flush Indoor C mmint l Concre-. c 06. Tin 6 Asbtstos Sheets 3 Out 4 None 4:Communal Pit Otldcor 4 Other 3 Other Duip Dirt y 7 Concrete-Cement GTCorruguled Iron 7. Other 1OOm)I 5: Other 5:Other 5: None 4. Dlher . OthOr CtrMu6LU-Cn ont 4s 6o 6U Naa r 6s None C!)MufJ-.Wood 10 Cnrd Board Times l61 6Oth65 6Per Monthpe I I i s I 36 37 38 39 60 61 626 45 67A 3 ,5 2 i 3. 1, 0 2,3H.6. 6 1DMSI - 77 - CENTRAL BUREAU OF STATISTICS MINISTRY OF ECONOMIC PLANNING AND DEVELOPMENT URBAN HOUSING SURVEY 1983 INTERVIEWED EDITED HOUSEHOLD QUESTIONNAIRE oAE RUCTUREHH CC 4 ENUMERATOR/ TGWN No DLILGI TIWN TRUCTGE No -I SUPERVISOR E. OPINIONS ABOUT NEIGHBOURHOOD F. TRAVEL. TIME AND COST 1. 2. 3. /., OPINION ABOUT YCUR NEIGHBOURHOOD ONE WAY DISTANCE NORMAL MODE TOTAL TRAVEL TOTAL NORMAL DAILY COST ON THE FOLLOWING TO WORK OF TRAVEL TIME 1! Worst Specify In KM 1:Foot In Minules In K.Shitlings 2: Fair 2: Bicycle 3: Inditerent 3:Private car i.: Good d: Matatu 5: Bus 5: Best 6:0Motorcycle SECURITY H tUMAN RECREATIVE OTHER 7: Other HEAD SPOUSE CHILDREN TOTAL ENVIRONMENT ENVIRONME FACILITIES TO SCHOOL HOUSEHOLD CODE RATE H AD SPOUSE HEAD SPOUSE HEAD SPOUSE TO WORK TO WORK I 1 1f I a - I I I I - I i. 13 1 15 16 17 18 19 20 21 22 23 24 25 27 28 30 31 333k 36137 39 0 G. INCOME AND EXPENDITURE GROSS MONTHLY INCOME FROM NUMBR TOTAL HOUSEHOLD EXPENDITURE LAST "ONTH ALL SOURCES IN K.SH OF INCOME IN K.SHILLINGS 0: Under 500 CONTRIBUTORS Food I I I I 1: 501 - 1,000 45 19 2: 1001- Z000 Rent a I a 1 3: 2,001 - 4000 50 54 1.: 4,001- 4000 Household Requirements I i I I 5: F001 - tU)O 55 59 6: 4001 - 1 OOO Transport I I 1 7: 10.001 - 21000 60 05 8: Ovcr 20000 Waterl Light 9: Unknown 65 69 Other 1.3 LA70 71. TOTAL I 75 79 78 - CONFIDENTIAL CENTRAL BUREAU OF STATISTICS Purje 3 MINISTRY OF ECONOMIC PLANNING AND DEVELOPMENT URBAN HOUSING SURUEY 1983 INTERVIEWED EDITED HOUSEHOLD QUESTIONNAIRE TOWN STRUCTURE H/H CHECI FORM DATE - j _ NUMBER DIGI TYPE ENUMERATOR/ 3 CLERM 11 12 SUPERVISCR H. FOR RENTERS ONLY RESERVED FOR COMPUTER USE l. Year moved to the residential unit Year 2. Amount of rent paid last month HSh/Month 3. Does rent include the following charges: 15 19 WATER Yes No ELECTRICITY Yes TELEPHONE Y No - HIRE FOR FURNITURE No DOMESTIC STAFF/ASKARI/GARDENER Yes No [ -- 22 No F 23 DS '-- 24 4. Imputed value for these charges K_ _ MSh/Month 5. Net Rent paid (after deducting these hargee) __F.Sh/Month 25 28 29 33 6. If rent is subsidised, is it? Government subsidy [; Subsidised by employer For services rendered Other Not applicabie 3 LE.J2 3 5 3 7. WILLINGNESS TO PAY FOR WATER a) For units with pipEd - water inside How much rent are you willing to pay for a similar unit but without piped - water inside? ______________MSh/Month _______ b) For units with piped - water within 25 metres 35 How much rent are you willing to pay for a similar unit but with piped - water inside the unit? KSh/Month c) For units with piped - water more than 25 metres irom the unit 1. How much rent are you willing to pay for a similar unit but with piped - water inside the unit? I KSh/Month 16 ii. How much rent are you willing to pay for a similar unit but with a stand-pipe close to the unit and shared by less than 10 families? KSh/Month A - 79 - Page 4 CENTRAL BUREAU OF STATISTICS MINISTRY OF ECONOMIC PLANNING AND DEVELOPMENT URBAN HOUSING SURVEY 1983 INTERVIEWED EDITED STRUCTURF H HOUSEHOLD QUESTIONNAIRE DATE ENUMERATOR I CLERK SUPERVISOR I. FOR RECENT MOVERS ( WITHIN THE PAST 12 MONTHS ) ONLY INFORMATION ON PREVIOUS RESIDENTIAL UNIT 1.Reason for moving ......................... 2. Amount of rent paid K.............. ShlMonth I 55 59 3 4. 5. 6. 7 8. TOILET 10DATWING 1.KITCHEN 12.LIGHTING 13 TELEPH- iDOMESTIC TYPE OF UNIT NUMBER OF ROOMS FLOOR OUTER WALLS ROOF WATER SUPPLY FACILITIES FACILITIES FACILITIES FACILITIES ONE SERVANTS USED FOR SUARTERS I House 1. Earth 01. Bricks 1. Thatch PROVISION AVAILABILI- HOT-WATER FOR PIPED TY OF PIPE[SSE rvt WA ER ATR SYSTEM I Private 2. Maisonette 2. Wood 02. Blocks 2 Tin WATER WATER 1.Private flush Indoor 1. Private 1.Electricity 1. Yes 1. Yes 2 Private 3. Fl at 3. Terrazo 03.Stones 3 Tiles 2. Private pit 'outdoor 2 Communal 2.Paraftin 2 No 2 No t. Concrele- Communal Lamp L. Swahili wood 04.Concret t. Conrete LInside Iu Private 1. Yes 3 flush -.Communal 3. Private Il S. CarrUg ated Ind oar Communal S. Shanty S. Concrete-tiles 05.Woad Irate 2.Outside 2.Communal 2. No . Communal pit 3. Other Concrete- Asbestos (within Brick O.Tin Sheets loom. 3.Private I S. Other 4.Communal 4. Other Concrete- 07 Corrugated Communal outdoor Cement Iron 7. Ot her 3.Outside 6 None S. None a. Other 08.Mud-Cament (e and 4. None S.Other a- 04 09. Mud-Wood /.None 6.None a 10. Cardboard ui 0 6- I1. Other 60 61 62 63 I. 65 66 67 65 69 70 71 72 73 7L 75 76 777 7 J. FOR HOUSEHOLDS PLANNING TO MOVE ONLY INFORMATION ON UNIT TO WHICH PLANNING TO MOVE 1. Reason for moving.... ................................. 2. Amount of rent expecting to pay .............. KSh Month 1. A. . 6. 7. 9. 10. It. 12. 13 00646STIC TYPE OF UNIT NUMBER OF R00MS FLOOR OUTER WALLS ROOF WATER SUPPLf TOILET BATHING KITCHEN 2IGHTING TELEPHONE SERVANTS USED FOR FACILITIES FACILITIES FACILITIES FACILITIES QUARTERS 1. Earth 01. Bricks 1. Thatch PROVISION AVAILABILI- HOT-WATER FOR PIPED TY OF PIPED SYTE1. Private 1. House 2.Wood 02,Blocks 2. Tin WATER WATER I.Private flush Indoor 1. Private 1,Electricity 1 Yes I Yes 2 Maisonette 3.Terraoo D3. Slones 3. Tiles APrivate pit 2.Private 2-Communal 2.Paratin 2. No 2. we FConcrete outdoor Lamp 3 Flat wood 0A.Concrete 4. Concrete 1. Inside 1. Private 1. Yes S.Communal flush 3 Private I t. Swahili S.Concrete-tiles OS.Wood S. Corrugated iron 2.Outside 2.Communal 2. No L.Communal pit 3.ommunal C ommunal ].Other I within Indoor S. Shanty 6.Concrate-brick 06. Tin 6. Asbestos sheets ( 00m) ].Private 1 5. Ot tier Lo . Other Communal 6 Other Y.Concrete-cgment 7.Corrugated iron 7. Other 2.Outside 6 None L.Communal S.None (beyond outdoor .Other DA.Mud-ce ment (bol 4.None Saudwo 4. None S.Other - 10. Cardboard 6. None in o 1t.Other 55 16 67 1 9 30 91 92 1A94 I 16 97 is 15 100 101 102 103 10o I - 80 - CENTRAL BUREAU OF STATISTILS o agI MINISTRY OF ECONOMIC PLANNING AND DEVELOMENI a rwI!NERVIEWED EDITED of ' H CHECK FORM URBAN HOUSING SURVEY 1983 --- - INTERIEWE EDITED I- DATE OWRSTRUCTURE HoQ DIGIT U 0 STRUCTURE QUESTIONNAIRE CNUMERATOR I CLERK SUPERVISOR STRUCTURE IDENTIFICATION A DESCRIPTION OF STRUCTURE TOWN NAME I TYPE OF 2 TYPE OF 3* NUMBER OF RESIDENTIAL UNITS 4. RENTAL INCOME LAST MONTH 5 AGE OF 6 FLOOR 7 OUTER WALLS 6 ROOF STRUCTURE SCHEME STRUCTURE 1 House 1 Private IN YEARS I- Earth OtBricks 1 Thatch 2 Maisonette 2. Mortgage 2 Wood 02.8locks 2-Tin 3 Block at FIats Tenant Purcas IN K. SHILLINGS 1, Less than 5 3 Terrazo 3te 3-Tele 4 Swahili 4. Sites and Service 2 6 Less than 10 4, Concrete-Wood OS Wood A Concrete 5 Shanty S Unauthorised 3 10 Less than 20 5 Coricete Tilte 06 Tiii S Corrugated Iron 6 Other 6 Oiher 4 Over 20 6 Conciete buick 07 Cortugated Iron 6 Asbestos Sheets 7-Concrete -Cement Gb Mud - Cenent 7 Other 09 Mud - Wood 6 Other O_ 10 CurdblAird Complete In-compret Occupied Un-occupied Owner- Rented Rented Units Vacant/ Owner- Total 11 Uthel Occupied Occupied 131 e 19 22 23 26 27 32 33 38139 6 746 49 B FACILITIES 1 WATER SUPPLY 2 TOILET FACILITIES 3 BATHING FACILITIES 4 KITELN 5 1.111AING AUAR DISPOSAL I CLEANLINESS PROVISION FOR PIPED WATERLATER SYSTEM FACILITIES FACILlIlEi WIERE DEPOSITED FREOUI.NC, 0f 1 Inside 1 Private 1- Yes 1- Private Flush 1 Private Indoor 1 Private 1 Electricity 1 Piivate thustbn LOLL-C TIo 1 Clcon 2 Outside (Within 100m) 2- Communal 2 No 2-Private Pit 2. Private Outdoor 2 Communal 2 Parallin Lamp 2 Luonal Justi IUJ MUNTH 2 Soni gurbage 3'-Communal Flush 3- Communal Indoor 3 private lComanl lu 3 Outside (Beyond 1om) J- Privale/ Communal 4. Communal Pit 4 Communal Outdoor 4. Other 3 utl.!r J Co.imunal Uunp 3 Very Cirty 4 None 4- None S-Other 6 None Other S None Mane ~ ~ ~ ~ ~ ~ ~ 9 52n S,.. Nonoee1 __ ___ L 1 -uI~ ~ ~ so s, s2 s54 sob ss as s C -DISTANCE TO PUBLIC AMENITIES 1 PIPED WATER 2. .EARLST 3. NEAREST 4- NEAREST S. CHURCH/ 6 HOSPITAL 7. CLINIC) 8- POLICE POSTI 9- MATATUI 10- TARMAC 11 TELEPHONE 12 aTRL ET 13 MAKELTI . CUMM4UniITY lb TOWN NURSERY PRIMARY SECONDARY DISPENSARY STATION IIUS STOP ROAD LIGHTING SHUPPINO LENl LIt ifRL CENTRE SCHOOL SCHOOL 6162 636H 5 6 67 6869 70 7 27 47 OW1AN"E CODES IN KILOMETRES OWIThlN HOMNESTEAD .LESS THAN I 2,1LESS THAN 2 3-2 LESS THAN 3 4-3 LESS THAN 4. 5-4. LESS THANS 6-5 LESS THAN 10 7. OVEIR 10 KM 1 . . 1 - 81 - CONFIDENTIAL CENTRAL BUREAU OF STATISTICS MINISTRY OF ECONOMIC PLANNING AND DEVELOPMENT URBAN HOUSING SURVEY 1983 Page 2 C: W STRUCTURE QUESTIONNAIRE INTERVIEWED EDITED TOWN c 3 STRUCTURE NU HECK FORM DATE '-- UMBER DIGIT TYPEM Ln0C3F010ENUMERATOR/ CLERK SUPERVISOR TO BE OBTAINED FROM A TRACEABLE OWNER/OWNER'S REPRESENTATIVE RESERVED FOR COMPERE USE TICK WHERE APPROPRIATE OR FILL WITH APPROPRIATE NUMBER e.g. COMPUTER USE D. SOCIO-ECONOMIC CHARACTERISTICS OF THE OWNER 1. Name of the owner.... ....................................... SEX Hale Female Postal Address....................................... Residential Address..................................... 2. Occupation of the owner. ............................. 3. Level of formal education attained by owner No Schooling Primary Secondary University 4. Gross montly income of the owner's household (from all sources) in KSh. Under 500 501-1,000 1,001-2,000 2,001-4,000 4,001-6,000 6,001-8,000 8,001-10,000 10,001-20,000 Above 20,000 Unkown 5. Number of income contributors E. ACqUISITION AND FINANCING OF LAND AND STRUCTURE 1. LAND 1) Land tenure: Own Lease Other ii) When acquired, was the plot; Developed Lndeveloped I - 82- LU I0ENi iAL CENTRAL BUREAU OF STATISTICS MINISTRY OF ECONOMIC PLANNING AND DEVELOPMENT Page 3 URBAN HOUSING SURVEY 1983 INTERVIEWED EDITED TOWN STRUCTURE H/H CHECK STRUCTURE QUESTIONNAIRE J _NUMBER DIGIT ENUMERATOR/ CLERK SUPERVISOR RESERVJED IODR COMPUTER USE iii) Size of the plot , Acres iv) Year of acquisition of land 24 25 26 27 v) Value of land when acquired KSh. 28 34 vi) Estimated current value of land KSh. vii) Annual land rate ( to Local Authorities) KSh. 42 45 viii) Annual land rent ( to Central Government) KSh 2.6 27 9 2. STRUCTURE i) For the structure, do you have Title deed Lease Temporary occupancy licence Other ii) Was the structure Purchased Gift [jInherited Owner-built Other [1 1 2 3 5 iii) Year of acquisition 52 53 iv) Amount paid t Sh. 1 0 v) Year structure completed I 61 62 vi) Total legal fees paid KSh. 3 68 vii) Amount of architectural fees paid KSh. I I I I 69 74 viii) Conotruction costs *(excludes legal and architectu: fees) KSh. 75 ix) Ettimnoted current value of structure Sh. 82 88 x) Hou LJOU the acauisition financed? C1;h Credit Not applicable [9 2 3 83 CONFIDENTIAL CENTRAL BUREAU OF STATISTICS Page 4 MINISTRY OF ECONOMIC PLANNING AND DEVELOPMENT INTERVIEWED EDITF- URBAN HOUSING SURVEY 1983 STRUCTURE qUESTIONNAIRE DATE Cx ENUJMERATOR/ T OWN .1 Ej STRI IT URE INUMVIER ril 1,TI SPRIO ______ fRE OLlE DWT SUERUTrJR RUERUED FOR xi) fur Cnsh Onl1, 1) Ilut iuo LIti viranced? Snuims Gift Sale of prooerty Loan [ Other [] 90 j 1 2 3 b) If Inn, source: Commercial Uank Mortgage Housing Finance Company National Housing Corporation Insurance Companies Co-operatives Other Financial Institutiona [! 1T ~ 5L Employer Relative Other FI1 7 9 i) for Credit Only a) Amun or dnwpayment KSh. t2' b) Munthly murtgngu payments KSh. c) Perliad of mortgae payments Years d) Sorc of inance: Commercial Bank Mortgage Housing Finance Company National Housing Corporation 2 3 Insurance Cumpanies Co-operatives Other Financial Institutions Employpr [A Relative Other F 7 8 9 xiii) Du you own other residential structures? Yes No16 xiv) Louation o olher residential structures: In thiS L0n1 F -1 In other towns Both in this town and other towns Not applicable lIT 12 3 4 84 - CENTRAL BUREAU OF STATISTICS MINISTRY OF ECONOMIC PLANNING AND DEVELOPMENT URBAN HOUSING SURVEY 1983 LISTING FORM Town Stratum Cluster Town Name............. Stratum Name . Serial Structure l - Id Total in Name of House Hold Head Remarks No. No. ., H'Hold __ ____________I__ _ _ - 85 - APPENDIX B: INTRODUCTION TO LOGARITHMS AND ELASTICITY There is no escape from the following fact: when first encountered, logarithms are confusing. They have such desirable properties, however, that it is well worth the effort to learn to use them. The next few pages serve as an introduction to the subject, but realistically, only working with logarithms for some time will make you completely comfortable with them. It's worth the effort because they can make complicated regression models easy to estimate/ Exponents and Logarithms An exponent is the power to which a variable is to be raised, and logarithms are just special exponents. In familiar power expressions such as x2 or x3, the exponents are constants, but there is no reason why we can't have variable exponents like xy (or as sometimes written, x ** y, where ** is a common symbol for exponentiation). Suppose we believe that the relationship between rent and income is of the following form: (A-1) R = k * Ib where k is a constant term and b is the exponent of income; both k and b are to be estimated. R and I are, say, rent and income. Ordinary least squares can't handle this problem as written, because the coefficient b and income are not linearly related. We have to find a way to make b linearly related to I, or more preci5ely, to some simple and computable function of I. Definition of logarithm. When we have two numbers such as 4 and 16 which are related exponentially: 1/ Much of this annex is adapted from Chiang (1974) Chapter 10, and Mirer, (1983), Chapters 2 and 6. - 86 - (A-2) 42 = 16 then we define the exponent 2 to be the logarithm of 16 to the base 4, and can rewrite A-2 as: (A-3) log4 16 = 2 or in other words, the logarithm is the power (here 2) to which the base (here 4) must be raised to attain a particular number (16). A logarithm is simply an exponent. "Log" is often written as the short form of logrithm. We can also go back the other way, and exponentiate, or "find the antilog." We can choose any base we want to. In practice, there are two common bases, base 10 (often called common logarithms) and a special base, the number 2.71828, often denoted by the letter "e" and known as the base of natural logarithms. In economics, we always use natural logarithms, because it turns out this type has several desirable properties which we will discuss later. The special number e is a special constant very much like the more familiar number pi (3.14159) which turns up so often in geometry and other mathematics. The derivation of e requires calculus so we will not discuss it here / Table A-1 lists some representative numbers and their natural logarithms. Of course, when analyzing data with computers we don't refer to these tables, because the computer can automatically calculate the logs for us. Notice that all the numbers, "n," are greater than zero. One of the restrictions with logarithms is that they cannot be computed for negative numbers. Fortunately, this is not a serious problem for our kind of work 2/ See Chiang (1974) or any calculus text for the deriviation of e. -87- Table A-l: Natural Logarithms n log,n n log, n n log n 0.0 - 4.5 1.5041 9.0 2.1972 0.1 -2.3026 4.6 1.5261 9.! 2.2083 0.2 -1.6094 4.7 1.5476 9,2 2.2192 0.3 -1.2040 4.8 1.5686 9 3 2.2300 0.4 -0.9163 4.9 1.5892 9.4 2.2407 0.5 -0,693 1 5.0 1.6094 9.5 2.2513 0.6 -0.5108 5.1 1.6292 9.6 2.2618 0.7 -0.3567 5.2 1.6487 9.7 2.2721 0,8 -0.2231 5.3 1,6677 9.8 2.2824 0.9 -0.1054 5.4 1,6864 9.9 2.2925 1.0 0.0000 5.5 1.7047 10 2.3026 1.1 0.0953 5.6 1.7228 11 2.3979 1.2 0.1823 5.7 1.7405 12 2.4849 1.3 0.2624 5.8 1.7579 13 2.5649 1.4 0.3365 5.9 1.7750 14 2.6391 .5 0.4055 6.0 1.7918 15 2.7081 1.6 0.4700 6.1 1.8083 16 2.7726 1.7 0.5306 6.2 1.8245 17 2.8332 1.8 0.5878 6.3 1.8405 18 2.8904 1.9 0.6419 6.4 1.8563 19 2.9444 2.0 0.6931 6.5 1.8718 20 2.9957 2.1 0.7419 6.6 1.8871 25 3.2189 2.2 0.7885 6.7 1.9021 30 3,4012 2.3 0.8329 6.8 1.9169 35 3.5553 2.4 0.8755 6.9 1.9315 40 3.6889 2.5 0.9163 7.0 1.9459 45 3.8067 2,6 0.9555 7.1 1.9601 50 3.9120 2.7 0.9933 7.2 1.9741 55 4.0073 2.8 1.0296 7.3 1.9879 1 60 4.0943 2.9 1.0647 7.4 2.0015 65 4.1744 3.0 1.0986 7.5 2.0149 70 4.2485 3.1 1.1314 7.6 2.0281 75 4.3175 .2 1.1632 7.7 2.0142 80 4.3820 3.3 1.1939 7.8 2.054 1 85 4.4427 3.4 1.2238 7.9 2.0669 90 4.4998 3.5 1,2528 8.0 2.0794 95 4.5539 3.6 1.2809 8.1 2.0919 100 4.6052 3.7 1.3083 8.2 2.1041 200 5.2983 3.8 1.3350 8.3 2.1163 300 5.7038 3.9 1.3o10 8.4 2.1282 400 5.9915 4,0 1.3863 8.5 2.1401 500 6.2146 4.1 1.4110 8.6 2.1518 600 6.3069 4.2 1.4351 b.i 2.1633 700 6.5311 4.3 1.4586 8.8 2.1748 800 6,6846 4.4 1.4816 8.9 2.1861 900 6.8024 - 88 - because most important economic variables -- for example, incomes, prices, rents, distances, household sizes -- are always positive.2/ It turns out that equations like A-1 can be easily re-expressed in linear form if we use natural logarithms. If we perform the same mathematical operation to both sides of A-1 the equality still holds. We can therefore write: (A-4) loge R = loge (k * Ib) Loge, the natural log, is often written "ln" for brevity, or (A-5) ln R = ln (k* Ib) Now what? There are two important rules of logs which help simplify these equations. These rules are so helpful t-at they are one of the major reasons for using logs. They are: Rule 1: the log of the product of two numbers equals the sum of the logs of the two numbers. For example, ln(x * y) = ln(x) + ln(y) Rule 2: the log of an exponential function equals the exponent times the log of the variable, or ln(xa) = a * ln(x) We will not prove these rules (see Chiang or a calculus text) but we can illustrate that they work by using examples from Table A-1. For example, Rule 1 says that: ln(15) = ln (3 * 5) = ln(3) + ln(5), or, from 3/ If there are a few legitimate negative numbers for some variable for some variable for which you want to use logs, a common fix-up is to (1) add the value of the largest negative number, plus one, to each observation, (2) then take the log. - 89 - Table A-1: 2.7081 = 1.0986 + 1.6094, which verifies Rule 1. To test Rule 2, try: ln(16) = ln(42) = 2 * In(4), so check tl%at: 2.7726 = 2 * 1.3863 and Rule 2 is verified. The reader can make up his own examples with the numbers from Tables A-1 to convince himself of the validity of these rules. There is no reason why we can't use both rules on the same problem. Let's return to equation A-5: (A-5) In R = Ln (k* Ib) Now apply Rule 1 and get: (A-6) In R = ln(k) + ln(Ib) and applying Rule 2 we get (A-7) In R = ln(k) + b * ln(I) and now we have transformed equation A-1 into a linear equation that can be estimated using regression analysis! All we have to do is compute two variables, the natural log of R (rent) and the natural log of I (income), and use these new variables in the regression: (A-8) In R = a + b * ln(I) The only difference between A-8 and A-7 (and hence A-1) is that the intercept, estimated, a, is the log of the original intercept, estimated, a, is the log of the original intercept, k. It can easily be transformed back, but usually we are more interested in the estimate of b, since that number can be interpreted as the percentage change in rent given a one percent change in income. - 90 - If we have several right-hand side variables: (A-9) R = k *Ib * pc where P is price, and c is a new parameter to be estimated, we can use the regression: (A-10) ln R a + b * ln(I) + c * ln(P) Also, we can add other types of variables such as dummy variables, linear variables, and squared variables, e.g.: (A-11) ln R + a + b * ln(I) + c * ln(HH) + d* In (HHSQ) where HH is household size, and HHSQ is household size, squared. So far, so good. Now we know how to estimate linear logarithms models derived from nonlinear (and unestimable using regression) models like A-1 and A-9. So what? Why would we want to do that? It turns out that these models have several desirable properties. First, models like A-1 are constant elasticity models; the relative effect of a change in I upon R is constant, and equal tc b. This means that the regression coefficient b is a very convenient summary of the responsiveness of R to changes in I: if b happens to equal 1, a 1 percent change in income implies a 1 percent change in rent, or in other words, rents rise proportionally with income, or in another manner of speaking, the typical rent-to-income ratio is constant, as income changes. If b is zero, then there is no measured relationship between rent and income. If b lies between 0 and 1, rents go up with income, but not as fast as income goes up. Suppose we estimate b to be .6. Then a 1 percent increase in income implies a .6 percent increase in rent. This means that at higher incomes people pay higher levels of rent, but since the increases are less than proportional, the rent-to-income ratio goes down. This is confusing at first, and requires some thought. Suppose we have three people: - 91 - Proportional Proportional Change Change Rent in in Person Rent Income Income Rent Income 1 200 1,000 -- -- .20 2 260 1,500 50% 30% .17 3 320 2,000 33% 23% .16 Note the following: The second person has 50 percent more income than the first, and the third 33 percent more than the second. But rents rise less than proportionally to income (30 percent and 23 percent). In other words, the level of rents go up while the rent-to-income ratio declines. Demand is income-inelastic, in economics jargon. If we ran a regression like A-8 on a sample which contained mostly people with this kind of consumption pattern, then we can expect our estimate of b to be greater than zero but less than one. If our sample contains mostly people like this: Proportional Proportional Change Change Rent in in Person Rent Income Income Rent Income 1 200 1,000 -- -- .20 2 300 1,500 50% 50% .20 3 400- 2,000 33% 33% .20 Here rents rise in the same proportion as incomes, the rent-to-income ratio is constant, and the coefficient from a regression like A-8 should be close to one. If ve found rents going up faster than income, and rent-to-income ratios increased with income, then b should be greater than one (demand is income elastic). - 92 - A second advantage is that the log form is fle,--ble, and this does not place undue restrictions on the shape of the relationship to be estimated. Figure A-1, adapted from Mirer (1983, p. 103) demonstrates this point (using Y and X instead of R and I as variable names). A third advantage is that log models in economics have more constant variance than linear models. this means the following: The errors, or differences between predicted rent and actual rent, are often found to vary systematically with income when a simple linear model is used, in the following way: larger errors are typically found for higher income people. This problem, known as heteroskedasticity, means that our hypothesis tests using t and F statistics, are incorrect. Log models usually do not suffer from this problem as much as simple linear models.- 4/ See, for example Malpezzi, Ozanne and Thibodeau, (1981) pp. 24-26. - 93 - Figure A-1 Flexibility of Logrithmic Functional Form Y = (elo)xx) In Y =0 + t In X Y In Y (a)I <0 -I X inX Y In Y = 0 (b) X InX Y In Y (c) 0 <: 1l X In X Y In Y X In X Y In Y (e)> X InX ._ The geometry of the log-linear relation depends on the sign of t.When Y decreases as X increases (case (a), with St < 0], it is concave upward. When Y increases with X (with 0, > 0), the concavity may be up- ward or downward, depending on the magnitude of 01. Although Y is a 'nonlinear function of X, In Y is a linear function of In X; the. slope of that line is the same g, as in the original formulation. The parameter 01 is the elasticity of Y with respect to X. - 94 - APPENDIX C: FORTRAN SUBROUTINES FOR ORDER STATISTICS The code in this appendix is from Velleman and Hoaglin (1981), which is also recommended as a guide to exploratory data analysis. The first part of the appendix lists the basic code. The second part provides some explanation of initialization and programming conventions. We are indebted to Velleman and Hoaglin for permission to use this code. However, as they point out, this code was designed for other purposes (see their text), and should be used as a guide to writing code for order statistics rather than copied directly. In particular, the Kenyan housing survey is not self-weighting, so medians should be computed using sample weights. This is a straightforward extension of this code, where the sum of the weights is substituted for the number of observations. - 95 - BLOCK DATA C C CHARS CONTAINS THE SYMBOLS OF THE STANDARD FORTRAN CHARACTER SET, C AND CHA - CHPT ARE THE CORRESPONDING INDICES INTO CHARS. C PUTCHR IS THE PRIMARY USER OF THIS TRANSLATION VECTOR. C COMMON /CHARIO/ CHARS, CMAX, 1 CHA, CHB, CHC, CHD, CAE, CHF, CHG, CHH, CHI, CHJt CHK, 2 CHL, CHM, CHN, CHO, CHP, CHQ, CHR, CHS, CHT, CHU, CHV, 3 CHW, CHX, CHY, CHZ, CHO, CH1, CH2, CH3, CH4, CH5, CH6, 4 CH7, CH8, CH9, CHBL, CHEQi CHPLUS, CHMIN, CHSTAR, CHSLSH, 5 CHLPAR, CHRPAR, CHCOMA, CHPT C C INTEGER CHARS(46), CMAX INTEGER CHA, CHBv CHC9 CH0, CHE, CHFI CHG, CHHy CHI INTEGER CHJv CHKI CHLt CHMv CHNP CHO, CHPY CHQ9 CHR INTEGER CHS, CHTi CHU, CHVI CHWv CHXt CHY, CHZ INTEGER CHA, CH8, CH2, CH3, CH4, CH5 CH6, CH71 CH8i CH9 INTEGER CHBL, CHEQ, CHPLUS, CHMIN, CHSTAR, CHSLSH INTEGER CHLPAR, CHRPAR, CHCOMA, CHPT DATA CHARS( I)tCHARS( 2)tCHARS( 3)9CHARS( 4) /lHA71HBtCHCHCHD/ DATA CHARS( 5)vCHARS( 6)tCHARS( 79CHARS( 8) /lHEtlHFPCHGHCHH/ DATA CHARS( 9)vCHARS(IO)tCHARS(11)vCHARS(12) /lHI91HJtlHKvlHL/ DATA CHARS(13j,)CHARS(14),CHARS(15),CHARS(16) /lHMtlHNtlHOvlHP/ DATA CHARS(17)tCHARS(18)tCHAPS(19),CHARS(201 /lHQtlHR91HStlHT/ DATA CHARS( 21) 9CHARS( 221 tCHAPS( 23),qCHARS( 24) /lHU71HVvlHW11HX/ DATA CHARS(25)vCHARS(26)tCHARS(27),CHARS(281 / 1HY,9XHZ,i1H09,lHl/ DATA CHARS(29),CHAPSC(3O),CHAr S(31),CHARS(32) /lH2,lH3llH4ilH5/ DATA CHARS(33)YCHARS(34,)?CHARS(35)?,CHARS(36) /IH69lH79lH8vlH9/ DATA CHARS(37)tCHARS(38)ICHARS(39)tCHARS(40) 11H tlH=tlH+,lH-/ DATA CHARS (41),iCHARS (42),1CHARS( 43),9CHARS( 44) /lH*,.'LH/tlH(ilH)/ DATA CHARS (45)i CHARS (46 H, C CH./ DATA CMAX /46/ DATA CHAvCHBiCHCICHDvCHE9CHF / 19 2, 3, 4, 5, 6/ DATA CHGtCHHtCHI?CHJvCiKtCHL / 7, 8, 9910illYI2/ DATA CHM-CHNiCHOiCHP,CHQvCHR /1314,15il6ql7vl8/ DATA CHSiCHTiCHUCHViCHWPCHX /120v2lv22?23924/ DATA CH*eCHZtCH0jCHliCH29CH3 /25t26,27,28i29,30/ DATA CH41CH5jCH6iCH7pCH8vCH9 /-1,32037340506/ DATA CHBL,CHEQ,CHPLUS,CHMIN /3'1938939940/ DATA CHSTAR,CHSLSHvCHLPAR,CHRPAR /41942t43944/ DATA CHCOMA,CHPT /45t46/ C C END -96 - SUBq0JTINE CINIT(IOUNIT, IPMIN, IPMAX, IEPSI IMAXIN, ERR) C INTEGER IOUNIT, IPMIN, IPMAX, IMAXIN, EPR REAL IEPSI C C INITIALIZATION, TO BE CALLED AT START OF ANY MAIN PROGRAM C WHICH CALLS ONE OF THE EDA SUBROUTINES (EITHER DIRECTLY OR C INDIRECTLY). C C IOUNIT IS THE NUMBER OF THE UNIT TO WHICH OUTPUT IS DIRECTED. C IPMIN IS THE LEFT MARGIN. C IPMAX IS THE RIGHT MARGIN. C IEPSI IS THE MACHINE-RELATED EPSILON. C IMAXIN IS THE MAXIMUM PERMITTED INTEGER VALUE C C ERR IS THE (USU -E4 - F_, *TDA Ib M+E THER C THE ROUTINE EXECUTED SUCCESSFULLY. C COMMON /CHRBUF/ P, PMAX, PMIN, OUTPTR, MAXPTR, OUNIT COMMON /NUMBRS/ EPSI, MAXINT C INTEGER P(130), PMAX, PMIN, OUTPTR, MAXPTR, OUNIT REAL EPSI, MAXINT C C LOCAL VARIABLES C INTEGER BLANK, I DATA BLANK /1H / C C ERR = 6 IF(IPMIN .LT. 1) GO TO 999 IF(IPMAX .GT. 130) GO TO 999 IF(IPMAX .LE. IPMIN) GO TO 999 ERR = 7 IF((1.0 + IEPSI) .LE. 1.0) GO TO q99 ERR = 0 OUNIT = IOUNIT ?MIN = IPMIN OUTPTR = IPMIN MAXPTR = IPMIN PMAX = IPMAX EPSI = IEPSI MAXINT = FLOAT(IMAXIN) C DO 50 I = 1, 130 P(I) = BLANK 50 CONTINUE C 999 RETURN END - 97 - SUBROUTINE PUTCHR(POSNv CHAR, ERR) C INTEGER POSCv CHARt ERR C C PLACE THE CHARACTER CHAR' AT POSITION POSN IN C THE OUTPUT LINE P *IF POSN =0 9 PLACE CHAR IN THE C NEXT AVAILABLE POSITION IN P *MAXPTR IS TO BE INITIAL- C IED TO PMIN I AND PRINT MUST RESET IT. C COMMON /CHARTO/ CHARS, CMAXI 1 CHA, CHBi CHCt CHD, CHEv CHF9 CHG, CHHv CHI, CHJt CHK* 2 CML, CHMt CHN, CHOv CHPi CHQt CHR, CHS, CHTY '_HUi CHV, 3 CHW, CMHXt CHYi CHZr CHOr CHlt CMZv CH3? CH4? CH5i CH6v 4 CH7v CH87 CH9i CHBLt CHEQi CHPLUS, CHMINt CHSTARt CHSLSHi 5 CHLPARq CHRPARv CHCOMAt CHPT C COMMON /CHRBUF/ P, PMAXv PMINi OUTPTRY MAXPTRt OUNIT C INTEGER CHARS(46)t CMAX INTEGER CHAv CHBj CHC, CHDt CHE, CHFv CHG9 CHHt CHI INTEGER CH.J9 CHKi CML, CHM9 CHNi CH07 CMP? CHQ9 CHR I NT ElSER CHS, CHT, CHUv CMVt CMHWv CMX, CMY, CH'K INTEGER rHO,, CH1, CH21 CH31 CH4, CM5, CM6i CH7t CH8v CH9 INTEGER CHBL, CHEQ, CHPLUSt CHMIN, CHSTAPi CHSLSH INTEGER CHLPARi CHRPAR, CHCOMA, CHPT INTEGER P(130)C PMAXv PMINt OUTPTR9 MAXPTR, OUNIT C IF(CHAR .GT. 0 .AND, CHAR .LE. CMAX) GO TO 10 ERR = 4 RETURN 10 IF(POSN .NE. 0) OUTPTR = MAX0(PMINv POSN) OUTPTR =MINO(OUTPTRt PMAX) P(OUTPTR) = CHARS(CHAR) MAXPTR = MAXO(MAXPTR, UTPTR) OUTPTR L OUT.PTR + 1 RETURN END INTEGER FUNCTION WDTHOF(I) INEG ER I C FIND THE NUMBER OF CHARACTERS NEEDED TO PRINT I INTEGER IA, IQ, NO C IA = IAS(I) ND = 1 IF(I L*r. 0) ND 2 10 IQ = IA110 IF(TQ *EQ. 0) GO TO 20 IA = IQ ND ND + 1 GO 10 10 20 WDTHCF =NO RETURN END - 98 - SUBROUTINE PjTNUM'%POSN, N, W, ERR) C INTEGER POSN, N, W, ERR C PLACE THE CHARACTER REPRESENTATION OF THE INTEGER N C RIGHT-JUSTIFIED IN A FIELD W SPACES WIDE STARTING C AT POSITICN POSN IN THE OUTPUT LINE P C C THE VARIABLES IP, INUM, AND IW ARE INTERNAL VERSIONS C OF POSN, Nv AND W . WE PROCEED BY EXTRACTING THE C DIGITS OF N, STARTING WITH THE LOW-ORDER DIGIT, C AND STACKING THEM IN DSTK. ( ND COUNTS THE DIGITS.) C ONCE WE HAVE COLLECTED ALL THE DIGITS (AND KNOW THAT C W SPACES ARE S,)FFICIENT), WE SKIP OVER ANY UNNEEDED C SPACES, PUT OUT A MINUS SIGN IF NEEDED, AND THEN PUT OUT C THE DIGITS, STARTING WITH THE HIGH-ORDER ONE. C C THiS ROUTINE CALLS PUTCHR AND DEPENDS ON HAVING DIGITS C 0 THROUGH 9 IN CONSECUtIVE ELEMENTS OF CHARS IN THE C COMMON BLOCK CHARIO, STARTING AT CHO = 27. IT ALSO C ASSUMES THAT THE MINUS SIGN IS AT CHMIN = 40 IN CHARS. C INTEGER CHO, CHO, CHMIN, DSTK(20), INUM, IP, IQ, IW, ND C COMMON/CHRBUF/ P, PMAX, PMIN, OUTPTR, MAXPTR, OUNIT INTEGER P(130), PMAX, PMIN, OUTPTR, MAXPTR, OUNIT C DATA CHO, CHMIN/27, 40/ C C IW = W IF(N .LT. 0) IW = IW - 1 INUM = IASS(N) C C EXTRACT AND STACK HE DIGITS OF INUM, CHECKING C TO SEE THAT N FITS IN W SPACES. C ND = 1 10 IQ = INUM/10 DSTK(ND) = INUM - IQ * 10 IF(NO .LE. 20 .AND. NO .LE. IW) GO TO 20 ERR '= 2 GO TO 999 20 IF(IQ .EQ. 0) GO TO 30 INUM = IQ NO = NO + 1 GO TO 10 C C UNSTACK THE DIGITS FROM DSTK AND PUT THEM OUT. C NOTE THAT WHEN N IS NEGATIVE, A MINUS SIGN MUST BE C INSERTED IN THE SPACE BEFORE THE FIRST DIGIT. DECREASING C IW BY 1 IN THE INITIALIZATION HAS PRO"IDED A SPACE C FOR THE MINUS SIGN. - 99 - C 30 IP = POSN IF(IP .EQ. 0) IP = OUTPTR IP = IP + IW - ND IF(N .GE. 0) GO TO 40 CALL PUTCHR(IPI CHMIN, ERR) IP = IP + 1 40 CHO =.CHO+ DSTK(ND) CALL PUTCHR(IP, CHO, ERR) IF(ND .EQ. 1) GO TO 50 ND = ND - 1 IP = IP + 1 GO TO 40 50 CONTINUE C 999 RETURN END SUBROUTINE PRINT C C PRINT THE OUTPUT LINE P ON UNIT OUNIT (MAXPTR C INDICATES THE RIGHTMOST POSITION WHICH HAS BEEN USED C IN THIS LINE). THEN RESET P TO SPACES, AND MAXPTR AND C OUTPTR TO PMIN. C COMMON /CHRBUF/ P, PMAX, PMIN, OUTPTR, MAXPTR, OUNIT C INTEGER P(130), PMAX, PMIN, OUTPTR, MAXPTR, OUNIT C C LOCAL VARIABLES C INTEGER BLANK, I C DATA BLANK /1H / C WRITE(OUNIT, 10) (P(I), 1=1, MAXPTR) 10 FORMAT(1X, 130A1) C 00 20 I = 1, MAXPTR P(I) = BLANK 20 CONTINUE C OUTPTR = PMIN MAXPTR = PMIN C RETURN END -100- SUBROUTINE SORT( Y, N, ERR) C INTEGER N, ERR REAL Y(N) C C SHELL SORT N VALUES IN Y() FROM SMALLEST TO LARGEST. C C NOTE THAT LOCAL SYSTEM SOPT UTILITIES APE LIKELY TO BE C MORE EFFICIENT, AND SHOULD BE SUBSTITUTED WHENEVER POSSIBLE. C C LOCAL VARIABLES C INTEGER I, Ji J1, GAP, NMG REAL TEMP C IF(N .GE. 1) GO TO 10 ERR = 1 GO TO 999 10 IF(N .EQ. 1) GO TO 999 C C ONE ELEMENT 13 ALWAYS SORTED C GAP = N 20 GAP = GAP/2 NMG = N - GAP DO 40 JI = 1, NMG I = JI + GAP C C DO J = J1, 1, -GAP C J = J1 30 IF (Y(J) .LE. Y(I)) GO TO 40 C C SWAP OUT-OF-ORDER PAIR C TEMP = Y(I) Y(I) = Y(J) Y(J) = TEMP C C KEEP OLD POINTER FOR NEXT TIME THROUGH C I = J J = J - GAP IF (J .GE. 1) GO TO 30 40 CONTINUE IF (GAP .GT. 1) GO TO 20 999 RETURN END - 101- SUBROUTINE PSORT( ON, WITH, N, ERR) C INTEGER N, ERR REAL ON(N), WITH(N) C C PAIR SHELL SORT N VALUES IN ON() FROM SMALLEST TO LARGEST C CARRYING ALONG THE VALUES IN WITH(). C C NOTE THAT LOCAL SYSTEM SORT UTILITIES ARE LIKELY TO BE C MORE EFFICIENT, AND SHOULD BE SUBSTITUTED WHENEVER POSSIBLE. C C LOCAL VARIABLES C INTEGER I, Jt Ji, GAP, NMG REAL TON,TWITH IF(N .GE. 1) GO TO 10 ERR = 1 GO TO 999 10 IF(N .EQ. 1) GO TO 999 C C ONE ELEMENT IS ALWAYS SORTED C GAP = N 20 GAP = GAP/2 NMG = N - GAP DO 40 J1 = 1, NMG I z JI + GAP C C 00 J = J1, 1, -GAP C J = J1 30 IF (CNIJ) .LE. ON(I)) GO TO 40 C C SWAP CUT-CP-ORDER PAIR C TON = ON(I) ON(I) = ON(J) ON(J) = TON TWITH = WITH(I WITH(I) = WITH(J) WITH(J) = TWITH C C KEEP OLD POINTER FOR NEXT TIME THROUGF C I J J J - GAP IF (J .GE. 1) GO TO 30 40 CONTINUE IF (GAP .GT. 1) GO TO 20 999 RETURN END - 102 - SUBROUTINE YINFO(Y, N, MED, HL, HH, ADJL, ADJH, IADJL, IADJH, 1 STEP, ERR) C C GET GENERAL INFORMATION ABOUT Y(. USEFUL FOR PLOT SCALING. C SORTS Y() AND RETURNS IT SORTED. ALSO RETURNS C MED a MEDIAN C HL v LOW HINGE HH =HI HINGE C ADJL = LOW ADJACENT VALUE ADJH =HI ADJ VALUE C IADJL= ITS INDEX (LOCATN) IADJH=ITS INDEX C INTEGER N, IADJL, IADJH, EPR REAL Y(N), MED, HL, HH, ADJL, ADJH, STEP C C LOCAL VARIABLES C REAL HFENCE, LFENCE INTEGER J? K, TEMPI, TEMP2 C CALL SORT(Y, N, ERR) IF (ERR .NE. 0) GO TO 999 K=N J = (K/2)+1 C TEMPI = N+1-J MED = (Y(J) + Y(TEMPI))/2.0 C K = (K+1)/2 J = (K/2) + 1 TEMPI = K+1-J HL = (Y(J) + Y(TEMP1))/2.0 TEMPI = N-K+J TEMP2 = N+1-J HH = (Y(TEMP1) + Y(TEMP2))/2.0 C STEP = (HH - HL)*1.5 HFENCE HH + STEP LFENCE = HL - STEP c C FIND ADJACENT VALUES C IADJL = 0 20 IADJL = IADJL + 1 IF ( Y(IADJL) 4LE. LFENCE) GO TO 20 ADJL Y(IAOJL) c IADJH = N+1 30 IADJH = IADJH - 1 IF ( Y(IADJH) .GE. HFENCE) GO TO 30 ADJH = Y(IADJH) 999 RETURN END -103 - SUBROUTINE NPOSW(HI? LO, NIGNOS, NN, MAXP, MZERO, PTOTL, FRACT, 1 UNIT, NPW, ERR) C C FIND A NICE (I.E., SIMPLE) DATA-UNITS VALUE TO ASSIGN TO ONE PLOT C POSITION IN ONE DIMENSION OF A PLOT. A PLOT POSITION IS TYPICALLY C ONE CHARACTER POSITION HORIZONTALLY, OR ONE LINE VERTICALLY. C C ON ENTRY: C HI, LO ARE THE HIGH AND LOW EDGES OF THE DATA RANGE TO BE PLOTTED. C NICNOS IS A VECTOR OF LENGTH NN CONTAINING NICE MANTISSAS FOR C THE PLOT UNIT. C MAXP IS THE MAXIMUM NUMBER OF PLOT POSITIONS ALLOWED IN THIS C DIMENSION OF THE PLOT. C MZERO IS .TRUE. IF A POSITION LABELED -0 US ALLOWED IN THIS C DIMENSION, .FALSE. OTHERWISE. C C ON EXIT: C PTOTL HOLDS THE TOTAL NUMBER OF PLOT POSITIONS TC BE USED IN C THIS DIMENSION. (MUST BE .LE. MAXP.) C FRACT IS THE MANTISSA OF THE NICE POSITION WIDTH. IT IS C SELECTED FROM THE NUMBERS IN NICNOS. C UNIT IS AN INTEGER POWER OF 10 SUCH THAT NPW = FFACT * UNIT. C NPW IS THE NICE POSITION WIDTH. ONE PLOT POSITION WIDTH C WILL REPRESENT A DATA-SPACE DISTANCE OF NPW. C C INTEGER NN, MAXP, PTOTL, ERR REAL HI, LO, NICNOS(NN), FRACT, UNIT, NPW LOGICAL MZERO C C FUNCTIONS INTEGER FLOOR, INTFN C C LOCAL VARIABLES C INTEGER I REAL APRXW C IF (MAXP .GT. 0) GO TO 5 ERR = 8 GO TO 999 5 APRXW = (HI - LO)/FLOAT(MAXP) IF(APRXW .GT. 0.0) GO TO 10 C C HI .LE. LO IS AN ERROR C ERR = 9 GO TO 999 10 UNIT = 10.0**FLOOR(ALOG10(APRXW)) FRACT APRXW/UNIT DO 20 1 = 1, NN IF(FRACT .LE. NICNOS(I) GO TO 30 20 CONTINUE 104 - 30 FRACT a NICNOS(I) NPW * FRACT * UNIT PTOTL - INTFN(HI/NPW, ERR) - INTFN(LO/NPW, ERR) + 1 IF(ERR .NE. 0) GO TO 999 C C IF MINUS ZERO POSITION POSSIBLE AND SGN(HI) .NE. SGN(LO), ALLOW IT. C IF(MZERO .AND. (HI*LO .LT. 0.0 .OR. HI .EQ. 0.0)) PTOTL=PTOTL+ C C PTOTL POSITIONS REQUIPED WITH THIS WIDTH -- FEW ENOUGH? C IF(PTOTL .LE. MAXP) GO TO 999 C C TOO MANY POSITIONS NEEDED, SO BUMP NPW UP ONE NICE NUMBER C I 2 I+1 IF(I .LE. NN) GO TO 30 1 = 1 UNIT = UNIT * 10.0 GO TO 30 999 RETURN END INTEGER FUNCTION INTFN(X, ERR) C C FIND THE INTEGER EQUAL TO OR NEXT CLOSER TO ZERO THAN X. C C CHECKS TO SEE THAT X IS NOT TOO LARGE TO FIT IN AN C INTEGER VARIABLE. C REAL X INTEGER ERR C COMMON /NUMBRS/ EPSI, MAXINT REAL EPSI, MAXINT C IF( ABS(X) .LE. MAXINT) GO TO 10 C C X IS TOO LARGE IN MAGNITUDE TO FIT IN AN INTEGER, C RETURN THE LARGEST LEGAL INTEGER AND SET THE ERROR FLAG. C ERR = 3 INTFN = IFIX( SIGN(MAXINT, X) ) GO TO 999 C 10 INTFN = INT((1.0 + EPSI) * X) 999 RETURN END - 105 - INTEGEP FUNCTION FLOOR (Y) REAL Y C FIND FLOOR(Y), THE LARGEST INTEGER NOT EXCEEDING Y C FLOOR = INT(Y) IF(Y .LT. 0.0 .AND. Y .NE. FLOAT(FLOOR)) FLOOR = FLOOR - 1 RETURN END REAL FUNCTION MEDIAN(Y, N) C FIND THE MEDIAN CF THE SORTED VALUES Y(1), ...i Y(N). INTEGER N REAL Y(N) C LOCAL VARIABLES INTEGER MPTR, MPT2 C MPTR = (N/2) + 1 MPT2 = N-MPTR+1 MEDIAN (Y(MPTR) + Y(MPT2))/2.0 RETURN END REAL FUNCTION GAU(Z) REAL Z C THIS FUNCTION CALCULATES THE VALUE OF THE STANDARD C GAUSSIAN CUMULATIVE DISTRIBUTION FUNCTION AT Z. C THE ALGORITHM USES APPROXIMATIONS GIVEN BY STEPHEN E. DERENZO C IN MATHEMATICS OF COMPUTATION, V. 31 (1977), PP. 214-225 C C LOCAL VARIABLES REAL P, PI, X C X = ABS(Z) IF(X .GT. 5.5) GO TO 10 P = EXP(-((83.0 * X + 351.0) * X + 562.0) * X / 1 (7C3.0 + 165.0 * X)) GO TO 20 C 10 PI = 4.0 * ATAN(1.0) P = SQRT(2.0/Fl) * EXP(-(X * X/2.0 + 1 0.94/tX * X)J) / X C C THE APPROXIMATIONS YIELD VALUES OF THE HALF-NORMAL TAIL AREA. C TRANSLATE THAT INTO THE VALUE OF THE GAUSSIAN C.D.F. AND C ALLOW FOR THE SIGN OF Z. C 20 GAU - P/2.0 IF(Z .GT. 0.0) GAU = 1.0 - GAU C RETURN ENO - 106 C.2 FORTRAN We hardly need to explain our decision to provide programs in FORTRAN- it is the most nearly universal of all scientific programming languages. We cannot, however, pretend that developing these programs was a labor of love. A reader who examines them carefully will find segments that are awkward or - 107 - tedious because FORTRAN is ill-suited to the programming needs of modern data analysis. For example, the output capabilities of FORTRAN are far too rigid for the graphic and semi-graphic displays that are common in explor- atory data analysis. On the whole, however, the advantages of making these programs as widely available as possible outweighed the difficulties of FORTRAN. If programs are to be widely used, they must be portable. That is, it must be possible to move them from one'computing environment to another with an absolute minimum number of changes. Fortunately for us, others have laid substantial groundwork in developing portable (or, strictly speaking, semi-portable) FORTRAN programs. As a result, a number of practices that facilitate portability are well-established, and computer software to support the most valuable of them is available. In this part of the appendix we briefly describe the practices we have followed and the role they have played in the development of our programs. Consistency of style is also important for any set of programs that are intended to be used (and read) togthr. Thus we also describe the particular conventions we have chosen to follow. These range from simple choices that affect only the appearance of the printed programs to overall decisions that affect the structure and interrelations among all the programs in this book. Related to interconnections is the question of just how one might customarily uase these programs. We briefly discuss and illustrate two approaches to this. And finally there are the utility routines, which perform a variety of essential services for the data analysis routines presented in Chapters 1 through 9. Listings for the utility i6i. 4ppe-rAppendix B. Portability A fully portable program or subroutine can be moved gracefully from one computing machine to another. And even though the computers are of different manufacture and have different systems software, the program compiles without errors, executes without errors, and produces identically the same results on both. This is the ideal situation. Unfortunately, it can rarely be attained in practice; but with reasonable effort a good approximation to it is possible. The two primary obstacles to overcome are differences among dialects of the FORTRAN language and differences in characteristics of the arithmetic hardware. (One must also contend with variations in system conventions, but these are generally less serious.) - 108 - The solution to the problem of dialects is conceptually quite simple: One uses only a subset of FORTRAN that is handled in the same way by essentially all known systems. In practice it is all too easy to slip back unknowingly into using some facility or construction which is acceptable in one's own environment but unacceptable in certain others. To avoid this, we have restricted our FORTRAN to a. particular subset known as PFORT. This is an attractive solution because this subset of FORTRAN is supported by a piece of software, the PFORT Verifier (Ryder 1974), that takes a FORTRAN program as input and reports on all its departures from this subset of the language. Especially valuable is the Verifier's ability to process a main program and all associated subroutines and to identify potential difficul- ties of communication among them, including misuse of COMMON. When a particular construction is acceptable in many (but not all) dialects of FORTRAN, it is tempting to use it-especially when it would make the programs easier to understand-and then to announce, "The programs conform to PFORT,.except for. . . ." For example, subscript expres- sions of the form N + 1 - I are common (as in LVALS, MEDPOL, and RGCOMP), but the strict FORTRAN definition of subscript expressions is too restrictive to permit this form. We have decided to avoid such complications and adhere to PFORT. Thus we can state that all the FORTRAN programs in this book have been processed by t4he PFORT Verifier without any warning messages. The problem of arithmetic hardware characteristics is somewhat more difficult than the problem of language dialects. Fortunately, EDA techniques generally involve much less numerical computation than one finds in most mathematical software. In fact, our programs need only two machine-related constants: an epsilon, whose role was described earlier, and the REAL value of the largest valid integer. We have isolated these as the variables EPSI and MAXINT in the COMMON block NUMBRS so that they can be set once at initialization. The initialization subroutine, CINIT, takes care of this. CINIT, which should be called before any of the other FORTRAN routines in this 'book, also sets several other variables that may vary from installation to installation or from run to run: OUNIT the FORTRAN unit number for output (often unit 6), PMIN the left margin in the output line, PMAX the right margin in the output line. In CINIT, the corresponding subroutine arguments all begin with the letter I to indicate that they are initialization values. CINIT performs several basic checks on these and then completes the initialization process. In the course of a - 109 - SUBROUTINE CINIT(IOUNIT, IPMIN, IrPMAX, IEPSI, IMAXIN, ERR) C INTEGER IOUNIT, IPMIN, IPMAX, IMAXIN, EPR REAL IEPSI C C INITIALIZATION, TO BE CALLED AT START OF ANY MAIN PROGRAM C WHICH CALLS ONE OF THE EDA SUBROUTINES (EITHER DIRECTLY OR C INDIRECTLY). C C IOUNIT IS THE NUMBER OF THE UNIT TO WHICH OUTPUT IS DIRECTED. C IPMIN IS THE LEFT MARGIN. C IPMAX IS THE RIGHT MARGIN. C IEPSI IS THE MACHINE-RELATED EPSILON. C IMAXIN IS THE MAXIMUM PERMITTED INTEGER VALUE C C ERR IS THE (USUAL) ERROR FLAG, TO INDICATE WHETHER C THE ROUTINE EXECUTED SUCCESSFULLY. C COMMON /CHRBUF/ Po PMAX, PMIN, OUTPTR, MAXPTR, OUNIT COMMON /NUMBRS/ EPSI, MAXINT C INTEGER P(130), PMAX, PMIN, OUTPTR, MAX;TRY OUNIT REAL EPSI, MAXINT C C LOCAL VARIABLES C INTEGER BLANK, I DATA BLANK /1H / C C ERR = 6 IF(IPMIN .LT. 1) GO TO 999 IF(IPMAX .GT. 130)'GO T4V 999 IF(IPMAX .LE. IPMIN) GO TO 999 ERR = 7 IF((1.0 + IEPSI) .LE. 1.0) GO TO 999 ERR = 0 OUNIT = IOUNIT PMIN = IPMIN OUTPTR = IPMIN MAXPTR = IPMIN PMAX = IPMAX EPSI = IEPSI MAXINT = FLOAT(IMAXIN) C DO 50 1 = 1, 130 P(I) = BLANK 50 CONTINUE C 999 RETURN END . - 110 - sequence of analyses, using several of the programs in this book, a. user may reset the initialization variables by again calling CINIT. Of course, this causes the previous values of these variables to be lost, and it causes the output line to be set to all blanks, but it has no other side effects. Stream Output FORTRAN requires that the programmer specify the contents and format of a line of output, essentially when the program is written. (While it is possible for a running program to read a format specification or to construct one, it is extremely difficult to program this in a portable way.) Because EDA displays, such as the boxplot, depend heavily on the data, we usually can be no more specific about the output format than to say that a line will contain a number of characters-some digits, some symbols, and some blank spaces. As the program executes, it must determine the format for a line and the character that occupies each position on the line. For example, stem-and-leaf displays come in three different formats, and each requires different characters in special positions on the line. Thus the program needs to build each output line a few characters at a time. This style of output-allowing the program to determine the format and contents of the output line as it goes along-is known as stream output. Because such output capabilities are not a part of the FORTRAN language, we have written special subroutines to simulate (in a rudimentary but portable way) the features that we need to produce our EDA displays. Often, we have used standard FORTRAN output. The ;mportant variables for our stream output subroutines reside in the COMMON block CHRBUF. At the heart of our simple stream output is the array P, in which we construct a line of output. Our initialization routine, CINIT, sets P to all blanks. Any routine needing to construct an output line can do so by storing characters (alphabetic, numeric, or special symbols) in P; this is usually done with the subroutines PUTCHR and PUTNUM. When the line is complete, the routine PRINT writes out the contents of P and resets P to blanks. The routine PUTCHR places a character in P, either at the position specified by the argument POSN or at the next available position (if POSN is zero). PUTCHR keeps track of the last print position used and the rightmost non-blank position in the line. The routine PUTNUM places into P the characters for an integer, N. The calling program must specify the width, W, of the field (number of characters) where the number should appear, and its starting position on the line. PUTNUM - 1112 - translates the integer into the appropriate sequence of numerals and uses PUTCHR to place them in P. Applications of PUTNUM include placing the depth counts and the stems on each line of a stem-and-leaf display. Finally, the integer function WDTHOF receives an integer, I, and returns the number of characters (including a minus sign if I is negative) required to print it. We use this information in printing the depth counts and stems in a stem-and-leaf display. Conventions To promote clarity of these programs and to preserve their portability, we have followed several conventions. None of these has especially sweeping conse- quences, but we list them here so that they will be clear to the reader and user. Input/Output. Our subroutines do no input. Reading of data is the responsi- bility of the user, who is in the best position to deal with features of the input process that may depend on the particular version of FORTRAN or on the devices where data are stored. It is customary to isolate output operations so that they do not appear in computational subroutines. We have done this where appropriate; but, of course, it makes no sense when the EDA technique is primarily a display (as in stem-and-leaf, boxplot, condensed plotting, and coded tables). Scratch Storage. When a technique uses temporary storage whose size depends on the number of data values, our routines are structured so that the user supplies this storage through the argument list. (PLOT, for example, requires two work arrays of length N because it must sort-the data points into order on y while preserving the (x,y) pairs.) In this way we avoid any built-in restriction on the amount of data that can be handled, and we make it straightforward to accommodate the storage limitations that the user's system may impose. Characters. When we must work with characters, we store them, one character to the word, in INTEGER variables or arrays. This may waste a certain amount of space, but it is strongly preferable to dealing with heavy depen- dence on the number of characters that can be stored in a word on the user's particular machine. It further avoids the arithmetic that would be required to pack and unpack characters stored several to the word. The character set that - 112 - we have used is the bare minimum FORTRAN character set: the 26 letters, the 10 digits, the 9 symbols - + - * / ( ) , . and the blank space. This facilitates portability, but it is not much to work with in building displays. In BASIC we are able to assume the much larger ASCII character set, and the advaniages are evident when one compares the BASIC and FORTRAN versions of the displays. Dimensioning in Subroutines. When a subroutine argument is an array, our declaration for it uses its actual dimensions, as in "REAL Y(N), . . ." in STMNLF. We have not used "dummy" dimensions, as in "REAL A(1)" seen in some programs. Errors. We attempt to detect a variety of errors that a user might make, and we communicate information on them through the INTEGER variable ERR, which appears as the last argument of many of the subroutines. If no error condition exists, ERR has the value 0. Otherwise, a positive value identifies the error condition. (These error numbers are defined in Exhibit C-1.) Exhibit C-1 FORTRAN Program Error Codes Code Subroutine Meaning 1 SORT N : 0; nothing to sort 2 PSORT N: 0; nothing to sort 3 INTFN X > MAXINT; argument passed is too large to be "fixed" as an integer variable 4 PUTCHR Illegal character code 5 PUTNUM Number won't fit in space provided 6 CINIT Violated 0 < IPMIN < IPMAX < 130 in setting page margins 7 CINIT EPSI too small; 1.0 + EPSI - 1.0 8 NPOSW No room allowed for plot 9 NPOSW HI < LOW 11 STMNLF N 1I 12 STEMP Bad internal value-bad nice numbers? 13 STMNLF Page too narrow for display 21 LVALS Violated 2.: N : 24576 22 LVPRNT Violated 3 5 NLV s 15; too many letter values 23 LVPRNT Page width < 64 positions, not enough room - 113 - Exhibit C-1 (continued) Code Subroutine Meaning 3.1 BOXES N < 1 41 PLOT N < 5 4 t 42 PLOT Violated 5 5 LINES s 40 or 1 s CHRS s 10 44 PLOT XMIN > XMAX 45 PLOT YMIN > YMAX -Errors 44 -nd 45 are possible if incorrect plot bounds have been specified in the subroutine call. 51 RLINE N < 6 52 RLINE No iterations specified 53 RLINE All x-values equal; no line possible 54 RUNE Split is too uneven for resistance 61 RSM N<7 62 RUNMED Insufficient workspace room 63 RUNMED Internal error-error in sort program? This error can occur if a system sort utility is substituted for the supplied SORT subroutine, but used incorrectly. 71 CTBL Zero dimensions Tor table 72 CTBL Too many columns to fit on page 81 MEDPOL or TWCVS Zero dimensions for table 82 MEDPOL No half-steps specified 83 MEDPOL Illegal start parameters 85 MEDPOL' Table is empty 88 TWCVS Zero grand effect; can't compute comparison values - 91 RGCOMP L : 2; too few bins 92 RGCOMP One of the hinges falls in the left-open bin. or in the right-open bin 93 . RGPRNT Page too narrow for rootogram table 94 RGPRNT Room for rootogram table but not for graphic display - 114 - Exits. Each of our subroutines has a single exit, the RETURN statement immediately preceding the END statement. In most subroutines this RETURN bears the statement number 999. Output FORMAT statements. We place each FORMAT statement immedi- ately after the first WRITE statement that uses it. For our programs, which do not use the same FORMAT statement in many different and widely separated WRITE statements and often rely on the stream output routines described earlier, this leads to much better readability than if we grouped all FORMAT statements at the end of the subroutine. Declared Identifiers. We do not rely on "implicit typing" to determine (according to its first letter) whether an identifier'is INTEGER or REAL. Instead, we explicitly declare all the identifiers used in each subprogram, except for the standard FORTRAN functions. We strongly endorse this practice, which a few FORTRAN compilers support by issuing a warning message for any undeclared identifier, because it aids greatly in eliminating misspelled names. (The PFORT Verifier, for example, lists all the identifiers in each program unit, so that such errors stand out.) Indentation. We find that it is generally easier to follow the logic of a program when statements within a DO loop or following an IF statement are indented slightly, and we have used this device throughout our programs. Reference Isaacs, Gerald L. 1976. "BASIC REVISITED, An Update. to Interdialect Translat- ability of the BASIC Programming Language." CONDUIT, The University of Iowa, Iowa, city. Ryder, B.G. 1974. "The PFORT Verifier." Software-Practice and Experience 4:359-377. Programming Yes Glance at Appendix P, ? and turn to Chapter 2. No Proceed - 115 - APPENDIX D: SUGGESTIONS FOR FURTHER READING Statistical Techniques Blalock (1960) is a widely available general statistics text. In particular, the chapter on hypothesis testing is quite good. Other good general statistics texts include Snedecor and and Cochrane (1967) and the SPSS manual (Nie et al., 1975). Draper and Smith (1966) is a good general text on regression analysis which is widely available. Probably the most widely available econometrics text is Theil (1971). Because of its availability, it has been cited several times in this paper. It is, however, a difficult book. A better introduction to econometrics which may be available is Wonnacott and Wonnacott (1970), or perhaps Maddala (1977). Mirer (1983) is an excellent introduction, but it is a recent book and perhaps more difficult to find. Housing Market Analysis An excellent general introduction is Quigley (1979). This paper summarizes recent economic analysis of housing markets, but focuses on developed country data. Chapter 6 of Linn (1979) summarizes housing market analysis with reference to developing countries in general. Mayo et al. (1982) is a good case study of a particular housing market (Cairo). Mayo (1981) reviews the housing demand literature in the U.S., and highlights the permanent income and price elasticity issue. Mayo and Malpezzi (1983), and Keare and Jimenez (1983) review housing demand in developing countries. Mayo and Malpezzi (1983) also review the effects of tenure on housing demand. A good basic reference on mobility is Goodman (1978). Hedonic models are discussed in detail in Malpezzi et al. (1981), and a good example of their application to place-to-place price indexes is Follain - 116 - and Ozanne (1979). An interesting application to developing country data (Korea) is Follain et al. (1982). Computational Techniques Velleman and Hoaglin (1981) provide many useful computer programs and good discussions of exploratory data analysis. The SPSS manual is another obvious source of computational information. Sae-Hau (1982) is highly recommended for advice about data preparation. Information about SAS can be obtained by writing directly to the SAS Institute Inc., Box 8000, Cary, North Carolina, 27511, U.S.A. - 117 - APPENDIX E: DATA USED FOR SIMPLE EXAMPLES This data is of two kinds. Some examples used actual survey data from Cairo, Egypt. This data is documented in Mayo et al. (1982). Other data were manufactured by computer in order to emphasize particular points. All the manufactured data, and the Cario data used for the simple examples in Part I, are reproduced in this appendix so that the reader can get a feel for what these data look like, and can replicate the examples. - 118 - DATA APPENDIX Variable Definitions 1. Data used for Figure 5 and Tables 7 and 9. RENT - household rent paid. HHSIZE - household size (continuous). HSIZESQ - household size, squared (continuous). HH2 - 1 if 2 person household; 0 otherwise. HH3 - 1 if 3 person household; 0 otherwise. HH8 - 1 if 8 person household; 0 otherwise. HHGE9 - household size if greater than or equal to 8; 0 otherwise. 2. Data used for Figure 4 and Table 4. El - manufactured "error term," normally distributed with mean 0 and variance 1. E2 - manufactured "error term", normally distributed with mean 0 and variance 3. X - manufactured variable, uniformly distributed between 0 and 10. Y1 - computed as follows: Y1 = 2 + .75 x + El. Y2 - computed as follows: Y2 = 2 + .75 x + E2. 3. Cairo renter data. HHSIZE - household size. HSIZESQ - household size, squared. TOTINC - total family income (not used). OWN - 0 if renter (all zeros, not used). MINCOME - monthly permanent income (consumption), in Egyptian pounds. MGRENT - monthly gross rent (rent plus utilities). - 119 - DIST - distance from central business district (not used). LMINCOME - log of monthly income. LMGRENT - log of monthly rent. RY - rent-to-income ratio. RESIDUAL - residual, or estimated error, from simple log demand equation (Table 3, Figure 3). PREDICT - predicted log rent from simple demand equation. Notice that RESIDUAL = LMGRENT - PREDICT. MINCOME - monthly permanent income (consumption), in Egyptian pounds. MGRENT - monthly gross rent (rent plus utilities). DIST - distance from central business district (not used). LMINCOME - log of monthly income. LMGRENT - log of monthly rent. RY - rent-to-income ratio. RESIDUAL - residual, or estimated error, from simple log demand equation (Table 3, Figure 3). PREDICT - predicted log rent from simple demand equation. Notice that RESIDUAL = LMGRENT - PREDICT. DATA APPENDIX CAIRO RENTER DATA. RESIDUALS, AND PREDICTED RENTS FROM FIGURE 3 AND TABLES 1-3 OBS HHSIZE HSIZESQ TOTINC OWN MINCOME MGRENT DIST LMINCOME LMGRENT RY RESIDUAL PREDICT 1 5 25 O O 68.50 2.200 28.50 4.22683 0.78846 0.03212 -1.3058 2.09424 2 6 36 83 0 30.06 1.470 28.50 3,40320 0.38526 0.04890 -1.3248 1.71011 3 4 16 376 0 117.60 1.750 28.50 4.76729 0.55962 0.01488 -1.8130 2.37263 4 7 49 58 0 58.66 4.993 21.97 4.07176 1.60800 0.08511 -0.3590 1.96704 5 4 16 55 0 45.40 4.516 21.97 3.81551 1.50767 0.09948 -0.4646 1.97226 6 6 36 35 0 49.17 2.520 21.97 3.89528 0.92426 0.05125 -0.9928 1.91710 7 5 25 65 0 69.99 6.140 21.97 4.24835 1.81482 0.08773 -0.2885 2.10329 8 7 49 65 0 87.30 3.300 21.97 4.46935 1.19392 0.03780 -0.9404 2.13429 9 7 49 85 0 87.50 4.500 21.97 4.47164 1.50408 0.05143 -0.6312 2.13525 10 5 25 187 0 95.00 13.250 21.97 4.55388 2.58400 0.13947 0.3522 2.23181 11 4 16 106 0 99.80 15.310 21.97 4.60317 2.72851 0.15341 0.4249 2.30359 12 5 25 220 0 150.00 5.400 7.72 5.01064 1.68640 0.03600 -0.7376 2.42395 13 2 4 80 0 90.00 27.227 7.72 4.49981 3.30423 0.30253 0.9019 2.40231 14 3 9 270 0 300.00 21.500 7.72 5.70378 3.06805 0.07167 0.2371 2.83098 15 3 9 250 0 150.00 54.160 7.72 5.01064 3.99194 0.36107 1.4525 2.53940 16 6 36 91 0 165.95 10.006 7.72 5.11169 2.30318 0.06030 -0.1256 2.42879 17 6 36 55 0 37.60 7.100 7.72 3.62700 1.96009 0.18883 0.1558 1.80425 18 3 9 94 0 95.00 11.500 7.72 4.55388 2.44235 0.12105 0.0951 2.34727 19 4 16 410 0 124.50 13.000 7.72 4.82431 2.56495 0.10442 0.1683 2.39661 20 3 9 40 0 39.85 4.850 7.72 3.68512 1.57898 0.12171 -0.4028 1.98182 21 7 49 75 0 78.75 5.750 7.72 4.36628 1.74920 0.07302 -0.3417 2.09093 22 3 9 50 0 45.85 7.200 7.72 3.82538 1.97408 0.15703 -0.0667 2.04082 23 12 144 210 0 150.30 40.800 4.01 5.01263 3.70868 0.27146 1.2668 2.44191 24 12 144 100 0 106.00 8.000 4.01 4.66344 2.07944 0.07547 -0.2156 2.29502 F 25 1 1 10 0 47.00 4.650 4.01 3.85015 1.53687 0.09894 -0.6833 2.22018 C 26 4 16 100 0 70.00 6.000 4.01 4.24850 1.79176 0.08571 -0.3626 2.15139 I 27 2 4 35 0 35.00 6.150 4.01 3.55535 1.81645 0.17571 -0.1886 2.00502 28 7 49 90 0 69.75 7.750 4.01 4.24492 2.04769 0.11111 0.0078 2.03988 29 3 9 70 0 123.00 8.250 4.01 4.81218 2.11021 0.06707 -0.3457 2.45593 30 6 36 48 0 58.00 7.500 4.01 4.06044 2.01490 0.12931 0.0283 1.98658 31 7 49 66 0 325.00 6.100 4.01 5.78383 .1.80829 0.01877 -0.8789 2.68723 32 5 25 60 0 80.00 13.000 4.01 4.38203 2.56495 0.16250 0.4054 2.15952 33 9 81 80 0 1.20 2.650 3.86 0.18232 0.97456 2.20833 0.6521 0.32245 34 5 25 40 0 40.00 28.630 3.86 3.68888 3.3544G 0.71575 1.4865 1.86795 35 3 9 26 0 35.00 29.070 3.86 3.55535 3.36971 0.83057 1.4425 1.92723 36 8 64 85 0 267.00 9.473 1.48 5.58725 2.24845 0.03548 -0.3452 2.59361 37 5 25 170 0 180.00 21.130 1.48 5.19296 3.05069 0.11739 0.5500 2.50065 38 4 16 1000 0 400.00 37.640 0.60 5.99146 3.62807 0.09410 0.7405 2.88758 39 3 9 232 0 450.00 38.440 0.60 6.10925 3.64910 0.08542 0.6476 3.00154 40 3 9 180 0 200.00 27.000 0.60 5.29832 3.29584 0.13500 0.6354 2.66042 41 4 16 50 0 500.00 24.000 0.60 6.21461 3.17805 0.04800 0.1966 2.98145 42 4 16 620 0 600.00 156.000 0.60 6.39693 5.04986 0.26000 1.9917 3.05814 43 4 16 250 0 150.00 20.750 0.60 5.01064 3.03255 0.13833 0.5576 2.47499 44 10 100 163 0 163.00 7.950 1.78 5.09375 2.07317 0.04877 -0.3311 2.40428 45 3 9 35 0 50.00 14.080 1.78 3.91202 2.64476 0.28160 0.5675 2.07727 46 4 16 25 0 20.00 5.500 1.78 2.99573 1.70475 0.27500 0.0773 1.627,14 47 4 16 25 0 40.00 1.770 1.78 3.68888 0.57098 0.04425 -1.3480 1.91899 48 7 49 100 0 300.00 5.850 1.54 5.70378 1.76644 0.01950 -0.8871 2.65356 49 4 16 60 0 120.00 4.650 1.54 4.78749 1.53687 0.03875 --0.8443 2.38113 50 6 36 65 0 85.15 32.500 1.54 4.44441 3.48124 0.38168 1.3331 2.14810 51 4 16 54 0 55.00 7.200 1.54 4.00733 1.97408 0.13091 -0.0789 2.05295 52 6 36 40 0 40.00 26.200 1.54 3.68888 3.26576 0.65500 1.4355 1.83028 53 10 100 100 0 271.50 7.750 1.54 5.60396 2.04769 0.02855 -0.5712 2.61891 54 8 64 120 0 150.50 109.200 1.54 5.01396 4.69318 0.72558 2.3407 2.35246 DATA APPENDIX CAIRO RENTER DATA, RESIDUALS, AND PREDICTED RENTS FROM FIGURE 3 AND TABLES 1-3 OBS HHSIZE HSIZESQ TOTINC OWN MINCOME MGRENT DIST LMINCOME LMGRENT RY RESIDUAL PREDICI 55 3 9 210 0 148.00 72.6500 1.34 4.99721 4.28565 0.490878 1.7519 2.53376 56 5 25 90 0 99.50 7.8745 1.34 4.60016 2.06363 0.079141 -0.1877 2.25128 57 4 16 85 0 80.00 6.7500 1.34 4.38203 1.90954 0.084375 -0.3010 2.21057 58 5 25 100 0 100.00 10.2500 1.34 4.60517 2.32728 0.102500 0.0739 2.25339 59 9 81 60 0 102.25 7.4000 2.38 4.62742 2.00148 0.072372 -0.1908 2.19230 60 4 16 GO 0 150.00 3.7666 2.38 5.01064 1.32617 0.025111 -1.1488 2.47499 61 2 4 70 0 60.00 5.7500 2.38 4.09434 1.74920 0.095833 -0.4825 2.23175 62 11 121 110 0 188.00 6.1500 2.38 5.23644 1.81645 0.032713 -0.6770 2.49350 63 2 4 29 0 30.00 3.1000 2.38 3.40120 1.13140 0.103333 -0.8088 1.94017 64 2 4 50 0 55.50 4.5000 2.38 4.01638 1.50408 0.081081 -0.6949 2.19895 65 9 81 165 0 152.20 8.3166 3.12 5.02520 2.11825 0.054643 -0.2414 2.35963 66 3 9 195 0 100.00 8.3000 3.12 4.60517 2.11626 0.083000 -0.2526 2.36884 67 3 9 50 0 40.50 13,2500 3.12 3.70130 2.58400 0.327160 0.5954 1.98863 68 2 4 100 0 90.00 4.0500 3.12 4.49981 1.39872 0.045000 -1.0036 2.40231 69 5 25 58 0 47.36 5.2100 3.12 3.85778 1.65058 0.110008 -0.2884 1.93900 70 4 16 10 0 23.75 1.8500 2.38 3.16758 0.61519 0.077895 -1.0845 1.69970 71 4 16 90 0 94.85 5.3000 2.38 4.55230 1.66771 0.055878 -0.6145 2.28219 72 5 25 118 0 112.50 7.8000 2.38 4.72295 2.05412 0.OG9333 -0.2488 2.30294 73 8 64 30 0 24.00 7.0000 2.38 3.17805 1.94591 0.291667 0.3657 1.58017 74 7 49 140 0 134.10 10.6000 2.38 4.89859 2.36085 0.079045 0.0460 2.31485 75 6 36 108 0 278.35 7.6500 2.38 5.62888 2.03471 0.027483 -0.6116 2.64635 76 6 36 120 0 120.00 7.1500 3.27 4.78749 1.96711 0.059583 -0.3253 2,29242 77 3 9 32 0 40.00 2.6000 5.34 3.68888 0.95551 0.065000 -1.0279 1.98340 78 3 9 54 0 40.00 3.1000 5.34 3.68888 1.13140 0.077500 -0.8520 1.98340 79 5 25 90 0 33.35 8.1000 5.34 3.50706 2.09186 0.242879 0.3004 1.79146 4 80 5 25 195 0 138.00 7.1500 5.34 4.92725 1.96711 0.051812 -0.4218 2.38888 81 8 64 200 0 195.86 13.1100 5.34 5.27740 2.57338 0.066936 0.1101 2.46327 82 7 49 90 0 90.00 8.7500 5.34 4.49981 2.16905 0.097222 0.0220 2.14710 83 5 25 100 0 100.00 8.9000 5.34 4.60517 2.18605 0.089000 -0.0673 2.25339 84 4 16 85 0 65.00 4.1000 5.34 4.17439 1.41099 0.063077 -0.7122 2.12322 85 2 4 56 0 50.00 2.3000 2.97 3.91202 0.83291 0.046000 -1.3221 2.15505 86 3 9 50 0 44.50 2.7500 2.97 3.79549 1.01160 0.061798 -1.0166 2.02825 87 4 16 90 0 64.00 12.5000 2.97 4.15888 2.52573 0.195313 0.4090 2.11670 88 6 36 125 0 125.00 6.2500 2.97 4.82831 1.83258 0.050000 -0.4770 2.30959 89 2 4 40 0 40.00 4.2500 2.97 3.68888 1.44692 0.106250 -0.6143 2.06119 90 5 25 35 0 60.00 8.6500 2.97 4.09434 2.15756 0.144167 0.1190 2.03851 91 5 25 200 0 200.00 10.2500 2.97 5.29832 2.32728 0.051250 -0.2177 2.54497 92 7 49 150 0 150.00 6.4000 2.97 5.01064 1.85630 0.042667 -0.5057 2.36198 93 7 49 140 0 102.00 7.1500 2.97 4.62497 1.96711 0.070098 -0.232G 2.19975 94 4 16 130 0 119.00 7.6500 2.97 4.77912 2.03471 0.064286 -0.3429 2.37761 95 7 49 75 0 75.00 5.7500 2.97 4.31749 1.74920 0.076667 -0.3212 2.07041 96 10 100 110 0 136.00 7.7500 2.97 4.91265 2.04769 0.056985 -0.2804 2.32810 97 9 81 55 0 55.00 9.0000 2.97 4.00733 2.19722 0.163636 0.2658 1.9311G 98 7 49 80 0 80.00 44.4000 2.97 4.38203 3.79324 0.555000 1.6957 2.09756 99 10 100 48 0 49.00 3.1000 5.20 3.89182 1.13140 0.OG3265 -0.7673 1.898cl 100 7 49 100 0 133.50 7.4500 5.20 4.89410 2.00821 0.055805 -0.3047 2.31296 101 3 9 150 0 125.00 12.6500 5.20 4.82831 2.53766 0.101200 0.0749 2.46271 102 5 25 250 0 154.00 7.1500 5.20 5.03695 1.96711 0.046429 -0.4679 2.43502 103 5 25 130 0 90.00 11.2500 5.20 4.49981 2.42037 0.126000 0.2113 2.20907 104 8 64 80 0 81.50 11.7500 5.20 4.40060 2.46385 0.144172 0.3694 2.09445 105 6 36 50 0 50.00 5.3100 5.20 3.91202 1.66959 0.106200 -0.2546 1.92415 106 5 25 200 0 200.00 15.5000 5.20 5.29832 2.74084 0.077500 0.1959 2.54497 107 4 16 150 0 130.00 13.0000 5.20 4.86753 2.56495 0.100000 0.1502 2.41480 108 3 9 60 0 53.00 5.5000 5.20 3.97029 1.70475 0.103774 -0.3970 2.10178 DATA APPENDIX CAIRO RENTER DATA, RESIDUALS, AND PREDICTED RENTS FROM FIGURE 3 AND TABLES 1-3 OBS HHSIZE HSIZESQ TOTINC OWN MINCOME MGRENT DIST LMINCOME LMGRENT RY RESIDUAL PREDICT 109 5 25 43 0 68.50 8.400 5.20 4.22683 2.12823 0.122628 0.0340 2.09424 110 4 16 85 0 67.00 6.000 5.20 4.20469 1.79176 0.089552 -0.3442 2.13597 111 7 49 100 0 96.00 8.200 5.20 4.56435 2.10413 0.085417 -0.0701 2.17425 112 8 64 70 0 70.00 11.000 5.20 4.24850 2.39790 0.157143 0.3674 2.03046 113 6 36 75 0 9.25 6.250 5.20 2.22462 1.83258 0.675676 0.6182 1.21433 114 6 36 85 0 60.00 9.150 5.20 4.09434 2.21375 0.152500 0.2129 2.00084 115 7 49 40 0 56.50 8.249 3.27 4.03424 2.11009 0.146000 0.1588 1.9512G 116 5 25 43 0 52.50 10.500 3.27 3.96081 2.35138 0.200000 0.3690 1.98234 117 5 25 70 0 104.40 5.000 3.27 4.64823 1.60944 0.047893 -0.6621 2.27150 118 8 64 50 0 61.15 8.500 3.27 4.11333 2.14007 0.139002 0.1665 1.97360 119 3 9 70 0 116.50 9.500 3.27 4.75789 2.25129 0.081545 -0.1818 2.43309 120 4 16 70 0 91.00 10.000 3.27 4.51086 2.30259 0.109890 0.0378 2.26476 121 5 25 70 0 70.00 7.500 3.27 4.24850 2.01490 0.107143 -0.0885 2.10335 122 5 25 65 0 GO.00 5.000 3.27 4.09434 1.60944 0.083333 -0.4291 2.03851 123 4 16 107 0 630.00 3.000 3.27 6.44572 1.09861 0.004762 -1.9801 3.07867 124 7 49 95 0 80.75 4.750 3.27 4.39136 1.55814 0.058824 -0.5433 2.10148 125 6 36 90 0 120.00 4.000 3.27 4.78749 1.38629 0.033333 -0-9061 2.29242 126 8 64 150 0 150.00 4.750 3.27 5.01064 1.55814 0.031667 -0.7929 2.35106 127 7 49 70 0 40.00 4.750 3.27 3.68888 1.55814 0.118750 -0.2478 1.80598 128 5 25 7 0 70.00 4.500 3.27 4.24850 1.50408 0.064286 -0.5993 2.10335 129 5 25 50 0 70.00 3.850 3.27 4.24850 1.34807 0.055000 -0.7553 2.10335 130 8 64 35 0 32.50 6.900 3.27 3.48124 1.93152 0.212308 0.2238 1.70771 131 7 49 130 0 128.00 7.150 3.27 4.85203 1.96711 0.055859 -0.3282 2.29527 132 6 36 85 0 87.35 6.100 3.27 4.46992 1.80829 0.069834 -0.3505 2.15883 133 8 64 240 0 120.00 7.550 3.27 4.78749 2.02155 0.062917 -0.2356 2.25719 r 134 6 36 90 0 90.00 2.450 3.27 4.49981 0.89609 0.027222 -1.2753 2.17140 135 4 16 85 0 GO.00 4.900 5.34 4.09434 1.58921 0.081667 -0.5003 2.08955 136 7 49 130 0 100.00 4.270 5.34 4.60517 1.45161 0.042700 -0.7398 2.19142 137 5 25 45 0 44.00 9.300 5.34 3.78419 2.23001 0.211364 0.3220 1.90804 138 3 9 80 0 75.00 8.550 5.34 4.31749 2.14593 0.114000 -0.1019 2.24783 139 6 36 90 0 125.00 7.750 5.34 4.82831 2.04769 0.062000 -0.2619 2.30959 140 3 9 75 0 80.00 10.750 5.34 4.38203 2.37491 0.134375 0.0999 2.27498 141 7 49 65 0 65.00 23.648 5.05 4.17439 3.16330 0.363822 1.1531 2.01021 142 5 25 60 0 60.00 3.600 5.05 4.09434 1.28093 0.060000 -0.7576 2.03851 143 4 16 115 0 100.00 16.520 5.05 4.60517 2.80457 0.165200 0.5001 2.30443 144 4 16 85 0 100.00 15.800 5.05 4.60517 2.76001 0.158000 0.4556 2.30443 145 5 25 200 0 400.00 20.300 5.05 5.99146 3.01062 0.050750 0.1741 2.83654 146 4 16 302 0 254.56 20.976 (.23 5.53954 3.04338 0.082401 0.3459 2.69748 147 5 25 40 0 40.20 20.550 6.23 3.69387 3.02286 0.511194 1.1528 1.87005 148 7 49 150 0 150.00 21.750 6.23 5.01064 3.07961 0.145000 0.7176 2.36198 149 5 25 360 0 147.00 18.881 6.23 4.99043 2.93816 0.128442 0.5227 2.41545 150 4 16 380 0 250.00 19.010 6.23 5.52146 2.94497 0.076040 0.2551 2.68987 151 5 25 200 0 500.00 224.100 11.58 6 21461 5.41209 0.448200 2.4817 2.93041 152 7 49 100 0 148.50 20.240 11.58 5.00058 3.00766 0.136296 0.6499 2.35776 153 3 9 95 0 140.50 18.000 11.58 4.94521 2.89037 0.128114 0.3785 2.51188 154 1 1 680 0 162.00 28.250 11.58 5.08760 3.34109 0.174383 0.6004 2.74072 155 4 16 500 0 293.20 44.800 11.58 5.68085 3.80221 0.152797 1.0453 2.75692 156 6 36 240 0 263.86 20.660 11.58 5.57542 3.02820 0.078299 0.4043 2.62386 157 5 25 998 0 352.75 21.000 11.58 5.86576 3.04452 0.059532 0.2609 2.78366 158 3 9 185 0 165.80 27.650 11.58 5.11078 3.31963 0.166767 0.7381 2.58153 159 4 16 160 0 124.00 7.500 8.91 4.82028 2.01490 0.060484 -0.3800 2.39492 160 7 49 300 0 133.50 11.250 8.91 4.89410 2.42037 0.084270 0.iO74 2.31296 161 8 64 50 0 40.00 7.750 8.91 3.68888 2.04769 0.193750 0.2526 1.79506 162 9 81 38 0 40.00 11.128 8.91 3.68888 2.40946 0.278200 0.6120 1.79750 DATA APPENDIX CAIRO RENTER DATA, RESIDUALS, AND PREDICTED RENTS FROM FIGURE 3 AND TABLES 1-3 OBS HHSIZE HSIZESQ TOTINC OWN MINCOME MGRENT DIST LMINCOME LMGRENT RY RESIDUAL PREDICT 163 7 49 157 0 120.00 7.2998 8.91 4.78749 1.98785 0.06083 -0.28027 2.26812 164 4 16 80 0 118.50 8.0000 8.91 4.77491 2.07944 0.06751 -0.29639 2.37583 165 5 25 150 0 150.00 9.5000 8.91 5.01064 2.25129 0.06333 -0.17266 2.42395 166 7 49 40 0 80.50 3.7000 8.31 4.38826 1.30833 0.04596 -0.79184 2.10018 167 6 36 200 0 209.50 20.6235 8.31 5.34472 3.02643 0.09844 0.49961 2.52682 168 4 16 IGO 0 100.00 8,2500 8.31 4.60517 2.11021 0.08250 -0.19422 2.30443 169 4 16 60 0 74.50 9.1500 8.31 4.31080 2.21375 0.12282 0.03315 2.180GO 170 7 49 120 0 100.13 11.3320 8.31 4.60647 2.42763 0.11317 0.23566 2.19197 171 8 64 130 0 116.50 21.3000 8.31 4.75789 3.05871 0.18283 0.81397 2.24474 172 8 64 120 0 98.14 7.3900 8.31 4.58640 2.00013 0.07530 -0.17247 2.17260 173 5 25 125 0 86.50 6.7500 8.31 4.46014 1.90954 0.07803 -0.28284 2 19239 174 5 25 100 0 100.00 5.3000 8.31 4.60517 1.66771 0.05300 -0.58568 2.25339 175 4 16 200 0 79.00 11.2500 8.31 4.36945 2.42037 0.14241 0.21509 2.20527 176 7 49 60 0 100.00 26.9500 11.28 4.60517 3.29398 0.26950 1.10256 2.19142 177 2 4 440 0 GO.00 30.5900 11.28 4.09434 3.42067 0.50983 1.18892 2.23175 178 2 4 GO 0 60.00 14.1600 11.28 4.09434 2.65042 0.23600 0.41867 2.23175 179 2 4 90 0 72.95 6.4500 11.28 4.28977 1.86408 0.08842 -0.44988 2.31396 180 3 9 69 0 83.00 8.7500 11.28 4.41884 2.16905 0.10542 -0.12141 2.29046 181 6 36 GO 0 59.50 4.7500 11.28 4.08598 1.55814 0,07983 -0.43918 1.99732 182 2 4 105 0 85.15 13.9900 11.28 4.44441 2.63834 0.16430 0.25934 2.37901 183 4 16 80 0 95.00 10.2500 11.28 4.55388 2.32728 0.10789 0.04442 2.28285 184 6 36 80 0 84.50 13.5000 11.28 4.43675 2.60269 0.15976 0.45781 2.14487 185 5 25 40 0 70.00 12.1500 11.28 4.24850 2.49733 0.17357 0.39398 2.10335 186 9 81 120 0 240.60 9.8500 11.28 5.48314 2.28747 0.04094 -0.26479 2.55226 187 4 16 195 0 200.00 9.2500 11.28 5.29832 2.22462 0.04625 -0.37138 2.59601 188 3 9 90 0 130.00 8.6000 11.28 4.86753 2.15176 0.06615 -0.32745 2.47921 189 5 25 105 0 1.52 11.9000 11.28 0.41871 2.47654 7,82895 1.98420 0.49234 190 4 16 150 0 122.00 27.0400 11.28 4.80402 3.29732 0.22164 0.90924 2.38808 191 2 4 60 0 60.00 18.2000 11.28 4.09434 2.90142 0.30333 0.66967 2.23175 192 2 4 70 0 70.00 18.1500 11.28 4.24850 2.89867 0.25929 0.60208 2.29659 193 6 36 45 0 45.00 5.5000 11.28 3.80666 1.70475 0.12222 -0.17508 1.87983 194 3 9 50 0 25.00 12.7900 11.28 3.21888 2.54866 0.51160 0.76297 1.78569 195 3 9 70 0 59.84 20.0700 11.28 4.09167 2.99923 0.33539 0.84G39 2.15284 196 5 25 100 0 123.00 44.8000 11.28 4.81218 3.80221 0.36423 1.46174 2.34047 197 2 4 310 0 118.65 19.8000 11.28 4.77618 2.98568 0.16688 0.46712 2.51856 198 4 16 35 0 35.00 6.1000 9.50 3.55535 1.80829 0.17429 -0.05453 1.86282 199 4 16 50 0 50.30 7.0000 9.50 3.91801 1.94591 0.13917 -0.06946 2.01537 200 6 36 310 0 70.00 9.0000 9.50 4.24850 2.19722 0.12857 0.13154 2.06568 201 3 9 60 0 58.20 9.2000 9.50 4.06389 2.21920 0.15808 0.07805 2.14115 202 6 36 60 0 79.50 6.0000 9.50 4.37576 1.79176 0.07547 -0.32746 2.11922 203 7 49 120 0 120.00 7.0000 9.50 4.78749 1.94591 0.05833 -0.32221 2.26812 204 5 25 65 0 72.15 14.0000 9.50 4.27875 2.63906 0.19404 0.52298 2.11608 205 5 25 120 0 96.35 15.9782 9.50 4.56799 2.77123 0.16583 0.53348 2.23775 206 2 4 81 0 80.10 16.0000 9.50 4.38328 2.77259 0.19975 0.41930 2.35329 207 6 36 80 0 87.00 9.0000 9.50 4.46591 2.19722 0.10345 0.04008 2.15714 208 8 64 95 0 74.00 11.6640 9.50 4.30407 2.45651 0.15762 0.40267 2.05384 209 5 25 54 0 44.80 8.7500 9.50 3.80221 2.16905 0.19531 0.25343 1.91562 210 5 25 18 0 58.75 4.2500 15.44 4.07329 1.44692 0.07234 -0.58273 2.02965 211 5 25 60 0 56.20 9.0000 15.44 4.02892 2.19722 0.16014 0.18624 2.01099 212 3 9 75 0 38.50 4.0000 9.20 3.65066 1.38629 0.10390 -0.58103 1.96733 213 6 36 150 0 200.06 9.0000 6.83 5.29862 2.19722 0.04499 -0.31020 2.50742 214 4 16 40 0 75.60 7.2500 6.83 4.32546 1.98100 0.09590 -0.20577 2.18677 215 5 25 50 0 82.20 15.2000 6.83 4.40916 2.72130 0.18491 0.55036 2.17094 216 5 25 50 0 132.70 11.7826 6.83 4.88809 2.46662 0.08879 0.09422 2.37240 DATA APPENDIX CAIRO RENTER DATA, RESIDUALS, AND PREDICTED RENTS FROM FIGURE 3 AND TABLES 1-3 OBS HHSIZE HSIZESQ TOTINC OWN MINCOME MGRENT DIST LMINCOME LMGRENT RY RESIDUAL PREDICT 217 7 49 200 0 94.00 19.0000 6.83 4.54329 2.94444 0.20213 0.7790 2.16539 218 4 16 60 0 797.00 29.9500 6.83 6.68085 3.39953 0.03758 0.2220 3.17758 219 6 36 45 0 56.25 56.5000 6.83 4.02981 4.03424 1.00444 2.0605 1.97369 220 4 lb 80 0 84.00 7.0000 6.83 4.43082 1.94591 0.08333 -0.2852 2.23109 221 5 25 60 0 108.20 4.7000 6.83 4.68398 1.5475G 0.04344 -0.7390 2.28654 222 3 9 100 0 58.50 7.7000 6.83 4.06903 2.04122 0.13162 -0.1021 2.14331 223 3 9 60 0 54.00 6.7500 6.83 3.98898 1.90954 0.12500 -0.2001 2.10964 224 3 9 54 0 49.00 4.0000 6.83 3.89182 1.38629 0.08163 -0.6825 2.06877 225 4 16 160 0 51.00 8.8300 6.83 3.93183 2.17816 0.17314 0.1570 2.02119 226 4 16 65 0 101.25 8.2500 6.83 4.61759 2.11021 0.08148 -0.1994 2.30966 227 8 64 58 0 79.00 6.2000 6.83 4.36945 1.82455 0.07848 -0.25G8 2.08134 228 3 9 77 0 70.00 6.6500 5.34 4.24850 1.89462 0.09500 -0.3242 2.21881 229 4 16 60 0 53.60 3.6000 5.34 3.98155 1.28093 0.06716 -0.7612 2.04210 230 6 36 70 0 121.50 8.7500 5.34 4.79991 2.16905 0.07202 -0.1286 2.29764 231 3 9 850 0 400.35 34.2500 3.41 5.99234 3.53369 0.08555 0.5813 2.95236 232 2 4 50 0 250.00 25.0000 3.41 5.52146 3.21888 0.10000 0.3868 2.83207 233 3 9 560 0 300.00 21.8150 3.41 5.70378 3.08260 0.07272 0.2516 2.83098 234 1 1 100 0 75.25 12.5000 2.67 4.32082 2.52573 0.16611 0.1076 2.41817 235 9 81 65 0 123.63 11.3866 2.67 4.81729 2.43244 0.09210 0.1603 2.27217 236 2 4 70 0 71.35 6.0000 2.67 4.26760 1.79176 0.08409 -0.5129 2.30-163 237 6 36 30 0 71.00 6.5000 2.67 4.26268 1.87180 0.09155 -0-1998 2.07165 238 1 1 33 0 30.00 4.1000 4.60 3.40120 1.41099 0.13667 -0.6203 2.03133 239 5 25 80 0 74.70 7.7500 4.GO 4.31348 2.04769 0.10375 -0.0830 2.13069 240 6 36 56 0 63.00 11.4500 4.60 4.14313 2.43799 0.18175 0.4166 2.02136 241 7 49 100 0 64.50 11.3100 4.60 4.16667 2.42569 0.17535 0.4187 2.00696 242 5 25 10 0 30.00 4.0000 4.60 3.40120 1.38629 0.13333 -0.3606 1.74693 243 7 49 59 0 90.00 5.6200 4.60 4.49981 1.72633 0.06244 -0.4208 2.14710 244 5 25 100 0 107.00 16.0000 20.78 4.67283 2.77259 0.14953 0.4907 2.28185 245 7 49 63 0 232.75 1.7500 20.78 5.44996 0.55962 0.00752 -1.9872 2.5469 246 8 64 70 0 89.00 5.0000 20.78 4.48864 1.60944 0.05618 -0.5220 2.13148 DATA APPENDIX MANUFACTURED DATA USED FOR FIGURE 4 AND TABLE 4 OBS YI Y2 X El E2 1 G.2572 4.7017 7.30486 -1.2214 -2.7769 2 3.7961 1.2889 2.76033 -0.2742 -2.7814 3 3.9352 7.1287 2.92823 -0.2609 2.9325 4 6.7654 2.7448 4.78812 1.1743 -2.8463 5 5.7980 6.8462 3.95260 0.8336 1.8818 6 2.9036 -1.8579 1.41929 -0.1608 -4.9224 7 4.8417 3.3707 3.94397 -0.1163 -1.5873 8 5.2678 3.8073 6.38352 -1.5199 -2.9803 9 7.0222 8.0525 7.74733 -0.7883 0.2420 10 11.0499 9.4522 9.42398 1.9819 0.3842 11 5.9499 6.1467 8.91342 -2.7352 -2.5384 12 7.2163 4.8124 7.81591 -0.6456 -3.0495 13 4.1508 5.0737 2.04735 0.6153 1.5382 14 9.1109 9.2515 9.83824 -0.2677 -0.1272 15 0.9573 -2.6896 1.32915 -2.0395 -5.6864 16 9.0866 10.3918 8.94844 0.3753 1.6805 17 5.9646 6.8467 6.45608 -0.8775 0.0046 18 8.2261 -0.1643 7.41044 0.6683 -7.7222 19 6.4915 5.6458 7.25698 -0.9512 -1.7969 20 7.2152 0.0384 8.12568 -0.8791 -8.0559 21 7.9828 7.0470 8.22832 -0.1885 -1.1243 22 2.9310 4.8966 3.41728 -1.6320 0.3336 23 5.7897 5.5461 4.24225 0.6080 0.3644 24 8.3074 8.6137 9.51738 -0.8306 -0.5244 I 25 . 8.2678 3.8932 8.67647 -0.2396 -4.6141 26 3.5404 4.9051 5.50754 -2.5903 -1.2255 27 6.6281 6.9439 5.21601 0.7161 1.0319 28 7.1956 8.1062 5.46228 1.0988 2.0095 29 5.4800 1.2020 4.57682 0.0473 -4.2306 30 2.2808 5.2443 2.54581 -1.6285 1.3349 31 8.0023 7.9911 7.40131 0.4513 0.4401 32 5.3462 6.1389 3.81178 0.4874 1.2800 33 5.1822 11.6926 4.57574 -0.2496 6.2607 34 6.1736 -2.9086 4.45625 0.8314 -8.2508 35 8.5563 5.4878 6.21945 1.8918 -1.1768 36 4.1185 4.2269 0.27373 1.9132 2.0216 37 1.3947 5.5413 0.59601 -1.0523 3.0942 38 5.6674 3.0515 7.06635 -1.6324 -4.2183 39 6.2627 7.5628 4.19621 1.1155 2.4157 40 5.2715 3.2858 5.75619 -1.0457 -3.0313 41 6.8996 9.6865 4.24299 1.7174 4.5043 42 3.0212 6.5399 1.95067 -0.4418 3.0769 43 3.8518 8.1030 4.86813 -1.7993 2.4519 44 8.7223 11.3889 8.71114 0.1890 2.8556 45 8.2958 18.8834 8.21094 0.1376 10.7252 46 1.9373 -2.5850 1.27457 -1.0186 -5.5409 47 3.4920 2.4327 1.68873 0.2254 -0.8338 48 4.0539 -1.7246 2.47829 0.1952 -5.5833 49 3.3034 -0.9916 2.68922 -0.7135 -5.0085 50 7.9437 9.3611 7.63800 0.2152 1.6326 51 2.7190 -0.7560 1.93477 -0.7321 -4.2070 52 8.2728 11.1079 7.62877 0.5512 3.3863 53 8.3185 6.6138 6.76854 1.2421 -0.4626 54 8.5273 5.0214 8.91924 -0.1621 -3.6681 55 6.6156 4.7760 5.61250 0.4062 -1.4334 DATA APPENDIX MANUFACTURED DATA USED FOR FIGURE 4 AND TABLE 4 OBS Yl Y2 X El E2 56 9.8167 10.4806 9.30891 0.8350 1.4989 57 5.7180 9.2328 4.84727 0.0826 3.5973 68 7.5164 9.3803 8.11860 -0.5726 1.2914 59 8.7414 4.7444 9.36393 -0.2816 -4.2786 60 9.8128 2.6768 9.58399 0.6248 -6.5112 61 8.9143 9.4339 8.03762 0.8861 1.4057 62 8.7137 11.3970 8.20226 0.5620 3.2453 63 7.3125 11.5217 5.37305 1.2827 5.4919 64 5.0852 4.6769 4.78197 -0.5013 -J.9096 65 3.1596 7.6258 0.57190 0.7307 5.1968 66 3.2792 7.3876 1.94593 -0.1802 3.9282 67 5.1313 7.0664 5.27208 -0.8227 1.1123 68 7.9128 9.9160 7.83728 0.0349 2.0381 69 3.6012 6.1118 1.20883 0.6946 3.2051 70 7.8628 6.1583 6.77867 0.7788 -0.9257 71 8.2659 15.6029 9.04776 -0.5199 G.8171 72 7.0373 4.2921 5.63454 0.8114 -1.9338 73 10.0857 10.8948 9.63971 0.8559 1.6650 74 6.4872 4.8879 4.62658 1.0173 -0.5820 75 6.9800 16.5513 8.94754 -1.7306 7.8406 76 4.4206 0.9233 1.35229 1.4064 -2.0909 77 7.2138 8.7318 7.89473 -0.7073 0.8107 78 9.1808 10.7612 6.75661 2.1133 3.6937 79 9.9127 6.7066 8.38590 1.6232 -1.5828 80 2.7530 3.6563 1.80679 -0.6020 0.3012 81 6.1451 7.9387 6.74919 -0.9168 0.8768 82 4.3602 6.3511 3.62027 -0.3550 1.6359 a% 83 7.0077 6.1692 5.85053 0.6198 -0.2187 I 84 8.7111 10.3237 9.79575 -0.6357 0.9769 85 6.6527 12.4448 7.12975 -0.6945 5.0975 86 10.4G35 7.2518 9.69738 1.1905 -2.0213 87 3.1937 8.0237 3.80862 -1.6628 3.1672 88 0.9650 4.7521 1.44205 -2.1166 1.6706 89 7.9280 5.5978 6.52070 1.0375 -1.2927 90 3.8328 -1.0384 3.33984 -0.6721 -5.5433 91 2.8134 6.7692 2.61311 -1.1465 2.8094 92 7.9368 9.5197 8.53754 -0.4664 1.1165 93 2.6352 2.2180 0.37018 0.3576 -0.0596 94 2.8469 5.0158 1.61536 -0.3647 1.8042 95 8,3037 7.9988 9.40893 -0.7530 -1.0579 96 7.3079 10.4642 5.88674 0.8929 4.0492 97 7.7162 8.4037 8.49647 -0.6561 0.0314 98 -1.1956 -0.1309 0.15125 -3.3090 -2.2444 99 3.1250 4.3293 2.03925 -0.4044 0.7999 100 6.2774 1.9409 3.65065 1.5394 -2.7971 DATA APPENDIX MANUFACTURED DATA USED FOR FIGURE 5 AND TABLES 7, 9 OBS RENT HHSIZE HSIZESQ HH2 HH3 HH4 HIH5 HH6 IH7 HH8 HHGE9 1 622.31 8 64 0 0 0 0 0 0 1 0 2 521.8G 3 9 0 1 0 0 0 0 0 0 3 1093.25 3 9 0 1 0 0 0 0 0 0 4 715.37 5 25 0 0 0 1 0 0 0 0 5 1088.18 4 16 0 0 1 0 0 0 0 0 6 107.76 2 4 1 0 0 0 0 0 0 0 7 741.27 4 16 0 0 1 0 0 0 0 0 8 671.97 7 49 0 0 0 0 0 1 0 0 9 924.20 8 64 0 0 0 0 0 0 1 0 10 638.42 10 100 0 0 0 0 0 0 0 10 11 546.16 9 81 0 0 0 0 0 0 0 9 12 595.05 8 64 0 0 0 0 0 0 1 0 13 953.82 3 9 0 1 0 0 0 0 0 0 14 587.28 10 100 0 0 0 0 0 0 0 10 15 31.36 2 4 1 0 0 0 0 0 0 0 16 968.05 9 81 0 0 0 0 0 0 0 9 17 970.46 7 49 0 0 0 0 0 1 0 0 18 127.78 8 64 0 0 0 0 0 0 1 0 19 720.31 8 64 0 0 0 0 0 0 1 0 20 -5.59 9 81 0 0 0 0 0 0 0 9 21 687.57 9 81 0 0 0 0 0 0 0 9 22 933.36 4 16 0 0 1 0 0 0 0 0 23 1036.44 5 25 0 0 0 1 0 0 0 0 24 547.56 10 100 0 0 0 0 0 0 0 10 25 338.59 9 81 0 0 0 0 0 0 0 9 26 927.45 6 36 0 0 0 0 1 0 0 0 27 1153.19 6 36 0 0 0 0 1 0 0 0 28 1250.95 6 36 0 0 0 0 1 0 0 0 29 576.94 5 25 0 0 0 1 0 0 0 0 30 933.49 3 9 0 1 0 0 0 0 0 0 31 944.01 8 64 0 0 0 0 0 0 1 0 32 1028.00 4 16 0 0 1 0 0 0 0 0 33 1626.07 5 25 0 0 0 1 0 0 0 0 34 174.92 5 25 0 0 0 1 0 0 0 0 35 852.32 7 49 0 0 0 0 0 1 0 0 3G 502.16 1 1 0 0 0 0 0 0 0 0 37 609.42 1 1 0 0 0 0 0 0 0 0 38 475.17 8 64 0 0 0 0 0 0 1 0 3!: 1241.57 5 25 0 0 0 1 0 0 0 0 40 746.87 6 36 0 0 0 0 1 0 0 0 41 1450.43 5 25 0 0 0 1 0 0 0 0 42 907.69 2 4 1 0 0 0 0 0 0 0 43 1245.19 5 25 0 0 0 1 0 0 0 0 44 1085.56 9 81 0 0 0 0 0 0 0 9 45 1372.52 9 81 0 0 0 0 0 0 0 9 46 45.91 2 4 1 0 0 0 0 0 0 0 47 516.62 2 4 1 0 0 0 0 0 0 0 48 241.67 3 9 0 1 0 0 0 0 0 0 49 299.15 3 9 0 1 0 0 0 0 0 0 50 1063.26 8 64 0 0 0 0 0 0 1 0 51 179.30 2 4 1 0 0 0 0 0 0 0 52 1238.63 8 64 0 0 0 0 0 0 1 0 53 923.74 7 49 0 0 0 0 0 1 0 0 54 433.19 9 81 0 0 0 0 0 0 0 9 55 906.66 6 36 0 0 0 0 1 0 0 0 DATA APPENDIX MANUFACTURED DATA USED FOR FIGURE 5 AND TABLES 7, 9 OBS RENT HHSIZE HSIZESQ HH2 HH3 HH4 HHS HH6 HH7 HH8 HHGE9 56 749.89 10 100 0 0 0 0 0 0 0 10 57 1359.73 5 25 0 0 0 1 0 0 0 0 58 929.14 9 81 0 0 0 0 0 0 0 9 59 172.14 10 100 0 0 O0 0 0 0 10 60 -51.12 10 100 0 0 0 0 0 0 0 10 61 940.57 9 81 0 0 O 0 0 0 0 9 62 1124.53 9 81 0 0 0 0 0 0 0 9 63 1599.19 6 36 0 0 0 0 1 0 0 0 64 909.04 5 25 0 0 0 1 0 0 0 0 65 819.68 1 1 0 0 0 0 0 0 0 0 66 992.82 2 4 1 0 0 0 0 0 0 0 67 1161.23 6 36 0 0 0 0 1 0 0 0 68 1103.81 8 64 0 0 0 0 0 0 1 0 69 920.51 2 4 1 0 0 0 0 0 0 0 70 877.43 7 49 0 0 0 0 0 1 0 0 71 1281.71 10 100 0 0 0 0 0 0 0 10 72 856.62 6 36 0 0 0 0 1 0 0 0 73 766.50 io 100 0 0 0 0 0 0 0 10 74 941.80 5 25 0 0 0 1 0 0 0 0 75 1584.06 9 81 0 0 0 0 0 0 0 9 76 390.91 2 4 1 0 0 0 0 0 0 0 77 981.07 8 64 0 0 0 0 0 0 1 0 78 1339.37 7 49 0 0 0 0 0 1 0 0 79 641.72 9 81 0 0 0 0 0 0 0 9 80 630.12 2 4 1 0 0 0 0 0 0 0 81 1057.68 7 49 0 0 0 0 0 1 0 0 82 1063.59 4 16 0 0 1 0 0 0 0 0 00 83 1028.13 6 36 0 0 0 0 1 0 0 0 84 697.69 10 100 0 0 0 0 0 0 0 10 85 1409.75 8 64 0 0 0 0 0 0 1 0 86 397.87 10 100 0 0 0 0 0 0 0 10 87 1216.72 4 16 0 0 1 0 0 0 0 0 88 767.06 2 4 1 0 0 0 0 0 0 0 89 840.73 7 49 0 0 0 0 0 1 0 0 90 345.67 4 16 0 0 1 0 0 0 0 0 91 1080.94 3 9 0 1 0 0 0 0 0 0 92 911.65 9 81 0 0 0 0 0 0 0 9 93 294.04 1 1 0 0 0 0 0 0 0 0 94 780.42 2 4 1 0 0 0 0 0 0 0 95 494.21 10 100 0 0 0 0 0 0 0 10 96 1454.92 6 36 0 0 0 0 1 0 0 0 97 803.14 9 81 0 0 0 0 0 0 0 9 98 75.56 1 1 0 0 0 0 0 0 0 0 99 879.99 3 9 0 1 0 0 0 0 0 0 100 620.29 4 16 0 0 1 0 0 0 0 0 - 129 - APPENDIX F OUTLINE OF SUGGESTED TABLES FOR URBAN HOUSING SURVEY REPORT Introduction This outline is intended to be a basis for discussion for final selection of tables. General Comments The following is a partial list of tables which can be produced for each city and for the larger towns in the 1983 Urban Housing Survey. In addition to weighted counts, it is important that (1) weighted proportions, and (2) the unweighted number of observations in each cell be included, so that the reliability of the estimates is self-documented. The most important criterion variables for the tables include structure type, age of the structure, number of residential units, tenure, and income class. The design of the tables will have to be modified for larger and small samples. For example, there are eight income classes by two tenure groups by four water outcomes, or 64 cells. This may be no problem in Nairobi, but in the smaller towns we will have to use a smaller number of income classes and collapse the water outcomes because of the smaller sample size. For example, we could use four income classes and compute a recoded variable (1 = private and/or communal piped water, 2 = no piped water). The classifications will need Lo be collapsed in many other cases as well. For example, there are six structure types and many possible values for "Number of Residential Units." These criterion variables will have to be collapsed into a manageable number of categories. The attached computer printout shows one possible format for a table which looks at the percentage of households with piped water by income class - 130 - and tenure for Cairo. It is only one possible suggested format, and other formats can be designed. Many tables specify two criterion variables, e.g., income class and tenure. It is extremely useful to break the tables out by (1) both criteria (income and tenure) together and (2) each separately. See the attached example. Another point to remember is that the sample size needed to reliably estimate proportions is less than that needed to reliably estimate means or medians (see, for example, Kish, Survey Sampling ch. 2). It may be necessary to collapse categories further for means and medians. A common rule of thumb is to not report medians or averages for cells containing fewer than 25 sample observations. Weighted counts should still be reported for small cells so that tables sum up correctly. Comments on Table Specifications and Outline Since the number of tables can become quite large, we recommend producing several reports of varying detail for different audiences. One possible work plan would incldue: (1) the production of a full set of tables for each city, arranged by city, which can be used for reference and by specialists and planners working in a particular town; (2) a one-volume report following the outline of the table specifications (outline modified as required) which presents conclusions and sample tables from representative cities; and (3) an executive summary of no more than 25 pages which summarizes the key findings of the survey for policymakers. Regression models similar to those described in Section 1.3 can be the topic of separate reports since the estimation of these models is likely to be time consuming. - 131 - The following outline lists suggested tables by chapter. The chapters refer to the one-volume report; chapter 1 can be modified to stand on its own as the executive summary. Chapter 1: Introduction and Summary - describe the survey - summarize main results by chapter Chapter 2: Characteristics of the Current Housing Stock A. Characteristics of Structures 1. Distribution of structures by type of structure (i.e. house, maisonette, etc.). 2. Number of residential units by type of structure. 3. Number of residents by type of structure, and by age of structure. 4. Type of construction materials used in outer walls, by type of structure. 5. Type of construction materials used for roofs, by type of structure. 6. Type of construction materials used for floors, by type of structure. 7. Age of structure, by type of structure. 8. Median estimated value of structure by type and by age. 9. Ownership of unit, by structure type. 10. Type of scheme of structure, by structure type. B. Characteristics of Households 1. Tenure, by income class. 2. Type of water supply, by income class and tenure. 3. Type of sanitation, by income class and tenure. 4. Type of lighting, by income class and tenure. 5. Type of scheme, by income class and tenure. 6. Type of ownership, by income class and tenure. 7. Number of households reporting income from rent, by income class and tenure. 8. Type of kitchen, by income class and tenure. 9. Type of bathing facilities, by income class and tenure. 10. Number of rooms, median rent per room, and median persons per room, by income class and tenure. 11. Garbage disposal - frequency of collection and type of disposal, by income class and tenure. 12. Proportion of units which contain servants' quarters, by income class and tenure. - 132 - 13. Median expenditures on food, rent, household requirements, transport, water/light, and total expenditure, by income class. 14. Tables on distances, cost of travel, mode of transport to public amenities maybe useful for future reports. 15. Tables on opinions about neighborhood for future reports. Chapter 3: Estimating Housing Demand A. Revealed Preferences of Recent Movers 1. Number of movers by previous and current tenure. 2. Number of moves by income class and tenure. 3. Comparing the previous residential unit with the current one by: (a) type of unit (i.e. movement from flat to house, etc.) (b) number of rooms (c) rent (d) water supply B. Revealed Preferences of Households Planning to Move 1. Number of planned moves by current and expected tenure. 2. Number of planned moves by income class. 3. Comparing the current residential unit to the one planning to move to by: (a) type of unit (b) number of rooms (c) rent (d) water supply C. Renters Affordability and Willingness to Pay 1. Median monthly contract rent and gross rent (rent plus utilities) paid by income class. 2. Table based on Table 2 of Analyzing an Urban Housing Survey, using income class in place of "Income Quintiles," and using rent-to- consumption ratio in place of "Rent-to-Income Ratio." Consumption must be used in place of income because we do not have continuous income measures but we do have continuous measures of consumption. - 133 - D. The Housing Consumption of Owner Occupants 1. Median current value of owner-occupied structures, by income class, structure type, and number of residential units. 2. Table based on C-2 above, substituting current value of structure for rent, and classified by structure type and number of residential units. Chapter 4: Housing Supply A. Characteristics of Housing Supply* 1. Type of scheme (included in Chapter 1) 2. Type of loan (owners only) by income class 3. Source of finance (owners only) by income class 4. Land tenure (owners only) by income class * MWH and planning department should give their suggestions for additional tables. B. Characteristics of the Rental Market 1. Proportion of total housing units for rental 2. Median monthly rent (contract and gross) by number of rooms 3. Proportion of residential units where part of the unit is sublet.