WPS8349


Policy Research Working Paper                         8349




                   The Roots of Inequality
              Estimating Inequality of Opportunity
                     from Regression Trees

                                 Paolo Brunori
                                    Paul Hufe
                              Daniel Gerszon Mahler




Development Research Group
Poverty and Inequality Team
February 2018
Policy Research Working Paper 8349


  Abstract
 This paper proposes a set of new methods to estimate                               opportunity estimations. Finally, regression trees can be
 inequality of opportunity based on conditional inference                           graphically represented; their structure is immediate to read
 regression trees. It illustrates how these methods repre-                          and easy to understand. This will make the measurement
 sent a substantial improvement over existing empirical                             of inequality of opportunity more easily comprehensible
 approaches to measure inequality of opportunity. First, the                        to a large audience. These advantages are illustrated by an
 new methods minimize the risk of arbitrary and ad hoc                              empirical application based on the 2011 wave of the Euro-
 model selection. Second, they provide a standardized way                           pean Union Statistics on Income and Living Conditions.
 to trade off upward and downward biases in inequality of




  This paper is a product of the Poverty and Inequality Team, Development Research Group. It is part of a larger effort by
  the World Bank to provide open access to its research and make a contribution to development policy discussions around
  the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The authors may be
  contacted at dmahler@worldbank.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
                     The Roots of Inequality:
    Estimating Inequality of Opportunity from Regression Trees*
                                      , Paul Hufe
                         Paolo Brunori          , Daniel Gerszon Mahler§




JEL-Codes: D31; D63; C38
Keywords: Equality of Opportunity; Machine Learning; Random Forests




   *
     We are grateful for comments received from participants during presentations held at the Institute for Social
and Economic Research at the University of Essex, the Poverty and Applied Microeconomics Seminar at the
World Bank, the Copenhagen Centre for Social Data Science at the University of Copenhagen, and the 13th
Winter School on Social Choice Theory and Welfare at Canazei. Any errors remain our own.
   
     Corresponding Author: University of Florence, Dipartimento di Scienze per l’Economia e l’Impresa, Via delle
Pandette 32 - 50127 Firenze, Italy, paolo.brunori@unifi.it.
   
     ifo Munich and LMU Munich, hufe@ifo.de.
   §
     University of Copenhagen and World Bank, dmahler@worldbank.org.
1    Introduction
John Roemer’s (1998) seminal contribution, Equality of Opportunity, has incited a ﬂourishing
empirical literature on the measurement of unequal opportunities. At the heart of Roemer’s
formulation is the idea that factors that determine individual outcomes can be divided into two
categories: factors over which individuals have control, which he calls eﬀort, and factors for which
individuals cannot be held responsible, which he calls circumstances. Individuals characterized
by exactly the same exogenous circumstances are assigned to a circumstance type. Members of
a type have the same background conditions to transform resources into outcomes. Therefore,
while within-type inequality, as caused by the diﬀerential exertion of eﬀort, is morally irrele-
vant, between-type diﬀerences in achievements are inequitable and call for compensation. Thus,
opportunity-equalizing policies have the objective of neutralizing the impact of circumstances
on the distribution of the desirable outcome.
    Following Roemer’s approach, a battery of methods to measure inequality of opportunity
have been proposed (see Roemer and Trannoy, 2015; Van de gaer and Ramos, 2016, for recent
overviews).1 Today, well established empirical methods include summary indexes that quantify
the extent of unequal opportunities (Alm˚    as et al., 2011; Bourguignon et al., 2007; Checchi and
Peragine, 2010) as well as statistical tests that detect the mere existence thereof (Kanbur and
Snell, 2017; Lefranc et al., 2009). In either case, empirical results are sensitive to critical choices
of model selection which are under complete discretion of the researcher.
    First, researchers have to make a decision on which circumstance variables to consider for es-
timation.2 Observable circumstances beyond individual control are typically a subset of the real
number of exogenous variables aﬀecting individual outcomes. This issue has been largely dis-
cussed by the literature, and the prevailing view is that partial observability implies downward-
biased inequality of opportunity estimates (Ferreira and Gignoux, 2011). To counteract this
downward bias, one strategy is to resort to high-quality datasets that provide very detailed in-
formation with respect to individual circumstances (Hufe et al., 2017). Naturally, the scope of
improvement of this approach is limited by sample sizes. Consider for example the increasing
availability of genetic datasets with billions of polymorphisms per person (Altshuler et al., 2015).
While the genetic make-up of individuals clearly is beyond individual control and must be con-
sidered a circumstance, the number of circumstances exceeds the available degrees of freedom
which forces the researcher to choose selectively from the available set of circumstances.
    Second, the inﬂuence of circumstances may be dependent on the expression of other circum-
stance characteristics. For example, it is a well-established ﬁnding that the inﬂuence of similar
child-care arrangements on various life outcomes varies strongly by biological sex (Garc´     ıa et al.,
2017). In contrast to such evidence, however, many empirical applications presume that the
eﬀect of circumstances on individual outcome is ﬁxed and additive (Bourguignon et al., 2007;
Ferreira and Gignoux, 2011). On the one hand, analogous to partial observability, this functional
form assumption forces a downward bias on inequality of opportunity estimates. On the other
hand, limitations in the available degrees of freedom may prove the estimation of fully saturated
models impractical. Again the researcher is left to her own devices in selecting the best model
for estimating inequality of opportunity.
    While the downward bias of inequality of opportunity estimates is prominently discussed
in the extant literature, the reliability of estimates has been largely disregarded. Holding the
    1
      Note that a number of contributions from the social choice literature on fair allocation had previously
proposed similar methods (Fleurbaey, 1995, 2008; Van de gaer, 1993).
    2
      Roemer does not provide a ﬁxed list of variables that are to be considered as circumstances. Rather he
suggests that the set of circumstances should evolve from a political process (Roemer and Trannoy, 2015). In
empirical implementations typical circumstances are biological sex, socioeconomic background, race, or the area
of birth.



                                                      2
sample size constant, increasing the type partition by including additional circumstances or
relaxing the linearity assumption directly translates into reduced variation for estimating the
relevant parameters. In fact, Brunori et al. (2016) show that overﬁtting the model may instill
an upward bias on inequality of opportunity estimates.
    This discussion highlights the non-trivial challenge in selecting the appropriate model for
estimating inequality of opportunity. Scholars must balance between diﬀerent sources of bias
while trying to avoid ad-hoc solutions.
    In this paper we propose the use of classiﬁcation and regression tree methods to address
the outlined shortcomings of current approaches. Introduced by Morgan and Sonquist (1963)
and popularized by Breiman et al. (1984), classiﬁcation and regression trees belong to a family
of statistical methods that are commonly summarized under the labels of “machine learning”
or “statistical learning” (Friedman et al., 2009). Originating from the ﬁelds of computer sci-
ence and statistics, these methods are being increasingly adopted by economists (Athey, 2017;
Mullainathan and Spiess, 2017; Varian, 2014). Classiﬁcation and regression trees methods were
developed to make out-of-sample predictions of a dependent variable based on a number of
observable predictors. They let algorithms automatically segment the predictor space into non-
overlapping regions to ﬁnd the best model for predicting the outcome of interest. In the context
of estimating equality of opportunity, this means that we let an automated algorithm decide
how to partition the population into mutually exclusive types for the purpose of calculating
measures of inequality of opportunity in the spirit of Roemer’s theory. To be precise, within the
class of classiﬁcation and regression tree methods we focus on conditional inference trees and
conditional inference forests, both of which bear a number of substantial advantages (Hothorn
et al., 2006).
    First, by drawing on a clear-cut algorithm one minimizes the degree of arbitrariness in
model selection. In both trees and forests types are obtained in the attempt to explain outcome
variability without assuming anything about which circumstances play a statistically signiﬁcant
role in shaping individual opportunities and how they interact. Thus, the partition of the
population into Roemerian types is no longer a judgment call of the researcher but a non-
arbitrary outcome of data analysis. Second, the conditional inference algorithm branches trees
(and constructs forests) by using a sequence of hypothesis tests that prevent model overﬁtting.
Therefore, by using the conditional inference algorithm we can both derive a test for the null
hypothesis of equal opportunity and avoid the potential upward bias of inequality of opportunity
estimates. As a consequence of avoiding upwards and downwards biases, and in contrast to the
current approaches, our estimates are better suited for comparisons across time and between
countries when sample sizes diﬀer.
    Aside from those shared merits, trees and forests bear some distinct advantages which the
researcher needs to trade oﬀ when selecting her preferred approach within the class of condi-
tional inference methods. Trees, on the one hand, have intuitive appeal and their graphical
illustrations are instructive tools for longitudinal or cross-sectional comparisons of opportunity
structures. Forests, on the other hand, perform better in trading oﬀ the diﬀerent sources of
bias outlined above. In fact, we will show that conditional inference forests outperform other
prevalent estimation techniques in terms of out-of-sample prediction accuracy.
    To be sure, just as the literature on intergenerational mobility (Black and Devereux, 2011),
scholars of equality of opportunity are reluctant to give their estimates a causal interpretation.
The ambition of the literature is to understand how much variation in outcomes can be attributed
to root causes that are commonly perceived as unfair. It is precisely the prediction character of
these empirical exercises that makes this branch of the literature a useful ﬁeld to leverage the
advantages of machine learning algorithms.
    The remainder of this paper is organized as follows: section 2 gives a brief introduction with
respect to current empirical approaches in the literature. Section 3 introduces regression trees


                                                3
and illustrates how to use them in the context of inequality of opportunity estimations. An
empirical illustration based on the EU Survey of Income and Living Conditions is contained in
section 4, in which we will also highlight the particular advantages of using tree-based estimation
methods. Lastly, section 5 concludes.


2       Empirical Approaches to Equality of Opportunity
Consider a population of size N indexed by i ∈ {1, ..., N } and an associated vector of incomes
Y = {y1 , ..., yi , ..., yN }. Individual i’s outcome of interest yi is the result of two sets of factors.
First, a set of circumstances beyond her control of size P : Ωi = {Ci1 , ..., Cip , ..., CiP }. Second, a
set of eﬀorts of size Q: Θi = {Ei       1 , ..., E q , ..., E Q }. In general, the outcome generating function
                                                  i          i
g : Ω × Θ → R+ can therefore be written as

                                                yi = g (Ωi , Θi ).                                             (1)

    Each circumstance C p ∈ Ω is characterized by a total of X p possible realizations, where
each realization is denoted as xp . Based on the realizations xp we can partition the population
into a set of non-overlapping types T = {t1 , ..., tm , ..., tM }. A type is a subgroup of the original
population uniform in terms of circumstances, i.e. individuals i and j belong to the same type
tm ∈ T if xp        p    p                                                           p
              i = xj ∀C ∈ Ω. They belong to diﬀerent types tm ∈ T if ∃C ∈ Ω : xi = xj .
                                                                                               p     p

The number of types in the population is given by M = P                 p
                                                                   p=1 X . Following Roemer (1998) we
                                                                  q
assume that the joint realizations of the eﬀort variables E ∈ Θ can be summarized by a scalar
π ∈ [0, 1]. Individuals sharing the same expression of eﬀort are called a tranche. Hence, types
and tranches deﬁne two particular ways of partitioning the population into subgroups, where
group membership either indicates uniformity in circumstances (types) or eﬀort (tranches).
    In the literature we can distinguish two broad classes of equality of opportunity deﬁnitions.3
First, the ex-ante view focuses on between-type diﬀerences in the value of opportunity sets
without paying attention to the speciﬁc eﬀort realizations of individual type members. According
to this perspective, equality of opportunity is satisﬁed if the aggregate value of opportunity sets
is equalized across types. One example in case is the ex-ante utilitarian perspective according to
which the value of opportunity sets is indicated by the average outcome within the speciﬁc type.
Thus, equality of opportunity would be realized if the mean outcome of each type was equal
to the population mean. Second, the ex-post view focuses on individual outcomes conditional
on eﬀort exertion. According to this perspective, equality of opportunity would be satisﬁed if
individual outcomes were equalized within each tranche, i.e. individuals with equal levels of
eﬀort exertion realize the same outcomes. A comprehensive discussion of the ex-ante and ex-
post principles of equality of opportunity can be found in Fleurbaey and Peragine (2013). In
the context of this paper we will restrict ourselves to the ex-ante utilitarian approach only.

Tests and Measures The extant literature has witnessed the development of empirical tests
and measures for ex-ante utilitarian inequality of opportunity. A prominent example for the
former category is provided by Lefranc et al. (2009), who show that rejecting the null hypoth-
esis of no ﬁrst-order stochastic dominance in type-speciﬁc outcome distributions is suﬃcient to
reject the existence of equal opportunities in the population from an ex-ante utilitarian perspec-
tive. Furthermore, in a recent contribution Kanbur and Snell (2017) develop likelihood ratio
tests that can serve to test for ex-ante utilitarian equality of opportunity. A widely adopted
example of the latter category, is the measure developed by Van de gaer (1993) and Chec-
    3
      Measures diﬀerent from the ones illustrated here, have been proposed in the literature. The interested reader
is referred to Van de gaer and Ramos (2016) for a comprehensive overview.



                                                        4
chi and Peragine (2010). They propose to measure inequality in a counterfactual distribution
Y EA = {y1 EA , ..., y EA , ..., y EA } obtained by removing inequality within types from the original
                      i           N
distribution. To be precise, individual outcomes are re-scaled to match their respective type
mean:
                                    EA     1
                                  yi    =         yi = µm , ∀i ∈ tm , ∀tm ∈ T,                     (2)
                                          Nm
                                        i∈tm

where Nm is the size and µm the average outcome of type tm . Therefore, any remaining inequality
in Y EA reﬂects inequality between types and inequality of opportunity can now be summarized
by applying any standard scalar measure of inequality I(·), like the Gini index or a member of
the generalized entropy class (Cowell, 2016), to the counterfactual distribution Y EA . Any such
measure obtains its minimal value in the case of equality of all type means, i.e if µm = µl =
µ ∀tm , tl ∈ T .

Estimation In practice we do not observe the full set of circumstances Ω. Rather we observe
the subset Ω              ˇ . For example, in most datasets we do not have full information on the
            ˇ ⊆ Ω of size P
genetic make-up of individuals neither do we have a gapless documentation of the socioeconomic
conditions in which individuals grew up. Analogously, for most C p ∈ Ω        ˇ we only observe the
subset Xˇ of the true number of realizations X . For example, in many datasets information
          p                                         p

on parental education and occupation is coded in categorical variables of varying detail, which
may mask more nuanced socioeconomic diﬀerences among households.
    Depending on the strength of their distributional assumptions, estimations of inequality of
opportunity are typically classiﬁed as either non-parametric or parametric. A point in case for
the former approach is the abovementioned measure put forward by Van de gaer (1993) and
Checchi and Peragine (2010). The researcher partitions the sample into mutually exclusive cells
based on the realizations of all circumstance variables under consideration. Hence, the researcher
makes no assumption on the interaction of circumstance variables in the determination of indi-
vidual outcomes. This comes at a high cost, however. To avert overﬁtting, the partition must be
constructed such that a suﬃcient number of observations belongs to each cell. Conditional on
the dataset being rich enough in information on circumstances, this in turn forces the researcher
to make a discretionary choice on the relevant partition. Consider for instance a continuous cir-
cumstance variable like parental income. Employing the non-parametric estimation approach,
the researcher must split the parental income distribution into quantiles for constructing the
type partition. The potential granularity of this split obviously depends on the sample sizes of
the ensuing cells. Additionally, the researcher must balance the informational content of a ﬁner
partition of parental income against the opportunity cost of being forced to exclude another cir-
cumstance variable from the investigation. To put it in formal terms: the researcher must select
a subset Ωˆ ⊆Ω ˇ ⊆ Ω from the set of observed circumstances. Furthermore, within the conﬁnes
of limited degrees of freedom the researcher must also decide for each C p ∈ Ω,    ˆ how to restrict
the number of realizations X  ˆp ⊆ X ˇ p ⊆ X p in order to construct a statistically meaningful type
partition.
    To address this problem, the literature commonly resorts to parametric estimation ap-
proaches. Here, the researcher obtains the counterfactual distribution by estimating a Mincerian
regression with circumstances as the sole right-hand side variables (Bourguignon et al., 2007;
Ferreira and Gignoux, 2011):
                                                         ˇ
                                                         P
                                    ln(yi ) = β0 +           βp Cip + i .                         (3)
                                                     p=1




                                                     5
The counterfactual distribution, Y EA can then be constructed from the predicted values
                                                             ˇ
                                                             P
                                            EA
                                           yi    = exp             ˆp C p .
                                                                   β                                          (4)
                                                                       i
                                                             p=1

    Although the parametric approach solves some of the shortcomings of the non-parametric
approach, it is not a panacea. The standard version of the parametric approach assumes a
linear impact of all circumstances and therefore neglects the existence of interdependencies and
non-linearities in the impact of circumstances. To pick up the example from the introduction,
the researcher cannot allow for a diﬀerential impact of the same child-rearing arrangement on
male and female children. Of course, to alleviate this shortcoming the researcher may integrate
interaction terms and higher order polynomials into equation (3). At the extreme the researcher
may even estimate a fully saturated model, in which case parametric and non-parametric es-
timation coincide. This congruence, however, reiterates the fundamental problem of current
approaches towards the estimation of inequality of opportunity. In view of restrictions on the
available degrees of freedom, the researcher is forced to make a discretionary choice on the
model she estimates, which in itself is a strong determinant of the ensuing results when testing
and measuring equality of opportunity. Furthermore, just as the non-parametric approach, the
parametric estimation is at risk of overﬁtting the data when the set of circumstances is large.
    In analogy to this paper, Li Donni et al. (2015) have discussed the issue of ad-hoc model
selections in the empirical literature on equality of opportunity. To resolve this issue, they
propose a data-driven type partition by estimating a latent class model. In this approach,
observable circumstances are considered indicators of membership in an unobservable latent
type, tm . For each possible number of latent types, M , the model obtains the partition into
types by minimizing the within-type correlation of observable circumstances, C p ∈ Ω.     ˇ The
                               ∗
optimal number of groups M is selected by minimizing an appropriate model selection criterion
such as Schwarz’s Bayesian Information Criterion (BIC). The latent class approach therefore
partly solves the issue of arbitrary model selection. However, it cannot solve the problem of
model selection once the potential number of type characteristics exceeds the available degrees
of freedom. In these cases the latent class approach replicates the limitations of other prevalent
approaches towards estimating inequality of opportunity: the researcher must pre-select the
relevant set of circumstances, their subpartition as well as the respective interactions. To the
contrary, our approach embodies a method to select circumstances from the set of all observed
variables in a non-arbitrary fashion. Furthermore, latent types are constructed in the attempt
to explain circumstances’ correlation. The partition is therefore insensitive to the degree of
association between circumstances and outcome. However, one may consider explaining outcome
variability as function of circumstances to be precisely the purpose of inequality of opportunity
measurement.4 Lastly, we prefer the conditional inference approach as it provides the particular
advantages of being econometrically more tractable while providing a stronger economic meaning
of the identiﬁed types.


3       Estimating Inequality of Opportunity from Regression Trees
Originally, tree-based methods were developed to make out-of-sample predictions of a dependent
variable based on a number of observable predictors. As we will outline in the following, they
can be straightforwardly applied to equality of opportunity estimations and solve many of the
    4
    This issue is common to any two-stage analysis in which latent classes serve as controls for a distal outcome.
The eﬀect of latent class membership on the distal outcome is attenuated and the explained variability is reduced
(Lanza et al., 2013).



                                                         6
issues associated with the prevalent estimations approaches outlined in section 2. While we put
a particular emphasis on regression trees, our main arguments also hold for classiﬁcation trees.
Thus, the proposed estimation methods are not restricted to continuous variables like income,
but can also be fruitfully employed with respect to non-continuous outcomes, such as health
(Trannoy et al., 2010) or education (Oppedisano and Turati, 2015).
    In what follows we will present two tree-based estimation procedures both of which solve
the model selection problem outlined in section 2. First, we will introduce conditional inference
regression trees. Their simple graphical illustration is particularly instructive for longitudinal
or cross-sectional comparisons of opportunity structures. In spite of their intuitive appeal,
however, they perform relatively poorly in out-of-sample predictions. Second, to address the
concern of estimate reliability we will also introduce conditional inference forests, which are
– loosely speaking – a collection of many conditional inference trees. Forests do not have the
intuitive appeal of regression trees. However, they perform signiﬁcantly better in terms of out-of-
sample predictions. In fact, we will show in section 4.5 that they outperform all other considered
estimation techniques along this dimension.

3.1   Conditional Inference Trees
Tree-based methods obtain predictions for outcome y as a function of the input variables
I = {I 1 , ..., I p , ...I P }. Speciﬁcally, they use the set I to partition the population into a set
of non-overlapping groups, G = {g1 , ..., gm , ..., gM }, where each group gm is homogeneous in
the expression of each input variable. These groups are also called terminal nodes or leafs in a
regression tree context. The predicted value for outcome y of observation i is calculated from
the mean outcome µm of the group gm to which the individual is assigned. Hence, in addition
to the observed income vector Y = {y1 , ..., yi , ..., yN } one obtains a vector of predicted values
ˆ = {y
Y     ˆ1 , ..., y         ˆN }, where
                ˆi , ..., y
                                         1
                            y
                            ˆi = µm =               yi , ∀i ∈ gm , ∀gm ∈ G.                       (5)
                                        Nm
                                             i∈gm

The mapping from regression trees to equality of opportunity estimation is straightforward.
Conditional on the input variables being circumstances only, i.e. I ⊆ Ω  ˇ ⊆ Ω, it is evident that
each resulting group gm ∈ G can be interpreted as a circumstance type tm ∈ T . Furthermore,
Yˆ is analogous to the smoothed distribution Y EA , the construction of which we have outlined
in section 2 to illustrate ex-ante utilitarian measures of inequality of opportunity. In view of
the fact that our predictor space is conﬁned to circumstance variables only, we use the terms
“input variables” and “circumstances” as well as “groups” and “types” interchangeably in the
following. Input variables will be denoted by C p and groups by tm . In line with equation (5),
we will refer to individual predictions yˆi as µm .

Algorithm Considering all possible ways in which the population can be split into groups is
a daunting task when the set of input variables is large. In conventional estimation approaches
                                                                  ˇ (ii) to restrict the number of
                                                           ˆ from Ω,
the researcher is left to her own devices in (i) selecting Ω
                        p
realizations of each C ∈ Ω,                                                                     ˆ
                            ˆ and (iii) to determine the relevant interactions among all C p ∈ Ω.
The magnitude of this choice set oftentimes leads to arbitrary model selection. To the contrary,
with regression trees the researcher does not need to make these choices herself. The researcher
only submits the full and unrestricted set of observed variables that qualify as circumstances,
ˇ while the algorithm chooses the relevant circumstances, their subpartition and the respective
Ω,
interactions. To be precise, the observations are divided into M groups (or types) by what is
known as recursive binary splitting. Recursive binary splitting starts by dividing the full sample


                                                    7
into two distinct groups according to the value they take in one input variable C p . If C p is
a continuous or ordered variable, then i ∈ tm if Cip < xp and i ∈ tl if Cip ≥ xp . If C p is a
categorical variable then the categories can be split into any two arbitrary groups. The process
is continued such that one of the two groups is divided into further subgroups (potentially based
                  ˇ and so on. Graphically, this division into groups can be presented like an
on another C p ∈ Ω),
upside-down tree (Figure 1).

                                Figure 1: Exemplary Tree Representation




                      Note : Artiﬁcial example of a regression tree. The grey boxes indicate
                      splitting points, while the white boxes indicate terminal nodes. The
                      values inside the white boxes show predicted values associated with
                      each terminal node (µm ).

    The exact manner in which the split is conducted depends on the type of regression tree that
is used. In this paper we follow the methodology proposed by Hothorn et al. (2006), leading to
what they call conditional inference trees.5
    Conditional inference trees are grown by a series of permutation tests according to the
following 4-step procedure:
                                                             p
                                                 C : D (Y |C p ) = D (Y ), for each input variable
   1. Test the null hypothesis of independence, H0
            ˇ and obtain a p-value associated with each test, pC p .
      C p ∈ Ω,
                                                                                           p                   p
        ⇒ Adjust the p-values for multiple hypothesis testing, such that pC                C P
                                                                          adj. = 1 − (1 − p )
          (Bonferroni Correction).
                                                                                                           p
   2. Select the variable, C ∗ , with the lowest adjusted p-value, i.e. C ∗ = {C p : argmin pC
                                                                                             adj. }.
                  ∗
        ⇒ If pC
              adj. > α: Exit the algorithm.
                  ∗                           ∗
        ⇒ If pC
              adj. ≤ α: Continue, and select C as the splitting variable.
    5
      An alternative would be Classiﬁcation and Regression Trees (CART) as introduced by Breiman et al. (1984).
                                                                             1                        2
CART chooses splits so as to minimize the mean squared error, MSE = N            m     i∈tm (yi − µm ) . We prefer
conditional inference trees since CART are biased towards splitting variables made of many categories (Hothorn
et al., 2006). Furthermore, we avoid the intricacies of tree pruning (Friedman et al., 2009) by establishing a test
criterion that considers the bias-variance trade-oﬀ before making an additional split.


                                                        8
   3. Test the discrepancy between the subsamples for each possible binary partition, s, based
      on C ∗ , i.e. Ys = {Yi : Ci∗ < xp } and Y−s = {Yi : Ci∗ ≥ xp }, and obtain a p-value associated
                          ∗
      with each test, pCs .

        ⇒ Split the sample based on Cs  ∗ , by choosing the split point s that yields the lowest
                                         ∗
                         ∗      ∗             ∗
          p-value, i.e. Cs∗ = {Cs : argmin pCs }.

   4. Repeat the algorithm for each of the resulting subsamples.

    Conditional inference trees oﬀer a particularly relevant structure in the context of inequality
of opportunity. Each hypothesis test is essentially a test for whether equal opportunities exist
within a particular (sub)sample. If the algorithm results in no splits at all, then we cannot
reject the null hypothesis of equality of opportunity. The deeper the tree is grown, the more
types are necessary to fully account for the inherent inequality of opportunities in the society
under consideration. Each split tells us that the resulting types have signiﬁcantly diﬀerent
opportunities under an ex-ante utilitarian interpretation. In all of the resulting types (i.e. the
terminal nodes of the tree), we cannot reject the null of equal opportunities.

Tuning Note that the structure and depth of the resulting opportunity tree hinges crucially
on the level of α. The less stringent the α-requirement, the more we allow for false positives,
i.e. the more splits will be detected as signiﬁcant and the deeper the tree will be grown. So how
should α be chosen? On the one hand, α can be chosen a priori in line with the disciplinary
convention to require signiﬁcance levels of at least 5% or even 1%. On the other hand, we can
let the data speak on the optimal speciﬁcation of α, i.e. we can tune the α-parameter to ﬁnd a
model that performs optimally according to a pre-speciﬁed testing criterion.
     If opting for the latter option, α is typically chosen by K -fold cross-validation (CV). To
perform cross validation, one starts by splitting the sample into K subsamples, also called
folds. Then, one implements the conditional inference algorithm on the union of K − 1 folds for
varying levels of α, while leaving out the k th subsample. This makes it possible to compare the
predictions emanating from the K − 1 folds with the real data points observed in the k th fold.
The mean squared prediction error serves as an evaluation criterion:
                                                     k
                                                    Nm           1
                              MSECV
                                 k (α ) =
                                                                      k
                                                                    (yi − µm (α))2 .                           (6)
                                                m
                                                    Nk          Nmk
                                                         i∈tm

This exercise is repeated for all K folds, so that MSECV (α) =                 1
                                                                               K       k   MSECV
                                                                                              k (α). One then
chooses the α∗ that delivers the lowest MSECV (α):

                                   α∗ = {α ∈ A : argmin MSECV (α)}.6                                           (7)

    In our empirical application we ﬁx α∗ = 0.01, which is in line with the disciplinary convention
for hypothesis tests. However, we provide a sensitivity check using cross-validated α in Figure
A.1 of Appendix A.3.
   6
     One may argue that a criterion that evaluates models according to their capacity to predict individual
outcomes is misplaced for ex-ante utilitarian inequality of opportunity estimations. Afterall, we have demonstrated
above that we are mainly concerned with estimating type means rather than individual outcomes. In Appendix
A.1 we show that the MSE-criterion and its handling of the variance-bias trade-oﬀ straightforwardly extends to
balancing upward and downward biases in inequality of opportunity estimations.




                                                         9
3.2    Conditional Inference Forests
Regression trees solve the model selection problem outlined in section 2 and provide a simple and
non-arbitrary way of dividing the population into types. Furthermore, trees are easily mapped
and thus lay bare the opportunity structure of a given society for a larger audience. However,
constructing the counterfactual distribution Y EA from conditional inference trees suﬀers from
two shortcomings: ﬁrst, they only make limited use of the information inherent in the set of
observed circumstances since not all C p ∈ Ω  ˇ are used for the construction of the tree. Yet,
the omitted circumstances may possess some informational content that can increase predictive
power even though they are not signiﬁcantly associated with Y at level α∗ . This is a particular
issue if two important circumstances are highly correlated. Once a split is done using either
of the two, the other will unlikely yield enough information to cause another split. Second,
the predictions and thus the values of opportunity sets, µm , emanating from trees have a high
variance. The structure of trees - and therefore the ensuing distribution Y EA - is fairly sensitive
to alternations in the respective data samples. This is a particular issue if there are various
circumstances that are close competitors for deﬁning the ﬁrst split (Friedman et al., 2009).
In what follows we will introduce conditional inference forests. Conditional inference forests
build methodologically on conditional inference tress and are able to deal with both of these
shortcomings (Breiman, 2001).

Algorithm In short, random forests create many trees and average over all of these when
making predictions. Trees are constructed according to the same 4-step procedure outlined
in the previous subsection. However, two tweaks are made. First, each tree is estimated on a
random subsample b of the original data.7 In total B such trees are estimated. Second, a random
subset of circumstances Ω ¯ ⊆Ω            ¯ is allowed to be used at each splitting point. Together
                                ˇ of size P
these two tweaks remedy the shortcomings of single conditional inference trees. Drawing only on
subsets Ω¯ ⊆Ω ˇ increases the likelihood that all circumstances with informational content at some
point will be identiﬁed as the splitting variable C ∗ and thus addresses the limited information
use of regression trees. Furthermore, averaging over the B predictions cushions the variance of
individual predictions µm and thus addresses the second shortcoming identiﬁed in relation with
single regression trees. Therefore, predictions are formed as follows:
                                                            B
                                       y      ¯, B) = 1
                                       ˆi (α, P                   µb     ¯
                                                                   m (α, P ).                                 (8)
                                                      B
                                                            b=1


Tuning From equation (8) it is evident that individual predictions are a function of α – the
signiﬁcance level governing the implementation of splits –, P  ¯ – the number of circumstances to
be considered at each splitting point –, and B – the number of subsamples to be drawn from
the data. Again, these parameters can be imposed a priori by the researcher or they can be
determined by tuning the three-dimensional grid (α, P  ¯ , B ) to optimize the out-of-sample ﬁt of
the model. In our empirical illustration we proceed as follows. First, to reduce computational
costs we ﬁx B at a level at which the marginal gain of drawing an additional subsample in terms
of out-of-sample prediction accuracy becomes negligible.8
    Second, we determine α∗ and P   ¯ ∗ by minimizing the out-of-bag error. This entails the fol-
                                                ¯:
lowing four steps for a grid of values of α and P
   7
     Alternatively, one can draw bootstrapped samples, i.e. sample with replacement until a dataset with the
same size as the original data is reached. We use the subsampling technique since it has been shown that using
bootstrapping leads to biased variable selection (Strobl et al., 2007).
   8
     Empirical tests show that this is the case with B ∗ = 200 for most countries in our sample (see Figure A.2 of
Appendix A.3).


                                                       10
    1. Run a random forest with B subsamples, where P    ¯ circumstances are randomly chosen to
       be considered at each splitting point, and α is used as the value for the hypothesis tests.

    2. Calculate the average predicted value of observation i using each of the subsamples b−i
                                                        ˆi
       (the so called bags ) in which i does not enter: y         ¯) = 1
                                                          OOB (α, P            b     ¯
                                                                      B−i b−i µm (α, P ).

    3. Calculate the out-of-bag mean squared error: MSEOOB (α, m) =                  1
                                                                                                   −y         ¯ )]2 .
                                                                                                      OOB (α, P
                                                                                     N     i [yi    ˆi
                                        ¯∈P
                    ¯ ∗ ) = {({α ∈ A}, {P
    4. Choose (α∗ , P                     ˇ }) : argmin MSEOOB }.

The logic behind this tuning exercise is similar to cross-validation. However, instead of leaving
out the k th fraction of the dataset to make out-of-sample predictions, we leverage the fact that
each tree of a forest is grown on a subsample b−i that excludes all observations i. Hence, for each
tree we can use the out-of-bag data points to evaluate the predictive accuracy of the respective
model.9
    The improved predictive quality of random forests comes at a cost. It is no longer possible
to identify a ﬁxed set of types T into which we can partition the population. For example,
depending on the subset Ω   ¯ ⊆Ω ˇ used for a particular tree as well as the the particular subsample
b drawn from the data, it may be that i, j ∈ tb               b+1
                                                  m but i ∈ tm while j ∈   / tb+1
                                                                               m . As a consequence,
the individual prediction and hence the valuation of the individual opportunity set is an average
over the value of opportunity sets µb   m associated with each tree of the forest. Therefore, the
valuation of opportunity sets is less straightforward and opportunity structures are hard to
illustrate in a graphical manner. It is nevertheless possible to describe opportunity structures
by calculating the relative variable importance of each C p ∈ Ω    ˇ in constructing the forest. See
section 4.3 for an illustration.


4       Empirical Application
In this section we provide an illustration of our methodology using harmonized survey data
from 31 European countries. As outlined above, conditional inference trees and random forests
solve the issue of model selection associated with the prevalent approaches to equality of op-
portunity estimations. Conditional inference trees are easily tractable and lend themselves to
cross-sectional (and longitudinal) comparisons of opportunity structures. Conditional inference
forests are less tractable but outperform the former approach in terms of predictive accuracy.
In the following, we will illustrate the merits of both approaches. Furthermore, we will com-
pare the results from both versions of our method with prevalent measurement approaches in
the extant literature; namely parametric, non-parametric and latent class models. Comparisons
will be made along two dimensions. First, the estimates themselves, and second, the respec-
tive out-of-sample accuracy. The latter criterion should be interpreted as an indicator of how
well the respective method balances upward and downward biases in inequality of opportunity
estimations. A formal argument for why this is the case, is provided in Appendix A.1.

4.1     Data
The empirical illustration is based on the 2011 wave of the European Union Statistics on Income
and Living Conditions (EU-SILC). EU-SILC provides harmonized survey data with respect to
    9
     In principle tuning can be conducted analogously to regression trees by means of k-fold cross validation. This,
however, is computationally expensive. Cross-validation would require to repeat the entire estimation exercise for
a total of K folds. This is not necessary when using the out-of-bag error since out-of-sample points are already
delivered by leaving out observations i when using bag b−i . Hence, in the case of forests using the out-of-bag
error is K times more computationally eﬃcient than cross-validation.



                                                        11
incomes, poverty, and living conditions on an annual basis and covers a cross-section of 31 Eu-
ropean countries in the 2011 wave.10 We draw on the 2011 wave since it contains an ad-hoc
module about the intergenerational transmission of (dis)advantages, which allows us to con-
struct ﬁnely-grained type partitions. The set of observed circumstances Ω ˇ and their respective
              p
expressions x are listed in Table 1 whereas descriptive statistics concerning circumstances are
reported in Appendix A.2. As an additional advantage, EU-SILC has been extensively studied
by the empirical literature on inequality of opportunity and thus provides appropriate bench-
marks against which we can compare our method (Checchi et al., 2016; Marrero and Rodr´     ıguez,
2012; Palomino et al., 2016).
    The unit of observation is the individual, whereas the outcome of interest is equivalent
disposable household income. Aware that inequality statistics tend to be heavily inﬂuenced by
outliers (Cowell and Victoria-Feser, 1996) we adopt a standard winsorization method according
to which we set all non-positive incomes to 1 and scale back all incomes exceeding the 99.5th
percentile of the country-speciﬁc income distribution to this lower threshold. Our analysis is
focused on the working age population. Therefore, we restrict the sample to respondents aged
between 30 and 59. To assure the representativeness of our country samples all results are
calculated by using appropriate individual cross-sectional weights.
    Table 2 shows considerable heterogeneity in the income distributions of our country sample.
While households in Norway (NO) and Switzerland (CH) on average obtained incomes above
e40,000 in 2010, the average households in Bulgaria (BG), Romania (RO) and Lithuania (LT)
did not exceed the e5,000 mark. The lowest inequality prevails in the Nordic countries of Norway
(NO), Sweden (SE) and Iceland (IS), all of which are characterized by Gini coeﬃcients of below
0.22. At the other end of the spectrum we ﬁnd the Eastern European countries of Latvia (LV),
Lithuania (LT) and Romania (EL) with Gini coeﬃcients well above the level of 0.33.

4.2    Benchmark Methods
We compare our estimates from trees and forests against three benchmark methods that have
been proposed in the extant literature.
    First, we draw on the parametric approach as proposed by Bourguignon et al. (2007) and
Ferreira and Gignoux (2011). In line with equation (3), estimates are obtained by a Mince-
rian regression of equivalent household income on the following controls: father occupation (10
categories), father and mother education (ﬁve categories), area of birth (three categories), and
tenancy status of the household (two categories). The model speciﬁcation therefore includes 20
binary variables and resembles the speciﬁcation used in Palomino et al. (2016).11
    Second, we draw on the non-parametric approach as proposed by Checchi and Peragine
(2010). Non-parametric estimates are obtained by partitioning the sample into 40 types. Indi-
viduals in type tm have parents of equivalent education (ﬁve categories), share their migration
status (a binary variable whether the respondent is a ﬁrst or second generation immigrant), and
have fathers working in the same occupation. To minimize the frequency of sparsely populated
   10
      The sample consists of Austria (AT), Belgium (BE), Bulgaria (BG), Switzerland (CH), Cyprus (CY), Czech
Republic (CZ), Germany (DE), Denmark (DK), Estonia (EE), Greece (EL), Spain (ES), Finland (FI), France
(FR), Croatia (HR), Hungary (HU), Ireland (IE), Iceland (IS), Italy (IT), Malta (MT), Lithuania (LT), Luxem-
bourg (LU), Latvia (LV), Netherlands (NL), Norway (NO), Poland (PL), Portugal (PT), Romania (RO), Sweden
(SE), Slovenia (SI), Slovak Republic (SK), and Great Britain (UK).
   11                                                                                                  EA
      We have estimated the predicted outcomes both as the exponential of the predicted log outcome, yi   =
        ˇ
exp     P   ˆ   p
            βp C , and by introducing, assuming a normally distributed error term, the correction yi   EA
                                                                                                          =
       p=1      i

       ˇ
exp    P     ˆ
             β   p    2            2
              p Ci + σ /2 , where σ is the estimated variance of the error term. We do not ﬁnd any signiﬁ-
       p=1

cant diﬀerences in the level of estimated inequality of opportunity when introducing the correction. This may
explain why the need of such correction has never been explicitly discussed in previous contributions.


                                                     12
                                 Table 1: List of Circumstances


     Respondent’s sex:                                    - Medium (3-4 ISCED-97)
         - Male                                            - High (5-6 ISCED-97)
         - Female                                     Father/mother occupational status:
     Respondent’s country of birth:                       - Unknown or dead father/mother
         - Respondent’s present country of resi-           - Employed
           dence                                           - Self employed
         - European country                                - Unemployed
         - Non-European country                            - Retired
     Presence of parents at home:                         - House worker
         - Both present                                    - Other inactive
         - Only mother                                Father/mother main occupation (based on
         - Only father                                 International Standard Classiﬁcation of Oc-
                                                       cupations, published by the International
         - Without parents
                                                       Labour Oﬃce ISCO-08):
         - Lived in a private household without
           any parent                                      - Managers (I-01)

     Number of adults (aged 18 or more) in re-            - Professionals (I-02)
      spondent’s household                                 - Technicians (I-03)
     Number of working adults (aged 18 or more)           - Clerical support workers (I-04)
      in respondent’s household                            - Service and sales workers (including
     Number of children (under 18) in respon-               also armed force) (I-05 and 10)
      dent’s household                                     - Skilled agricultural, forestry and ﬁshery
     Father/mother country of birth and citizen-            workers (I-06)
      ship:                                                - Craft and related trades workers (I-07)
         - Born/citizen of the respondent’s                - Plant and machine operators, and as-
           present country of residence                      semblers (I-08)
         - Born/citizen of another EU-27 country           - Elementary occupations (I-09)
         - Born/citizen   of   another   European          - Father/mother did not work, was un-
           country                                           known or was dead (I-0)
         - Born/citizen of a country outside Eu-      Managerial position of the father/mother:
           rope
                                                           - Supervisory
     Father/mother education (based on interna-
                                                           - Non-supervisory
      tional Standard Classiﬁcation of Education
      1997 (ISCED-97)):                               Tenancy status of the house in which the re-
         - Unknown father/mother                       spondent was living:

         - Illiterate                                      - Owned
         - Low (0-2 ISCED-97)                              - Not owned



types we divert from the occupational list given in Table 1 by re-coding occupations into three
categories: highly skilled non-manual (I-01–I-03), lower-skilled non-manual (I-04–I-05 and I-10),
skilled manual and elementary occupation (I-06–I-09 and father/mother unknown or dead). This


                                                13
                                    Table 2: Summary Statistics

                      Country    Sample size   Avg. eq. income    Std. dev.    Gini
                        AT            6,220              25,451      13,971   0.268
                        BE            6,011              23,291      10,948   0.249
                        BG            7,154               3,714       2,491   0.333
                        CH            7,583              42,208      24,486   0.279
                        CY            4,589              21,058      11,454   0.279
                        CZ            8,711               9,006       4,320   0.250
                        DE           12,683              22,221      12,273   0.276
                        DK            5,897              32,027      13,836   0.232
                        EE            5,338               6,922       3,912   0.330
                        EL            6,184              13,184       8,651   0.334
                        ES           15,481              17,088      10,597   0.329
                        FI            9,743              27,517      13,891   0.246
                        FR           11,078              24,299      14,583   0.288
                        HR            6,969               6,627       3,819   0.306
                        HU           13,330               5,327       2,863   0.276
                        IE            4,318              24,867      14,307   0.296
                        IS            3,684              22,190       9,232   0.210
                        IT           21,070              18,786      11,730   0.309
                        LT            5,403               4,774       3,150   0.344
                        LU            6,765              37,911      19,977   0.271
                        LV            6,423               5,334       3,618   0.363
                        MT            4,701              13,006       6,747   0.277
                        NL           11,411              25,210      11,414   0.235
                        NO            5,026              43,260      16,971   0.202
                        PL           15,545               6,103       3,690   0.316
                        PT            5,899              10,781       7,296   0.334
                        RO            7,867               2,562       1,646   0.337
                        SE            6,599              26,346      10,700   0.215
                        SI           13,183              13,772       5,994   0.225
                        SK            6,779               7,304       3,416   0.257
                        UK            7,391              25,936      16,815   0.320
               Note : Summary statistics for the 31 countries in the 2011 wave of EU-SILC.
               Income variables are measured in Euros.



partition is similar but more parsimonious than the one used in Checchi et al. (2016), who base
their analysis on a total of 96 types.
    Lastly, we compare our estimates against the latent class approach as proposed by Li Donni
                                                                                              ˇ We
et al. (2015). The eligible set of circumstances is the full set of observable circumstances, Ω.
follow Li Donni et al. (2015) in using Schwartz’s Bayesian Information Criterion (BIC) to select
the most adequate number of latent types.

4.3   Estimates of Inequality of Opportunity
Table 3 shows inequality of opportunity estimates for our country sample according to ﬁve diﬀer-
ent estimation procedures. Columns 2-4 list results using the parametric, the non-parametric,
and the latent class approach, all of which have been proposed in the extant literature (see
section 4.2). Columns 5 and 6 list results from conditional inference trees and conditional infer-
ence forests, respectively. For all methods, inequality of opportunity estimates are obtained by
calculating the Gini index in the counterfactual distribution Y EA .
    Of all methods under consideration the parametric approach delivers the highest estimates.
For 29 out of 31 countries the inequality of opportunity estimates are higher than the results

                                                   14
from both conditional inference trees and forests. Analogously, the unweighted average estimate
over all countries equals 0.103 Gini points for the parametric approach as compared to 0.079
and 0.078 Gini points for trees and forests, respectively. Also in terms of country rankings, the
parametric approach delivers markedly diﬀerent results in comparison to our preferred methods.
While the parametric approach identiﬁes Romania (RO), Bulgaria (BG) and Greece (EL) as the
countries in which opportunities are most unequally distributed, these countries rank 6th, 1st
and 5th (6th, 2nd and 7th) in the case of trees (forests).
    Non-parametric measures of inequality of opportunity take a middle-ground between the
parametric approach and our preferred methods. For 16 (19) out of 31 countries the non-
parametric estimate exceeds the estimate coming from trees (forests), while the unweighted
cross-country average estimate amounts to 0.084 Gini points. In terms of country rankings
the non-parametric approach shows much closer resemblance to our preferred methods than
the parametric approach. For example, the three most unequal countries from an opportunity
perspective as identiﬁed by the non-parametric approach are Bulgaria (BG), Portugal (PT)
and Luxembourg (LU), which is congruent with the top three countries identiﬁed by trees and
forests.
    Lastly, the latent class model tends to furnish much lower estimates than all other methods,
including trees and forests. This is not very surprising if one considers how latent types are
constructed. Latent classes are obtained in the attempt to maximize local independence, that is
to minimize the within-type correlation of circumstances. The algorithm constructs types (and
selects their most appropriate number) ignoring the covariance of circumstances and outcome.
Conditional inference trees instead construct types by maximizing the outcome variability that
can be explained by circumstances. For 8 (9) out of 31 countries the latent class estimate falls
short of the estimate coming from trees (forests), while the unweighted cross-country average
estimate amounts to 0.069 Gini points. Also in terms of country rankings the latent class
approach diﬀers markedly from our preferred methods. It identiﬁes Romania (RO), Greece
(EL) and Portugal (PT) as the countries in which opportunities are most unequally distributed,
whereas these countries rank 6th, 5th and 1st (6th, 7th and 3rd) in the case of trees (forests).
    To gain further understanding as regards the relation of existing measurement approaches
to our preferred methods, Figure 2 plots the estimates from each method against the estimates
from conditional inference forests. The black diagonal indicates the 45 degree line, on which all
data points should align if the diﬀerent methods were perfectly congruent. The upper left panel
plots the estimates from the parametric approach against the forest estimates. We can conﬁrm
the previous diagnosis that the parametric approach delivers higher estimates than forests (and
trees). The diﬀerence is particularly pronounced for countries that are characterized by relatively
low levels of inequality of opportunity, like the Nordic countries. The upper right panel shows the
same plot for the non-parametric approach. We again ﬁnd relatively high upward divergences
in comparison to conditional forest estimates for countries in which inequality of opportunity is
low. However, the diﬀerences are less pronounced. Interestingly, this pattern in reversed when
looking at the correlation plot for the latent class approach in the lower left panel. Instead
of overestimating the impact of circumstances in societies of low inequality of opportunity, it
underestimates the impact of circumstances in societies that are characterized by high inequality
in opportunities. Finally, as expected, trees and forests tend to produce very similar results.
The correlation between estimates is high (0.98) and in contrast to all other approaches, the
sign of the diﬀerence is uncorrelated with the level of the estimate.

4.4   Opportunity Structure
Endowed with an estimate of inequality of opportunity, adequate policy responses must be
informed by the particular opportunity structure of a society. That is, policy makers want to


                                                15
                           Table 3: Inequality of Opportunity Estimates
           Country   Parametric   Non-Parametric   Latent Class   Cond. Inf. Tree   Cond. Inf. Forest
           AT            0.0888           0.0751         0.0796            0.0865              0.0880
           BE            0.1108           0.0868         0.0534            0.0868              0.0912
           BG            0.1542           0.1356         0.1148            0.1362              0.1335
           CH            0.0917           0.0827         0.0631            0.0796              0.0901
           CY            0.0942           0.0831         0.0738            0.0799              0.0800
           CZ            0.0716           0.0659         0.0600            0.0569              0.0511
           DE            0.0704           0.0588         0.0467            0.0697              0.0793
           DK            0.0772           0.0409         0.0289            0.0212              0.0204
           EE            0.1108           0.1020         0.0744            0.0967              0.1005
           EL            0.1476           0.1208         0.1165            0.1264              0.1089
           ES            0.1421           0.1201         0.0893            0.1280              0.1200
           FI            0.0687           0.0515         0.0475            0.0197              0.0275
           FR            0.0858           0.0863         0.0717            0.0904              0.0980
           HR            0.1312           0.0884         0.0758            0.0822              0.0763
           HU            0.1098           0.1033         0.0951            0.1134              0.1079
           IE            0.1048           0.0971         0.0484            0.0843              0.0784
           IS            0.0669           0.0321         0.0297            0.0123              0.0157
           IT            0.1213           0.0907         0.0799            0.1078              0.0969
           LT            0.0947           0.0674         0.0587            0.0693              0.0672
           LU            0.1340           0.1209         0.0904            0.1326              0.1356
           LV            0.1335           0.1099         0.0951            0.1102              0.1110
           MT            0.0872           0.0796         0.0566            0.0710              0.0716
           NL            0.0661           0.0529         0.0411            0.0284              0.0194
           NO            0.0480           0.0405         0.0296            0.0202              0.0234
           PL            0.1111           0.0973         0.0953            0.1019              0.0991
           PT            0.1376           0.1236         0.1156            0.1362              0.1267
           RO            0.1698           0.1040         0.1194            0.1204              0.1107
           SE            0.1178           0.0604         0.0251            0.0247              0.0313
           SI            0.0772           0.0730         0.0588            0.0317              0.0361
           SK            0.0626           0.0507         0.0420            0.0495              0.0459
           UK            0.1012           0.0896         0.0622            0.0714              0.0791
          Note : Estimates of inequality of opportunity using ﬁve diﬀerent estimation methods.
          Inequality of opportunity is measured as the Gini coeﬃcient in the counterfactual distri-
          bution Y EA .



learn about the particular circumstance types the diﬀerences among which cause the existence
of inequality of opportunity. In this section we illustrate such analyses for both trees and
forests. To keep the analysis intelligible we restrict ourselves to two interesting cases: Sweden
and Germany.

Trees As outlined in section 3.1, the analysis of opportunity structures is particularly intuitive
in the case of trees as the relevant information can be directly read oﬀ their graphical illustration.
    Figure 3 illustrates the opportunity structure of Sweden that can be summarized by a tree
with two terminal nodes. Inequality of opportunity in Sweden is due to marked diﬀerences
between ﬁrst-generation immigrants born outside Europe and the collective group of native
residents and European immigrants. The former type accounts for about 10% of the population
and on average obtains an equivalent household income that is 35% lower than the corresponding
income of the latter group. Recall that each split is based on a statistical test for the existence
of equality of opportunity within the respective internal node. Thus, in Sweden we can reject
the null hypothesis of equal opportunities for ﬁrst-generation immigrants born outside Europe
and the remainder of the population. However, within these sub-groups the null hypothesis of
equality of opportunity cannot be rejected.
    A diﬀerent picture arises when considering Germany. Parental occupation, parental educa-

                                                     16
                                                                                              Figure 2: Correlation of Estimates by Method
                                                     (a) Parametric Approach                                                                                                                                    (b) Non-Parametric Approach


                                                                                                                      RO
                                                                                                                      q




                                                                                                                                       BG
                                                                                                                                       q

                0.15                                                                                             EL                                                                      0.15
                                                                                                                  q
                                                                                                                          ES
                                                                                                                               q
                                                                                                                                   q                                                                                                                                                                     BG
                                                                                                                                                                                                                                                                                                     q
                                                                                     q
                                                                                                                      q             PT q
                                                                                 HR                              LV                            LU
                                                                                                        IT                                                                                                                                                                                          PT
                                                                                                        q
                                                                                                                                                                                                                                                                                     EL            q
                                                                                                                                                                                                                                                                                                         q
                                                                                                                                                                                                                                                                                      q        q
                                                     q
                                                                                                    PL                                                                                                                                                                                        ES         LU
                                                SE                                                                                                                                                                                                                                   LV




                                                                                                                                                            NON−PARAMETRIC
                                                                                                            q
                                                                                     IE
                                                                                               q             q    q   HU                                                                                                                                                                  q
                                                                                              BE            EE                                                                                                                                                               EE        RO
 PARAMETRIC




                                                                                         q
                                                                                                                                                                                                                                                                                      qq
                                                                                         q                                                                                                                                                                                       q
                0.10                                                                                                                                                                     0.10                                                         IE                         q PL
                                                                                                                                                                                                                                                                                      HU
                                                                             LT UK                                                                                                                                                                         q
                                                                             q  q    CH                                                                                                                                                                     UK IT q
                                                                               CY q q                                                                                                                                                                  qq
                                                                                 q
                                                                                  AT FRq                                                                                                                                                         HR          BE q q
                                                                            MT                                                                                                                                                                             q   q    FR
                                                              SI                                                                                                                                                                                  q
                                                                                                                                                                                                                                                           CY CH
                                        q                 q                                                                                                                                                                                 MT                      q
                                   DK                FI                q
                                                                                                                                                                                                                               q
                                                                                                                                                                                                                                                               AT
                                                 q
                                                                   CZ
                                                                                          q
                                                                                                                                                                                                                          SI                   LT
                              IS   q q                                            DE                                                                                                                                                    q
                                                                                                                                                                                                                                             q

                                        NL                         q
                                                                                                                                                                                                                          q
                                                                                                                                                                                                                                    CZ
                                                              SK                                                                                                                                                     SE
                                                                                                                                                                                                                                                           q

                                                                                                                                                                                                                q                   SK            DE
                0.05                        NO                                                                                                                                           0.05          NL
                                                                                                                                                                                                                      q             q
                                            q                                                                                                                                                                 FI
                                                                                                                                                                                                       DK q q
                                                                                                                                                                                                        IS NO
                                                                                                                                                                                                            q




                0.00                                                                                                                                                                     0.00

                       0.00                                        0.05              0.10                                                           0.15                                        0.00                                0.05              0.10                                                    0.15
                                                                           CONDITIONAL FOREST                                                                                                                                               CONDITIONAL FOREST


                                                     (c) Latent Class Analysis                                                                                                                                  (d) Conditional Inference Tree




                0.15                                                                                                                                                                     0.15
                                                                                                                                                                                                                                                                                                    BG
                                                                                                                                                                                                                                                                                         PT q        q
                                                                                                                                                                                                                                                                                                         q
                                                                                                                                                                                                                                                                                  EL ES q                LU
                                                                                                                                                            CONDITIONAL INFERENCE TREE




                                                                                                                                                                                                                                                                                    q
                                                                                                                      RO                                                                                                                                                              RO
                                                                                                                                                                                                                                                                                     q
                                                                                                                  q
                                                                                                                   q
                                                                                                                               PT                                                                                                                                                HU
                                                                                                                                   q   q
                                                                                                                 EL                    BG                                                                                                                               IT
                                                                                                                                                                                                                                                                                      q
                                                                                                                                                                                                                                                                                          q
                                                                                                                                                                                                                                                                             q
                                                                                                                                                                                                                                                                                          LV
 LATENT CLASS




                                                                                                                                                                                                                                                                        PL       q
                0.10                                                                                                                                                                     0.10                                                                                    EE
                                                                                                   PL q           q q     LV                                                                                                                                                   q

                                                                                                             HU                q           q                                                                                                     IE                          q FR

                                                                                                                               ES LU                                                                                                                   q q
                                                                                                                                                                                                                                              HR qq AT BE
                                                                                                   IT
                                                                                 HR AT q                q
                                                                                                                                                                                                                                            MT
                                                                                                                                                                                                                                                   q    q
                                                                                     q
                                                                                          q                 q    EE                                                                                                                            UK    CY   CH
                                                                                    CY FRq                                                                                                                                                    q
                                                                                                                                                                                                                                                  q        q
                                                                                                                                                                                                                                                           q

                                                                                                                                                                                                                                            LT                 DE
                                                                       q
                                                                           LT UK q CH  q
                                                          q                  q
                                                     SI            CZ      MT q
                                                                                                                                                                                                                                        q

                                                                               IE q
                                                                                        q                                                                                                                                           CZ
                0.05                             q
                                                                                         BE                                                                                              0.05                                       q

                                            FI                     SK             q
                                                                                                                                                                                                                                   SK
                                       q                           q                DE
                                   NL
                                             NO                                                                                                                                                                              SI
                              IS   q    q q
                                                                                                                                                                                                                NL         q
                                                                                                                                                                                                                q
                                                                                                                                                                                                                          SE
                                       DK            q
                                                                                                                                                                                                       DK q               q
                                                      SE                                                                                                                                                            q q
                                                                                                                                                                                                                NO        FI
                                                                                                                                                                                                            q
                                                                                                                                                                                                       IS
                0.00                                                                                                                                                                     0.00

                       0.00                                        0.05              0.10                                                           0.15                                        0.00                                0.05              0.10                                                    0.15
                                                                           CONDITIONAL FOREST                                                                                                                                               CONDITIONAL FOREST



Note : Comparison of inequality of opportunity estimates based on random forests with estimates based on four
other methods. Along the solid line inequality of opportunity is the same for the two methods.


tion, migration status, the number of working adults in the household, and parental tenancy
status interact in creating a complex tree made of 14 splits and 15 terminal nodes. The null
hypothesis of equality of opportunity is most ﬁrmly rejected for individuals whose fathers work
in diﬀerent occupations. If a respondent’s father worked in one of the higher ranked occupa-
tions (I-01–I-05), the individual belongs to a more advantaged circumstance type than otherwise
(Terminal nodes 5-10). These types together account for 37.4% of the population and have an
average outcome of e26,380 – far above the population average of e22,221. However, the ad-
vantage of this circumstance characteristic is contingent on the educational status of the father.


                                                                                                                                                           17
                                     Figure 3: Opportunity Tree: Sweden




                        Note : Opportunity tree for Sweden. White rectangular boxes indi-
                        cate terminal nodes. The ﬁrst number inside the rectangular boxes
                        indicates the share of the population belonging to this group, while
                        the second number indicates the predicted income.



If fathers of respondents had no or low education, the oﬀspring earned less (e21,390) than the
country average in spite of the fact that fathers made a career in a high-rank occupation. Con-
ditional on the father both being highly educated and working in a high-rank occupation, the
intra-household division of labor plays a strong role. On the one hand, those individuals coming
from single earner households in which the mother stayed at home are the most advantaged
circumstance types of Germany in 2010, especially if their father worked as a manager or profes-
sional (Terminal nodes 5 and 6). On the other hand, oﬀspring of double-earner households tend
to be diﬀerentiated by their migration status. Comparing terminal nodes 8 and 9 we learn that
the advantage of coming from a highly-educated double-earner household is substantially dimin-
ished from e25,718 to e22,808 if the respondent’s father was born outside Germany. A similar
distinction based on migration status can be observed on the right-hand side of the tree, in which
individuals were born to fathers with a lower occupational status (I-05–I-0). Individuals in this
group lived in above average income households if both of their parents were fairly educated and
their father had no migration background (Terminal node 14). This advantage again vanishes
substantially if the respondent’s father was born outside Europe (Terminal node 15). Overall,
when analyzing the right-hand side of the tree, it is clear that circumstances interact in a very
diﬀerent way in determining individuals’ outcomes. In addition to parental education and the
migration status of individuals, the tenancy status during childhood as well as the number of
working adults in the household play an important role.
    There is marked heterogeneity in tree structures across countries. For the remaining countries
in our sample, terminal nodes range from three (Denmark, Iceland and Norway) to 27 (Italy).12
It is noteworthy that the rank-rank correlation between the number of terminal nodes and the
inequality of opportunity estimates presented in section 4.3 is positive but not perfect (Appendix
A.5).

Forests Forests cannot be analyzed in the straightforward graphical manner of trees. However,
we can use variable importance measures to assess the impact of circumstance variables for the
  12
       Figures of the tree structures for the remaining countries are available upon request.


                                                         18
                                                                               Figure 4: Opportunity Tree: Germany



                                                                        Father occ.
                                                                         occ_father
                                                                         pp<0.001
                                                                           < 0.001

                               {Service, Clerical, Technician, Professional,
                                                                 ≤4          Manager}              {Not working, Elementary, Plant 4
                                                                                                                                 > operator, Craft/Trades,Agriculture}

                                                         Father  edu.
                                                          father_edu                                                                                                                Mother edu.
                                                                                                                                                                                    mother_edu
                                                           p<0.001
                                                           p < 0.001                                                                                                                   < 0.001
                                                                                                                                                                                     pp<0.001

                                        {Medium,
                                    {Medium,     High}
                                                 {Unknown,
                                             High}         None,
                                                     {Unknown,   Low} Low}
                                                               None,                                                                       {Low, Medium, High}                         {Unknown, None}

                         Mother Superv.
                       supervisory_mother                                                          Father  edu.
                                                                                                    father_edu                                                                                    House   own.
                                                                                                                                                                                                    tenancy
                           pp<0.001
                              < 0.001                                                                p<0.001
                                                                                                     p < 0.001                                                                                       = 0.004
                                                                                                                                                                                                   pp<0.004

                     Not working      {No, Yes}                                                High                  {Unknown, None, Low, Medium}

             Father occ.
              occ_father                  father_birth
                                          Father birth                                Birth area
                                                                                      birth_area                                                           House   own.
                                                                                                                                                             tenancy
              p<0.001
              p < 0.001                    p = 0.007
                                            p<0.007                                    p<0.009
                                                                                       p = 0.009                                                              < 0.001
                                                                                                                                                            pp<0.001

                                                                                                                                                       Owned     Not owned

                                                                                                                                             Working adults
                                                                                                                                              n_w_adult              Working adults
                                                                                                                                                                      n_w_adult




19
                                                                                                                                               p<0.001
                                                                                                                                               p < 0.001              pp<0.007
                                                                                                                                                                        = 0.007

                                                                                                                                      ≤3            >3                                            Not
                                                                                                                                                                                              Owned
                                                                                                                                                                                            {Owned}   owned
                                                                                                                                                                                                    {Not Owned}

        {Professional,                                                           {Native}                          father_edu
                                                                                                                  Father  edu.
                ≤2     >2          Country      EU  −27
                                           of residence
                                     {Native}      {EU-27}                         Native
                                                                                        Outside EU
                                                                                          {Outside
          Manager}                                                                                                  p = 0.003
                                                                                                                    p<0.003
                                                                                           Europe}
                {Service, Clerical,
                                                                                                          {None,  Low,
                                                                                                           {None,Low,   {Unknown}
                                                                                                                       Unknown
                                                                                                                      Medium}                                             ≤3   >3
                   Technician}
                                                                                                             Medium}
                                                                                                                              n_w_adult
                                                                                                                            Working adults
                                                                                                                              p = 0.004
                                                                                                                              p<0.004

                                                                                                                                 ≤1        >1



              5           6               8            9            10             14           15             19            21            22            23           25           26          28           29
            = 725
          n 5.7%      n=  1097
                        8.6%            = 1952
                                      n 15.4%        = 271
                                                   n 2.1%          = 708
                                                                 n 5.6%           = 593
                                                                                n4.7%           = 48
                                                                                              n0.4%           = 2869
                                                                                                            n 22.6%          = 87
                                                                                                                           n0.7%           = 92
                                                                                                                                         n 0.7%       n=  126
                                                                                                                                                         1%        n21.6%
                                                                                                                                                                     = 2739      n = 95
                                                                                                                                                                                  0.7%      n= 4%
                                                                                                                                                                                                511      n 6.1%
                                                                                                                                                                                                           = 770
         y = 29574   y = 26350       y = 25718    y = 22808     y = 21390      y = 26744    y = 20041      y = 23265     y = 22615     y = 17379     y = 18501    y = 21590    y = 17176   y = 19784    y = 17442




     Note : Opportunity tree for Germany. White rectangular boxes indicate terminal nodes. The ﬁrst number inside the rectangular boxes indicates the
     share of the population belonging to this group, while the second number indicates the predicted income. Occupation refers to ISCO-08 one digit codes.
     All variables describing household characteristics refer to the period in which the respondent was about 14 years old. See Table 1 for details.
construction of opportunity forests. One measure of variable importance, as proposed by Strobl
et al. (2007), is obtained by permuting input variable C p ∈ Ω    ˇ such that its dependence with
                                                         OOB
y is lost. After this, the out-of-bag error rate, MSE         , is re-computed. The increase of
MSEOOB in comparison to the baseline out-of-bag error indicates the importance of C p ∈ Ω    ˇ for
                                                           p    ˇ
prediction accuracy. Repeating this procedure for all C ∈ Ω aﬀords a relative comparison of
all circumstances.
    Figure 5 shows the results from this procedure for our example cases of Germany and Sweden.
Each black dot is the importance of one of the P ˇ variables in the set of observed circumstances
ˇ We standardize the ensuing results such that the variable importance measure for the circum-
Ω.
stance with the greatest impact in each country equals one. For the case of Sweden birth area is
the only circumstance that has a meaningful predictive value. In Germany, father’s occupation
and father’s education are most important, followed by the number of working adults in the
household and mother’s education.
                                       Figure 5: Variable Importance for Germany and Sweden

                                1.00                 q   Father's Occupation     q   Birth Area




                                0.75
                                                         Father's Education
          Variable Importance




                                                     q




                                0.50
                                                     q   # of Working Adults
                                                     q   Mother's Education


                                                     q   Tenancy
                                0.25

                                                     q
                                                     q
                                                     q
                                                     q
                                                     q
                                                     q                           q
                                                     q
                                                     q
                                                     q
                                                     q                           q
                                                     q                           q
                                                                                 q
                                                     q                           q
                                                                                 q
                                                                                 q
                                0.00
                                                  Germany                      Sweden

         Note : Each dot shows the importance of a particular circumstance for the predictions
         from our random forest. The importance of a circumstance is measured by permuting the
         circumstance, calculating a new MSEOOB , and computing the diﬀerence in the MSEOOB
         between the original model and the model with the permuted circumstance. The impor-
         tance measure is standardized such that the circumstance with the greatest importance
         in each country equals one. Occupation refers to ISCO-08 one digit codes. All variables
         describing household characteristics refer to the period in which the respondent was about
         14 years old. See Table 1 for details.



    It is reassuring that these ﬁndings are in line with the graphical analysis of opportunity trees.
In Figure A.3 of Appendix A.4 we show variable importance plots for all countries in our sample.
Broadly, we can divide our country sample into three groups according to the circumstances that
determine their opportunity structure. First, there is a handful of primarily Nordic countries
where the respondent’s birth area is the most important circumstance. Second, there is a large
group of primarily Western and Southern European countries for which father’s occupation and
father’s education are most important. Third, there is a group of Eastern European countries
for which mother’s education and occupation is most important.




                                                                        20
4.5    Out-of-Sample Performance
Recall that current approaches towards estimating inequality of opportunity are subject to
diﬀerent biases. Models are downward biased to the extent that the full set of circumstances
Ω is unobserved. Models are upward biased to the extent that they over-utilize the set of
observed circumstances Ω  ˇ leading to overﬁtted estimates that do not replicate out-of-sample
(see Appendix A.1 for the formal argument).
    In order to assess how well diﬀerent estimation approaches trade oﬀ these biases, we follow the
machine-learning practice of splitting our sample into a training set with itrain ∈ {1, ..., Ntrain }
                                                                                         2
and a test set with itest ∈ {1, ..., Ntest }. For each country in our sample, Ntrain = 3   N while
         1
Ntest = 3 N . We ﬁt our models on the training set and compare their performance on the test
set according to the following procedure:

    1. Run the chosen models on the training data (for the speciﬁc estimation procedures, see
       section 3.1 for trees, section 3.2 for forests, and section 4.2 for our benchmark methods).
                                      ˆ
    2. Store the prediction functions f       ˇ
                                       train (Ω).

                                                                      ˆ
                                                             ˆitest = f
    3. Predict the outcomes of observations in the test set: y                ˇ
                                                                       train (Ωitest ).

    4. Calculate the out-of-sample error: MSEtest =       1
                                                         Ntest   itest             ˆitest ]2 .
                                                                         [yitest − y

    Figure 6 compares the resulting MSEtest of the diﬀerent models. For each country, the
MSEtest of random forests is standardized to equal 1, such that an MSEtest larger than 1 repre-
sents a worse ﬁt out-of-sample. That implies that the respective method performs worse than
forests in trading oﬀ upward and downward biases, either by neglecting the use of circumstances
or overﬁtting. We derive 95% conﬁdence intervals based on 200 bootstrapped re-samples of the
test data using the normal approximation method.
    As expected, random forests outperform all other methods in nearly all cases. On average,
the parametric approach gives a ﬁt 9.4% worse than forests. With average shortfalls of around
3%, out-of-sample prediction errors are less pronounced for non-parametric models and latent
class analysis. Yet both methods perform worse than conditional inference forests for the vast
majority of countries in our sample. Hence, relative to random forests, our benchmark methods
                                                                ˇ and are therefore biased in their
either underutilize or overutilize the information contained in Ω
inequality of opportunity estimates. The estimates presented in section 4.3 suggest that the
parametric and the non-parametric partitions are overﬁtting the data, while the type partition
delivered by latent class analysis is too coarse.
    On average conditional inference trees are closest to the test error rate of forests. Yet they
also fall short of the performance of forests due to their poorer utilization of the information
          ˇ
given in Ω.


5     Conclusion
In this paper we have proposed two novel approaches towards estimating inequality of opportu-
nity based on regression trees. Both conditional inference trees and forests minimize arbitrary
model selection by the researcher, while trading oﬀ downward and upward biases in inequality of
opportunity estimates. On the one hand, conditional inference forests outperform all methods
considered in this paper in terms of their out-of-sample performance. Hence, they deliver the
best estimates of inequality of opportunity. On the other hand, conditional inference trees are
econometrically less complex and provide a handy graphical illustration that can be used for the
straightforward analysis of opportunity structures. The fact that trees are very close to forests
in terms of their inequality of opportunity estimates (section 4.3), the importance they assign to

                                                    21
                                                                                                                  Figure 6: Comparison of Models’ Test Error
                                                                              (a) Parametric Approach                                                                                                                                                         (b) Non-Parametric Approach




                                                                                                                                                                                       Non−Parametric MSEtest / Random Forest MSEtest
 Parametric MSEtest / Random Forest MSEtest




                                                                                                                                                                          q

                                                1.2                                                                                                                                                                                     1.2



                                                                                                                                                                      q
                                                                                                                                                                  q
                                                                                                                                                              q
                                                                                                                                                          q

                                                                                                                                              q   q   q
                                                                                                                                      q   q
                                                                                                                              q   q
                                                                                                                          q
                                                1.1                                                               q
                                                                                                                      q
                                                                                                                                                                              q
                                                                                                                                                                                                                                        1.1
                                                                                                          q   q

                                                                                                  q   q
                                                                                              q

                                                                                      q   q                                                                                                                                                                                                                                                                       q
                                                                  q   q   q   q   q
                                                                                                                                                                                                                                                                                                                                                              q
                                                              q
                                                                                                                                                                                                                                                                                                                                                          q
                                                                                                                                                                                                                                                                                                                                                  q   q
                                                          q                                                                                                                                                                                                                                                                                   q
                                                                                                                                                                                                                                                                                                                                          q
                                                                                                                                                                                                                                                                                                                          q   q   q   q
                                                      q                                                                                                                                                                                                                                                           q   q
                                                                                                                                                                                                                                                                                                              q                                                       q
                                                                                                                                                                                                                                                                                              q   q   q   q
                                                                                                                                                                                                                                                                                  q   q   q
                                                                                                                                                                                                                                                                  q   q   q   q
                                                                                                                                                                                                                                                              q
                                                                                                                                                                                                                                                  q   q   q
                                                                                                                                                                                                                                              q
                                                1.0                                                                                                                                                                                     1.0


                                                      CY CH FR AT SK NL HU DE BE UK EE PT ES LV EL AVG                                                                                                                                        DK NL PL HR LV LU CY LT DE EL NO IE HU RO IS AVG
                                                        CZ BG LU PL SI NO IS FI IE MT RO HR LT IT DK                                                                                                                                            CZ SK BG UK ES EE IT SI FR MT CH AT PT FI BE
                                                                                                          Country                                                                                                                                                                                 Country


                                                                          (c) Latent Class Analysis                                                                                                                                                           (d) Conditional Inference Tree
 Latent Class MSEtest / Random Forest MSEtest




                                                                                                                                                                                       Ctree MSEtest / Random Forest MSEtest




                                                1.2                                                                                                                                                                                     1.2




                                                1.1                                                                                                                                                                                     1.1
                                                                                                                                                                          q
                                                                                                                                                                                                                                                                                                                                                                  q
                                                                                                                                                                  q   q
                                                                                                                                                      q   q   q

                                                                                                                                              q   q                                                                                                                                                                                                           q
                                                                                                                                      q   q
                                                                                                                                                                                                                                                                                                                                                      q   q
                                                                                                                                  q
                                                                                                                          q   q                                               q                                                                                                                                                                   q
                                                                                                              q   q   q                                                                                                                                                                                                           q   q   q   q
                                                                                                  q   q   q                                                                                                                                                                                                       q   q   q   q                                       q
                                                                                      q   q   q                                                                                                                                                                                                           q   q
                                                                                  q                                                                                                                                                                                                                   q
                                                                                                                                                                                                                                                                              q   q   q   q   q   q
                                                                              q                                                                                                                                                                                       q   q
                                                                          q                                                                                                                                                                           q   q   q   q
                                                                      q                                                                                                                                                                           q
                                                              q   q                                                                                                                                                                           q
                                                1.0   q   q
                                                                                                                                                                                                                                        1.0


                                                      CZ NL LT HR SI RO LV FI DE FR PT HU ES EL CY AVG                                                                                                                                        HR IS BG ES CZ IT CY RO AT SI MT FR EL PT BE AVG
                                                        SK UK PL IS NO BG DK MT CH IT LU EE IE AT BE                                                                                                                                            SK DK PL HU FI LT NL NO LV EE LU DE UK CH IE
                                                                                                          Country                                                                                                                                                                                 Country


Notes: The ﬁgure compares the test error of the diﬀerent models. The test error of random forests is standardized
to 1, such that a test error larger than 1 represent worse ﬁts than random forests. 95% conﬁdence intervals are
derived based on 200 bootstrapped re-samples of the test data using the normal approximation method. Sweden
is excluded from the ﬁgure since it is an outlier. The test errors for Sweden are 1.43 [1.21, 1.66] for the parametric
approach, 1.11 [1.01, 1.21] for the non-parametric approach, 1.06 [1.02, 1.11] for latent class analysis, and 1.06
[1.01, 1.11] for conditional inference trees.



speciﬁc circumstances (4.4) and their out-of-sample performance (4.5) makes us conﬁdent that
they are a useful tool for communicating issues related to inequality of opportunity to a larger
audience.
   To be sure, the development of machine learning algorithms and their integration into the
analytical toolkit of economists is a highly dynamic process. We are well aware that ﬁnding the
best machine learning algorithm for inequality of opportunity estimations is a methodological
horse race with frequent entry of new competitors that eventually will lead to some method
outperforming the ones we proposed in this work. Therefore, the main contribution of this work
should be understood as paving the way for new methods that are able to handle the intricacies
of model selection for inequality of opportunity estimations. While we restricted ourselves to
ex-ante utilitarian measures of inequality of opportunity, the exploration of these algorithms for


                                                                                                                                                                                  22
other methods in the inequality of opportunity literature, such as ex-post measures ` a la Pistolesi
                                    a la Lefranc et al. (2009), provides an interesting avenue for
(2009) or ex-ante and ex-post tests `
future research.




                                                23
References
Alm˚as, I., Cappelen, A. W., Lind, J. T., Sørensen, E. Ø., and Tungodden, B. (2011). Measuring
  unfair (in)equality. Journal of Public Economics, 95(7–8):488–499.
Altshuler, D., Durbin, R. M., Donnelly, P., Green, E. D., Nickerson, D. A., Boerwinkle, E.,
  and Doddapaneni, H. (2015). A global reference for human genetic variation. Nature,
  526(7571):68–74.
Athey, S. (2017). The Impact of Machine Learning on Economics. mimeo.
Black, S. E. and Devereux, P. J. (2011). Recent Developments in Intergenerational Mobility. In
  Card, D. and Ashenfelter, O., editors, Handbook on Labor Economics, volume 4, chapter 16,
  pages 1487–1541. Elsevier, Amsterdam.
Bourguignon, F., Ferreira, F. H. G., and Men´   endez, M. (2007). Inequality of Opportunity in
  Brazil. Review of Income and Wealth, 53(4):585–618.
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Stone, C., and Olshen, R. (1984). Classiﬁcation and Regression
  Trees. Taylor & Francis, Belmont.
Brunori, P., Peragine, V., and Serlenga, L. (2016). Upward and downward bias when measuring
  inequality of opportunity. ECINEQ Working Paper Series, 2016-406.
Checchi, D. and Peragine, V. (2010). Inequality of opportunity in Italy. The Journal of Economic
  Inequality, 8(4):429–450.
Checchi, D., Peragine, V., and Serlenga, L. (2016). Inequality of Opportunity in Europe: Is
  There a Role for Institutions? In Cappellari, L., Polachek, S., and Tatsiramos, K., editors,
  Inequality: Causes and Consequences, volume 43 of Research in Labor Economics, pages 1–44.
  Emerald, Bingley.
Cowell (2016). Inequality and Poverty Measures. In Adler, M. D. and Fleurbaey, M., editors,
  Oxford Handbook of Well-Being and Public Policy, chapter 4, pages 82–125. Oxford University
  Press, Oxford.
Cowell, F. A. and Victoria-Feser, M.-P. (1996). Robustness Properties of Inequality Measures.
  Econometrica, 64(1):77–101.
Ferreira, F. H. G. and Gignoux, J. (2011). The Measurement of Inequality of Opportunity:
  Theory and an Application to Latin America. Review of Income and Wealth, 57(4):622–657.
Fleurbaey, M. (1995). Three solutions for the compensation problem. Journal of Economic
  Theory, 65(2):505–521.
Fleurbaey, M. (2008). Fairness, Responsibility, and Welfare. Oxford University Press, Oxford.
Fleurbaey, M. and Peragine, V. (2013). Ex Ante Versus Ex Post Equality of Opportunity.
  Economica, 80(317):118–130.
Friedman, J., Hastie, T., and Tibshirani, R. (2009). The elements of statistical learning. Springer,
  New York.
Garc´ıa, J. L., Heckman, J. J., and Ziﬀ, A. L. (2017). Gender Diﬀerences in the Beneﬁts of an
  Inﬂuential Early Childhood Program. European Economic Review, Forthcoming.
Hothorn, T., Hornik, K., and Zeileis, A. (2006). Unbiased Recursive Partitioning: A Conditional
  Inference Framework. Journal of Computational and Graphical Statistics, 15(3):651–674.
Hufe, P., Peichl, A., Roemer, J. E., and Ungerer, M. (2017). Inequality of income acquisition:
  the role of childhood circumstances. Social Choice and Welfare, Forthcoming.
Kanbur, R. and Snell, A. (2017). Inequality Measures as Tests of Fairness. mimeo.
Lanza, S., Xianming, T., and Bethany, B. (2013). Latent class analysis with distal outcomes: A
  ﬂexible model- based approach. Structural Equation Modeling, 20(1):1–26.
Lefranc, A., Pistolesi, N., and Trannoy, A. (2009). Equality of opportunity and luck: Deﬁnitions
  and testable conditions, with an application to income in France. Journal of Public Economics,
  93(11–12):1189–1207.


                                                24
Li Donni, P., Rodr´ ıguez, J. G., and Rosa Dias, P. (2015). Empirical deﬁnition of social types
  in the analysis of inequality of opportunity: A latent classes approach. Social Choice and
  Welfare, 44(3):673–701.
Marrero, G. A. and Rodr´   ıguez, J. G. (2012). Inequality of Opportunity in Europe. Review of
  Income and Wealth, 58(4):597–621.
Morgan, J. N. and Sonquist, J. A. (1963). Problems in the Analysis of Survey Data, and a
  Proposal. Journal of the American Statistical Association, 58(302):415–434.
Mullainathan, S. and Spiess, J. (2017). Machine Learning: An Applied Econometric Approach.
  Journal of Economic Perspectives, 31(2):87–106.
Oppedisano, V. and Turati, G. (2015). What are the causes of educational inequality and of its
  evolution over time in Europe? Evidence from PISA. Education Economics, 23(1):3–24.
Palomino, J. C., Marrero, G. A., and Rodr´     ıguez, J. G. (2016). Channels of inequality of op-
  portunity: The role of education and occupation in Europe. ECINEQ Working Paper Series,
  2016-411.
Pistolesi, N. (2009). Inequality of opportunity in the land of opportunities, 1968–2001. The
  Journal of Economic Inequality, 7(4):411–433.
Roemer, J. E. (1998). Equality of Opportunity. Harvard University Press, Cambridge.
Roemer, J. E. and Trannoy, A. (2015). Equality of Opportunity. In Atkinson, A. B. and
  Bourguignon, F., editors, Handbook of Income Distribution, volume 2, chapter 4, pages 217–
  300. Elsevier, Amsterdam.
Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable
  importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25.
Trannoy, A., Tubeuf, S., Jusot, F., and Devaux, M. (2010). Inequality of opportunities in health
  in France: A ﬁrst pass. Health Economics, 19(8):921–938.
Van de gaer, D. (1993). Equality of Opportunity and Investment in Human Capital. PhD thesis,
  University of Leuven.
Van de gaer, D. and Ramos, X. (2016). Empirical Approaches to Inequality of Opportunity:
  Principles, Measures, and Evidence. Journal of Economic Surveys, 30(5):855–883.
Varian, H. R. (2014). Big Data: New Tricks for Econometrics. Journal of Economic Perspectives,
  28(2):3–27.




                                               25
A        Appendix
A.1      Model Evaluation by the MSE
We use the MSE as a model evaluation criterion when cross-validating α in the case of trees
(Section 3.1) and when determining the values of α and P     ¯ using the out-of-bag error rate in
the case of forests (Section 3.2). Analogously, when comparing the predictive performance of
diﬀerent estimation approaches in the test sample N test , we prefer the estimation approach that
yields a lower MSE (Section 4.5). The following discussion applies to all of these applications.
For the sake of conciseness, superscript h always indicates observations in the hold-out sample
regardless of the speciﬁc application.
    The general MSE evaluation criterion can be written as follows:
                                                  1               h
                                                                (yi  ˆi )2 .
                                                                    −y                                                    (9)
                                                  Nh
                                                          h

                                         ˇ are the sole input variables, individual predictions
In the case where observed circumstances Ω
ˆi are given by the mean outcomes of the type to which individuals are allocated and we can
y
write:
                                      1       h
                                            (yi − µm )2 ,                                  (10)
                                     Nh
                                                          h
              1
where µm = N      i∈tm yi and tm denotes a speciﬁc type in the model we want to evaluate. It is
instructive to rewrite the MSE as a weighted average over types as follows:
                                                  h
                                                 Nm              1    h
                                                                    (yi − µ m )2 .                                      (11)
                                           m
                                                 Nh             Nmh
                                                      i∈tm

We can expand the previous expression and spell out the binomial formula:
          h
         Nm           1                                        2
                           h
                         (yi − µh        h
                                m ) + ( µm − µm )                                                                       (12)
     m
         Nh          Nmh
              i∈tm
          h
         Nm           1                                                        h
                                                                              Nm                          1
                           h
 =                       (yi − µh  2     h
                                m ) + ( µm − µ m )
                                                   2
                                                     +2                          ( µh − µm )                   h
                                                                                                             (yi − µh
                                                                                                                    m ). (13)
     m
         Nh          Nmh
                                                                          m
                                                                              Nh m                       Nmh
              i∈tm                                                                                i∈tm

                     1    h
Evidently,     i∈tm Nmh (yi   − µh
                                 m ) = 0 and the formula simpliﬁes to:

                                                                                             
                                    h
                                   Nm             1  h
                                                      (y − µh
                                                   h  i
                                                               2    h       2
                                                            m ) + (µm − µm )  ,                                        (14)
                              m
                                   Nh            Nm
                                          i∈tm
                                                                    (1)           (2)

where (1) is the intra-type variance of outcomes in the hold-out sample and (2) is the variance of
type-means between the hold-out sample and the training sample. Recall that we prefer models
that minimize formula (14). For the sake of exposition, let’s generalize the previous expression
by introducing the weighting parameter δ ∈ [0, 1]. Note that the standard MSE yields equivalent
rankings to the special case in which δ = 0.5, i.e. the case in which we give equal weight to both
(1) and (2):                                                              
                               h
                              Nm           1      h    h 2             h       2
                                              δ (yi − µm ) +(1 − δ ) (µm − µm )  .                                    (15)
                          m
                              Nh          Nmh
                                   i∈tm
                                                              (1)                       (2)



                                                                26
Now assume two extreme cases:

  1. δ = 1: In this case we give full priority to (1), i.e. we would always prefer a model
     that decreased the intra-type variance in the hold-out sample the most. Naturally, one
     reduces intra-type variance by increasing the granularity of the type partition. Hence, we
     would always prefer the model that used more circumstances and interactions. Thus, (1)
     addresses the downward bias of equality of opportunity estimates as induced by using only
                                                          ˇ
              ˆ of the full set of observed circumstances Ω.
     a subset Ω

  2. δ = 0: In this case we give full priority to (2), i.e. we would always prefer a model that
     decreased the variance between type means in the hold-out sample and the type means
     in the training sample. Invoking the law of large numbers it is evident that the ideal
     model from this perspective is the model with no partition at all, i.e. the model in which
     individual predictions µm are given by the sample mean µ. Thus, (2) addresses the upward
     bias identiﬁed by Brunori et al. (2016) that originates from overﬁtting the model to the
     training data.

To conclude, the more weight we put on (1), the less the downward bias in our estimation, since
we allow circumstances to have explanatory scope for observed outcomes. Intuitively, if we set
δ = 0, our estimates would be deeply downward biased because we would eﬀectively say that
inequality of opportunity was non-existent. The more weight we put on (2), the more accurate
our estimates of type means, i.e. the less the out-of sample-variance in our estimates of the type
means. Intuitively, with δ = 1 we would say that we did not care about the precision of our
estimates at all, which is the standard practice in today’s inequality of opportunity estimations.
This instills overﬁtting and an upward bias in inequality of opportunity estimates. Hence by
giving equal weight to both components, the MSE balances upwards and downward biases in
inequality of opportunity estimations and thus is a sensible criterion for model evaluation in this
context.




                                                27
A.2     Descriptive Statistics


                    Table A.1: Descriptive Statistics (Individual and Household)
                                Sex         Birth   area     Presence   parents            Household members          Tenancy
 Country   Eq. income   Male      Female   Native     EU     Both        One      Adults    Working adults Children    Owner
   AT        25,451     0.499      0.501   0.790     0.070   0.856      0.017      2.73          1.76        2.60      0.585
   BE        23,291     0.502      0.498   0.824     0.076   0.855      0.019      2.38          1.59        2.78      0.750
   BG         3,714     0.500      0.500   0.994     0.001   0.904      0.012      2.44          2.01        2.07      0.910
   CH        42,208     0.495      0.505   0.684     0.197   0.837      0.017      2.55          1.90        2.53      0.546
   CY        21,058     0.475      0.525   0.787     0.096   0.900      0.015      2.64          1.67        2.70      0.784
   CZ         9,006     0.492      0.508   0.964     0.026   0.851      0.013      2.09          1.92        2.24      0.597
   DE        22,221     0.504      0.496   0.868     0.000   0.830      0.020      2.24          1.68        2.32      0.499
   DK        32,027     0.495      0.505   0.923     0.026   0.809      0.027      2.22          2.31        2.24      0.736
   EE         6,922     0.475      0.525   0.847     0.000   0.756      0.011      2.10          1.80        2.09      0.859
   EL        13,184     0.502      0.498   0.890     0.025   0.931      0.019      2.31          1.56        2.33      0.834
   ES        17,088     0.505      0.495   0.834     0.051   0.893      0.012      2.88          2.11        2.43      0.819
   FI        27,517     0.501      0.499   0.954     0.018   0.829      0.016      2.36          1.75        2.30      0.772
   FR        24,299     0.491      0.509   0.885     0.036   0.820      0.022      2.47          1.66        1.75      0.630
   HR         6,627     0.499      0.501   0.875     0.017   0.874      0.020      2.56          1.35        2.31      0.902
   HU         5,327     0.483      0.517   0.988     0.008   0.844      0.041      2.14          1.75        2.27      0.830
   IE        24,867     0.476      0.524   0.783     0.149   0.893      0.078      3.17          3.20        3.20      0.727
   IS        22,190     0.493      0.507   0.920     0.042   0.899      0.012      2.42          1.90        2.63      0.893
   IT        18,786     0.498      0.502   0.880     0.040   0.901      0.011      2.59          1.62        2.41      0.685
   LT         4,774     0.479      0.521   0.939     0.004   0.846      0.016      2.32          2.02        2.46      0.698
   LU        37,911     0.501      0.499   0.480     0.401   0.868      0.020      2.53          1.64        2.71      0.734
   LV         5,334     0.480      0.520   0.865     0.000   0.763      0.012      1.97          1.76        2.28      0.455
   MT        13,006     0.503      0.497   0.944     0.000   0.932      0.020      3.02          1.84        2.68      0.576
   NL        25,210     0.491      0.509   0.903     0.020   0.882      0.016      2.10          1.54        3.25      0.575
   NO        43,260     0.489      0.511   0.907     0.041   0.913      0.014      2.02          1.76        1.87      0.922
   PL         6,103     0.496      0.504   0.999     0.000   0.889      0.015      2.70          1.96        2.44      0.644
   PT        10,781     0.494      0.506   0.906     0.022   0.854      0.017      2.68          2.23        2.68      0.544
   RO         2,562     0.494      0.506   0.999     0.000   0.919      0.009      2.77          1.90        2.27      0.861
   SE        26,346     0.507      0.493   0.846     0.050   0.820      0.035      2.07          1.78        2.35      0.757
   SI        13,772     0.504      0.496   0.876     0.000   0.855      0.019      2.53          1.77        2.20      0.746
   SK         7,304     0.481      0.519   0.987     0.010   0.920      0.010      2.52          2.08        2.34      0.694
   UK        25,936     0.493      0.507   0.848     0.042   0.825      0.024      2.34          2.24        2.41      0.649

Note : Omitted categories are: “Outside Europe” for birth area and “None/Collective house” for the presence of
parents, and “Not owned” for the tenancy variable.




                                                               28
                                                                     Table A.2: Descriptive Statistics (Fathers)
                 Birth   area      Citizenship        Education                            Activity                                          Main occupation   ISCO-08 1-digit             Superv.
      Country   Native     EU     Resid.   EU    Prim. Sec. Tert.    Empl.   Self-empl.   Unempl.     Retired   House work     1       2        3      4       5&0      6      7     8       Yes
        AT      0.743     0.093   0.777 0.068    0.007 0.398 0.421   0.714     0.215       0.003       0.010      0.072      0.085   0.063    0.284 0.145      0.147 0.051 0.064   0.046    0.338
        BE      0.748     0.100   0.762 0.093    0.016 0.491 0.199   0.699     0.179       0.007       0.011      0.130      0.041   0.127    0.209 0.057      0.054 0.084 0.104   0.126    0.278
        BG      0.933     0.004   0.936 0.001    0.029 0.466 0.333   0.899     0.028       0.005       0.004      0.078      0.142   0.207    0.216 0.135      0.058 0.029 0.047   0.065    0.093
        CH      0.588     0.286   0.603 0.280    0.051 0.227 0.487   0.653     0.292       0.001       0.003      0.055      0.054   0.077    0.223 0.111      0.065 0.057 0.140   0.131    0.397
        CY      0.803     0.082   0.808 0.094    0.045 0.667 0.178   0.566     0.381       0.004       0.009      0.053      0.125   0.122    0.245 0.161      0.109 0.029 0.074   0.071    0.229
        CZ      0.878     0.065   0.910 0.036    0.003 0.602 0.216   0.891     0.017       0.001       0.006      0.094      0.053   0.195    0.305 0.039      0.051 0.036 0.125   0.070    0.233
        DE      0.800     0.200   0.855 0.145    0.004 0.125 0.496   0.819     0.123       0.008       0.013      0.062      0.040   0.154    0.266 0.059      0.061 0.051 0.158   0.104    0.299
        DK      0.935     0.025   0.970 0.020    0.000 0.368 0.418   0.708     0.272       0.004       0.014      0.021      0.009   0.072    0.288 0.160      0.103 0.043 0.070   0.122    0.447
        EE      0.603     0.270   0.637 0.233    0.000 0.300 0.338   0.823     0.006       0.003       0.006      0.177      0.053   0.253    0.221 0.034      0.027 0.014 0.053   0.092    0.153
        EL      0.887     0.016   0.911 0.015    0.042 0.587 0.135   0.449     0.517       0.002       0.004      0.034      0.055   0.099    0.210 0.308      0.060 0.087 0.026   0.047    0.182
        ES      0.836     0.047   0.846 0.046    0.052 0.762 0.064   0.702     0.219       0.006       0.016      0.081      0.137   0.113    0.191 0.145      0.101 0.055 0.076   0.045    0.191
        FI      0.827     0.007   0.827 0.007    0.019 0.491 0.182   0.592     0.209       0.016       0.009      0.253      0.044   0.138    0.146 0.135      0.053 0.016 0.085   0.089
        FR      0.789     0.078   0.857 0.057    0.040 0.695 0.073   0.753     0.170       0.003       0.006      0.079      0.223   0.055    0.155 0.103      0.050 0.072 0.111   0.068    0.335
        HR      0.822     0.006   0.834 0.004    0.006 0.464 0.312   0.763     0.103       0.037       0.019      0.137      0.228   0.103    0.214 0.049      0.079 0.036 0.088   0.041    0.129
        HU      0.962     0.017   0.969 0.012    0.017 0.599 0.241   0.892     0.043       0.001       0.011      0.064      0.137   0.193    0.279 0.094      0.067 0.017 0.052   0.060    0.117




29
        IE      0.792     0.107   0.758 0.094    0.014 0.574 0.258   0.659     0.221       0.049       0.009      0.120      0.158   0.065    0.149 0.155      0.092 0.022 0.042   0.092    0.344
        IS      0.918     0.050   0.923 0.044    0.001 0.334 0.486   0.638     0.332       0.001       0.001      0.031      0.042   0.094    0.220 0.180      0.096 0.024 0.076   0.121    0.570
        IT      0.823     0.022   0.827 0.020    0.030 0.708 0.136   0.614     0.244       0.016       0.016      0.143      0.118   0.105    0.227 0.099      0.082 0.057 0.074   0.040    0.199
        LT      0.899     0.004   0.926 0.004    0.014 0.538 0.228   0.916     0.011       0.000       0.004      0.076      0.214   0.179    0.241 0.080      0.030 0.017 0.038   0.074    0.110
        LU      0.387     0.467   0.400 0.466    0.037 0.484 0.316   0.757     0.174       0.001       0.009      0.070      0.039   0.183    0.228 0.112      0.046 0.048 0.118   0.093    0.251
        LV      0.572     0.248   0.642 0.165    0.002 0.381 0.297   0.767     0.005       0.002       0.008      0.229      0.083   0.218    0.199 0.069      0.036 0.010 0.037   0.083    0.070
        MT      0.952     0.041   0.953 0.040    0.164 0.561 0.180   0.717     0.214       0.013       0.011      0.073      0.106   0.099    0.244 0.050      0.169 0.045 0.106   0.046    0.225
        NL      0.829     0.028   0.888 0.022    0.008 0.376 0.285   0.726     0.173       0.006       0.003      0.103      0.031   0.079    0.200 0.086      0.084 0.051 0.155   0.124    0.310
        NO      0.897     0.046   0.908 0.041    0.005 0.328 0.390   0.712     0.255       0.002       0.014      0.032      0.032   0.100    0.227 0.111      0.075 0.029 0.167   0.110    0.285
        PL      0.955     0.012   0.980 0.003    0.004 0.462 0.448   0.701     0.238       0.002       0.005      0.064      0.078   0.157    0.254 0.237      0.053 0.025 0.053   0.044    0.111
        PT      0.932     0.006   0.945 0.006    0.193 0.700 0.031   0.650     0.248       0.002       0.014      0.102      0.077   0.114    0.264 0.185      0.082 0.038 0.060   0.032    0.190
        RO      0.938     0.001   0.939 0.001    0.017 0.726 0.088   0.642     0.237       0.004       0.014      0.139      0.104   0.121    0.249 0.253      0.040 0.016 0.034   0.040    0.045
        SE      0.945     0.022   0.851 0.061    0.000 0.422 0.350   0.745     0.211       0.002       0.014      0.192      0.019   0.108    0.230 0.086      0.105 0.031 0.067   0.118    0.337
        SI      0.769     0.200   0.000 0.000    0.001 0.684 0.166   0.773     0.099       0.013       0.020      0.128      0.173   0.080    0.257 0.089      0.059 0.037 0.100   0.052    0.242
        SK      0.935     0.020   0.945 0.011    0.001 0.362 0.497   0.921     0.011       0.002       0.005      0.071      0.128   0.209    0.285 0.030      0.052 0.028 0.095   0.060    0.145
        UK      0.800     0.064   0.869 0.039    0.033 0.508 0.228   0.795     0.147       0.025       0.009      0.059      0.083   0.133    0.236 0.036      0.091 0.040 0.085   0.142    0.398

     Note : Omitted categories are: “European outside EU” for birth area, “Not Europe” for citizenship, “Illiterate” for education, “Unknown/Dead” and “Other inactive” for
     activity. ISCO-08 occupation deﬁnitions are: 1 “Elementary”, 2 “Plant Operator” 3, “Craft/Trades” 4, “Agriculture” 5, “Service and Army” 6, “Clerical”, 7 “Technician”,
     8 “Professional”, 9 “Manager”; ”Dead/Unknown/Not working” is not shown. Omitted categories for supervisory are: “No” and “Dead/Unknown/Not working”.
                                                                     Table A.3: Descriptive Statistics (Mothers)
                 Birth   area      Citizenship        Education                            Activity                                          Main occupation   ISCO-08 1-digit             Superv.
      Country   Native     EU     Resid.   EU    Prim. Sec. Tert.    Empl.   Self-empl.   Unempl.     Retired   House work     1       2        3      4       5&0      6      7     8       Yes
        AT      0.740     0.096   0.789 0.065    0.026 0.587 0.328   0.369     0.169       0.002       0.005      0.463      0.087   0.010    0.045 0.128      0.155 0.071 0.009   0.024    0.092
        BE      0.755     0.097   0.790 0.092    0.030 0.564 0.201   0.320     0.117       0.006       0.002      0.651      0.069   0.024    0.016 0.002      0.046 0.058 0.045   0.081    0.034
        BG      0.931     0.003   0.981 0.002    0.039 0.464 0.357   0.878     0.026       0.007       0.003      0.101      0.152   0.064    0.099 0.181      0.140 0.092 0.040   0.123    0.030
        CH      0.567     0.307   0.599 0.286    0.078 0.410 0.399   0.382     0.152       0.001       0.001      0.466      0.068   0.025    0.039 0.055      0.125 0.069 0.069   0.056    0.064
        CY      0.804     0.080   0.812 0.091    0.088 0.684 0.162   0.325     0.166       0.001       0.001      0.509      0.220   0.042    0.022 0.036      0.067 0.037 0.020   0.045    0.048
        CZ      0.882     0.061   0.946 0.037    0.005 0.670 0.261   0.898     0.007       0.003       0.002      0.096      0.139   0.080    0.105 0.074      0.160 0.149 0.104   0.080    0.088
        DE      0.811     0.189   0.862 0.138    0.010 0.284 0.475   0.482     0.050       0.009       0.004      0.493      0.033   0.087    0.015 0.025      0.116 0.089 0.079   0.051    0.059
        DK      0.922     0.029   0.935 0.023    0.000 0.531 0.283   0.630     0.069       0.006       0.012      0.321      0.001   0.026    0.052 0.035      0.225 0.123 0.095   0.103    0.122
        EE      0.601     0.272   0.726 0.250    0.001 0.334 0.391   0.906     0.004       0.001       0.004      0.092      0.113   0.124    0.051 0.084      0.110 0.097 0.109   0.169    0.085
        EL      0.888     0.016   0.916 0.016    0.078 0.592 0.133   0.193     0.277       0.001       0.004      0.532      0.049   0.021    0.034 0.223      0.048 0.039 0.004   0.027    0.026
        ES      0.836     0.046   0.849 0.046    0.082 0.802 0.048   0.186     0.069       0.001       0.003      0.748      0.071   0.009    0.021 0.028      0.059 0.021 0.010   0.025    0.029
        FI      0.826     0.007   0.933 0.006    0.019 0.559 0.238   0.658     0.204       0.019       0.006      0.151      0.202   0.057    0.046 0.048      0.145 0.122 0.091   0.126
        FR      0.806     0.067   0.880 0.047    0.063 0.724 0.079   0.454     0.085       0.001       0.001      0.463      0.109   0.005    0.049 0.059      0.108 0.111 0.050   0.036    0.072
        HR      0.823     0.008   0.848 0.003    0.017 0.634 0.189   0.352     0.053       0.027       0.011      0.596      0.122   0.013    0.034 0.022      0.070 0.046 0.036   0.058    0.033
        HU      0.964     0.016   0.980 0.012    0.025 0.655 0.243   0.729     0.022       0.001       0.007      0.252      0.167   0.087    0.075 0.061      0.118 0.113 0.063   0.049    0.044




30
        IE      0.787     0.114   0.761 0.103    0.011 0.546 0.324   0.253     0.048       0.007       0.000      0.700      0.060   0.007    0.014 0.017      0.059 0.052 0.007   0.061    0.082
        IS      0.905     0.059   0.924 0.046    0.002 0.626 0.275   0.598     0.102       0.001       0.000      0.305      0.130   0.013    0.028 0.064      0.180 0.109 0.045   0.095    0.149
        IT      0.820     0.024   0.862 0.024    0.042 0.779 0.112   0.224     0.080       0.005       0.005      0.698      0.062   0.022    0.031 0.035      0.051 0.029 0.022   0.038    0.041
        LT      0.902     0.002   0.959 0.003    0.014 0.519 0.316   0.867     0.014       0.001       0.001      0.124      0.293   0.034    0.112 0.067      0.110 0.049 0.046   0.129    0.068
        LU      0.374     0.483   0.393 0.485    0.074 0.587 0.245   0.318     0.106       0.000       0.004      0.579      0.108   0.024    0.015 0.054      0.061 0.036 0.046   0.049    0.047
        LV      0.585     0.234   0.793 0.182    0.006 0.414 0.399   0.891     0.003       0.002       0.007      0.106      0.221   0.023    0.093 0.085      0.122 0.098 0.084   0.138    0.074
        MT      0.950     0.043   0.957 0.038    0.150 0.652 0.145   0.073     0.015       0.001       0.002      0.919      0.010   0.009    0.004 0.002      0.018 0.009 0.007   0.019    0.011
        NL      0.829     0.027   0.907 0.023    0.012 0.532 0.288   0.282     0.056       0.003       0.000      0.665      0.060   0.008    0.011 0.016      0.089 0.052 0.038   0.050    0.037
        NO      0.877     0.048   0.891 0.043    0.014 0.368 0.437   0.623     0.106       0.008       0.016      0.270      0.091   0.026    0.017 0.053      0.214 0.114 0.142   0.041    0.065
        PL      0.957     0.010   0.990 0.004    0.004 0.524 0.410   0.518     0.261       0.008       0.002      0.226      0.118   0.018    0.080 0.262      0.097 0.071 0.053   0.057    0.050
        PT      0.928     0.008   0.950 0.007    0.283 0.631 0.029   0.359     0.197       0.003       0.010      0.444      0.145   0.032    0.059 0.158      0.075 0.025 0.017   0.031    0.048
        RO      0.936     0.001   0.939 0.001    0.023 0.728 0.112   0.370     0.219       0.005       0.010      0.440      0.080   0.040    0.076 0.218      0.060 0.026 0.024   0.034    0.010
        SE      0.942     0.024   0.855 0.058    0.000 0.409 0.369   0.731     0.058       0.002       0.007      0.582      0.035   0.021    0.009 0.016      0.152 0.057 0.033   0.087    0.095
        SI      0.791     0.178   0.000 0.000    0.004 0.752 0.148   0.578     0.071       0.005       0.010      0.351      0.193   0.006    0.066 0.061      0.091 0.085 0.093   0.047    0.089
        SK      0.932     0.023   0.980 0.010    0.001 0.451 0.482   0.846     0.006       0.004       0.002      0.153      0.203   0.052    0.096 0.034      0.161 0.107 0.110   0.075    0.048
        UK      0.808     0.064   0.877 0.036    0.042 0.679 0.099   0.577     0.051       0.087       0.003      0.375      0.127   0.044    0.028 0.005      0.152 0.078 0.068   0.097    0.104

     Note : Omitted categories are: “European outside EU” for birth area, “Not Europe” for citizenship, “Illiterate” for education, “Unknown/Dead” and “Other inactive” for
     activity. ISCO-08 occupation deﬁnitions are: 1 “Elementary”, 2 “Plant Operator” 3, “Craft/Trades” 4, “Agriculture” 5, “Service and Army” 6, “Clerical”, 7 “Technician”,
     8 “Professional”, 9 “Manager”; ”Dead/Unknown/Not working” is not shown. Omitted categories for supervisory are: “No” and “Dead/Unknown/Not working”.
A.3                                               Empicial Robustness Checks

                                                                                                Figure A.1: Tuning Conditional Inference Trees

                                           1.15
[Method] MSEtest / Random Forest MSEtest




                                                                                                                                                                                                                                                                                        q



                                           1.10

                                                                                                                                                                                                                                                                                    q



                                                                                                                                                                                                                                                                           qq
                                                                                                                                                                                                                                                                                            q


                                                                                                                                                                                                                                                                     q q
                                           1.05                                                                                                                                                                                               q
                                                                                                                                                                                                                                                         qqq
                                                                                                                                                                                                                                                                q
                                                                                                                                                                                                                                                               q q
                                                                                                                                                                                                                                                                      q
                                                                                                                                                                                                                                                                                q
                                                                                                                                                                               q
                                                                                                                                                                                                       q                               qqq   q q   qqq
                                                                                                                                                                                                                           q q   qqq
                                                                                                                                                                                                   q                 qqq                                                                         q
                                                                                                                                                                                    q    q q   q           qqq              q                                                                   q q
                                                                                                                                                        qq                qq       q q    q
                                                                                                                                                             qq
                                                                                                                 q
                                                                                                    q                 qqq   qqq   qqq   qqq   qqq   q             q
                                                            q                       qq                  q   qq
                                                                q                           q
                                                                    qqq   qqq   q        qq     q
                                                        q
                                                  qqq

                                           1.00



                                                  HR SK IS DK BG PL ES HU CZ FI IT LT CY NL RO NO AT LV SI EE MT LU FR DE EL UK PT CH BE SE IE AVG
                                                                                              Country

                                                                                                                     Method:      q
                                                                                                                                        Ctree: α=0.01                 q
                                                                                                                                                                               Ctree: α=0.05                     q
                                                                                                                                                                                                                     Ctree: Tuned α

Note: The ﬁgure compares the test error of random forests with diﬀerent conditional inference trees. The test
error of random forests is standardized to equal 1, such that a test error larger than 1 represent worse ﬁts than
random forests. “Ctree: tuned α” uses cross-validation to tune α. 95% conﬁdence intervals are derived based on
200 bootstrapped re-samples of the test data using the normal approximation method.




                                                                                                                                                             31
                                                                              Figure A.2: Optimal Size of Forests

                             q
                              q




                 12060       q
                              q



                             q




                 12050
                                  q
Out−of−Bag MSE




                                      q
                                     q
                                           q       q
                 12040            q
                                          q

                                  q
                                           q   q
                                                       q
                                       qq      q
                                 q        q
                                      q
                                      q q
                 12030          q       q
                                  q q qq q
                                             q
                                       q q      q              q
                                 q   qq q
                                     q      qq q            q                  q
                                    q q                            q                                              q
                                                                       q           q
                                                       q q                                q                   q         q
                                   q     q q       qq         q
                                                                q             q                           qq
                                              q           q                  q  qq                       qq
                                           q q                                                                      q         q                         q
                              q
                                                 qq  q                   q                      q
                                                                                                  q                  q       qq               q                       q                                                q
                                            q         qq q
                                                      q              q        q       q qq q                                      qq q q                     q                           q           q         q               q
                                                        qqqq
                                                                  q q
                                                                        qq      qq          q      q                     q                       q
                                                                                                                                                         q         q q q q                            q
                                                                                                                                                                                                                                                    q
                                                   q                                                        q qq                                                                                         qq                     q
                 12020                                       qq qqqqq qq       q
                                                                                  qq q q
                                                                                   q qq       qqq q q
                                                                                                        qq
                                                                                                               q
                                                                                                               qq qq
                                                                                                                               qq
                                                                                                                                q    qqq
                                                                                                                                      qq
                                                                                                                                        qq q
                                                                                                                                         q
                                                                                                                                               q
                                                                                                                                               qq qq q
                                                                                                                                                    q q
                                                                                                                                                       q    q q
                                                                                                                                                                    q q
                                                                                                                                                                qq q q                q      qq
                                                                                                                                                                                              q       qq
                                                                                                                                                                                                       q         q q qq q          q q q
                                                                                                                                                                                                                                          q q      q q
                                                 q     q q q qqq q
                                                                      q q   q q              q
                                                                                              q qqq   q     q q        q qqqq
                                                                                                                            q        q q    q    qqqq         qq   q         q    q             q
                                                                                                                                                                                       q q q qq q q        q q q qq q q q         q q       qqqq
                                                                                                      q                       q     q       qq  q                                             q      q q q qq q            q               q    qq q
                                               q                                      q      q      q qq         q q   q
                                                                                                                    q q q          qq         q    q         q   q
                                                                                                                                                                 qq
                                                                                                                                                                q q q  q  q       q    qq                            qq qq q qqq q q
                                                                                                                                                                                                                       q                              q
                                 q              q                    q     q
                                                                           qq qq q
                                                                                     q    q
                                                                                        q q qq                          q        qq        q
                                                                                                                                           q              q
                                                                                                                                                          q               q q
                                                                                                                                                                            q  qq
                                                                                                                                                                                qq
                                                                                                                                                                                 q
                                                                                                                                                                                   qq q   qqq q q
                                                                                                                                                                                            q      qq q q q qqq q  q qqq q qqqqqq    qqqq  qq q q
                                                                                                                                                                                                                                               q     qq
                                                                                                                                                                                                                                                       q
                                                                                                           q q                           q                 q
                                                                                                                                                           q                       qq q
                                                                                                                                                                                    q            q
                                                                                                                                                                                                 q         qq qq q q     qqqq      q          q qqqqqq
                                                        q
                                                                            q
                                                                                                 q     q q            q      q q  q               q                     q
                                                                                                                                                                        q q q  q           q qq                    q                q    qq q
                                                                                     q                                q q                            qq
                                                                                                                                                                              q       qq q
                                                                                                                                                                                                  qq              q           q
                                                                                                                                                                                                                                        q
                                                          q                                    q     q               q                                q        q                                         q   q                   q
                                                                      q                                                                                     q         qq q
                                                                                         q                                                              q
                                                                          q                                                                                                          q
                 12010                                     q



                         0                                       200                                         400                                          600                                          800                                        1000
                                                                                                                         Number of Trees
Note: The ﬁgure compares the M SE OOB for Germany using varying forest sizes (diﬀerent levels of B ). We allow
                                                              ¯ = 6). The blue line indicates the loss of ﬁt.
for 6 circumstances to be considered at each splitting point (P
After around 200 trees, improvements in the error tend to be negligible. Similar patterns were found with other
                              ¯ . For this reason, we set B ∗ = 200 in our random forests.
countries and other levels of P


A.4                  Opportunity Structures




                                                                                                                            32
                                                                                           Figure A.3: Variable Importance Plot

                      1.00                                                                                                                                                                                                             q q q
                                                                                                                                  q




                                                                                                                                  q

                                                                                            q
                      0.75                                                                                                                         q




                                                                                                                                                                                             q                       q                                   q
Variable Importance




                                                                                                                                                                                                                                                             q




                                            q                                                                                                  q                                                                                                                     q
                                                                                                                    q                                                                         q
                                                                                                                                                                                                                                                             q
                                                                                                                                                                                                       q q
                                        q                                                                                                          q
                                        q                                    q


                      0.50              q
                                                                             q
                                                                                                                                      q                                                  q
                                                                                                                                                                                                       q

                                                                   q                                                                                                                     q                                                                                q
                                                                                                                                                                  q
                                        q                                                                            q
                                        q                                                                                                                     q                                                                                          q
                                                                                                                                                                                                                              q

                                                                                                                                                                                                                             q
                                                                                                                                  q                q
                                        q                                              q    q                                     q
                                                                                                                                                          q                                                                            q
                                                                                                                                                                                                                                                                 q

                                                                                      q                       q

                                                                                                                          q
                                        q
                                                                             q
                                                                             q
                                                                                            q
                                                                                            q        q               q

                                                                                                                     q                             q
                                                                                                                                                                      q

                                                                                                                                                                               q
                              q                                              q                                       q


                      0.25
                              q
                                        q        q                                                                                    q                       q                q
                                                                                                                                                                                         q                               q
                                                                                                                                                                                                                                                                      q

                                                                                                                                                                                                                                                                 q
                                                                    q

                              q
                                        q
                                                     q              q
                                                                                                              q

                                                                                                              q
                                                                                                                                 q    q
                                                                                                                                      q
                                                                                                                                                   q
                                                                                                                                                   q                  q        q
                                                                                                                                                                               q
                                                                                                                                                                                              q
                                                                                                                                                                                                       q        q
                                                                                                                                                                                                                q
                                                                                                                                                                                                                         q
                                                                                                                                                                                                                              q        q
                                                                                                                                                                                                                                       q                              q


                                                                                                                                                                                        q
                              q                      q                                                                                                                                   q                                             q                              q
                                                                    q                                                                                                                                                                           q                         q
                              q                                     q        q                                                    q
                                                                                                                                  q   q                                                                         q                      q                     q            q
                              q         q   q                                               q                                         q                                        q                                                       q
                                                                                                                                                                                                                                       q        q                     q   q

                                                                                                              q
                                                                             q
                                                                             q                                                                     q                                     q
                                                                                                                                                                                         q    q                          q             q        q        q   q
                              q             q                                q                                                                                q
                                                                                                                                                              q                                                                                          q
                                        q                                                                                         q                                            q                                              q

                              q q
                                                                    q                                q
                                                                                                     q                                q                       q       q        q                       q        q             q                                       q   q
                              q                                                                                                                               q                q                       q                      q
                                                                                                                                                                                                                              q                 q                q        q
                                            q                       q                       q                 q            q                                          q        q         q                                    q                 q
                                            q        q                                      q                              q                       q                           q              q                          q                                   q            q
                              q             q                 q                                      q                     q          q
                                                                                                                                      q                       q                q         q    q                 q                               q            q            q
                              q             q                       q
                                                                    q        q              q
                                                                                            q                                                                         q                       q                 q             q        q                         q        q
                                                     q              q                       q        q                            q                                                      q
                                                                                                                                                                                         q    q                                                 q
                                                                                                                                                                                                                                                q                     q   q
                                        q   q
                                            q                                                        q               q     q          q            q                  q                                q        q        q    q                          q   q
                                                     q              q        q         q             q               q
                                                                                                                     q     q      q   q            q          q                          q    q                 q                               q        q   q
                                                                                                                                                                                                                                                             q   q    q
                                            q        q
                                                     q        q              q                       q               q     q          q                       q       q                  q    q                          q    q
                                                                                                                                                                                                                              q        q        q            q
                                                                                                                                                                                                                                                             q        q
                              q             q
                                            q                                q         q                      q                                               q       q                       q
                                                                                                                                                                                              q        q        q        q
                                                                                                                                                                                                                         q
                                                          q
                              q             q        q
                                                     q
                                                     q        q     q                  q
                                                                                       q    q        q        q      q     q      q   q            q          q       q        q                       q        q        q                                   q
                                                                                                                                                                                                                                                             q   q
                                            q                 q
                                                              q     q        q                       q        q
                                                                                                              q      q                q            q          q       q                  q             q
                                                                                                                                                                                                       q                 q                      q        q       q    q
                                                                                                                                                                                                                                                                      q   q
                                                              q                                                      q     q                                                                                                                                     q
                      0.00    q
                              q                      q              q
                                                                             q         q
                                                                                       q
                                                                                       q    q        q
                                                                                                     q
                                                                                                              q            q
                                                                                                                           q      q                q
                                                                                                                                                                      q        q
                                                                                                                                                                                                       q                 q             q
                                                                                                                                                                                                                                       q
                                                                                                                                                                                                                                                         q       q


                                                                                                                     q                                                                                                                                   q
                                        q
                                                                                                                                                   q




                              DK IS LU NO SE SI BE BG EE EL FI LT SK AT CH CY DE ES FR IE IT MT NL PT UK HR LV RO CZ HU PL AVG
                                                                          Country

                                                 Birth Area                      Father's Education                            Father's Occupation q Mother's Education                                                           Mother's Occupation

Note: Each dot shows the importance of a particular circumstance for the predictions from our random forest.
The importance of a circumstance is measured by permuting the circumstance, calculating a new MSEOOB ,
and computing the diﬀerence in the MSEOOB between the original model and the model with the permuted
circumstance. The importance measure is standardized such that the circumstance with the greatest importance
in each country equals one.


A.5                           Tree Complexity

                                                Figure A.4: Correlation of Complexity and Magnitude of Estimate
                                            (a) Latent Class Analysis                                                                                                              (b) Conditional Inference Tree
                      0.125
                                                                                 RO
                                                                                   EL           BG                                                                                                           PT      BG
                                                                                 PT                                                                                                                                    EL                  LU        ES
                                                                                                                                                                                                                    RO

                      0.100                                                                          HU
                                                                                                              LV                                                                                                                            HU
                                                                                           PL                                                                                                              LV                                                        IT
                                                                                                     LU
                                                                                                                                                       0.10                                                          PL
                                                                                                         ES
                                                                                                                                                                                                  EE
                                                                                      AT                                                                                                                    AT       BE
                                                                                                                          IT                                                                      HR                              FR
      IOp (Gini)




                                                                                            CY
                                                                                                                                          IOp (Gini)




                      0.075                                                                                                                                                                                   IE
                                                                   HR            EE                                                                                                                CH        CY
                                                                    UK                      FR                                                                                          LT
                                                                                      CH                                                                                                     MT             UK
                                                                         CZ                                                                                                                                                  DE
                                                              MT
                                                                   SI LT
                                                                                                                                                                                             SK CZ
                      0.050                                         BE            FI                                                                   0.05
                                                         IE             SK        DE
                                                                        NL                                                                                                                   SI
                                                IS            NO                                                                                                          SE
                                                                                                                                                                               DK NL
                      0.025                     DK                                                                                                                             NO FI
                                   SE
                                                                                                                                                                                   IS

                                  2.5                     5.0                   7.5                                10.0               12.5                        0                                         10                                      20
                                                                  Number of latent classes                                                                                                                  Number of terminal nodes



Note : Complexity of the opportunity structure is proxied by the number of latent classes and the number of
terminal nodes, respectively.




                                                                                                                                      33