2.Estimation model 1980-2003: data and methodology
Data selection criteria and coverage
Missing value estimation procedure
Step 1: Logistic transformation
Step 2: Country-level interpolation
Step 3: Calculation of response-probabilistic weights
Step 4: Weighted multivariate estimation
Comparison with national projections
Adjustment of projection parameters
The 5th Edition of the estimates and projections of the economically active population EAPEP Database is the result of a joint collaboration between the ILO Bureau of Statistics and the ILO Employment Trends Unit.
In this project, the Employment Trends Unit had primary responsibility for developing the historical estimates portion of the database (1980-2003), whilst the Bureau of Statistics had primary responsibility for developing the projections.
This collaborative project utilised new and enhanced methodologies to improve the EAPEP labour force estimates and projections, while also establishing a system to guarantee more frequent and reliable data updates. The resulting models and methodologies will be the basis for subsequent updates of the EAPEP Database by the Bureau of Statistics and the Employment Trends Unit.
This paper was prepared by Steven Kapsos (ILO Employment Trends UNIT) for the estimates, James Brown and Fiifi Amoako Johnson (University of Southampton) for the basic model of the projections and Farhad Mehran, Ferdinand Lepper and Christophe Vittorelli (ILO Bureau of Statistics) for the projections.
The ILO programme on estimates and projections of the economically active population is part of a larger international effort on demographic estimates and projections in which several UN agencies contribute. Estimates and projections of the total population and its components by sex and age group are produced by the UN Population Division, the economically active population by the ILO, the agricultural population by FAO and the school attending population by UNESCO.
The main objective of the ILO programme is to provide member states, international agencies and the public at large with the most comprehensive, detailed and comparable estimates and projections of the economically active population in the world and its main geographical regions. The first edition was published by the ILO Bureau of statistics in 1971 (covering 168 countries and territories, with reference period 1950-1985)[1]; the second edition in 1977 (with154 countries and territories and reference period 1975-2000)[2]; the third edition in 1986 (with 156 countries and territories and reference period 1985-2025)[3]; and the fourth edition in 1996 (with 178 countries and territories and reference period 1950-2010)[4].
The present fifth edition covers 191 countries and territories and 29 economic and geographical groupings. The reference period for the estimates is 1980-2003 and for the projections, 2004-2020. The basic data are single-year labour force participation rates by sex and eleven age groups in five-year age intervals, the last age group being 65 years and above. The data are available at the ILO main website on labour statistics: http://laborsta.ilo.org. Due to updated information (new UN population projections and estimates on activity rates up to 2003) and a further developed methodology, the results of the 5th edition may be different from those of the 4th edition for the same year/country for both the total estimates and the projections. A comparison of the results of the two editions is thus not useful.
The purpose of the present note is to describe the main elements of the estimation and projection methodologies adopted for the fifth edition. Both the estimation and projection parts deviate substantially from the procedures adopted in the previous editions: they have been designed to a much greater extent on consistent models with minimum parameter adjustments. It simplifies the methodological descriptions and makes the numerical results essentially reproducible.
The following chart (fig. 1) depicts the main steps involved:

The underlying national labour force data used for producing harmonised single-year ILO country estimates of labour force participation rates (LFPR) by sex and standard age groups are described in Section 2. Also described in that section is the statistical treatment of missing values and the estimation models for countries for which no or limited data were available. The projection methodology is described in Section 3. It begins by describing the core model based on scenarios concerning the pattern of convergence or divergence of male and female labour force participation rates over the projection horizon. The procedures used for evaluating the results and, where necessary, adjusting the parameters of the core model are then outlined with a few numerical examples. Finally, Section 4 describes the procedures used for transforming the labour force participation rates into counts of the economically active population and those used for summing the country-level data into the main geographical and economic aggregates. Annex 1 lists the countries and territories, and the geographical and economic groupings covered by the project. Annex 2 present standardized sex and age profiles of estimated and projected labour force participation rates for each country.
This section contains two main parts. The first part provides an overview of the criteria used to select the baseline national labour force participation rate (LFPR) data that serve as the key input into the ILO’s Economically Active Population Estimates and Projections (EAPEP) 5th Edition database. This section includes a discussion of non-comparability issues that exist in the available national LFPR data and concludes with a description of the LFPR data coverage, after taking into account the various selection criteria. The second part describes the econometric model developed for the treatment of missing LFPR values, both in countries that report in some but not all of the years in question, as well as for those countries for which no data are currently available.
The EAPEP database is a collection of country-reported and ILO estimated labour force participation rates. The database is a complete panel, that is, it is a cross-sectional time series database with no missing values. A key objective in the construction of the database was to generate a set of comparable labour force participation rates across both countries and time. With this in mind, the first step in the production of the historical portion of the 5th Edition of the EAPEP database was to carefully scrutinize existing country-reported labour force participation rates and to select only those observations deemed sufficiently comparable. In the second step, a weighted least squares econometric model was developed to produce estimates of labour force participation rates for those countries and years in which no country-reported, cross-country comparable data currently exist. The following describes the sources of data non-comparability, the process through which data were either selected or eliminated and the resulting data coverage and database structure.
In order to generate a set of sufficiently comparable labour force participation rates across both countries and time, it is necessary to identify and address the various sources of non-comparability. The main sources of non-comparability of labour force participation rates are as follows:[5]
Survey type – country-reported labour force participation rates are derived from several types of survey data including labour force surveys, population censuses, establishment surveys, insurance records or official government estimates. Data taken from different survey types are often not comparable.
Age group coverage – non-comparability also arises from differences in the age groupings used in measuring the labour force. While the standard age-groupings used in the EAPEP Database are 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64 and 65+, some countries report non-standard age groupings, which can adversely affect comparability.
Geographic coverage – some country-reported labour force participation rates correspond to a specific geographic region, area or territory. Geographically-limited data are not comparable across countries.
Others – Non-comparability can also arise from the inclusion or non-inclusion of military conscripts; variations in national definitions of the economically active population, particularly with regard to the statistical treatment of “contributing family workers” and “unemployed, not looking for work”; and differences in survey reference periods.
Taking these issues into account, a set of criteria was established upon which nationally-reported labour force participation rates would be selected for or eliminated from the input file for the EAPEP dataset.[6] The selection criteria include the following:
Selection criterion 1. Data must be derived from either a labour force survey or population census and population census data are included only if no labour force survey data exist for a given country. Labour force surveys are the most comprehensive source for internationally comparable labour force data. National labour force surveys are typically very similar across countries, and the data derived from these surveys are generally much more comparable than data obtained from other sources. Consequently, a strict preference was given to labour force survey data in the selection process. Yet, many developing countries without adequate resources to carry out a labour force survey do report LFPR estimates based on population censuses. Due the need to balance the competing goals of data comparability and data coverage, some population census-based labour force participation rates were included. However, a strict preference was given to labour force survey-based data, with population census-derived estimates only included for countries in which no labour force survey-based participation data exist. Data derived from official government estimates were not included in the dataset as the methodology for producing official estimates can differ significantly across countries and over time, leading to non-comparability.
Selection criterion 2. Only data corresponding to the 11 standardized age-groups (15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64 and 65+) are included. The inclusion of data corresponding to age-groups other than those listed above could result in a less comparable dataset. Therefore only data from these 11 standard age groupings were included.
Selection criterion 3. Only fully national (i.e. not geographically limited) labour force participation rates are included. Labour force participation rates corresponding to only urban or only rural areas were not included. This criterion was necessary due to the large differences that often exist between rural and urban labour markets.
Together, these criteria determined the data content of the final input file, which was utilized in the subsequent econometric estimation process (described below). Table 1 provides response rates and total observations by age-group and year. These rates represent the share of total potential observations for which real, cross-country comparable data exist.
Table 1. Response rates by year, both sexes combined
|
Year |
Proportion of |
Number of observations |
|
|
1980 |
0.30 |
1274 |
|
|
1981 |
0.18 |
770 |
|
|
1982 |
0.18 |
736 |
|
|
1983 |
0.24 |
1006 |
|
|
1984 |
0.19 |
818 |
|
|
1985 |
0.22 |
908 |
|
|
1986 |
0.23 |
956 |
|
|
1987 |
0.21 |
886 |
|
|
1988 |
0.26 |
1096 |
|
|
1989 |
0.26 |
1112 |
|
|
1990 |
0.39 |
1624 |
|
|
1991 |
0.30 |
1262 |
|
|
1992 |
0.26 |
1108 |
|
|
1993 |
0.30 |
1250 |
|
|
1994 |
0.30 |
1242 |
|
|
1995 |
0.32 |
1326 |
|
|
1996 |
0.30 |
1276 |
|
|
1997 |
0.30 |
1256 |
|
|
1998 |
0.30 |
1256 |
|
|
1999 |
0.28 |
1194 |
|
|
2000 |
0.31 |
1290 |
|
|
2001 |
0.28 |
1186 |
|
|
2002 |
0.29 |
1224 |
|
|
2003 |
0.28 |
1186 |
|
|
Total |
0.27 |
27242 |
|
The input file is also broken down by sex, however the number of both male and female observations is the same (13,621), thus only total figures are provided in the table. In total, comparable data are available for 27,242 out of a possible 100,848 observations, or approximately 27 per cent of the total. The total number of potential observations in the panel is determined by multiplying 191 countries * 11 age-groups * 2 sexes * 24 years = 100,848. It is important to note that while the percentage of real observations is rather low, 159 out of 191 countries (84 per cent) reported labour force participation rates in at least one year during the 1980 to 2003 reference period. Thus, some information on LFPR is known about the vast majority of the countries in the sample.
There is very little difference among the 11 age-groups with respect to data availability. This is primarily due to the fact that countries that report LFPR in a given year tend to report for all age groups. The main exception to this occurs in cases in which some reported age-groups do not conform to selection criterion 2. On the other hand, there is clear variation in response by year. In particular, coverage has tended to improve over time, as the lowest coverage occurred in the early 1980s. While the overall response rate is approximately 27 per cent, as will be shown in the next section, response rates vary substantially among the different regions of the world.
This section describes the basic missing value estimation model developed to produce the EAPEP historical database. The model was developed by the ILO Employment Trends Unit as part of its ongoing responsibility for the development and analysis of world and regional aggregates of key labour market indicators including labour force, employment, unemployment, employment status, employment by sector and working poverty, among others.[7] The present methodology contains four steps. First, in order to ensure realistic estimates of labour force participation rates, a logistic transformation is applied to the input data file. Second, a simple interpolation technique is utilized to expand the baseline data in countries that report labour force participation rates in some years. Next, the problem of non-response bias (systematic differences between countries that report data in some years and countries that do not report data in any year) is addressed and a solution is developed to correct for this bias. Finally, the weighted least squares estimation model, which produces the actual country-level LFPR estimates, is explained in detail. Each of these steps is described below.
The first step in the estimation process is to transform all labour force participation rates included in the input file. This step is necessary since using simple linear estimation techniques to estimate labour force participation rates can yield implausible results (for instance labour force participation rates of more than 100 per cent). Therefore, in order to avoid out of range predictions, the final input set of labour force participation rates is transformed logistically in the following manner prior to the estimation procedure:
(1)
where yit is the observed labour force participation rate in country i and year t. This transformation ensures within-range predictions, and applying the inverse transformation produces the original labour force participation rates. The specific choice of a logistic function in the present context was chosen following Crespi (2004).
The second step in the estimation model is to fill in, through linear interpolation, the set of available information from countries that report in some but not all of the years in question. In many reporting countries, some gaps in the data do exist. For instance, a country will report labour force participation rates in 1990 and 1992, but not in 1991. In these cases, a simple linear interpolation routine is applied, in which “smoothed” LFPR estimates are produced using equation 2.
(2)
In this equation,
is
the interpolated logistically transformed labour force participation rate in
country i, and t is the year in which
is
linearly interpolated.
is the logistically
transformed labour force participation rate in year ti1,
which corresponds to the closest reporting year in country i following
year t.
is the logistically
transformed labour force participation rate in year ti0, which
is the closest reporting year in country i preceding year t.
Accordingly,
is bounded at the most
recent overall reporting year for country i, while
is
bounded at the earliest reporting year for country i.
This procedure increases the number of observations upon which the econometric estimation of labour force participation rates in reporting and non-reporting countries is based and also provides a somewhat smoother, more stable series for use in the estimation.
Table 2. Response rates by estimation group
|
Estimation group |
% of |
% of potential obs., post-interpolation |
Obs. |
Obs., post-interpolation |
|
Developed Europe |
76.8 |
86.6 |
8924 |
10056 |
|
Developed Non-Europe |
86.1 |
89.2 |
4090 |
4238 |
|
CEE and CIS |
28.1 |
78.5 |
4008 |
11198 |
|
East and South-East Asia |
15.5 |
25.5 |
1796 |
2958 |
|
South Asia |
18.1 |
51.9 |
858 |
2444 |
|
Central America and the Caribbean |
24.8 |
52.2 |
3280 |
6900 |
|
South America |
44.7 |
69.9 |
2360 |
3690 |
|
Middle East and North Africa |
9.7 |
30.1 |
922 |
2886 |
|
Sub-Saharan Africa |
3.9 |
6.7 |
1004 |
1742 |
|
Total |
27.0 |
45.7 |
27242 |
46112 |
The increase in observations resulting from the linear interpolation procedure is provided in table 2. This table also provides a picture of the large variation in data availability among the different geographic/economic estimation groups. In total, the number of observations increased from 27,242 to 46,112 – that is, from 27 per cent to 45.7 per cent of the total potential observations. The lowest data coverage is in sub-Saharan Africa, in which the post-interpolation coverage is just 6.7 per cent. East and South-East Asia and the Middle East and North Africa also have relatively low coverage, at 25.5 per cent and 30.1 per cent, respectively. Post-interpolation coverage in all other regions is over 50 per cent, reaching nearly 90 per cent in the developed regions. This resulting database represents the final set of harmonized real and estimated labour force participation rates upon which the multivariate weighted estimation model was carried out as described below.
Out of 191 countries in the EAPEP database, 32 do not have any reported comparable labour force participation rates over the 1980-2003 period. This raises the potential problem of non-response bias. That is, if labour force participation rates in countries that do not report data tend to differ significantly from participation rates in countries that do report, basic econometric estimation techniques will result in biased estimates of labour force participation rates for the non-reporting countries, as the sample upon which the estimates are based does not sufficiently represent the underlying heterogeneity of the population.[8]
The identification problem at hand is essentially whether data in the EAPEP database are missing completely at random (MCAR), missing at random (MAR) or not missing at random (NMAR).[9] If the data are MCAR, non-response is ignorable and multiple imputation techniques such as those inspired by Heckman (1979) should be sufficient for dealing with missing data. This is the special case in which the probability of reporting depends neither on observed nor unobserved variables – in the present context this would mean that reporting and non-reporting countries are essentially “similar” in both their observable and unobservable characteristics as they relate to labour force participation rates. If the data are MAR, the probability of sample selection depends only on observable characteristics. That is, it is known that reporting countries are different from non-reporting countries, but the factors that determine whether countries report data are identifiable. In this case, econometric methods incorporating a weighting scheme, in which weights are set as the inverse probability of selection (or inverse propensity score), is one common solution for correcting for sample selection bias. Finally, if the data are NMAR, there is a selection problem related to unobservable differences in characteristics among reporters and non-reporters, and methodological options are limited. In cases where data are NMAR, it is desirable to render the MAR assumption plausible by identifying covariates that impact on response probability (Little and Hyonggin, 2003).
Given the important methodological implications of non-response type, it is instructive to examine characteristics of reporting and non-reporting countries in order to determine the type of non-response present in the EAPEP database. Table 3 confirms significant differences between reporting and non-reporting countries in the sample.
Table 3. Per-capita GDP and population size of reporting and non-reporting countries
|
|
Reporters |
Non-reporters |
|
Mean per-capita GDP, 2003 |
9153 |
2452 |
|
Median per-capita GDP, 2003 |
5829 |
1501 |
|
Mean population, 2005 (millions) |
38.3 |
9.8 |
|
Median population, 2005 (millions) |
7.7 |
4.6 |
|
Total countries |
161 |
30 |
Sources: World Bank, World Development Indicators Database 2005, IMF, World Economic Outlook Database, September 2005; UN, World Population Prospects 2004 Revision Database.
The table shows that reporting countries have considerably higher per capita GDP and larger populations than non-reporting countries. In the context of the EAPEP database, it is important to note that countries with low per-capita GDP also tend to exhibit higher than average labour force participation rates, particularly among women, youth and older individuals. This outcome is borne mainly due to the fact that the poor often have few assets other than their labour upon which to survive. Thus, basic economic necessity often drives the poor to work in higher proportions than the non-poor. As economies develop, many individuals (particularly women) can afford to work less, youth can attend schooling for longer periods and, consequently, overall participation rates in developing economies moving into the middle stages of development tend to decline.[10]
This is demonstrated in figure 2, which depicts actual country-reported labour force participation rates by 5-year age-group in Germany and Ghana. Germany’s per-capita GDP in 2003 stood at around $25,600, while Ghana’s was approximately $2,100. While there is little difference with regard to male prime working-age labour force participation, female participation is considerably higher in Ghana, including during prime child-rearing years. In addition, the LFPR curves corresponding to women and men in Ghana are considerably flatter than the curves corresponding to their German counterparts. This reflects the considerably higher participation rates of youth and older workers in Ghana.
Figure 2. Labour force participation rates by age-group in Ghana and Germany, most recent year

It appears that factors exist that co-determine the likelihood for countries to report labour force participation rates in the EAPEP input dataset and the actual labour force participation rates themselves. The missing data do not appear to be MCAR. Due to the existence of data (such as per-capita GDP and population size) that exist for both responding and non-responding countries and that are related to response likelihood, it should be possible to render the MAR assumption plausible and thus to correct for the problem of non-response bias.[11] This correction can be made while using the fixed-effects panel estimation methods described below, by applying “balancing weights” to the sample of reporting countries. The remainder of the present discussion describes this weighting routine in greater detail.
The basic methodology utilized to render the data MAR and to correct for sample selection bias contains two steps. The first step is to estimate each country’s probability of reporting labour force participation rates. In the EAPEP input dataset, per-capita GDP, population size, year dummy variables and membership in the Highly Indebted Poor Country (HIPC) Initiative represent the set of independent variables used to estimate response probability.[12]
Following Crespi (2004) and Horowitz and Manski (1998), we characterize each country in the EAPEP input dataset by a vector (yit, xit, wit, rit), where y is the outcome of interest (the logistically transformed labour force participation rate), x is a set of covariates that determine the value of the outcome and w is a set of covariates that determine the probability of the outcome being reported. Finally, r is a binary variable indicating response or non-response as follows:
(3)
Equation 4 indicates that there is a linear function whereby the likelihood of reporting labour force participation rates is a function of the set of covariates:
(4)
where a country reports if this index value is positive (
).
g is the set of regression
coefficients and eit
is the error term. Assuming a symmetric cumulative distribution function, the
probability of reporting labour force participation rates can be written as in
equation 5.
(5)
The functional form of F depends on the assumption made about the error term eit. As in Crespi (2004), we assume that the cumulative distribution is logistic, as shown in equation 6:
(6)
It is necessary to estimate equation 6 through logistic regression, which is carried out by placing each country into one of the 9 estimation groups listed in table 2. The regressions are carried out for each of the 11 standardized age-groups. The results of this procedure provide the predicted response probabilities for each age-group within each country in the EAPEP dataset.
The second step is to calculate country weights based on these regression results and to use the weights to “balance” the sample during the estimation process. The predicted response probabilities calculated in equation 6 are used to compute weights defined as:
(7)
The weights given by equation 7 are calculated as the ratio of the proportion of non-missing observations in the sample (for each age-group and each year) and the reporting probability estimated in equation 6 of each age-group in each country in each year. By calculating the weights in this way, reporting countries that are more similar to the non-reporting countries (based on characteristics including per-capita GDP, population size and HIPC membership) are given greater weight and thus have a greater influence in estimating labour force participation rates in the non-reporting countries, while reporting countries that are less similar to non-reporting countries are given less weight in the estimation process. As a result, the weighted sample looks more similar to the theoretical population framework than does the simple un-weighted sample of reporting countries.
The final step is the estimation process itself. Countries are again divided into the 9 estimation groups listed above, which were chosen on the combined basis of broad economic similarity and geographic proximity.[13] Having generated response-probabalistic weights to correct for sample selection bias, the key issues at hand include 1) the precise model specification and 2) the choice of independent variables for estimating labour force participation.
In terms of model specification, taking into account the database structure and existence of unobserved heterogeneity among the various countries in the EAPEP input database, the choice was made to use panel data techniques with country fixed effects, with the sample of reporting countries weighted using the sit(w) to correct for non-response bias.[14] By using fixed effects in this way, the “level” of known labour force participation rates in each reporting country is taken into account when estimating missing values in the reporting country, while in non-reporting countries, the weighted average fixed effect among reporting countries in each estimation group is used to estimate these countries’ labour force participation rates. More formally, the following linear model was constructed (and run on the logistically transformed labour force participation rates):
(8)
where yit is the observed labour force participation rate in country i and year t and xit is a set of explanatory covariates of the labour force participation rate and eit is the error term. The main set of covariates included is listed in table 4.[15]
Table 4. Independent variables in fixed-effects panel regression
|
Variable |