The Federal Reserve Board eagle logo links to home page

Skip to: [Printable Version (PDF)] [Bibliography] [Footnotes]
Finance and Economics Discussion Series: 2013-07 Screen Reader version

A Test for Selection in Matched Administrative Earnings Data*

Jesse Bricker

Federal Reserve Board

Gary V. Engelhardt

Syracuse University


Keywords: Sample selection, administrative data, survey data

Abstract:

We test whether individuals in the Health and Retirement Study who consented to have administrative earnings data matched to survey responses represent a non-random sample. For both men and women, there is a general pattern of negative selection across three measures of pre-entry labor-market behavior: labor-force participation, self-employment, and earnings. However, for some outcomes the estimates are not precise enough to draw firm conclusions. The strongest results are that men who consented were 4.7 percentage points less likely to be self-employed than those who did not, and women who consented earned 13 percent less than those who did not.

JEL Classification: J26, H2

1 Introduction

Administrative records matched to labor-market surveys represent an important innovation in the measurement of earnings. Such data have been compiled for various years of the Current Population Survey, Survey of Income and Program Participation, and the Health and Retirement Study (HRS), and are often gathered for program evaluations. Individuals typically must give informed consent to have their earnings matched. Relatively little is known about whether empirical studies based on the matched earnings of consenters suffer from sample-selection bias, because consenters may display systematically different labor-market behavior than non-consenters. In this paper, we develop a new test for non-random selection in administrative earnings data in the HRS by exploiting the differential timing of the consent process. We apply it to three labor-market outcomes: labor-force participation, self-employment, and log annual earnings.

2 Methods

We illustrate our methods by focusing on earnings. Let true earnings, y^{*} , be

 \displaystyle y_{i}^{*} =\theta x_{i} +\varepsilon _{i} , (1.1)

where  x is a vector of explanatory variables, and  \varepsilon is the disturbance term. Also, let  s^{*} be the net benefit to individual  i of consenting,

 \displaystyle s_{i}^{*} =\xi z_{i} +\delta c_{i} +v_{i} , (1.2)

modeled as a function of observable factors,  z, an unobservable monotonic index of the respondent's taste for data privacy,  c, and a random component,  v. We assume the net benefit is decreasing in privacy,  \delta <0. Define the consent indicator  s as

 \displaystyle \begin{array}{l} {s=1{\rm if }s^{*} \ge 0} \\ {s=0{\rm if }s^{*} <0.} \end{array} (1.3)

Then observed earnings (from administrative data), y, are

 \displaystyle \begin{array}{l} {y=y^{*} {\rm if }s=1} \\ {y{\rm missing if }s=0{\rm .}} \end{array} (1.4)

There will be no sample selection bias to estimates of the determinants of earnings from using the observed sample if

 \displaystyle E[y\vert x,s^{*} ]=E[y\vert x]. (1.5)

In principle, this could be tested directly by expanding (1.1),

 \displaystyle y_{i} =\theta x_{i} +\psi s_{i}^{*} +\varepsilon _{i} (1.6)

substituting in (1.2) and letting  z=x to yield

 \displaystyle y_{i} =\alpha x_{i} +\zeta c_{i} +u_{i} (1.7)

(where  \alpha =\theta +\psi \xi ,  \zeta =\psi \delta , and  u=\varepsilon +\psi v). In this case,

 \displaystyle E[y\vert x,s^{*} ]=E[y\vert x,c]=E[y\vert x] (1.8)

implies no selection bias. Hence, a test of  \zeta =0 based on parameter estimates using the sample of observed earnings is a test for sample-selection bias. Unfortunately, in practice this test typically is not feasible, because  c is unobserved.

In our approach, we estimate a variant of (1.7) using a discrete-valued proxy for  c that we obtain from the differential timing of the HRS consent process. Specifically, we analyze the Original Cohort (OC), who entered the HRS in 1992. They are comprised of individuals born 1931-41 and their spouses (regardless of age). At entry, OC individuals were asked consent to link their survey responses to pre-entry administrative data on W-2 earnings and Form 1040 Schedule C self-employment income through 1991 (Olson, 1999; Bricker and Engelhardt, 2008). This is the initial consent (IC). Three-quarters of respondents consented (tabulated by sex in columns 1 and 2 in Table 1). This group has the lowest index values of  c. Then in 2004-6, individuals were asked consent to match earnings through 2003. This is the subsequent consent (SC). An additional 5.4% of those who did not consent at entry subsequently did. This group had the next lowest index values of  c. The remaining 19.6% of individuals never consented (NC). They had the highest values of  c. Therefore, the multiple consent process established an ordering:

 \displaystyle c^{IC} <c^{SC} <c^{NC} . (1.9)

We use this to define an indicator,

 \displaystyle \begin{array}{l} {D=1{\rm if Initial Consenter (}IC{\rm )}} \\ {D=0{\rm if Subsequent Consenter (}SC{\rm ),}} \end{array} (1.10)

and use it as a proxy for  c in (1.7) to yield

 \displaystyle y_{i} =\alpha x_{i} +\beta D_{i} +\upsilon _{i} . (1.11)

We estimate the parameters in (1.11) using the observed sample. Importantly, differential timing of consent gives variation in  D within the observed sample, with which to identify  \hat{\beta }. Then we test the null hypothesis that  \beta =0 (no difference in labor-market behavior between initial and subsequent consenters) versus the alternative that  \beta \ne 0.

We test separately for men and women, because of well-established differences by sex in work behavior. The vector  x includes standard earnings determinants: a quadratic in age, dummy variables for race (white and black, respectively), educational attainment (high school degree or GED, some college, college graduate, respectively), whether foreign-born, married, veteran status (for men), and a constant.

3 Results and Discussion

Table 2 gives selected descriptive statistics on the three consent groups. Panel A shows means for our three outcome variables from the pre-entry administrative data (1991). The first row of panel B shows the self-reported labor force participation rate from the entry-wave survey (1992). The second row of that panel shows the percentage of respondents who had item non-response for self-reported earnings via a "don't know" or "refusal." For men and women, this percentage is lowest for initial consenters (IC), higher for subsequent consenters (SC), and highest for never consenters (NC). This is consistent with the assumed ordering in (1.9), and (if the non-response is strategic) would suggest that respondents have similar tastes for earnings privacy in both matched and survey data. The third row shows that self-reported earnings among the sub-sample with no item non-response generally falls across the consent groups, whereas imputed earnings are more flat (fourth row), not inconsistent with negative selection. Finally, panel C shows means for the demographic characteristics in  x and reinforces the findings from Haider and Solon (2000) that there are some, but not particularly large, observable differences in earnings determinants between consenters and non-consenters.

Table 3 presents probit estimates of  \beta in (1.11) for pre-entry (in 1991) labor-force participation, defined as having positive annual earnings or self-employment income. Standard errors are in parentheses; marginal effects are in square brackets. For brevity, the other parameter estimates are not shown. The estimate of  \beta in column 1 for men indicates that, conditional on standard determinants of labor-market behavior, there is small, negative selection on participation. Men who consented at entry had an estimated 1.7 percentage point lower participation rate than those who subsequently consented. However, this effect is not different than zero at conventional significance levels. Even if it were, this is an economically small effect relative to the labor-force participation rate of the subsequently matched of 76.7% (panel 3 of Table 1). The results are qualitatively similar for women, shown in column 2.

In Table 4, we restrict the sample to those in the labor force and present probit estimates of  \beta in (1.11) for pre-entry self-employment, defined as positive Schedule C income. The marginal effects in column 1 for men indicate selection: entry consenters had an estimated 4.7 percentage point lower self-employment rate than subsequent consenters (p = 0.046), an economically sizable effect relative to the self-employment rate of the subsequent consenters of 19.5% (panel 3 of Table 1), i.e., almost a 25% increase in the self-employment rate. The estimate for women in column 2 is similar in relative magnitude, but less precise.

Next, we limit the sample to those in the labor force and not self-employed, then examine the extent of selection in pre-entry log annual earnings. Figures 1 and 2 show unconditional non-parametric kernel density estimates of the distributions of log earnings by consent phase for men and women, respectively, based on an Epanechnikov kernel. Although visually there are some differences between groups, non-parametric tests (Kolmogorov-Smirnov and Wilcoxon Rank-Sum) fail to reject the null hypothesis that entry and subsequent consenters came from the same earnings distribution for each sex.

Table 5 presents OLS estimates of  \beta in (1.11) for log annual earnings as the labor-market outcome. For men, entry consenters had 3.3% lower earnings, than subsequent consenters. These effects are economically small in magnitude and not statistically different than zero at conventional significance levels.

To explore impacts across the earnings distribution, we estimated the parameters in (1.11) for each quantile of the conditional distribution using the least absolute deviations (LAD) estimator. The solid line in Figure 3 shows the associated estimate of  \beta in (1.11) for each (whole-numbered) quantile. The dashed lines demarcate the boundaries of the 95% confidence interval based on 299 bootstrap replications. For men, there is little evidence of selection across the earnings distribution.

For women, the OLS estimates of  \beta in (1.11) with log annual earnings as the outcome are shown in column 2 of Table 5. Entry consenters had 13% lower earnings than subsequent consenters, economically large and statistically different than zero at the 10% significance level. The LAD estimates of  \beta in Figure 4 indicate this negative selection effect is spread evenly across the earnings distribution.

An issue that arises with our method is that the subsequent consenters are drawn from the pool of respondents still active in the study in 2004-6, many years after entry. This group itself is potentially selected through differential mortality and attrition from the study. As a robustness check, we re-did the empirical analysis limiting the analysis sample to initial and subsequent consenters who were still in the study in 2006, and the results were qualitatively and quantitatively similar.

4 Conclusion

Over the last twenty years, there has been a well-documented decline in household survey response rates and respondent cooperation. This has led to greater efforts to match administrative data to survey responses, in an effort to mitigate measurement error and bolster data quality. We present a method to test for non-random selection in administrative earnings that relies on differential timing of the informed consent process that is typically required for administrative data linkages. The method is applicable for longitudinal surveys that use multiple attempts to obtain consent.

References

Bricker, Jesse and Gary V. Engelhardt, "Measurement Error in Earnings Data in the Health and Retirement Study," Journal of Economic and Social Measurement 33:1 (2008): 39-61.

Haider, Steven J., and Gary Solon, "Nonrandom Selection in the HRS Social Security Earnings Sample," Working Paper No. 00-01, RAND Labor and Population Program, 2000.

Olson, Janice A., "Linkages with Data from Social Security Administrative Records in the Health and Retirement Study." Social Security Bulletin 62 (1999): 73-85.


Table 1. Construction of the Sample by Consent Phase and Sex
Sample (1) Men (2) Women
Number in Cohort 5,812 6,730
Without Matched Social Security Records as of 2006 1,189 1,277
% Unmatched 20.5% 19.0%
With Matched Social Security Records as of 2006 4,623 5,453
% Matched 79.5% 81.0%
Number with Matched Social Security Records 4,295 5,109
% Initially Matched 73.9% 75.9%
Out of the Labor Force in 1991 903 1,697
In the Labor Force in 1991 3,392 3,412
With Self-Employment Income 509 281
ii. No Self-Employment Income 2,883 3,131
Number with Matched Social Security Records 328 344
% Subsequently Matched 5.6% 5.1%
Out of the Labor Force in 1991 62 116
In the Labor Force in 1991 266 228
With Self-Employment Income 52 22
No Self-Employment Income 214 206
Table 2. Means for Selected Characteristics, by Timing of Consent and Sex
Variable (1) Men : Initial Consent (2) Men : Subsequent Consent (3) Men : Never Consent (4) Women: Initial Consent (5) Women: Subsequent Consent (6) Women: Never Consent
Pre-Entry Labor-Market Activity (1991) from Administrative Data : In the Labor Force (%) 79.0 81.1 - 66.8 66.3 -
Pre-Entry Labor-Market Activity (1991) from Administrative Data : Self-Employed (%) 11.9 15.6 - 5.5 6.4 -
Pre-Entry Labor-Market Activity (1991) from Administrative Data : Earnings ($) 22,023 22,071 - 10,840 12,414 -
Entry-Wave Labor-Market Activity (1992) from Self-Reported Data: In the Labor Force (%) 71.6 76.8 71.4 62.3 63.1 59.5
Entry-Wave Labor-Market Activity (1992) from Self-Reported Data: Earnings Item Non-Response (%) 7.7 19.2 23.8 7.4 14.5 17.3
Entry-Wave Labor-Market Activity (1992) from Self-Reported Data: Earnings, Conditional on No Item Non-Response ($) 27,348 24,947 23,963 12,073 12,866 10,734
Entry-Wave Labor-Market Activity (1992) from Self-Reported Data: Earnings, Including Imputations for Item Non-Response ($) 27,645 25,652 26,534 12,728 14,231 12,617
Demographics: White (%) 75.8 66.2 70.1 73.0 61.6 62.5
Demographics: Black (%) 13.9 17.7 17.2 16.4 19.2 22.7
Demographics: High School (%) 35.3 32.6 31.6 41.2 33.1 37.7
Demographics: Some College (%) 18.7 18.0 18.9 19.0 23.0 20.6
Demographics: College Graduate (%) 19.7 18.3 20.9 13.9 17.2 13.2
Demographics: Foreign-Born (%) 9.9 12.8 10.6 10.5 15.7 13.9
Demographics: Married (%) 87.6 90.2 84.9 76.1 75.9 74.5
Demographics: Age (Years) 55.9 55.9 55.7 52.7 52.5 52.7
Demographics: Veteran (%) 56.9 50.0 55.4 - - -

Table 3. Probit Estimates of Labor-Force Participation by Sex, Standard Errors in Parentheses, Marginal Effects in Brackets

Explanatory Variable (1) Men (2) Women
Initial Consent -0.067 0.024
Initial Consent (standard error) (0.088) (0.074)
Initial Consent [marginal effects] [-0.017] [0.008]

Note: Standard errors in () and marginal effects in []

Table 4. Probit Estimates of Self-Employment by Sex, Standard Errors in Parentheses, Marginal Effects in Brackets

Explanatory Variable (1) Men (2) Women
Initial Consent -0.187 -0.106
Initial Consent (standard error) (0.093) (0.120)
Initial Consent [marginal effects] [-0.047] [-0.017]

Note: Standard errors in () and marginal effects in [].
Table 5. OLS Parameter Estimates for Log Earnings by Sex and HRS, Standard Errors in Parentheses

Explanatory Variable (1) Men (2) Women
Initial Consent -0.033 -0.130
Initial Consent (standard error) (0.072) (0.079)

Note: Standard errors in ()


Figure 1: Kernel Density Estimates of the 1991 Earnings Distribution for Working Men in the Original Cohort by Match Phase.

Figure 1: Kernel Density Estimates of the 1991 Earnings Distribution for Working Men in the Original Cohort by Match Phase. Figure 1 is a plot of two kernel densities of working men's 1991 earnings.  The density of earnings of the Initial Match group (those that consented in the 1992-1996 period) is plotted as a solid line and the density of earnings of the Subsequent Match group (those that consented in the 2004-2006 period) is plotted as a dashed line.  The y-axis is the density height and the y-axis runs from 0 to 0.6.  The x-axis is the natural log of 1991 earnings and the x-axis runs from 3 to 12.  The two lines follow nearly the same path.  Starting at the left (when the x-axis is 3), both lines have a height of about zero until the x-axis equals 6.  Starting when the x-axis equals 6, both lines slowly increase in height until the x-axis equals 9, at which point both lines rise rapidly until they reach an apex at roughly 10.5 on the x-axis.  The dashed line rises a bit earlier and has a bit of a lower peak.  Both lines fall rapidly from 10.5 to 13 on the x-axis.


Figure 2: Kernel Density Estimates of the 1991 Earnings Distribution for Working Women in the Original Cohort by Match Phase.

Figure 2: Kernel Density Estimates of the 1991 Earnings Distribution for Working Women in the Original Cohort by Match Phase. Figure 2 is a plot of two kernel densities of working women's 1991 earnings.  The density of earnings of the Initial Match group (those that consented in the 1992-1996 period) is plotted as a solid line and the density of earnings of the Subsequent Match group (those that consented in the 2004-2006 period) is plotted as a dashed line.  The y-axis is the density height and the y-axis runs from 0 to 0.5.  The x-axis is the natural log of 1991 earnings and the x-axis runs from 3 to 12.  The two lines follow nearly the same path.  Starting at the left (when the x-axis is 3), both lines have a height of about zero until the x-axis equals 6.  Starting when the x-axis equals 6, both lines slowly increase in height until the x-axis equals 8, at which point both lines rise rapidly until they reach an apex at roughly 9.5 on the x-axis.  The solid line rises a bit earlier and has a bit of a higher peak.  Both lines fall rapidly from 10.5 to 13 on the x-axis, though the solid line falls more rapidly than the dashed line.


Figure 3: Quantile Regression Estimates and 95% Confidence Interval of Initial Match on 1991 Log Earnings for Men in the Original Cohort.

Figure 3: Quantile Regression Estimates and 95% Confidence Interval of Initial Match on 1991 Log Earnings for Men in the Original Cohort.  Figure 3 is a line plot of 98 regression coefficients (one for each quantile of the men's 1991 log earnings distribution, starting at the first quantile and ending at the 99th) and the associated standard error around each estimate.  The estimates are from a model that relates the correlation of initial consent to log of 1991 earnings, so the y-axis is labeled Impact on 1991 Log Earnings. The estimates for each quantile are plotted across the x-axis, and the x-axis runs from 5 to 95.  The y-axis shows us the value of the estimate and the value of the 95 percent confidence interval around the estimate.  The estimates are plotted as a solid line.  The upper and lower confidence interval bounds are plotted as dashed lines. The solid line is fairly flat between the 10th and uppermost quantile and is always close to zero on the y-axis.  The solid line lies in the negative y-axis territory from the 1st to the 10th quantile.  The upper confidence interval always lies above zero on the y-axis and the lower confidence interval always lies below zero on the y-axis.  Overall, then, the plot tells us that the coefficient estimates are not different from zero at each of the 98 plotted quantiles.


Figure 4: Quantile Regression Estimates and 95% Confidence Interval of Initial Match on 1991 Log Earnings for Women in the Original Cohort.

Figure 4: Quantile Regression Estimates and 95% Confidence Interval of Initial Match on 1991 Log Earnings for Women in the Original Cohort.  Figure 4 is a line plot of 98 regression coefficients (one for each quantile of the women's 1991 log earnings distribution, starting at the first quantile and ending at the 99th) and the associated standard error around each estimate.  The estimates are from a model that relates the correlation of initial consent to log of 1991 earnings, so the y-axis is labeled Impact on 1991 Log Earnings. The estimates for each quantile are plotted across the x-axis, and the x-axis runs from 5 to 95.  The y-axis shows us the value of the estimate and the value of the 95 percent confidence interval around the estimate.  The estimates are plotted as a solid line.  The upper and lower confidence interval bounds are plotted as dashed lines. The solid line is almost always below zero on the y-axis.  The lower confidence interval always lies below zero on the y-axis.  The upper confidence interval lies above zero on the y-axis for most quantiles except above the 85th when it sometimes is below zero.  Overall, then, the plot tells us that the coefficient estimates are mostly not different from zero except at the upper quantiles when they are sometimes negative.


Footnotes

* The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. Return to Text

This version is optimized for use by screen readers. Descriptions for all mathematical expressions are provided in LaTex format. A printable pdf version is available. Return to Text