Translate this page into:
Subsampling rules for item non response of an estimator based on the combination of regression and ratio
⁎Corresponding author.
-
Received: ,
Accepted: ,
This article was originally published by Elsevier and was migrated to Scientific Scholar after the change of Publisher.
Peer review under responsibility of King Saud University.
Introduction
Survey sampling models assume the existence of a finite population U={u1,…,uN}, where the units are perfectly identifiable, and a sample s of size n ≤ N is selected from U. Another assumption is that the variable of interest Y is measured in each selected unit. Unfortunately, in real life, surveys should deal with the existence of some missing observations. The existence of non-response suggests that the population U is divided into two strata: U1, where are grouped the units that give a response at the first visit, and U2, which contains the rest of the individuals. This is the so called ‘response strata’ model and was the framework proposed by Hansen and Hurwitz (1946), see text books as Arnab (2017), Singh (2003), and Lohr (2010).
The behavior of estimators based on the use of subsampling depends heavily on the used sub-sampling rule. Alternative sampling rules to Hansen-Hurwitz’s rule have been proposed; see for example Srinath (1971) and Bouza (1981).
The quality of the inquiries depends of the rate of responses. A question is how many non-respondents should be subsampled for having a good response rate. If sufficient information is available from prior rounds or other sources, the decision can be made on the basis of the experience of the sampler who analyzes the response rate, design effect, costs etc. The role of non-response in the accuracy of estimation is still generating discussions among statisticians, see an enlightening discussion in Särndal and Lundquist (2014). In a seminal paper Hansen and Hurwitz (1946) suggested subsampling non-respondents for alleviating the effect of having missing data.
Many sampling models consider increasing the survey’s precision by utilizing information on an auxiliary variable. That is the case of ratio, regression and product estimators Singh and Kumar (2009) developed a general class of estimators for the population mean of the interest variable Y, by using information on two auxiliary variables, when missing observations are present. The class includes some well-known estimators.
In this paper we consider the case in which we have missing information on the interest variable but is available the information on the auxiliary variables in the sample, as well as their population means. Singh and Kumar (2009) considered the subsampling rule proposed by Hansen and Hurwitz (1946). This paper extends their results studying the effect of using the rules of Srinath (1971) and Bouza (1981), when dealing with estimators of the class. The behavior the estimators in the class is analyzed in terms of accuracy and cost for each rule. The approximate errors are given.
Section 2 presents the basic issues on the nonresponse procedures. Section 3 is devoted to analyzing the statistical properties of the estimators. In this section are developed comparisons, of the effects of using the three sub-sampling rules in the variance and cost function are developed. We presented also a numerical study, using real life studies, where the rules are evaluated. Finally, in Section 4 some concluding remarks are given.
The non-response problem
It is increasingly common to subsample non-respondents for increasing the response rates at a reduced cost. The usual theory of survey sampling is developed assuming that the finite population is composed by individuals that can be perfectly identified. Assume that a sample s of size n ≤ N is selected using simple random sampling with replacement (SRSWR). The variable of interest Y is to be measured in each selected unit. Real-life surveys should deal with the existence of missing observations. This fact establishes that the population is divided into two strata:
Then we may distinguish the strata parameters
as well as the population ones
There are three solutions to cope with this fact: to ignore the non-respondents, to impute the missing values or to subsample the non-respondents. Rarely, ignoring the non-responses is a good solution, as Y may be related with having very different values, in the units belonging to U2 with respect to U1. Imputation of the missing data depends on having an adequate model of the non-responses mechanism and reliable information for predicting Y for each non-respondent. Subsampling the non-respondents is a conservative solution. Theoretically, dealing with subsampling the non-respondents stratum is a particular case of Double Sampling (DS), see Bouza et al. (2011) for a motivating discussion on the subject. It was proposed firstly by Hansen and Hurwitz (1946). Its use increases the costs but provides the confidence of estimating using information on U2 Deciding which subsampling procedure is to be used is of practical importance, see Thompson and Washington (2013), Torres van Grinsven et al. (2014), Andridge and Thompson (2015) and Heffetz and Reeves (2016). Then it makes sense analyzing the behavior of alternative sampling rules to Hansen-Hurwitz’s rule. In the literature are reported the rules of Srinath (1971) and Bouza (1981), as other rules for determining the sub sample size. They fix the size of the subsample to be drawn from the set of non-respondents.
Let us present a general algorithm for implementing one of the subsampling procedures.
Subsampling algorithm
Step 1. Select a sample s from U using simple random sampling with replacement (SRSWR).
Step 2. Evaluate Y among the subsample of the respondents
, determine
and compute
Step 3. Determine
Step 4. Fix
Step 5. Select using SRSWR a sub-sample
Step 6. Evaluate Y among the units in
and compute
Step 7. Compute the estimate of
As s1 is a subsample of U1, (2.1) is an unbiased estimator of the mean of the response stratum, that is
. The subsample selected from the non-respondents provides (2.2) which is a conditionally unbiased estimator of
Therefore, as
. The usual analysis of the behavior of (2.3) is based on studying the expression
The first term is the sample mean of s, hence We have that Therefore, is an unbiased estimator of the population mean.
We have that the expected variance of the first term is
The conditional variance of the second term is given by
Note that is a Binomial random variable, hence
Due to the independence the expectation of the cross product is zero and is deduced the well-known expression
The value of θ determines the value of the second term in the expected error. The subsampling rules deal with determining θ. The existing particular rules fix the value of θ. They are:
The rules of Hansen-Hurwitz and Srinath depend on the decision of the sampler for fixing θ. Bouza’s rule is determined as proportional to the proportion of non-responses. Hence, it is a randomized rule and fixing the sub-sampling size does not depend on the expertise of the sampler.
Having auxiliary information X the use of ratio estimators is commonly used. Under nonresponse we have the knowledge of the population mean of X, , and are computed the estimators:
The ratio estimator in this case is given by
The regression estimator is
In the next section we will consider the estimation problems.
The estimation problem
The class of estimators of Singh-Kumar
Singh and Kumar (2009) developed a class of estimators for when auxiliary information on two variables X and Z is available and non-responses are present. The sampling design analyzed was a DS one. They derived expressions of the mean squared error (MSE) for the estimators of the proposed class. Take
Consider that we deal with missing information in the variable of interest (item non-response). Hence, we have responses in x and z when a unit belongs to s. Following the model, we are going to estimate the mean of Y using (2.4). We may compute
Take α as a fixed scalar. The estimators of this class are characterized by the general formula
It is considered that we know .
Let us use a Taylor Series development for (2.4). Take
For accepting the validity of the development in Taylor Series is necessary that .
From the results of Singh and Kumar (2009) we may write
where
. Using the corresponding development for expanding (3.1) we have that
The approximation of the expectation, variances and covariances of the errors are developed considering that the terms of order larger than 2 are negligible. Then we may write:
where
Let us look for the approximate bias and variance of the estimator. Considering again that the terms of order larger than 2 are negligible, we have the expansion
Its expectation is equal to
Note that only εy is affected by the existence of missing observations. Squaring both terms and calculating the variance is obtained
The value of α determines a particular member of the class. An optimal estimator may be determined looking for the minimization of (3.5) by determining its optimum value. T is given by which depends of unknown population parameters, see Singh and Kumar (2009) for a detailed discussion on the members of this class.
A comparison of estimators
It is well known that to the first degree of approximation of the Taylor Series the conditional variances are
For the regression estimator the conditional variance is
Noting that
we have that if is known
Hence is better than the other estimators. Singh and Kumar (2011) pointed out that, even if is unknown, is to be preferred, if the sampler evaluates for which feasible values of it behaves better.
An evaluation of the magnitude of the gain in accuracy due to the use of was developed using real life data.
We developed the numerical analysis using population data obtained in three studies. A brief description of them is the following
Problem 1. 793 factories contaminate a source of water. They were inspected and was obtained
X = percent of samples with an index superior to the permitted level.
The historical report of this percent was also known
Z = historical percent of samples with an index superior to the permitted level.
The managers improved the collection of solid contaminants and a sample an reported
Y = percent of samples reported with an index superior to the permitted level.
N2 = 104 factories did not send the report. They were visited and Y was obtained.
The parameters of interest are
Problem 2. 120 persons with VIH were included in an experiment with a new drug. The levels of hemoglobin were one of the measurements made to them. The variables involved were
X = measurement of hemoglobin before starting with the treatment.
Z = first measurement of hemoglobin after starting with the treatment.
Y = measurement of hemoglobin 6 months after starting with the treatment.
N2 = 51 patients did not visit the hospital. They were visited and Y was obtained.
The parameters of interest are
Problem 3. 1840 farmers increased the area of their farms. The interest was to evaluate the tax to be pay. The variables involved were
X = initial area of the farms in hectares
Z = Actual area of the farms in hectares.
Y = Harvested area in hectares.
N2 = 176 farmers did not return eth form of the tax to be pay patients. They were visited and Y was obtained.
The parameters of interest are
The resulting Gains in accuracy obtained are presented in Table 1.
Problem
1
84,06
13,16
16,47
2
1,34
0,20
0,91
3
49,92
0,04
0,41
Table 1 suggests that the improvements in accuracy due to the use of are very large when compared with . The error is decreased a little in the studies of VIH patients and farmers for the other estimators.
A comparison of the subsampling rules performance
From the above discussion is clear that the preference for a certain subsampling rule does not affect in the comparison of the estimators. Note that the effect of using a certain rule is important when we calculate the expected variance. Then we are interested in evaluating the behavior of the expectations under each rule. That is to compare the different expectations of .
The use of Hansen-Hurwitz’s rule, HH, fixes that θ = 1/K, K ≥ 1. Then its use yields that
When we use the rule of Srinath (1971), S, we have that
Doing some calculus is derived that
. Substituting in the conditional variance, we have
Comparing this term with (3.5), we should prefer HH to S whenever
That is if
A similar analysis of the use of the rule of Bouza (1981) needs of considering the new structure. Due to the randomness of θ we have that the conditional variance is
Its expectation is
Note that HH is to be preferred to B when
That is if as and K ≥ 1 we may fix a value of K that satisfies this relationship.
Comparing S and B we have that the former generates a smaller coefficient if is satisfied the inequality
This relationship suggests that we S may be preferred if is used values of H smaller than 1.
Considering the costs, we may use the cost function
Its expectation depends of the subsampling rule. The results are
Accepting that
, t > 2, a development in Taylor Series allows deriving that
We have that for B
In terms of the expected costs, we may look for the preference of the rules. We have:
A comparison with S yields the preference rule,
It is easily derived that none of the rules may be more accurate and cheaper with respect to any of the other two simultaneously.
We used the results reported with four populations in the paper of Azeem and Hanif (2017) for establishing adequate values of the parameters of the subsampling rules. In the next table we have that
is the total number of units in the population questioned,
the number of units responding the survey questions,
the number of units which do not respond,
is the population variance of
and
is the variance of
for non-respondents part of the population (Table 2).
Population
N
N1
N2
Population
N
N1
N2
1
5000
4500
500
102.007
99.99174
3
5000
4500
500
101.2633
5000
1
5000
4250
750
102.007
100.8224
3
5000
4250
750
101.2633
5000
1
5000
4000
1000
102.007
103.2349
3
5000
4000
1000
101.2633
5000
2
5000
4500
500
97.1206
94.5457
3
5000
4500
500
25.441
5000
2
5000
4250
750
97.1206
98.2761
4
5000
4250
750
25.441
5000
2
5000
4000
1000
97.1206
96.0935
4
5000
4000
1000
25.441
5000
Note that for the populations 1 and 2 the variance of non-respondents is similar to the overall variance. Populations 3 and 4, have a considerably larger variance of the non-respondent strata than the population variance. We will use the weights observed in the inquiries of the different non-respondent stratum, .
We fixed a set of values of H in Table 3 for comparing HH with S in terms of their accuracy.
H
W2
0,1
0,3
0,5
0,7
1
0,1
2
4,00
6,00
8,00
11,00
0,15
1,67
3,00
4,33
5,67
7,67
0,2
1,50
2,5
3,5
4,50
6,00
Note that for H = 0,1 fixing a value of K, for which HH is to be preferred, implies using large subsample sizes. An increase in the non-respondent’s stratum determine also the need of using larger values of the sub sample size for preferring HH’s rule.
Table 4 illustrates that for small subsampling sizes HH may have a better accuracy than the rule of Bouza. S will have the same behavior for relatively small values of H.
1 + W1/W2
W1
W2
0,1
10
0,90
0,15
6,67
0,85
0,2
5,00
0,80
The analysis of the costs is presented in the next 2 tables.
We prefer using HH by using . Analyzing Table 4 we have that in the population analyzed the relation holds for K > 1,20, which may be easily satisfied in practice.
The analysis of the costs associated with B needs to take into account the sample size. We consider the commonly used sampling fractions 0,01, 0,05 and 0,1 for illustrating. Note that the results in the Table 5 suggest that the sub sampling rule HH will have smaller expected costs than B if K > 9,82. The subsample parameter H should be very large for preferring S to B in terms of costs (Table 6).
H
W2
0,1
0,3
0,5
0,7
1
0,10
0,20
0,40
0,60
0,80
1,10
0,15
0,35
0,45
0,65
0,85
1,15
0,20
0,30
0,60
0,70
0,90
1,20
HH
S
W2/n
50
250
500
50
250
500
0,10
8,47
6,53
9,82
82,37
64,36
95,48
0,15
6,05
6,38
6,45
39,36
41,54
41,97
0,20
4,63
4,92
4,96
22,18
23,53
23,82
Conclusions
Non-responses are present in the practice of survey research. Deciding to sub sample the non-respondents poses the need of deciding which will be the size of the sub-sample. The sampler must select a sub-sampling rule and fix a value of K or H or using instead a randomized rule. We developed a study of this problem when dealing with the class of estimators proposed by Singh and Kumar (2009).
Evaluating the preference of one of the rules may be performed by analyzing the effect of them in the corresponding expected variance or cost. The numerical study developed in Section 3 illustrated how a simple procedure allows deciding on the convenience of using one of the rules. The evaluation of the subsampling rules does not necessarily conveys to preferring one of them both in terms of accuracy and cost.
This study may be extended to other estimation procedures, as the product of a ratio and regression estimators proposed by Singh and Kumar (2011).
Acknowledgments
We heartily appreciate the suggestions of the referees which allowed improving the presentation of the results. One of the authors thanks CYTED and “A Cuban-Flemish Training and Research Program in Data Science and Big Data Analysis” projects for supporting his research.
References
- Survey Sampling Theory and Applications. Academic Press, Elsevier; 2017.
- Joint influence of measurement error and non response on estimation of population mean. Commun. Statist. Theory Methods. 2017;4:1679-1693.
- [Google Scholar]
- Using the fraction of missing information to identify auxiliary variables for imputation procedures via proxy pattern-mixture models. Int. Statist. Rev.. 2015;83:472-492.
- [Google Scholar]
- Sobre el problema de la fracción de submuestreo para el caso de las no respuestas. Trabajos de Estadística y de Investigación Operativa. 1981;32:30-36.
- [Google Scholar]
- Handling with missing observations in simple random sampling and ranked set sampling. In: Lovric Miodrag, ed. International Encyclopedia of Statistical Science. Berlin: Springer-Verlag; 2011. p. :621-622. Part 8
- [Google Scholar]
- The problem of non-responses in survey sampling. J. Am. Statist. Assoc.. 1946;41:517-523.
- [Google Scholar]
- Heffetz, O., Reeves, D.B., 2016. Difficulty to Reach Respondents and Nonresponse Bias: Evidence from Large Government Surveys. NBER Working Paper No. 22333. (Consulted January 10-2018. http://www.nber.org/papers/w22333).
- Sampling: Design and Analysis (second ed.). Boston: Brooks/Cole; 2010.
- Accuracy in estimation with nonresponse: a function of degree of imbalance and degree of explanation. J. Survey Statist. Methodol.. 2014;2:361-3087.
- [Google Scholar]
- Advanced Sampling Theory with Applications. Dordrecht, The Netherlands: Kluwer Academic; 2003.
- A general procedure of estimating the population mean in the presence of non-response under double sampling using auxiliary information. Statist. Oper. Res. Trans.. 2009;33:71-84.
- [Google Scholar]
- Combination of regression and ratio estimate in presence of nonresponse. Braz. J. Probable. Stat.. 2011;25:205-217.
- [Google Scholar]
- Multiphase sampling in nonresponse problems. J. Am. Statist. Assoc.. 1971;66:583-658.
- [Google Scholar]
- Thompson, K.J., Washington, K.T., 2013. Challenges in the treatment of unit nonresponse for selected business surveys: a case study. Survey Methods: Insights from the Field. Last Consulted December 30, 2017 Available at: http://surveyinsights.org/?p=2991.
- Torres van Grinsven, V., Bolko, I., Bavdaž, Villund, O., 2014. Comparing Subsample Approaches. Presentation to the 9th Workshop on Labor Force Survey Methodology Rome.