Subsampling rules for item non response of an estimator based on the combination of regression and ratio

Carlos N. Bouza-Herrera; Mir Subzar

doi:10.1016/j.jksus.2018.10.006

View/Download PDF

Buy Reprints

PDF

Translate this page into:

31 (

2

); 171-176

doi:

10.1016/j.jksus.2018.10.006

Subsampling rules for item non response of an estimator based on the combination of regression and ratio

Carlos N. Bouza-Herrera^{^a,⁎}, Mir Subzar^{^b}

a Universidad de La Habana, Cuba

b Division of Agricultural Statistics, SKUAST-Kashmir, Srinagar 190025, India

⁎Corresponding author.

Received: 2017-12-21, Accepted: 2018-10-15, Published: 2018-04-01

Disclaimer:
This article was originally published by Elsevier and was migrated to Scientific Scholar after the change of Publisher.

Peer review under responsibility of King Saud University.

1

1 Introduction

Survey sampling models assume the existence of a finite population U={u₁,…,u_N}, where the units are perfectly identifiable, and a sample s of size n ≤ N is selected from U. Another assumption is that the variable of interest Y is measured in each selected unit. Unfortunately, in real life, surveys should deal with the existence of some missing observations. The existence of non-response suggests that the population U is divided into two strata: U₁, where are grouped the units that give a response at the first visit, and U₂, which contains the rest of the individuals. This is the so called ‘response strata’ model and was the framework proposed by Hansen and Hurwitz (1946), see text books as Arnab (2017), Singh (2003), and Lohr (2010).

The behavior of estimators based on the use of subsampling depends heavily on the used sub-sampling rule. Alternative sampling rules to Hansen-Hurwitz’s rule have been proposed; see for example Srinath (1971) and Bouza (1981).

The quality of the inquiries depends of the rate of responses. A question is how many non-respondents should be subsampled for having a good response rate. If sufficient information is available from prior rounds or other sources, the decision can be made on the basis of the experience of the sampler who analyzes the response rate, design effect, costs etc. The role of non-response in the accuracy of estimation is still generating discussions among statisticians, see an enlightening discussion in Särndal and Lundquist (2014). In a seminal paper Hansen and Hurwitz (1946) suggested subsampling non-respondents for alleviating the effect of having missing data.

Many sampling models consider increasing the survey’s precision by utilizing information on an auxiliary variable. That is the case of ratio, regression and product estimators Singh and Kumar (2009) developed a general class of estimators for the population mean of the interest variable Y, by using information on two auxiliary variables, when missing observations are present. The class includes some well-known estimators.

In this paper we consider the case in which we have missing information on the interest variable but is available the information on the auxiliary variables in the sample, as well as their population means. Singh and Kumar (2009) considered the subsampling rule proposed by Hansen and Hurwitz (1946). This paper extends their results studying the effect of using the rules of Srinath (1971) and Bouza (1981), when dealing with estimators of the class. The behavior the estimators in the class is analyzed in terms of accuracy and cost for each rule. The approximate errors are given.

Section 2 presents the basic issues on the nonresponse procedures. Section 3 is devoted to analyzing the statistical properties of the estimators. In this section are developed comparisons, of the effects of using the three sub-sampling rules in the variance and cost function are developed. We presented also a numerical study, using real life studies, where the rules are evaluated. Finally, in Section 4 some concluding remarks are given.

2

2 The non-response problem

It is increasingly common to subsample non-respondents for increasing the response rates at a reduced cost. The usual theory of survey sampling is developed assuming that the finite population $U = \{u_{1}, \dots, u_{N}\}$ is composed by individuals that can be perfectly identified. Assume that a sample s of size n ≤ N is selected using simple random sampling with replacement (SRSWR). The variable of interest Y is to be measured in each selected unit. Real-life surveys should deal with the existence of missing observations. This fact establishes that the population is divided into two strata: $U_{1} = \{u \in U u g i v e s a r e s p o n s e i n t h e f i r s t v i s i t\},$ $U_{2} = \{u \in U u d o e s n o t g i v e a r e s p o n s e i n t h e f i r s t v i s i t\} .$

Then we may distinguish the strata parameters ${\bar{Y}}_{t} = \frac{1}{N_{t}} \sum_{u_{i} \in U_{t}} Y_{i}, σ_{t}^{2} = \frac{1}{N_{t}} \sum_{u_{i} \in U_{t}} {(Y_{i} - {\bar{Y}}_{t})}^{2}, N_{t} = | | U_{t} | | t = 1, 2$

as well as the population ones $\bar{Y} = \frac{1}{N} \sum_{u_{i} \in U} Y_{i}, σ_{Y}^{2} = \frac{1}{N_{t}} \sum_{u_{i} \in U} {(Y_{i} - \bar{Y})}^{2}, N = N_{1} + N_{2}$

There are three solutions to cope with this fact: to ignore the non-respondents, to impute the missing values or to subsample the non-respondents. Rarely, ignoring the non-responses is a good solution, as Y may be related with having very different values, in the units belonging to U₂ with respect to U₁. Imputation of the missing data depends on having an adequate model of the non-responses mechanism and reliable information for predicting Y for each non-respondent. Subsampling the non-respondents is a conservative solution. Theoretically, dealing with subsampling the non-respondents stratum is a particular case of Double Sampling (DS), see Bouza et al. (2011) for a motivating discussion on the subject. It was proposed firstly by Hansen and Hurwitz (1946). Its use increases the costs but provides the confidence of estimating using information on U₂ Deciding which subsampling procedure is to be used is of practical importance, see Thompson and Washington (2013), Torres van Grinsven et al. (2014), Andridge and Thompson (2015) and Heffetz and Reeves (2016). Then it makes sense analyzing the behavior of alternative sampling rules to Hansen-Hurwitz’s rule. In the literature are reported the rules of Srinath (1971) and Bouza (1981), as other rules for determining the sub sample size. They fix the size of the subsample to be drawn from the set of non-respondents.

Let us present a general algorithm for implementing one of the subsampling procedures.

2.1

2.1 Subsampling algorithm

Step 1. Select a sample s from U using simple random sampling with replacement (SRSWR).

Step 2. Evaluate Y among the subsample of the respondents $s_{1} \subset s$ , determine $\{y_{i}, i \in s_{1}, | | s_{1} | | = n_{1}\}$ and compute

(2.1)

{\bar{y}}_{1} = \frac{\sum_{i = 1}^{n_{1}} y_{i}}{n_{1}}

Step 3. Determine $s_{2} = \{u_{i} \in s_{2} = s - s_{1}\}, s_{2} = n_{2}$

Step 4. Fix $n_{2}^{*} = θ n_{2}, θ \leq 1$

Step 5. Select using SRSWR a sub-sample $s_{2}^{*} {\subset s}_{2} o f s i z e n_{2}^{*}$

Step 6. Evaluate Y among the units in $s_{2}^{*}$ and compute

(2.2)

{\bar{y}}_{2}^{*} = \frac{\sum_{i = 1}^{n_{2}^{*}} y_{i}}{n_{2}^{*}}

Step 7. Compute the estimate of $\bar{Y}$

(2.3)

{\bar{y}}^{*} = w_{1} {\bar{y}}_{1} + w_{2} {\bar{y}}_{2}^{*}, w_{j} = \frac{n_{j}}{n}, j = 1, 2

As s₁ is a subsample of U₁, (2.1) is an unbiased estimator of the mean of the response stratum, that is $E ({\bar{y}}_{1} |s) = {\bar{Y}}_{1}$ . The subsample selected from the non-respondents provides (2.2) which is a conditionally unbiased estimator of ${\bar{y}}_{2}$ $becaus e E ({\bar{y}}_{2}^{*}, |s) = {\bar{y}}_{2} .$ Therefore, as $s_{2} \subset U_{2},$ $E {E (\bar{y}}_{2}^{*} |s) = {\bar{Y}}_{2}$ . The usual analysis of the behavior of (2.3) is based on studying the expression

(2.4)

{\bar{y}}^{*} = w_{1} {\bar{y}}_{1} + w_{2} {\bar{y}}_{2}^{*} = (w_{1} {\bar{y}}_{1} + w_{2} {\bar{y}}_{2}) + w_{2} ({\bar{y}}_{2}^{*} - {\bar{y}}_{2}),

The first term is the sample mean of s, hence $EE (w_{1} {\bar{y}}_{1} + w_{2} {\bar{y}}_{2}) = \bar{Y},$ We have that $E ({\bar{y}}_{2}^{*} - {\bar{y}}_{2} |s) = 0 .$ Therefore, ${\bar{y}}^{*}$ is an unbiased estimator of the population mean.

We have that the expected variance of the first term is

(2.5)

E ((V (w_{1} {\bar{y}}_{1} + w_{2} {\bar{y}}_{2}) |s)) = \frac{σ_{Y}^{2}}{n}

The conditional variance of the second term is given by $V (w_{2} ({\bar{y}}_{2}^{*} - {\bar{y}}_{2}) |s) = w_{2}^{2} σ_{2 Y}^{2} (\frac{1}{θ n_{2}} - \frac{1}{n_{2}}) = \frac{n_{2} (1 - θ)}{θ n^{2}} σ_{2 Y}^{2}$

Note that $n_{2}$ is a Binomial random variable, hence $E [V (w_{2} ({\bar{y}}_{2}^{*} - {\bar{y}}_{2}) |s)] = \frac{W_{2} (1 - θ)}{θ n} σ_{2 Y}^{2}$

Due to the independence the expectation of the cross product is zero and is deduced the well-known expression $E V ({\bar{y}}^{*} |s) = \frac{σ_{Y}^{2}}{n} + \frac{W_{2} (1 - θ)}{θ n} σ_{2 Y}^{2} .$

The value of θ determines the value of the second term in the expected error. The subsampling rules deal with determining θ. The existing particular rules fix the value of θ. They are:

Hansen and Hurwitz (1946): $θ = \frac{1}{K}, K \geq 1$
Srinath (1971): $θ = \frac{n_{2}^{2}}{Hn + n_{2}}, H \geq 0$
Bouza (1981): $θ = w_{2} n_{2}$

The rules of Hansen-Hurwitz and Srinath depend on the decision of the sampler for fixing θ. Bouza’s rule is determined as proportional to the proportion of non-responses. Hence, it is a randomized rule and fixing the sub-sampling size does not depend on the expertise of the sampler.

Having auxiliary information X the use of ratio estimators is commonly used. Under nonresponse we have the knowledge of the population mean of X, $\bar{X}$ , and are computed the estimators: ${\bar{x}}^{*} = w_{1} {\bar{x}}_{1} + w_{2} {\bar{x}}_{2}^{*}, \bar{x} = w_{1} {\bar{x}}_{1} + w_{2} {\bar{x}}_{2}$ $s_{x}^{* 2} = \frac{1}{n - 1} (\sum_{i = 1}^{n_{1}} x_{i}^{2} + \frac{n_{2}}{n_{2}^{*}} \sum_{j = 1}^{n_{2}^{*}} x_{j}^{2} - n {\bar{x}}^{*} \bar{x})$ $s_{xy}^{*} = \frac{1}{n - 1} (\sum_{i = 1}^{n_{1}} x_{i} y_{i} + \frac{n_{2}}{n_{2}^{*}} \sum_{j = 1}^{n_{2}^{*}} x_{j} y_{j} - n {\bar{y}}^{*} \bar{x})$

The ratio estimator in this case is given by ${\bar{y}}_{ratio}^{*} = \frac{{\bar{y}}^{*}}{{\bar{x}}^{*}} \bar{X} .$

The regression estimator is ${\bar{y}}_{reg} = {\bar{y}}^{*} + b_{yx}^{*} (\bar{X} - {\bar{x}}^{*}), b_{yx}^{*} = \frac{s_{xy}^{*}}{s_{x}^{* 2}}$

In the next section we will consider the estimation problems.

3

3 The estimation problem

3.1

3.1 The class of estimators of Singh-Kumar

Singh and Kumar (2009) developed a class of estimators for $\bar{Y}$ when auxiliary information on two variables X and Z is available and non-responses are present. The sampling design analyzed was a DS one. They derived expressions of the mean squared error (MSE) for the estimators of the proposed class. Take $\bar{Q} = \sum_{j = 1}^{N} \frac{Q_{j}}{N}, \bar{q} = \sum_{j = 1}^{n} \frac{q_{j}}{n}, {\bar{q}}_{t} = \sum_{j = 1}^{n_{t}} \frac{q_{j}}{n_{t}}, {\bar{q}}_{2}^{*} = \sum_{j = 1}^{n_{2}^{*}} \frac{q_{j}}{n_{2}^{*}}, Q = X, Y, Z, q = x, y, z, t = 1, 2$

Consider that we deal with missing information in the variable of interest (item non-response). Hence, we have responses in x and z when a unit belongs to s. Following the model, we are going to estimate the mean of Y using (2.4). We may compute $\bar{g} = w_{1} {\bar{g}}_{1} + w_{2} {\bar{g}}_{2}, g = x, z .$

Take α as a fixed scalar. The estimators of this class are characterized by the general formula ${\bar{y}}_{α} = ({\bar{y}}^{*} + β^{*} (\bar{X} - \bar{x})) \frac{\bar{Z}}{\bar{Z} + α (\bar{z} - \bar{z})}$ $β^{*} = \frac{s_{xy}^{*}}{s_{x}^{2}}, s_{xy}^{*} = \sum_{i = 1}^{n_{1} + n_{2}^{*}} \frac{(x_{i} - \bar{x}) (y_{i} - {\bar{y}}^{*})}{n_{1} + n_{2}^{*} - 1}, s_{x}^{2} = \sum_{i = 1}^{n_{1} + n_{2}^{*}} \frac{{(x_{i} - \bar{x})}^{2}}{n_{1} + n_{2}^{*} - 1}$

It is considered that we know $\bar{Q}, Q = X, Z$ .

Let us use a Taylor Series development for (2.4). Take ${\bar{y}}^{*} = \bar{y} (1 + ε_{y}), \bar{g} = \bar{G} (1 + ε_{g}), g = x, z s_{xy}^{*} = σ_{xy} (1 + ε_{xy}), s_{x}^{2} = σ_{x}^{2} (1 + ε_{2 x})$

For accepting the validity of the development in Taylor Series is necessary that $|α ε_{z}| < 1 a n d |ε_{2 x}| < 1 h o l d$ .

From the results of Singh and Kumar (2009) we may write

(3.1)

y_{α} = \frac{\bar{Y} (1 + ε_{y}) - \bar{X} ε_{x} \frac{σ_{xy} (1 + ε_{xy})}{σ_{x}^{2} (1 + ε_{2 x})}}{1 - α ε_{z}} = \bar{Y} [\frac{1 + ε_{y} - \frac{β R_{xy} ε_{x} (1 + ε_{xy})}{(1 + ε_{2 x})}}{(1 + α ε_{z})}]

where $β = \frac{σ_{xy}}{σ_{x}^{2}}, R_{xy} = \frac{\bar{X}}{\bar{Y}}$ . Using the corresponding development for expanding (3.1) we have that

(3.2)

{\bar{y}}_{α} = \bar{Y} [\frac{1 + ε_{y} - \frac{β R_{xy} ε_{x} (1 + ε_{xy})}{(1 + ε_{2 x})}}{(1 + α ε_{z})}]

The approximation of the expectation, variances and covariances of the errors are developed considering that the terms of order larger than 2 are negligible. Then we may write:

(3.3)

E (ε_{q}^{t} ε_{u}^{h}) = \{\begin{matrix} V_{1 q} i f q = u, q = x, z; t = h = 1 \\ V_{2 q u} i f q = x, y a n d q \neq u = x, y, z, t = h = 1 \\ V_{3} i f q = y a n d u = x y \\ V_{4} i f q = s_{x}^{2} a n d u = x \end{matrix}

where $V_{1 q} = \frac{C_{q}^{2}}{n}, C_{q}^{2} = \frac{σ_{q}^{2}}{{\bar{Q}}^{2}}, q = x, y, z$ ${V_{2 q u} = ρ}_{qu} C_{q} C_{u}, ρ_{qu} = \frac{σ_{qu}}{σ_{q} σ_{u}},$ $V_{3} = \frac{N μ_{21}}{(N - 2) n \bar{X} σ_{xy}},$ $μ_{21} = \sum_{i = 1}^{N} \frac{(x_{i} - \bar{X}) {(y_{i} - \bar{Y})}^{2}}{N},$ $V_{4} = \frac{N μ_{30}}{(N - 2) n \bar{X} σ_{x}^{2}}, μ_{30} = \sum_{i = 1}^{N} \frac{{(x_{i} - \bar{X})}^{3}}{N},$

Let us look for the approximate bias and variance of the estimator. Considering again that the terms of order larger than 2 are negligible, we have the expansion ${\bar{y}}_{α} - \bar{Y} ≅ \bar{Y} (ε_{y} + β R_{xy} ε_{x} (α ε_{z} - 1) + α ε_{z} (α ε_{z} - ε_{y} - 1) + β R_{xy} ε_{x} (ε_{x} - ε_{xy}))$

Its expectation is equal to $Bias ({\bar{y}}_{α}) = E ({\bar{y}}_{α} - \bar{Y}) ≅ \bar{Y} (α β R_{xy} V_{2 x y} + α^{2} V_{1 z} + α V_{2 z y} + β R_{xy} V_{1 x})$

Note that only ε_y is affected by the existence of missing observations. Squaring both terms and calculating the variance is obtained

(3.5)

E [{({\bar{y}}_{α} - \bar{Y})}^{2} |s] ≅ \frac{1}{n} [σ_{y}^{2} (1 - ρ_{xy}^{2}) + α R_{yz} (α R_{yz} - 2 A) σ_{z}^{2} + w_{2} σ_{2 y}^{2}] = V [{\bar{y}}_{α} |s]

R_{yz} = \frac{\bar{Y}}{\bar{Z}}, A = \frac{σ_{yz}}{σ_{z}^{2}} - \frac{σ_{yx}}{σ_{x}^{2}} \frac{σ_{xz}}{σ_{z}^{2}}

The value of α determines a particular member of the class. An optimal estimator may be determined looking for the minimization of (3.5) by determining its optimum value. T is given by $α_{0} = \frac{A}{R_{yz}}$ which depends of unknown population parameters, see Singh and Kumar (2009) for a detailed discussion on the members of this class.

3.2

3.2 A comparison of estimators

It is well known that to the first degree of approximation of the Taylor Series the conditional variances are $V ({\bar{y}}_{ratio}^{*} |s) ≅ \frac{1}{n} (σ_{y}^{2} + σ_{x}^{2} R (R - 2 B_{yx}) + \frac{n_{2}^{*}}{n} σ_{2 y}^{2})$

For the regression estimator the conditional variance is $V ({\bar{y}}_{reg} |s) ≅ \frac{1}{n} (σ_{y}^{2} (1 - ρ_{xy}^{2}) + \frac{n_{2}^{*}}{n} σ_{2 y}^{2})$

Noting that $V ({\bar{y}}^{*} |s) = \frac{σ_{Y}^{2}}{n} + \frac{n_{2}^{*}}{n^{2}} σ_{2 Y}^{2}$

we have that if $α_{0} = \frac{A}{R_{yz}}$ is known $G ({\bar{y}}_{ratio}^{*}, {\bar{y}}_{α_{0}}) = V ({\bar{y}}_{ratio}^{*} |s) - V [{\bar{y}}_{α_{0}} |s] = \frac{1}{n} {(A - R_{yz})}^{2} σ_{z}^{2}$ ${G ({\bar{y}}_{reg}^{*}, {\bar{y}}_{α_{0}}) = V (\bar{y}}_{reg}^{*} |s) - V [{\bar{y}}_{α_{0}} |s] = \frac{A^{2} σ_{z}^{2}}{n}$ $G ({\bar{y}}^{*}, {\bar{y}}_{α_{0}}) = V ({\bar{y}}^{*} | s) - V [{\bar{y}}_{α_{0}} |s] = \frac{1}{n} (A^{2} σ_{z}^{2} + ρ_{xy}^{2} σ_{y}^{2})$

Hence ${\bar{y}}_{α_{0}}$ is better than the other estimators. Singh and Kumar (2011) pointed out that, even if $α_{0}$ is unknown, ${\bar{y}}_{α}$ is to be preferred, if the sampler evaluates for which feasible values of $α$ it behaves better.

An evaluation of the magnitude of the gain in accuracy due to the use of ${\bar{y}}_{α_{0}}$ was developed using real life data.

We developed the numerical analysis using population data obtained in three studies. A brief description of them is the following

Problem 1. 793 factories contaminate a source of water. They were inspected and was obtained

X = percent of samples with an index superior to the permitted level.

The historical report of this percent was also known

Z = historical percent of samples with an index superior to the permitted level.

The managers improved the collection of solid contaminants and a sample an reported

Y = percent of samples reported with an index superior to the permitted level.

N₂ = 104 factories did not send the report. They were visited and Y was obtained.

The parameters of interest are $\bar{X} = 24, 7 σ_{x}^{2} = 31, 4; \bar{Z} = 10, 3 σ_{z}^{2} = 9, 34; \bar{Y} = 18, 7 σ_{y}^{2} = 7, 78$ $σ_{yz} = - 7, 96; σ_{yx} = 10, 20; σ_{xz} = 19, 62; R_{yz} = 1, 82, A = - 1, 19$

Problem 2. 120 persons with VIH were included in an experiment with a new drug. The levels of hemoglobin were one of the measurements made to them. The variables involved were

X = measurement of hemoglobin before starting with the treatment.
Z = first measurement of hemoglobin after starting with the treatment.
Y = measurement of hemoglobin 6 months after starting with the treatment.

N₂ = 51 patients did not visit the hospital. They were visited and Y was obtained.

The parameters of interest are $\bar{X} = 6, 60 σ_{x}^{2} = 1, 43; \bar{Z} = 9, 90 σ_{z}^{2} = 2, 21; \bar{Y} = 8, 06 σ_{y}^{2} = 3, 08$ $σ_{yz} = 0, 64; σ_{yx} = 1, 01; σ_{xz} = 0, 82; R_{yz} = 0, 81, A = 0, 03$

Problem 3. 1840 farmers increased the area of their farms. The interest was to evaluate the tax to be pay. The variables involved were

X = initial area of the farms in hectares
Z = Actual area of the farms in hectares.
Y = Harvested area in hectares.

N₂ = 176 farmers did not return eth form of the tax to be pay patients. They were visited and Y was obtained.

The parameters of interest are $\bar{X} = 23, 35 σ_{x}^{2} = 60, 46; \bar{Z} = 34, 86 σ_{z}^{2} = 88, 75; \bar{Y} = 26, 72 σ_{y}^{2} = 49, 33$ $σ_{yz} = 15, 58; σ_{yx} = - 22, 67; σ_{xz} = 46, 93; R_{yz} = 0, 77, A = 0, 02$

The resulting Gains in accuracy obtained are presented in Table 1.

Table 1 Gains in accuracy in 3 real life problems to the use of

{\bar{y}}_{α_{0}}

.

Problem	$G ({\bar{y}}_{ratio}^{*}, {\bar{y}}_{α_{0}})$	$G ({\bar{y}}_{reg}^{*}, {\bar{y}}_{α_{0}})$	$G ({\bar{y}}^{*}, {\bar{y}}_{α_{0}})$
1	84,06	13,16	16,47
2	1,34	0,20	0,91
3	49,92	0,04	0,41

Table 1 suggests that the improvements in accuracy due to the use of ${\bar{y}}_{α_{0}}$ are very large when compared with ${\bar{y}}_{ratio}^{*}$ . The error is decreased a little in the studies of VIH patients and farmers for the other estimators.

3.3

3.3 A comparison of the subsampling rules performance

From the above discussion is clear that the preference for a certain subsampling rule does not affect in the comparison of the estimators. Note that the effect of using a certain rule is important when we calculate the expected variance. Then we are interested in evaluating the behavior of the expectations under each rule. That is to compare the different expectations of $E [\frac{w_{2} (1 - θ)}{θ n}]$ .

The use of Hansen-Hurwitz’s rule, HH, fixes that θ = 1/K, K ≥ 1. Then its use yields that

(3.6)

E [\frac{w_{2} (1 - θ)}{θ n}] = \frac{W_{2} (K - 1)}{n},

When we use the rule of Srinath (1971), S, we have that $θ = \frac{n_{2}}{Hn + n_{2}}$

Doing some calculus is derived that $\frac{1 - θ}{θ} = \frac{Hn}{n_{2}}$ . Substituting in the conditional variance, we have

(3.7)

E [V (w_{2} ({\bar{y}}_{2}^{*} - {\bar{y}}_{2}) |s)] = E (\frac{H w_{2} σ_{2 Y}^{2}}{n n_{2}}) ≅ \frac{H}{n} σ_{2 Y}^{2}

Comparing this term with (3.5), we should prefer HH to S whenever $\frac{W_{2} (K - 1)}{n} \leq \frac{H}{n}$

That is if $K \leq \frac{H + W_{2}}{W_{2}} = \frac{H}{W_{2}} + 1 .$

A similar analysis of the use of the rule of Bouza (1981) needs of considering the new structure. Due to the randomness of θ we have that the conditional variance is $V (w_{2} ({\bar{y}}_{2}^{*} - {\bar{y}}_{2}) |s) = \frac{(1 - n_{2} / n)}{n} σ_{2 Y}^{2} = \frac{n_{1}}{n^{2}} σ_{2 Y}^{2}$

Its expectation is

(3.8)

E [V (w_{2} ({\bar{y}}_{2}^{*} - {\bar{y}}_{2}) |s)] = \frac{W_{1}}{n} σ_{2 Y}^{2}

Note that HH is to be preferred to B when $\frac{W_{2} (K - 1)}{n} \leq \frac{W_{1}}{n}$

That is if $K \leq \frac{W_{1}}{W_{2}} + 1 = \frac{1}{W_{2}}$ as $\frac{W_{1}}{W_{2}} \geq 0$ and K ≥ 1 we may fix a value of K that satisfies this relationship.

Comparing S and B we have that the former generates a smaller coefficient if is satisfied the inequality $H \leq W_{1}$

This relationship suggests that we S may be preferred if is used values of H smaller than 1.

Considering the costs, we may use the cost function $C = c_{0} + c_{1} n + c_{2} n_{2}^{*}$

Its expectation depends of the subsampling rule. The results are

(3.9)

{E (C}_{HH}) = c_{0} + c_{1} n + \frac{c_{2} n W_{2}}{K}

Accepting that $E {(n_{2} - n_{2}^{*})}^{t} ≅ 0$ , t > 2, a development in Taylor Series allows deriving that

(3.10)

{E (C}_{S}) ≅ c_{0} + c_{1} n + \frac{c_{2} n W_{2}}{H + W_{2}}

We have that for B

(3.11)

{E (C}_{B}) = c_{0} + c_{1} n + c_{2} (n W_{2}^{2} + W_{1} W_{2})

In terms of the expected costs, we may look for the preference of the rules. We have: $HH ≾ S i f K > H + W_{2}$ $HH ≾ B i f K > \frac{n W_{2}}{n W_{2}^{2} + W_{1} W_{2}}$

A comparison with S yields the preference rule, $S ≾ B if H > \frac{n (1 - W_{2}^{2}) - W_{1}}{{nW}_{2} + W_{1} W_{2}}$

It is easily derived that none of the rules may be more accurate and cheaper with respect to any of the other two simultaneously.

We used the results reported with four populations in the paper of Azeem and Hanif (2017) for establishing adequate values of the parameters of the subsampling rules. In the next table we have that $N$ is the total number of units in the population questioned, $N_{1}$ the number of units responding the survey questions, $N_{2}$ the number of units which do not respond, $σ_{y}^{2}$ is the population variance of $Y$ and $σ_{y}^{2}$ is the variance of $Y$ for non-respondents part of the population (Table 2).

Table 2 Population Data.

Population	N	N1	N2	$σ_{Y}^{2}$	$σ_{2 Y}^{2}$	Population	N	N1	N2	$σ_{Y}^{2}$	$σ_{2 Y}^{2}$
1	5000	4500	500	102.007	99.99174	3	5000	4500	500	101.2633	5000
1	5000	4250	750	102.007	100.8224	3	5000	4250	750	101.2633	5000
1	5000	4000	1000	102.007	103.2349	3	5000	4000	1000	101.2633	5000
2	5000	4500	500	97.1206	94.5457	3	5000	4500	500	25.441	5000
2	5000	4250	750	97.1206	98.2761	4	5000	4250	750	25.441	5000
2	5000	4000	1000	97.1206	96.0935	4	5000	4000	1000	25.441	5000

Note that for the populations 1 and 2 the variance of non-respondents is similar to the overall variance. Populations 3 and 4, have a considerably larger variance of the non-respondent strata than the population variance. We will use the weights observed in the inquiries of the different non-respondent stratum, $W_{2} = \frac{N_{2}}{N}$ .

We fixed a set of values of H in Table 3 for comparing HH with S in terms of their accuracy.

Table 3 Selected Values of the lower bound

\frac{H}{W_{2}} + 1

for accepting that HH is less variable than S.

			H
W₂	0,1	0,3	0,5	0,7	1
0,1	2	4,00	6,00	8,00	11,00
0,15	1,67	3,00	4,33	5,67	7,67
0,2	1,50	2,5	3,5	4,50	6,00

Note that for H = 0,1 fixing a value of K, for which HH is to be preferred, implies using large subsample sizes. An increase in the non-respondent’s stratum determine also the need of using larger values of the sub sample size for preferring HH’s rule.

Table 4 illustrates that for small subsampling sizes HH may have a better accuracy than the rule of Bouza. S will have the same behavior for relatively small values of H.

Table 4 Selected Values of the lower bound for accepting that HH or S are less variable than B.

	1 + W₁/W₂	W₁
W₂
0,1	10	0,90
0,15	6,67	0,85
0,2	5,00	0,80

The analysis of the costs is presented in the next 2 tables.

We prefer using HH by using $K > H + W_{2}$ . Analyzing Table 4 we have that in the population analyzed the relation holds for K > 1,20, which may be easily satisfied in practice.

The analysis of the costs associated with B needs to take into account the sample size. We consider the commonly used sampling fractions 0,01, 0,05 and 0,1 for illustrating. Note that the results in the Table 5 suggest that the sub sampling rule HH will have smaller expected costs than B if K > 9,82. The subsample parameter H should be very large for preferring S to B in terms of costs (Table 6).

Table 5 Selected Values of the upper bound

H + W_{2}

for accepting that HH is less costly than S.

			H
W₂	0,1	0,3	0,5	0,7	1
0,10	0,20	0,40	0,60	0,80	1,10
0,15	0,35	0,45	0,65	0,85	1,15
0,20	0,30	0,60	0,70	0,90	1,20

Table 6 Selected Values of the upper bounds of K and H for accepting that HH and S are less costly than B.

	HH			S
	$\frac{n W_{2}}{n W_{2}^{2} + W_{1} W_{2}}$			$\frac{n (1 - W_{2}^{2}) - W_{1}}{n W_{2} + W_{1} W_{2}}$
W₂/n	50	250	500	50	250	500
0,10	8,47	6,53	9,82	82,37	64,36	95,48
0,15	6,05	6,38	6,45	39,36	41,54	41,97
0,20	4,63	4,92	4,96	22,18	23,53	23,82

4

4 Conclusions

Non-responses are present in the practice of survey research. Deciding to sub sample the non-respondents poses the need of deciding which will be the size of the sub-sample. The sampler must select a sub-sampling rule and fix a value of K or H or using instead a randomized rule. We developed a study of this problem when dealing with the class of estimators proposed by Singh and Kumar (2009).

Evaluating the preference of one of the rules may be performed by analyzing the effect of them in the corresponding expected variance or cost. The numerical study developed in Section 3 illustrated how a simple procedure allows deciding on the convenience of using one of the rules. The evaluation of the subsampling rules does not necessarily conveys to preferring one of them both in terms of accuracy and cost.

This study may be extended to other estimation procedures, as the product of a ratio and regression estimators proposed by Singh and Kumar (2011).

Acknowledgments

We heartily appreciate the suggestions of the referees which allowed improving the presentation of the results. One of the authors thanks CYTED and “A Cuban-Flemish Training and Research Program in Data Science and Big Data Analysis” projects for supporting his research.

References

Arnab R., . Survey Sampling Theory and Applications. Academic Press, Elsevier; 2017.
Azeem M., Hanif M., . Joint influence of measurement error and non response on estimation of population mean. Commun. Statist. Theory Methods. 2017;4:1679-1693.
[Google Scholar]
Andridge R.R., Thompson K.J., . Using the fraction of missing information to identify auxiliary variables for imputation procedures via proxy pattern-mixture models. Int. Statist. Rev.. 2015;83:472-492.
[Google Scholar]
Bouza C.N., . Sobre el problema de la fracción de submuestreo para el caso de las no respuestas. Trabajos de Estadística y de Investigación Operativa. 1981;32:30-36.
[Google Scholar]
Bouza C.N., Covarrubias D., Fernandez Z., . Handling with missing observations in simple random sampling and ranked set sampling. In: Lovric Miodrag, ed. International Encyclopedia of Statistical Science. Berlin: Springer-Verlag; 2011. p. :621-622. Part 8
[Google Scholar]
Hansen M.H., Hurvitz W.N., . The problem of non-responses in survey sampling. J. Am. Statist. Assoc.. 1946;41:517-523.
[Google Scholar]
Heffetz, O., Reeves, D.B., 2016. Difficulty to Reach Respondents and Nonresponse Bias: Evidence from Large Government Surveys. NBER Working Paper No. 22333. (Consulted January 10-2018. http://www.nber.org/papers/w22333).
Lohr S.L., . Sampling: Design and Analysis (second ed.). Boston: Brooks/Cole; 2010.
Särndal C., Lundquist P., . Accuracy in estimation with nonresponse: a function of degree of imbalance and degree of explanation. J. Survey Statist. Methodol.. 2014;2:361-3087.
[Google Scholar]
Singh S., . Advanced Sampling Theory with Applications. Dordrecht, The Netherlands: Kluwer Academic; 2003.
Singh H.P., Kumar S., . A general procedure of estimating the population mean in the presence of non-response under double sampling using auxiliary information. Statist. Oper. Res. Trans.. 2009;33:71-84.
[Google Scholar]
Singh H.P., Kumar S., . Combination of regression and ratio estimate in presence of nonresponse. Braz. J. Probable. Stat.. 2011;25:205-217.
[Google Scholar]
Srinath K.P., . Multiphase sampling in nonresponse problems. J. Am. Statist. Assoc.. 1971;66:583-658.
[Google Scholar]
Thompson, K.J., Washington, K.T., 2013. Challenges in the treatment of unit nonresponse for selected business surveys: a case study. Survey Methods: Insights from the Field. Last Consulted December 30, 2017 Available at: http://surveyinsights.org/?p=2991.
Torres van Grinsven, V., Bolko, I., Bavdaž, Villund, O., 2014. Comparing Subsample Approaches. Presentation to the 9th Workshop on Labor Force Survey Methodology Rome.