## Confidence Intervals for Multinomial Proportions

In a sample of size N, let nj for be the observed frequencies in the j-th cell of a m-cell multinomial distribution with corresponding parameter . The maximum likelihood estimator of is where for . Asymptotically, has m-variate normal distribution with zero mean vector and variance-covariance matrix with elements

Therefore, the corresponding correlation matrix has singular negative product structure with , where , for , and rank m-1. Gold (1963) and Quesenberry and Hurst (1964) provided two alternative asymptotic simultaneous confidence intervals for as follows:

 (4)

and

 (5)

for , respectively, while Bailey (1980) suggested the other two confidence intervals for as

 (6)

and

 (7)

for , where , , and . Based on the Bonferroni inequality, without consideration of the singular correlation structure of , Goodman's (1965) criterion (hereafter called GC) is usually used to determine for the above four techniques in (4)-(7), where is defined as the percentage point of the chi-square distribution with 1 degree of freedom.

After taking into account the singular correlation structure of , Kwong and Iglewicz (1996) proposed another criterion (hereafter called KIC) of setting in (4)-(7),where is defined as the two-sided percentage point of the standardized m-variate normal distribution with the singular negative equi-correlated structure. Kwong and Iglewicz (1996) conducted a simulation study to compare the performances of these two criteria. The study concluded that KIC provides less conservative and shorter confidence intervals than GC does when the sample size is sufficiently large.

However, based on the conjecture that the least favorable parameter vector having all the elements equal to one another, KIC uses only part of singular correlation structure. Intuitively, it is possible to improve KIC if the singular negative product correlation structure, instead of singular equi-correlated structure, is incorporated into the evaluation of critical values.

For a given confidence level , let

By the central limit theorem,

Since the correlation structure of is dependent on the unknown parameter , the sample proportions are used to replace the parameters in order to estimate the correlation structure. Therefore, based on the result in Section 2.2, we numerically solve

for and then use it to construct the simultaneous confidence intervals for multinomial proportions when the sample size is relatively large. The numerical solution of this equation requires the use of a nonlinear equation solving method. We considered the use of variations on the secant method for this problem because the computation of the derivative of Hm requires the evaluation of m m-1-dimensional integrals, so the use of more rapidly convergent method, like Newton's method, is infeasible. When is small, the objective function is very flat near the values of t where , and we found that the ordinary secant method frequently diverged unless the starting values for t were very close to the solution. We therefore tried some modified secant methods, which provide at each iteration a bounding interval for the final solution. One of these methods, called the Pegasus'' method (see Ralston and Rabinowitz, 1978, p. 341), which we found to be robust, reliable and efficient, was used for the results presented in the remainder of this paper.

The Pegasus method requires an initial bounding interval for the solution point t* ( , with ). A crude initial bounding interval can be determined from the simple Bonferroni bound. Using notation similar to that in Section 2.1, we let , and then . If we let , then the simple Bonferroni bound is . A simple lower bound is . Combining these bounds, and using , a crude bounding interval for t* is given by (tCL, tGC), where (equivalent to GC previously discussed) and . When m is large, the repeated evaluation of h(t) requires expensive numerical integrations, so in order to reduce the number of evaluations of h(t), we looked for better bounds that could be used to reduce the size of the initial bounding interval, thereby reducing the number of steps required for the convergence of the Pegasus method. Better bounds can be inexpensively computed from combinations of univariate and bivariate probabilities. The best bounds using these probabilities are the Hunter-Worsley bound for an upper t* limit (see Hsu, 1996, p. 229), and a modified Bonferroni bound for a lower t* limit (see Kwerel, 1975). If we let , the Hunter-Worsley and modified Bonferroni bounds guarantee that

where , and T* is a maximal spanning tree for the complete graph of order m with edge weights given by . If we use tMB to denote the t value where and tHW to denote the t value where , then . A combined procedure for determining t* is to first use the Pegasus method, starting with (tCL, tGC), to determine tHW, then use the Pegasus method, starting with (tCL, tHW), to determine tMB, and finally to use the Pegasus method starting with (tMB, tHW) and combined with numerical integration for computing h(t), to determine t*. For the tests reported in Table 2, we found that the interval (tMB, tHW) is usually very small, particularly for the smaller values, and in some cases (depending on the accuracy required for t*) it was not necessary to complete the final step in this procedure.

Another problem that arises when an iterative numerical method like the Pegasus method is combined with a numerical integration method, is the problem of balancing the errors from the two methods. We let be an approximation to t*, and be the numerical integration estimate of h(t). When carrying out the iterations for the Pegasus method, if we attempt to compute h(t) at a point close to t*, we actually compute

If we want to be able to stop the hybrid numerical procedure when for some small , then we need an estimate of h'(t*) so that we can set the absolute error tolerance for our numerical integration method to be less than . As the Pegasus iterations are carried out we can compute a simple estimate for h'(t*) with a difference quotient. We found as small as .01 for some problems for the smallest values considered, but in these cases we usually have very small starting intervals, and often no numerical integrations are necessary. The numerical integration error tolerance was set at for the tests that we carried out.

Selected test results are summarized in Table 2. We used the same sets of pj's that were used for the tests that were done to produce Table 1. We also added two other pj sets for m=5. The error tolerance was set at 0.001 for all of the tests. In the cases where , no numerical integration was necessary, and we set t* = (tHW+tMB)/2. For most of these problems, the computation time required to compute the required t* values to within the requested accuracy was not significant. In most cases the modified Bonferroni and Hunter-Worseley bonds could be used to quickly provide t* values to within the , and the KIC value was very accurate. The two additional pj sets were added for m=5 to provide examples where the KIC value is not as accurate. This occurs when there is significant variation in the pj values.

 pj's tCL tMB t* tKIC tHW tGC f-Values (.2, .1, .4, .3) .100 1.645 2.189 2.190 2.193 2.208 2.241 30016 .050 1.960 2.465 2.466 2.468 2.476 2.498 38864 .010 2.576 3.011 3.013 3.013 3.014 3.023 23520 .005 2.807 3.219 3.220 3.221 3.221 3.227 0 (.1, .2, .2, .2, .3) .100 1.645 2.288 2.288 2.289 2.308 2.326 44512 .050 1.960 2.554 2.554 2.555 2.565 2.576 95200 .010 2.576 3.084 3.084 3.084 3.086 3.090 46768 .005 2.807 3.286 3.287 3.287 3.288 3.291 0 (.3, .1, .05, .5, .05) .100 1.645 2.288 2.288 2.289 2.308 2.326 44512 .050 1.960 2.554 2.554 2.555 2.565 2.576 95200 .010 2.576 3.084 3.084 3.084 3.086 3.090 46768 .005 2.807 3.286 3.287 3.287 3.288 3.291 0 (.1, .1, .2, .2, .2, .2) .100 1.645 2.362 2.363 2.363 2.383 2.394 38256 .050 1.960 2.621 2.621 2.621 2.632 2.638 50960 .010 2.576 3.139 3.139 3.139 3.142 3.144 70336 .005 2.807 3.339 3.340 3.339 3.340 3.341 0 (.10, .30, .05, .50, .05) .100 1.645 2.280 2.282 2.289 2.296 2.326 15728 .050 1.960 2.547 2.549 2.555 2.555 2.576 13712 .010 2.576 3.079 3.079 3.084 3.080 3.090 0 .005 2.807 3.282 3.283 3.287 3.283 3.291 0 (.05, .05, .05, .05, .80) .100 1.645 2.277 2.280 2.289 2.290 2.326 97088 .050 1.960 2.546 2.547 2.555 2.551 2.576 137872 .010 2.576 3.079 3.079 3.084 3.080 3.090 0 .005 2.807 3.283 3.283 3.287 3.283 3.291 0 (.1, .1, .2, .2, .2, .1, .100 1.645 2.421 2.422 2.422 2.441 2.450 72272 .1) .050 1.960 2.675 2.675 2.676 2.685 2.690 112208 .010 2.576 3.185 3.185 3.185 3.187 3.189 18544 .005 2.807 3.382 3.383 3.382 3.383 3.384 0 (.1, .1, .1, .1, .15, .05, .100 1.645 2.471 2.472 2.473 2.490 2.498 160144 .2, .2) .050 1.960 2.721 2.722 2.721 2.730 2.734 154880 .010 2.576 3.224 3.225 3.224 3.226 3.227 20048 .005 2.807 3.419 3.419 3.419 3.420 3.421 0 (.01, .02, .07, .1, .15, .05, .100 1.645 2.514 2.515 2.516 2.531 2.539 33680 .3, .2, .1) .050 1.960 2.760 2.761 2.761 2.768 2.773 50000 .010 2.576 3.258 3.259 3.258 3.259 3.261 0 .005 2.807 3.451 3.451 3.451 3.452 3.452 0 (.1, .05, .05, .04, .06, .1, .100 1.645 2.553 2.554 2.554 2.570 2.576 55536 .15, .15, .1, .2) .050 1.960 2.796 2.796 2.796 2.804 2.807 484416 .010 2.576 3.288 3.289 3.288 3.290 3.291 0 .005 2.807 3.479 3.480 3.479 3.480 3.481 0 (.02, .08, .04, .06, .1, .1, .100 1.645 2.586 2.588 2.587 2.604 2.609 691008 .16, .14, .15, .1, .05) .050 1.960 2.827 2.827 2.827 2.835 2.838 489952 .010 2.576 3.315 3.316 3.315 3.317 3.317 0 .005 2.807 3.505 3.505 3.505 3.506 3.506 0 (.01, .03, .06, .05, .05, .1, .100 1.645 2.617 2.618 2.618 2.634 2.638 501248 .15, .05, .1, .14, .16, .1) .050 1.960 2.855 2.857 2.855 2.863 2.865 74544 .010 2.576 3.339 3.340 3.339 3.341 3.341 0 .005 2.807 3.528 3.529 3.528 3.529 3.529 0

Alan C Genz
2001-02-09