How do we determine the error of concluding there is no discrepancy between sample and population means when in fact there is a discrepancy (called the false negative, type 2 or β error)?
Well, it’s not straightforward. And so some medics and medical writers simply ignore it! As a result, in the medical literature and in the press we have a whole host of false assumptions – “this drug doesn’t do this”, “this treatment is no better than that”, “there is no difference in some parameter between this disease and that” – when the statistical tests in the studies merely concluded that there was no strong evidence for a difference!
Taking the example that the mean bp of a sample of company employees was 185 mmHg when the desired mean bp was 180 mmHg, this corresponds as we saw earlier to a Z score of 1 and a p value (one-tailed) of 0.16. We cannot reject the null hypothesis that the bp is OK, because the α error is too great.
On the other hand, it would be foolish to conclude that the mean bp is satisfactory. The mean is clearly at the upper end of the scale. How do we quantify this? We cannot simply turn it around and say there is an 84% (100%-16%) chance that the mean bp is OK. That 84% is the confidence level that the heights are not OK.
At the same time, we cannot calculate a p-value that the sample bp is different from the desired bp, so that we can reject this new null hypothesis and conclude that they are identical. The chance that the sample mean would be identical to the population mean would actually be infinitely small. (In a one-tailed test we do not consider the lower half of the SE plot.) There is always going to be some kind of random variation of a sample mean about the population mean.
Instead, we have to decide on a level of random variation that we do not care about.
For example, if measurements are only taken to the nearest mmHg, we will not care if the population mean is really 181 mmHg. In fact, the chief medical officer might not worry about a variation in bp of less than 5 mmHg; so our threshold for what we would tolerate might then be 185 mmHg. This decision on acceptable unimportant variation has to be a common sense one based on the data set; it is not something that can be statistically calculated.
Now we have two plots: one for the standard error of the desired mean, and one for the standard error of the highest acceptable mean within the parameters of what we care about. If we fail to reject the first null hypothesis, namely that the sample mean could belong to the desired population mean, we make a second null hypothesis: the observed sample mean is greater than the maximum acceptable population mean. We will reject this second hypothesis if the observed sample mean is significantly lower than the maximum acceptable mean – in other words we have significant evidence that the mean bp is in fact OK.
So how to calculate this? We use the same formula as above, but this time we are interested in the difference between the maximum acceptable mean, μa, and the observed mean, μ. The null hypothesis is now that μa is the true population mean. So:
Z = (μa-μ)√n /SD
We said earlier that, in this example of sampling employees’ bp, determining the probability of the α error, pα, uses a one-tailed p-value but that in most cases we tend to use two-tailed tests. However, for determining probability of β error, pβ, we typically use a one-tailed probability for determining non-compliance even if we used two-tailed probability for the α error. This is because the second hypothesis comes after the first. At this point we already know whether the sample mean is higher or lower than the desired mean, but just not by enough to be significantly different. If it is higher than the desired mean, the more challenging test is obviously whether or not it is low enough still to represent the desired mean, not whether it is high enough.