Still on our example of taking a sample of employees’ blood pressure to see if the mean bp is low enough to satisfy a certain standard, let us say that we found the sample mean to be such that we fail to reject the initial hypothesis that the bp is significantly greater than desired.
For instance, if we took the same sample size of 20 employees with the standard deviation of 22.5 and found an observed sample mean bp of 182 mmHg compared to the desired level of 180 mmHg, the Z score is 2*√20/22.5=0.4 and this equates to a pα of 0.34; this is nowhere near as low as the critical value of 0.05 for a false positive (α error), so we cannot reject the null hypothesis that the mean bp is OK.
At this point we could have three different scenarios.
To explore the β error, the strength of evidence that the observed mean does reflect the desired mean, we consider a normal distribution standard error plot about the maximum acceptable mean bp, the level below which we do not care about even if it is a little higher than the desired level.
Because the chief medical officer is rather strict, the maximum acceptable mean in this scenario is quite close to the desired level of 180 mmHg; it is 185 mmHg. The observed mean of 182 mmHg was just lower than this; does it mean that we are confident the observed mean is OK? We consider two normal distribution standard error plots: one about the desired mean that relates to the first (null) hypothesis that the bp was the same as desired, and one about the maximum acceptable mean that relates to the second (alternative) hypothesis that the bp was greater than the maximum acceptable bp.
We already failed to reject the null hypothesis that the mean bp is OK, because the sample mean is lower than the critical pα level.
But from the graph, it also comes as no surprise that when we calculate the Z-score for the second hypothesis, that the true mean is greater than the maximum acceptable level, we get 3*√20/22.5=0.6 and this equates to a one-tailed pβ value of 0.27, again greater than the critical value (which most statisticians set at 0.05 or 0.1).
Therefore we also cannot reject the second hypothesis that the true mean bp is greater than the maximum acceptable level; in other words we are not sufficiently confident that the mean bp is lower than the maximum allowed and therefore is OK.
Due to the considerable overlap between the two SE plots, we are in a grey area; we can reject neither hypothesis. There is insufficient evidence to prove that the mean bp is too high, but it still might be too high as there was also insufficient evidence to say that the mean bp was lower than the maximum acceptable level.
Let us make just one change to the calculation; say that the chief medical officer is lax, and is only going to be exercised if the bp is 12 mmHg higher than desired, not 5 mmHg. The desired and maximum acceptable SE plots are now further apart.
The pα value is of course the same. But the Z score for the second null hypothesis is now 10*√20/22.5=2, and we know that this equates to a one-tailed pβ of 0.02. The observed mean of 182 mmHg, while being lower than the upper pα cut off point, is now also lower than the lower pβ cut off point. So now, when we fail to reject the first hypothesis that there is no discrepancy in bp, we can reject the second hypothesis that there is an important discrepancy in bp. We are confident that the mean bp will not be of concern to the chief medical officer and the error for this (the type 2 or β error) is only 2%.
One can imagine a scenario where the two plots are such that they overlap a little and the upper critical pα level happens to be the same as the lower pβ level. This degree of separation of the two plots is thus just such that if we cannot reject the first hypothesis, we can always automatically reject the second. We will always get a confident answer that either the mean bp was too great or there is good evidence that it is satisfactory.
A lazy and clever bp sampler might realise that, given a certain SD, the only two variables that determine which of the three above scenarios pertains are the leniency of the acceptable unimportant variation (how far apart are the two SE plots) and the number of employees measured in the sample (how wide is each SE plot). For example, if he knows that the chief medical officer’s acceptable variation is 5 mmHg, he can work out beforehand just how many employees he needs to measure without wasting any effort. He can avoid the potential uncertainty of the grey area of scenario 1 if the sample mean does not turn out to be obviously too high, and also the wasted effort of the widely separated SE plots, where it would be fairly obvious that the two hypotheses are mutually exclusive.
Graphically, one can see that in the unique situation of scenario 3, the desired mean bp according to the null hypothesis, μ0, is exactly far enough away from μa that its Z score is equal to Zβ plus Zα, the Z scores that equate to the critical values for pβ and pα.
Zα + Zβ= ( μa – μ0) √n /SD
Solving for the sample size:
n = (Zα + Zβ)2SD2 /(μa – μ0)2
The sampler therefore needs to do a small pilot study to estimate the SD of employees’ bp (where we can use our previous value of 22.5mmHg), decide upon a pα and a pβ (we can say 0.05 for each), and note that the chief medical officer’s acceptable variation is 5 mmHg.
The one-tailed Z scores for desired p=0.05 are 1.65 according to the table or a calculator. Using the equation, the value of n is 200.5, so we would round that up to 201 employees that need to be sampled, quite a bit more than 20. He had better get to work!
Where scenario 3 first seemed like an abstract mathematical concept, we see now that it is really useful. In designing a clinical trial, subjects are money; the fewer the subjects the cheaper the trial. Drug companies want an answer yes or no but they don’t want to waste resources on getting p-values of 0.001 when they only need p=0.05. Determining the number needed in a trial before the study starts is called a power calculation; it is finding if the number of subjects in the sample will give a study that is sufficiently powered to say that if there is not found to be a significant difference, then it is likely that there is indeed no difference.
There is nothing that makes the average clinician’s eyes glaze over and heart sink than an Ethics Committee application. Full of initial enthusiasm for conducting a planned clinical trial, he or she gets the 40-page form that checks if it is indeed ethical to ask patients if they keep a pet at home, or some other equally dangerous intervention. And then on page 39, he or she sees the section, “Please do a power calculation for your proposed sample size”. And this is the exact point at which many proposed studies become abandoned studies. But now you know the secret of the dreaded power calculation, and may go forth and investigate! Just set your two desired p values to 0.05, get an SD from a pilot study or previous study on a similar population, choose a value of variation that you will not care about and use the formula above.