Comparing Variables of Different Types

So far we have only described two-sample tests that compare a continuous variable, e.g. blood pressure, and a categorical variable with two categories e.g. one company vs another, or diabetic or not. I can’t write a whole statistics textbook, so I have just presented more briefly what is done for different types of variables.

Comparing more than Two Samples or more than Two Categorical Variables

The standard test used for 3+ samples and a normally distributed continuous variable is the ANOVA – analysis of variance. This uses a 3+ sample version of the t-test, called the F-test. The F-statistic is the ratio of the variance between the sample means to the variance between the subjects within one sample. When comparing samples, the test essentially determines if the variance between the samples is greater than within the sample – if it is, then the samples must be drawn from different populations, and so are significantly different.

If we have three samples to test, we can consider that the samples are divided according to a variable. In this situation it is a dependent categorical variable; we are dividing beforehand all the subjects into three discrete categories according to one parameter and with one parameter value per sample. What we then actually measure is called the independent variable.

If we are comparing bp in employees of three different companies, then company is the dependent categorical variable and bp is the independent variable. This situation comparing just the one dependent variable (which in this case has 3 levels or categories) is called one-way ANOVA. If there are two dependent variables, e.g. looking at differences in bp by company and by gender, this is a two-way ANOVA, and three variables is called three-way ANOVA, and so on.

A univariate ANOVA analysis, which may be one-way or two-way or more, is where one picks out relationships between different pairs of categories, e.g. different pairs of two companies, while multivariate ANOVA analysis, or MANOVA, is where one looks to see if there is an overall trend or non-random relationship between the 3+ categories as a whole, or indeed between dependent variables across their categories.

Comparing two Continuous Variables

We might easily have a study comparing two continuous variables, e.g. blood pressure and blood sugar, as opposed to a continuous variable and a categorical variable, e.g. blood pressure and diabetic/ non-diabetic.

We can plot these two variables against one another on two axes and look at the resulting “scattergram”. A test that determines if there is a relationship without assuming any particular type of relationship is the correlation coefficient. This varies from 1 (perfect correlation), to zero (no correlation) to -1 (perfect negative correlation, i.e. as one variable rises the other drops). There are statistical tests that calculate this coefficient and can provide confidence intervals to see if it is significantly different from zero, i.e. no correlation.

If we think the relationship between the two might be linear, either positive (both increasing together) or negative (one increasing, while the other decreases in proportion), we can test for this by using linear regression. This is simply seeing if a straight line fits the distribution of the scattergram of points. The best fit line can be determined by the least-squares model; this is the line where the squares of the differences between each actual y value and the line’s y-value are minimised. Our old friend, the t-test, can be used to see if the slope of the best fit line differs significantly from a null hypothesis level, often zero.

English: Random data points and their linear r...

Two continuous variables and their linear regression. (Photo credit: Wikipedia)

Comparing two Categorical Variables

Often we would like to compare two categories rather than two continuous measurements, for example to ask the question, “Are diabetics more likely to have ischaemic heart disease (IHD)?”.

To do this we can draw up a “contingency table” dividing a sample of subjects into four subcategories: diabetic & IHD, diabetic and no IHD, non-diabetic and IHD, non-diabetic and no IHD. We don’t need equal numbers in any categories, we just need a random sample and to record the variable values in each subject.

Once we have a contingency table, we draw a null-hypothesis that there is no relationship. If this were the case we would expect diabetes to be evenly distributed across IHD and non-IHD subjects, and vice versa. The proportions might not be 50:50 because the sample might just happen to have more non-diabetics than diabetics. We append the expected values (in red in the table) according to this even proportional distribution.

Diabetic Non-Diabetic Total
IHD 25    20.9 35    39.1 60
Non-IHD 15    19.1 40    35.9 55
Total 40 75 115

In this example, the total proportion of diabetics is 40/115, so for the null hypothesis, this proportion will be the same in the first row for IHD patients. So in a total of 60 IHD patients, 40/115 will be diabetic and 75/115 will be non diabetic: 60*40 / 115 = 20.9 diabetics, and 39.1 non-diabetics. We can do the rest by simple subtraction.

The observed minus the expected value is a measure of the deviation from the null hypothesis, and the sum of these will relate to the total variance and whether or not it could be random sampling variation. We are interested in the magnitude not the direction of this variation, and squaring the sums is a convenient way to remove any negative differences.

Therefore it comes as no surprise that a test statistic expressing this degree of variation from expected in a normalised manner, the Chi-squared statistic, is:

Χ2 = Σ((O – E)2/E)

Since the Chi-squared (or Χ2 value to use the Greek lettering) is a proportion of difference divided by the value, there is a single table no matter how big or small the numbers where one can look up the corresponding p-value for likelihood of this variation being due to chance. There is one complication, however; the statistic, like the t-test, depends on the degrees of freedom. There is inherently likely to be more variability if there are more categories, and so the p-values should allow for this.

As before, the degrees of freedom reflects the ways the different parameters might interact in an individual. There is only one degree of freedom in the example above; given one variable, diabetes, there is only one way the other can vary – IHD or non-IHD. In practice the degrees of freedom in any multi-category Chi-squared test is the number of categories of the first variable – 1 times the number of categories of the second variable – 1. On the contingency table this is (number of rows -1) * (number of columns – 1). (We exclude the row and column for our totals of course)!

In the example above, the Chi-squared statistic is 0.804+0.430+0.880+0.468 = 2.58. We look it up (I found an online calculator on an internet search engine), and we get, for 1 df a p value of 0.11. So the relationship is a little suggestive, as we see from the figures, but we cannot reject the hypothesis that there is no relationship between diabetes and IHD at the p=0.05 level of significance (in this completely made-up sample of course!).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s