25 Testing for Independence

We now consider the second \(\chi^2\) method, namely the test of independence in a contingency table.

When investigating the relationship between two qualitative (categorical) variables, a common method is the chi-square test of independence. This method allows us to determine whether there is a significant association between the two variables in the population, based on data observed in a sample.

Example 25.1: Income Level and Housing Cost

A sample of 500 single-person households is classified based on their disposable income (low, medium, high) and their housing costs (low, medium, high). The data is presented in a contingency table:

Housing Cost	Low Income	Medium Income	High Income	Total
Low	35	50	15	100
Medium	50	120	30	200
High	15	80	105	200
Total	100	250	150	500

We want to test whether there is a relationship between income level and housing cost.

Our hypotheses are as follows:

Null hypothesis: The variables are independent; there is no association between income and housing cost.
Alternative hypothesis: The variables are not independent; there is an association between income and housing cost.

To explore this further, we can examine the relative percentages across income categories:

Housing Cost	Low Income	Medium Income	High Income	Total
Low	35%	20%	10%	20%
Medium	50%	48%	20%	40%
High	15%	32%	70%	40%
Total	100%	100%	100%	100%

From this, we can see that the distributions vary across income levels, suggesting dependence.

So what would independence look like?

Under complete independence, the same distribution of housing costs would occur across all income groups. Based on the total column proportions, expected percentages would be:

Housing Cost	Expected %	Low Income	Medium Income	High Income
Low	20%	20%	20%	20%
Medium	40%	40%	40%	40%
High	40%	40%	40%	40%

Using expected proportions, the expected counts under \(H_0\) are:

Housing Cost	Low Income	Medium Income	High Income	Total
Low	20	50	30	100
Medium	40	100	60	200
High	40	100	60	200
Total	100	250	150	500

These represent what we would expect to observe if there were no relationship between income and housing costs.

Let’s generalize this test for examining independence between any pair of categorical variables.

The observed data is arranged in a two-way contingency table with \(r\) rows and \(c\) columns, summarizing the joint distribution of two categorical variables:

	1	2	\(\cdots\)	\(c\)	Total
1	\(O_{11}\)	\(O_{12}\)	\(\cdots\)	\(O_{1c}\)	\(R_1\)
2	\(O_{21}\)	\(O_{22}\)	\(\cdots\)	\(O_{2c}\)	\(R_2\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)	\(\vdots\)
\(r\)	\(O_{r1}\)	\(O_{r2}\)	\(\cdots\)	\(O_{rc}\)	\(R_r\)
Total	\(C_1\)	\(C_2\)	\(\cdots\)	\(C_c\)	\(n\)

Here, \(O_{ij}\) denotes the observed frequency in row \(i\), column \(j\), while \(R_i\) and \(C_j\) are the row and column totals respectively, and \(n\) is the grand total.

We want to test: \[ \begin{align*} H_0 &: \text{The two classification variables are independent} \\ H_1 &: \text{The two classification variables are not independent} \end{align*} \]

We use the chi-square test statistic defined as: \[ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \] where \(E_{ij}\) is the expected frequency under the assumption of independence:

\[ E_{ij} = \frac{R_i \cdot C_j}{n} \]

That is, the expected count in each cell is the product of its row and column totals divided by the grand total. Expected frequencies (\(E_{ij}\)) computed this way ensure that:

The relative distribution within each column is the same as the marginal distribution across rows (on the right).
The relative distribution within each row is the same as the marginal distribution across columns (at the bottom).

This matches what we would expect under independence between the variables.

If \(H_0\) is true and the sample is large enough, then the test statistic \(\chi^2\) approximately follows a chi-square distribution with \((r - 1)(c - 1)\) degrees of freedom.

We reject the null hypothesis if the test statistic \(\chi^2\) is greater than the critical value, determined from the chi-square distribution table (Appendix C), based on:

The chosen significance level \(\alpha\)
The degrees of freedom: \((r - 1)(c - 1)\)

Note: Rule of Thumb for the Chi-Square Test

The chi-square test for independence is a widely used method to assess whether two categorical variables are statistically independent by comparing observed frequencies to expected frequencies under the assumption of independence. It is crucial to ensure expected cell counts are sufficiently large to rely on the chi-square approximation.

To ensure the approximation is valid, a common guideline is:

At least 80% of the expected cell counts \(E_{ij}\) should be 5 or greater.
If this condition is not met, consider combining categories to increase expected counts.

Example 25.1: Income Level and Housing Cost (Cont’d)

We are given a \(3 \times 3\) contingency table of education level and income level among 500 individuals. The goal is to determine whether education level and income level are statistically independent.

Education Level	Low Income	Medium Income	High Income	Row Totals (\(R_i\))
Low	35	50	15	100
Medium	50	120	30	200
High	15	80	105	200
Column Totals (\(C_j\))	100	250	150	\(n = 500\)

Step 1: Hypotheses

\(H_0\): There is no association between education and income level. The two variables are independent.
\(H_1\): There is an association between education and income level. The two variables are not independent.

Step 2: Significance Level
We choose a significance level of \(\alpha = 0.05\).

Step 3: Test Statistic
We will use the chi-square test statistic: \[ \chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

where:

\(O_{ij}\) = observed count in cell \((i,j)\)
\(E_{ij}\) = expected count in cell \((i,j)\) which under \(H_0\), computed as: \[ E_{ij} = \frac{R_i \cdot C_j}{n} \]

Example: for cell (1,1): \[ E_{11} = \frac{100 \cdot 100}{500} = 20 \]

The expected values are shown in parentheses in the table below:

Education Level	Low Income	Medium Income	High Income	Total
Low	35 (20)	50 (50)	15 (30)	100
Medium	50 (40)	120 (100)	30 (60)	200
High	15 (40)	80 (100)	105 (60)	200
Total	100	250	150	500

Step 4: Decision rule
Degrees of freedom: \[ df = (r - 1)(c - 1) = (3 - 1)(3 - 1) = 4 \]

Using the chi-square table (Appendix C) we find: \[ \chi^2_{0.05, 4} = 9.488 \]

We thus reject the null if our observed value is greater than 9.488.

Step 5: Observation
Using the formula: \[ \chi^2_{\text{obs}} = \frac{(35 - 20)^2}{20} + \frac{(50 - 50)^2}{50} + \dots + \frac{(105 - 60)^2}{60} = 93.63 \]

Step 6: Conclusion
Compare the test statistic with the critical value:

\(\chi^2_{\text{obs}} = 93.63\)
\(\chi^2_{\text{crit}} = 9.488\)

Since \(93.63 > 9.488\), we reject the null hypothesis at the 5% significance level. There is strong evidence to suggest that education level and income level are not independent; there appears to be a significant association between them.

Exercises

A political analyst wants to examine whether voting preference is independent of age group. A random sample of 300 voters was surveyed, and the results were summarized in the following contingency table:

	Candidate A	Candidate B	Candidate C	Total
18–29 years	28	45	27	100
30–49 years	36	33	31	100
50+ years	26	30	44	100
Total	90	108	102	300

Is there evidence at the 5% significance level that voting preference is dependent on age group?

Solution

Step 1: Hypotheses

\(H_0\): Voting preference is independent of age group.
\(H_1\): Voting preference is dependent on age group.

Step 2: Significance level
Set \(\alpha = 0.05\)

Step 3: Test statistic
\[ \chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

where:

\(O_{ij}\) = observed count in cell \((i,j)\)
\(E_{ij}\) = expected count in cell \((i,j)\) which under \(H_0\), computed as: \[ E_{ij} = \frac{R_i \cdot C_j}{n} \]

Step 4: Decision rule
Degrees of freedom: df = \((3-1)(3-1) = 4\)
Critical value: \(\chi^2_{0.95, 4} = 9.488\)
Reject \(H_0\) if observed value > \(9.488\).

Step 5: Observation
We use the formula for expected cell counts: \[ E_{ij} = \frac{(\text{Row total}) \times (\text{Column total})}{\text{Grand total}} \]

Given Table (Observed Counts):

	Candidate A	Candidate B	Candidate C	Row Total
18–29 years	28	45	27	100
30–49 years	36	33	31	100
50+ years	26	30	44	100
Column Total	90	108	102	300

Expected values:
Row 1: Age 18–29 (Row Total = 100)

Candidate A:
\[E_{11} = \frac{100 \times 90}{300} = 30\]
Candidate B:
\[E_{12} = \frac{100 \times 108}{300} = 36\]
Candidate C:
\[E_{13} = \frac{100 \times 102}{300} = 34\]

Row 2: Age 30–49 (Row Total = 100)

Candidate A:
\[E_{21} = \frac{100 \times 90}{300} = 30\]
Candidate B:
\[E_{22} = \frac{100 \times 108}{300} = 36\]
Candidate C:
\[E_{23} = \frac{100 \times 102}{300} = 34\]

Row 3: Age 50+ (Row Total = 100)

Candidate A:
\[E_{31} = \frac{100 \times 90}{300} = 30\]
Candidate B:
\[E_{32} = \frac{100 \times 108}{300} = 36\]
Candidate C:
\[E_{33} = \frac{100 \times 102}{300} = 34\]

Table over Observed vs (Expected Counts):

Age Group	Candidate A	Candidate B	Candidate C
18–29	O = 28 (E = 30)	O = 45 (E = 36)	O = 27 (E = 34)
30–49	O = 36 (E = 30)	O = 33 (E = 36)	O = 31 (E = 34)
50+	O = 26 (E = 30)	O = 30 (E = 36)	O = 44 (E = 34)

18–29 years:

Candidate A:
\[\frac{(28 - 30)^2}{30} = \frac{4}{30} = 0.133\]
Candidate B:
\[\frac{(45 - 36)^2}{36} = \frac{81}{36} = 2.25\]
Candidate C:
\[\frac{(27 - 34)^2}{34} = \frac{49}{34} ≈ 1.441\]

30–49 years:

Candidate A:
\[\frac{(36 - 30)^2}{30} = \frac{36}{30} = 1.2\]
Candidate B:
\[\frac{(33 - 36)^2}{36} = \frac{9}{36} = 0.25\]
Candidate C:
\[\frac{(31 - 34)^2}{34} = \frac{9}{34} ≈ 0.265\]

50+ years:

Candidate A:
\[\frac{(26 - 30)^2}{30} = \frac{16}{30} ≈ 0.533\]
Candidate B:
\[\frac{(30 - 36)^2}{36} = \frac{36}{36} = 1.0\]
Candidate C:
\[\frac{(44 - 34)^2}{34} = \frac{100}{34} ≈ 2.941\] Add all components: \[ \chi^2 ≈ 0.133 + 2.25 + 1.441 + 1.2 + 0.25 + 0.265 + 0.533 + 1.0 + 2.941 ≈ 10.013 \]

Step 6: Conclusion
Since \(10.013 > 9.488\), we reject the null hypothesis. There is evidence that voting preference depends on age group.