25  Testing for Independence

We now consider the second \(\chi^2\) method, namely the test of independence in a contingency table.

When investigating the relationship between two qualitative (categorical) variables, a common method is the chi-square test of independence. This method allows us to determine whether there is a significant association between the two variables in the population, based on data observed in a sample.

Example 25.1: Income Level and Housing Cost

A sample of 500 single-person households is classified based on their disposable income (low, medium, high) and their housing costs (low, medium, high). The data is presented in a contingency table:

Housing Cost Low Income Medium Income High Income Total
Low 35 50 15 100
Medium 50 120 30 200
High 15 80 105 200
Total 100 250 150 500

We want to test whether there is a relationship between income level and housing cost.

Our hypotheses are as follows:

  • Null hypothesis: The variables are independent; there is no association between income and housing cost.
  • Alternative hypothesis: The variables are not independent; there is an association between income and housing cost.

To explore this further, we can examine the relative percentages across income categories:

Housing Cost Low Income Medium Income High Income Total
Low 35% 20% 10% 20%
Medium 50% 48% 20% 40%
High 15% 32% 70% 40%
Total 100% 100% 100% 100%

From this, we can see that the distributions vary across income levels, suggesting dependence.

So what would independence look like?

Under complete independence, the same distribution of housing costs would occur across all income groups. Based on the total column proportions, expected percentages would be:

Housing Cost Expected % Low Income Medium Income High Income
Low 20% 20% 20% 20%
Medium 40% 40% 40% 40%
High 40% 40% 40% 40%

Using expected proportions, the expected counts under \(H_0\) are:

Housing Cost Low Income Medium Income High Income Total
Low 20 50 30 100
Medium 40 100 60 200
High 40 100 60 200
Total 100 250 150 500

These represent what we would expect to observe if there were no relationship between income and housing costs.

Let’s generalize this test for examining independence between any pair of categorical variables.

The observed data is arranged in a two-way contingency table with \(r\) rows and \(c\) columns, summarizing the joint distribution of two categorical variables:

1 2 \(\cdots\) \(c\) Total
1 \(O_{11}\) \(O_{12}\) \(\cdots\) \(O_{1c}\) \(R_1\)
2 \(O_{21}\) \(O_{22}\) \(\cdots\) \(O_{2c}\) \(R_2\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\) \(\vdots\)
\(r\) \(O_{r1}\) \(O_{r2}\) \(\cdots\) \(O_{rc}\) \(R_r\)
Total \(C_1\) \(C_2\) \(\cdots\) \(C_c\) \(n\)

Here, \(O_{ij}\) denotes the observed frequency in row \(i\), column \(j\), while \(R_i\) and \(C_j\) are the row and column totals respectively, and \(n\) is the grand total.

We want to test: \[ \begin{align*} H_0 &: \text{The two classification variables are independent} \\ H_1 &: \text{The two classification variables are not independent} \end{align*} \]

We use the chi-square test statistic defined as: \[ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \] where \(E_{ij}\) is the expected frequency under the assumption of independence:

\[ E_{ij} = \frac{R_i \cdot C_j}{n} \]

That is, the expected count in each cell is the product of its row and column totals divided by the grand total. Expected frequencies (\(E_{ij}\)) computed this way ensure that:

  • The relative distribution within each column is the same as the marginal distribution across rows (on the right).
  • The relative distribution within each row is the same as the marginal distribution across columns (at the bottom).

This matches what we would expect under independence between the variables.

If \(H_0\) is true and the sample is large enough, then the test statistic \(\chi^2\) approximately follows a chi-square distribution with \((r - 1)(c - 1)\) degrees of freedom.

We reject the null hypothesis if the test statistic \(\chi^2\) is greater than the critical value, determined from the chi-square distribution table (Appendix C), based on:

  • The chosen significance level \(\alpha\)
  • The degrees of freedom: \((r - 1)(c - 1)\)
Note: Rule of Thumb for the Chi-Square Test

The chi-square test for independence is a widely used method to assess whether two categorical variables are statistically independent by comparing observed frequencies to expected frequencies under the assumption of independence. It is crucial to ensure expected cell counts are sufficiently large to rely on the chi-square approximation.

To ensure the approximation is valid, a common guideline is:

  • At least 80% of the expected cell counts \(E_{ij}\) should be 5 or greater.
  • If this condition is not met, consider combining categories to increase expected counts.

Example 25.1: Income Level and Housing Cost (Cont’d)

We are given a \(3 \times 3\) contingency table of education level and income level among 500 individuals. The goal is to determine whether education level and income level are statistically independent.

Education Level Low Income Medium Income High Income Row Totals (\(R_i\))
Low 35 50 15 100
Medium 50 120 30 200
High 15 80 105 200
Column Totals (\(C_j\)) 100 250 150 \(n = 500\)

Step 1: Hypotheses

  • \(H_0\): There is no association between education and income level. The two variables are independent.
  • \(H_1\): There is an association between education and income level. The two variables are not independent.

Step 2: Significance Level
We choose a significance level of \(\alpha = 0.05\).

Step 3: Test Statistic
We will use the chi-square test statistic: \[ \chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

where:

  • \(O_{ij}\) = observed count in cell \((i,j)\)
  • \(E_{ij}\) = expected count in cell \((i,j)\) which under \(H_0\), computed as: \[ E_{ij} = \frac{R_i \cdot C_j}{n} \]

Example: for cell (1,1): \[ E_{11} = \frac{100 \cdot 100}{500} = 20 \]

The expected values are shown in parentheses in the table below:

Education Level Low Income Medium Income High Income Total
Low 35 (20) 50 (50) 15 (30) 100
Medium 50 (40) 120 (100) 30 (60) 200
High 15 (40) 80 (100) 105 (60) 200
Total 100 250 150 500

Step 4: Decision rule
Degrees of freedom: \[ df = (r - 1)(c - 1) = (3 - 1)(3 - 1) = 4 \]

Using the chi-square table (Appendix C) we find: \[ \chi^2_{0.05, 4} = 9.488 \]

We thus reject the null if our observed value is greater than 9.488.

Step 5: Observation
Using the formula: \[ \chi^2_{\text{obs}} = \frac{(35 - 20)^2}{20} + \frac{(50 - 50)^2}{50} + \dots + \frac{(105 - 60)^2}{60} = 93.63 \]

Step 6: Conclusion
Compare the test statistic with the critical value:

  • \(\chi^2_{\text{obs}} = 93.63\)
  • \(\chi^2_{\text{crit}} = 9.488\)

Since \(93.63 > 9.488\), we reject the null hypothesis at the 5% significance level. There is strong evidence to suggest that education level and income level are not independent; there appears to be a significant association between them.

Exercises

  1. A political analyst wants to examine whether voting preference is independent of age group. A random sample of 300 voters was surveyed, and the results were summarized in the following contingency table:
Candidate A Candidate B Candidate C Total
18–29 years 28 45 27 100
30–49 years 36 33 31 100
50+ years 26 30 44 100
Total 90 108 102 300

Is there evidence at the 5% significance level that voting preference is dependent on age group?

Step 1: Hypotheses

  • \(H_0\): Voting preference is independent of age group.
  • \(H_1\): Voting preference is dependent on age group.

Step 2: Significance level
Set \(\alpha = 0.05\)

Step 3: Test statistic
\[ \chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

where:

  • \(O_{ij}\) = observed count in cell \((i,j)\)
  • \(E_{ij}\) = expected count in cell \((i,j)\) which under \(H_0\), computed as: \[ E_{ij} = \frac{R_i \cdot C_j}{n} \]

Step 4: Decision rule
Degrees of freedom: df = \((3-1)(3-1) = 4\)
Critical value: \(\chi^2_{0.95, 4} = 9.488\)
Reject \(H_0\) if observed value > \(9.488\).

Step 5: Observation
We use the formula for expected cell counts: \[ E_{ij} = \frac{(\text{Row total}) \times (\text{Column total})}{\text{Grand total}} \]

Given Table (Observed Counts):

Candidate A Candidate B Candidate C Row Total
18–29 years 28 45 27 100
30–49 years 36 33 31 100
50+ years 26 30 44 100
Column Total 90 108 102 300

Expected values:
Row 1: Age 18–29 (Row Total = 100)

  • Candidate A:
    \[E_{11} = \frac{100 \times 90}{300} = 30\]
  • Candidate B:
    \[E_{12} = \frac{100 \times 108}{300} = 36\]
  • Candidate C:
    \[E_{13} = \frac{100 \times 102}{300} = 34\]

Row 2: Age 30–49 (Row Total = 100)

  • Candidate A:
    \[E_{21} = \frac{100 \times 90}{300} = 30\]
  • Candidate B:
    \[E_{22} = \frac{100 \times 108}{300} = 36\]
  • Candidate C:
    \[E_{23} = \frac{100 \times 102}{300} = 34\]

Row 3: Age 50+ (Row Total = 100)

  • Candidate A:
    \[E_{31} = \frac{100 \times 90}{300} = 30\]
  • Candidate B:
    \[E_{32} = \frac{100 \times 108}{300} = 36\]
  • Candidate C:
    \[E_{33} = \frac{100 \times 102}{300} = 34\]

Table over Observed vs (Expected Counts):

Age Group Candidate A Candidate B Candidate C
18–29 O = 28 (E = 30) O = 45 (E = 36) O = 27 (E = 34)
30–49 O = 36 (E = 30) O = 33 (E = 36) O = 31 (E = 34)
50+ O = 26 (E = 30) O = 30 (E = 36) O = 44 (E = 34)

18–29 years:

  • Candidate A:
    \[\frac{(28 - 30)^2}{30} = \frac{4}{30} = 0.133\]
  • Candidate B:
    \[\frac{(45 - 36)^2}{36} = \frac{81}{36} = 2.25\]
  • Candidate C:
    \[\frac{(27 - 34)^2}{34} = \frac{49}{34} ≈ 1.441\]

30–49 years:

  • Candidate A:
    \[\frac{(36 - 30)^2}{30} = \frac{36}{30} = 1.2\]
  • Candidate B:
    \[\frac{(33 - 36)^2}{36} = \frac{9}{36} = 0.25\]
  • Candidate C:
    \[\frac{(31 - 34)^2}{34} = \frac{9}{34} ≈ 0.265\]

50+ years:

  • Candidate A:
    \[\frac{(26 - 30)^2}{30} = \frac{16}{30} ≈ 0.533\]
  • Candidate B:
    \[\frac{(30 - 36)^2}{36} = \frac{36}{36} = 1.0\]
  • Candidate C:
    \[\frac{(44 - 34)^2}{34} = \frac{100}{34} ≈ 2.941\] Add all components: \[ \chi^2 ≈ 0.133 + 2.25 + 1.441 + 1.2 + 0.25 + 0.265 + 0.533 + 1.0 + 2.941 ≈ 10.013 \]

Step 6: Conclusion
Since \(10.013 > 9.488\), we reject the null hypothesis. There is evidence that voting preference depends on age group.