Chapter 1 Counts and Tables

You should use this method if the data is:

  • Categorical

In this chapter you will learn how to do data exploration for categorical variables using tables (also called contingency tables) and counts.

1.1 Counts

Count for one variable:

  • Use data wage_df.
  • Perform count() on maritl
  • Sort the values.
  • Use gt() to print the table.
wage_df %>% 
  count(maritl,
        sort = TRUE) %>% 
  gt()
maritl n
2. Married 2074
1. Never Married 648
4. Divorced 204
5. Separated 55
3. Widowed 19

Count two variables:

  • Use data wage_df.
  • Perform count() on maritl and education
  • Sort the values.
  • Use gt() to print the table.
wage_df %>% 
  count(maritl, education,
        sort= TRUE) %>% 
  gt()
maritl education n
2. Married 2. HS Grad 651
2. Married 4. College Grad 487
2. Married 3. Some College 421
2. Married 5. Advanced Degree 341
1. Never Married 2. HS Grad 219
2. Married 1. < HS Grad 174
1. Never Married 3. Some College 164
1. Never Married 4. College Grad 143
4. Divorced 2. HS Grad 73
1. Never Married 1. < HS Grad 62
1. Never Married 5. Advanced Degree 60
4. Divorced 3. Some College 52
4. Divorced 4. College Grad 41
4. Divorced 5. Advanced Degree 22
5. Separated 2. HS Grad 20
4. Divorced 1. < HS Grad 16
5. Separated 1. < HS Grad 14
5. Separated 3. Some College 11
5. Separated 4. College Grad 9
3. Widowed 2. HS Grad 8
3. Widowed 4. College Grad 5
3. Widowed 1. < HS Grad 2
3. Widowed 3. Some College 2
3. Widowed 5. Advanced Degree 2
5. Separated 5. Advanced Degree 1

Obtain the sum of a quantitative variable (wage) for different levels of a categorical variable (maritl) by using wt =:

wage_df %>%  
  count(maritl,
        wt = wage,
        name = "Sum") %>% 
  gt()
maritl Sum
1. Never Married 60092.052
2. Married 246516.180
3. Widowed 1891.234
4. Divorced 21044.489
5. Separated 5566.868

1.2 Total counts

Total counts are an useful way to represent the observations that fall into each combination of the levels of categorical variables. We create a contingency table of the two categorical variables jobclass and race and call the result tab:

tab <- table(wage_df$jobclass, wage_df$race)
tab
##                 
##                  1. White 2. Black 3. Asian 4. Other
##   1. Industrial      1325      111       86       22
##   2. Information     1155      182      104       15

1.3 Joint proportions

We can also view the percentage of each cell in relation to the total amount of all observations (here n = 3000). Therefore, you have to simply divide the numbers from our total counts with 3.000.

The following code generates tables of joint proportions:

# joint proportions
prop.table(tab) 
##                 
##                     1. White    2. Black    3. Asian    4. Other
##   1. Industrial  0.441666667 0.037000000 0.028666667 0.007333333
##   2. Information 0.385000000 0.060666667 0.034666667 0.005000000

For example, around 44% of all people in the dataset are white industrial workers.

1.4 Conditional proportions: columns

You also may want to know the probability that workers have a certain jobclass, given that they have a particular ethnical background. This is a so called conditional probability. Conditional probability represents the chance that one event will occur given that a second event has already occurred.

The following code generates tables of conditional proportions:

# conditional on columns
prop.table(tab, 2)  
##                 
##                   1. White  2. Black  3. Asian  4. Other
##   1. Industrial  0.5342742 0.3788396 0.4526316 0.5945946
##   2. Information 0.4657258 0.6211604 0.5473684 0.4054054

We performed a columnwise evaluation and are now able to answer the following question:

  • Approximately what proportion of all white workers are industrial workers?
  • The answer is: around 53%.

1.5 Conditional proportions: rows

Now we want to obtain the probability that workers have a certain race, given their jobclass.

# conditional on rows
prop.table(tab, 1)  
##                 
##                    1. White   2. Black   3. Asian   4. Other
##   1. Industrial  0.85816062 0.07189119 0.05569948 0.01424870
##   2. Information 0.79326923 0.12500000 0.07142857 0.01030220

We performed a rowwise evaluation and are now able to answer the following question:

  • Approximately what proportion of all industrial workers are white?
  • The answer is: around 86%.

1.6 Chi-squared Test of Independence

Finally, let’s test the hypothesis whether the variable jobclass is independent of the variable race at .05 significance level.

chisq.test(tab)  
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 29.331, df = 3, p-value = 1.908e-06

As the p-value is smaller than the .05 significance level, we reject the null hypothesis that the jobclass is independent of the race of the workers.