Chapter 1 Counts and Tables
You should use this method if the data is:
- Categorical
In this chapter you will learn how to do data exploration for categorical variables using tables (also called contingency tables) and counts.
1.1 Counts
Count for one variable:
%>%
wage_df count(maritl,
sort = TRUE) %>%
gt()
maritl | n |
---|---|
2. Married | 2074 |
1. Never Married | 648 |
4. Divorced | 204 |
5. Separated | 55 |
3. Widowed | 19 |
Count two variables:
- Use data
wage_df
. - Perform count() on
maritl
andeducation
- Sort the values.
- Use gt() to print the table.
%>%
wage_df count(maritl, education,
sort= TRUE) %>%
gt()
maritl | education | n |
---|---|---|
2. Married | 2. HS Grad | 651 |
2. Married | 4. College Grad | 487 |
2. Married | 3. Some College | 421 |
2. Married | 5. Advanced Degree | 341 |
1. Never Married | 2. HS Grad | 219 |
2. Married | 1. < HS Grad | 174 |
1. Never Married | 3. Some College | 164 |
1. Never Married | 4. College Grad | 143 |
4. Divorced | 2. HS Grad | 73 |
1. Never Married | 1. < HS Grad | 62 |
1. Never Married | 5. Advanced Degree | 60 |
4. Divorced | 3. Some College | 52 |
4. Divorced | 4. College Grad | 41 |
4. Divorced | 5. Advanced Degree | 22 |
5. Separated | 2. HS Grad | 20 |
4. Divorced | 1. < HS Grad | 16 |
5. Separated | 1. < HS Grad | 14 |
5. Separated | 3. Some College | 11 |
5. Separated | 4. College Grad | 9 |
3. Widowed | 2. HS Grad | 8 |
3. Widowed | 4. College Grad | 5 |
3. Widowed | 1. < HS Grad | 2 |
3. Widowed | 3. Some College | 2 |
3. Widowed | 5. Advanced Degree | 2 |
5. Separated | 5. Advanced Degree | 1 |
Obtain the sum of a quantitative variable (wage
) for different levels of a categorical variable (maritl
) by using wt =
:
%>%
wage_df count(maritl,
wt = wage,
name = "Sum") %>%
gt()
maritl | Sum |
---|---|
1. Never Married | 60092.052 |
2. Married | 246516.180 |
3. Widowed | 1891.234 |
4. Divorced | 21044.489 |
5. Separated | 5566.868 |
1.2 Total counts
Total counts are an useful way to represent the observations that fall into each combination of the levels of categorical variables. We create a contingency table of the two categorical variables jobclass
and race
and call the result tab
:
<- table(wage_df$jobclass, wage_df$race)
tab tab
##
## 1. White 2. Black 3. Asian 4. Other
## 1. Industrial 1325 111 86 22
## 2. Information 1155 182 104 15
1.3 Joint proportions
We can also view the percentage of each cell in relation to the total amount of all observations (here n = 3000). Therefore, you have to simply divide the numbers from our total counts with 3.000.
The following code generates tables of joint proportions:
# joint proportions
prop.table(tab)
##
## 1. White 2. Black 3. Asian 4. Other
## 1. Industrial 0.441666667 0.037000000 0.028666667 0.007333333
## 2. Information 0.385000000 0.060666667 0.034666667 0.005000000
For example, around 44% of all people in the dataset are white industrial workers.
1.4 Conditional proportions: columns
You also may want to know the probability that workers have a certain jobclass, given that they have a particular ethnical background. This is a so called conditional probability. Conditional probability represents the chance that one event will occur given that a second event has already occurred.
The following code generates tables of conditional proportions:
# conditional on columns
prop.table(tab, 2)
##
## 1. White 2. Black 3. Asian 4. Other
## 1. Industrial 0.5342742 0.3788396 0.4526316 0.5945946
## 2. Information 0.4657258 0.6211604 0.5473684 0.4054054
We performed a columnwise evaluation and are now able to answer the following question:
- Approximately what proportion of all white workers are industrial workers?
- The answer is: around 53%.
1.5 Conditional proportions: rows
Now we want to obtain the probability that workers have a certain race, given their jobclass.
# conditional on rows
prop.table(tab, 1)
##
## 1. White 2. Black 3. Asian 4. Other
## 1. Industrial 0.85816062 0.07189119 0.05569948 0.01424870
## 2. Information 0.79326923 0.12500000 0.07142857 0.01030220
We performed a rowwise evaluation and are now able to answer the following question:
- Approximately what proportion of all industrial workers are white?
- The answer is: around 86%.
1.6 Chi-squared Test of Independence
Finally, let’s test the hypothesis whether the variable jobclass
is independent of the variable race
at .05 significance level.
chisq.test(tab)
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 29.331, df = 3, p-value = 1.908e-06
As the p-value is smaller than the .05 significance level, we reject the null hypothesis that the jobclass is independent of the race of the workers.