Chapter 1 Counts and Tables | Data Exploration in R

1.1 Counts

Count for one variable:

Use data wage_df.
Perform count() on maritl
Sort the values.
Use gt() to print the table.

wage_df %>% 
  count(maritl,
        sort = TRUE) %>% 
  gt()

maritl	n
2. Married	2074
1. Never Married	648
4. Divorced	204
5. Separated	55
3. Widowed	19

Count two variables:

Use data wage_df.
Perform count() on maritl and education
Sort the values.
Use gt() to print the table.

wage_df %>% 
  count(maritl, education,
        sort= TRUE) %>% 
  gt()

maritl	education	n
2. Married	2. HS Grad	651
2. Married	4. College Grad	487
2. Married	3. Some College	421
2. Married	5. Advanced Degree	341
1. Never Married	2. HS Grad	219
2. Married	1. < HS Grad	174
1. Never Married	3. Some College	164
1. Never Married	4. College Grad	143
4. Divorced	2. HS Grad	73
1. Never Married	1. < HS Grad	62
1. Never Married	5. Advanced Degree	60
4. Divorced	3. Some College	52
4. Divorced	4. College Grad	41
4. Divorced	5. Advanced Degree	22
5. Separated	2. HS Grad	20
4. Divorced	1. < HS Grad	16
5. Separated	1. < HS Grad	14
5. Separated	3. Some College	11
5. Separated	4. College Grad	9
3. Widowed	2. HS Grad	8
3. Widowed	4. College Grad	5
3. Widowed	1. < HS Grad	2
3. Widowed	3. Some College	2
3. Widowed	5. Advanced Degree	2
5. Separated	5. Advanced Degree	1

Obtain the sum of a quantitative variable (wage) for different levels of a categorical variable (maritl) by using wt =:

wage_df %>%  
  count(maritl,
        wt = wage,
        name = "Sum") %>% 
  gt()

maritl	Sum
1. Never Married	60092.052
2. Married	246516.180
3. Widowed	1891.234
4. Divorced	21044.489
5. Separated	5566.868

1.2 Total counts

Total counts are an useful way to represent the observations that fall into each combination of the levels of categorical variables. We create a contingency table of the two categorical variables jobclass and race and call the result tab:

tab <- table(wage_df$jobclass, wage_df$race)
tab

##                 
##                  1. White 2. Black 3. Asian 4. Other
##   1. Industrial      1325      111       86       22
##   2. Information     1155      182      104       15

1.3 Joint proportions

We can also view the percentage of each cell in relation to the total amount of all observations (here n = 3000). Therefore, you have to simply divide the numbers from our total counts with 3.000.

The following code generates tables of joint proportions:

# joint proportions
prop.table(tab)

##                 
##                     1. White    2. Black    3. Asian    4. Other
##   1. Industrial  0.441666667 0.037000000 0.028666667 0.007333333
##   2. Information 0.385000000 0.060666667 0.034666667 0.005000000

For example, around 44% of all people in the dataset are white industrial workers.

1.4 Conditional proportions: columns

You also may want to know the probability that workers have a certain jobclass, given that they have a particular ethnical background. This is a so called conditional probability. Conditional probability represents the chance that one event will occur given that a second event has already occurred.

The following code generates tables of conditional proportions:

# conditional on columns
prop.table(tab, 2)

##                 
##                   1. White  2. Black  3. Asian  4. Other
##   1. Industrial  0.5342742 0.3788396 0.4526316 0.5945946
##   2. Information 0.4657258 0.6211604 0.5473684 0.4054054

We performed a columnwise evaluation and are now able to answer the following question:

Approximately what proportion of all white workers are industrial workers?
The answer is: around 53%.

1.5 Conditional proportions: rows

Now we want to obtain the probability that workers have a certain race, given their jobclass.

# conditional on rows
prop.table(tab, 1)

##                 
##                    1. White   2. Black   3. Asian   4. Other
##   1. Industrial  0.85816062 0.07189119 0.05569948 0.01424870
##   2. Information 0.79326923 0.12500000 0.07142857 0.01030220

We performed a rowwise evaluation and are now able to answer the following question:

Approximately what proportion of all industrial workers are white?
The answer is: around 86%.

1.6 Chi-squared Test of Independence

Finally, let’s test the hypothesis whether the variable jobclass is independent of the variable race at .05 significance level.

chisq.test(tab)

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 29.331, df = 3, p-value = 1.908e-06

As the p-value is smaller than the .05 significance level, we reject the null hypothesis that the jobclass is independent of the race of the workers.