Chapter 5 Part 4: mutate(), group_by()/summarize()
5.1 Learning Objectives
- Learn and apply
mutate()to change the data type of a variable - Apply
mutate()to calculate a new variable based on other variables in adata.frame. - Apply
case_whenin amutate()statement to make a continuous variable categorical - Apply
group_by()/summarize()as a pattern to get summary statistics, including counts, means, and standard deviations within a category. - Standardize variable names using
clean_names()
5.2 Getting set up
5.3 mutate() - A confusing name, a powerful dplyr verb
So what is mutate? mutate() is one of the most useful dplyr verbs. You can use it to transform data (variables in your data.frame) and add it as a new variable into the data.frame.
I tend to think of this much like adding a formula in Excel to calculate the value of a new column based on previous columns. You can do lots of things such as:
- subtract one column from another
- convert the units of one column to new units (such as days to years)
- change the capitalization of categories in a variable
- recode a continuous variable to be a categorical one
5.3.1 Using mutate to calculate a new variable based on other variables
One use for mutate is to do Excel type calculations using other columns on the data. For instance, we might want to calcluate the sum of age_at_diagnosesand days_to_death to get the age_at_death.
## Rows: 1,152
## Columns: 21
## $ primary_diagnosis <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C3...
## $ tumor_stage <chr> "stage ia", "stage ib", "stage ib", "st...
## $ age_at_diagnosis <dbl> 24477, 26615, 28171, 27154, 23370, 1902...
## $ vital_status <chr> "dead", "dead", "dead", "alive", "alive...
## $ morphology <chr> "8070/3", "8070/3", "8070/3", "8083/3",...
## $ days_to_death <dbl> 371, 136, 2304, NA, NA, 345, 716, 2803,...
## $ state <chr> "live", "live", "live", "live", "live",...
## $ tissue_or_organ_of_origin <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C3...
## $ days_to_birth <dbl> -24477, -26615, -28171, -27154, -23370,...
## $ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C3...
## $ days_to_last_follow_up <dbl> NA, NA, 2099, 3747, 3576, NA, NA, 1810,...
## $ cigarettes_per_day <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095...
## $ years_smoked <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 26,...
## $ gender <chr> "male", "male", "female", "male", "fema...
## $ year_of_birth <dbl> 1936, 1931, 1927, 1930, 1942, 1953, 193...
## $ race <chr> "white", "asian", "white", "white", "no...
## $ ethnicity <chr> "not hispanic or latino", "not hispanic...
## $ year_of_death <dbl> 2004, 2003, NA, NA, NA, 2005, 2006, NA,...
## $ bcr_patient_barcode <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-1...
## $ disease <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC",...
## $ age_at_death <dbl> 24848, 26751, 30475, NA, NA, 19370, 276...
5.3.2 Your Turn
Create a variable called cigarettes_total by multiplying cigarettes_per_day by -days_to_birth below:
5.3.3 Using mutate to make our character variables into factors
Remember factors? Well, we can use mutate() to make a character variable a factor.
Let’s convert gender from character into factor. We’ll then pipe the output into glimpse() so you can see the variable types.
smoke_complete %>%
#reassign the gender variable to be a factor
mutate(gender = factor(gender)) %>%
glimpse()## Rows: 1,152
## Columns: 20
## $ primary_diagnosis <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C3...
## $ tumor_stage <chr> "stage ia", "stage ib", "stage ib", "st...
## $ age_at_diagnosis <dbl> 24477, 26615, 28171, 27154, 23370, 1902...
## $ vital_status <chr> "dead", "dead", "dead", "alive", "alive...
## $ morphology <chr> "8070/3", "8070/3", "8070/3", "8083/3",...
## $ days_to_death <dbl> 371, 136, 2304, NA, NA, 345, 716, 2803,...
## $ state <chr> "live", "live", "live", "live", "live",...
## $ tissue_or_organ_of_origin <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C3...
## $ days_to_birth <dbl> -24477, -26615, -28171, -27154, -23370,...
## $ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C3...
## $ days_to_last_follow_up <dbl> NA, NA, 2099, 3747, 3576, NA, NA, 1810,...
## $ cigarettes_per_day <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095...
## $ years_smoked <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 26,...
## $ gender <fct> male, male, female, male, female, male,...
## $ year_of_birth <dbl> 1936, 1931, 1927, 1930, 1942, 1953, 193...
## $ race <chr> "white", "asian", "white", "white", "no...
## $ ethnicity <chr> "not hispanic or latino", "not hispanic...
## $ year_of_death <dbl> 2004, 2003, NA, NA, NA, 2005, 2006, NA,...
## $ bcr_patient_barcode <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-1...
## $ disease <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC",...
One thing to notice: we are doing something called reassignment here. We’re taking the previous values of our variable (gender), doing something to it (making it a factor), and then reassigning the variable gender to our fixed set of values.
Remember when we wanted to decide the order of the categories in the factor? We can assign that order using the levels argument in factor(). To show the order, we can pipe the output into a command called tabyl(), from the janitor package, which is a tidyverse friendly version of table().
smoke_complete %>%
#reassign the gender variable to be a factor
mutate(gender = factor(gender, levels = c("female", "male"))) %>%
tabyl(gender)## gender n percent
## female 366 0.3177083
## male 786 0.6822917
Notice that the female value goes before the male, which is what we wanted.
5.3.4 Using mutate to make a continuous variable categorical using case_when
Say we want to make the cigarettes_per_day into a categorical variable with the values:
- 0-5 cigarettes/day
- 6+ cigarettes/day
How would we do that?
Well, we need to follow the basic pattern for each of our categories:
condition ~ category name
The left side of the ~ is where we can specify how we define the *category&.
The right side of the ~ is where we can specify the category name (as a character).
In the example below,
cigarettes_per_day <= 5 is our left side,
and 0-5 is our right side (our category).
We need to do this for each level in our category.
smoke_complete %>%
mutate(
cigarettes_category = case_when(
cigarettes_per_day <= 5 ~ '0-5',
cigarettes_per_day > 5 ~ '6+'
)
) %>%
mutate(cigarettes_category =
factor(cigarettes_category,
levels = c("0-5", "6+")
)
) %>%
tabyl(cigarettes_category)## cigarettes_category n percent
## 0-5 1060 0.92013889
## 6+ 92 0.07986111
5.3.5 Your Turn
Modify the code below to recode cigarettes_category to have 3 levels:
0-56-1011+
Hint: you’ll have to chain conditions with a & to get the 6-10 category.
smoke_complete %>%
mutate(
cigarettes_category = case_when(
cigarettes_per_day <= 5 ~ '0-5',
cigarettes_per_day > 5 ~ '6+'
)
) %>%
mutate(cigarettes_category =
factor(cigarettes_category,
levels = c("0-5", "6+")
)
) %>%
tabyl(cigarettes_category)## cigarettes_category n percent
## 0-5 1060 0.92013889
## 6+ 92 0.07986111
5.4 group_by()/summarize()
These verbs usually go together and allow you to produce summaries on your data.
5.4.1 summarize()
The summarize() verb produces summaries across your data.
For example, if I wanted to find the average cigarettes_per_day across all my data, I’d do this:
## # A tibble: 1 x 1
## average_cigarettes
## <dbl>
## 1 2.61
What are the kinds of things that summarize is useful for?
Useful functions (from https://dplyr.tidyverse.org/reference/summarise.html)
- Center:
mean(),median() - Spread:
sd()(standard deviation) - Range:
min(),max() - Position:
first(),last(),nth() - Count:
n(),n_distinct()
By itself, summarize() is ok. But it’s really combined with group_by() that it becomes extremely powerful.
5.4.2 group_by()/summarize()
These two verbs always go together. group_by() doesn’t do anything by itself, output wise. But what it does is split our data frame into a number of smaller data.frames split out by category. Then we use summarize to do some sort of summary calculation (such as counting or calculating the mean) on these smaller data.frames:
For example, if we want to calculate the mean cigarettes_per_day smoked between males and females, we’d do the following:
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## gender mean_cigarettes
## <chr> <dbl>
## 1 female 2.07
## 2 male 2.86
5.4.3 group_by()/summarize to calculate mean and standard deviation values
Below, we’ll calculate the average number of cigarettes smoked per day for males and females, and the standard deviation.
smoke_complete %>%
group_by(gender) %>%
summarize(mean = mean(cigarettes_per_day),
sd = sd(cigarettes_per_day))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## gender mean sd
## <chr> <dbl> <dbl>
## 1 female 2.07 1.53
## 2 male 2.86 2.19
5.4.4 group_by() to count() things
What if we want to know the number of subjects in our dataset with each type of disease? We can use the n() function.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## disease count
## <chr> <int>
## 1 BLCA 224
## 2 CESC 92
## 3 LUSC 836
5.4.5 Your Turn
Use group_by()/summarize() to calculate the maximum cigarettes per day within each disease category.
5.5 Standardizing variable names: clean_names()
The last package we’ll talk about today is the janitor package. This package adds some really useful functions for cleaning data.
The one I use the most is called clean_names(). This will mutate column names into a standard format - all character values will be lowercase and spaces will be converted to underscores.
The smoke_complete dataset has already had this function applied to it, but let’s create a smaller example dataset where this wouldn’t be the case.
## Parsed with column specification:
## cols(
## `primary diagnosis` = col_character(),
## `tumor Stage` = col_character(),
## `Age at Diagnosis` = col_double(),
## vitalSTATUS = col_character()
## )
## # A tibble: 1,152 x 4
## `primary diagnosis` `tumor Stage` `Age at Diagnosis` vitalSTATUS
## <chr> <chr> <dbl> <chr>
## 1 C34.1 stage ia 24477 dead
## 2 C34.1 stage ib 26615 dead
## 3 C34.3 stage ib 28171 dead
## 4 C34.1 stage ia 27154 alive
## 5 C34.1 stage iiia 23370 alive
## 6 C34.3 stage ib 19025 dead
## 7 C34.3 stage iv 26938 dead
## 8 C34.1 stage ib 28430 dead
## 9 C34.1 stage iib 30435 dead
## 10 C34.9 stage iv 24019 dead
## # ... with 1,142 more rows
Ack! These column names have spaces, inconsistent capitalization, which make these difficult to work with.
So let’s use clean_names() to fix this:
## # A tibble: 1,152 x 4
## primary_diagnosis tumor_stage age_at_diagnosis vital_status
## <chr> <chr> <dbl> <chr>
## 1 C34.1 stage ia 24477 dead
## 2 C34.1 stage ib 26615 dead
## 3 C34.3 stage ib 28171 dead
## 4 C34.1 stage ia 27154 alive
## 5 C34.1 stage iiia 23370 alive
## 6 C34.3 stage ib 19025 dead
## 7 C34.3 stage iv 26938 dead
## 8 C34.1 stage ib 28430 dead
## 9 C34.1 stage iib 30435 dead
## 10 C34.9 stage iv 24019 dead
## # ... with 1,142 more rows
By default, clean_names() converts to snake case (spaces between words are replaced by an underscore _). clean_names() will help you standardize your dataset’s column names.
5.6 Assignment 4
Load the
smoke_completedataset from thedatafolder.Determine which tissue/organ of origin occurs most frequently in
smoke_complete. (Hint: use bothgroup_by()/summarize()andarrange().)
- Create a new dataset called
smoke_filteredthat only contains subjects withtissue_or_organ_of_originofC34.3. Useglimpse()onsmoke_filtered.
- For subjects in
smoke_filtered, make a new variable calledage_yearsby convertingage_at_diagnosisfrom days to years.
- Make a new variable called
age_categoryusingcase_when()fromage_yearsinsmoke_filtered.
Make age_category a factor with the following levels:
0-40years old40-55years old55-70years old70+years old
smoke_filtered <-
smoke_filtered %>%
mutate(age_category =
case_when(age_years <= 40 ~ "0-40",
)
) %>%
mutate(age_category = factor(age_category,
levels = c("0-40",
)
)
)- Create a scatterplot using
geom_point()demonstrating the correlation (if any) between the categorical variableage_at_diagnosisversuscigarettes_per_day.