02:00
{dplyr}
University of Oregon
Fall 2025
{datapasta}
R
demo
The outline should include a description of the data to be used, a discussion of what preparatory work will need to be done, and how the requirements of the final project will be met. The outline is intended to be low-stakes and is primarily designed to be a means for you to obtain feedback on the feasibility of the project and areas to consider.
Homeworks 2 & 3
{dplyr}
Week 4
{dplyr}
functions
{dplyr}
functionsifelse()
case_when()
Learning Objectives
select()
filter()
arrange()
group_by()
summarize()
recode()
& rename()
{ggplot2}
{dplyr}
{tidyr}
{lubridate}
{glue}
{tidymodels}
{ggplot2}
{dplyr}
{tidyr}
{lubridate}
{glue}
{tidymodels}
{dplyr}
A grammar for data wrangling (verbs!)
Take a couple minutes with your neighbor to guess what the following functions do:
select()
filter()
mutate()
arrange()
summarize()
group_by()
02:00
{dplyr}
A grammar for data wrangling (verbs!)
select()
a subset of columns
filter()
a subset of rows
mutate()
add a new column or modify an existing column
arrange()
sort rows in ascending/descending order
summarize()
a new column created according to other columns for example: mean()
or sd()
+ often used with…
group_by()
grouping categories in column(s)
{dplyr}
always takes a data frame (tibble) as the first argument
{ggplot}
subsequent functions tell {dplyr}
what to do with the data frame
each function applied returns the modified data frame
select()
select()
choose columns
Let’s go back to the `penguins’ dataset
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
select
president species and island and body_mass_g-
Don’t select variable(s)
# A tibble: 344 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 1 more variable: sex <fct>
-
Don’t select variable(s)
OR maybe better
# A tibble: 344 × 6
species island flipper_length_mm body_mass_g sex year
<fct> <fct> <int> <int> <fct> <int>
1 Adelie Torgersen 181 3750 male 2007
2 Adelie Torgersen 186 3800 female 2007
3 Adelie Torgersen 195 3250 female 2007
4 Adelie Torgersen NA NA <NA> 2007
5 Adelie Torgersen 193 3450 female 2007
6 Adelie Torgersen 190 3650 male 2007
7 Adelie Torgersen 181 3625 female 2007
8 Adelie Torgersen 195 4675 male 2007
9 Adelie Torgersen 193 3475 <NA> 2007
10 Adelie Torgersen 190 4250 <NA> 2007
# ℹ 334 more rows
Less preferable because variable orders change
# A tibble: 344 × 3
species island bill_length_mm
<fct> <fct> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
6 Adelie Torgersen 39.3
7 Adelie Torgersen 38.9
8 Adelie Torgersen 39.2
9 Adelie Torgersen 34.1
10 Adelie Torgersen 42
# ℹ 334 more rows
# A tibble: 344 × 3
species bill_length_mm flipper_length_mm
<fct> <dbl> <int>
1 Adelie 39.1 181
2 Adelie 39.5 186
3 Adelie 40.3 195
4 Adelie NA NA
5 Adelie 36.7 193
6 Adelie 39.3 190
7 Adelie 38.9 181
8 Adelie 39.2 195
9 Adelie 34.1 193
10 Adelie 42 190
# ℹ 334 more rows
select()
to order columnsselect()
helper functions
everything()
starts_with()
ends_with()
contains()
all_of()
any_of()
First: this code may not work for you - why?
# A tibble: 48 × 25
test_year test_type test_site student_id pre_test_score pre_test_percent
<chr> <chr> <chr> <chr> <dbl> <chr>
1 06/01/2016 12… YEAR END VIRDEN Virden 1 43 29%
2 06/01/2016 12… YEAR END VIRDEN Virden 2 46 31%
3 06/01/2016 12… YEAR END VIRDEN Virden 3 39 26%
4 06/01/2016 12… YEAR END VIRDEN Virden 4 35 23%
5 06/01/2016 12… YEAR END VIRDEN Virden 5 46 31%
6 06/01/2016 12… YEAR END VIRDEN Virden 6 35 23%
7 06/01/2016 12… YEAR END VIRDEN Virden 7 40 27%
8 06/01/2016 12… YEAR END VIRDEN Virden 8 39 26%
9 06/01/2016 12… YEAR END VIRDEN Virden 9 40 27%
10 06/01/2016 12… YEAR END VIRDEN Virden 10 45 30%
# ℹ 38 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
# percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
# unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
# unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
# unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
# unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …
select
to order columns# A tibble: 48 × 25
student_id test_site test_type test_year pre_test_score pre_test_percent
<chr> <chr> <chr> <chr> <dbl> <chr>
1 Virden 1 VIRDEN YEAR END 06/01/2016 12… 43 29%
2 Virden 2 VIRDEN YEAR END 06/01/2016 12… 46 31%
3 Virden 3 VIRDEN YEAR END 06/01/2016 12… 39 26%
4 Virden 4 VIRDEN YEAR END 06/01/2016 12… 35 23%
5 Virden 5 VIRDEN YEAR END 06/01/2016 12… 46 31%
6 Virden 6 VIRDEN YEAR END 06/01/2016 12… 35 23%
7 Virden 7 VIRDEN YEAR END 06/01/2016 12… 40 27%
8 Virden 8 VIRDEN YEAR END 06/01/2016 12… 39 26%
9 Virden 9 VIRDEN YEAR END 06/01/2016 12… 40 27%
10 Virden 10 VIRDEN YEAR END 06/01/2016 12… 45 30%
# ℹ 38 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
# percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
# unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
# unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
# unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
# unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …
everything()
is a helper verb that includes all column names
starts_with()
# A tibble: 48 × 3
test_year test_type test_site
<chr> <chr> <chr>
1 06/01/2016 12:00:00 AM YEAR END VIRDEN
2 06/01/2016 12:00:00 AM YEAR END VIRDEN
3 06/01/2016 12:00:00 AM YEAR END VIRDEN
4 06/01/2016 12:00:00 AM YEAR END VIRDEN
5 06/01/2016 12:00:00 AM YEAR END VIRDEN
6 06/01/2016 12:00:00 AM YEAR END VIRDEN
7 06/01/2016 12:00:00 AM YEAR END VIRDEN
8 06/01/2016 12:00:00 AM YEAR END VIRDEN
9 06/01/2016 12:00:00 AM YEAR END VIRDEN
10 06/01/2016 12:00:00 AM YEAR END VIRDEN
# ℹ 38 more rows
ends_with()
# A tibble: 48 × 10
pre_test_score post_test_score unit_1_score unit_2_score unit_3_score
<dbl> <dbl> <dbl> <dbl> <dbl>
1 43 92 3 4 6
2 46 104 5 5 6
3 39 75 4 4 6
4 35 115 4 4 6
5 46 85 2 5 6
6 35 91 5 5 7
7 40 96 5 5 6
8 39 74 4 5 5
9 40 90 6 4 5
10 45 86 4 5 5
# ℹ 38 more rows
# ℹ 5 more variables: unit_4_score <dbl>, unit_5_6_score <dbl>,
# unit_7_score <dbl>, unit_8_score <dbl>, total_score <dbl>
contains()
# A tibble: 48 × 7
test_year test_type test_site pre_test_score pre_test_percent post_test_score
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 06/01/20… YEAR END VIRDEN 43 29% 92
2 06/01/20… YEAR END VIRDEN 46 31% 104
3 06/01/20… YEAR END VIRDEN 39 26% 75
4 06/01/20… YEAR END VIRDEN 35 23% 115
5 06/01/20… YEAR END VIRDEN 46 31% 85
6 06/01/20… YEAR END VIRDEN 35 23% 91
7 06/01/20… YEAR END VIRDEN 40 27% 96
8 06/01/20… YEAR END VIRDEN 39 26% 74
9 06/01/20… YEAR END VIRDEN 40 27% 90
10 06/01/20… YEAR END VIRDEN 45 30% 86
# ℹ 38 more rows
# ℹ 1 more variable: post_test_percent <chr>
all_of()
# A tibble: 48 × 4
pre_test_score pre_test_percent post_test_score post_test_percent
<dbl> <chr> <dbl> <chr>
1 43 29% 92 61%
2 46 31% 104 69%
3 39 26% 75 50%
4 35 23% 115 77%
5 46 31% 85 57%
6 35 23% 91 61%
7 40 27% 96 64%
8 39 26% 74 49%
9 40 27% 90 60%
10 45 30% 86 57%
# ℹ 38 more rows
all_of()
Error in `select()`:
ℹ In argument: `all_of(vars)`.
Caused by error in `all_of()`:
! Can't subset elements that don't exist.
✖ Elements `pre_test_score`, `pre_test_percent`, `post_test_score`, and `post_test_percent` don't exist.
any_of()
# A tibble: 48 × 4
pre_test_score pre_test_percent post_test_score post_test_percent
<dbl> <chr> <dbl> <chr>
1 43 29% 92 61%
2 46 31% 104 69%
3 39 26% 75 50%
4 35 23% 115 77%
5 46 31% 85 57%
6 35 23% 91 61%
7 40 27% 96 64%
8 39 26% 74 49%
9 40 27% 90 60%
10 45 30% 86 57%
# ℹ 38 more rows
any_of()
# A tibble: 234 × 0
select
methods and helpersYou can mix types and helper functions
# A tibble: 48 × 4
student_id test_year total_score total_percent_correct
<chr> <chr> <dbl> <chr>
1 Virden 1 06/01/2016 12:00:00 AM 207 41%
2 Virden 2 06/01/2016 12:00:00 AM 224 45%
3 Virden 3 06/01/2016 12:00:00 AM 193 39%
4 Virden 4 06/01/2016 12:00:00 AM 223 45%
5 Virden 5 06/01/2016 12:00:00 AM 198 40%
6 Virden 6 06/01/2016 12:00:00 AM 210 42%
7 Virden 7 06/01/2016 12:00:00 AM 211 42%
8 Virden 8 06/01/2016 12:00:00 AM 180 36%
9 Virden 9 06/01/2016 12:00:00 AM 206 41%
10 Virden 10 06/01/2016 12:00:00 AM 206 41%
# ℹ 38 more rows
select
methods and helpersYou can also use select to rearrange your columns
# A tibble: 48 × 25
student_id test_year total_score total_percent_correct test_type test_site
<chr> <chr> <dbl> <chr> <chr> <chr>
1 Virden 1 06/01/2016 … 207 41% YEAR END VIRDEN
2 Virden 2 06/01/2016 … 224 45% YEAR END VIRDEN
3 Virden 3 06/01/2016 … 193 39% YEAR END VIRDEN
4 Virden 4 06/01/2016 … 223 45% YEAR END VIRDEN
5 Virden 5 06/01/2016 … 198 40% YEAR END VIRDEN
6 Virden 6 06/01/2016 … 210 42% YEAR END VIRDEN
7 Virden 7 06/01/2016 … 211 42% YEAR END VIRDEN
8 Virden 8 06/01/2016 … 180 36% YEAR END VIRDEN
9 Virden 9 06/01/2016 … 206 41% YEAR END VIRDEN
10 Virden 10 06/01/2016 … 206 41% YEAR END VIRDEN
# ℹ 38 more rows
# ℹ 19 more variables: pre_test_score <dbl>, pre_test_percent <chr>,
# post_test_score <dbl>, post_test_percent <chr>, percentage_change <chr>,
# unit_1_score <dbl>, unit_1_percent <chr>, unit_2_score <dbl>,
# unit_2_percent <chr>, unit_3_score <dbl>, unit_3_percent <chr>,
# unit_4_score <dbl>, unit_4_percent <chr>, unit_5_6_score <dbl>,
# unit_5_6_percent <chr>, unit_7_score <dbl>, unit_7_percent <chr>, …
relocate()
relocate(.data, … .before = NULL, .after = NULL)
.data = dataframe/tibble
… = columns to move
.before = variable to move before
.after = variable to move after
relocate()
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# A tibble: 344 × 8
year species island bill_length_mm bill_depth_mm flipper_length_mm
<int> <fct> <fct> <dbl> <dbl> <int>
1 2007 Adelie Torgersen 39.1 18.7 181
2 2007 Adelie Torgersen 39.5 17.4 186
3 2007 Adelie Torgersen 40.3 18 195
4 2007 Adelie Torgersen NA NA NA
5 2007 Adelie Torgersen 36.7 19.3 193
6 2007 Adelie Torgersen 39.3 20.6 190
7 2007 Adelie Torgersen 38.9 17.8 181
8 2007 Adelie Torgersen 39.2 19.6 195
9 2007 Adelie Torgersen 34.1 18.1 193
10 2007 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 2 more variables: body_mass_g <int>, sex <fct>
relocate()
# A tibble: 344 × 8
species year island bill_length_mm bill_depth_mm flipper_length_mm
<fct> <int> <fct> <dbl> <dbl> <int>
1 Adelie 2007 Torgersen 39.1 18.7 181
2 Adelie 2007 Torgersen 39.5 17.4 186
3 Adelie 2007 Torgersen 40.3 18 195
4 Adelie 2007 Torgersen NA NA NA
5 Adelie 2007 Torgersen 36.7 19.3 193
6 Adelie 2007 Torgersen 39.3 20.6 190
7 Adelie 2007 Torgersen 38.9 17.8 181
8 Adelie 2007 Torgersen 39.2 19.6 195
9 Adelie 2007 Torgersen 34.1 18.1 193
10 Adelie 2007 Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 2 more variables: body_mass_g <int>, sex <fct>
relocate()
# A tibble: 344 × 8
species island year bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <int> <dbl> <dbl> <int>
1 Adelie Torgersen 2007 39.1 18.7 181
2 Adelie Torgersen 2007 39.5 17.4 186
3 Adelie Torgersen 2007 40.3 18 195
4 Adelie Torgersen 2007 NA NA NA
5 Adelie Torgersen 2007 36.7 19.3 193
6 Adelie Torgersen 2007 39.3 20.6 190
7 Adelie Torgersen 2007 38.9 17.8 181
8 Adelie Torgersen 2007 39.2 19.6 195
9 Adelie Torgersen 2007 34.1 18.1 193
10 Adelie Torgersen 2007 42 20.2 190
# ℹ 334 more rows
# ℹ 2 more variables: body_mass_g <int>, sex <fct>
filter()
filter()
choose rows that satisfy your condition (== TRUE
)
filter()
Complete set of boolean operations
filter()
numeric# A tibble: 81 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
6 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
7 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
8 audi a4 quattro 2 2008 4 auto… 4 19 27 p comp…
9 chevrolet malibu 2.4 1999 4 auto… f 19 27 r mids…
10 chevrolet malibu 2.4 2008 4 auto… f 22 30 r mids…
# ℹ 71 more rows
# A tibble: 143 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 133 more rows
filter
characteruse ==
NOT =
# A tibble: 124 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 46.1 13.2 211 4500
2 Gentoo Biscoe 50 16.3 230 5700
3 Gentoo Biscoe 48.7 14.1 210 4450
4 Gentoo Biscoe 50 15.2 218 5700
5 Gentoo Biscoe 47.6 14.5 215 5400
6 Gentoo Biscoe 46.5 13.5 210 4550
7 Gentoo Biscoe 45.4 14.6 211 4800
8 Gentoo Biscoe 46.7 15.3 219 5200
9 Gentoo Biscoe 43.3 13.4 209 4400
10 Gentoo Biscoe 46.8 15.4 215 5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>
,
is equivalent to &
filter()
noteCan’t write:
Error in `filter()`:
ℹ In argument: `island == "Briscoe" | "Dream"`.
Caused by error in `island == "Briscoe" | "Dream"`:
! operations are possible only for numeric, logical or complex types
Must be:
# A tibble: 124 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 39.5 16.7 178 3250
2 Adelie Dream 37.2 18.1 178 3900
3 Adelie Dream 39.5 17.8 188 3300
4 Adelie Dream 40.9 18.9 184 3900
5 Adelie Dream 36.4 17 195 3325
6 Adelie Dream 39.2 21.1 196 4150
7 Adelie Dream 38.8 20 190 3950
8 Adelie Dream 42.2 18.5 180 3550
9 Adelie Dream 37.6 19.3 181 3300
10 Adelie Dream 39.8 19.1 184 4650
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>
filter()
# A tibble: 48 × 25
test_year test_type test_site student_id pre_test_score pre_test_percent
<chr> <chr> <chr> <chr> <dbl> <chr>
1 06/01/2016 12… YEAR END VIRDEN Virden 1 43 29%
2 06/01/2016 12… YEAR END VIRDEN Virden 2 46 31%
3 06/01/2016 12… YEAR END VIRDEN Virden 3 39 26%
4 06/01/2016 12… YEAR END VIRDEN Virden 4 35 23%
5 06/01/2016 12… YEAR END VIRDEN Virden 5 46 31%
6 06/01/2016 12… YEAR END VIRDEN Virden 6 35 23%
7 06/01/2016 12… YEAR END VIRDEN Virden 7 40 27%
8 06/01/2016 12… YEAR END VIRDEN Virden 8 39 26%
9 06/01/2016 12… YEAR END VIRDEN Virden 9 40 27%
10 06/01/2016 12… YEAR END VIRDEN Virden 10 45 30%
# ℹ 38 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
# percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
# unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
# unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
# unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
# unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …
filter()
Anomaly in the data
!
means NOT
!=
means NOT equal
Note:
NA
:
filter(!is.na(variable))
filter(variable != NA)
{dplyr}
functionsWhat if we wanted to filter
then select
a data set?
Select “island” and “species” of penguins over 4200 grams
{dplyr}
functionsselect
around filter
☹️{dplyr}
functionsWe could use the pipe function to help increase the readability of our code:
|>
Called the “pipe” operator and “piping functions”
# A tibble: 149 × 2
island species
<fct> <fct>
1 Torgersen Adelie
2 Torgersen Adelie
3 Torgersen Adelie
4 Torgersen Adelie
5 Dream Adelie
6 Dream Adelie
7 Dream Adelie
8 Biscoe Adelie
9 Biscoe Adelie
10 Torgersen Adelie
# ℹ 139 more rows
Even better:
# A tibble: 149 × 2
island species
<fct> <fct>
1 Torgersen Adelie
2 Torgersen Adelie
3 Torgersen Adelie
4 Torgersen Adelie
5 Dream Adelie
6 Dream Adelie
7 Dream Adelie
8 Biscoe Adelie
9 Biscoe Adelie
10 Torgersen Adelie
# ℹ 139 more rows
mutate()
# A tibble: 45 × 4
student_id test_site pre_test_score post_test_score
<chr> <chr> <dbl> <dbl>
1 Virden 1 VIRDEN 43 92
2 Virden 2 VIRDEN 46 104
3 Virden 3 VIRDEN 39 75
4 Virden 4 VIRDEN 35 115
5 Virden 5 VIRDEN 46 85
6 Virden 6 VIRDEN 35 91
7 Virden 7 VIRDEN 40 96
8 Virden 8 VIRDEN 39 74
9 Virden 9 VIRDEN 40 90
10 Virden 10 VIRDEN 45 86
# ℹ 35 more rows
mutate()
Add a variable
# A tibble: 45 × 5
student_id test_site pre_test_score post_test_score gain
<chr> <chr> <dbl> <dbl> <dbl>
1 Virden 1 VIRDEN 43 92 49
2 Virden 2 VIRDEN 46 104 58
3 Virden 3 VIRDEN 39 75 36
4 Virden 4 VIRDEN 35 115 80
5 Virden 5 VIRDEN 46 85 39
6 Virden 6 VIRDEN 35 91 56
7 Virden 7 VIRDEN 40 96 56
8 Virden 8 VIRDEN 39 74 35
9 Virden 9 VIRDEN 40 90 50
10 Virden 10 VIRDEN 45 86 41
# ℹ 35 more rows
Why do we not see the new gain column?
mutate()
Add multiple variables…
Use a variable you just created!
# A tibble: 45 × 7
student_id test_site pre_test_score post_test_score gain sum_tests pct_tests
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Virden 1 VIRDEN 43 92 49 135 75
2 Virden 2 VIRDEN 46 104 58 150 83.3
3 Virden 3 VIRDEN 39 75 36 114 63.3
4 Virden 4 VIRDEN 35 115 80 150 83.3
5 Virden 5 VIRDEN 46 85 39 131 72.8
6 Virden 6 VIRDEN 35 91 56 126 70
7 Virden 7 VIRDEN 40 96 56 136 75.6
8 Virden 8 VIRDEN 39 74 35 113 62.8
9 Virden 9 VIRDEN 40 90 50 130 72.2
10 Virden 10 VIRDEN 45 86 41 131 72.8
# ℹ 35 more rows
mutate()
Change a variable
# A tibble: 45 × 7
student_id test_site pre_test_score post_test_score gain sum_tests pct_tests
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 VIRDEN 43 92 49 135 75
2 2 VIRDEN 46 104 58 150 83.3
3 3 VIRDEN 39 75 36 114 63.3
4 4 VIRDEN 35 115 80 150 83.3
5 5 VIRDEN 46 85 39 131 72.8
6 6 VIRDEN 35 91 56 126 70
7 7 VIRDEN 40 96 56 136 75.6
8 8 VIRDEN 39 74 35 113 62.8
9 9 VIRDEN 40 90 50 130 72.2
10 10 VIRDEN 45 86 41 131 72.8
# ℹ 35 more rows
arrange()
arrange()
Order the data
# A tibble: 45 × 7
student_id test_site pre_test_score post_test_score gain sum_tests pct_tests
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 JONES 27 32 5 59 32.8
2 6 WESTSIDE 57 82 25 139 77.2
3 8 VIRDEN 39 74 35 113 62.8
4 3 VIRDEN 39 75 36 114 63.3
5 5 VIRDEN 46 85 39 131 72.8
6 10 VIRDEN 45 86 41 131 72.8
7 3 JONES 54 95 41 149 82.8
8 1 VIRDEN 43 92 49 135 75
9 13 JONES 49 98 49 147 81.7
10 13 WESTSIDE 45 94 49 139 77.2
# ℹ 35 more rows
arrange()
Order the data
# A tibble: 45 × 7
student_id test_site pre_test_score post_test_score gain sum_tests pct_tests
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 JONES 24 108 84 132 73.3
2 4 JONES 25 99 74 124 68.9
3 12 JONES 27 32 5 59 32.8
4 12 VIRDEN 31 102 71 133 73.9
5 11 VIRDEN 32 91 59 123 68.3
6 13 VIRDEN 33 86 53 119 66.1
7 6 VIRDEN 35 91 56 126 70
8 15 VIRDEN 35 101 66 136 75.6
9 9 JONES 35 109 74 144 80
10 4 VIRDEN 35 115 80 150 83.3
# ℹ 35 more rows
arrange()
Order the data
desc
ending# A tibble: 45 × 7
student_id test_site pre_test_score post_test_score gain sum_tests pct_tests
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 JONES 24 108 84 132 73.3
2 4 VIRDEN 35 115 80 150 83.3
3 4 JONES 25 99 74 124 68.9
4 5 JONES 36 110 74 146 81.1
5 9 JONES 35 109 74 144 80
6 12 VIRDEN 31 102 71 133 73.9
7 1 JONES 36 103 67 139 77.2
8 15 VIRDEN 35 101 66 136 75.6
9 6 JONES 40 106 66 146 81.1
10 8 WESTSIDE 43 109 66 152 84.4
# ℹ 35 more rows
arrange()
Order the data
# A tibble: 45 × 7
student_id test_site pre_test_score post_test_score gain sum_tests pct_tests
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 WESTSIDE 57 82 25 139 77.2
2 3 JONES 54 95 41 149 82.8
3 1 WESTSIDE 53 109 56 162 90
4 2 WESTSIDE 51 105 54 156 86.7
5 3 WESTSIDE 51 105 54 156 86.7
6 8 JONES 50 101 51 151 83.9
7 7 WESTSIDE 49 109 60 158 87.8
8 7 JONES 49 104 55 153 85
9 13 JONES 49 98 49 147 81.7
10 5 WESTSIDE 46 106 60 152 84.4
# ℹ 35 more rows
summarize()
summarize()
or summarise()
mean
and standard deviation
of the gain scores# A tibble: 1 × 2
mean_gain sd_gain
<dbl> <dbl>
1 56.4 14.1
group_by()
group_by()
Conduct an operation for each level of a grouping factor
Here, compute the mean gain for each test site
# A tibble: 3 × 2
test_site mean_gain
<chr> <dbl>
1 JONES 59
2 VIRDEN 53.8
3 WESTSIDE 56.3
group_by()
When you use group_by()
and summarize()
You will get a row for each unique group
If you have multiple grouping variables, you will get a row for each unique combination of groups
Use mutate()
!
# A tibble: 45 × 8
# Groups: test_site [3]
student_id test_site pre_test_score post_test_score gain sum_tests pct_tests
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 VIRDEN 43 92 49 135 75
2 2 VIRDEN 46 104 58 150 83.3
3 3 VIRDEN 39 75 36 114 63.3
4 4 VIRDEN 35 115 80 150 83.3
5 5 VIRDEN 46 85 39 131 72.8
6 6 VIRDEN 35 91 56 126 70
7 7 VIRDEN 40 96 56 136 75.6
8 8 VIRDEN 39 74 35 113 62.8
9 9 VIRDEN 40 90 50 130 72.2
10 10 VIRDEN 45 86 41 131 72.8
# ℹ 35 more rows
# ℹ 1 more variable: mean_gain <dbl>
Use summarize()
!
{ggplot}
!recode()
and rename()
ifelse()
Recoding into dichotomous variable with ifelse()
ifelse(
test, test == TRUE
, test == FALSE)
or
ifelse(
logical statement, do this if
TRUE, else
do this)
ifelse()
Recoding into dichotomous variable with ifelse()
Remember to pair with mutate()
!
# A tibble: 45 × 26
test_year test_type test_site student_id pre_test_score pre_test_percent
<chr> <chr> <chr> <chr> <dbl> <chr>
1 06/01/2016 12… YEAR END VIRDEN Virden 1 43 29%
2 06/01/2016 12… YEAR END VIRDEN Virden 2 46 31%
3 06/01/2016 12… YEAR END VIRDEN Virden 3 39 26%
4 06/01/2016 12… YEAR END VIRDEN Virden 4 35 23%
5 06/01/2016 12… YEAR END VIRDEN Virden 5 46 31%
6 06/01/2016 12… YEAR END VIRDEN Virden 6 35 23%
7 06/01/2016 12… YEAR END VIRDEN Virden 7 40 27%
8 06/01/2016 12… YEAR END VIRDEN Virden 8 39 26%
9 06/01/2016 12… YEAR END VIRDEN Virden 9 40 27%
10 06/01/2016 12… YEAR END VIRDEN Virden 10 45 30%
# ℹ 35 more rows
# ℹ 20 more variables: post_test_score <dbl>, post_test_percent <chr>,
# percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
# unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
# unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
# unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
# unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …
recode()
Recoding into multinomial variable
recode()
Recoding into multinomial variable
Remember to pair with mutate()
# A tibble: 45 × 25
test_year test_type test_site student_id pre_test_score pre_test_percent
<chr> <chr> <chr> <chr> <dbl> <chr>
1 06/01/2016 12… YEAR END Virden E… Virden 1 43 29%
2 06/01/2016 12… YEAR END Virden E… Virden 2 46 31%
3 06/01/2016 12… YEAR END Virden E… Virden 3 39 26%
4 06/01/2016 12… YEAR END Virden E… Virden 4 35 23%
5 06/01/2016 12… YEAR END Virden E… Virden 5 46 31%
6 06/01/2016 12… YEAR END Virden E… Virden 6 35 23%
7 06/01/2016 12… YEAR END Virden E… Virden 7 40 27%
8 06/01/2016 12… YEAR END Virden E… Virden 8 39 26%
9 06/01/2016 12… YEAR END Virden E… Virden 9 40 27%
10 06/01/2016 12… YEAR END Virden E… Virden 10 45 30%
# ℹ 35 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
# percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
# unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
# unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
# unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
# unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …
recode()
Remember to use backticks `` for special characters
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb cyl_str
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 cylinders6
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 cylinders6
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 cylinders4
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 cylinders6
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 cylinders8
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 cylinders6
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 cylinders8
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 cylinders4
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 cylinders4
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 cylinders6
# ℹ 22 more rows
case_when()
Recoding across several variables, or if you find yourself using nested ifelse()
statements
Remember to pair with mutate()
.default
equivalent to all else
rename()
rename(new_name = old_name)
NOT paired with mutate()
# A tibble: 344 × 8
species home bill_length bill_depth_mm flipper_length_mm body_mass_g sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torger… 39.1 18.7 181 3750 male
2 Adelie Torger… 39.5 17.4 186 3800 fema…
3 Adelie Torger… 40.3 18 195 3250 fema…
4 Adelie Torger… NA NA NA NA <NA>
5 Adelie Torger… 36.7 19.3 193 3450 fema…
6 Adelie Torger… 39.3 20.6 190 3650 male
7 Adelie Torger… 38.9 17.8 181 3625 fema…
8 Adelie Torger… 39.2 19.6 195 4675 male
9 Adelie Torger… 34.1 18.1 193 3475 <NA>
10 Adelie Torger… 42 20.2 190 4250 <NA>
# ℹ 334 more rows
# ℹ 1 more variable: year <int>
select()
# A tibble: 344 × 4
species home bill_length year
<fct> <fct> <dbl> <int>
1 Adelie Torgersen 39.1 2007
2 Adelie Torgersen 39.5 2007
3 Adelie Torgersen 40.3 2007
4 Adelie Torgersen NA 2007
5 Adelie Torgersen 36.7 2007
6 Adelie Torgersen 39.3 2007
7 Adelie Torgersen 38.9 2007
8 Adelie Torgersen 39.2 2007
9 Adelie Torgersen 34.1 2007
10 Adelie Torgersen 42 2007
# ℹ 334 more rows
{dplyr}
mutate
, select
, and summarize