Week 4: Data Wrangling

{dplyr}

Joe Nese

University of Oregon
Fall 2025

`{datapasta}`

Copy and paste data to and from R
VERY handy!
Good for reprex
- posting questions on Posit Community or stackoverflow

demo

Housekeeping

Final Project Outline
- Due next week (10/29)

The outline should include a description of the data to be used, a discussion of what preparatory work will need to be done, and how the requirements of the final project will be met. The outline is intended to be low-stakes and is primarily designed to be a means for you to obtain feedback on the feasibility of the project and areas to consider.

Homework Review

Homeworks 2 & 3

Data Wrangling with `{dplyr}`

Week 4

Agenda

Introduce common {dplyr} functions
- piping {dplyr} functions
- helper functions
Introduce complimentary functions
- ifelse()
- case_when()

Learning Objectives

Understand basic data wrangling and transformations
- select()
- filter()
- arrange()
- group_by()
- summarize()
- recode() & rename()
Be able to pipe these functions to efficiently wrangle data

Providing grammar for:

Graphics
- {ggplot2}
Data manipulations
- {dplyr}
- {tidyr}
Expanding area of specialized topics
- {lubridate}
- {glue}
- {tidymodels}
Many more…

Providing grammar for:

Graphics
- {ggplot2}
Data manipulations
- {dplyr}
- {tidyr}
Expanding area of specialized topics
- {lubridate}
- {glue}
- {tidymodels}
Many more…

{dplyr}

`{dplyr}`

A grammar for data wrangling (verbs!)

Take a couple minutes with your neighbor to guess what the following functions do:

select()
filter()
mutate()
arrange()
summarize()
group_by()

02:00

`{dplyr}`

A grammar for data wrangling (verbs!)

select()

a subset of columns

filter()

a subset of rows

mutate()

add a new column or modify an existing column

arrange()

sort rows in ascending/descending order

summarize()

a new column created according to other columns for example: mean() or sd() + often used with…

group_by()

grouping categories in column(s)

Arguments

{dplyr} always takes a data frame (tibble) as the first argument
- just like {ggplot}
subsequent functions tell {dplyr} what to do with the data frame
each function applied returns the modified data frame

`select()`

choose columns

data.frame |> 
  select(a, c)

Examples

Let’s go back to the `penguins’ dataset

library(tidyverse)
#install.packages("palmerpenguins")
library(palmerpenguins)

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

`select` president species and island and body_mass_g

select(penguins, species, island, body_mass_g)

OR more typically

penguins |>
    select(name, party, body_mass_g)

Error in `select()`:
! Can't select columns that don't exist.
✖ Column `name` doesn't exist.

Negation `-`

Don’t select variable(s)

penguins |>
    select(-year)

# A tibble: 344 × 7
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 1 more variable: sex <fct>

Negation `-`

Don’t select variable(s)

penguins |>
    select(-bill_length_mm, -bill_depth_mm)

OR maybe better

penguins |>
    select(-c(bill_length_mm, bill_depth_mm))

# A tibble: 344 × 6
   species island    flipper_length_mm body_mass_g sex     year
   <fct>   <fct>                 <int>       <int> <fct>  <int>
 1 Adelie  Torgersen               181        3750 male    2007
 2 Adelie  Torgersen               186        3800 female  2007
 3 Adelie  Torgersen               195        3250 female  2007
 4 Adelie  Torgersen                NA          NA <NA>    2007
 5 Adelie  Torgersen               193        3450 female  2007
 6 Adelie  Torgersen               190        3650 male    2007
 7 Adelie  Torgersen               181        3625 female  2007
 8 Adelie  Torgersen               195        4675 male    2007
 9 Adelie  Torgersen               193        3475 <NA>    2007
10 Adelie  Torgersen               190        4250 <NA>    2007
# ℹ 334 more rows

Indexing

Less preferable because variable orders change

penguins |>
    select(1:3)

# A tibble: 344 × 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Torgersen           39.1
 2 Adelie  Torgersen           39.5
 3 Adelie  Torgersen           40.3
 4 Adelie  Torgersen           NA  
 5 Adelie  Torgersen           36.7
 6 Adelie  Torgersen           39.3
 7 Adelie  Torgersen           38.9
 8 Adelie  Torgersen           39.2
 9 Adelie  Torgersen           34.1
10 Adelie  Torgersen           42  
# ℹ 334 more rows

Use multiple methods

penguins |>
    select(species, 3, 5)

# A tibble: 344 × 3
   species bill_length_mm flipper_length_mm
   <fct>            <dbl>             <int>
 1 Adelie            39.1               181
 2 Adelie            39.5               186
 3 Adelie            40.3               195
 4 Adelie            NA                  NA
 5 Adelie            36.7               193
 6 Adelie            39.3               190
 7 Adelie            38.9               181
 8 Adelie            39.2               195
 9 Adelie            34.1               193
10 Adelie            42                 190
# ℹ 334 more rows

Use `select()` to order columns

select() helper functions

everything()
starts_with()
ends_with()
contains()
all_of()
any_of()

Helper functions

First: this code may not work for you - why?

reads <- read_csv(here("data", "Project_Reads_Scores.csv")) |> 
  clean_names()
reads

# A tibble: 48 × 25
   test_year      test_type test_site student_id pre_test_score pre_test_percent
   <chr>          <chr>     <chr>     <chr>               <dbl> <chr>           
 1 06/01/2016 12… YEAR END  VIRDEN    Virden 1               43 29%             
 2 06/01/2016 12… YEAR END  VIRDEN    Virden 2               46 31%             
 3 06/01/2016 12… YEAR END  VIRDEN    Virden 3               39 26%             
 4 06/01/2016 12… YEAR END  VIRDEN    Virden 4               35 23%             
 5 06/01/2016 12… YEAR END  VIRDEN    Virden 5               46 31%             
 6 06/01/2016 12… YEAR END  VIRDEN    Virden 6               35 23%             
 7 06/01/2016 12… YEAR END  VIRDEN    Virden 7               40 27%             
 8 06/01/2016 12… YEAR END  VIRDEN    Virden 8               39 26%             
 9 06/01/2016 12… YEAR END  VIRDEN    Virden 9               40 27%             
10 06/01/2016 12… YEAR END  VIRDEN    Virden 10              45 30%             
# ℹ 38 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

Use `select` to order columns

reads |>
    select(student_id, test_site, test_type, test_year, everything())

# A tibble: 48 × 25
   student_id test_site test_type test_year      pre_test_score pre_test_percent
   <chr>      <chr>     <chr>     <chr>                   <dbl> <chr>           
 1 Virden 1   VIRDEN    YEAR END  06/01/2016 12…             43 29%             
 2 Virden 2   VIRDEN    YEAR END  06/01/2016 12…             46 31%             
 3 Virden 3   VIRDEN    YEAR END  06/01/2016 12…             39 26%             
 4 Virden 4   VIRDEN    YEAR END  06/01/2016 12…             35 23%             
 5 Virden 5   VIRDEN    YEAR END  06/01/2016 12…             46 31%             
 6 Virden 6   VIRDEN    YEAR END  06/01/2016 12…             35 23%             
 7 Virden 7   VIRDEN    YEAR END  06/01/2016 12…             40 27%             
 8 Virden 8   VIRDEN    YEAR END  06/01/2016 12…             39 26%             
 9 Virden 9   VIRDEN    YEAR END  06/01/2016 12…             40 27%             
10 Virden 10  VIRDEN    YEAR END  06/01/2016 12…             45 30%             
# ℹ 38 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

everything() is a helper verb that includes all column names

`starts_with()`

reads |> 
    select(starts_with("test"))

# A tibble: 48 × 3
   test_year              test_type test_site
   <chr>                  <chr>     <chr>    
 1 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 2 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 3 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 4 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 5 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 6 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 7 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 8 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 9 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
10 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
# ℹ 38 more rows

`ends_with()`

reads |> 
    select(ends_with("score"))

# A tibble: 48 × 10
   pre_test_score post_test_score unit_1_score unit_2_score unit_3_score
            <dbl>           <dbl>        <dbl>        <dbl>        <dbl>
 1             43              92            3            4            6
 2             46             104            5            5            6
 3             39              75            4            4            6
 4             35             115            4            4            6
 5             46              85            2            5            6
 6             35              91            5            5            7
 7             40              96            5            5            6
 8             39              74            4            5            5
 9             40              90            6            4            5
10             45              86            4            5            5
# ℹ 38 more rows
# ℹ 5 more variables: unit_4_score <dbl>, unit_5_6_score <dbl>,
#   unit_7_score <dbl>, unit_8_score <dbl>, total_score <dbl>

`contains()`

reads |> 
    select(contains("test"))

# A tibble: 48 × 7
   test_year test_type test_site pre_test_score pre_test_percent post_test_score
   <chr>     <chr>     <chr>              <dbl> <chr>                      <dbl>
 1 06/01/20… YEAR END  VIRDEN                43 29%                           92
 2 06/01/20… YEAR END  VIRDEN                46 31%                          104
 3 06/01/20… YEAR END  VIRDEN                39 26%                           75
 4 06/01/20… YEAR END  VIRDEN                35 23%                          115
 5 06/01/20… YEAR END  VIRDEN                46 31%                           85
 6 06/01/20… YEAR END  VIRDEN                35 23%                           91
 7 06/01/20… YEAR END  VIRDEN                40 27%                           96
 8 06/01/20… YEAR END  VIRDEN                39 26%                           74
 9 06/01/20… YEAR END  VIRDEN                40 27%                           90
10 06/01/20… YEAR END  VIRDEN                45 30%                           86
# ℹ 38 more rows
# ℹ 1 more variable: post_test_percent <chr>

`all_of()`

vars <- c("pre_test_score", "pre_test_percent", "post_test_score", "post_test_percent")

reads |> 
    select(all_of(vars))

# A tibble: 48 × 4
   pre_test_score pre_test_percent post_test_score post_test_percent
            <dbl> <chr>                      <dbl> <chr>            
 1             43 29%                           92 61%              
 2             46 31%                          104 69%              
 3             39 26%                           75 50%              
 4             35 23%                          115 77%              
 5             46 31%                           85 57%              
 6             35 23%                           91 61%              
 7             40 27%                           96 64%              
 8             39 26%                           74 49%              
 9             40 27%                           90 60%              
10             45 30%                           86 57%              
# ℹ 38 more rows

`all_of()`

vars <- c("pre_test_score", "pre_test_percent", "post_test_score", "post_test_percent")

mpg |> 
    select(all_of(vars))

Error in `select()`:
ℹ In argument: `all_of(vars)`.
Caused by error in `all_of()`:
! Can't subset elements that don't exist.
✖ Elements `pre_test_score`, `pre_test_percent`, `post_test_score`, and `post_test_percent` don't exist.

`any_of()`

vars <- c("pre_test_score", "pre_test_percent", "post_test_score", "post_test_percent")

reads |> 
    select(any_of(vars))

# A tibble: 48 × 4
   pre_test_score pre_test_percent post_test_score post_test_percent
            <dbl> <chr>                      <dbl> <chr>            
 1             43 29%                           92 61%              
 2             46 31%                          104 69%              
 3             39 26%                           75 50%              
 4             35 23%                          115 77%              
 5             46 31%                           85 57%              
 6             35 23%                           91 61%              
 7             40 27%                           96 64%              
 8             39 26%                           74 49%              
 9             40 27%                           90 60%              
10             45 30%                           86 57%              
# ℹ 38 more rows

`any_of()`

vars <- c("pre_test_score", "pre_test_percent", "post_test_score", "post_test_percent")

mpg |> 
    select(any_of(vars))

# A tibble: 234 × 0

Mix `select` methods and helpers

You can mix types and helper functions

reads |> 
    select(student_id, 1, starts_with("total"))

# A tibble: 48 × 4
   student_id test_year              total_score total_percent_correct
   <chr>      <chr>                        <dbl> <chr>                
 1 Virden 1   06/01/2016 12:00:00 AM         207 41%                  
 2 Virden 2   06/01/2016 12:00:00 AM         224 45%                  
 3 Virden 3   06/01/2016 12:00:00 AM         193 39%                  
 4 Virden 4   06/01/2016 12:00:00 AM         223 45%                  
 5 Virden 5   06/01/2016 12:00:00 AM         198 40%                  
 6 Virden 6   06/01/2016 12:00:00 AM         210 42%                  
 7 Virden 7   06/01/2016 12:00:00 AM         211 42%                  
 8 Virden 8   06/01/2016 12:00:00 AM         180 36%                  
 9 Virden 9   06/01/2016 12:00:00 AM         206 41%                  
10 Virden 10  06/01/2016 12:00:00 AM         206 41%                  
# ℹ 38 more rows

Mix `select` methods and helpers

You can also use select to rearrange your columns

reads |> 
    select(student_id, 1, starts_with("total"), everything())

# A tibble: 48 × 25
   student_id test_year    total_score total_percent_correct test_type test_site
   <chr>      <chr>              <dbl> <chr>                 <chr>     <chr>    
 1 Virden 1   06/01/2016 …         207 41%                   YEAR END  VIRDEN   
 2 Virden 2   06/01/2016 …         224 45%                   YEAR END  VIRDEN   
 3 Virden 3   06/01/2016 …         193 39%                   YEAR END  VIRDEN   
 4 Virden 4   06/01/2016 …         223 45%                   YEAR END  VIRDEN   
 5 Virden 5   06/01/2016 …         198 40%                   YEAR END  VIRDEN   
 6 Virden 6   06/01/2016 …         210 42%                   YEAR END  VIRDEN   
 7 Virden 7   06/01/2016 …         211 42%                   YEAR END  VIRDEN   
 8 Virden 8   06/01/2016 …         180 36%                   YEAR END  VIRDEN   
 9 Virden 9   06/01/2016 …         206 41%                   YEAR END  VIRDEN   
10 Virden 10  06/01/2016 …         206 41%                   YEAR END  VIRDEN   
# ℹ 38 more rows
# ℹ 19 more variables: pre_test_score <dbl>, pre_test_percent <chr>,
#   post_test_score <dbl>, post_test_percent <chr>, percentage_change <chr>,
#   unit_1_score <dbl>, unit_1_percent <chr>, unit_2_score <dbl>,
#   unit_2_percent <chr>, unit_3_score <dbl>, unit_3_percent <chr>,
#   unit_4_score <dbl>, unit_4_percent <chr>, unit_5_6_score <dbl>,
#   unit_5_6_percent <chr>, unit_7_score <dbl>, unit_7_percent <chr>, …

`relocate()`

relocate(.data, … .before = NULL, .after = NULL)

.data = dataframe/tibble

… = columns to move

.before = variable to move before

.after = variable to move after

supplying neither .before nor .after move columns to the left-hand side
specifying both is an error

`relocate()`

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

penguins |> 
  relocate(year)

# A tibble: 344 × 8
    year species island    bill_length_mm bill_depth_mm flipper_length_mm
   <int> <fct>   <fct>              <dbl>         <dbl>             <int>
 1  2007 Adelie  Torgersen           39.1          18.7               181
 2  2007 Adelie  Torgersen           39.5          17.4               186
 3  2007 Adelie  Torgersen           40.3          18                 195
 4  2007 Adelie  Torgersen           NA            NA                  NA
 5  2007 Adelie  Torgersen           36.7          19.3               193
 6  2007 Adelie  Torgersen           39.3          20.6               190
 7  2007 Adelie  Torgersen           38.9          17.8               181
 8  2007 Adelie  Torgersen           39.2          19.6               195
 9  2007 Adelie  Torgersen           34.1          18.1               193
10  2007 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 2 more variables: body_mass_g <int>, sex <fct>

`relocate()`

penguins |> 
  relocate(year, .after = species)

# A tibble: 344 × 8
   species  year island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <int> <fct>              <dbl>         <dbl>             <int>
 1 Adelie   2007 Torgersen           39.1          18.7               181
 2 Adelie   2007 Torgersen           39.5          17.4               186
 3 Adelie   2007 Torgersen           40.3          18                 195
 4 Adelie   2007 Torgersen           NA            NA                  NA
 5 Adelie   2007 Torgersen           36.7          19.3               193
 6 Adelie   2007 Torgersen           39.3          20.6               190
 7 Adelie   2007 Torgersen           38.9          17.8               181
 8 Adelie   2007 Torgersen           39.2          19.6               195
 9 Adelie   2007 Torgersen           34.1          18.1               193
10 Adelie   2007 Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 2 more variables: body_mass_g <int>, sex <fct>

`relocate()`

penguins |> 
  relocate(year, .before = bill_length_mm)

# A tibble: 344 × 8
   species island     year bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>     <int>          <dbl>         <dbl>             <int>
 1 Adelie  Torgersen  2007           39.1          18.7               181
 2 Adelie  Torgersen  2007           39.5          17.4               186
 3 Adelie  Torgersen  2007           40.3          18                 195
 4 Adelie  Torgersen  2007           NA            NA                  NA
 5 Adelie  Torgersen  2007           36.7          19.3               193
 6 Adelie  Torgersen  2007           39.3          20.6               190
 7 Adelie  Torgersen  2007           38.9          17.8               181
 8 Adelie  Torgersen  2007           39.2          19.6               195
 9 Adelie  Torgersen  2007           34.1          18.1               193
10 Adelie  Torgersen  2007           42            20.2               190
# ℹ 334 more rows
# ℹ 2 more variables: body_mass_g <int>, sex <fct>

`filter()`

choose rows that satisfy your condition (== TRUE)

df |> 
  filter(
    type == "otter",
    site == "bay"
  )

`filter()`

Complete set of boolean operations

x is the left-hand circle
y is the right-hand circle
the shaded region show which parts each operator selects
- xor == x or y except where they overlap

==: equal
!: negates
!=: NOT equal
&: and
|: or

`filter()` numeric

mpg |> 
  filter(cyl == 4)

# A tibble: 81 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 6 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
 7 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
 8 audi         a4 quattro   2    2008     4 auto… 4        19    27 p     comp…
 9 chevrolet    malibu       2.4  1999     4 auto… f        19    27 r     mids…
10 chevrolet    malibu       2.4  2008     4 auto… f        22    30 r     mids…
# ℹ 71 more rows

mpg |> 
  filter(hwy >= 22)

# A tibble: 143 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 133 more rows

`filter` character

use == NOT =

penguins |>
    filter(species == "Gentoo")

# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>

penguins |>
    filter(species = "Gentoo")

Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `species == "Gentoo"`?

Gentoo starting after 2007

, is equivalent to &

penguins |> 
  filter(
    species == "Gentoo" &
    start > "2007"
  )

Error in `filter()`:
ℹ In argument: `species == "Gentoo" & start > "2007"`.
Caused by error in `start > "2007"`:
! comparison (>) is possible only for atomic and list types

penguins |> 
  filter(
    species == "Gentoo",
    start > "2007"
  )

Error in `filter()`:
ℹ In argument: `start > "2007"`.
Caused by error in `start > "2007"`:
! comparison (>) is possible only for atomic and list types

`filter()` note

Can’t write:

penguins |> 
  filter(island == "Briscoe" | "Dream")

Error in `filter()`:
ℹ In argument: `island == "Briscoe" | "Dream"`.
Caused by error in `island == "Briscoe" | "Dream"`:
! operations are possible only for numeric, logical or complex types

Must be:

penguins |> 
  filter(island == "Briscoe" | island == "Dream")

# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream            39.5          16.7               178        3250
 2 Adelie  Dream            37.2          18.1               178        3900
 3 Adelie  Dream            39.5          17.8               188        3300
 4 Adelie  Dream            40.9          18.9               184        3900
 5 Adelie  Dream            36.4          17                 195        3325
 6 Adelie  Dream            39.2          21.1               196        4150
 7 Adelie  Dream            38.8          20                 190        3950
 8 Adelie  Dream            42.2          18.5               180        3550
 9 Adelie  Dream            37.6          19.3               181        3300
10 Adelie  Dream            39.8          19.1               184        4650
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>

more `filter()`

reads

# A tibble: 48 × 25
   test_year      test_type test_site student_id pre_test_score pre_test_percent
   <chr>          <chr>     <chr>     <chr>               <dbl> <chr>           
 1 06/01/2016 12… YEAR END  VIRDEN    Virden 1               43 29%             
 2 06/01/2016 12… YEAR END  VIRDEN    Virden 2               46 31%             
 3 06/01/2016 12… YEAR END  VIRDEN    Virden 3               39 26%             
 4 06/01/2016 12… YEAR END  VIRDEN    Virden 4               35 23%             
 5 06/01/2016 12… YEAR END  VIRDEN    Virden 5               46 31%             
 6 06/01/2016 12… YEAR END  VIRDEN    Virden 6               35 23%             
 7 06/01/2016 12… YEAR END  VIRDEN    Virden 7               40 27%             
 8 06/01/2016 12… YEAR END  VIRDEN    Virden 8               39 26%             
 9 06/01/2016 12… YEAR END  VIRDEN    Virden 9               40 27%             
10 06/01/2016 12… YEAR END  VIRDEN    Virden 10              45 30%             
# ℹ 38 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

more `filter()`

Anomaly in the data

reads |>
 count(student_id)

# A tibble: 46 × 2
   student_id                 n
   <chr>                  <int>
 1 All Students (Average)     3
 2 Jones 1                    1
 3 Jones 10                   1
 4 Jones 11                   1
 5 Jones 12                   1
 6 Jones 13                   1
 7 Jones 14                   1
 8 Jones 15                   1
 9 Jones 2                    1
10 Jones 3                    1
# ℹ 36 more rows

Let’s remove these aggregate rows

students_only <- reads |>
  filter(student_id != "All Students (Average)") 

students_only |> 
  count(student_id)

# A tibble: 45 × 2
   student_id     n
   <chr>      <int>
 1 Jones 1        1
 2 Jones 10       1
 3 Jones 11       1
 4 Jones 12       1
 5 Jones 13       1
 6 Jones 14       1
 7 Jones 15       1
 8 Jones 2        1
 9 Jones 3        1
10 Jones 4        1
# ℹ 35 more rows

Not equal

! means NOT

!= means NOT equal

Note:

If you want values that are NOT NA:
- filter(!is.na(variable))
- ~~filter(variable != NA)~~

Chaining `{dplyr}` functions

What if we wanted to filter then select a data set?

Select “island” and “species” of penguins over 4200 grams

Two step method 😐

penguins_big <- filter(penguins, body_mass_g > 4200)

penguins_big <- select(penguins_big, island, species)

Chaining `{dplyr}` functions

Alternatively, we could wrap select around filter ☹️

select(filter(penguins, body_mass_g > 4200), island, species)

# A tibble: 149 × 2
   island    species
   <fct>     <fct>  
 1 Torgersen Adelie 
 2 Torgersen Adelie 
 3 Torgersen Adelie 
 4 Torgersen Adelie 
 5 Dream     Adelie 
 6 Dream     Adelie 
 7 Dream     Adelie 
 8 Biscoe    Adelie 
 9 Biscoe    Adelie 
10 Torgersen Adelie 
# ℹ 139 more rows

Chaining `{dplyr}` functions

Piping functions 🎉

We could use the pipe function to help increase the readability of our code:
|>

Called the “pipe” operator and “piping functions”

filter(penguins, body_mass_g > 4200) |> 
  select(island, species)

# A tibble: 149 × 2
   island    species
   <fct>     <fct>  
 1 Torgersen Adelie 
 2 Torgersen Adelie 
 3 Torgersen Adelie 
 4 Torgersen Adelie 
 5 Dream     Adelie 
 6 Dream     Adelie 
 7 Dream     Adelie 
 8 Biscoe    Adelie 
 9 Biscoe    Adelie 
10 Torgersen Adelie 
# ℹ 139 more rows

Piping functions

Even better:

Generally when using the pipe, the first argument is the data set, which gets piped through the corresponding functions
So the code on the prior slide would be:

penguins |> 
  filter(body_mass_g > 4200) |> 
  select(island, species)

# A tibble: 149 × 2
   island    species
   <fct>     <fct>  
 1 Torgersen Adelie 
 2 Torgersen Adelie 
 3 Torgersen Adelie 
 4 Torgersen Adelie 
 5 Dream     Adelie 
 6 Dream     Adelie 
 7 Dream     Adelie 
 8 Biscoe    Adelie 
 9 Biscoe    Adelie 
10 Torgersen Adelie 
# ℹ 139 more rows

`mutate()`

Some new data

(prepost <- reads |> 
  filter(student_id != "All Students (Average)") |> 
  select(student_id, test_site, pre_test_score, post_test_score))

# A tibble: 45 × 4
   student_id test_site pre_test_score post_test_score
   <chr>      <chr>              <dbl>           <dbl>
 1 Virden 1   VIRDEN                43              92
 2 Virden 2   VIRDEN                46             104
 3 Virden 3   VIRDEN                39              75
 4 Virden 4   VIRDEN                35             115
 5 Virden 5   VIRDEN                46              85
 6 Virden 6   VIRDEN                35              91
 7 Virden 7   VIRDEN                40              96
 8 Virden 8   VIRDEN                39              74
 9 Virden 9   VIRDEN                40              90
10 Virden 10  VIRDEN                45              86
# ℹ 35 more rows

`mutate()`

Add a variable

prepost |> 
  mutate(gain = post_test_score - pre_test_score)

# A tibble: 45 × 5
   student_id test_site pre_test_score post_test_score  gain
   <chr>      <chr>              <dbl>           <dbl> <dbl>
 1 Virden 1   VIRDEN                43              92    49
 2 Virden 2   VIRDEN                46             104    58
 3 Virden 3   VIRDEN                39              75    36
 4 Virden 4   VIRDEN                35             115    80
 5 Virden 5   VIRDEN                46              85    39
 6 Virden 6   VIRDEN                35              91    56
 7 Virden 7   VIRDEN                40              96    56
 8 Virden 8   VIRDEN                39              74    35
 9 Virden 9   VIRDEN                40              90    50
10 Virden 10  VIRDEN                45              86    41
# ℹ 35 more rows

names(prepost)

[1] "student_id"      "test_site"       "pre_test_score"  "post_test_score"

Why do we not see the new gain column?

`mutate()`

Add multiple variables…

Use a variable you just created!

prepost <- prepost |>
  mutate(
    gain = post_test_score - pre_test_score,
      sum_tests = post_test_score + pre_test_score,
    pct_tests = (sum_tests/180) * 100
  )
prepost

# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
   <chr>      <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1 Virden 1   VIRDEN                43              92    49       135      75  
 2 Virden 2   VIRDEN                46             104    58       150      83.3
 3 Virden 3   VIRDEN                39              75    36       114      63.3
 4 Virden 4   VIRDEN                35             115    80       150      83.3
 5 Virden 5   VIRDEN                46              85    39       131      72.8
 6 Virden 6   VIRDEN                35              91    56       126      70  
 7 Virden 7   VIRDEN                40              96    56       136      75.6
 8 Virden 8   VIRDEN                39              74    35       113      62.8
 9 Virden 9   VIRDEN                40              90    50       130      72.2
10 Virden 10  VIRDEN                45              86    41       131      72.8
# ℹ 35 more rows

`mutate()`

Change a variable

prepost <- prepost |> 
  mutate(student_id = parse_number(student_id))
prepost

# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1          1 VIRDEN                43              92    49       135      75  
 2          2 VIRDEN                46             104    58       150      83.3
 3          3 VIRDEN                39              75    36       114      63.3
 4          4 VIRDEN                35             115    80       150      83.3
 5          5 VIRDEN                46              85    39       131      72.8
 6          6 VIRDEN                35              91    56       126      70  
 7          7 VIRDEN                40              96    56       136      75.6
 8          8 VIRDEN                39              74    35       113      62.8
 9          9 VIRDEN                40              90    50       130      72.2
10         10 VIRDEN                45              86    41       131      72.8
# ℹ 35 more rows

`arrange()`

Order the data

Order by “gain”, ascending [default]

prepost |> 
  arrange(gain)

# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1         12 JONES                 27              32     5        59      32.8
 2          6 WESTSIDE              57              82    25       139      77.2
 3          8 VIRDEN                39              74    35       113      62.8
 4          3 VIRDEN                39              75    36       114      63.3
 5          5 VIRDEN                46              85    39       131      72.8
 6         10 VIRDEN                45              86    41       131      72.8
 7          3 JONES                 54              95    41       149      82.8
 8          1 VIRDEN                43              92    49       135      75  
 9         13 JONES                 49              98    49       147      81.7
10         13 WESTSIDE              45              94    49       139      77.2
# ℹ 35 more rows

`arrange()`

Order the data

Order by “pre_test_score” and “gain”, ascending

prepost |> 
  arrange(pre_test_score, gain)

# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1         11 JONES                 24             108    84       132      73.3
 2          4 JONES                 25              99    74       124      68.9
 3         12 JONES                 27              32     5        59      32.8
 4         12 VIRDEN                31             102    71       133      73.9
 5         11 VIRDEN                32              91    59       123      68.3
 6         13 VIRDEN                33              86    53       119      66.1
 7          6 VIRDEN                35              91    56       126      70  
 8         15 VIRDEN                35             101    66       136      75.6
 9          9 JONES                 35             109    74       144      80  
10          4 VIRDEN                35             115    80       150      83.3
# ℹ 35 more rows

`arrange()`

Order the data

Order by “gain”, descending

prepost |> 
  arrange(desc(gain))

# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1         11 JONES                 24             108    84       132      73.3
 2          4 VIRDEN                35             115    80       150      83.3
 3          4 JONES                 25              99    74       124      68.9
 4          5 JONES                 36             110    74       146      81.1
 5          9 JONES                 35             109    74       144      80  
 6         12 VIRDEN                31             102    71       133      73.9
 7          1 JONES                 36             103    67       139      77.2
 8         15 VIRDEN                35             101    66       136      75.6
 9          6 JONES                 40             106    66       146      81.1
10          8 WESTSIDE              43             109    66       152      84.4
# ℹ 35 more rows

`arrange()`

Order the data

Order by “pre_test_score” and “gain”, descending

prepost |> 
  arrange(desc(pre_test_score), desc(gain))

# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1          6 WESTSIDE              57              82    25       139      77.2
 2          3 JONES                 54              95    41       149      82.8
 3          1 WESTSIDE              53             109    56       162      90  
 4          2 WESTSIDE              51             105    54       156      86.7
 5          3 WESTSIDE              51             105    54       156      86.7
 6          8 JONES                 50             101    51       151      83.9
 7          7 WESTSIDE              49             109    60       158      87.8
 8          7 JONES                 49             104    55       153      85  
 9         13 JONES                 49              98    49       147      81.7
10          5 WESTSIDE              46             106    60       152      84.4
# ℹ 35 more rows

`summarize()`

summarize() or summarise()

Here: compute the mean and standard deviation of the gain scores
What are we looking at?

prepost |> 
  summarize(mean_gain = mean(gain, na.rm = TRUE),
            sd_gain = sd(gain, na.rm = TRUE))

# A tibble: 1 × 2
  mean_gain sd_gain
      <dbl>   <dbl>
1      56.4    14.1

`group_by()`

`group_by()`

Conduct an operation for each level of a grouping factor

Here, compute the mean gain for each test site

prepost |> 
  group_by(test_site) |> 
  summarize(mean_gain = mean(gain))

# A tibble: 3 × 2
  test_site mean_gain
  <chr>         <dbl>
1 JONES          59  
2 VIRDEN         53.8
3 WESTSIDE       56.3

`group_by()`

When you use group_by() and summarize()

You will get a row for each unique group

mpg |> 
  group_by(cyl) |> 
  summarize(max_hwy = max(hwy))

# A tibble: 4 × 2
    cyl max_hwy
  <int>   <int>
1     4      44
2     5      29
3     6      29
4     8      26

If you have multiple grouping variables, you will get a row for each unique combination of groups

mpg |> 
  group_by(cyl, drv) |> 
  summarize(max_hwy = max(hwy))

# A tibble: 9 × 3
# Groups:   cyl [4]
    cyl drv   max_hwy
  <int> <chr>   <int>
1     4 4          28
2     4 f          44
3     5 f          29
4     6 4          25
5     6 f          29
6     6 r          26
7     8 4          23
8     8 f          25
9     8 r          26

Add summarized column to full data

Use mutate()!

prepost |> 
  group_by(test_site) |> 
  mutate(mean_gain = mean(gain))

# A tibble: 45 × 8
# Groups:   test_site [3]
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1          1 VIRDEN                43              92    49       135      75  
 2          2 VIRDEN                46             104    58       150      83.3
 3          3 VIRDEN                39              75    36       114      63.3
 4          4 VIRDEN                35             115    80       150      83.3
 5          5 VIRDEN                46              85    39       131      72.8
 6          6 VIRDEN                35              91    56       126      70  
 7          7 VIRDEN                40              96    56       136      75.6
 8          8 VIRDEN                39              74    35       113      62.8
 9          9 VIRDEN                40              90    50       130      72.2
10         10 VIRDEN                45              86    41       131      72.8
# ℹ 35 more rows
# ℹ 1 more variable: mean_gain <dbl>

Retain only summarized columns

Use summarize()!

prepost |> 
  group_by(test_site) |> 
  summarize(mean_gain = mean(gain))

# A tibble: 3 × 2
  test_site mean_gain
  <chr>         <dbl>
1 JONES          59  
2 VIRDEN         53.8
3 WESTSIDE       56.3

Pipe to `{ggplot}`!

prepost |> 
  group_by(test_site) |> 
  summarize(mean_gain = mean(gain)) |> 
  ggplot(aes(test_site, mean_gain)) + 
  geom_col(fill = "cornflowerblue", alpha = .7) +
  theme_minimal()

Summarizing and Plotting

Calculate mean post_test_score by test site

students_only |>
  group_by(test_site) |>
  summarize(mean = mean(post_test_score))

# A tibble: 3 × 2
  test_site  mean
  <chr>     <dbl>
1 JONES      97.8
2 VIRDEN     92.5
3 WESTSIDE  103.

Plot it

theme_set(theme_minimal(base_size = 18))

students_only |>
  group_by(test_site) |>
  summarize(mean = mean(post_test_score)) |>
  ggplot(aes(test_site, mean)) +
  geom_col(alpha = 0.8)

A couple more verbs

recode() and rename()

Recode with `ifelse()`

Recoding into dichotomous variable with ifelse()

ifelse(test, test == TRUE, test == FALSE)

ifelse(logical statement, do this if TRUE, else do this)

Recode with `ifelse()`

Recoding into dichotomous variable with ifelse()

Remember to pair with mutate()!

students_only |> 
  mutate(New_test_site = ifelse(test_site == "JONES", "district 1", "district 2"))

# A tibble: 45 × 26
   test_year      test_type test_site student_id pre_test_score pre_test_percent
   <chr>          <chr>     <chr>     <chr>               <dbl> <chr>           
 1 06/01/2016 12… YEAR END  VIRDEN    Virden 1               43 29%             
 2 06/01/2016 12… YEAR END  VIRDEN    Virden 2               46 31%             
 3 06/01/2016 12… YEAR END  VIRDEN    Virden 3               39 26%             
 4 06/01/2016 12… YEAR END  VIRDEN    Virden 4               35 23%             
 5 06/01/2016 12… YEAR END  VIRDEN    Virden 5               46 31%             
 6 06/01/2016 12… YEAR END  VIRDEN    Virden 6               35 23%             
 7 06/01/2016 12… YEAR END  VIRDEN    Virden 7               40 27%             
 8 06/01/2016 12… YEAR END  VIRDEN    Virden 8               39 26%             
 9 06/01/2016 12… YEAR END  VIRDEN    Virden 9               40 27%             
10 06/01/2016 12… YEAR END  VIRDEN    Virden 10              45 30%             
# ℹ 35 more rows
# ℹ 20 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

Recode with `recode()`

Recoding into multinomial variable

recode(variable,
       old_category_1 = "new_category_1",
         old_category_2 = "new_category_2",
         old_category_3 = "new_category_3",
       ...
)

Recode with `recode()`

Recoding into multinomial variable

Remember to pair with mutate()

students_only |> 
  mutate(test_site = recode(test_site,
                            JONES = "Jones Middle",
                            VIRDEN = "Virden Elementary",
                            WESTSIDE = "Westside High"
                            )
        )

# A tibble: 45 × 25
   test_year      test_type test_site student_id pre_test_score pre_test_percent
   <chr>          <chr>     <chr>     <chr>               <dbl> <chr>           
 1 06/01/2016 12… YEAR END  Virden E… Virden 1               43 29%             
 2 06/01/2016 12… YEAR END  Virden E… Virden 2               46 31%             
 3 06/01/2016 12… YEAR END  Virden E… Virden 3               39 26%             
 4 06/01/2016 12… YEAR END  Virden E… Virden 4               35 23%             
 5 06/01/2016 12… YEAR END  Virden E… Virden 5               46 31%             
 6 06/01/2016 12… YEAR END  Virden E… Virden 6               35 23%             
 7 06/01/2016 12… YEAR END  Virden E… Virden 7               40 27%             
 8 06/01/2016 12… YEAR END  Virden E… Virden 8               39 26%             
 9 06/01/2016 12… YEAR END  Virden E… Virden 9               40 27%             
10 06/01/2016 12… YEAR END  Virden E… Virden 10              45 30%             
# ℹ 35 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

Recode with `recode()`

Remember to use backticks `` for special characters

mtcars |> 
  mutate(cyl_str = recode(cyl,
                      `4` = "cylinders4",
                      `6` = "cylinders6",
                      `8` = "cylinders8"
                      )
         ) |> 
  as_tibble()

# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cyl_str   
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>     
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 cylinders6
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 cylinders6
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 cylinders4
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 cylinders6
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 cylinders8
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 cylinders6
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 cylinders8
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2 cylinders4
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 cylinders4
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 cylinders6
# ℹ 22 more rows

Recode with `case_when()`

Recoding across several variables, or if you find yourself using nested ifelse() statements

Remember to pair with mutate()

library(palmerpenguins)

penguins |> 
  mutate(species_big = 
           case_when(
             body_mass_g > 4500 & species == "Adelie" ~ "adelie_big",
             body_mass_g > 4600 & species == "Chinstrap" ~ "chinstrap_big",
             body_mass_g > 6000 & species == "Gentoo" ~ "gentoo_big",
             .default = "other"
           ))

.default equivalent to all else

`rename()`

rename(new_name = old_name)

NOT paired with mutate()

penguins |>
  rename(
    home = island,
    bill_length = bill_length_mm
  )

# A tibble: 344 × 8
   species home    bill_length bill_depth_mm flipper_length_mm body_mass_g sex  
   <fct>   <fct>         <dbl>         <dbl>             <int>       <int> <fct>
 1 Adelie  Torger…        39.1          18.7               181        3750 male 
 2 Adelie  Torger…        39.5          17.4               186        3800 fema…
 3 Adelie  Torger…        40.3          18                 195        3250 fema…
 4 Adelie  Torger…        NA            NA                  NA          NA <NA> 
 5 Adelie  Torger…        36.7          19.3               193        3450 fema…
 6 Adelie  Torger…        39.3          20.6               190        3650 male 
 7 Adelie  Torger…        38.9          17.8               181        3625 fema…
 8 Adelie  Torger…        39.2          19.6               195        4675 male 
 9 Adelie  Torger…        34.1          18.1               193        3475 <NA> 
10 Adelie  Torger…        42            20.2               190        4250 <NA> 
# ℹ 334 more rows
# ℹ 1 more variable: year <int>

rename with `select()`

penguins |> 
  select(species, home = island, bill_length = bill_length_mm, year)

# A tibble: 344 × 4
   species home      bill_length  year
   <fct>   <fct>           <dbl> <int>
 1 Adelie  Torgersen        39.1  2007
 2 Adelie  Torgersen        39.5  2007
 3 Adelie  Torgersen        40.3  2007
 4 Adelie  Torgersen        NA    2007
 5 Adelie  Torgersen        36.7  2007
 6 Adelie  Torgersen        39.3  2007
 7 Adelie  Torgersen        38.9  2007
 8 Adelie  Torgersen        39.2  2007
 9 Adelie  Torgersen        34.1  2007
10 Adelie  Torgersen        42    2007
# ℹ 334 more rows

Final notes on `{dplyr}`

We could spend weeks on this
We’ll be using it all term long
Verbs can help you gain fluency
There are also conditional and all-inclusive versions of mutate, select, and summarize
- across()

Next time

Before next class

Final Project
- Final Project: Outline
Reading
- Bryan (2017)
Activities
- Download GitKraken
- Watch What is a Git repository?
- Watch What is a remote repository?
Homework
- Homework 4

Week 4: Data Wrangling

Share!

{datapasta}

Housekeeping

Homework Review

Data Wrangling with {dplyr}

Agenda

Providing grammar for:

Providing grammar for:

{dplyr}

{dplyr}

{dplyr}

Arguments

select()

select()

Examples

select president species and island and body_mass_g

Negation -

Negation -

Indexing

Use multiple methods

Use select() to order columns

Helper functions

Use select to order columns

starts_with()

ends_with()

contains()

all_of()

all_of()

any_of()

any_of()

Mix select methods and helpers

Mix select methods and helpers

relocate()

relocate()

relocate()

relocate()

filter()

filter()

filter()

filter() numeric

filter character

Gentoo starting after 2007

filter() note

more filter()

more filter()

Let’s remove these aggregate rows

Not equal

Chaining {dplyr} functions

Chaining {dplyr} functions

Chaining {dplyr} functions

Piping functions

mutate()

Some new data

mutate()

mutate()

mutate()

arrange()

arrange()

arrange()

arrange()

arrange()

summarize()

group_by()

group_by()

group_by()

Add summarized column to full data

Retain only summarized columns

Pipe to {ggplot}!

Summarizing and Plotting

Plot it

A couple more verbs

Recode with ifelse()

Recode with ifelse()

Recode with recode()

Recode with recode()

Recode with recode()

Recode with case_when()

rename()

rename with select()

`{datapasta}`

Data Wrangling with `{dplyr}`

`{dplyr}`

`{dplyr}`

`select()`

`select()`

`select` president species and island and body_mass_g

Negation `-`

Negation `-`

Use `select()` to order columns

Use `select` to order columns

`starts_with()`

`ends_with()`

`contains()`

`all_of()`

`all_of()`

`any_of()`

`any_of()`

Mix `select` methods and helpers

Mix `select` methods and helpers

`relocate()`

`relocate()`

`relocate()`

`relocate()`

`filter()`

`filter()`

`filter()`

`filter()` numeric

`filter` character

`filter()` note

more `filter()`

more `filter()`

Chaining `{dplyr}` functions

Chaining `{dplyr}` functions

Chaining `{dplyr}` functions

`mutate()`

`mutate()`

`mutate()`

`mutate()`

`arrange()`

`arrange()`

`arrange()`

`arrange()`

`arrange()`

`summarize()`

`group_by()`

`group_by()`

`group_by()`

Pipe to `{ggplot}`!

Recode with `ifelse()`

Recode with `ifelse()`

Recode with `recode()`

Recode with `recode()`

Recode with `recode()`

Recode with `case_when()`

`rename()`

rename with `select()`

Final notes on `{dplyr}`