Week 4: Data Wrangling

{dplyr}

Joe Nese

University of Oregon
Fall 2025

Share!

{datapasta}

  • Copy and paste data to and from R
  • VERY handy!
  • Good for reprex

demo

Housekeeping

  • Final Project Outline
    • Due next week (10/29)

The outline should include a description of the data to be used, a discussion of what preparatory work will need to be done, and how the requirements of the final project will be met. The outline is intended to be low-stakes and is primarily designed to be a means for you to obtain feedback on the feasibility of the project and areas to consider.

Homework Review

Homeworks 2 & 3

Data Wrangling with {dplyr}

Week 4

Agenda

  • Introduce common {dplyr} functions
    • piping {dplyr} functions
    • helper functions
  • Introduce complimentary functions
    • ifelse()
    • case_when()

Learning Objectives

  • Understand basic data wrangling and transformations
    • select()
    • filter()
    • arrange()
    • group_by()
    • summarize()
    • recode() & rename()
  • Be able to pipe these functions to efficiently wrangle data

Providing grammar for:

  • Graphics
    • {ggplot2}
  • Data manipulations
    • {dplyr}
    • {tidyr}
  • Expanding area of specialized topics
    • {lubridate}
    • {glue}
    • {tidymodels}
  • Many more…

Providing grammar for:

  • Graphics
    • {ggplot2}
  • Data manipulations
    • {dplyr}
    • {tidyr}
  • Expanding area of specialized topics
    • {lubridate}
    • {glue}
    • {tidymodels}
  • Many more…

{dplyr}

{dplyr}

A grammar for data wrangling (verbs!)

Take a couple minutes with your neighbor to guess what the following functions do:

  • select()
  • filter()
  • mutate()
  • arrange()
  • summarize()
  • group_by()
02:00

{dplyr}

A grammar for data wrangling (verbs!)

select()

a subset of columns

filter()

a subset of rows

mutate()

add a new column or modify an existing column

arrange()

sort rows in ascending/descending order

summarize()

a new column created according to other columns for example: mean() or sd() + often used with…

group_by()

grouping categories in column(s)

Arguments

  • {dplyr} always takes a data frame (tibble) as the first argument

    • just like {ggplot}
  • subsequent functions tell {dplyr} what to do with the data frame

  • each function applied returns the modified data frame

select()

select()

choose columns

data.frame |> 
  select(a, c)

Examples

Let’s go back to the `penguins’ dataset

library(tidyverse)
#install.packages("palmerpenguins")
library(palmerpenguins)

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

select president species and island and body_mass_g

select(penguins, species, island, body_mass_g)

OR more typically

penguins |>
    select(name, party, body_mass_g)
Error in `select()`:
! Can't select columns that don't exist.
✖ Column `name` doesn't exist.

Negation -

Don’t select variable(s)

penguins |>
    select(-year)
# A tibble: 344 × 7
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 1 more variable: sex <fct>

Negation -

Don’t select variable(s)

penguins |>
    select(-bill_length_mm, -bill_depth_mm)

OR maybe better

penguins |>
    select(-c(bill_length_mm, bill_depth_mm))
# A tibble: 344 × 6
   species island    flipper_length_mm body_mass_g sex     year
   <fct>   <fct>                 <int>       <int> <fct>  <int>
 1 Adelie  Torgersen               181        3750 male    2007
 2 Adelie  Torgersen               186        3800 female  2007
 3 Adelie  Torgersen               195        3250 female  2007
 4 Adelie  Torgersen                NA          NA <NA>    2007
 5 Adelie  Torgersen               193        3450 female  2007
 6 Adelie  Torgersen               190        3650 male    2007
 7 Adelie  Torgersen               181        3625 female  2007
 8 Adelie  Torgersen               195        4675 male    2007
 9 Adelie  Torgersen               193        3475 <NA>    2007
10 Adelie  Torgersen               190        4250 <NA>    2007
# ℹ 334 more rows

Indexing

Less preferable because variable orders change

penguins |>
    select(1:3)
# A tibble: 344 × 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Torgersen           39.1
 2 Adelie  Torgersen           39.5
 3 Adelie  Torgersen           40.3
 4 Adelie  Torgersen           NA  
 5 Adelie  Torgersen           36.7
 6 Adelie  Torgersen           39.3
 7 Adelie  Torgersen           38.9
 8 Adelie  Torgersen           39.2
 9 Adelie  Torgersen           34.1
10 Adelie  Torgersen           42  
# ℹ 334 more rows

Use multiple methods

penguins |>
    select(species, 3, 5)
# A tibble: 344 × 3
   species bill_length_mm flipper_length_mm
   <fct>            <dbl>             <int>
 1 Adelie            39.1               181
 2 Adelie            39.5               186
 3 Adelie            40.3               195
 4 Adelie            NA                  NA
 5 Adelie            36.7               193
 6 Adelie            39.3               190
 7 Adelie            38.9               181
 8 Adelie            39.2               195
 9 Adelie            34.1               193
10 Adelie            42                 190
# ℹ 334 more rows

Use select() to order columns

select() helper functions

  • everything()
  • starts_with()
  • ends_with()
  • contains()
  • all_of()
  • any_of()

Helper functions

First: this code may not work for you - why?

reads <- read_csv(here("data", "Project_Reads_Scores.csv")) |> 
  clean_names()
reads
# A tibble: 48 × 25
   test_year      test_type test_site student_id pre_test_score pre_test_percent
   <chr>          <chr>     <chr>     <chr>               <dbl> <chr>           
 1 06/01/2016 12… YEAR END  VIRDEN    Virden 1               43 29%             
 2 06/01/2016 12… YEAR END  VIRDEN    Virden 2               46 31%             
 3 06/01/2016 12… YEAR END  VIRDEN    Virden 3               39 26%             
 4 06/01/2016 12… YEAR END  VIRDEN    Virden 4               35 23%             
 5 06/01/2016 12… YEAR END  VIRDEN    Virden 5               46 31%             
 6 06/01/2016 12… YEAR END  VIRDEN    Virden 6               35 23%             
 7 06/01/2016 12… YEAR END  VIRDEN    Virden 7               40 27%             
 8 06/01/2016 12… YEAR END  VIRDEN    Virden 8               39 26%             
 9 06/01/2016 12… YEAR END  VIRDEN    Virden 9               40 27%             
10 06/01/2016 12… YEAR END  VIRDEN    Virden 10              45 30%             
# ℹ 38 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

Use select to order columns

reads |>
    select(student_id, test_site, test_type, test_year, everything())
# A tibble: 48 × 25
   student_id test_site test_type test_year      pre_test_score pre_test_percent
   <chr>      <chr>     <chr>     <chr>                   <dbl> <chr>           
 1 Virden 1   VIRDEN    YEAR END  06/01/2016 12…             43 29%             
 2 Virden 2   VIRDEN    YEAR END  06/01/2016 12…             46 31%             
 3 Virden 3   VIRDEN    YEAR END  06/01/2016 12…             39 26%             
 4 Virden 4   VIRDEN    YEAR END  06/01/2016 12…             35 23%             
 5 Virden 5   VIRDEN    YEAR END  06/01/2016 12…             46 31%             
 6 Virden 6   VIRDEN    YEAR END  06/01/2016 12…             35 23%             
 7 Virden 7   VIRDEN    YEAR END  06/01/2016 12…             40 27%             
 8 Virden 8   VIRDEN    YEAR END  06/01/2016 12…             39 26%             
 9 Virden 9   VIRDEN    YEAR END  06/01/2016 12…             40 27%             
10 Virden 10  VIRDEN    YEAR END  06/01/2016 12…             45 30%             
# ℹ 38 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

everything() is a helper verb that includes all column names

starts_with()

reads |> 
    select(starts_with("test"))
# A tibble: 48 × 3
   test_year              test_type test_site
   <chr>                  <chr>     <chr>    
 1 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 2 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 3 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 4 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 5 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 6 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 7 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 8 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
 9 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
10 06/01/2016 12:00:00 AM YEAR END  VIRDEN   
# ℹ 38 more rows

ends_with()

reads |> 
    select(ends_with("score"))
# A tibble: 48 × 10
   pre_test_score post_test_score unit_1_score unit_2_score unit_3_score
            <dbl>           <dbl>        <dbl>        <dbl>        <dbl>
 1             43              92            3            4            6
 2             46             104            5            5            6
 3             39              75            4            4            6
 4             35             115            4            4            6
 5             46              85            2            5            6
 6             35              91            5            5            7
 7             40              96            5            5            6
 8             39              74            4            5            5
 9             40              90            6            4            5
10             45              86            4            5            5
# ℹ 38 more rows
# ℹ 5 more variables: unit_4_score <dbl>, unit_5_6_score <dbl>,
#   unit_7_score <dbl>, unit_8_score <dbl>, total_score <dbl>

contains()

reads |> 
    select(contains("test"))
# A tibble: 48 × 7
   test_year test_type test_site pre_test_score pre_test_percent post_test_score
   <chr>     <chr>     <chr>              <dbl> <chr>                      <dbl>
 1 06/01/20… YEAR END  VIRDEN                43 29%                           92
 2 06/01/20… YEAR END  VIRDEN                46 31%                          104
 3 06/01/20… YEAR END  VIRDEN                39 26%                           75
 4 06/01/20… YEAR END  VIRDEN                35 23%                          115
 5 06/01/20… YEAR END  VIRDEN                46 31%                           85
 6 06/01/20… YEAR END  VIRDEN                35 23%                           91
 7 06/01/20… YEAR END  VIRDEN                40 27%                           96
 8 06/01/20… YEAR END  VIRDEN                39 26%                           74
 9 06/01/20… YEAR END  VIRDEN                40 27%                           90
10 06/01/20… YEAR END  VIRDEN                45 30%                           86
# ℹ 38 more rows
# ℹ 1 more variable: post_test_percent <chr>

all_of()

vars <- c("pre_test_score", "pre_test_percent", "post_test_score", "post_test_percent")

reads |> 
    select(all_of(vars))
# A tibble: 48 × 4
   pre_test_score pre_test_percent post_test_score post_test_percent
            <dbl> <chr>                      <dbl> <chr>            
 1             43 29%                           92 61%              
 2             46 31%                          104 69%              
 3             39 26%                           75 50%              
 4             35 23%                          115 77%              
 5             46 31%                           85 57%              
 6             35 23%                           91 61%              
 7             40 27%                           96 64%              
 8             39 26%                           74 49%              
 9             40 27%                           90 60%              
10             45 30%                           86 57%              
# ℹ 38 more rows

all_of()

vars <- c("pre_test_score", "pre_test_percent", "post_test_score", "post_test_percent")

mpg |> 
    select(all_of(vars))
Error in `select()`:
ℹ In argument: `all_of(vars)`.
Caused by error in `all_of()`:
! Can't subset elements that don't exist.
✖ Elements `pre_test_score`, `pre_test_percent`, `post_test_score`, and `post_test_percent` don't exist.

any_of()

vars <- c("pre_test_score", "pre_test_percent", "post_test_score", "post_test_percent")

reads |> 
    select(any_of(vars))
# A tibble: 48 × 4
   pre_test_score pre_test_percent post_test_score post_test_percent
            <dbl> <chr>                      <dbl> <chr>            
 1             43 29%                           92 61%              
 2             46 31%                          104 69%              
 3             39 26%                           75 50%              
 4             35 23%                          115 77%              
 5             46 31%                           85 57%              
 6             35 23%                           91 61%              
 7             40 27%                           96 64%              
 8             39 26%                           74 49%              
 9             40 27%                           90 60%              
10             45 30%                           86 57%              
# ℹ 38 more rows

any_of()

vars <- c("pre_test_score", "pre_test_percent", "post_test_score", "post_test_percent")

mpg |> 
    select(any_of(vars))
# A tibble: 234 × 0

Mix select methods and helpers

You can mix types and helper functions

reads |> 
    select(student_id, 1, starts_with("total"))
# A tibble: 48 × 4
   student_id test_year              total_score total_percent_correct
   <chr>      <chr>                        <dbl> <chr>                
 1 Virden 1   06/01/2016 12:00:00 AM         207 41%                  
 2 Virden 2   06/01/2016 12:00:00 AM         224 45%                  
 3 Virden 3   06/01/2016 12:00:00 AM         193 39%                  
 4 Virden 4   06/01/2016 12:00:00 AM         223 45%                  
 5 Virden 5   06/01/2016 12:00:00 AM         198 40%                  
 6 Virden 6   06/01/2016 12:00:00 AM         210 42%                  
 7 Virden 7   06/01/2016 12:00:00 AM         211 42%                  
 8 Virden 8   06/01/2016 12:00:00 AM         180 36%                  
 9 Virden 9   06/01/2016 12:00:00 AM         206 41%                  
10 Virden 10  06/01/2016 12:00:00 AM         206 41%                  
# ℹ 38 more rows

Mix select methods and helpers

You can also use select to rearrange your columns

reads |> 
    select(student_id, 1, starts_with("total"), everything())
# A tibble: 48 × 25
   student_id test_year    total_score total_percent_correct test_type test_site
   <chr>      <chr>              <dbl> <chr>                 <chr>     <chr>    
 1 Virden 1   06/01/2016 …         207 41%                   YEAR END  VIRDEN   
 2 Virden 2   06/01/2016 …         224 45%                   YEAR END  VIRDEN   
 3 Virden 3   06/01/2016 …         193 39%                   YEAR END  VIRDEN   
 4 Virden 4   06/01/2016 …         223 45%                   YEAR END  VIRDEN   
 5 Virden 5   06/01/2016 …         198 40%                   YEAR END  VIRDEN   
 6 Virden 6   06/01/2016 …         210 42%                   YEAR END  VIRDEN   
 7 Virden 7   06/01/2016 …         211 42%                   YEAR END  VIRDEN   
 8 Virden 8   06/01/2016 …         180 36%                   YEAR END  VIRDEN   
 9 Virden 9   06/01/2016 …         206 41%                   YEAR END  VIRDEN   
10 Virden 10  06/01/2016 …         206 41%                   YEAR END  VIRDEN   
# ℹ 38 more rows
# ℹ 19 more variables: pre_test_score <dbl>, pre_test_percent <chr>,
#   post_test_score <dbl>, post_test_percent <chr>, percentage_change <chr>,
#   unit_1_score <dbl>, unit_1_percent <chr>, unit_2_score <dbl>,
#   unit_2_percent <chr>, unit_3_score <dbl>, unit_3_percent <chr>,
#   unit_4_score <dbl>, unit_4_percent <chr>, unit_5_6_score <dbl>,
#   unit_5_6_percent <chr>, unit_7_score <dbl>, unit_7_percent <chr>, …

relocate()

relocate(.data, … .before = NULL, .after = NULL)

.data = dataframe/tibble

… = columns to move

.before = variable to move before

.after = variable to move after

  • supplying neither .before nor .after move columns to the left-hand side
  • specifying both is an error

relocate()

penguins 
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguins |> 
  relocate(year)
# A tibble: 344 × 8
    year species island    bill_length_mm bill_depth_mm flipper_length_mm
   <int> <fct>   <fct>              <dbl>         <dbl>             <int>
 1  2007 Adelie  Torgersen           39.1          18.7               181
 2  2007 Adelie  Torgersen           39.5          17.4               186
 3  2007 Adelie  Torgersen           40.3          18                 195
 4  2007 Adelie  Torgersen           NA            NA                  NA
 5  2007 Adelie  Torgersen           36.7          19.3               193
 6  2007 Adelie  Torgersen           39.3          20.6               190
 7  2007 Adelie  Torgersen           38.9          17.8               181
 8  2007 Adelie  Torgersen           39.2          19.6               195
 9  2007 Adelie  Torgersen           34.1          18.1               193
10  2007 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 2 more variables: body_mass_g <int>, sex <fct>

relocate()

penguins |> 
  relocate(year, .after = species)
# A tibble: 344 × 8
   species  year island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <int> <fct>              <dbl>         <dbl>             <int>
 1 Adelie   2007 Torgersen           39.1          18.7               181
 2 Adelie   2007 Torgersen           39.5          17.4               186
 3 Adelie   2007 Torgersen           40.3          18                 195
 4 Adelie   2007 Torgersen           NA            NA                  NA
 5 Adelie   2007 Torgersen           36.7          19.3               193
 6 Adelie   2007 Torgersen           39.3          20.6               190
 7 Adelie   2007 Torgersen           38.9          17.8               181
 8 Adelie   2007 Torgersen           39.2          19.6               195
 9 Adelie   2007 Torgersen           34.1          18.1               193
10 Adelie   2007 Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 2 more variables: body_mass_g <int>, sex <fct>

relocate()

penguins |> 
  relocate(year, .before = bill_length_mm)
# A tibble: 344 × 8
   species island     year bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>     <int>          <dbl>         <dbl>             <int>
 1 Adelie  Torgersen  2007           39.1          18.7               181
 2 Adelie  Torgersen  2007           39.5          17.4               186
 3 Adelie  Torgersen  2007           40.3          18                 195
 4 Adelie  Torgersen  2007           NA            NA                  NA
 5 Adelie  Torgersen  2007           36.7          19.3               193
 6 Adelie  Torgersen  2007           39.3          20.6               190
 7 Adelie  Torgersen  2007           38.9          17.8               181
 8 Adelie  Torgersen  2007           39.2          19.6               195
 9 Adelie  Torgersen  2007           34.1          18.1               193
10 Adelie  Torgersen  2007           42            20.2               190
# ℹ 334 more rows
# ℹ 2 more variables: body_mass_g <int>, sex <fct>

filter()

filter()

choose rows that satisfy your condition (== TRUE)

df |> 
  filter(
    type == "otter",
    site == "bay"
  )

filter()

Complete set of boolean operations

  • x is the left-hand circle
  • y is the right-hand circle
  • the shaded region show which parts each operator selects
    • xor == x or y except where they overlap
==
equal
!
negates
!=
NOT equal
&
and
|
or

filter() numeric

mpg |> 
  filter(cyl == 4)
# A tibble: 81 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 6 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
 7 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
 8 audi         a4 quattro   2    2008     4 auto… 4        19    27 p     comp…
 9 chevrolet    malibu       2.4  1999     4 auto… f        19    27 r     mids…
10 chevrolet    malibu       2.4  2008     4 auto… f        22    30 r     mids…
# ℹ 71 more rows
mpg |> 
  filter(hwy >= 22)
# A tibble: 143 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 133 more rows

filter character

use == NOT =

penguins |>
    filter(species == "Gentoo")
# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguins |>
    filter(species = "Gentoo")
Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `species == "Gentoo"`?

Gentoo starting after 2007

, is equivalent to &

penguins |> 
  filter(
    species == "Gentoo" &
    start > "2007"
  )
Error in `filter()`:
ℹ In argument: `species == "Gentoo" & start > "2007"`.
Caused by error in `start > "2007"`:
! comparison (>) is possible only for atomic and list types
penguins |> 
  filter(
    species == "Gentoo",
    start > "2007"
  )
Error in `filter()`:
ℹ In argument: `start > "2007"`.
Caused by error in `start > "2007"`:
! comparison (>) is possible only for atomic and list types

filter() note

Can’t write:

penguins |> 
  filter(island == "Briscoe" | "Dream")
Error in `filter()`:
ℹ In argument: `island == "Briscoe" | "Dream"`.
Caused by error in `island == "Briscoe" | "Dream"`:
! operations are possible only for numeric, logical or complex types

Must be:

penguins |> 
  filter(island == "Briscoe" | island == "Dream")
# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream            39.5          16.7               178        3250
 2 Adelie  Dream            37.2          18.1               178        3900
 3 Adelie  Dream            39.5          17.8               188        3300
 4 Adelie  Dream            40.9          18.9               184        3900
 5 Adelie  Dream            36.4          17                 195        3325
 6 Adelie  Dream            39.2          21.1               196        4150
 7 Adelie  Dream            38.8          20                 190        3950
 8 Adelie  Dream            42.2          18.5               180        3550
 9 Adelie  Dream            37.6          19.3               181        3300
10 Adelie  Dream            39.8          19.1               184        4650
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>

more filter()

reads
# A tibble: 48 × 25
   test_year      test_type test_site student_id pre_test_score pre_test_percent
   <chr>          <chr>     <chr>     <chr>               <dbl> <chr>           
 1 06/01/2016 12… YEAR END  VIRDEN    Virden 1               43 29%             
 2 06/01/2016 12… YEAR END  VIRDEN    Virden 2               46 31%             
 3 06/01/2016 12… YEAR END  VIRDEN    Virden 3               39 26%             
 4 06/01/2016 12… YEAR END  VIRDEN    Virden 4               35 23%             
 5 06/01/2016 12… YEAR END  VIRDEN    Virden 5               46 31%             
 6 06/01/2016 12… YEAR END  VIRDEN    Virden 6               35 23%             
 7 06/01/2016 12… YEAR END  VIRDEN    Virden 7               40 27%             
 8 06/01/2016 12… YEAR END  VIRDEN    Virden 8               39 26%             
 9 06/01/2016 12… YEAR END  VIRDEN    Virden 9               40 27%             
10 06/01/2016 12… YEAR END  VIRDEN    Virden 10              45 30%             
# ℹ 38 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

more filter()

Anomaly in the data

reads |>
 count(student_id)
# A tibble: 46 × 2
   student_id                 n
   <chr>                  <int>
 1 All Students (Average)     3
 2 Jones 1                    1
 3 Jones 10                   1
 4 Jones 11                   1
 5 Jones 12                   1
 6 Jones 13                   1
 7 Jones 14                   1
 8 Jones 15                   1
 9 Jones 2                    1
10 Jones 3                    1
# ℹ 36 more rows

Let’s remove these aggregate rows

students_only <- reads |>
  filter(student_id != "All Students (Average)") 

students_only |> 
  count(student_id)
# A tibble: 45 × 2
   student_id     n
   <chr>      <int>
 1 Jones 1        1
 2 Jones 10       1
 3 Jones 11       1
 4 Jones 12       1
 5 Jones 13       1
 6 Jones 14       1
 7 Jones 15       1
 8 Jones 2        1
 9 Jones 3        1
10 Jones 4        1
# ℹ 35 more rows

Not equal

! means NOT

!= means NOT equal

Note:

  • If you want values that are NOT NA:
    • filter(!is.na(variable))
    • filter(variable != NA)

Chaining {dplyr} functions

What if we wanted to filter then select a data set?

Select “island” and “species” of penguins over 4200 grams

  1. Two step method 😐
penguins_big <- filter(penguins, body_mass_g > 4200)

penguins_big <- select(penguins_big, island, species)

Chaining {dplyr} functions

  1. Alternatively, we could wrap select around filter ☹️
select(filter(penguins, body_mass_g > 4200), island, species)
# A tibble: 149 × 2
   island    species
   <fct>     <fct>  
 1 Torgersen Adelie 
 2 Torgersen Adelie 
 3 Torgersen Adelie 
 4 Torgersen Adelie 
 5 Dream     Adelie 
 6 Dream     Adelie 
 7 Dream     Adelie 
 8 Biscoe    Adelie 
 9 Biscoe    Adelie 
10 Torgersen Adelie 
# ℹ 139 more rows

Chaining {dplyr} functions

  1. Piping functions 🎉

We could use the pipe function to help increase the readability of our code:
|>

Called the “pipe” operator and “piping functions”

filter(penguins, body_mass_g > 4200) |> 
  select(island, species)
# A tibble: 149 × 2
   island    species
   <fct>     <fct>  
 1 Torgersen Adelie 
 2 Torgersen Adelie 
 3 Torgersen Adelie 
 4 Torgersen Adelie 
 5 Dream     Adelie 
 6 Dream     Adelie 
 7 Dream     Adelie 
 8 Biscoe    Adelie 
 9 Biscoe    Adelie 
10 Torgersen Adelie 
# ℹ 139 more rows

Piping functions

Even better:

  • Generally when using the pipe, the first argument is the data set, which gets piped through the corresponding functions
  • So the code on the prior slide would be:
penguins |> 
  filter(body_mass_g > 4200) |> 
  select(island, species)
# A tibble: 149 × 2
   island    species
   <fct>     <fct>  
 1 Torgersen Adelie 
 2 Torgersen Adelie 
 3 Torgersen Adelie 
 4 Torgersen Adelie 
 5 Dream     Adelie 
 6 Dream     Adelie 
 7 Dream     Adelie 
 8 Biscoe    Adelie 
 9 Biscoe    Adelie 
10 Torgersen Adelie 
# ℹ 139 more rows

mutate()

Some new data

(prepost <- reads |> 
  filter(student_id != "All Students (Average)") |> 
  select(student_id, test_site, pre_test_score, post_test_score))
# A tibble: 45 × 4
   student_id test_site pre_test_score post_test_score
   <chr>      <chr>              <dbl>           <dbl>
 1 Virden 1   VIRDEN                43              92
 2 Virden 2   VIRDEN                46             104
 3 Virden 3   VIRDEN                39              75
 4 Virden 4   VIRDEN                35             115
 5 Virden 5   VIRDEN                46              85
 6 Virden 6   VIRDEN                35              91
 7 Virden 7   VIRDEN                40              96
 8 Virden 8   VIRDEN                39              74
 9 Virden 9   VIRDEN                40              90
10 Virden 10  VIRDEN                45              86
# ℹ 35 more rows

mutate()

Add a variable

prepost |> 
  mutate(gain = post_test_score - pre_test_score)
# A tibble: 45 × 5
   student_id test_site pre_test_score post_test_score  gain
   <chr>      <chr>              <dbl>           <dbl> <dbl>
 1 Virden 1   VIRDEN                43              92    49
 2 Virden 2   VIRDEN                46             104    58
 3 Virden 3   VIRDEN                39              75    36
 4 Virden 4   VIRDEN                35             115    80
 5 Virden 5   VIRDEN                46              85    39
 6 Virden 6   VIRDEN                35              91    56
 7 Virden 7   VIRDEN                40              96    56
 8 Virden 8   VIRDEN                39              74    35
 9 Virden 9   VIRDEN                40              90    50
10 Virden 10  VIRDEN                45              86    41
# ℹ 35 more rows

names(prepost)
[1] "student_id"      "test_site"       "pre_test_score"  "post_test_score"

Why do we not see the new gain column?

mutate()

Add multiple variables…

Use a variable you just created!

prepost <- prepost |>
  mutate(
    gain = post_test_score - pre_test_score,
      sum_tests = post_test_score + pre_test_score,
    pct_tests = (sum_tests/180) * 100
  )
prepost
# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
   <chr>      <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1 Virden 1   VIRDEN                43              92    49       135      75  
 2 Virden 2   VIRDEN                46             104    58       150      83.3
 3 Virden 3   VIRDEN                39              75    36       114      63.3
 4 Virden 4   VIRDEN                35             115    80       150      83.3
 5 Virden 5   VIRDEN                46              85    39       131      72.8
 6 Virden 6   VIRDEN                35              91    56       126      70  
 7 Virden 7   VIRDEN                40              96    56       136      75.6
 8 Virden 8   VIRDEN                39              74    35       113      62.8
 9 Virden 9   VIRDEN                40              90    50       130      72.2
10 Virden 10  VIRDEN                45              86    41       131      72.8
# ℹ 35 more rows

mutate()

Change a variable

prepost <- prepost |> 
  mutate(student_id = parse_number(student_id))
prepost
# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1          1 VIRDEN                43              92    49       135      75  
 2          2 VIRDEN                46             104    58       150      83.3
 3          3 VIRDEN                39              75    36       114      63.3
 4          4 VIRDEN                35             115    80       150      83.3
 5          5 VIRDEN                46              85    39       131      72.8
 6          6 VIRDEN                35              91    56       126      70  
 7          7 VIRDEN                40              96    56       136      75.6
 8          8 VIRDEN                39              74    35       113      62.8
 9          9 VIRDEN                40              90    50       130      72.2
10         10 VIRDEN                45              86    41       131      72.8
# ℹ 35 more rows

arrange()

arrange()

Order the data

  • Order by “gain”, ascending [default]
prepost |> 
  arrange(gain)
# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1         12 JONES                 27              32     5        59      32.8
 2          6 WESTSIDE              57              82    25       139      77.2
 3          8 VIRDEN                39              74    35       113      62.8
 4          3 VIRDEN                39              75    36       114      63.3
 5          5 VIRDEN                46              85    39       131      72.8
 6         10 VIRDEN                45              86    41       131      72.8
 7          3 JONES                 54              95    41       149      82.8
 8          1 VIRDEN                43              92    49       135      75  
 9         13 JONES                 49              98    49       147      81.7
10         13 WESTSIDE              45              94    49       139      77.2
# ℹ 35 more rows

arrange()

Order the data

  • Order by “pre_test_score” and “gain”, ascending
prepost |> 
  arrange(pre_test_score, gain)
# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1         11 JONES                 24             108    84       132      73.3
 2          4 JONES                 25              99    74       124      68.9
 3         12 JONES                 27              32     5        59      32.8
 4         12 VIRDEN                31             102    71       133      73.9
 5         11 VIRDEN                32              91    59       123      68.3
 6         13 VIRDEN                33              86    53       119      66.1
 7          6 VIRDEN                35              91    56       126      70  
 8         15 VIRDEN                35             101    66       136      75.6
 9          9 JONES                 35             109    74       144      80  
10          4 VIRDEN                35             115    80       150      83.3
# ℹ 35 more rows

arrange()

Order the data

  • Order by “gain”, descending
prepost |> 
  arrange(desc(gain))
# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1         11 JONES                 24             108    84       132      73.3
 2          4 VIRDEN                35             115    80       150      83.3
 3          4 JONES                 25              99    74       124      68.9
 4          5 JONES                 36             110    74       146      81.1
 5          9 JONES                 35             109    74       144      80  
 6         12 VIRDEN                31             102    71       133      73.9
 7          1 JONES                 36             103    67       139      77.2
 8         15 VIRDEN                35             101    66       136      75.6
 9          6 JONES                 40             106    66       146      81.1
10          8 WESTSIDE              43             109    66       152      84.4
# ℹ 35 more rows

arrange()

Order the data

  • Order by “pre_test_score” and “gain”, descending
prepost |> 
  arrange(desc(pre_test_score), desc(gain))
# A tibble: 45 × 7
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1          6 WESTSIDE              57              82    25       139      77.2
 2          3 JONES                 54              95    41       149      82.8
 3          1 WESTSIDE              53             109    56       162      90  
 4          2 WESTSIDE              51             105    54       156      86.7
 5          3 WESTSIDE              51             105    54       156      86.7
 6          8 JONES                 50             101    51       151      83.9
 7          7 WESTSIDE              49             109    60       158      87.8
 8          7 JONES                 49             104    55       153      85  
 9         13 JONES                 49              98    49       147      81.7
10          5 WESTSIDE              46             106    60       152      84.4
# ℹ 35 more rows

summarize()

summarize() or summarise()

  • Here: compute the mean and standard deviation of the gain scores
  • What are we looking at?
prepost |> 
  summarize(mean_gain = mean(gain, na.rm = TRUE),
            sd_gain = sd(gain, na.rm = TRUE))
# A tibble: 1 × 2
  mean_gain sd_gain
      <dbl>   <dbl>
1      56.4    14.1

group_by()

group_by()

Conduct an operation for each level of a grouping factor

Here, compute the mean gain for each test site

prepost |> 
  group_by(test_site) |> 
  summarize(mean_gain = mean(gain))
# A tibble: 3 × 2
  test_site mean_gain
  <chr>         <dbl>
1 JONES          59  
2 VIRDEN         53.8
3 WESTSIDE       56.3

group_by()

When you use group_by() and summarize()

You will get a row for each unique group

mpg |> 
  group_by(cyl) |> 
  summarize(max_hwy = max(hwy))
# A tibble: 4 × 2
    cyl max_hwy
  <int>   <int>
1     4      44
2     5      29
3     6      29
4     8      26

If you have multiple grouping variables, you will get a row for each unique combination of groups

mpg |> 
  group_by(cyl, drv) |> 
  summarize(max_hwy = max(hwy))
# A tibble: 9 × 3
# Groups:   cyl [4]
    cyl drv   max_hwy
  <int> <chr>   <int>
1     4 4          28
2     4 f          44
3     5 f          29
4     6 4          25
5     6 f          29
6     6 r          26
7     8 4          23
8     8 f          25
9     8 r          26

Add summarized column to full data

Use mutate()!

prepost |> 
  group_by(test_site) |> 
  mutate(mean_gain = mean(gain))
# A tibble: 45 × 8
# Groups:   test_site [3]
   student_id test_site pre_test_score post_test_score  gain sum_tests pct_tests
        <dbl> <chr>              <dbl>           <dbl> <dbl>     <dbl>     <dbl>
 1          1 VIRDEN                43              92    49       135      75  
 2          2 VIRDEN                46             104    58       150      83.3
 3          3 VIRDEN                39              75    36       114      63.3
 4          4 VIRDEN                35             115    80       150      83.3
 5          5 VIRDEN                46              85    39       131      72.8
 6          6 VIRDEN                35              91    56       126      70  
 7          7 VIRDEN                40              96    56       136      75.6
 8          8 VIRDEN                39              74    35       113      62.8
 9          9 VIRDEN                40              90    50       130      72.2
10         10 VIRDEN                45              86    41       131      72.8
# ℹ 35 more rows
# ℹ 1 more variable: mean_gain <dbl>

Retain only summarized columns

Use summarize()!

prepost |> 
  group_by(test_site) |> 
  summarize(mean_gain = mean(gain))
# A tibble: 3 × 2
  test_site mean_gain
  <chr>         <dbl>
1 JONES          59  
2 VIRDEN         53.8
3 WESTSIDE       56.3

Pipe to {ggplot}!

prepost |> 
  group_by(test_site) |> 
  summarize(mean_gain = mean(gain)) |> 
  ggplot(aes(test_site, mean_gain)) + 
  geom_col(fill = "cornflowerblue", alpha = .7) +
  theme_minimal()

Summarizing and Plotting

  • Calculate mean post_test_score by test site
students_only |>
  group_by(test_site) |>
  summarize(mean = mean(post_test_score))
# A tibble: 3 × 2
  test_site  mean
  <chr>     <dbl>
1 JONES      97.8
2 VIRDEN     92.5
3 WESTSIDE  103. 

Plot it

theme_set(theme_minimal(base_size = 18))

students_only |>
  group_by(test_site) |>
  summarize(mean = mean(post_test_score)) |>
  ggplot(aes(test_site, mean)) +
  geom_col(alpha = 0.8)

A couple more verbs

recode() and rename()

Recode with ifelse()

Recoding into dichotomous variable with ifelse()

ifelse(test, test == TRUE, test == FALSE)

or

ifelse(logical statement, do this if TRUE, else do this)

Recode with ifelse()

Recoding into dichotomous variable with ifelse()

Remember to pair with mutate()!

students_only |> 
  mutate(New_test_site = ifelse(test_site == "JONES", "district 1", "district 2"))
# A tibble: 45 × 26
   test_year      test_type test_site student_id pre_test_score pre_test_percent
   <chr>          <chr>     <chr>     <chr>               <dbl> <chr>           
 1 06/01/2016 12… YEAR END  VIRDEN    Virden 1               43 29%             
 2 06/01/2016 12… YEAR END  VIRDEN    Virden 2               46 31%             
 3 06/01/2016 12… YEAR END  VIRDEN    Virden 3               39 26%             
 4 06/01/2016 12… YEAR END  VIRDEN    Virden 4               35 23%             
 5 06/01/2016 12… YEAR END  VIRDEN    Virden 5               46 31%             
 6 06/01/2016 12… YEAR END  VIRDEN    Virden 6               35 23%             
 7 06/01/2016 12… YEAR END  VIRDEN    Virden 7               40 27%             
 8 06/01/2016 12… YEAR END  VIRDEN    Virden 8               39 26%             
 9 06/01/2016 12… YEAR END  VIRDEN    Virden 9               40 27%             
10 06/01/2016 12… YEAR END  VIRDEN    Virden 10              45 30%             
# ℹ 35 more rows
# ℹ 20 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

Recode with recode()

Recoding into multinomial variable

recode(variable,
       old_category_1 = "new_category_1",
         old_category_2 = "new_category_2",
         old_category_3 = "new_category_3",
       ...
)

Recode with recode()

Recoding into multinomial variable

Remember to pair with mutate()

students_only |> 
  mutate(test_site = recode(test_site,
                            JONES = "Jones Middle",
                            VIRDEN = "Virden Elementary",
                            WESTSIDE = "Westside High"
                            )
        )
# A tibble: 45 × 25
   test_year      test_type test_site student_id pre_test_score pre_test_percent
   <chr>          <chr>     <chr>     <chr>               <dbl> <chr>           
 1 06/01/2016 12… YEAR END  Virden E… Virden 1               43 29%             
 2 06/01/2016 12… YEAR END  Virden E… Virden 2               46 31%             
 3 06/01/2016 12… YEAR END  Virden E… Virden 3               39 26%             
 4 06/01/2016 12… YEAR END  Virden E… Virden 4               35 23%             
 5 06/01/2016 12… YEAR END  Virden E… Virden 5               46 31%             
 6 06/01/2016 12… YEAR END  Virden E… Virden 6               35 23%             
 7 06/01/2016 12… YEAR END  Virden E… Virden 7               40 27%             
 8 06/01/2016 12… YEAR END  Virden E… Virden 8               39 26%             
 9 06/01/2016 12… YEAR END  Virden E… Virden 9               40 27%             
10 06/01/2016 12… YEAR END  Virden E… Virden 10              45 30%             
# ℹ 35 more rows
# ℹ 19 more variables: post_test_score <dbl>, post_test_percent <chr>,
#   percentage_change <chr>, unit_1_score <dbl>, unit_1_percent <chr>,
#   unit_2_score <dbl>, unit_2_percent <chr>, unit_3_score <dbl>,
#   unit_3_percent <chr>, unit_4_score <dbl>, unit_4_percent <chr>,
#   unit_5_6_score <dbl>, unit_5_6_percent <chr>, unit_7_score <dbl>,
#   unit_7_percent <chr>, unit_8_score <dbl>, unit_8_percent <chr>, …

Recode with recode()

Remember to use backticks `` for special characters

mtcars |> 
  mutate(cyl_str = recode(cyl,
                      `4` = "cylinders4",
                      `6` = "cylinders6",
                      `8` = "cylinders8"
                      )
         ) |> 
  as_tibble()
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cyl_str   
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>     
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 cylinders6
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 cylinders6
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 cylinders4
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 cylinders6
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 cylinders8
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 cylinders6
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 cylinders8
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2 cylinders4
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 cylinders4
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 cylinders6
# ℹ 22 more rows

Recode with case_when()

Recoding across several variables, or if you find yourself using nested ifelse() statements

Remember to pair with mutate()

library(palmerpenguins)

penguins |> 
  mutate(species_big = 
           case_when(
             body_mass_g > 4500 & species == "Adelie" ~ "adelie_big",
             body_mass_g > 4600 & species == "Chinstrap" ~ "chinstrap_big",
             body_mass_g > 6000 & species == "Gentoo" ~ "gentoo_big",
             .default = "other"
           ))

.default equivalent to all else

rename()

rename(new_name = old_name)

NOT paired with mutate()

penguins |>
  rename(
    home = island,
    bill_length = bill_length_mm
  )
# A tibble: 344 × 8
   species home    bill_length bill_depth_mm flipper_length_mm body_mass_g sex  
   <fct>   <fct>         <dbl>         <dbl>             <int>       <int> <fct>
 1 Adelie  Torger…        39.1          18.7               181        3750 male 
 2 Adelie  Torger…        39.5          17.4               186        3800 fema…
 3 Adelie  Torger…        40.3          18                 195        3250 fema…
 4 Adelie  Torger…        NA            NA                  NA          NA <NA> 
 5 Adelie  Torger…        36.7          19.3               193        3450 fema…
 6 Adelie  Torger…        39.3          20.6               190        3650 male 
 7 Adelie  Torger…        38.9          17.8               181        3625 fema…
 8 Adelie  Torger…        39.2          19.6               195        4675 male 
 9 Adelie  Torger…        34.1          18.1               193        3475 <NA> 
10 Adelie  Torger…        42            20.2               190        4250 <NA> 
# ℹ 334 more rows
# ℹ 1 more variable: year <int>

rename with select()

penguins |> 
  select(species, home = island, bill_length = bill_length_mm, year) 
# A tibble: 344 × 4
   species home      bill_length  year
   <fct>   <fct>           <dbl> <int>
 1 Adelie  Torgersen        39.1  2007
 2 Adelie  Torgersen        39.5  2007
 3 Adelie  Torgersen        40.3  2007
 4 Adelie  Torgersen        NA    2007
 5 Adelie  Torgersen        36.7  2007
 6 Adelie  Torgersen        39.3  2007
 7 Adelie  Torgersen        38.9  2007
 8 Adelie  Torgersen        39.2  2007
 9 Adelie  Torgersen        34.1  2007
10 Adelie  Torgersen        42    2007
# ℹ 334 more rows

Final notes on {dplyr}

  • We could spend weeks on this
  • We’ll be using it all term long
  • Verbs can help you gain fluency
  • There are also conditional and all-inclusive versions of mutate, select, and summarize

Next time

Before next class