Functional Programming for Educational Data Science

EDLD 653

Author
Affiliation

Joe Nese

University of Oregon
Spring 2026

Introduction &
Data Types

Week 1

Agenda

  • Introductions
  • Syllabus
  • Intro to data types

. . .

Learning Objectives

  • Understand the requirements of the course
  • Understand the fundamental difference between lists and atomic vectors
  • Understand how atomic vectors are coerced, implicitly or explicitly
  • Understand various ways to subset vectors, and how subsetting differs for lists
  • Understand attributes and how to set/modify

About me

  • husband, dad
  • BA: UC Santa Barbara
  • PhD, School Psychology: University of Maryland
  • UO since 2009 at Behavioral Research & Teaching (BRT)
  • Research Professor

Research

  • Applied statistical methods used by researchers
  • Developing and improving systems that support data-based decision making using advanced technologies to influence teachers’ instructional practices and increase student achievement

Teaching

  • EDLD 651 - Introduction to Data Science with R
  • EDLD 653 - this one!
  • EDLD 654 - Applied Machine Learning for Educational Data Scientists
  • EDLD 609 - Data Science Capstone

Introduce yourself!

We mostly know each other, but it’s always good to hear from each other

  • Name and program of study
  • How and how often are you using R these days?

This course

… is new to me

Syllabus

Course Learning Outcomes

  • Understand and be able to describe the differences in R’s data structures (including the four main vector types, data frames, and lists) and when each is most appropriate for a given task
  • Explore purrr::map() and its variants, how they relate to base R functions, and why the {purrr} variants are often preferable
  • Work with lists and list columns using purrr::nest() and purrr:unnest()
  • Convert repetitive tasks into functions
  • Understand elements of good functions, and things to avoid
  • Write effective and clear functions to continue with the mantra of “don’t repeat yourself”

Course Website

Required Textbooks (free)

https://adv-r.hadley.nz

Other books (also free)

https://r4ds.hadley.nz/

https://r4ds.had.co.nz/index.html

https://mastering-shiny.org/

Course Sequence

  • Data types
  • Base R iterations
  • {purrr}
  • Batch processes and working with list columns
  • Parallel iterations (and a few extras)
  • Writing functions
  • Shiny

Assignments

Assignments Points Percent
Labs (x3) 60 30%
Midterm 70 35%
Final 70 35%
Total 200

Grading Components

Lower % Lower point range Grade Upper point range Upper %
97 194 or more A+
93 186 A 192 96
90 180 A- 184 92
87 174 B+ 178 89
83 166 B 172 86
80 160 B- 164 82
77 154 C+ 158 79
73 146 C 152 76
70 140 C- 144 72
F 138 or less 69

Labs

Please try to be in-class on Lab days, it helps me help you

Assigned Date Assigned Date Due Points Percent
Lab 1 Apr-06 Apr-13 20 10%
Lab 2 Apr-13 Apr-20 20 10%
Lab 3 May-11 May-18 20 10%
  • Scored on a “best honest effort” basis
    • Contact me for help rather than submitting incomplete work
    • If the assignment is not complete, and you have not contacted me for help, it is likely to result in partial credit or zero score
  • You can work in groups on these
  • Late: 10 points max
  • >1 week late: 0 points

Midterm

  • Take-home Midterm test
    • Write loops to solve problems
  • Scored on a correct/incorrect basis
  • Worth 70 points (35% of your grade)

Final Exam

  • Take-home Final exam
    • Anything covered in this course is fair game
  • Scored on a correct/incorrect basis
  • Worth 70 points (35% of your grade)

Feedback

  • I will give you feedback on the midterm and the final
  • Labs scored on a completion basis
  • We will go over everything in class

GenAI Use

  • Best use of GenAI might be for coding!
  • Use to help with coursework and assignments
    • code checking
    • code generation
    • code explanation
  • If you include any content generated GenAI, you must cite it
  • The same way that you must cite any content you use from other sources, such as books, articles, videos, the internet, etc.
  • See example in syllabus

The code was generated by ChatGPT 4.5. See the link for a copy of the prompt. https://chatgpt.com/share/somerandomestringoftext

Break?

Basic Data Types

Vectors

4 basic types1

  • Integer (numeric, whole number)
  • Double (numeric, decimal)
  • Logical
  • Character

Creating vectors

Vectors are created with c()

Below are examples of each of the four main types of vectors

integer <- c(5L, 7L, 3L, 94L) # L explicitly an integer, not double

double <- c(3.27, 8.41, Inf, -Inf)

logical <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)

character <- c("red", "orange", "yellow", "green", "blue", "violet", "rainbow")

Coercion

  • Vectors must be of the same type
  • If you try to mix types, implicit coercion will occur
  • Implicit coercion defaults to the most flexible type
    • which is… ?

. . .

c(7L, 3.25)
[1] 7.00 3.25

. . .

c(3.24, TRUE, "April")
[1] "3.24"  "TRUE"  "April"

. . .

c(TRUE, 5)
[1] 1 5

Explicit coercion

You can alternatively define the coercion to occur

as.integer(c(7L, 3.25))
[1] 7 3

. . .

as.logical(c(3.24, TRUE, "April"))
[1]   NA TRUE   NA

. . .

as.character(c(TRUE, 5)) # still maybe a bit unexpected?
[1] "1" "5"

Coercing to logical

as.logical(c(0, 1, 1, 0))
[1] FALSE  TRUE  TRUE FALSE

. . .

Any number that is not zero gets coerced to TRUE

as.logical(c(0, 5L, 7.4, -1.6, 0))
[1] FALSE  TRUE  TRUE  TRUE FALSE

. . .

as.logical(c(3.24, TRUE, "April"))
[1]   NA TRUE   NA

Wait…why the NAs here?

Review

Discuss in small breakout groups

  • What are the four basic types of atomic vectors?
  • What function creates a vector?
  • What does coercion mean, and when does it come into play?
  • True/False: An R list is not a vector.

Checking types

Use typeof to verify the type of vector

typeof(c(7L, 3.25))
[1] "double"
typeof(as.integer(c(7L, 3.25)))
[1] "integer"

Piping

Although traditionally used within the {tidyverse}, it can still be useful

The following are equivalent

typeof(as.integer(c(7L, 3.25)))
[1] "integer"
c(7L, 3.25) |>
  as.integer() |>
  typeof()
[1] "integer"

Pop quiz

Without actually running the code, predict which type each of the following will coerce to.

c(1.25, TRUE, 4L)

c(1L, FALSE)

c(7L, 6.23, "eight")

c(TRUE, 1L, 0L, "False")

Answers

typeof(c(1.25, TRUE, 4L))
[1] "double"

. . .

typeof(c(1L, FALSE))
[1] "integer"

. . .

typeof(c(7L, 6.23, "eight"))
[1] "character"

. . .

typeof(c(TRUE, 1L, 0L, "False"))
[1] "character"

Lists

  • Lists are vectors, but not atomic vectors

  • Fundamental difference - each element can be a different type

. . .

list("a", 7L, 3.25, TRUE)
[[1]]
[1] "a"

[[2]]
[1] 7

[[3]]
[1] 3.25

[[4]]
[1] TRUE

Lists

  • Each element of the list is another vector, possibly atomic, possibly not
  • The prior example included all scalar vectors
    • vector that contains only a single value
  • Lists do not require all elements to be the same length
list(
  c("a", "b", "c"),
  rnorm(5),
  c(7L, 2L),
  c(TRUE, TRUE, FALSE, TRUE)
)
[[1]]
[1] "a" "b" "c"

[[2]]
[1] -0.86378205 -0.01866145 -1.78322803  0.89500207  0.35757039

[[3]]
[1] 7 2

[[4]]
[1]  TRUE  TRUE FALSE  TRUE

Summary

  • Atomic vectors must all be the same type
    • implicit coercion occurs if not (and you haven’t specified the coercion explicitly)
  • Lists are also vectors, but not atomic vectors
    • Each element can be of a different type and length
    • Incredibly flexible, but often a little more difficult to get the hang of

Challenge

Work with a partner

One of you share your screen:

  1. Create four atomic vectors, one for each of the fundamental types
    • integer, double, logical, character
  2. Combine two or more of the vectors. Predict the implicit coercion of each
  3. Apply explicit coercions, and predict the output for each

(basically quiz each other)

Attributes

Attributes

Q: What are attributes?
A: metadata
Q: What’s metadata?
A: Data about the data
Attributes
named list of arbitrary metadata

Other data types

Atomic vectors by themselves make up only a small fraction of the total number of data types in R

. . .

Other data types

  • Data frames
  • Matrices & arrays
  • Factors
  • Dates

. . .

Remember, atomic vectors are the atoms of R. Many other data structures are built from atomic vectors.

We use attributes to create other data types from atomic vectors.

Attributes

Common

  • Names
  • Dimensions

Less common

  • Arbitrary metadata

Examples

Please follow along!

library(palmerpenguins)
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

attributes()

Please follow along!

attributes(): see all attributes associated with an object

attributes(penguins[1:20, ]) # limiting rows just for slides
$names
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

$class
[1] "tbl_df"     "tbl"        "data.frame"

attr()

Access a single attribute by naming it within attr()

attr(penguins, "class")
[1] "tbl_df"     "tbl"        "data.frame"
attr(penguins, "names")
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

. . .

Note - this is not generally how you would pull these attributes. Rather, you would use class() and names()

Be specific

  • Note in the prior slides, I’m asking for attributes on the entire data frame

  • But the individual vectors may have attributes as well

. . .

attributes(penguins$species)
$levels
[1] "Adelie"    "Chinstrap" "Gentoo"   

$class
[1] "factor"
attributes(penguins$bill_length_mm)
NULL

Set attributes

Redefine attributes within attr()

attr(penguins$species, "levels") <- c("Big one", 
                                      "Little one", 
                                      "Funny one")

head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Big one Torgersen           39.1          18.7               181        3750
2 Big one Torgersen           39.5          17.4               186        3800
3 Big one Torgersen           40.3          18                 195        3250
4 Big one Torgersen           NA            NA                  NA          NA
5 Big one Torgersen           36.7          19.3               193        3450
6 Big one Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

. . .

Note - you would generally not define levels this way either, but it is a general method for modifying attributes

Dimensions

Let’s create a matrix (please do it with me)

  • Notice how the matrix fills
m <- matrix(1:6, ncol = 2)
m
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

. . .

Check out the attributes

attributes(m)
$dim
[1] 3 2

Modify the attributes

Let’s change it to a 2 x 3 matrix, instead of 3 x 2 (you try first)

. . .

attr(m, "dim") <- c(2, 3)
m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

. . .

Is this the result you expected?

Alternative creation

Create an atomic vector v, assign a dimension attribute

v <- 1:6
v
[1] 1 2 3 4 5 6

. . .

attr(v, "dim") <- c(3, 2)
v
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Quick aside

What if we wanted it to fill by row?

matrix(6:13, 
       ncol = 2, 
       byrow = TRUE)
     [,1] [,2]
[1,]    6    7
[2,]    8    9
[3,]   10   11
[4,]   12   13
vect <- 6:13
dim(vect) <- c(2, 4)
vect
     [,1] [,2] [,3] [,4]
[1,]    6    8   10   12
[2,]    7    9   11   13
t(vect)
     [,1] [,2]
[1,]    6    7
[2,]    8    9
[3,]   10   11
[4,]   12   13

Names

The following are equivalent

dim_names <- list(
    c("the first", "second", "III"),
    c("index", "value")
  )

attr(v, "dimnames") <- dim_names
v
          index value
the first     1     4
second        2     5
III           3     6
v2 <- 1:6
attr(v2, "dim") <- c(3, 2)
rownames(v2) <- c("the first", "second", "III")
colnames(v2) <- c("index", "value")
v2
          index value
the first     1     4
second        2     5
III           3     6

Remove names

You can remove names from a vector in two ways

  1. x <- unname(x)
  2. names(x) <- NULL

Arbitrary metadata

attr(v, "matrix_mean") <- mean(v)
v
          index value
the first     1     4
second        2     5
III           3     6
attr(,"matrix_mean")
[1] 3.5
attr(v, "matrix_mean")
[1] 3.5

. . .

Note that anything can be stored as an attribute (including matrices or data frames, etc.)

Why would we do this?

A brief example

  • Imagine we’re accessing a database that has many years of data
  • The tables in the database are the same, but the values (of course) differ
  • We might want to return the data, but store the year as an attribute

This will be a complex example, but stick with me

Made up data

I’m using a list to mimic a database

db <- list(
  data.frame(
    color = c("red", "orange", "green"),
    transparency = c(0.80, 0.65, 0.93)
  ),
  data.frame(
    color = c("blue", "pink", "cyan"),
    transparency = c(0.40, 0.35, 0.87)
  )
)
db
[[1]]
   color transparency
1    red         0.80
2 orange         0.65
3  green         0.93

[[2]]
  color transparency
1  blue         0.40
2  pink         0.35
3  cyan         0.87

Write a function

Let’s write a function that grabs one of these tables.

If it’s “1920” we’ll grab the first one, otherwise we’ll grab the second one

pull_color_data <- function(year) {
  to_pull <- if(year == "1920") {
    out <- db[[1]]
  } else {
    out <- db[[2]]
  }
  out
}

Does it work?

pull_color_data(1920)
   color transparency
1    red         0.80
2 orange         0.65
3  green         0.93
pull_color_data(2021)
  color transparency
1  blue         0.40
2  pink         0.35
3  cyan         0.87
pull_color_data(2122)
  color transparency
1  blue         0.40
2  pink         0.35
3  cyan         0.87

. . .

Yes!

Build a second function

Let’s say we want to make a second function that does something with the previous output.

BUT

  • What we do with it is going to depend on the data frame we get back.
  • We need to know the year.
  • So: Modify our original function to store the year as an attribute!

Update function

Notice we redefine the attributes so we’re including all the prior attributes it already had

pull_color_data <- function(year) {
  to_pull <- if(year == "1920") {
    out <- db[[1]]
  } else {
    out <- db[[2]]
  }
  attributes(out) <- c(
    attributes(out),
    db = year 
  ) 
  out
}

Try

pull_color_data(1920)
   color transparency
1    red         0.80
2 orange         0.65
3  green         0.93
attr(pull_color_data(2021), "db")
[1] 2021

Build our second function

Now, we can make our second function, and have it do something different depending on the data that is passed to it.

. . .

print_colors <- function(color_data) {
  title <- paste0("Colors for ", attr(color_data, "db")) 
  colorspace::swatchplot(color_data$color)
  mtext(title) # base plotting function
}

pull_color_data(1920) |> 
  print_colors()

pull_color_data(2021) |> 
  print_colors()

Another example

Fit a multilevel model and pull the variance-covariance matrix

m <- lme4::lmer(Reaction ~ 1 + Days + (1 + Days|Subject), 
                data = lme4::sleepstudy)

lme4::VarCorr(m)$Subject
            (Intercept)      Days
(Intercept)  612.100158  9.604409
Days           9.604409 35.071714
attr(,"stddev")
(Intercept)        Days 
  24.740658    5.922138 
attr(,"correlation")
            (Intercept)       Days
(Intercept)  1.00000000 0.06555124
Days         0.06555124 1.00000000

Matrices vs Data frames

Usually we want to work with data frames because they represent our data better

Sometimes a matrix is more efficient because you can operate on the entire matrix at once

. . .

set.seed(3000)
m <- matrix(rnorm(100, 200, 10), ncol = 10)
m
          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
 [1,] 212.5829 191.7712 199.5134 196.5931 175.2498 207.3866 192.2747 194.4411
 [2,] 206.4475 183.8198 209.9150 204.4850 212.8858 198.3413 194.4287 189.1406
 [3,] 204.6335 203.0518 193.4606 209.6925 204.9773 200.5138 216.1828 188.4229
 [4,] 207.7031 214.9317 195.3695 191.5964 214.3530 201.9884 198.9011 197.8626
 [5,] 197.7367 200.2874 189.6927 210.3867 193.3954 201.6348 197.5796 217.5513
 [6,] 196.4845 194.3709 192.6729 199.6904 189.8270 198.7178 208.5352 205.8042
 [7,] 193.5068 204.5237 205.7235 196.5166 197.8754 226.2404 190.5798 223.5147
 [8,] 187.3741 206.0149 194.3706 199.2014 193.6386 219.8513 187.0317 206.1837
 [9,] 183.0287 211.0514 187.4148 177.5480 203.2865 216.6478 194.5805 197.8386
[10,] 181.6598 180.4219 193.6863 207.9981 191.9110 189.8216 200.8376 202.6439
          [,9]    [,10]
 [1,] 202.9875 190.1133
 [2,] 187.3453 202.7654
 [3,] 186.9933 190.7834
 [4,] 202.5351 198.9414
 [5,] 202.4735 197.9089
 [6,] 196.7506 199.0003
 [7,] 206.9977 189.9343
 [8,] 212.2908 196.7418
 [9,] 214.7606 203.4116
[10,] 217.8882 189.9789

sum(m)
[1] 19950.41

. . .

mean(m)
[1] 199.5041

. . .

rowSums(m)
 [1] 1962.914 1989.575 1998.712 2024.182 2008.647 1981.854 2035.413 2002.699
 [9] 1989.569 1956.847

. . .

colSums(m)
 [1] 1971.157 1990.245 1961.819 1993.708 1977.400 2061.144 1980.932 2023.404
 [9] 2031.023 1959.579

. . .

# standardize the matrix
z <- (m - mean(m)) / sd(m)
z
            [,1]        [,2]          [,3]        [,4]       [,5]        [,6]
 [1,]  1.3010524 -0.76925390  0.0009222064 -0.28957800 -2.4127674  0.78413741
 [2,]  0.6907154 -1.56023846  1.0356537382  0.49548666  1.3311824 -0.11567656
 [3,]  0.5102566  0.35291388 -0.6011942053  1.01352341  0.5444628  0.10044486
 [4,]  0.8156136  1.53470366 -0.4112982781 -0.78663987  1.4771374  0.24713157
 [5,] -0.1758229  0.07792372 -0.9760202871  1.08257404 -0.6076838  0.21195789
 [6,] -0.3003824 -0.51064257 -0.6795541673  0.01853211 -0.9626564 -0.07822095
 [7,] -0.5965995  0.49933839  0.6186937589 -0.29718858 -0.1620174  2.65967021
 [8,] -1.2066698  0.64767736 -0.5106731003 -0.03011399 -0.5834837  2.02409910
 [9,] -1.6389436  1.14870403 -1.2026234041 -2.18414217  0.3762675  1.70542220
[10,] -1.7751178 -1.89825813 -0.5787428911  0.84496235 -0.7553444 -0.96319978
             [,7]       [,8]       [,9]       [,10]
 [1,] -0.71916313 -0.5036565  0.3465246 -0.93418102
 [2,] -0.50489283 -1.0309383 -1.2095282  0.32443028
 [3,]  1.65915710 -1.1023352 -1.2445489 -0.86751730
 [4,] -0.05999098 -0.1632940  0.3015206 -0.05597287
 [5,] -0.19144388  1.7952960  0.2953917 -0.15868964
 [6,]  0.89839717  0.6267207 -0.2739126 -0.05011848
 [7,] -0.88776743  2.3885179  0.7454463 -0.95198804
 [8,] -1.24072882  0.6644735  1.2719904 -0.27478419
 [9,] -0.48979202 -0.1656812  1.5176798  0.38870521
[10,]  0.13265409  0.3123405  1.8288102 -0.94754278

Stripping attributes

Many operations will strip attributes (which makes storing important things in them a bit precarious)

v
          index value
the first     1     4
second        2     5
III           3     6
attr(,"matrix_mean")
[1] 3.5
rowSums(v)
the first    second       III 
        5         7         9 

. . .

attributes(rowSums(v))
$names
[1] "the first" "second"    "III"      

. . .

  • Generally names are maintained

  • Sometimes, dim is maintained, sometimes not

  • All else is stripped

More on names()

The names attribute corresponds to the individual elements within a vector

names(v)
NULL
names(v) <- letters[1:6]
v
          index value
the first     1     4
second        2     5
III           3     6
attr(,"matrix_mean")
[1] 3.5
attr(,"names")
[1] "a" "b" "c" "d" "e" "f"

More on names()

Perhaps more straightforward

v3a <- c(a = 5, b = 7, c = 12)
v3a
 a  b  c 
 5  7 12 
names(v3a)
[1] "a" "b" "c"
attributes(v3a)
$names
[1] "a" "b" "c"

names() alternatives

v3b <- c(5, 7, 12)
names(v3b) <- c("a", "b", "c")
v3b
 a  b  c 
 5  7 12 

. . .

v3c <- setNames(c(5, 7, 12), c("a", "b", "c"))
v3c
 a  b  c 
 5  7 12 

. . .

  • Note that names() is not the same thing as colnames(), but, somewhat confusingly, both work to rename the variables (columns) of a data frame. We’ll talk more about why this is

Why names might be helpful

Subsetting

v
          index value
the first     1     4
second        2     5
III           3     6
attr(,"matrix_mean")
[1] 3.5
attr(,"names")
[1] "a" "b" "c" "d" "e" "f"
v["b"]
b 
2 
v["e"]
e 
5 

Implementation of factors

fct <- factor(c("a", "a", "b", "c"))
typeof(fct)
[1] "integer"

. . .

Weird! Factors are built on integer vectors

. . .

attributes(fct)
$levels
[1] "a" "b" "c"

$class
[1] "factor"

. . .

str(fct)
 Factor w/ 3 levels "a","b","c": 1 1 2 3

More manually

# First create integer vector
int <- c(1L, 1L, 2L, 3L, 1L, 3L)

# assign some levels
attr(int, "levels") <- c("red", "green", "blue")

# change the class to a factor
class(int) <- "factor"

int
[1] red   red   green blue  red   blue 
Levels: red green blue

This can make things tricky

age <- factor(sample(c("baby", 1:10), 100, replace = TRUE))
str(age)
 Factor w/ 11 levels "1","10","2","3",..: 7 1 5 8 8 9 4 6 5 9 ...
age
  [1] 6    1    4    7    7    8    3    5    4    8    10   3    4    10   10  
 [16] 4    2    10   6    5    baby 3    7    2    10   8    7    10   7    4   
 [31] 6    9    8    8    6    7    9    1    7    6    9    9    8    7    2   
 [46] baby baby 2    4    7    9    3    9    7    5    3    7    9    9    10  
 [61] 3    9    3    3    10   10   baby 4    3    5    baby 4    9    9    10  
 [76] 7    1    2    4    8    3    10   6    3    8    7    1    9    10   6   
 [91] 9    6    1    3    9    9    9    4    3    9   
Levels: 1 10 2 3 4 5 6 7 8 9 baby

. . .

What if we wanted to convert this to numeric?

data.frame(age) |> 
  count(age) |> 
  mutate(age_numeric = as.numeric(age)) |> 
  select(starts_with("age"), n)
    age age_numeric  n
1     1           1  5
2    10           2 12
3     2           3  5
4     3           4 13
5     4           5 10
6     5           6  4
7     6           7  8
8     7           8 13
9     8           9  8
10    9          10 17
11 baby          11  5

. . .

These are the integers associated with the factor levels, so as.numeric() will not give us the results we want

Fix: “baby” to NA

First convert to character, then to numeric (you can ignore the warning in this case)

baby to NA

data.frame(age) |> 
  mutate(
    age_chr = as.character(age),
    age_num = as.numeric(age_chr)
  ) |> 
  count(age, age_chr, age_num)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `age_num = as.numeric(age_chr)`.
Caused by warning:
! NAs introduced by coercion
    age age_chr age_num  n
1     1       1       1  5
2    10      10      10 12
3     2       2       2  5
4     3       3       3 13
5     4       4       4 10
6     5       5       5  4
7     6       6       6  8
8     7       7       7 13
9     8       8       8  8
10    9       9       9 17
11 baby    baby      NA  5

Fix: “baby” to 0

data.frame(age) |> 
  mutate(
    age_chr = as.character(age),
    age_num = ifelse(age_chr == "baby", 0, as.numeric(age_chr))
  ) |> 
  count(age, age_chr, age_num)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `age_num = ifelse(age_chr == "baby", 0, as.numeric(age_chr))`.
Caused by warning in `ifelse()`:
! NAs introduced by coercion
    age age_chr age_num  n
1     1       1       1  5
2    10      10      10 12
3     2       2       2  5
4     3       3       3 13
5     4       4       4 10
6     5       5       5  4
7     6       6       6  8
8     7       7       7 13
9     8       8       8  8
10    9       9       9 17
11 baby    baby       0  5

Summary: factor to numeric

Implementation of dates

date <- Sys.Date()
typeof(date)
[1] "double"

. . .

Huh?

Dates are built on top of double vectors
(weird, like factors are built on integer vectors)

. . .

attributes(date)
$class
[1] "Date"

. . .

attributes(date) <- NULL
date
[1] 20549
  • This number represents the days passed since January 1, 1970, known as the Unix epoch

. . .

unclass(as.Date("1970-01-02"))
[1] 1

A bit more on classes

Why do these all print different things?

summary(mtcars[, 1:2])
      mpg             cyl       
 Min.   :10.40   Min.   :4.000  
 1st Qu.:15.43   1st Qu.:4.000  
 Median :19.20   Median :6.000  
 Mean   :20.09   Mean   :6.188  
 3rd Qu.:22.80   3rd Qu.:8.000  
 Max.   :33.90   Max.   :8.000  
summary(gss_cat$marital)
    No answer Never married     Separated      Divorced       Widowed 
           17          5416           743          3383          1807 
      Married 
        10117 
m <- lm(mpg ~ cyl, mtcars)
summary(m)

Call:
lm(formula = mpg ~ cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9814 -2.1185  0.2217  1.0717  7.5186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
cyl          -2.8758     0.3224   -8.92 6.11e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared:  0.7262,    Adjusted R-squared:  0.7171 
F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10

What are the classes?

class(mtcars[, 1:2])
[1] "data.frame"

. . .

class(gss_cat$marital)
[1] "factor"

. . .

class(m)
[1] "lm"

S3 methods

When you call summary(), it looks for a method (function) for that specific class

  • summary(mtcars[, 1:2]) becomes
    summary.data.frame(mtcars[, 1:2])
  • summary(gss_cat$marital) becomes
    summary.factor(gss_cat$marital)
  • summary(m) becomes
    summary.lm(m)

Quick aside

Too briefly, S3 allows functions to behave differently depending on the class of the objects they are given

. . .

Naming function

Because S3 is so common in R, I recommend against including dots . in function names

. . .

summary.data.frame() is less clear than it would be if it were summary.data_frame()

. . .

Classes and methods is not something I’m going to expect you to have a deep knowledge on, but I want you to be aware of it

Missing values

Missing values beget missing values

NA > 5
[1] NA

. . .

NA * 7
[1] NA

. . .

I like this one

!NA
[1] NA

. . .

What about this one?

NA == NA
[1] NA

. . .

It is correct because there’s no reason to presume that one missing value is or is not equal to another missing value

When missing values don’t propagate

NA | TRUE
[1] TRUE

. . .

x <- c(NA, 3, NA, 5)
any(x > 4)
[1] TRUE

. . .

any(x > 6)
[1] NA

How to test missingness?

We’ve already seen the following doesn’t work

x == NA
[1] NA NA NA NA

. . .

Instead, use is.na()

is.na(x)
[1]  TRUE FALSE  TRUE FALSE

Different NAs?

Technically there are four missing values, one for each of the atomic types:

  • NA (logical)
  • NA_integer_ (integer)
  • NA_real_ (double)
  • NA_character_ (character)

This distinction is usually unimportant because NA will be automatically coerced to the correct type when needed

Lists

Lists

  • Lists are vectors, but not atomic vectors
  • Fundamental difference - each element can be a different type
list("a", 7L, 3.25, TRUE)
[[1]]
[1] "a"

[[2]]
[1] 7

[[3]]
[1] 3.25

[[4]]
[1] TRUE

. . .

Sneak peak at future content

lapply(list("a", 7L, 3.25, TRUE), class)
[[1]]
[1] "character"

[[2]]
[1] "integer"

[[3]]
[1] "numeric"

[[4]]
[1] "logical"

Lists

  • Technically, each element of the list is a vector, possibly atomic
  • The prior example included all scalars, which are vectors of length 1
  • Lists do not require all elements to be the same length

l <- list(
  c("a", "b", "c"),
  rnorm(5),
  c(7L, 2L),
  c(TRUE, TRUE, FALSE, TRUE)
)
l
[[1]]
[1] "a" "b" "c"

[[2]]
[1]  0.06212625  0.31806749 -1.38307597  0.45422692 -1.40553617

[[3]]
[1] 7 2

[[4]]
[1]  TRUE  TRUE FALSE  TRUE

Check the list

typeof(l)
[1] "list"
attributes(l)
NULL
str(l)
List of 4
 $ : chr [1:3] "a" "b" "c"
 $ : num [1:5] 0.0621 0.3181 -1.3831 0.4542 -1.4055
 $ : int [1:2] 7 2
 $ : logi [1:4] TRUE TRUE FALSE TRUE

Data frames as lists

A data frame is just a special case of a list, where all the elements are of the same length.

l_df <- list(
  a = c("red", "blue"),
  b = rnorm(2),
  c = c(7L, 2L),
  d = c(TRUE, FALSE)
)
l_df
$a
[1] "red"  "blue"

$b
[1] 0.6371395 1.2361582

$c
[1] 7 2

$d
[1]  TRUE FALSE
data.frame(l_df)
     a         b c     d
1  red 0.6371395 7  TRUE
2 blue 1.2361582 2 FALSE

Subsetting Lists

A nested list

Lists are often complicated objects. Let’s create a somewhat complicated one

x <- c(a = 3, b = 5, c = 7)
l <- list(
  x = x,
  x2 = c(x, x),
  x3 = list(
    vect = x,
    squared = x^2,
    cubed = x^3)
)
l
$x
a b c 
3 5 7 

$x2
a b c a b c 
3 5 7 3 5 7 

$x3
$x3$vect
a b c 
3 5 7 

$x3$squared
 a  b  c 
 9 25 49 

$x3$cubed
  a   b   c 
 27 125 343 

Subsetting lists

Multiple methods

  • Most common: $, [, and [[
l[1]
$x
a b c 
3 5 7 
typeof(l[1])
[1] "list"

. . .

l[[1]]
a b c 
3 5 7 
typeof(l[[1]])
[1] "double"

. . .

l[[1]]["c"]
c 
7 

Which bracket to use?

x

x[1]

x[[1]]

x[[1]][[1]]

Another analogy

. . .

Named list

Because the elements of the list are named, we can also use $, just like with a data frame (which is a list)

l$x2
a b c a b c 
3 5 7 3 5 7 
l$x3
$vect
a b c 
3 5 7 

$squared
 a  b  c 
 9 25 49 

$cubed
  a   b   c 
 27 125 343 

Subsetting nested lists

Multiple $ if all are named

l$x3$squared
 a  b  c 
 9 25 49 

. . .

Note this doesn’t work on named elements of an atomic vector, just the named elements of a list

l$x3$squared$b
Error in `l$x3$squared$b`:
! $ operator is invalid for atomic vectors

. . .

but we could do something like…

l$x3$squared["b"]
 b 
25 

Alternatives

  • You can use logical
l[c(TRUE, FALSE, TRUE)]
$x
a b c 
3 5 7 

$x3
$x3$vect
a b c 
3 5 7 

$x3$squared
 a  b  c 
 9 25 49 

$x3$cubed
  a   b   c 
 27 125 343 
  • Indexing works too
l[c(1, 3)]
$x
a b c 
3 5 7 

$x3
$x3$vect
a b c 
3 5 7 

$x3$squared
 a  b  c 
 9 25 49 

$x3$cubed
  a   b   c 
 27 125 343 

Careful with your brackets

l[[c(TRUE, FALSE, FALSE)]]
Error in `l[[c(TRUE, FALSE, FALSE)]]`:
! recursive indexing failed at level 2
  • Why doesn’t the above work?

Subsetting in multiple dimensions

  • Generally we deal with 2d data frames

  • If there are two dimensions, we separate the [] subsetting with a comma [row, column]

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars[3, 4]
[1] 93

Empty indicators

An empty indicator implies “all”

. . .

  • Select the entire 4th column
mtcars[ ,4]
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
[20]  65  97 150 150 245 175  66  91 113 264 175 335 109
  • Select the entire 4th row
mtcars[4, ]
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Data types returned

By default, each of the prior will return a vector, which itself can be subset

The following are equivalent

mtcars[4, c("mpg", "hp")]
                mpg  hp
Hornet 4 Drive 21.4 110
mtcars[4, ][c("mpg", "hp")]
                mpg  hp
Hornet 4 Drive 21.4 110

Return a data frame

Often, you don’t want the vector returned, but rather the modified data frame.

  • Specify drop = FALSE
mtcars[ ,4]
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
[20]  65  97 150 150 245 175  66  91 113 264 175 335 109
mtcars[ ,4, drop = FALSE]
                     hp
Mazda RX4           110
Mazda RX4 Wag       110
Datsun 710           93
Hornet 4 Drive      110
Hornet Sportabout   175
Valiant             105
Duster 360          245
Merc 240D            62
Merc 230             95
Merc 280            123
Merc 280C           123
Merc 450SE          180
Merc 450SL          180
Merc 450SLC         180
Cadillac Fleetwood  205
Lincoln Continental 215
Chrysler Imperial   230
Fiat 128             66
Honda Civic          52
Toyota Corolla       65
Toyota Corona        97
Dodge Challenger    150
AMC Javelin         150
Camaro Z28          245
Pontiac Firebird    175
Fiat X1-9            66
Porsche 914-2        91
Lotus Europa        113
Ford Pantera L      264
Ferrari Dino        175
Maserati Bora       335
Volvo 142E          109

tibbles

Note dropping the data frame attribute is the default for a data.frame but NOT a tibble

Maintains data frame

mtcars_tbl <- tibble::as_tibble(mtcars)
mtcars_tbl[ ,4]
# A tibble: 32 × 1
      hp
   <dbl>
 1   110
 2   110
 3    93
 4   110
 5   175
 6   105
 7   245
 8    62
 9    95
10   123
# ℹ 22 more rows

You can override this

mtcars_tbl[ ,4, drop = TRUE]
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
[20]  65  97 150 150 245 175  66  91 113 264 175 335 109

More than two dimensions

Depending on your applications, you may not run arrays much

array <- 1:12
dim(array) <- c(2, 3, 2)
array
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

Subset array

Select just the second matrix

array[ , ,2]
     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

. . .

Select first column of each matrix

array[ ,1, ]
     [,1] [,2]
[1,]    1    7
[2,]    2    8

Back to lists

Why are lists so useful?

  • Much more flexible
  • Often returned by functions, for example, lm
m <- lm(mpg ~ hp, mtcars)
str(m)
List of 12
 $ coefficients : Named num [1:2] 30.0989 -0.0682
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "hp"
 $ residuals    : Named num [1:32] -1.594 -1.594 -0.954 -1.194 0.541 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ effects      : Named num [1:32] -113.65 -26.046 -0.556 -0.852 0.67 ...
  ..- attr(*, "names")= chr [1:32] "(Intercept)" "hp" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:32] 22.6 22.6 23.8 22.6 18.2 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "hp"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.18 1.08
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 30
 $ xlevels      : Named list()
 $ call         : language lm(formula = mpg ~ hp, data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ hp
  .. ..- attr(*, "variables")= language list(mpg, hp)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "hp"
  .. .. .. ..$ : chr "hp"
  .. ..- attr(*, "term.labels")= chr "hp"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(mpg, hp)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "hp"
 $ model        :'data.frame':  32 obs. of  2 variables:
  ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ hp
  .. .. ..- attr(*, "variables")= language list(mpg, hp)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "hp"
  .. .. .. .. ..$ : chr "hp"
  .. .. ..- attr(*, "term.labels")= chr "hp"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(mpg, hp)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "hp"
 - attr(*, "class")= chr "lm"

Summary

  • Atomic vectors must all be the same type
    • implicit coercion occurs if not (and you haven’t specified the coercion explicitly)
  • Lists are also vectors, but not atomic vectors
    • Each element can be of a different type and length
    • Incredibly flexible, but often a little more difficult to get the hang of, particularly with subsetting

Next time

Before next class

Footnotes

  1. Note there are two others (complex and raw), but we almost never care about them↩︎