Differences in CBM-R Timing: Automatic Speech Recognition (ASR) vs Humans

The purpose of this post is to determine whether there are scoring method differences in time durations between ASR versus Traditional human scoring of traditional CBM-R and CORE passages.

Table of Contents


Introduction

We examined the differences in two scoring methods for the time duration scores of curriculum-based measurement of oral reading fluency (CBM-R). The two scoring methods were: (1) Traditional - the real-time human scores, comparable to traditional CBM-R assessments in schools; and (2) ASR - automatic speech recognition scores. We also explored the effect of passage length using: (1) easyCBM passages as traditional CBM-R passages of about 250 words read for 60 seconds; and CORE passages read in their entirety that were (2) long, about 85 words, (3) medium, about 50 words, and (4) short, about 25 words. These comparisons allowed for the analysis of the potential net gain of ASR compared to current school practices.

These results are part of our larger Content & Convergent Evidence Study. For details about the Content & Convergent Evidence Study procedures, including information on the sample, CBM-R passages, administration, and scoring methods, go here.

Passage-level results of words correct per minute (WCPM) scores for comparisons of scoring methods can be found here, and results comparing passage lengths can be found here.

Summary

We found statistically different differences in time scores between the ASR and Traditional time durations. These differences favor the ASR, under the assumption that the ASR timings – which record the duration of each word and the silences between in centiseconds – are very near precise. The ASR time durations were of course not infallible across the 13,121 audio recordings, but in general and under general assumptions, will be much more accurate than Traditional CBM-R times which are susceptible to many different types of human errors.

These findings go further to support the application of ASR in schools to score CBM-R assessments.

Analysis

We applied a mixed-effects model for time duration scores separately for each of Grades 2 through 4, with random effects for students and passages, and fixed effects for scoring method (three levels: ASR, Recording, and Traditional), passage length (four levels: easyCBM, short, medium, and long), and their interaction passage length:scoring method. For documentaiton of the model building process go here.


time ~ 1 + (1|student_id) + (1|passage_id) + 
           passage_length + scoring_method + passage_length:scoring_method, REML = FALSE))

Results

The following table shows the results of the final time duration model, with random effects for students and passages, and fixed effects for passage length, scoring method, and their interaction.

Grade 2 Grade 3 Grade 4
Estimate SE t-value Estimate SE t-value Estimate SE t-value
Fixed Effects
Intercept1 52.49 4.11 12.79 52.63 3.53 14.93 53.55 2.28 23.48
Long 6.76 4.18 1.62 -2.72 3.59 -0.76 -11.37 2.31 -4.91
Medium -16.49 4.13 -3.99 -22.99 3.55 -6.47 -29.15 2.30 -12.68
Short -31.12 4.09 -7.60 -37.26 3.53 -10.57 -41.03 2.28 -18.03
Traditional 3.83 1.25 3.06 4.57 0.99 4.62 3.67 0.90 4.10
Long:Traditional -6.42 1.35 -4.74 -6.25 1.04 -6.03 -4.96 0.93 -5.35
Medium:Traditional -5.39 1.31 -4.10 -5.93 1.02 -5.82 -4.57 0.92 -4.97
Short:Traditional -5.26 1.28 -4.11 -5.42 1.00 -5.39 -4.41 0.91 -4.86
Random Effects (SD)
Students 10.14 8.53 5.97
Passages 3.96 3.42 2.16
Residual 8.68 6.30 5.03
1 Intercept represents easyCBM passages with ASR time duration.

Based on the model’s results, we calculated pairwise comparisons from the estimated marginal means to examine the effects of scoring method.

The figure below shows the estimated marginal means of time for each scoring method by grade and passage length. The 95% confidence intervals for al comparisons overlap, suggesting that the estimated WCPM scores across scoring methods are relatively comparable.

To assist the interpretation of the results of the final model, we also report the statistical significance of the differences in marginal means, as well as Cohen’s (1988) d effect size estimates in the table below.

We examined the differences in time duration between ASR and Traditional scoring methods and found that all pairwise comparisons were statistically significant at the p = .01 level. On average, the Traditional time duration was greater for the easyCBM passages across grades by about 4 seconds, and lesser for the shorter CORE passages by 1 to 2 seconds. An examination of the magnitude of the effect sizes showed quite large effects in duration differences for the easyCBM passages across grades (d = -0.88 to -1.08), and medium effects for the CORE passages (d = 0.10 to 0.14).

These time estimates directly affect the accuracy and reliability of words correct per minute (WCPM) scores, which has implications for the consequential validity of the decisions based on those scores, including eligibility for targeted instruction, and progress monitoring decisions.

Acknowledgments

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A140203 to the University of Oregon. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.