Working smarter with {dplyr} 1.2.0

A talk on the new functions introduced in the latest {dplyr} release.
Presented by

Isabella Velásquez

Presented on

March 18, 2026

  Event   Slides   Repo   Recording


Details

  • 👥 R-Ladies Rome
  • 📆 18 March 2026 // 01:00 PM ET
  • 💻️ Virtual

Description

The tidyverse keeps evolving, and {dplyr} 1.2.0 brings a set of new tools that make data manipulation even more expressive and flexible. Whether you regularly work with R and the tidyverse, or you’re looking to sharpen your data wrangling skills, this session will help you better understand the design philosophy behind the latest {dplyr} release and how to apply it effectively.

Slides

Recording

Summary

This summary was generated by Claude Sonnet 4.6 and reviewed by me.

Today’s data

We’ll use salmonid mortality data from TidyTuesday, published by the Norwegian Veterinary Institute. The dataset covers monthly mortality data from 2020.

“The data is going to be shared saved in an object called monthly_losses_data. And if we read in that CSV and take a look at the first few rows, you’ll notice that there are nine columns that cover things like species, date, geo group, region, losses, and a variety of variables that describe how those losses happened. And so just a really nice data set to practice some of the functions that we’re going to be taking a look at.”

monthly_losses_data <-
  readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2026/2026-03-17/monthly_losses_data.csv')
Rows: 2808 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): species, geo_group, region
dbl  (5): losses, dead, discarded, escaped, other
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(monthly_losses_data)
# A tibble: 6 × 9
  species date       geo_group region losses   dead discarded escaped other
  <chr>   <date>     <chr>     <chr>   <dbl>  <dbl>     <dbl>   <dbl> <dbl>
1 salmon  2020-01-01 area      1       31425  28126      3299       0     0
2 salmon  2020-01-01 area      2      324116 277888     46113       0   115
3 salmon  2020-01-01 area      3      844829 776983     63770       0  4076
4 salmon  2020-01-01 area      4      676852 623159     51823       0  1870
5 salmon  2020-01-01 area      5      109269  97627     11424       0   218
6 salmon  2020-01-01 area      6      548921 531193     15710       0  2018

A quick dplyr refresher

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.

“dplyr is a core member of the tidyverse which is a curated collection of R packages specifically for data science. All the tidyverse packages share this opinionated philosophy that data should be in what’s called a tidy format — each variable is a column, each observation is a row, each value is a cell.”

Here are the core functions we’ll use as a foundation:

arrange() — changes the ordering of rows:

monthly_losses_data |> arrange(date)

select() — picks variables based on their names:

monthly_losses_data |>
  select(species, dead, discarded, escaped, other)

summarise() — reduces multiple values down to a single summary:

monthly_losses_data |>
  summarize(mean_losses = mean(losses))

group_by() — performs any operation “by group”:

monthly_losses_data |>
  group_by(region) |>
  summarize(mean = mean(losses))

mutate() — adds new variables that are functions of existing variables:

monthly_losses_data |>
  mutate(total = dead + discarded + escaped + other)

case_when() — checks each condition in order and assigns a value based on the first match:

monthly_losses_data |>
  mutate(loss_rating =
           case_when(losses > 100000 ~ "High",
                     losses < 100000 ~ "Low"))

filter() — picks cases based on their values:

monthly_losses_data |> filter(region == "1")

“dplyr is a grammar. This kind of syntax — taking your data frame, using a pipe, using a verb, and then specifying the columns — that’s the general syntax that we’ll see throughout this presentation.”

dplyr 1.2.0

Everything covered below is explored in great detail in the dplyr 1.2.0 blog post by Davis Vaughan.

filter_out()

The problem

Have you ever stared at filter(region == "1") and wondered: am I keeping Region 1 or dropping it?

“The original filter function was optimized for keeping rows. And often, because it was the quickest tool at our disposal or the only tool that we had, we may have forced it to drop rows using negative logic.”

Here’s what that looks like in practice. Say we have a small dataset with an NA in the losses column:

monthly_losses_NA
# A tibble: 5 × 3
  species region losses
  <chr>   <chr>   <dbl>
1 salmon  1       31425
2 salmon  2      324116
3 salmon  3      844829
4 salmon  3      676852
5 salmon  3          NA

We want to drop rows where region == 3 and losses > 700000. Using the traditional approach:

monthly_losses_NA |>
  filter(!(region == 3 & losses > 700000))
# A tibble: 3 × 3
  species region losses
  <chr>   <chr>   <dbl>
1 salmon  1       31425
2 salmon  2      324116
3 salmon  3      676852

The row with NA losses was silently dropped too — not what we wanted. To handle NAs correctly with filter(), we’d need:

monthly_losses_NA |>
  filter(
    !((region == 3 & !is.na(region)) &
             (losses > 700000 & !is.na(losses)))
    )
# A tibble: 4 × 3
  species region losses
  <chr>   <chr>   <dbl>
1 salmon  1       31425
2 salmon  2      324116
3 salmon  3      676852
4 salmon  3          NA
The solution: filter_out()

“Rather than trying to think through negations and logic and things like that, just remember: use filter() to keep rows, use filter_out() to drop rows.”

monthly_losses_NA |>
  filter_out(region == 3 & losses > 700000)
# A tibble: 4 × 3
  species region losses
  <chr>   <chr>   <dbl>
1 salmon  1       31425
2 salmon  2      324116
3 salmon  3      676852
4 salmon  3          NA

dplyr with filter_out() assumes that if a row is NA or unknown, you probably don’t want to drop it — so it only drops rows that definitively meet your criteria.

when_any() and when_all()

The problem

Combining multiple OR conditions with filter() leads to deeply nested, error-prone code:

monthly_losses_filters |>
  filter(
    (region %in% c("7", "8") & losses > 400000) |
           (region %in% c("2", "9") & losses > 300000)
    )
# A tibble: 4 × 3
# Groups:   region [4]
  species region losses
  <chr>   <chr>   <dbl>
1 salmon  2      324116
2 salmon  7      475115
3 salmon  8      442659
4 salmon  9      311127

“It starts to indent. You have to remember where you put your parentheses because there are a lot of them. As you can imagine, it can just get very unwieldy to write something like this, especially if you have many OR statements.”

The solution: when_any() and when_all()

Use when_any() for OR conditions and when_all() for AND conditions:

monthly_losses_filters |>
  filter(
    when_any(
      (region %in% c("7", "8") & losses > 400000),
      (region %in% c("2", "9") & losses > 300000)
    )
  )
# A tibble: 4 × 3
# Groups:   region [4]
  species region losses
  <chr>   <chr>   <dbl>
1 salmon  2      324116
2 salmon  7      475115
3 salmon  8      442659
4 salmon  9      311127
monthly_losses_filters |>
  filter(
    when_all(
      region %in% c("7", "8"),
      losses > 400000
    )
  )
# A tibble: 2 × 3
# Groups:   region [2]
  species region losses
  <chr>   <chr>   <dbl>
1 salmon  7      475115
2 salmon  8      442659

when_any() and when_all() continue this theme of intent-based coding — you lead with exactly what it is that you want. Your requirements are just separated by a comma, everything is indented at the same level, and hopefully much easier for a colleague to follow your logic.”

Both helpers also work with the new filter_out().

New recoding and replacing functions

Some background

Recoding used to involve choosing between if/else, case_when(), recode(), or recode() + rlang’s !!! splicing operator — each with its own quirky syntax.

“There are so many ways of recoding — if/else, case_when, recode, recode with rlang — and each one is very specific in how it works, each one brings their own set of syntax, and it could get really confusing.”

Recoding vs. replacing

Two important distinctions:

  • Recoding — creating an entirely new column using values from an existing column
  • Replacing — partially updating an existing column with new values
The new family

dplyr 1.2.0 introduces three new functions alongside case_when():

Task Match by Function
Recode Conditions case_when()
Recode Values recode_values()
Replace Conditions replace_when()
Replace Values replace_values()
recode_values()
po_mapping <- list(
  "1" = "Jæren", "2" = "Ryfylke", "3" = "Sotra",
  "4" = "Stadt", "5" = "Hustadvika", "6" = "Nordmøre",
  "7" = "Nord-Trøndelag", "8" = "Bodø", "9" = "Vestlfjorden",
  "10" = "Andfjorden", "11" = "Kvaløya",
  "12" = "Vest-Finnmark", "13" = "Øst-Finnmark"
)
monthly_losses_data |>
  select(species, geo_group, region, losses) |>
  filter(geo_group == "area") |>
  mutate(
    production_area = region |>
      recode_values(
        "1" ~ "Jæren",
        "2" ~ "Ryfylke",
        # ...
        "13" ~ "Øst-Finnmark",
        unmatched = "error"
      )
  )

The unmatched = "error" argument is particularly useful:

“Say I was unaware that my dataset actually has 14 regions instead of 13 and I didn’t give it something to recode for region 14. I could put unmatched = "error". And when I run this, it will actually stop the code and let me know immediately — hey, you didn’t match your list to everything. It’s a good way of cleaning up your pipeline and avoiding those kind of silent errors.”

You can also use a lookup table instead of listing values inline:

po_mapping_tibble <-
  tibble::enframe(po_mapping, name = "from", value = "to") |>
  unnest(to)

monthly_losses_data |>
  select(species, geo_group, region, losses) |>
  filter(geo_group == "area") |>
  mutate(production_area =
           recode_values(region,
                         from = po_mapping_tibble$from,
                         to = po_mapping_tibble$to)
         )
# A tibble: 1,512 × 5
   species geo_group region losses production_area
   <chr>   <chr>     <chr>   <dbl> <chr>          
 1 salmon  area      1       31425 Jæren          
 2 salmon  area      2      324116 Ryfylke        
 3 salmon  area      3      844829 Sotra          
 4 salmon  area      4      676852 Stadt          
 5 salmon  area      5      109269 Hustadvika     
 6 salmon  area      6      548921 Nordmøre       
 7 salmon  area      7      231487 Nord-Trøndelag 
 8 salmon  area      8      442659 Bodø           
 9 salmon  area      9      311127 Vestlfjorden   
10 salmon  area      10     288849 Andfjorden     
# ℹ 1,502 more rows
replace_values()

Use replace_values() when you want to update specific values within an existing column — all other values stay the same:

monthly_losses_data_po |>
  mutate(production_area = production_area |>
           replace_values(
             "Nordmøre" ~ "Nordmøre + Sør-Trøndelag",
             "Nord-Trøndelag" ~ "Nord-Trøndelag + Bindal"
             )
         )
# A tibble: 1,512 × 5
   species geo_group region losses production_area         
   <chr>   <chr>     <chr>   <dbl> <chr>                   
 1 salmon  area      1       31425 Jæren                   
 2 salmon  area      2      324116 Ryfylke                 
 3 salmon  area      3      844829 Sotra                   
 4 salmon  area      4      676852 Stadt                   
 5 salmon  area      5      109269 Hustadvika              
 6 salmon  area      6      548921 Nordmøre + Sør-Trøndelag
 7 salmon  area      7      231487 Nord-Trøndelag + Bindal 
 8 salmon  area      8      442659 Bodø                    
 9 salmon  area      9      311127 Vestlfjorden            
10 salmon  area      10     288849 Andfjorden              
# ℹ 1,502 more rows
replace_when()

case_when() requires a .default argument, otherwise unmatched rows become NA. replace_when() knows you only intend to replace some values — so no .default needed:

monthly_losses_data |>
  slice(1:4) |>
  select(species, region, losses) |>
  mutate(losses =
           replace_when(losses,
                        losses > 500000 ~ 500000)
         )
# A tibble: 4 × 3
  species region losses
  <chr>   <chr>   <dbl>
1 salmon  1       31425
2 salmon  2      324116
3 salmon  3      500000
4 salmon  4      500000

Additional notes

  • case_match() has been soft deprecated — it has been fully replaced by recode_values() and replace_values().
  • Speed improvements — core functions like if_else() and case_when() have been rewritten in C using the vctrs framework, bringing speed gains across the tidyverse. See the dplyr performance blog post.
  • Updating your LLM — because dplyr 1.2.0 is so new, your AI assistant may not know about filter_out(), when_any(), or the new recode functions. Give it context by pasting the release notes or blog post into the conversation, adding a CLAUDE.md file, or using an MCP server.

Community feedback matters

“dplyr 1.2.0 is a direct result of community feedback. The authors are incredibly active on platforms like Bluesky and they don’t just post updates — they also ask for your thoughts. And the reason that dplyr is more efficient, faster, and everything is because the people building it are listening to the people using it. So if you ever find yourself at a point where you have a really messy workaround, I highly encourage you to share it — because you might just inspire the next big feature in 1.3.0.”

Discussions happen in the Tidyups repo on GitHub, where the tidyverse team proposes ideas and invites community comment.

Summary

dplyr 1.2.0 introduces:

  • filter_out() — the missing complement to filter(), for dropping rows cleanly and safely
  • when_any() and when_all() — helpers for expressing OR and AND conditions more clearly inside filter() and filter_out()
  • recode_values(), replace_values(), and replace_when() — a complete, coherent family for recoding and replacing values

To upgrade:

install.packages("pak")
pak::pak("dplyr")