The tidyverse keeps evolving, and {dplyr} 1.2.0 brings a set of new tools that make data manipulation even more expressive and flexible. Whether you regularly work with R and the tidyverse, or you’re looking to sharpen your data wrangling skills, this session will help you better understand the design philosophy behind the latest {dplyr} release and how to apply it effectively.
Slides
Recording
Summary
This summary was generated by Claude Sonnet 4.6 and reviewed by me.
Today’s data
We’ll use salmonid mortality data from TidyTuesday, published by the Norwegian Veterinary Institute. The dataset covers monthly mortality data from 2020.
“The data is going to be shared saved in an object called monthly_losses_data. And if we read in that CSV and take a look at the first few rows, you’ll notice that there are nine columns that cover things like species, date, geo group, region, losses, and a variety of variables that describe how those losses happened. And so just a really nice data set to practice some of the functions that we’re going to be taking a look at.”
Rows: 2808 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, geo_group, region
dbl (5): losses, dead, discarded, escaped, other
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(monthly_losses_data)
# A tibble: 6 × 9
species date geo_group region losses dead discarded escaped other
<chr> <date> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 salmon 2020-01-01 area 1 31425 28126 3299 0 0
2 salmon 2020-01-01 area 2 324116 277888 46113 0 115
3 salmon 2020-01-01 area 3 844829 776983 63770 0 4076
4 salmon 2020-01-01 area 4 676852 623159 51823 0 1870
5 salmon 2020-01-01 area 5 109269 97627 11424 0 218
6 salmon 2020-01-01 area 6 548921 531193 15710 0 2018
A quick dplyr refresher
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
“dplyr is a core member of the tidyverse which is a curated collection of R packages specifically for data science. All the tidyverse packages share this opinionated philosophy that data should be in what’s called a tidy format — each variable is a column, each observation is a row, each value is a cell.”
Here are the core functions we’ll use as a foundation:
“dplyr is a grammar. This kind of syntax — taking your data frame, using a pipe, using a verb, and then specifying the columns — that’s the general syntax that we’ll see throughout this presentation.”
Have you ever stared at filter(region == "1") and wondered: am I keeping Region 1 or dropping it?
“The original filter function was optimized for keeping rows. And often, because it was the quickest tool at our disposal or the only tool that we had, we may have forced it to drop rows using negative logic.”
Here’s what that looks like in practice. Say we have a small dataset with an NA in the losses column:
monthly_losses_NA
# A tibble: 5 × 3
species region losses
<chr> <chr> <dbl>
1 salmon 1 31425
2 salmon 2 324116
3 salmon 3 844829
4 salmon 3 676852
5 salmon 3 NA
We want to drop rows where region == 3 and losses > 700000. Using the traditional approach:
# A tibble: 4 × 3
species region losses
<chr> <chr> <dbl>
1 salmon 1 31425
2 salmon 2 324116
3 salmon 3 676852
4 salmon 3 NA
The solution: filter_out()
“Rather than trying to think through negations and logic and things like that, just remember: use filter() to keep rows, use filter_out() to drop rows.”
# A tibble: 4 × 3
species region losses
<chr> <chr> <dbl>
1 salmon 1 31425
2 salmon 2 324116
3 salmon 3 676852
4 salmon 3 NA
dplyr with filter_out() assumes that if a row is NA or unknown, you probably don’t want to drop it — so it only drops rows that definitively meet your criteria.
when_any() and when_all()
The problem
Combining multiple OR conditions with filter() leads to deeply nested, error-prone code:
# A tibble: 4 × 3
# Groups: region [4]
species region losses
<chr> <chr> <dbl>
1 salmon 2 324116
2 salmon 7 475115
3 salmon 8 442659
4 salmon 9 311127
“It starts to indent. You have to remember where you put your parentheses because there are a lot of them. As you can imagine, it can just get very unwieldy to write something like this, especially if you have many OR statements.”
The solution: when_any() and when_all()
Use when_any() for OR conditions and when_all() for AND conditions:
# A tibble: 4 × 3
# Groups: region [4]
species region losses
<chr> <chr> <dbl>
1 salmon 2 324116
2 salmon 7 475115
3 salmon 8 442659
4 salmon 9 311127
monthly_losses_filters |>filter(when_all( region %in%c("7", "8"), losses >400000 ) )
# A tibble: 2 × 3
# Groups: region [2]
species region losses
<chr> <chr> <dbl>
1 salmon 7 475115
2 salmon 8 442659
“when_any() and when_all() continue this theme of intent-based coding — you lead with exactly what it is that you want. Your requirements are just separated by a comma, everything is indented at the same level, and hopefully much easier for a colleague to follow your logic.”
Both helpers also work with the new filter_out().
New recoding and replacing functions
Some background
Recoding used to involve choosing between if/else, case_when(), recode(), or recode() + rlang’s !!! splicing operator — each with its own quirky syntax.
“There are so many ways of recoding — if/else, case_when, recode, recode with rlang — and each one is very specific in how it works, each one brings their own set of syntax, and it could get really confusing.”
Recoding vs. replacing
Two important distinctions:
Recoding — creating an entirely new column using values from an existing column
Replacing — partially updating an existing column with new values
The new family
dplyr 1.2.0 introduces three new functions alongside case_when():
The unmatched = "error" argument is particularly useful:
“Say I was unaware that my dataset actually has 14 regions instead of 13 and I didn’t give it something to recode for region 14. I could put unmatched = "error". And when I run this, it will actually stop the code and let me know immediately — hey, you didn’t match your list to everything. It’s a good way of cleaning up your pipeline and avoiding those kind of silent errors.”
You can also use a lookup table instead of listing values inline:
# A tibble: 1,512 × 5
species geo_group region losses production_area
<chr> <chr> <chr> <dbl> <chr>
1 salmon area 1 31425 Jæren
2 salmon area 2 324116 Ryfylke
3 salmon area 3 844829 Sotra
4 salmon area 4 676852 Stadt
5 salmon area 5 109269 Hustadvika
6 salmon area 6 548921 Nordmøre + Sør-Trøndelag
7 salmon area 7 231487 Nord-Trøndelag + Bindal
8 salmon area 8 442659 Bodø
9 salmon area 9 311127 Vestlfjorden
10 salmon area 10 288849 Andfjorden
# ℹ 1,502 more rows
replace_when()
case_when() requires a .default argument, otherwise unmatched rows become NA. replace_when() knows you only intend to replace some values — so no .default needed:
# A tibble: 4 × 3
species region losses
<chr> <chr> <dbl>
1 salmon 1 31425
2 salmon 2 324116
3 salmon 3 500000
4 salmon 4 500000
Additional notes
case_match() has been soft deprecated — it has been fully replaced by recode_values() and replace_values().
Speed improvements — core functions like if_else() and case_when() have been rewritten in C using the vctrs framework, bringing speed gains across the tidyverse. See the dplyr performance blog post.
Updating your LLM — because dplyr 1.2.0 is so new, your AI assistant may not know about filter_out(), when_any(), or the new recode functions. Give it context by pasting the release notes or blog post into the conversation, adding a CLAUDE.md file, or using an MCP server.
Community feedback matters
“dplyr 1.2.0 is a direct result of community feedback. The authors are incredibly active on platforms like Bluesky and they don’t just post updates — they also ask for your thoughts. And the reason that dplyr is more efficient, faster, and everything is because the people building it are listening to the people using it. So if you ever find yourself at a point where you have a really messy workaround, I highly encourage you to share it — because you might just inspire the next big feature in 1.3.0.”
Discussions happen in the Tidyups repo on GitHub, where the tidyverse team proposes ideas and invites community comment.
Summary
dplyr 1.2.0 introduces:
filter_out() — the missing complement to filter(), for dropping rows cleanly and safely
when_any() and when_all() — helpers for expressing OR and AND conditions more clearly inside filter() and filter_out()
recode_values(), replace_values(), and replace_when() — a complete, coherent family for recoding and replacing values