Data transformation with dplyr
cheatsheetQuick reference guide for data manipulation with dplyr. Covers select, filter, mutate, summarize, and more.
Master data transformation with this comprehensive cheatsheet covering all essential dplyr functions. Learn how to efficiently filter rows, select columns, create new variables, and summarize data using the grammar of data manipulation.
dplyr functions work with pipes and expect tidy data. In tidy data:
- Each variable is in its own column
- Each observation, or case, is in its own row
Pipes: x |> f(y) becomes f(x, y)
| |
Summarize cases
Create summary statistics tables:
summarize()- Compute table of summariescount()- Count rows in each grouptally(),add_count(),add_tally()- Additional counting helpers
Group cases
group_by()- Create grouped copies for separate manipulation.byargument - Alternative syntax within functionsrowwise()- Group data into individual rowsungroup()- Return ungrouped copy
Manipulate cases
Extract cases
filter()- Extract rows meeting logical criteriadistinct()- Remove duplicate rowsslice()- Select rows by positionslice_sample()- Randomly select rowsslice_min()/slice_max()- Select lowest/highest valuesslice_head()/slice_tail()- Select first/last rows
Arrange cases
arrange()- Order rows; usedesc()for reverse order
Add cases
add_row()- Add one or more rows
Manipulate variables
Extract variables
pull()- Extract column as vectorselect()- Extract columns as tablerelocate()- Move columns to new positions
Select helpers
Use with select(): contains(), ends_with(), starts_with(), matches(), everything(), num_range(), : range notation, all_of(), any_of(), negation with !
Multiple variables
across()- Summarize/mutate multiple columns similarlyc_across()- Compute across columns in row-wise data
Make new variables
mutate()- Compute new columnsrename()- Rename columnsrename_with()- Rename using a function
Vectorized functions
For use with mutate():
Offset: lag(), lead()
Cumulative aggregate: cumall(), cumany(), cummax(), cummean(), cummin(), cumprod(), cumsum()
Ranking: cume_dist(), dense_rank(), min_rank(), ntile(), percent_rank(), row_number()
Math: Arithmetic operators, logarithm functions, comparison operators, between(), near()
Miscellaneous: case_when(), coalesce(), if_else(), na_if(), pmax(), pmin()
Summary functions
For use with summarize():
Count: n(), n_distinct(), sum(!is.na())
Position: mean(), median()
Logical: mean() for proportions of TRUE, sum() for counts of TRUE
Order: first(), last(), nth()
Rank: quantile(), min(), max()
Spread: IQR(), mad(), sd(), var()
Combine tables
Combine variables
bind_cols()- Place tables side by side
Combine cases
bind_rows()- Stack tables vertically
Mutating joins
left_join()- Include all rows from left tableright_join()- Include all rows from right tableinner_join()- Include only matching rowsfull_join()- Include all rows from both tables
Filtering joins
semi_join()- Rows of x with matches in yanti_join()- Rows of x without matches in y
Nest join
nest_join()- Inner join with nested results
Use by = join_by() for specifying column matches.
Set operations
intersect(), setdiff(), union(), union_all(), setequal()


