library(tidyverse)Small pieces
What is this?
This is mostly quick little code snippets to copy-paste and avoid re-writing. load tidyverse and get going.
Rmarkdown with rproject directory as root
We often want to set the root directory not to the file but to the project. In Rmarkdown, we use the following in the setup chunk. Quarto typically uses a different method, but see the Quarto notes for some exceptions. Converting from Rmarkdown to quarto with knitr::convert_chunk_header kills this block, and it’s annoying to always have the header. In both Rmarkdown and Quarto, this has to be in a setup chunk.
```{r setup}
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
```I thought it’d be easiest to set in the global options, but that doesn’t seem to persist to render or knit.

Create a directory if doesn’t exist
newdir <- file.path('output', 'testdir')
if (!dir.exists(newdir)) {dir.create(newdir, recursive = TRUE)}Windows paths
Windows paths come in with \, which R treats as an escape character. We can use file.path to just avoid them, or replace them with / or \\. But sometimes we just want to paste a path in quickly and be done with it. As of R 4.0, we can do that with r. It requires the parentheses to be in a funny place- inside the quotes.
pastepath <- r"(C:\Users\username\path\to\somewhere.csv)"
pastepath[1] "C:\\Users\\username\\path\\to\\somewhere.csv"
And we can feed that straight into functions that need paths, eg.
readr::read_csv(r"(C:\Users\username\path\to\somewhere.csv)")Look at all duplicate values
Functions like duplicated give the second (and greater) values that match. e.g.
x <- c(1,2,1,3,4,2)
duplicated(x)[1] FALSE FALSE TRUE FALSE FALSE TRUE
But we often want to grab all values that are repeated- ie if everything matches in one column what’s going on in the others. do do that we can use group_by and filter to get those with > 1 row.
IE, let’s compare cars with duplicated mpg values
mtcars %>%
dplyr::group_by(mpg) %>%
dplyr::filter(n() > 1) %>%
dplyr::arrange(mpg) # makes the comparisons easierWhy is that useful? We can see not only that these aren’t fully duplicated rows (which we also could have done with duplicated on the whole table), but also actually look at what differs easily.
Full table
If we do want to look at fully duplicated rows, we can use a similar approach, we just have to group_by everything. Duplicating three rows to demonstrate:
mtcars |>
# bind on some duplicates
dplyr::bind_rows(mtcars |> dplyr::slice(c(1,9,12))) |>
dplyr::group_by(dplyr::across(tidyselect::everything())) |>
dplyr::filter(dplyr::n()>1) |>
dplyr::ungroup() |>
dplyr::arrange(dplyr::across(tidyselect::everything()))In a list
We might have a list with internal duplicates, e.g.
duplist <- list(a = c('a', 'b'), b = c('b', 'c'), d = c('a', 'c'), e = c('f', 'g'), f = c('f', 'h'), g = c('a', 'l'))
duplist$a
[1] "a" "b"
$b
[1] "b" "c"
$d
[1] "a" "c"
$e
[1] "f" "g"
$f
[1] "f" "h"
$g
[1] "a" "l"
We can see which values in the first position are duplicated, but again, not the first instances.
thedups <- duplist[duplicated(purrr::map_chr(duplist, \(x) x[1]))] |>
purrr::map_chr(\(x) x[1])
thedups d f g
"a" "f" "a"
We can get all of them by mapping whether the first value is in thedups and then dropping empties
all_duplicated <- purrr::map(duplist, \(x) x[x[1] %in% thedups]) |>
purrr::discard(\(x) length(x) == 0)
all_duplicated$a
[1] "a" "b"
$d
[1] "a" "c"
$e
[1] "f" "g"
$f
[1] "f" "h"
$g
[1] "a" "l"
Changing the column type on readr
Sometimes with long csvs, readr’s guess of col type based on the first thousand rows is wrong. But only for some cols. If we want to not have to specify all of them, we can use .default and only specify the offending col.
First, save dummy data
dumtib <- tibble(c1 = 1:3000, c2 = rep(letters, length.out = 3000), c3 = c(c1[1:2000], c2[2001:3000]))
write_csv(dumtib, file = file.path(newdir, 'colspectest.csv'))If we read in without the cols, it assumes c3 is numeric and we get errors. But it doesn’t. why not? It keeps getting me elsewhere, but now I can’t create the problem. FIgure this out later, I guess
filein <- read_csv(file.path(newdir, 'colspectest.csv'), guess_max = 100)Rows: 3000 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): c2, c3
dbl (1): c1
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Tell it the third col is character.
filein <- readr::read_csv(file.path(newdir, 'colspectest.csv'), col_types = cols(.default = "?", c3 = "c"))Sourcing all files in a directory
Yes, we should be building as a library in this case, but it’s often easier at least initially to not deal with the overhead. If, for example, all functions are in the ‘functions’ directory,
# Load local functions
devtools::load_all()ℹ Loading galenR
Render big dfs in html
Render in quarto defaults to making dfs text, and so often we can’t see all the columns (or rows), or access them. setting the df-print option to paged allows them to work. The header should look like this (commented out because this isn’t a header)
# title: "TITLE"
# author: "AUTHOR"
# format:
# html:
# df-print: pagedConvert all rmd to qmd
convert_chunk_headers is the main thing, but I want to apply it to a full directory. Let’s get the dir for here.
allrmd <- list.files(rprojroot::find_rstudio_root_file(), pattern = '.Rmd', recursive = TRUE, full.names = TRUE)
allrmd <- allrmd[!stringr::str_detect(allrmd, 'renv')]
allqmd <- stringr::str_replace(allrmd, '.Rmd', '.qmd')Can I vectorize? No, but a loop works. Git commit first!
for (i in 1:length(allrmd)) {
knitr::convert_chunk_header(input = allrmd[i], output = allqmd[i])
}Now, if you want to really go for it, delete the rmds. That makes git happier because then it can treat this as a rename and keep tracking the files.
Dangerous- make sure you’ve git-committed. I’m commenting out and eval: false ing this
# file.remove(allrmd)Debugging R from Quarto
Sometimes (often) if you call a function with a breakpoint from a Quarto notebook, nothing prints to the console. I.e. I’ll type a variable name var_a to see what it is, and nothing will print. The issue is that values are going to a different stdout.
Type sink() and it should start working.