Small pieces

What is this?

This is mostly quick little code snippets to copy-paste and avoid re-writing. load tidyverse and get going.

library(tidyverse)

Rmarkdown with rproject directory as root

We often want to set the root directory not to the file but to the project. In Rmarkdown, we use the following in the setup chunk. Quarto typically uses a different method, but see the Quarto notes for some exceptions. Converting from Rmarkdown to quarto with knitr::convert_chunk_header kills this block, and it’s annoying to always have the header. In both Rmarkdown and Quarto, this has to be in a setup chunk.

```{r setup}
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
```

I thought it’d be easiest to set in the global options, but that doesn’t seem to persist to render or knit.

Create a directory if doesn’t exist

newdir <- file.path('output', 'testdir')
if (!dir.exists(newdir)) {dir.create(newdir, recursive = TRUE)}

Windows paths

Windows paths come in with \, which R treats as an escape character. We can use file.path to just avoid them, or replace them with / or \\. But sometimes we just want to paste a path in quickly and be done with it. As of R 4.0, we can do that with r. It requires the parentheses to be in a funny place- inside the quotes.

pastepath <- r"(C:\Users\username\path\to\somewhere.csv)"
pastepath
[1] "C:\\Users\\username\\path\\to\\somewhere.csv"

And we can feed that straight into functions that need paths, eg.

readr::read_csv(r"(C:\Users\username\path\to\somewhere.csv)")

Look at all duplicate values

Functions like duplicated give the second (and greater) values that match. e.g.

x <- c(1,2,1,3,4,2)
duplicated(x)
[1] FALSE FALSE  TRUE FALSE FALSE  TRUE

But we often want to grab all values that are repeated- ie if everything matches in one column what’s going on in the others. do do that we can use group_by and filter to get those with > 1 row.

IE, let’s compare cars with duplicated mpg values

mtcars %>%
  dplyr::group_by(mpg) %>%
  dplyr::filter(n() > 1) %>%
  dplyr::arrange(mpg) # makes the comparisons easier

Why is that useful? We can see not only that these aren’t fully duplicated rows (which we also could have done with duplicated on the whole table), but also actually look at what differs easily.

In a list

We might have a list with internal duplicates, e.g.

duplist <- list(a = c('a', 'b'), b = c('b', 'c'), d = c('a', 'c'), e = c('f', 'g'), f = c('f', 'h'), g = c('a', 'l'))
duplist
$a
[1] "a" "b"

$b
[1] "b" "c"

$d
[1] "a" "c"

$e
[1] "f" "g"

$f
[1] "f" "h"

$g
[1] "a" "l"

We can see which values in the first position are duplicated, but again, not the first instances.

thedups <- duplist[duplicated(purrr::map_chr(duplist, \(x) x[1]))] |>
      purrr::map_chr(\(x) x[1])
thedups
  d   f   g 
"a" "f" "a" 

We can get all of them by mapping whether the first value is in thedups and then dropping empties

all_duplicated <- purrr::map(duplist, \(x) x[x[1] %in% thedups]) |> 
  purrr::discard(\(x) length(x) == 0)
all_duplicated
$a
[1] "a" "b"

$d
[1] "a" "c"

$e
[1] "f" "g"

$f
[1] "f" "h"

$g
[1] "a" "l"

Changing the column type on readr

Sometimes with long csvs, readr’s guess of col type based on the first thousand rows is wrong. But only for some cols. If we want to not have to specify all of them, we can use .default and only specify the offending col.

First, save dummy data

dumtib <- tibble(c1 = 1:3000, c2 = rep(letters, length.out = 3000), c3 = c(c1[1:2000], c2[2001:3000]))

write_csv(dumtib, file = file.path(newdir, 'colspectest.csv'))

If we read in without the cols, it assumes c3 is numeric and we get errors. But it doesn’t. why not? It keeps getting me elsewhere, but now I can’t create the problem. FIgure this out later, I guess

filein <- read_csv(file.path(newdir, 'colspectest.csv'), guess_max = 100)
Rows: 3000 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): c2, c3
dbl (1): c1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Tell it the third col is character.

filein <- readr::read_csv(file.path(newdir, 'colspectest.csv'), col_types = cols(.default = "?", c3 = "c"))

Sourcing all files in a directory

Yes, we should be building as a library in this case, but it’s often easier at least initially to not deal with the overhead. If, for example, all functions are in the ‘functions’ directory,

# Load local functions
devtools::load_all()
ℹ Loading galenR

Render big dfs in html

Render in quarto defaults to making dfs text, and so often we can’t see all the columns (or rows), or access them. setting the df-print option to paged allows them to work. The header should look like this (commented out because this isn’t a header)

# title: "TITLE"
# author: "AUTHOR"
# format:
#   html:
#     df-print: paged

Convert all rmd to qmd

convert_chunk_headers is the main thing, but I want to apply it to a full directory. Let’s get the dir for here.

allrmd <- list.files(rprojroot::find_rstudio_root_file(), pattern = '.Rmd', recursive = TRUE, full.names = TRUE)

allrmd <- allrmd[!stringr::str_detect(allrmd, 'renv')]

allqmd <- stringr::str_replace(allrmd, '.Rmd', '.qmd')

Can I vectorize? No, but a loop works. Git commit first!

for (i in 1:length(allrmd)) {
  knitr::convert_chunk_header(input = allrmd[i], output = allqmd[i])
}

Now, if you want to really go for it, delete the rmds. That makes git happier because then it can treat this as a rename and keep tracking the files.

Dangerous- make sure you’ve git-committed. I’m commenting out and eval: false ing this

# file.remove(allrmd)