library(tidyverse)
Small pieces
What is this?
This is mostly quick little code snippets to copy-paste and avoid re-writing. load tidyverse and get going.
Rmarkdown with rproject directory as root
We often want to set the root directory not to the file but to the project. In Rmarkdown, we use the following in the setup
chunk. Quarto typically uses a different method, but see the Quarto notes for some exceptions. Converting from Rmarkdown to quarto with knitr::convert_chunk_header
kills this block, and it’s annoying to always have the header. In both Rmarkdown and Quarto, this has to be in a setup
chunk.
```{r setup}
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
```
I thought it’d be easiest to set in the global options, but that doesn’t seem to persist to render
or knit
.
Create a directory if doesn’t exist
<- file.path('output', 'testdir')
newdir if (!dir.exists(newdir)) {dir.create(newdir, recursive = TRUE)}
Windows paths
Windows paths come in with \
, which R treats as an escape character. We can use file.path
to just avoid them, or replace them with /
or \\
. But sometimes we just want to paste a path in quickly and be done with it. As of R 4.0, we can do that with r
. It requires the parentheses to be in a funny place- inside the quotes.
<- r"(C:\Users\username\path\to\somewhere.csv)"
pastepath pastepath
[1] "C:\\Users\\username\\path\\to\\somewhere.csv"
And we can feed that straight into functions that need paths, eg.
::read_csv(r"(C:\Users\username\path\to\somewhere.csv)") readr
Look at all duplicate values
Functions like duplicated
give the second (and greater) values that match. e.g.
<- c(1,2,1,3,4,2)
x duplicated(x)
[1] FALSE FALSE TRUE FALSE FALSE TRUE
But we often want to grab all values that are repeated- ie if everything matches in one column what’s going on in the others. do do that we can use group_by
and filter
to get those with > 1 row.
IE, let’s compare cars with duplicated mpg values
%>%
mtcars ::group_by(mpg) %>%
dplyr::filter(n() > 1) %>%
dplyr::arrange(mpg) # makes the comparisons easier dplyr
Why is that useful? We can see not only that these aren’t fully duplicated rows (which we also could have done with duplicated
on the whole table), but also actually look at what differs easily.
In a list
We might have a list with internal duplicates, e.g.
<- list(a = c('a', 'b'), b = c('b', 'c'), d = c('a', 'c'), e = c('f', 'g'), f = c('f', 'h'), g = c('a', 'l'))
duplist duplist
$a
[1] "a" "b"
$b
[1] "b" "c"
$d
[1] "a" "c"
$e
[1] "f" "g"
$f
[1] "f" "h"
$g
[1] "a" "l"
We can see which values in the first position are duplicated, but again, not the first instances.
<- duplist[duplicated(purrr::map_chr(duplist, \(x) x[1]))] |>
thedups ::map_chr(\(x) x[1])
purrr thedups
d f g
"a" "f" "a"
We can get all of them by mapping whether the first value is in thedups
and then dropping empties
<- purrr::map(duplist, \(x) x[x[1] %in% thedups]) |>
all_duplicated ::discard(\(x) length(x) == 0)
purrr all_duplicated
$a
[1] "a" "b"
$d
[1] "a" "c"
$e
[1] "f" "g"
$f
[1] "f" "h"
$g
[1] "a" "l"
Changing the column type on readr
Sometimes with long csvs, readr’s guess of col type based on the first thousand rows is wrong. But only for some cols. If we want to not have to specify all of them, we can use .default
and only specify the offending col.
First, save dummy data
<- tibble(c1 = 1:3000, c2 = rep(letters, length.out = 3000), c3 = c(c1[1:2000], c2[2001:3000]))
dumtib
write_csv(dumtib, file = file.path(newdir, 'colspectest.csv'))
If we read in without the cols, it assumes c3 is numeric and we get errors. But it doesn’t. why not? It keeps getting me elsewhere, but now I can’t create the problem. FIgure this out later, I guess
<- read_csv(file.path(newdir, 'colspectest.csv'), guess_max = 100) filein
Rows: 3000 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): c2, c3
dbl (1): c1
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Tell it the third col is character.
<- readr::read_csv(file.path(newdir, 'colspectest.csv'), col_types = cols(.default = "?", c3 = "c")) filein
Sourcing all files in a directory
Yes, we should be building as a library in this case, but it’s often easier at least initially to not deal with the overhead. If, for example, all functions are in the ‘functions’ directory,
# Load local functions
::load_all() devtools
ℹ Loading galenR
Render big dfs in html
Render in quarto defaults to making dfs text, and so often we can’t see all the columns (or rows), or access them. setting the df-print
option to paged allows them to work. The header should look like this (commented out because this isn’t a header)
# title: "TITLE"
# author: "AUTHOR"
# format:
# html:
# df-print: paged
Convert all rmd to qmd
convert_chunk_headers
is the main thing, but I want to apply it to a full directory. Let’s get the dir for here.
<- list.files(rprojroot::find_rstudio_root_file(), pattern = '.Rmd', recursive = TRUE, full.names = TRUE)
allrmd
<- allrmd[!stringr::str_detect(allrmd, 'renv')]
allrmd
<- stringr::str_replace(allrmd, '.Rmd', '.qmd') allqmd
Can I vectorize? No, but a loop works. Git commit first!
for (i in 1:length(allrmd)) {
::convert_chunk_header(input = allrmd[i], output = allqmd[i])
knitr }
Now, if you want to really go for it, delete the rmds. That makes git happier because then it can treat this as a rename and keep tracking the files.
Dangerous- make sure you’ve git-committed. I’m commenting out and eval: false ing this
# file.remove(allrmd)