Intro
The main motivation of writing this post came from this post, the OP asked how to manage a ~300 lines long R script in a hierarchical way in Rstudio, just like we can fold/unfold a section to its headings in a well-structured rmarkdown document.
There’s no quick answer to this question that came out of my mind instantly. When thinking about it again, I start to wondering whether structuring a ~300 lines long R script is necessary. Nah, it isn’t. an all-in-one R script will quickly become a monster three months later. It will become difficult to understand, debug and update, and eventually you may decide to write another all-in-one script from start.
I’ll use one of my project, lhdata, to show how I manage an R project
Note that this project was written before I learned how to create an R package, many things can be greatly improved, I’ll point them out in the footnote. Besides, some of the R code may hurt your eyes if you stare at them too long.
What This Project Does
lhdata is a web crawler that retrieve updates from bilibili and update the blogdown site owaraiclub accordingly. But that’s not importent, let’s move on.
Project Structure
Here’s the directory structure of the project.
.
├── bin
│ ├── scrapebili_aspen.sh
│ └── scrapebili.sh
├── data
│ └── vlist.new.anno_humidavid.rda
├── DESCRIPTION
├── lhdata.Rproj
├── NAMESPACE
├── notebook
│ ├── daily_updates.Rmd
│ ├── update_tags.Rmd
├── R
│ ├── annotate_vlist_tmp.R
│ ├── api_aid2cid.R
│ ├── api_getbilitags.R
│ ├── api_getuploads_fp.R
│ ├── api_getuploads.R
│ ├── api_upload_imgur.R
│ ├── daily_update.R
│ ├── fromJSON_fix.R
│ ├── generate_posts2.R
│ ├── generate_posts.R
│ ├── getbangumi.R
│ ├── getyearsdf.R
│ ├── update_meta.R
│ └── yihui_fetch.R
└── README.md
The main parts of the project are those *.R
files under R/
directory. Each file contains a group of R functions. These functions are used in the rmd files with source()
in a code chunk at the beginning.
There’s a bin/
dir to store executable bash scripts, I put them in a crontab to automatically run tasks every 4 hours on a server.
There’s a data/
dir to store data created by R, stored with save()
and restored with
load()
.
There’s a notebook/
dir to place my rmarkdown files. By putting most of my functions in the *.R
scripts, I can keep these notebooks relatively short. At the head of each notebook, there’s a setup R code chunk that loads all necessary R scripts or data like this. A better way is to create a lhdata R package,then these source can be placed by a single call of library(lhdata)
, neat!
```r
pkg_path="~/GIT/lhdata"
post_path="~/GIT/owaraisite/content/post/"
source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_getuploads.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_getuploads_fp.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/getbangumi.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/getyearsdf.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_aid2cid.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_getbilitags.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_upload_imgur.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/generate_posts2.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/annotate_vlist_tmp.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/fromJSON_fix.R"))
```
Summary
You may have started grunting when seeing those absolute paths in my code. I hereby shamelessly admit that this is a bad practice and forgive my younger self. A better way is to use the here
package that specify the project path, regardless of where the code is called under those nested subfolders. for example, source(here::here("R/api_getuploads.R"))
in notebooks/test.rmd
can correctly source the script at R/api_getuploads.R
.
I do not intended to provide guidelines on how to create an R package. That topic has been covered nicely by Hilary Parker. It’s a bonus, but not a must for neatly managing your R-related project.
To sum it up, here’s my recommended workflow for a project:
Manage your project in one “project directory”, or “project repo”, you can create a project in Rstudio by
File-New Project
conveniently. A bonus of doing this is that you can put your project under Git version control, and host it on github for sharing or backup.Use subfolders under your “project repo” to arrange different project files, such as data(
dataset/
), Rscript (R/
), rmarkdown reports (notebooks/
).If possible, split long files into short ones. This applies to scripts and rmarkdown notebooks.
If possible, do not use absolute paths, use
here
package instead. Here’s a post that discussed more