Intro

The main motivation of writing this post came from this post, the OP asked how to manage a ~300 lines long R script in a hierarchical way in Rstudio, just like we can fold/unfold a section to its headings in a well-structured rmarkdown document.

There’s no quick answer to this question that came out of my mind instantly. When thinking about it again, I start to wondering whether structuring a ~300 lines long R script is necessary. Nah, it isn’t. an all-in-one R script will quickly become a monster three months later. It will become difficult to understand, debug and update, and eventually you may decide to write another all-in-one script from start.

I’ll use one of my project, lhdata, to show how I manage an R project

Github repo for lhdata

Note that this project was written before I learned how to create an R package, many things can be greatly improved, I’ll point them out in the footnote. Besides, some of the R code may hurt your eyes if you stare at them too long.

What This Project Does

lhdata is a web crawler that retrieve updates from bilibili and update the blogdown site owaraiclub accordingly. But that’s not importent, let’s move on.

Project Structure

Here’s the directory structure of the project.

.
├── bin
│   ├── scrapebili_aspen.sh
│   └── scrapebili.sh
├── data
│   └── vlist.new.anno_humidavid.rda
├── DESCRIPTION
├── lhdata.Rproj
├── NAMESPACE
├── notebook
│   ├── daily_updates.Rmd
│   ├── update_tags.Rmd
├── R
│   ├── annotate_vlist_tmp.R
│   ├── api_aid2cid.R
│   ├── api_getbilitags.R
│   ├── api_getuploads_fp.R
│   ├── api_getuploads.R
│   ├── api_upload_imgur.R
│   ├── daily_update.R
│   ├── fromJSON_fix.R
│   ├── generate_posts2.R
│   ├── generate_posts.R
│   ├── getbangumi.R
│   ├── getyearsdf.R
│   ├── update_meta.R
│   └── yihui_fetch.R
└── README.md


The main parts of the project are those *.R files under R/ directory. Each file contains a group of R functions. These functions are used in the rmd files with source() in a code chunk at the beginning.

There’s a bin/ dir to store executable bash scripts, I put them in a crontab to automatically run tasks every 4 hours on a server.

There’s a data/ dir to store data created by R, stored with save() and restored with load().

There’s a notebook/ dir to place my rmarkdown files. By putting most of my functions in the *.R scripts, I can keep these notebooks relatively short. At the head of each notebook, there’s a setup R code chunk that loads all necessary R scripts or data like this. A better way is to create a lhdata R package,then these source can be placed by a single call of library(lhdata), neat!



```r
pkg_path="~/GIT/lhdata"
post_path="~/GIT/owaraisite/content/post/"

source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_getuploads.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_getuploads_fp.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/getbangumi.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/getyearsdf.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_aid2cid.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_getbilitags.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/api_upload_imgur.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/generate_posts2.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/annotate_vlist_tmp.R"))
source(encoding="UTF-8",file=paste0(pkg_path,"/R/fromJSON_fix.R"))
```

Summary

You may have started grunting when seeing those absolute paths in my code. I hereby shamelessly admit that this is a bad practice and forgive my younger self. A better way is to use the here package that specify the project path, regardless of where the code is called under those nested subfolders. for example, source(here::here("R/api_getuploads.R")) in notebooks/test.rmd can correctly source the script at R/api_getuploads.R.

I do not intended to provide guidelines on how to create an R package. That topic has been covered nicely by Hilary Parker. It’s a bonus, but not a must for neatly managing your R-related project.

To sum it up, here’s my recommended workflow for a project:

  • Manage your project in one “project directory”, or “project repo”, you can create a project in Rstudio by File-New Project conveniently. A bonus of doing this is that you can put your project under Git version control, and host it on github for sharing or backup.

  • Use subfolders under your “project repo” to arrange different project files, such as data(dataset/), Rscript (R/), rmarkdown reports (notebooks/).

  • If possible, split long files into short ones. This applies to scripts and rmarkdown notebooks.

  • If possible, do not use absolute paths, use here package instead. Here’s a post that discussed more

Here is a cute cat that make me happy