This document is to explicitly list coding best practices in R for biometricians and quantitative biologists within ADF&G Commercial Fisheries, with the hopes that a common framework will facilitate collaboration and code review. Much of what is in here originates from Grolemund and Wickham’s excellent R for Data Science “Book” http://r4ds.had.co.nz/. This is all predicated upon the assumption that your data are human and machine readable!
Generally speaking it is best to use Google’s R style guide, if in doubt default to these: https://google.github.io/styleguide/Rguide.xml
File names Do use:
lower_case_and_underscore.R
If you are changing files or versions often then adding a date can be beneficial.
data_2017_01_20.csv
or add a version
data_2017_01_20_v2.csv
Do not use:
DontUseMixedCases.R
periods.are.meh.but.ok.csv
dont_use.periods_and.underscore.R
Definitely don’t use a name that only you can understand
X10_20.16_T_AND_Y410.csv
Be consistent within and across projects. For example, all data file names contain the same information:
datasource_briefdescription_firstyear_lastyear.csv
Examples:
survey_bio_1988_2016.csv
or
fishery_cpue_1998_2016.csv
Object names
Rename all columns on import of a dataset. Do this at the beginning of a script to make sure that files can be joined without naming conflicts.
Keep names short, but descriptive.
Use:
year
or Year
cpue
or std.cpue
(update: I’ve moved away from using .
and now use _
e.g., std_cpue)
No spaces
No_Upper_And_Upper_Case
NO_ALL_CAPS
catch
works better than c
and you can search for and change catch
(plus c
is a function command…)
Know your data types, if an object is a character or factor use a capital letter, if it is numeric (double, or integer) use a lower case name. Define data types at the beginning of a script.
For example you have the integer year
in your data but you create figures based on the factor Year
. If you keep the two seperate you have the ability to easily plot and run various analyses without getting errors plus:
lm(catch~year)
is quite a bit different than lm(catch~Year)
Instead of writing catch_kg
make a note at the beginning of the script that catch is in kg unless otherwise stated then use catch
. Making notes in your code files is just good practice in general.
Dates
Believe it or not there is an international date/time standard (ISO 8601) the format is:
yyyy-mm-dd
Use it! Dates regularly have confounding errors - clean them at the beginning of a script and make all formats consistent (because what people send you will not be).
Function names
I often start function names with an f at the beginning, to clearly identify them as a function.
f_brief_function_description
Something like f_ricker
or ricker_fun
to name a Ricker spawner-recruit function. Functions named r, rick, or sr don’t tell you what it does - be at least slightly descriptive.
Assignments
Use <-
, not =
Use TRUE and FALSE, not T or F (the latter can be reassigned, the former cannot).
The <-
assignment shortcut is Alt-
, the “pipe” %>%
operator shortcut is Cntrl+Shift+M
more shortcuts in Rstudio can be found by pressing Alt+Shift+K
.
Spacing
Place spaces around all binary operators (=, +, -, <-
, etc.). Do not place a space before a comma, but always place one after a comma. Place a space before left parenthesis, except in a function call. There should be a hard return after each pipe %>%
.
Use them!
Don’t save your work environment - you should rerun your code each time to make sure you haven’t broken anything. If the analyses are lengthy or complex then save the output - which can then be sourced. In Rstudio you can go to “Tools > Global Options> Save workspace to .Rdata on exit” change it to “Never” and your workspace will not be saved.
Don’t use absolute paths no set_wd()
or attach
in your scripts - if data are confidential then write a note in the script of how to call the data (e.g., OceanAK), or where it is stored on a ADF&G server, or whom to contact. Why does this matter? Your set_wd()
is not the same as mine. Relative paths only - hence the reason for projects. Only load packages that you are actually using.
If everyone uses a similar structure for projects and scripts we will be able to read and understand each other’s work faster and more easily.
A project should have a number of folders:
sometimes there may be inclusion of a few other folders, such as:
This structure works well for developing R scripts, writing in markdown and can work well when writing with Sweave.
The general structure of a script should have:
# load ----
or # data ----
to create breaks and make it easy to navigate within your scripts and make lengthy analyses much easier to follow.At a minimum a script structure should look like:
# notes ----
# author
# contact
# date (or last changed date)
# load ----
library()
source('code/functions.R')
# data ----
data <- read_csv('data/data_file.csv')
data %>%
mutate(Year = factor(year)) -> data
# analysis ----
For example:
# notes ----
# This is a demonstration of how scripts should be setup
# Author: Ben Williams
# contact: ben.williams@alaska.gov
# Last edited: 2017-7-7
# load ----
library(tidyverse)
library(FNGr)
theme_set(theme_sleek())
# data ----
# typically this would be read_csv("data/iris.csv")
# but the iris dataset is built in
# change names
names(iris) <- c('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'Species')
glimpse(iris)
## Observations: 150
## Variables: 5
## $ sepal_length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ sepal_width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ petal_length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ petal_width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, s...
# eda ----
ggplot(iris, aes(sepal_length, sepal_width, color=Species)) +
geom_point() +
ylab('Sepal Width') +
xlab('Sepal Length')
Note: This section is under development
What is markdown?
what can it be converted to?
pdf, word, html
How does that work? via a program called Pandoc (a “universal” document converter) - don’t worry it is included with RStudio.
When to use it
-Whenever you want to rapidly communicate annotated code (informal report) -Initial report writing
-To keep notes for yourself
Structure
* YAML - ‘YAML Ain’t Markup Language’ a human (and computer) readable data format. This is the “frontmatter” of your report that tells markdown what type of output you would like to have.
* there are a slew of options http://rmarkdown.rstudio.com/html_document_format.html - I’ve generally found that keeping it rather basic e.g.,
---
title: 'This is a title'
author: 'Me'
date: "2017_06_10"
output:pdf_document
fontsize: 11pt
csl: canjfas.csl
bibliography: bibby.bib
---
This has a title, author and date that will be at the head of the document. I’ve told it to generate a pdf and have included a bibliography (.bibtex format) for references and am using the Canadian Journal or Fisheries and Aquatic Sciences .csl (citation style language) that format the references. Here is a good site for downloading .csl styles https://github.com/citation-style-language/styles
sessionInfo(c("ggplot2", "FNGr"))
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## character(0)
##
## other attached packages:
## [1] ggplot2_3.0.0 FNGr_0.1.10
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.18 cellranger_1.1.0 pillar_1.3.0 compiler_3.5.1
## [5] plyr_1.8.4 bindr_0.1.1 methods_3.5.1 forcats_0.3.0
## [9] utils_3.5.1 tools_3.5.1 grDevices_3.5.1 digest_0.6.16
## [13] lubridate_1.7.4 jsonlite_1.5 evaluate_0.11 tibble_1.4.2
## [17] nlme_3.1-137 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1
## [21] rlang_0.2.1 cli_1.0.0 rstudioapi_0.7 yaml_2.2.0
## [25] haven_1.1.2 bindrcpp_0.2.2 withr_2.1.2 xml2_1.2.0
## [29] httr_1.3.1 dplyr_0.7.6 stringr_1.3.1 knitr_1.20
## [33] hms_0.4.2 graphics_3.5.1 datasets_3.5.1 stats_3.5.1
## [37] rprojroot_1.3-2 grid_3.5.1 tidyselect_0.2.4 glue_1.3.0
## [41] base_3.5.1 R6_2.2.2 readxl_1.1.0 rmarkdown_1.10
## [45] readr_1.1.1 modelr_0.1.2 purrr_0.2.5 tidyr_0.8.1
## [49] magrittr_1.5 backports_1.1.2 scales_0.5.0 htmltools_0.3.6
## [53] rvest_0.3.2 assertthat_0.2.0 tidyverse_1.2.1 colorspace_1.3-2
## [57] labeling_0.3 stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0
## [61] broom_0.5.0 crayon_1.3.4