Getting Started with R - Building Good Habits

Posted in December 2020 to programming

I recently had to start using R for a couple of university courses. Everyone had to write their projects in Rmarkdown, and submit them in both PDF and Rmarkdown format. The past few weeks have been a bit of a blur. I'm also enrolled for an introductory course in R, but being more used to just searching for things when I have a problem, I haven't been paying so much attention to the lessons for that course. So when I had to actually write the code for these projects, then it was Google time! And since then it's been a crazy mess of different libraries, dodging R's quirks, and just trying to get things to work the way I want them to. I finally submitted the last main project the other day, so now I'm free to do a bit of amateur forensics on the code I wrote.

In this post, I'd like to document some of the tricks I learnt, or things that I think went particularly well, so that I can use them again in future.

R Environment

This was a major struggle. First off, my laptop has 64-bit libraries only, and it took me a long time to realise that that was causing some package installs to fail. I'd already given up using R on my laptop when I found this, but I think this is the solution:

install.packages("<package>", INSTALL_opts = "--no-multiarch")

My next stop was Jupyter, but I had trouble running inline R code in my Rmarkdown documents, so I eventually gave up on that too. I was hesitant to install Rstudio on my laptop, but then I discovered that a server-based version is available, so I chose that. I had no trouble installing it in a Ubuntu LXC container.

I'm still not entirely clear on the demarcation between workspaces, projects and environments, but creating a project and linking it to git gave me a git repo whose root folder name was the project name, so that part wasn't too complicated. This was handy for SSHing in to resolve merge conflicts or do other non-trivial git stuff. Speaking of which, this entry from my ~/.ssh/config shows my trick for SSHing into individual LXC containers via the host:

Host rstudio
    IdentityFile ~/.ssh/<key file>
    ProxyCommand ssh -T -A <server> lxc exec %h -- /usr/sbin/sshd -i

The next issue arose from working with other people. If I put Linux-specific settings like latex_engine: xelatex in the YAML metadata, then other people on the team couldn't knit my file. To get around this, I added the .Rproj file to the gitignore, and removed OS-specific settings from the YAML metadata. This way, where necessary we could each use our own settings for knitting the files. This was the .gitignore file we ended up with:

_snaps/*
.Rproj.user
.Rhistory
.RData
.Ruserdata
*.log

*.swp
*.Rproj

*.pdf
*.pptx

With that out of the way, it was time to write some actual code!

Coding

Libraries

I don't know if this is an R thing or a me thing, but I had major issues with namespaces in R. The order that you load libraries seems to make a difference. I think the one I had an issue with was dplyr, which has to be loaded after other packages with identically-named functions.

YAML Metadata

These settings worked for me:

header-includes:
- \usepackage{amsmath}
- \usepackage[utf8]{inputenc}

Also, we had to submit our presentations with font size 11, 1.5-line spacing. In Latex-speak, that line spacing translates to:

header-includes:
- \linespread{1.25}

Other Top-of-File Stuff

In writing a project, you might have to set a seed. Because this is just a number, it seems a bit arbitrary (which I guess it is, but ho hum). However, if you load the gtools library, you can do this instead:

set.seed(sum(asc("Rocking the custom seed text!")))

Another thing to include at the outset is colour scheme data. I actually used different methods for setting colour schemes between the two projects I worked on. One was RColorBrewer, and the other was just plain colours in R.

The RColorBrewer way:

library(RcolorBrewer)

...

twocolor <- brewer.pal(8, "Dark1")[1:2]
threecolor <- brewer.pal(3, "Dark1")A

ggplot + 
... + 
  scale_color_manual(values = twocolor)

One issue I had with this was that I wanted to set transparency. I found this, but installing the underlying GISTools stopped my teammate's R from running, so we had to abandon it.

The plain R way:

acolor <- c("A" = "lightblue", "B" = "orange")

gplot + 
... +
  scale_color_manual(values = acolor)

I didn't attempt to add transparency here. It looks like it wouldn't be easy.

Testing

In the projects I contributed to, I divided tests into two types: one using the testthat package, and another that I'll explain later. The testthat package provides a really convenient way of testing code. I used it after making changes to a data object, such as casting a vector as a different type, or doing calculations. It provides a way of making sure you haven't accidentally broken your data. This could involve checking row or column counts:

test_that("no change in data frame dimensions", {
    expect_equal(nrow(st), 200)
    expect_equal(ncol(st), 15)})

checking that factors have been applied correctly:

test_that("no change after casting as factors", {
    expect_true(sum(st[st$dominant_hand == "left",]) <
                    sum(st[st$dominant_hand == "right",]))})

or doing checks against a single data point:

test_that("bmi group matches height and weight", {
    # id == 204
    # height      weight
    #     81         175 --> BMI = 26
    expect_equal(toString(st[st$id == 204,]$bmigroup), "overweight")})

Where the original data was loaded as a data frame, I did the initial data cleaning, did tests to confirm no change, and then deleted the original object.

Inline R Code

One thing that seemed like a good best practice to me was using references to variables within text instead of numbers. I've worked on projects before where a small change in the data has resulted in a large-scale change of hard-coded numbers. I put a chunk of code before the paragraph with the numbers, defined a short variable I could refer to in the text, and then put something like:

`r qpc `

in the text.

The other thing I put in the code chunk was a quick test - the second type of test that I was referring to above. For example, say we expected qpc to be a positive number. We could use the stopifnot command to write a quick test checking the value of qpc:

stopifnot(qpc > 0)

I matched these tests with any statements made in the code. If I wrote that two numbers were nearly equal, I would add a quick stopifnot test to confirm that they really were almost the same. This makes the document more resilient to change, and also provides another chance to spot errors.

Figures

The seams holding Latex, R and Rmarkdown together can sometimes be quite easy to spot, and it feels as if displaying images is one of those areas. I haven't figured out some of the details for displaying figures yet - for example, I haven't found a way to wrap text around an image - and it feels to me like these points are approaching the limit of what can be done by gluing these approaches together.

One area that was quite easy to integrate was captions. Using the captioner package, it was relatively simple to add captions and reference them in the text (although for some reason, it broke whenever I tried to label something as a table). I needed the following commands to get it working:

# in the YAML markdown
header-includes:
- \usepackage{caption}

# outside a code chunk
\captionsetup[table]{labelformat=empty}
\captionsetup[figure]{labelformat=empty}

# in code chunks
library(stringr)
library(captioner)

fig_nums <- captioner::captioner(prefix = "Figure")
f.ref <- function(x) {
  stringr::str_extract(fig_nums(x), "[^:]*")
}
ageblock <- fig_nums(name = "ageblock",
                     caption = "Charts characterising the age variable")

The opening part of the code chunk with the image is labelled as r fig.cap = ageblock. With these things in place, it is now possible to refer to the figure in the text, with \r f.ref('ageblock')``.

Also on the topic of figure blocks, if you have two figures side by side, you can reduce the height of the figure block by setting r fig.height = 2.3. Of course, you might need to adjust it a bit to get the height right.

Other Latex Tricks

Because the code gets parsed by Latex at some point in it journey to PDF-dom, you can insert Latex commands and have them get picked up by Latex. For example, \newpage will give you a new page, or $\hat\beta$ will give you a dapper estimate.

Other Minor Points

In a lot of the examples I found online, I saw people using the variable df for a data frame object. After a while, I realised that there already was a df function, and that this variable assignment is polluting the namespace. I'm guessing that df isn't a very popular function, otherwise people wouldnt do this, but this was a minor bugbear for me. I also followed the example set by this style guide and kept my variables lowercase.

Other Useful Libraries

  • hash - this library lets you create objects that behave a bit like a Python dict.
  • ggResidpanel - good for making residual plots in a hurry
  • cowplot - plot_grid is really useful for putting several graphs together

Things I wasn't able to do

When I first found out about being able to add units to numeric variables, I was very excited. You can actually do calculations involving units, and it'll give the correct unit with the answer. Amazing!

Well, yes and no. Unfortunately, I couldn't use it for very long before hitting an issue. Namely, being unable to get a summary of a linear regression using vectors with units. That issue is almost two years old, so it seems unlikely the package will be useable with linear regression any time soon.

Another thing that I couldn't find a solution to was saving my data in plain text. You can use save to store an object as an RData file, but if you are at all obsessive about wanting to store only plain text in your git repo, then this may be a source of problems for you. I tried to export a data frame as a CSV, but of course, vector metadata was not saved, so for example, my ordered factor stopped being ordered. Very troubling indeed.

A final niggle that I mentioned above was being unable to easily add transparency to the definitions of colour schemes.

The End

Or rather, the beginning ...?

This has been an attempt to make sense of my frantic first few weeks learning R. If you're future me, or even if you're not, I hope you've found something of use to you in here!