OverviewTeaching: 20 min
Exercises: 10 minQuestions
How can I manage my projects in R?Objectives
To be able to create self-contained projects in RStudio
To be able to use git from within RStudio
The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.
Managing your projects in a reproducible fashion doesn't just make your science reproducible, it makes your life easier.— Vince Buffalo (@vsbuffalo) April 15, 2013
Most people tend to organize their projects like this:
There are many reasons why we should ALWAYS avoid this:
A good project layout will ultimately make your life easier:
Fortunately, there are tools and packages which can help you manage your work effectively.
One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using this today to create a self-contained, reproducible project.
Challenge: Creating a self-contained project
We’re going to create a new project in RStudio:
- Click the “File” menu button, then “New Project”.
- Click “New Directory”.
- Click “Empty Project”.
- Type in the name of the directory to store your project, e.g. “my_project”.
- Make sure that the checkbox for “Create a git repository” is selected.
- Click the “Create Project” button.
Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory.
Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:
This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.
In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. I find it useful to store these scripts in a separate folder, and create a second “read-only” data folder to hold the “cleaned” data sets.
Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.
There are lots of different ways to manage this output. I find it useful to have an output folder with different sub-directories for each separate analysis. This makes it easier later, as many of my analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.
Tip: Good Enough Practices for Scientific Computing
Good Enough Practices for Scientific Computing gives the following recommendations for project organization:
- Put each project in its own directory, which is named after the project.
- Put text documents associated with the project in the
- Put raw data and metadata in the
datadirectory, and files generated during cleanup and analysis in a
- Put source for the project’s scripts and programs in the
srcdirectory, and programs brought in from elsewhere or compiled locally in the
- Name all files to reflect their content or function.
Tip: ProjectTemplate - a possible solution
One way to automate the management of projects is to install the third-party package,
ProjectTemplate. This package will set up an ideal directory structure for project management. This is very useful as it enables you to have your analysis pipeline/workflow organised and structured. Together with the default RStudio project functionality and Git you will be able to keep track of your work as well as be able to share your work with collaborators.
- Load the library
- Initialise the project:
install.packages("ProjectTemplate") library(ProjectTemplate) create.project("../my_project", merge.strategy = "allow.non.conflict")
For more information on ProjectTemplate and its functionality visit the home page ProjectTemplate
The most effective way I find to work in R, is to play around in the interactive
session, then copy commands across to a script file when I’m sure they work and
do what I want. You can also save all the commands you’ve entered using the
history command, but I don’t find it useful because when I’m typing its 90%
trial and error.
When your project is new and shiny, the script file usually contains many lines of directly executed code. As it matures, reusable chunks get pulled into their own functions. It’s a good idea to separate these into separate folders; one to store useful functions that you’ll reuse across analyses and projects, and one to store the analysis scripts.
Tip: avoiding duplication
You may find yourself using data or analysis scripts across several projects. Typically you want to avoid duplication to save space and avoid having to make updates to code in multiple places.
In this case I find it useful to make “symbolic links”, which are essentially shortcuts to files somewhere else on a filesystem. On Linux and OS X you can use the
ln -scommand, and on Windows you can either create a shortcut or use the
mklinkcommand from the windows terminal.
Now we have a good directory structure we will now place/save the data file in the
Download the gapminder data from here.
- Download the file (CTRL + S, right mouse click -> “Save as”, or File -> “Save page as”)
- Make sure it’s saved under the name
- Save the file in the
data/folder within your project.
We will load and inspect these data later.
It is useful to get some general idea about the dataset, directly from the command line, before loading it into R. Understanding the dataset better will come handy when making decisions on how to load it in R. Use command-line shell to answer the following questions: 1. What is the size of the file? 2. How many rows of data does it contain? 3. What are the data types of values stored in this file
Solution to Challenge 2
By running these commands in the shell:
ls -lh data/gapminder-FiveYearData.csv
-rw-r--r-- 1 phb staff 80K Jan 9 20:07 data/gapminder-FiveYearData.csv
The file size is 80K.
wc -l data/gapminder-FiveYearData.csv
There are 1705 lines and the data looks like:
country,year,pop,continent,lifeExp,gdpPercap Afghanistan,1952,8425333,Asia,28.801,779.4453145 Afghanistan,1957,9240934,Asia,30.332,820.8530296 Afghanistan,1962,10267083,Asia,31.997,853.10071 Afghanistan,1967,11537966,Asia,34.02,836.1971382 Afghanistan,1972,13079460,Asia,36.088,739.9811058 Afghanistan,1977,14880372,Asia,38.438,786.11336 Afghanistan,1982,12881816,Asia,39.854,978.0114388 Afghanistan,1987,13867957,Asia,40.822,852.3959448 Afghanistan,1992,16317921,Asia,41.674,649.3413952
Tip: command line in R Studio
You can quickly open up a shell in RStudio using the Tools -> Shell… menu item.
We also set up our project to integrate with git, putting it under version control. RStudio has a nicer interface to git than shell, but is very limited in what it can do, so you will find yourself occasionally needing to use the shell. Let’s go through and make an initial commit of our template files.
The workspace/history pane has a tab for “Git”. We can stage each file by checking the box: you will see a green “A” next to stage files and folders, and yellow question marks next to files or folders git doesn’t know about yet. RStudio also nicely shows you the difference between files from different commits.
Tip: versioning disposable output
Generally you do not want to version disposable output (or read-only data). You should modify the
.gitignorefile to tell git to ignore these files and directories.
- Create a directory within your project called
- Modify the
.gitignorefile to contain
graphs/so that this disposable output isn’t versioned.
Add the newly created folders to version control using the git interface.
Solution to Challenge 3
This can be done with the command line:
$ mkdir graphs $ echo "graphs/" >> .gitignore
Use RStudio to create and manage projects with consistent layout.
Treat raw data as read-only.
Treat generated output as disposable.
Separate function definition and application.
Use version control.