This vignette describes the ideal data analytical workflow used by the Schola Empirica team.
To do analysis and create great reports reproducibly and efficiently.
To make our lives easier.
To make life easier for our future selves.
Project structure should be such that it is obvious what happens where and the whole project can be rerun quickly, perhaps on new data. Ideally, the structure also facilitates efficiency, i.e. during analysis things are not rerun which only need to be done once etc.
Everything we submit to a client/partner/stakeholder should be based on data and code that together reproducibly create the report: no hacks, no manual edits if at all possible, no copy-pasting images from one place to another.
Version control helps us track when each report was created, using what code and what data, and can help us go back and fix things if needed. It also helps us avoid mostly duplicate versions of everything lying around.
Code and data are real.
Reports and charts are based on well-designed styles and templates which we use consistently.
Projects and reports should be self-documenting by their structure, code, workflow, and comments inside, and the git history with good commit messages.
Let’s not repeat code and keep reinventing the wheel.
The workflow described here does not rely on a particular framework that would enforce a project structure and a way of orchestrating all the bits together.
It is more lightweight - it suggests a structure, a way of working,
and provides suggested integrators (such as build.R
and
shared.R
and the 00_
data load scripts). It
relies on simple order of execution and on the analyst putting the right
bits of code in the right places.
You are free to change any of this, but also responsible for making sure that the system you create actually works.
This lightweight approach also does not provide any optimalisation
with regard to build time, like a make
- or
drake
-based workflow would (e.g. by only rebuilding outputs
when te code has changed.)
If you would like a more rigid framework or need one that optimizes for computing time, look through your options in the Resources section below.
The reschola
package provides an RStudio project
template that (a) takes care of setting up your project on Github (if
you let it) and (b) creates a default project structure, incorporating
key parameters that you give it in setup.
Feel free to adapt this in any way that works and remains understandable to someone who is not you.
shared.R
for variables and perhaps functions shared by
more scripts. By default contains GDrive URL and project title, if
provided during setup001_retrieve-data.R
helps you download files from your
GDrive folder, if set. You can also use it to store code for retrieving
other data. This should only hold things which you expect to only run
once, or refresh rarely - particularly things that take time or put a
load on other servers.002_read-data.R
should hold code that reads the data
and does any transformations immediately tied to data reading,
e.g. setting data types or basic filtering. Again, this is code that you
don’t expect to change as you work on the actual analysis. You may want
to save the result into rds
files in
data-intermediate
(or data-input
if it is
simply an rds
mirror of the input data saved for quick
access.)003_check-and-process-data.R
(if you plan to run it
often, you may wish signal this by numbering it 01_*
to
rerun it often with the rest of the analysis, and you may also turn it
into an Rmd file if that is more convenient.) This should process data
in data-input
and save its outputs in
data-processed
.[NN]_*.Rmd
where NN is 01-98 is the actual analysis -
may be an exploratory script, a partial analysis, or a report. Expected
to be run in the order of their numbering, but ideally key components
should work off data saved in data-input
or
data-intermediate
.data-input
should contain only unaltered input data as
downloaded.data-processed
should contain processed data filesdata-output
should contain data that you expect to
share externally that are the output of your projectcharts-output
and reports-output
for the
obvious99_reproducibility.R
by default contains a description
of the system and environment used to run the analysis. Use it to store
any other information useful for reproducing the analysis (but not
passwords etc.)You should feel free to move code between the analysis and the
00*
scripts as you discover data transformations that
should be made earlier on in the workflow.
Use build.R
to tie these together - when run, it should
rebuild the whole project from scratch, except perhaps downloading data.
You may want to build different versions of build_*.R
as
helpers for running different parts of the workflow while you work. If
you deal with lots of build scripts, use the wonderful (pun intended) {buildr} package.
There are two templates in the package: Schola PDF report (prefered) and Schola Word report.
The PDF report use LaTeX to create typographically correct, completely vector-graphics document.
The Word report is simple: it creates a Word document with some nice custom defaults and styles.
reschola
offers a ggplot2 theme,
theme_schola()
, which provides some sensible easthetic
defaults, including font choice, to make charts beautiful and
consistent.
The desired approach is to use this theme, alter its parameters if
needed, and then if necessary make other changes using another
theme()
call.
There are also a small number of other plotting utilities.
See the Making charts vignette for details on everything graphics related.
There are also utilities, e.g. for creating report drafts
(draft_*()
) and for interacting with Google Drive
(gdrive_*()
)
In RStudio, go to
File > New Project > New directory > Standard Schola Empirica Project
.
Ideally, start from a clean RStudio session with no project open.
Fill in the fields (only directory name is mandatory), switch the Git
menu to get a (reschola/your) Github repo of you wish, select other
options if needed, check “Open in new session”, and click
Create
.
Other ways are possible but this gives you a good starting point and takes care of a lot of the setup hassle for you.
The data-input
and data-processed
have
.gitignore
files in to stop you form committing sensitive
data to git. Commit these .gitignore
files to git. But only
alter them if you are sure you know what you are doing.
If you haven’t used the googledrive
package before, the
package may ask for authorisation to access Google Drive. This is
legitimate and you should grant access. This happens in the browser and
on some machines may cause project initiatition to freeze or stop. If
that happens, run googledrive::drive_auth()
, delete the
directory created by the previous project creation attempt, and try
creating the project again.
renv
: managing package dependenciesOne of the most annoying barriers do reproducibility is when packages on which your code depends change over time and as a result your code breaks or behaves differently.
The most convenient and sophisticated way to handle this is to use
the renv
dependency management system.
renv
In short, it makes sure that your project holds a complete record of the exact package versions you are using when creating it.
When to use it:
Always, really, but especaially when:
What it does:
To save avoid wasting time, disk space and download bandwidth,
renv
keeps a copy of all the package versions used in your
projects in a shared per-computer cache. The project libraries only
contain links to that cache. That way you are not committing the package
code into your project, nor do the package files sit in your project
directory, and they only get downloaded once if multiple projects’
libraries use the save version of a package.
All you need to do is call renv::init()
at the beginning
and then renv::snapshot()
anytime you install new packages
or commit code.
See https://rstudio.github.io/renv/ for an intro to the package and https://environments.rstudio.com/ for a broader intro.
Renv is not part of the standard project setup in
reschola
so as not to increase the complexity of project
initiation, but it is much recommended that you use it.
Use 001_retrieve-data.R
; if you have other data
retrieval, ideally the code for it should live here.
Do not edit the data by hand.
See tips for some packages that can help you retrieve data from public sources or other systems.
Use 002_read-data.R
. Add any other data reading that is
needed.
See readr::locale()
for handling encoding, decimal marks
and separators in CSVs. You might also need
readr::read_csv2()
.
Use readr::guess_encoding()
if the text comes in
garbled.
Use 003_check-and-process-data.R
. You may need to move
this into an RMarkdown document.
See tips for packages that can help you set up a structured data checking pipeline.
You may want to commit some of the summarised/processed data output here once you have done some analysis and are reasonably sure it will not change too often. But generally, data should not be committed, esp. if large or at risk of committing personal information.
I suggest you keep your data exploration in a separate script from your report; often the EDA will happen in the report as you go, but a better process perhaps is to develop bits of your analysis in one script/Rmd and only move bits of code into the report Rmd which is essential for building the report.
An RMarkdown Notebook might be an appropriate format for this.
See tips.html#data-exploration-1 for a list of appropriate tools for data exploration.
There is a real trade off here: one way to do it is to work through
the analysis in the report script, perhaps hiding most through chunk
options (include = FALSE
) and only outputting into the
final format stuff that is relevant.
That way you get a sense of the thought process but also a bloated and circuitous script. Another is to do the analysis in one or more files and only moving bits into the report which are needed there.
That way you get a tight report script but at times disconnected from the analytical process.
One way to lighten the load is to hive off some work into partial
Rmarkdown files, typically named _something.Rmd
, and then
“insert” them into the main document via
```
See RMarkdown Cookbook on child documents.
Use draft_*
to quickly create a draft using the required
template.
The Word output of the two templates is aesthetically equivalent, but
the _word
template (and output format as set in the YAML
header) can do more sophisticated handling of e.g. cross-references.
Note the schola_word
format is based on the
bookdown::word_document2
format. This means it can be
customised like other
bookdown documents and even strung into a whole book.
You can create a cross-reference to any section, e.g. link to section
Methods using [Methods]
or
[the methods section][Methods]
. This will show up as a link
in Word and HTML.
Create a footnote by using
Text^[This is a footnote]
.
You can also refer to tables, figures and equations. This only works
in the schola_word
output format (template).
Do it like this (note that @
is escaped with
\
):
See table \@ref(fig:graf6).
to ref to a table in a
chunk named graf6
As table \@ref(tab:tab3) shows...
for a table in a
chunk named tab3
Note that these chunks need to have the fig.cap
set to a
non-empty string, and they need to have a chunk name without
underscores or any special character (camelCase style is
recommended). Yihui Xie says:
Try to avoid spaces, periods (.), and underscores (_) in chunk labels and paths. If you need separators, you are recommended to use hyphens (-) instead. For example, setup-options is a good label, whereas setup.options and chunk 1 are bad; fig.path = ‘figures/mcmc-’ is a good path for figure output, and fig.path = ‘markov chain/monte carlo’ is bad.
Edit _bookdown.yml
to change the words used for “Figure”
and “Table” in captions (doesn’t apply for PDF).
See more on cross-references in the bookdown guide.
Knitr and Rmarkdown incorporate a system for managing citations and bibliographies, which can take reference lists from a number of citation managers. For the basics, see the RMarkdown site, details are in the RMarkdown Cookbook.
In principle, if you want a HTML file you can just switch the format
to html_document
and it should work fine if perhaps the
details might differ slightly. See tips on how
to get that online.
If you expect your report to be rerun in some time with different data or a different parameter, like a changed date or name of something, you can make your report parameterised. See this brief guide or a longer explanation.
This is also useful if you are running the same report for a number of units of something, e.g. for different waves of research or different geographical units - see how the Urban Institute does it.
Charts should be created using ggplot2
as far as
possible. Use the schola_theme()
theme.1
None of this is a linear process. The only requirement is that from an external point of view (and that includes you in three months or two years), the process of rebuilding the report(s) and the entire project is linear.
But as you work, you will find bits of code that belong somewhere else; you will make data transformations in your report that you will then realize you can move to your data transformation script. You will load new data in a script and than move that loading code to an earlier script. That is fine - it will happen gradually through iteration, but the iteration should also move you towards more organised code.
The logic described by Emily Riederer in her RMarkdown driven development approach may be helpful here.
In the end, the scripts should follow these principes:
Don’t forget to update the README.md and other documentation as you
go, as well as build.R
and any other build*.R
scripts you may have.
Feel free to use git to go back and forth. Version control is your friend here. Something broke? You can go back to when it worked.
See workflow guidance for a primer on git and Github, which should be a core part of your process.
When a draft report goes out e.g. to stakeholders for feedback, it might be useful to create a git tag:
You can also designate certain snapshots as special with a tag, which is a name of your choosing. In a software project, it is typical to tag a release with its version, e.g., “v1.0.3”. For a manuscript or analytical project, you might tag the version submitted to a journal or transmitted to external collaborators. Figure 20.1 shows a tag, “draft-01”, associated with the last commit.
Run reschola::manage_docx_header_logos()
to replace
default Schola logo or add a client/funder logo.
Really just make sure that you have followed the steps. If you have, then:
build.R
should contain all scripts needed to run the
whole thing in the right order; so should any other built-type script
you may have createdAdditionally, you should use renv
and snapshot the state
of the project library using renv::snapshot()
.
See R for Data Science by Hadley Wickham and Garret Grolemund.
The rest mostly draws on What They Forgot to teach you about R, which seems to have the remedy to many common pains of working with R.
See RStudio part in setup for the options to set for this.)
Use the here
package instead of setwd()
to
make sure paths just
work.
See Naming things by Jenny Bryan.
_
between parts, -
between words)keyring
), never
hard code them (usethis::edit_r_environ()
)You can add files you do not wish to commit to git’s ‘.gitignore’ file. That way, git will not even show those as new/changed. This works separately for each repo.
The easiest way to do this is to run
e.g. usethis::use_git_ignore("secret_file.R")
.
You need to commit the .gitignore
file.
See the goodpractice
package for a list of good practice and automated checking for them.
Currently we have no explicitly agreed style.
Still, following some basic code hygiene seems like a good idea:
=
and +
Sharla Gelfand on reproducible reporting with RMarkdown at rstudio::conf(2020) Emily Riederer on RMarkdown driven development
Wilson et al. 2016, “Good Enough Practices in Scientific Computing”
rrtools
for research compendia
See the Making charts vignette for more details.↩︎