Reproducible, Ethical and Collaborative Data Science: The Turing Way Project

The Turing Way Project, is an open-source project which should excite any academic with an interest in data science. It is an almanac of techniques and best practices for making your research more reproducible and ready for an open science future.
At the time of writing, there are more than 250 contributors working on the Turing Way Project. Not only does it provide an extensive online book, but also the option to engage with other members of the community via their GitHub repository.
They provide enough material for hundreds of posts. However, you should check the Turing Way Project out yourself! I am only interested in driving more awareness to this body of work, as I suspect that many people have not heard about it yet.

Here, we will only comment on some highlights from their Reproducible Research chapter, which makes up a small share of the book.
Some subchapters that I’d like to underline here are the ones on version control, reproducible environments, and code testing.
Version Control
Version control might be one of the most efficient measures to elevate your research project. It enhances collaboration with your future self, and collaborators. If done right, it gets rid of any “you accidentally overwrote my changes”, or “which version is the current one?”.
For this, I use git and GitHub. Git enables you to document the progress of your project with commits, and also allows you to jump back to a specific commit in the past. While this is also a feature native to other services, such as Google Drive, git’s ability to also handle multiple branches of a project is not.
While there is always a main path of development (ideally your tested, working version), you and your collaborators can branch off in different directions.
In practice, person A works on one new feature of the code (branch A), while person B works on another feature (branch B). Afterwards both can merge their work back to the main branch, resulting in an extension of your project by both of these new features. This is handy, as it allows you to divide and conquer.
In my experience, branching and merging works smoothly if you think about it beforehand: Try to split-up the tasks in a way that does not require you to work in the same script. Editing the same file can easily lead to merge conflicts, which are typically time-consuming to solve.
Reproducible Environments
Sharing your code is already helpful to other researchers and provides an explicit documentation of your research project. However, sharing code in a way, such that others can immediately reproduce your results requires some attention to detail.
One of these details is to document your computational environment.
We need reproducible environments to allow others, to establish the same environment on their computer which we used to write our code.
Often times, it is enough to simply recreate the Python or R environment which you used to get your script working. However, sometimes you need to provide an image of your whole machine, e.g., with a docker container. Below you find some notes on the smaller case of simply providing the same software environment.
In Python, the most popular way to manage your interpreter is conda. Sharing your environment via conda is easy. You can create a requirements.txt which contains your conda environment, and send it to your collaborators who can automatically install all necessary packages with it:
# you export your environment to requirements.txt
conda list --export > requirements.txt
# send to, or share via GitHub,your requirements.txt with your friend
# your friend sets up his environment based on the requirements.txt
conda create --name name_of_env --file requirements.txt
In R it is less common to use conda, however you often run into trouble when it comes to installing packages which you load in your script.
To automatically install required packages, the library pacman
can help. Put the code below at the top of your R script, and you should get rid of 99% of the environment trouble.
# check if pacman is installed, if not install it
if (!require("pacman")) install.packages("pacman")
# use pacman::p_load which takes care of all other packages
# Add all packages here, which you'd usually load with library()
pacman::p_load(tidyverse,
ggplot2,
readxl)
Unit Tests
As pointed to in The community’s 5 Years of Data Science, unit tests are an underrated skill for data scientists, and this is of course also true for researchers. While you should always inspect your data yourself (multiple times), after a while you can automate some of these checks with unit tests.
For example, you can write a simple test which checks whether your observation IDs are (still) unique or whether some error caused duplicates:
assert df.shape[0] == df['id'].nunique(), "Trouble with IDs"
When you are ready to become a professional code tester, consider stepping your game up with wrapping multiple tests into one function:
def data_tester(df):
""" runs a battery of unit tests on the dataframe
"""
# check whether ids are unique
assert df.shape[0] == df['id'].nunique(), "Trouble with IDs"
# check whether sum of share equals 100%
assert df['precent_share'].sum() == 1, "Shares do not add up to one"
# more checks
...
Also keep an eye on modules dedicated to unit tests, such as pytest
and unittest
. There is also the package assertthat
which implements, “assert-like” unit tests in R:
# assert that the dataframe has 100 observations
assert_that(nrow(df) == 100)
I would be surprised if this was my last post on The Turing Way. However, meanwhile, I urge you again to check out their Project, GitHub, and Twitter.