Introduction

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • How can I make my results easier to reproduce?

Objectives
  • Introduction to the practical aspects of reproducibility.

  • Explain the notition of a scientific workflow or pipeline.

Many scientific exploration pipelines have the following structure:

Each of the stages of the above pipeline involve carrying out some (or all) of the following tasks:

While one can manually execute these tasks, automating them can not only significantly reduce the amount of effort required by others (or ourselves) to re-execute a pipeline but, more importantly, allow others to have an explicit and easy to understand record of what were the series of high-level steps that were executed and in which order. This is also commonly referred to a “workflow” or a “pipeline”.

An Example

Assume that, as part of a study, we are analyzing historic data in order to obtain a statistical descriptor. Specifically, we will be making use of a dataset that contains a record of CO2 emissions per capita per country, and obtain the arithmetic mean for an arbitrary number of years.

We start by creating a folder that we’ll name myproject/, and we’ll first download the dataset: right click on this link, select Save as... and save it in a myproject/data/ folder (so file will reside in myproject/data/global.csv) as shown below:

tree .
.
└── data
    └── global.csv

1 directory, 1 file

NOTE: The above uses the tree command that might not be available on your machine. You can either install it or instead run ls -lR.

Then, assume we have a scripts/get_mean_by_group.py Python script that calculates the mean for this dataset:

#!/usr/bin/env python
import csv
import sys

try:
    # Python 3
    from itertools import zip_longest
except ImportError:
    # Python 2
    from itertools import izip_longest as zip_longest

fname = sys.argv[1]
group_size = int(sys.argv[2])
fout = fname.replace('.csv', '') + '_per_capita_mean.csv'


def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)


with open(fname, 'r') as fi, open(fout, 'w') as fo:
    r = csv.reader(fi)
    w = csv.writer(fo)

    # get 0-based index of last column in CSV file
    last = len(next(r)) - 1

    for g in grouper(r, group_size):
        group_sum = 0
        year = 0

        for row in g:
            # check if column is null (assume zero if it is)
            if row[last]:
                group_sum += float(row[last])
            else:
                group_sum += 0.0
            year = row[0]

        w.writerow([year, group_sum / group_size])

We can copy-paste the content of this script and put it in a scripts/ folder. The folder structure now looks like this:

tree .
.
├── data
│   └── global.csv
└── scripts
    └── get_mean_by_group.py

2 directories, 2 files

This script is invoked by running the following command on the terminal:

python scripts/get_mean_by_group.py data/global.csv 5

Once we execute the above, a data/global_per_capita_mean.csv file is created with the result. We can eyeball this and verify that the data has been aggregated in groups of 5 years (the parameter we passed) and for each group the mean was obtained. Note how years before 1950 did not contain an entry for the Per Capita column and thus any group before this year has 0.0 as the resulting mean.

Thus, the workflow in this minimal example study is comprised of the following two steps:

  1. Download data.
  2. Execute analysis.

In order to automate these steps, we can write the following Bash script:

#!/bin/bash
set -e

curl \
  --create-dirs \
  -Lo workflows/minimal-python/data/global.csv \
  https://github.com/datasets/co2-fossil-global/raw/master/global.csv

python scripts/get_mean_by_group.py data/global.csv 5

What Might Cause This Script to Fail?

The above example, although minimal, might not work properly when executed in other environments. Can you think of what might cause this Bash script to fail when executed in other environments?

Solution

Some things that might go wrong:

  1. The curl command is not available.
  2. The Python script (lines 5-10) takes care of dealing with portability issues across the two main versions of the Python runtime that are in use today. If this wasn’t the case (e.g. line 7 was the only one in the script), this would fail when executed on a Python version distinct to the one where the workflow originally ran.

We can make it easier for others to re-execute what we have done by automating all these manual tasks in a way that is portable. By portability we mean that someone else re-running this workflow would do so without having to deal with issues arising from the fact that their computing environment is different than the one where this workflow originally executed. In this lesson, we will learn how to do this by writing workflows in the Github Actions (GHA) Workflow language. We will also use Popper to run GHA workflows in our machine and on CI services.

Key Points

  • Scientific explorations can be described in a workflow language so that they can be automatically re-executed.

  • The Github Actions (GHA) Workflow Language can be used to express workflows that are easier to re-execute since they assume they will run on a containerized environment.

  • Popper is a GHA workflow execution engine that helps users create automated and portable experimentation workflows.