Guillaume Chevrot

“Anyone who stops learning is old, anyone who keeps learning stays young."

Creating an ActivePapers

What is ActivePapers?

ActivePapers is an on-going project developped by Konrad Hinsen whose aim is to make a computational calculation reproducible and publishable. You can find all the details of the project here.

Design?

To make computational calculation reproducible and publishable, it is proposed to create an ActivePaper file that will contain all your data and code in one unique file. Again, see the details here. ActivePapers implementations use HDF5 as the underlying storage format, that means that an ActivePaper is an HDF5 file. One advantage is that you can inspect the datasets in an ActivePaper with generic HDF5 tools like HDFView.

How do we create an ActivePaper file?

To create our ActivePaper file, we will use the ActivePapers Python edition. You can find an installation guide here, it should be pretty straightforward.

A more comprehensive tutorial already exists. What I proposed here is only to create a very simple ActivePaper and show how to extract the data.

With this ActivePaper, I want to create 2 arrays, add these 2 arrays and generate a plot. So the first thing to do is to write the Python code that will do these operations.

Creating the arrays

I write this code in the file 'create_data.py':

In [ ]:
from activepapers.contents import data
import numpy as np

# Create groups for the input data
inputs = data.create_group('inputs')

# creating a numpy array
arr = np.arange(100)

# Adding the numpy array to the groups
inputs['dataset_1'] = arr
inputs['dataset_2'] = arr

Download create_data.py

Adding the arrays

I write this code in the file 'adding_data.py':

In [ ]:
from activepapers.contents import data
import numpy as np

# Create group for the output data
output = data.create_group('output')

input_data = data['inputs']

# Adding the 2 inputs array
arr_1 = input_data['dataset_1'][:].astype(np.int)
arr_2 = input_data['dataset_2'][...].astype(np.int)
sum = arr_1 + arr_2

# Writing the output
output['sum'] = sum

Download adding_data.py

Plot

I write this code in the file 'plot.py':

In [ ]:
import matplotlib
matplotlib.use('PDF')  # if I don't use it, the pdf produced is corrupted?
import matplotlib.pyplot as plt
import numpy as np
from activepapers.contents import data, open_documentation

def plot(x, y, fontsize=19, output='plot.pdf'):
    fig = plt.figure(figsize=(7,7))
    ax = fig.add_subplot(1,1,1)

    # data
    ax.plot(x, y, '-', linewidth=2.0, label='normal')

    # legend
    ax.set_xlabel('x', fontsize=fontsize)
    ax.set_ylabel('y', fontsize=fontsize)

    # police
    ax.tick_params(labelsize=fontsize)
    # Add and specify different settings for minor grids
    x_max = x.max()
    ax.set_xticks(np.arange(0.0, x_max+1, 10.0), minor = True)
    y_max = y.max()
    ax.set_yticks(np.arange(0.0, y_max+1, 10.0), minor = True)
    ax.grid(which = 'minor', alpha = 0.9)

    return fig

# Plotting and saving in documentation
x = data['inputs/dataset_1'][:]
y = data['output/sum'][:]
fig = plot(x, y)
fig.savefig(open_documentation('plot.pdf', 'w')) #save plot in /documentation/

Download plot.py

Documentation

You can also add a documentation. For example, you can add this type of README.txt:

In [ ]:
1) DATA
=======
Inputs: creating 2 arrays
Output: adding these 2 arrays

2) CODE
=======
create_data.py: create the inputs
adding_data.py: compute the output
plot.py: plot the data
Creating the ActivePapers

Now we can generate the ActivePaper:

In [ ]:
aptool -p test.ap create -d matplotlib

Here we create an ActivePaper named 'test.ap' and external dependencies (here matplolib), i.e. Python modules that are required but not available as ActivePapers.

Then we add the README.txt and the Python code into the ActivePaper:

In [ ]:
aptool checkin -t text documentation/README.txt
aptool checkin -t calclet code/*.py

Then it becomes "magic". You can actually run the codes inside the ActivePapers and the results will be generate inside the ActivePaper:

In [ ]:
aptool run create_data # creating the data
aptool run adding_data # adding the data
aptool run plot        # generating the plot

So now, we have one unique file test.ap containing the inputs and outputs. You can inspect the file aptool:

In [ ]:
aptool ls

As expected, it produces:

In [ ]:
code/adding_data
code/create_data
code/plot
data/inputs/dataset_1
data/inputs/dataset_2
data/output/sum
documentation/README
documentation/plot.pdf

As the ActivePaper file is in fact a HDF5 file, you can read the datasets with many generic HDF5 tools, in particular HDFView. We can also do it with Python via the library h5py, for example let's print the output dataset with this python script:

In [1]:
import h5py as h5py

with h5py.File('test.ap', 'r') as f:
    dset_output = f['data/output/sum']
    print(dset_output)
    print(dset_output[:])
<HDF5 dataset "sum": shape (100,), type "<i8">
[  0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34
  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70
  72  74  76  78  80  82  84  86  88  90  92  94  96  98 100 102 104 106
 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142
 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178
 180 182 184 186 188 190 192 194 196 198]

You can also easily extract the code and the documentation via:

In [ ]:
aptool checkout documentation
aptool checkout code