Guillaume Chevrot

“Anyone who stops learning is old, anyone who keeps learning stays young."

Interactive 2D map - parallelization with the multiprocessing module

In the previous notebook, I plotted an interactive 2D map. The time needed to generate this 2D map can be rather long. To improve that, we can "easily" parallelize the calculation of the distances with the multiprocessing module. The next 8 inputs are identical to the ones presented in the previous notebook.

In [1]:
from __future__ import print_function, division

from MMTK import *
from MMTK.Trajectory import Trajectory, TrajectoryOutput, SnapshotGenerator
from MMTK.Proteins import Protein, PeptideChain

import numpy as np

import time

from multiprocessing import Pool, cpu_count

NetCDF file format - MMTK trajectory (The trajectory can be easily translated from outcomes of popular Molecular dynamics software like Gromacs):

In [3]:
filename = 'traj_prot_nojump.nc'

Reading the trajectory:

In [4]:
trajectory = Trajectory(None, filename)
universe = trajectory.universe
proteins = universe.objectList(Protein)
chain = proteins[0][0]

2D map function (written in a way that it can be called with multiprocessing):

In [5]:
def calpha_2dmap(trajectory = trajectory, dt = 10000, t = range(0, len(trajectory))):
    """
    Calculate the distance between carbon alpha in the first protein of trajectory.
    
    Parameters
    ----------
    trajectory: MMTK.Trajectory
    
    dt : int
        dt in ns. The distances will be calculated every dt ns.
    t : int 
        range of steps to take into account
    """
    dist = []
    universe = trajectory.universe
    proteins = universe.objectList(Protein)
    chain = proteins[0][0]
    traj = trajectory[t]
    for n, step in enumerate(traj):
        if int(step['time']) % dt == 0:
            universe.setConfiguration(step['configuration'])
            for i in np.arange(len(chain)-1):
                for j in np.arange(len(chain)-1):
                    dist.append(universe.distance(chain[i].peptide.C_alpha,
                                                  chain[j].peptide.C_alpha))
    return(dist)

Monoprocessor calculation:

In [6]:
c0 = time.time()
dist = calpha_2dmap()
c1 = time.time() - c0
print(c1)
180.277282
In [7]:
dist = np.array(dist)
In [8]:
dist.shape
Out[8]:
(3562551,)

A. Multiprocessing

Let see if we can improve the performance of the calculation with the multiprocessing library (the definition of the function calpha_2dmap has been written in a way to allow multiprocessing). I first write a function that will distribute the work on the number of desired process.

In [9]:
def distribute(nprocs = cpu_count(), start = 0, end = len(trajectory)):
    """
    Distribute the number of steps (of the trajectory) between the number of processes.
    
    Parameters
    ----------
    nprocs : int 
        number of processes, defaults to cpu_count() (=all the processes available)
    start : int
        Starting step.
    end : int
        Last step. This last step is not included (it is mandatory for the function "calpha_2dmap").
        
    Returns
    -------
    list : list of list. Each list corresponds to the number of steps to be distributed on each process. 
    """
    
    nsteps = end - start
    print("You ask to distribute {0} step(s) on {1} processe(s)." . format(nsteps, nprocs))
    
    nsteps_per_proc = (nsteps + nprocs - 1) // nprocs
    print("The maximum number of steps per process will be {0}." . format(nsteps_per_proc))
    
    l = [range(i, min(i + nsteps_per_proc, end)) for i in range(start, end, nsteps_per_proc)]
    #print(l)
    print("The steps will be distributed on {0} processes".format(len(l)))
          
    return(l)
In [10]:
n_procs = 10
In [11]:
distribution = distribute(nprocs = n_procs, start = 0, end = len(trajectory))
You ask to distribute 30001 step(s) on 10 processe(s).
The maximum number of steps per process will be 3001.
The steps will be distributed on 10 processes
In [12]:
pool = Pool(processes=n_procs)
In [13]:
dt = 10000 #[ns]
c0 = time.time()
res = [pool.apply_async(calpha_2dmap, args=(trajectory, dt, t,)) for t in distribution]
#res = [pool.apply_async(calpha_2dmap, args=(trajectory, dt,t,)) for t in [range(0,2001), range(3000,5001), range(6000,8001), range(9000,11001), range(12000,14001), range(15000,17001), range(18000,20001), range(21000,23001), range(24000,26001), range(27000,30001)]]
dist_pool = [distance.get() for distance in res]
c2 = time.time() - c0
print(c2)
24.6278738976

What is the speedup?

In [14]:
print("Using {0} processors, the time to calculate the 2D map has been improved by a factor of {1:.2f}".format(n_procs, c1/c2))
Using 10 processors, the time to calculate the 2D map has been improved by a factor of 7.32

Flatten the dist_pool list:

In [15]:
len(dist_pool)
Out[15]:
10
In [16]:
d = [item for sublist in dist_pool for item in sublist]  # could also do it via itertool
In [17]:
len(d)
Out[17]:
3562551
In [18]:
dist_pool = np.array(d)
In [19]:
dist_pool.shape
Out[19]:
(3562551,)

Let's compare the results obtained with one processor and with multiple processors and verify if the results are the same

In [20]:
(dist_pool == dist).all() # verify that the multi-processes calculation give the same result
Out[20]:
True

The rest of the notebook is identical to the previous notebook

B. Reshaping the distance

I reshape the dist array, so the first dimension corresponds to the time.

In [21]:
dist = dist.reshape(int(len(dist) / ((len(chain)-1) * (len(chain)-1))), len(chain)-1, len(chain)-1)
In [22]:
dist.shape # time, n_calpha, n_calpha
Out[22]:
(31, 339, 339)

So the first dimension corresponds to the time and the 2 other dimensions correspond to the distances between the Carbon-$\alpha$ atoms of the 339 residues of the protein.

In [23]:
dist_mean = dist[-10:].mean(axis=0) # mean over the last 10 steps
In [24]:
dist_mean.shape
Out[24]:
(339, 339)

1. Static 2D map

We can naturally plot a static 2D map ...

In [25]:
fig = plt.figure()
plt.imshow(dist_mean, plt.get_cmap('YlOrRd'))
plt.colorbar()
Out[25]:
<matplotlib.colorbar.Colorbar instance at 0x7fa5f68afbd8>

2. Interactive 2D map

... but why wouldn't we plot an interactive one?

In [26]:
import IPython.html.widgets as widgets
from IPython.html.widgets import interact, interactive, fixed
In [27]:
len(dist[0])
Out[27]:
339
In [28]:
@interact(
          t=(0,len(dist)-1),
          calpha_x_min = (0, len(dist[0])),
          calpha_x_max = (0, len(dist[0])),
          calpha_y_min = (0, len(dist[0])),
          calpha_y_max = (0, len(dist[0]))
          )

def interactive_map(t=0, calpha_x_min = 0, calpha_x_max = len(dist[0]), calpha_y_min = 0, calpha_y_max = len(dist[0])):
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.set_xlim([calpha_x_min, calpha_x_max])
    ax.set_ylim([calpha_y_min, calpha_y_max])
    plt.imshow(dist[t], plt.get_cmap('YlOrRd'))
    cbar = plt.colorbar()
    cbar.ax.tick_params(labelsize=14)
    plt.clim(0.0,7.0)
    plt.title('2D map')

NOTE: the corresponding plot is encapsulated in the first video at the end of this page.

So now we can have our interactive 2D map. You can observe the 2D map in function of time or zoom to a specific region. If you are insterested in some specific distances, you can directly narrow a specific zone like that:

In [ ]:
@interact(
          t=(0,len(dist)-1),
          calpha_x_min = (211, 227),
          calpha_x_max = (211, 227),
          calpha_y_min = (110, 143),
          calpha_y_max = (110, 143))

def interactive_map(t=len(dist)-1, calpha_x_min = 0, calpha_x_max = len(dist[0]), calpha_y_min = 0, calpha_y_max = len(dist[0])):
    fig = plt.figure(figsize=(7,7))
    ax = fig.add_subplot(1, 1, 1)
    ax.set_xlim([calpha_x_min, calpha_x_max])
    ax.set_ylim([calpha_y_min, calpha_y_max])
    ax.tick_params(labelsize=14)
    plt.imshow(dist[t], plt.get_cmap('YlOrRd'), aspect = 'auto')
    cbar = plt.colorbar()
    cbar.ax.tick_params(labelsize=14)
    plt.clim(0.0,6.0)
    plt.title('2D map')
    plt.tight_layout()
    plt.savefig('2d.jpg', dpi=1000)

NOTE: the corresponding plot is encapsulated in the second video.

NOTE - if you cannot watch the videos on this page, you will find them at these URLs: