Comparing large arrays

Aim

In this post we are going write a utility that compares large arrays efficiently which can then be used to label clusters. First of all, it should be defines what is meant by large and efficiently. Large refers to arrays that are longer than 1000. A efficient algorithm scales better than $O(N^2)$ where $N$ is the number of arrays.

Motivation

I am currently working on implementing a clustering algorithm, Markov cluster algorithm, where each cluster has an associated row vector in the transtion matrix:

\[\begin{bmatrix} a & a & a & 0 & 0 & 0 & \cdots & 0 \\ 0 & 0 & 0 & b & b & 0 & \cdots & 0 \\ 0 & & & & \cdots & & & 0 \\ 0 & & & & \cdots & & & 0 \\ a & a & a & 0 & 0 & 0 & \cdots & 0 \\ \vdots & & & \vdots & & & & \vdots \\ 0 & & & \cdots & & & c & c \end{bmatrix}\]

Rows 1. and 5. denotes the same cluster which consists of the first three nodes, whereas the second and seventh row belong to two different clusters. By finding the unique rows one can assign the cluster labels to the nodes.

Initial considerations

Speed

Unfortunately, python is not designed for numerical heavy programming tasks, therefore a native python implementation where we perform $N^{2}$ array comparisons would definitely be a suboptimal solution. In general, most of the time one is primarily concered with overcoming the speed limitations of python when implementing numerical algorithms. The usual pathways are

Write the numerically demanding parts in a compiled language (e.g. C, fortran)
Write the code in cython which is then compiled
Use compiled python libraries (e.g. numpy)
Write the code in python and use a just in time compiler (e.g. numba)
Rephrase the problem so that it becomes tractable in native python

Data format – sprase matrices

The matrices are stored in the compressed sparse row (csr) format.

A csr matrix consists of three arrays. One that stores the nonzero (nnz) elements of the matrix, one that stores the column indices and a third one that stores the accumulative number of nnz entries for each row. These are called sparse_matrix.data, sparse_matrix.inidces, sparse_matrix.indptr in the scipy.sparse parlance.

Our hypothetical matrix from the first paragraph would look like

sparse_matrix.data = [a, a, a, b, b, a, a, a,..., c, c]
sparse_matrix.indices = [0, 1, 2, 3, 4, 0, 1, 2,..., N-2, N-1]
sparse_matrix.indptr = [0, 3, 5, 5, 5, 8, 8,..., 8, 10]

Meaningful data

One can readily observe that only the column indices carry information on the clusters. The actual values are redundant for they are equal for a given column index. As a consequence, one only has to compare the column indices of the rows. The first task is then to retrieve the column indices for each row.

Possible solutions

1. solution: breaking down to smaller groups.

Two rows are definitely different if the number of their nnz entries are not equal. Therefore if we group the rows according to their length (length from now on refer to the number of nnz elements), the comparison should only be carried out in the so-created groups. The cost still would be $(N/k)^{2} \approx O(N^{2})$. It is not great, but can serve as a starting point.

Algorithm

\[\begin{eqnarray} &1.& storage \leftarrow \{\text{None} : []\} \\ &2.& iptr \leftarrow 0 \\ &3.& nnz\_in\_rows \leftarrow \, \text{calculate number of nnz entries in rows} \\ &3.& \mathbf{for} \, i = 0, N - 1 \\ &3.1 & \qquad \mathbf{if} \, nnz\_in\_rows(i) = 0: \\ &3.2 & \qquad \qquad skip \\ &3.3 & \qquad \mathbf{else}: \\ &3.4 & \qquad \qquad idcs = a.indices[iptr:iptr+nnz\_in\_rows(i)-1]\\ &3.5 & \qquad \qquad storage[\text{len}(idcs)].\text{append}(idcs) \\ &3.6 & \qquad iptr += nnz\_in\_rows(i) \end{eqnarray}\]

We are going to use defaultdict to store the vectors of equal length; and three generators, islice, map and filter to speed up the iteration over the column indices.

from collections import defaultdict
from itertools import islice

def group_rows_by_length(a):
    """
    Groups the rows or columns of a csr or csc sparse matrix according to their length.
    Parameters:
        a ({scipy.sparse.csr_matrix, scipy.sparse.csc_matrix) : matrix
    Returns:
        storage ({int : [[]]}) : dictionary where the keys are the length of the rows, the values the list of rows.
    """
    storage = defaultdict(list)
    nnz_in_rows = np.diff(a.indptr)
    [storage[len(x)].append(x) for x in map(lambda y: list(islice(a.indices, y)), filter(lambda z: z > 0, nnz_in_rows))]
    
    return storage

All scipy sparse matrices implement the tolil() method which recasts the matrix to a list of lists format. In fact, we tacitly did exactly the same above. The tolil() source code can be found here. The only drawback of using tolil() that it copies the data across to the new lil matrix instance – a costly step.

2. solution #hash

What aspects of the previous solution can be improved?

We will still have to compare the vectors. Albeit, only smaller groups of them.
We are still using a native python loop in the form of loop comprehension.
Updating the dictionary is expensive.

As to the last issue, do we really need to store the indices? Can we use the indices themselves as keys instead? If so, the keys of the dictionary can solely used to identify the various vectors. Yes we can! Only modification that we have to do is to cast the nozero indices as tuple-s as opposed to list-s, so that they can be hashed.

Algorithm

\[\begin{eqnarray} &1.& unique_rows \leftarrow \{\} \\ &2.& iptr \leftarrow 0 \\ &3.& nnz\_in\_rows \leftarrow \, \text{calculate number of nnz entries in rows} \\ &3.& \mathbf{for} \, i = 0, N - 1 \\ &3.1 & \qquad \mathbf{if} \, nnz\_in\_rows(i) = 0: \\ &3.2 & \qquad \qquad skip \\ &3.3 & \qquad \mathbf{else}: \\ &3.4 & \qquad \qquad key \leftarrow \text(tuple) (a.indices[iptr:iptr+nnz\_in\_rows(i)-1])\\ &3.5 & \qquad \qquad unique\_rows[key] \leftarrow i \\ &3.6 & \qquad iptr \, += nnz\_in\_rows(i) \end{eqnarray}\]

In python:

import numpy as np

def select_unique_rows(a):
    """
    Selects the unique rows from a csr/csc matrix.
    Parameters:
        a ({scipy.sparse.csr_matrix,scipy_sparse.csc_matrix}) a sparse matrix.
    Returns:
        unique_rows ({:}) : the keys of the dictionary are tuples of the column indices of the unique rows.
        The values are the last occurences of the unique rows.
    """
    nnz_in_rows = np.diff(a.indptr)
    unique_rows = {tuple(islice(a.indices, int(x))) : y for y, x in enumerate(filter(lambda x: x > 0, nnz_in_rows))}
    
    return unique_rows

If we do not care about the row indices (and we do not really), the enumerate() constructor can be removed:

def select_unique_rows_lite(a):
    """
    Selects the unique rows from a csr/csc matrix.
    Parameters:
        a ({scipy.sparse.csr_matrix,scipy_sparse.csc_matrix}) a sparse matrix.
    Returns:
        unique_rows ({:}) : the keys of the dictionary are tuples of the column indices of the unique rows.
        The values are set to unit
    """
    nnz_in_rows = np.diff(a.indptr)
    unique_rows = {tuple(islice(a.indices, int(x))) : 1 for x in filter(lambda x: x > 0, nnz_in_rows)}
    
    return unique_rows

Furthermore, we can only keep the keys, for the values carry no useful information. This can be achieved by using set as opposed to tuple:

def select_unique_rows_lite_wset(a):
    """
    Selects the unique rows from a csr/csc matrix.
    Parameters:
        a ({scipy.sparse.csr_matrix,scipy_sparse.csc_matrix}) a sparse matrix.
    Returns:
        unique_rows ({}) : set of tuples of column indices of the unique rows.
    """
    nnz_in_rows = np.diff(a.indptr)
    unique_rows = set(tuple(islice(a.indices, int(x))) for x in filter(lambda x: x > 0, nnz_in_rows))
    
    return unique_rows

Much better. There are a number issues, though.

Filtering can be expensive. It will always be a tradeoff between how many nnz rows we have, and much time filtering takes. Usually, we will not have any if we treat the csr matrix properly.
We still iterating in native python.

def select_unique_rows_lite_nofilter(a):
    """
    Selects the unique rows from a csr/csc matrix.
    Parameters:
        a ({scipy.sparse.csr_matrix,scipy_sparse.csc_matrix}) a sparse matrix.
    Returns:
        unique_rows ({:}) : the keys of the dictionary are tuples of the column indices of the unique rows.
        The values are set to unit
    """
    nnz_in_rows = np.diff(a.indptr)
    unique_rows = {tuple(islice(a.indices, int(x))) : 1 for x in nnz_in_rows}
    
    return unique_rows

Once the clusters are determined, one can create an array of the cluster labels. assign_cluster_labels does this job for the select_unique_rows_lite, select_unique_rows_lite_wset and select_unique_rows_lite_nofilter functions. Again, filtering can be removed if we made sure there are no empty rows in the matrix.

def assign_cluster_label(clusters):
    n_nodes = sum(map(lambda x: len(x), clusters))
    
    labels = np.full(n_nodes, -1, dtype = np.int)
    for label, cluster in enumerate(filter(lambda x: len(x) > 0, clusters)):
        labels[np.array(cluster)] = label

3. Solution: Colouring

Imagine the nodes as an array of beads of uniform colour. We than paint the beads belonging to the same cluster with the same colour. Once all of the beads are painted they have all been assigned to a cluster. We are going to implement this solution using numba.

import numba
@numba.jit(nopython=True)
def colour_array_numba(idcs, iptr, labels):
    """
    Assigns labels.
    Parameters:
      idcs (np.ndarray of np.int) : csr indices
      iptr (np.ndarray of np.int) : csr row pointers
      labels (np.ndarray of np.int) : initialised label array
    """

    n_nodes = iptr.size - 1  # number of nodes
    i_label = 0              # "colour" 
    n_coloured = 0           # number of coloured nodes

# --- loop over all rows <--> vectors <--> clusters
    for i in range(n_nodes):
# grep a cluster
        i_start, i_end = iptr[i],  iptr[i+1]
        n_nodes_in_cluster = i_start - i_end
# skip if it is empty
        if n_nodes_in_cluster == 0: continue

        inodes = idcs[i_start:i_end]
# if already coloured skip
        if labels[inodes[0]] != -1: 
          continue
        else:
# assign colour
          labels[inodes] = i_label
          i_label += 1
# keep track of coloured nodes
          n_coloured += n_nodes_in_cluster
# shortcut if all nodes have been coloured
          if n_coloured == n_nodes:
            break

def assign_labels(a):
  """
  Assigns a label to the nodes 
  Parameters:
    a (scipy.sparse.csr_matrix) : sparse matrix
  Returns:
    labels (np.ndarray of np.int) : cluster label for each node
  """
# --- create an array of uniform labels
  labels = np.full(a.shape[1], -1, dtype = np.int)
# --- colour nodes according to clusters
  colour_array_numba(a.indices, a.indptr, labels)

  return labels

Testing

We are going to compare the above methods on the sets created with our efficient create_block_diagonal_csr_matrix_np utility.

import sys
sys.path.append(r'C:\Users\Balazs\source\repos\MPC\MPC\RandomSparse')
from BlockDiag import create_block_diagonal_csr_matrix_np

In the following paragraph, we create a number of cluster groups in the csr representation. The size and nuber of clusters vary between 10 and 1000. Only 5% of the row vectors is retained in each block. In the largest example, it means 1000 times 50 vectors of length of 1000 nodes each.

from sklearn.model_selection import ParameterGrid

parameter_grid = ParameterGrid({'cluster_size' : [10, 50, 100, 200, 500, 1000],
                                'n_clusters' : [10, 50, 200, 500, 1000]})

keep_density = 0.05
keep_each_row = True

As to timing, I wrote a bare bones timer utility, SlimTimer, which will be invoked to measure the execution times.

sys.path.append(r'c:\users\Balazs\source\repos\slimTimer')
from slimTimer.SlimTimer import SlimTimer 

# set up a timer for each function with nine runs to measure each
n_runs = 5

timers = [SlimTimer(func = select_unique_rows_lite, n_runs = n_runs),
          SlimTimer(func = select_unique_rows_lite_wset, n_runs = n_runs),
          SlimTimer(func = select_unique_rows_lite_nofilter, n_runs = n_runs),
          SlimTimer(func = assign_labels, n_runs = n_runs)]

We will now measure how the functions perform for each matrix sequentially.

timings = []
for params in parameter_grid:
    
# create blocks i.e. groups of clusters
    clusters = [params['cluster_size']] * params['n_clusters']
    
# set up matrix    
    mat = create_block_diagonal_csr_matrix_np(clusters, fill_value = 1, 
                                              keep_density = keep_density, 
                                              keep_each_block = True)
# measure labeling time
    for timer in timers:
        timer.set_func_args(mat)
        timer.measure()

# get timing results
    res_dict = params.copy()
    for timer in timers:
        res_dict.update(timer.to_dict(with_tag = True))

# collect results
    timings.append(res_dict)

We will plot the comparisions grouped by the the size of the clusters.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(timings)
grouped = df.groupby('cluster_size')

In case you wish to use this notebook, the header of the dataframe shown below. We will need the n_cluster , function_name_tmean and function_name_tstdev columns to plot the timings, which contain the number of clusters, the mean runtime and the standard error of the mean time for each labeling function.

print(df.columns)

Index(['assign_labels_runtimes', 'assign_labels_tmean', 'assign_labels_tstdev',
       'cluster_size', 'n_clusters',
       'select_unique_rows_lite_nofilter_runtimes',
       'select_unique_rows_lite_nofilter_tmean',
       'select_unique_rows_lite_nofilter_tstdev',
       'select_unique_rows_lite_runtimes', 'select_unique_rows_lite_tmean',
       'select_unique_rows_lite_tstdev',
       'select_unique_rows_lite_wset_runtimes',
       'select_unique_rows_lite_wset_tmean',
       'select_unique_rows_lite_wset_tstdev'],
      dtype='object')

The comparison of the timing of the various functions can be seen below.

png

It is of no surprise that the compiled function is the fastest for all sizes and numbers of clusters by a long chalk. As to the native python methods, ommission of filtering speeds up the process. One can also observer that costructing sets is slower than constructing a dictionary for our purposes.

Conclusions

We wrote and compared native python and compiled utilities to label clusters. We have found that the compiled utility is superior to the native python ones. Whether to use it? I would argue that if efficiency is the most important desideratum of our program, one should never by shy to use the tools to achive it.