Efficient Tversky index algorithm

05 Mar 2019

Calculating set similarities is one of the most common operations in data science. The most popular scientific python library, scipy provides a collection of highly optimised functions to determine set similarities. Unfortunately, the set similarity measures operate on binary vectors, and do not include the versatile Tversky index. Therefore an efficient algorithm to calculate the Tversky index is be presented in this blog post.

Motivation

I have been analysing the UK singles charts. The temporal evolution of the record labels was the topic of the first part of the series. I am now investigating the Christmas singles. In order to determine what fraction of singles released in December recur in each year, I need a set similarity measure.

The raw notebook can be found here

Set similarity indices

The number of set similarity indices is comparable to number of species that ecologists have applied them to. Very much like the species, these indices often known by multiple names. Most of the times, your very own index is the most likely to have been published under three different names by the mid eighties. Look up a review on ecology concerning species diversity or sociology (studies on urban deprivation is a good place to start) and you will find dozens of indices.

Requirements

In the following, we restrict our attention to sets, $S_{k}$ which are composed of unique elements:

\[\begin{eqnarray} \mathcal{I}_{A} = \{1,...,N_{A} \} \, \quad & \text{(index set of sets)} \\ \forall k \in \mathcal{I}_{A}: A_{k} \neq \emptyset \quad & \text{(sets are not empty)}\\ \forall i,j,\, i \neq j, \, a_{i}, a_{j} \in A_{k} : a_{i} \neq a_{j} \quad & \text{(sets of unique elements)} \end{eqnarray}\]

In the broadest terms, we wish the following behaviour of the index, $d$:

If and only if the two sets have an empty section the index should be zero:

\[d(A_{i}, A_{j}) = 0 \Leftrightarrow A_{i} \cap A_{j} = \emptyset \quad (1)\]

If and only is one set is a subset of the other the index is unit:

\[d(A_{i}, A_{j}) = 1 \Leftrightarrow A_{i} \subseteq A_{j} \vee A_{j} \subseteq A_{i} \quad (2)\]

A handful of standard indices are reviewed below from this perspective.

Jaccard index

The popular Jaccard index is the ratio of the cardinalities of the section and union of two sets.

\[S_{J}(A_{i}, A_{j}) = \frac{|A_{i} \cap A_{j}|}{|A_{i} \cup A_{j}|}\]

Since the positions of the charts have changed over time the sets (e.g. a weekly chart) might have non-equal cardinalities. This index does not fulfill requirement (2).

Overlap coefficient

The overlap coefficient is defined as the ratio of the cardinality of the intersection and that of the smaller set:

\[S_{ol}(A_{i}, A_{j}) = \frac{|A_{i} \cap A_{j}|} {\min(|A_{i}|,|A_{j}|)}\]

For all $S_{i}$ are not empty this index is defined over all possible pairings. It also meets both criteria.

Tversky index

The symmetric Tversky index is given by the following formula:

\[S_{ts}(A_{i}, A_{j}) = \frac{|A_{i} \cap A_{j}|} { |A_{i} \cap A_{j}| + \beta \left(\alpha \min(|A_{i} \setminus A_{j} |,|A_{j} \setminus A_{i}|) + (1 - \alpha) \max(|A_{i} \setminus A_{j}|,|A_{j} \setminus A_{i}|) \right) }\]

Setting $\alpha = \beta = 1$ returns the overlap coefficient. We choose to implement the Tversky index on the merit of its versatility.

Implementation

Aim

We aim to implement a function that calculates the pairwise Tversky indices on a collection of sets.

High level algorithm

The algorithm can be broken down to two main components:

The $\texttt{Tversky}$ function computes the index of a pair of sets.

\[\begin{eqnarray} & & \textbf{Algorithm:} \, \textit{Calculate Tversky index} \\ &1. &\textbf{Input:} \, \mathcal{A}, \mathcal{B}: \, \text{sets}; \alpha, \beta \in \left[0,1\right] \\ &2. & \textbf{Output:} \, t \in \left[0,1\right]: \, \text{Tversky index} \\ &3. & \texttt{Tversky}(\mathcal{A}, \mathcal{B}, \alpha, \beta) \\ &4. & \quad t = \frac{|\mathcal{A} \cap \mathcal{B}|} { |\mathcal{A} \cap \mathcal{B}| + \beta \left(\alpha \min(|\mathcal{A} \setminus \mathcal{B} |,|\mathcal{B} \setminus \mathcal{A}|) + (1 - \alpha) \max(\mathcal{A} \setminus \mathcal{B}|,|\mathcal{B} \setminus \mathcal{A}|)\right)} \\ &5. & \quad \textbf{return} \, t \end{eqnarray}\]

The $\texttt{BatchTversky}$ algorithm firstly converts a collection of collections to a collection of sets by sequentially aplying the $\text{Set}$ function. In the second step, the upper diagonal of the index matrix, $\mathbf{T}$ is populated by calling $\texttt{Tversky}$ on the set pairs. The diagonal is then set to unit, as a last step.

\[\begin{eqnarray} & & \textbf{Algorithm:} \, \textit{Calculate Tversky indices} \\ &1. &\textbf{Input:} \, \mathbf{A}: \, \text{list of sets}; \alpha, \beta \in \left[0,1\right] \\ &2. & \textbf{Output:} \, \mathbf{T}: \, \text{coo sparse matrix}, \, \text{pairwise indices} \\ &3. & \texttt{BatchTversky}(\mathbf{A}, \alpha, \beta) \\ &4. & \quad T \leftarrow \mathbf{0} \\ &5. & \quad \textbf{for} \, \, A_{i} \,\, \textbf{in} \,\, \mathbf{A} \,\, \textbf{do}\\ &6. & \quad \quad \mathcal{A}_{i} \leftarrow \texttt{set}(A_{i}) \\ &7. & \quad \textbf{for} \, \, i = 1, |\mathcal{A}| \\ &8. & \quad \quad \textbf{for} \, \, j = i+1, |\mathcal{A}| \\ &9. & \quad \quad \quad t_{ij} \leftarrow \texttt{Tversky}(\mathcal{A}_{i}, \mathcal{A}_{j})\\ &10. & \quad \textbf{for} \, \, i = 1, |\mathcal{A}| \\ &11. & \quad \quad t_{ii} = 1 \\ &12. & \quad \textbf{return} \, T \end{eqnarray}\]

Two implementations are given below. One is a high level, native (and rather naive) python manifestation. The other is a low level realisation of the index.

Native python

The $\texttt{Sets}$ function is implemented as the convert_to_sets utility:

def convert_to_sets(collection):
    """
    Converts a collection of objects to a collection of sets
    Parameters:
        collection ([[hashable]]) : a collection of groups of hashable objects
    Returns:
        sets ([{}]) : list of sets
    """
    sets = tuple([set(x) for x in collection])
    
    return sets

The calculate_tversky_index1 function does what it says on the label:

def calculate_tversky_index1(a1, a2, alpha, beta):
    """
    Calculates the Tversky index between two sets:
    Parameters:
        a1 ({}) : first set
        a2 ({}) : second set
        alpha (float) : alpha parameter
        beta (float) : beta parameter
    Returns:
        tversky_index_ (float) : Tversky index
    """
    n_1 = len(a1)
    n_2 = len(a2)
    n_common = len(a1 & a2)
    
    n_diff12 = n_1 - n_common
    n_diff21 = n_2 - n_common
    
    tversky_index_ = n_common / (n_common + beta * (alpha * min(n_diff12, n_diff21) 
                                            + (1 - alpha) * max(n_diff12, n_diff21))) 
    
    return tversky_index_

The calculate_set_pairwise_tversky1 consumes a list of sets and returns the array of indices in coordinate format.

def calculate_set_pairwise_tversky1(sets, alpha = 1, beta = 1):
    """
    Calculates the pairwise Tversky index in a collection of sets.
    Parameters:
        sets (({})) : iterable of sets
        alpha (float) : alpha parameter. Default: 1.
        beta (float) : beta parameter. Default: 1.
    Returns:
        tversky_indices_ (np.ndarray[n_pairs, 3]) : coordinate format dense array of the indices 
    """
    
    if alpha == beta == 0:
        raise ValueError("'alpha' and 'beta' cannot be simultaneously zero.")
    
    n_sets = len(sets)
    
    # set up coo matrix
    n_pairs = int((n_sets + 1) * n_sets / 2)
    tversky_indices_ = np.zeros((n_pairs, 3), dtype = np.float)
    
    # loop over all pairs
    i_cnt = 0
    for i1 in range(n_sets):
        a1 = sets[i1]
        
        for i2 in range(i1 + 1, n_sets):
            a2 = sets[i2]
            
            ti_ = calculate_tversky_index1(a1, a2, alpha, beta)
            tversky_indices_[i_cnt] = i1, i2, ti_
           
            i_cnt += 1
            
    # fill in diagonal
    tversky_indices_[-n_sets:,0] = np.arange(n_sets) 
    tversky_indices_[-n_sets:,1] = np.arange(n_sets)
    tversky_indices_[-n_sets:,2] = 1
    
    return tversky_indices_

Finally all the components are merged together in the function calculate_pairwise_tversky1

def calculate_pairwise_tversky1(collection, alpha = 1, beta = 1):
    """
    Calculates the pairwise Twersky index in a collection of sets.
    Parameters:
        collection ([[hashable]]) : a collection of groups of hashable objects
        alpha (float) : alpha parameter. Default: 1.
        beta (float) : beta parameter. Default: 1.
    Returns:
        tversky_indices_ (np.ndarray[n_pairs, 3]) : coordinate format dense array of the indices 
    """
    
    if alpha == beta == 0:
        raise ValueError("'alpha' and 'beta' cannot be simultaneously zero.")
    
    sets = convert_to_sets(collection)
        
    tversky_indices_ = calculate_set_pairwise_tversky1(sets, alpha = alpha, beta = beta)
    
    return tversky_indices_

Analysis

Pros:
Cons:
Time complexity

If the average size of a collection is $n$ and there are $N$ collections, the time complexity of the above brute force method is $\mathcal{O}(N^{2} n)$. The hashing requires $\mathcal{O}(Nn)$ operations. Calculating the intersection is proportional to $n \cdot \mathcal{O}(1) = \mathcal{O}(n)$ which has to be done $~N^{2} / 2$ times.

Test

This space only allows for simple testing. The input data will be a sequence words.

fruits = ['apple', 'banana', 'cantaloupe', 'date', 'elderberry', 'fig', 'grapefruit', 'honeydew melon' ]

# create sets
fruit_bowls = [[fruits[idx] for idx in range(i1, i1 + 4)] for i1 in range(5)]
for idx, bowl in enumerate(fruit_bowls):
    print("{0}. bowl: {1}".format(idx + 1, ", ".join(bowl)))
1. bowl: apple, banana, cantaloupe, date
2. bowl: banana, cantaloupe, date, elderberry
3. bowl: cantaloupe, date, elderberry, fig
4. bowl: date, elderberry, fig, grapefruit
5. bowl: elderberry, fig, grapefruit, honeydew melon

The overlap between the sets are then calculated:

overlap = calculate_pairwise_tversky1(fruit_bowls, alpha = 1, beta = 1)

png

It looks as expected. The first and last sets do not have common elements, hence their overlap is zero.

Low level implementation

No matter how versatile the above implementation is, it has a number of serious shortcomings. We will try to address and resolve some of them in this section.

General considerations

Possible improvements

The time complexity of the naive algorithm is $\mathcal{O}(N^{2}n)$. $N^{2}$ is number of comparisons, and $n$ is the time needed to compare two elements. Either of them is reduced, the algorithm speeds up. It is unlikely that the hash lookup can be improved upon, thus we will concentrate on the reducing the $N^{2}$ number of comparisons.

Totally ordered sets

Let us assume the collections are totally ordered sets, it is therefore possible to find the their least and maximum elements. Is it a good assumption? It is more than likely that the collections will consists of integer hashes in production environment. Therefore the elements in a collection can be ordered. If the sets do no consist of integers one can always hash the elements, assuming a good hash function in $\mathcal{O}(Nn)$ time.

Once the least and largest elements of all sets a determined, the sets can be screened for comparison. If the least element of a set is greater than the maximum of the other, their intersection is zero.

The steps are now assuming totally ordered sets:

As we can see the average time complexity is the same, only a $\mathcal{O}(Nn\log n)$ term has been added. In realistic cases ($n » 4$) this is a negligible burden. However, depending on the distribution of elements e.g. the average set size is much smaller than the number of hashes, and the sets tend to have elements from the same neighbourhood, it can result in considerable speed up.

Algorithm

The bulletpoints above serve as guide to write the algorithm $\texttt{ScreenBatchTversky}$ which screens the sets for non-zero intersections.

\[\begin{eqnarray} & & \textbf{Algorithm:} \, \textit{Calculate Tversky indices with screening} \\ &1. &\textbf{Input:} \, \mathbf{A}: \, \text{list of totally ordered sets}; \alpha, \beta \in \left[0,1\right] \\ &2. & \textbf{Output:} \, \mathbf{T}: \, \text{coo sparse matrix} \, \text{pairwise indices} \\ &3. & \texttt{ScreenBatchTversky}(\mathbf{A}, \alpha, \beta) \\ % &4. & \quad \mathbf{T} \leftarrow \mathbf{0} \\ &5. & \quad \mathbf{L} \leftarrow \mathbf{0} \\ &6. & \quad \mathbf{U} \leftarrow \mathbf{0} \\ &7. & \quad \mathbf{C} \leftarrow \mathbf{0} \\ % &8. & \quad \textbf{for} \, \, \mathcal{A}_{i} \,\, \textbf{in} \,\, \mathbf{A} \,\, \textbf{do}\\ &9. & \quad \quad \mathcal{A}_{i} \leftarrow \texttt{sort}(\mathcal{A}_{i}) \\ % &10. & \quad \textbf{for} \, \, i = 1, |\mathbf{A}| \,\, \textbf{do}\\ &11. & \quad \quad L_{i} \leftarrow \text{inf}(\mathcal{A}_{i}) \\ &12. & \quad \quad U_{i} \leftarrow \sup(\mathcal{A}_{i}) \\ % &13. & \quad \mathbf{L} \leftarrow \texttt{sort}(\mathbf{L}) \\ &14. & \quad \mathbf{U} \leftarrow \texttt{sort}(\mathbf{U}) \\ % &15. & \quad \textbf{for} \, \, i = 1, |\mathbf{A}| \,\, \textbf{do}\\ &16. & \quad \quad C_{i} \leftarrow \texttt{SelectNonzeroOverlaps}(L_{i}, \mathbf{U}) \\ % &17. & \quad \textbf{for} \, \, i = 1, |\mathbf{A}| \,\, \textbf{do}\\ &18. & \quad \quad \textbf{for} \, \, j \in C_{i} \,\, \textbf{do}\\ &19. & \quad \quad \quad T_{ij} \leftarrow \texttt{Tversky}(\mathcal{A}_{i}, \mathcal{A}_{j}) \\ &20. & \quad \textbf{return} \, \mathbf{T} \end{eqnarray}\]

The function $\texttt{SelectNonzeroOverlaps}$ compares the least and maximum elements of sets to determine the pairs which definitely do not have intersections. (A note, without looking at the elements of two sets it is not possible to determine whether their intersection is empty, even if their ranges overlap. Consider $\mathcal{A} ={1,3,5}$ and $\mathcal{B} = {2,4,6}$. The ranges overlap, but the intersection is empty. In other words, the screening filters out the true negatives, but retains the false positives.

Issues

There are four additional issues with the algorithm above.

Resolution

These issues can all be solved by introducing some simple modifications.

  1. $\texttt{argsort}$ $U, L$
  2. $\texttt{SelectNonzeroOverlaps}$ returns a list of indices, one for each set. Each index contains the lowest position in the sorting index array of $U$
  3. Loop over sets starting at the largest least element.
  4. For each set loop over selection
  5. Skip if first set index greater than the second

The modified algorithm can be found below

\[\begin{eqnarray} & & \textbf{Algorithm:} \, \textit{Calculate Tversky indices with screening} \\ &1. &\textbf{Input:} \, \mathbf{A}: \, \text{list of totally ordered sets}; \alpha, \beta \in \left[0,1\right] \\ &2. & \textbf{Output:} \, \mathbf{T}: \, \text{coo sparse matrix} \, \text{pairwise indices} \\ &3. & \texttt{ScreenBatchTversky}(\mathbf{A}, \alpha, \beta) \\ % &4. & \quad \mathbf{T} \leftarrow \mathbf{0} \\ &5. & \quad \mathbf{L} \leftarrow \mathbf{0} \qquad \qquad \textit{(least elements)}\\ &6. & \quad \mathbf{U} \leftarrow \mathbf{0} \qquad \qquad \textit{(largest elements)}\\ &7. & \quad \mathbf{I}_{S} \leftarrow \mathbf{0} \qquad \qquad \textit{(array storing non-excluded sets)}\\ &8. & \quad \mathbf{I}_{B} \leftarrow \mathbf{0} \qquad \qquad \textit{(lookup for least elements)}\\ % &9. & \quad \textbf{for} \, \, A_{i} \,\, \textbf{in} \,\, \mathbf{A} \,\, \textbf{do}\\ &10. & \quad \quad \mathcal{A}_{i} \leftarrow \texttt{sort}(\mathcal{A}_{i}) \\ % &11. & \quad \textbf{for} \, \, i = 1, |\mathbf{A}| \,\, \textbf{do}\\ &12. & \quad \quad L_{i} \leftarrow \text{inf}(\mathcal{A}_{i}) \\ &13. & \quad \quad U_{i} \leftarrow \sup(\mathcal{A}_{i}) \\ % &14. & \quad \mathbf{I}_{L} \leftarrow \texttt{argsort}(\mathbf{L}) \\ &15. & \quad \mathbf{I}_{U} \leftarrow \texttt{argsort}(\mathbf{U}) \\ &16. & \quad \mathbf{I}_{B} \leftarrow \texttt{CreateLookUp}(\mathbf{I}_{L}) \\ % &17. & \quad \mathbf{I}_{S}\leftarrow \texttt{LeftExcludeSets}(\mathbf{L}, \mathbf{U}, \mathbf{I}_{L}, \mathbf{I}_{U}, \mathbf{I}_{B}) \\ % &18. & \quad \textbf{for} \, \, k =|\mathbf{A}|, 1, -1 \,\, \textbf{do}\\ &19. & \quad \quad i = I_{L}(k) \qquad \qquad \qquad \textit{(choose set with largest least element)} \\ &20. & \quad \quad \textbf{for} \, \, l = \mathbf{I}_{S}(i), \, |\mathbf{I}_{U}| \, \,\textbf{do} \qquad \textit{(iterate over non-excluded sets)} \\ &21. & \qquad \quad j = \mathbf{I}_{U}(l) \qquad \qquad \qquad \textit{(choose the other set)}\\ &22. & \qquad \quad \textbf{if} \,\, i = j \\ &23. & \qquad \qquad \textbf{continue} \\ &24. & \quad \qquad \textbf{if} \,\, k < I_{B}(j) \qquad \qquad \qquad \textit{(do not calculate twice)} \\ &25. & \qquad \qquad \textbf{continue} \\ &26. & \quad \quad \quad T_{ij} \leftarrow \texttt{Tversky}(\mathcal{A}_{i}, \mathcal{A}_{j}) \\ &27. & \quad \textbf{return} \, \mathbf{T} \end{eqnarray}\]

A graphical summary of the algorithm helps to explain role of screening

png

The $\texttt{}$ algorithm is implemented in the calculate_pariwise_tersky2 function below. The input collection is expected to be a list of nd.array-s

nb_int32 = nb.types.int32
nb_f64  = nb.types.float64
 @nb.jit(
         nb_f64[:,:](
                        nb.types.List(nb_int32[::1], reflected = True), 
                        nb_f64, 
                        nb_f64
                      ), 
         nopython = True,
         debug = True
        )
def calculate_pairwise_tversky2(collection, alpha, beta):
    """
    Calculates the pairwise Twersky index in a collection of sets.
    Parameters:
        collection ([np.ndarray]) : a collection of groups of hashable objects
        alpha (float) : alpha parameter. Default: 1.
        beta (float) : beta parameter. Default: 1.
    Returns:
        tversky_indices_ (np.ndarray[n_pairs, 3]) : coordinate format dense array of the indices  
    """
    
    n_sets = len(collection)
    
    # sort sets in place
    for x in collection:
        x.sort()   
      
     # find minimum and maximum elements
    l_arr = np.array([x[0] for x in collection])
    u_arr = np.array([x[-1] for x in collection])
        
    # argsort
    idcs_l = np.argsort(l_arr)
    idcs_u = np.argsort(u_arr)

    # create lookup
    idcs_b = np.argsort(idcs_l)
    
    # determine which sets to be included (LeftExcludeSets)
    idcs_s = np.searchsorted(u_arr[idcs_u], l_arr[idcs_l])[idcs_b]    
    
    # collects i,j, S_{Tversky}(i,j) triplets
    collector = []
   
    # master loop over sets -- it feels like fortran! 
    for k in range(n_sets -1, -1, -1):
        i = idcs_l[k]
                
        for l in range(idcs_s[i], n_sets):
            j = idcs_u[l]
            
            if i == j:
                continue
                
            if k < idcs_b[j]:
                continue
                            
            ti_ = calculate_tversky_index_sorted(collection[i], collection[j], alpha, beta)
            collector.append((i * 1.0, j * 1.0, ti_))
    
    # add diagonal
    collector.extend([(i, i, 1) for i in range(n_sets)])
            
    # convert to numpy array
    tversky_indices_ = np.array(collector, dtype = np.float64)
    
    return tversky_indices_

The only task to be tackled is to provide a recipe to calculate the index itself. In calculate_pairwise_tversky1 we used hash tables. Should we invoke this method again, it would result in copying all data $N$ times. Instead, we exploit the fact the arrays are ordered. All equal elements in two sorted arrays of lengths, $m,n$ can be determined in $\mathcal{O}(\min(m,n)$ time without creating any temporary arrays. The algorithm is omitted for sake of brevity this time. The implementation in ` calculate_intersection_size should be clear enough. Please note, np.intersect1d` function creates a temporary array, thus we avoid using it.

@nb.jit(
        nb_int32(
                  nb_int32[::1],
                  nb_int32[::1]
                ), 
        nopython = True
       )
def calculate_intersection_size(a1, a2):
    """
    Calculates the size of intersection of two arrays.
    Parameters:
        a1 (np.ndarray 1D) : first array
        a2 (np.ndarray 1D) : second array
    Returns:
        n_common (int) : number of shared elements
    """
    
    # choose sorter array
    # these are pointers, so we do not create new arrays
    
    if a1.size <= a2.size:
        a, b = a1, a2
    else:
        a, b = a2, a1
    
    n_a, n_b = a.size, b.size
    
    # initialise counter
    i, j, n_common = 0, 0, 0
    c_old = min(a[0], b[0]) - 1 # this is dirty, I know
    
    # merge in shorter array, shortcut on disjoint arrays
    while (i < n_a) and (j < n_b):
        
        if a[i] <= b[j]:
            c_new = a[i]
            i += 1
        else:
            c_new = b[j]
            j += 1
            
        if c_new == c_old:
            n_common += 1
       
        c_old = c_new
        
    # compare last element
    if i == n_b:
        if a[-1] == b[j]:
            n_common += 1
    
    if j == n_b:
        if a[i] == b[-1]:
            n_common += 1
        
    return n_common

This method is then wrapped in the calculate_tversky_index_sorted function which is called from the main algorithm.

@nb.jit(
        nb_f64(
                nb_int32[::1],
                nb_int32[::1],
                nb_f64,
                nb_f64
              ),
        nopython = True
        )
def calculate_tversky_index_sorted(a1, a2, alpha, beta):
    """
    Calculates the Tversky index of two sorted lists/arrays.
    Parameters:
        a1 (np.ndarray of int) : first array
        a2 (np.ndarray of int) : second array
        alpha (float) : alpha parameter. 
        beta (float) : beta parameter. 
    """
    
    # cardinalities
    n_1 = a1.size
    n_2 = a2.size
    
    # intersection size
    n_common = calculate_intersection_size(a1, a2)

    # 
    n_diff12 = n_1 - n_common
    n_diff21 = n_2 - n_common
    
    tversky_index_ = n_common / (n_common + beta * (alpha * min(n_diff12, n_diff21) 
                                            + (1 - alpha) * max(n_diff12, n_diff21)))
    
    return tversky_index_

Analysis

Pros:
Cons:
Hints for compiling with numba

Testing

For sake of brevity, the test set will only consists of a handful of arrays.

nums = [5,4,7,2,0,1,3,6]
arrs = [np.arange(idx, idx + 4, dtype = np.int32) for idx in nums]

# reference  
tversky_ref_res = calculate_pairwise_tversky1(arrs, 1.0, 1.0) 
  
# sparse + numba
tversky_nb_res = calculate_pairwise_tversky2(arrs, 1.0, 1.0)

The lower triangle of the image below shows the overlap calculated with the calculate_pairwise_tersky1 algorithm, whilst the upper triangle contains the results from the new algorithm. The two results agree. The zero overlap pairs have been omitted by the new algorithm as expected.

png

Low level implementation 2

We will see below that low level implementation is surprisingly slow. We went a great length to ensure that the for loops are only entered when there is likely to be an overlap, and all known zero overlaps are omitted. The reason being there are numerous array lookup operations. See lines 18-22 in calculate_pariwise_tversky2.

Now matter how enchanting these prospects of rewriting our code again are, we choose to take a different route of action. All lookup arrays will now be omitted. The screening will happen inside the nested for loop. It will result in many empty loop entries, but the memory lookup will be as efficient as possible.

\[\begin{eqnarray} & & \textbf{Algorithm:} \, \textit{Calculate Tversky indices with screening} \\ &1. &\textbf{Input:} \, \mathbf{A}: \, \text{list of totally ordered sets}; \alpha, \beta \in \left[0,1\right] \\ &2. & \textbf{Output:} \, \mathbf{T}: \, \text{coo sparse matrix} \, \text{pairwise indices} \\ &3. & \texttt{ScreenBatchTverskySlim}(\mathbf{A}, \alpha, \beta) \\ % &4. & \quad \mathbf{T} \leftarrow \mathbf{0} \\ % &5. & \quad \textbf{for} \, \, A_{i} \,\, \textbf{in} \,\, \mathbf{A} \,\, \textbf{do}\\ &6. & \qquad \mathcal{A}_{i} \leftarrow \texttt{sort}(\mathcal{A}_{i}) \\ % &7. & \qquad \textbf{for} \, \, i = 1, \, |\mathbf{A}| \, \,\textbf{do} \\ &8. & \qquad \quad \textbf{for} \, \, j = i + 1, \, |\mathbf{A}| \, \,\textbf{do} \\ % &9. & \qquad \qquad \textbf{if} \, \, \sup(\mathcal{A}_{i}) < \text{inf}(\mathcal{A}_{j}) \\ &10. & \qquad \qquad \quad \textbf{continue} \\ &11. & \qquad \qquad \textbf{if} \, \, \text{inf}(\mathcal{A}_{i}) > \sup(\mathcal{A}_{j}) \\ &12. & \qquad \qquad \quad \textbf{continue} \\ &13. & \quad \quad \quad T_{ij} \leftarrow \texttt{Tversky}(\mathcal{A}_{i}, \mathcal{A}_{j}) \\ &14. & \quad \textbf{return} \, \mathbf{T} \end{eqnarray}\]

This algorithm is implemented in the calculate_pairwise_tversky3 function below.

 @nb.jit(
         nb_f64[:,:](
                        nb.types.List(nb_int32[::1], reflected = True), 
                        nb_f64, 
                        nb_f64
                      ), 
         nopython = True,
         debug = True
        )
def calculate_pairwise_tversky3(collection, alpha, beta):
    """
    Calculates the pairwise Twersky index in a collection of sets.
    Parameters:
        collection ([np.ndarray]) : a collection of groups of hashable objects
        alpha (float) : alpha parameter. Default: 1.
        beta (float) : beta parameter. Default: 1.
    Returns:
        tversky_indices_ (np.ndarray[n_pairs, 3]) : coordinate format dense array of the indices  
    """
    
    n_sets = len(collection)
    
    # sort sets in place
    for x in collection:
        x.sort()   
      
    collector = []
    # master loop over sets -- it feels like fortran! 
    for i, coll1 in enumerate(collection):
                
        for j, coll2 in enumerate(collection[idx:]):
            
            if (coll1[0] > coll2[-1]) or (coll1[-1] > coll2[0]):
                continue
                            
            ti_ = calculate_tversky_index_sorted(coll1, coll2, alpha, beta)
            collector.append((i * 1.0, j + i * 1.0, ti_))
    
    # add diagonal
    collector.extend([(i, i, 1) for i in range(n_sets)])
            
    # convert to numpy array
    tversky_indices_ = np.array(collector, dtype = np.float64)
    
    return tversky_indices_

Performance

The performance of the three implementations are now compared. Various sets of arrays are created the following properties of them will be controlled:

The parameters are as follow:

from sklearn.model_selection import ParameterGrid

test_params = {
                'n_arr' : [100, 500, 1000, 2000],
                'n_size' : [10, 100, 1000]
                }

test_grid = ParameterGrid(test_params)

In order to be a fair comparison, the time needed to convert the arrays to sets will not be measured when the calculate_pairwise_tersky1 algorithm is used.

The timing results are collected in the results list of dictionaries. We use the SlimTimer utility with which three measurements are carried out. The lines which perform the measurements are omitted from this post for sake of brevity. They can be inspected in the raw notebook.

png

Does screening for non overlapping arrays improve the performance? It is hard to answer this question, because the difference between the first and the other implementations is twofold: i) the algorithm is different, ii) jit-ing was used to speed up loops. In either case, the calculate_pairwise_tversky3 is definitely much faster then other two methods.

What we do know that is the arrays are sampled uniformly from an interval, the size of the average overlap should be independent of the array size and the size of the pool, provided their ratio is constant. If the pool size much larger than the size of the array, the overlap becomes independent of the array size.

To determine the effect of the loop optimisation two algorithms are compared. One is the calculate_pairwise_tversky3, the other is nearly identical to the first one, save for the screening for non-overlapping arrays has been removed.

If the array elements are not sampled uniformly, the effect of screening is amplified. A series of test are carried out on the following two-layer sampling distribution.

png

As we can see the loop optimisation results in a speedup by a factor of two.

Conclusions

Future plans

Implement the algorithm using an interval tree for efficient screening of overlapping arrays.