When we deal with some applications such as Collaborative Filtering (CF), computation of vector similarities may become a challenge in terms of implementation or computational performance.

Consider a matrix whose rows and columns represent user_id and item_id. A cell contains boolean or numerical value which represents the user-item information such as purchase history or item rating.

 \item_id  1  2  3  4  ...
user_id   -------------
      1  | 1  1  0  0
      2  | 0  1  0  0
      3  | 1  0  1  0
      4  | 0  0  0  0
      5  | 0  0  1  0
      ...

Matrix: Whether a user purchased an item or not

In a general situation, the matrix is sparse. So we may use scipy.sparse library to treat the matrix. On the Item-based CF, similarities to be calculated are all combinations of two items ( columns).

This post will show the efficient implementation of similarity computation with two major similarities, Cosine similarity and Jaccard similarity.

Cosine Similarity

Cosine similarity is defined as

Below code calculates cosine similarities between all pairwise column vectors.

Assume that the type of mat is scipy.sparse.csc_matrix.

import sklearn.preprocessing as pp

def cosine_similarities(mat):
    col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
    return col_normed_mat.T * col_normed_mat

Vectors are normalized at first. And then, cosine values are determined by matrix product.

In [1]: import scipy.sparse as sp

In [2]: mat = sp.rand(5, 4, 0.2, format='csc') # generate random sparse matrix
[[ 0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.        ]
 [ 0.          0.01970055  0.87107041  0.        ]
 [ 0.          0.67868903  0.          0.        ]
 [ 0.          0.81698843  0.          0.        ]]

In [3]: cosine_similarities(mat)
[[ 0.          0.          0.          0.        ]
 [ 0.          1.          0.01854522  0.        ]
 [ 0.          0.01854522  1.          0.        ]
 [ 0.          0.          0.          0.        ]]

Jaccard Similarity

The similarity is defined as

Assume that the mat is binary (0 or 1) matrix and the type is scipy.sparse.csc_matrix.

import numpy as np

def jaccard_similarities(mat):
    cols_sum = mat.getnnz(axis=0)
    ab = mat.T * mat

    # for rows
    aa = np.repeat(cols_sum, ab.getnnz(axis=0))
    # for columns
    bb = cols_sum[ab.indices]

    similarities = ab.copy()
    similarities.data /= (aa + bb - ab.data)

    return similarities

Jaccard implementation is a little bit complicated. In order to eliminate python-loops, the code manipulates scipy.sparse.csc_matrix’s raw representation directly.

In [4]: mat = sp.rand(5, 4, 0.2, format='csc')
In [5]: mat.data[:] = 1 # binarize
[[ 1.  1.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 1.  0.  1.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  1.  0.]]

In [6]: jaccard_similarities(mat)
[[ 1.          0.33333333  0.33333333  0.        ]
 [ 0.33333333  1.          0.          0.        ]
 [ 0.33333333  0.          1.          0.        ]
 [ 0.          0.          0.          0.        ]]

Performance

Those two implementations contain no python-loops. So the main calculation runs in a native code of numpy or scipy.

Even if the matrix has a million elements, it will be finished in a second on my laptop.

In [7]: mat = sp.rand(10**5, 10**4, 0.001, format='csc')
<100000x10000 sparse matrix of type '<class 'numpy.float64'>'
        with 1000000 stored elements in Compressed Sparse Column format>

In [8]: %time cosine_similarities(mat)
CPU times: user 535 ms, sys: 91.1 ms, total: 626 ms
Wall time: 647 ms

In [9]: mat.data[:] = 1 # binarize
In [10]: %time jaccard_similarities(mat)
CPU times: user 857 ms, sys: 203 ms, total: 1.06 s
Wall time: 1.14 s