Compute all pairwise vector similarities within a sparse matrix (Python)
When we deal with some applications such as Collaborative Filtering (CF), computation of vector similarities may become a challenge in terms of implementation or computational performance.
Consider a matrix whose rows and columns represent user_id and item_id. A cell contains boolean or numerical value which represents the user-item information such as purchase history or item rating.
\item_id 1 2 3 4 ...
user_id -------------
1 | 1 1 0 0
2 | 0 1 0 0
3 | 1 0 1 0
4 | 0 0 0 0
5 | 0 0 1 0
...
Matrix: Whether a user purchased an item or not
In a general situation, the matrix is sparse. So we may use scipy.sparse
library to treat the matrix.
On the Item-based CF, similarities to be calculated are all combinations of two items ( columns).
This post will show the efficient implementation of similarity computation with two major similarities, Cosine similarity and Jaccard similarity.
Cosine Similarity
Cosine similarity is defined as
Below code calculates cosine similarities between all pairwise column vectors.
Assume that the type of mat
is scipy.sparse.csc_matrix
.
import sklearn.preprocessing as pp
def cosine_similarities(mat):
col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
return col_normed_mat.T * col_normed_mat
Vectors are normalized at first. And then, cosine values are determined by matrix product.
In [1]: import scipy.sparse as sp
In [2]: mat = sp.rand(5, 4, 0.2, format='csc') # generate random sparse matrix
[[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0.01970055 0.87107041 0. ]
[ 0. 0.67868903 0. 0. ]
[ 0. 0.81698843 0. 0. ]]
In [3]: cosine_similarities(mat)
[[ 0. 0. 0. 0. ]
[ 0. 1. 0.01854522 0. ]
[ 0. 0.01854522 1. 0. ]
[ 0. 0. 0. 0. ]]
Jaccard Similarity
The similarity is defined as
Assume that the mat
is binary (0 or 1) matrix and the type is scipy.sparse.csc_matrix
.
import numpy as np
def jaccard_similarities(mat):
cols_sum = mat.getnnz(axis=0)
ab = mat.T * mat
# for rows
aa = np.repeat(cols_sum, ab.getnnz(axis=0))
# for columns
bb = cols_sum[ab.indices]
similarities = ab.copy()
similarities.data /= (aa + bb - ab.data)
return similarities
Jaccard implementation is a little bit complicated. In order to eliminate python-loops, the code manipulates scipy.sparse.csc_matrix
’s raw representation directly.
In [4]: mat = sp.rand(5, 4, 0.2, format='csc')
In [5]: mat.data[:] = 1 # binarize
[[ 1. 1. 0. 0.]
[ 0. 1. 0. 0.]
[ 1. 0. 1. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 1. 0.]]
In [6]: jaccard_similarities(mat)
[[ 1. 0.33333333 0.33333333 0. ]
[ 0.33333333 1. 0. 0. ]
[ 0.33333333 0. 1. 0. ]
[ 0. 0. 0. 0. ]]
Performance
Those two implementations contain no python-loops. So the main calculation runs in a native code of numpy or scipy.
Even if the matrix has a million elements, it will be finished in a second on my laptop.
In [7]: mat = sp.rand(10**5, 10**4, 0.001, format='csc')
<100000x10000 sparse matrix of type '<class 'numpy.float64'>'
with 1000000 stored elements in Compressed Sparse Column format>
In [8]: %time cosine_similarities(mat)
CPU times: user 535 ms, sys: 91.1 ms, total: 626 ms
Wall time: 647 ms
In [9]: mat.data[:] = 1 # binarize
In [10]: %time jaccard_similarities(mat)
CPU times: user 857 ms, sys: 203 ms, total: 1.06 s
Wall time: 1.14 s