Scipy sparse... arrays? - python

So, I'm doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I'd use scipy's 'sparse' package to reduce the storage overhead, but I'm a little confused about how to create arrays, not matrices.
I've gone through this tutorial on how to create sparse matrices:
http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7
To mimic an array, I just create a 1xN matrix, but as you may guess, Asp.dot(Bsp) doesn't quite work because you can't multiply two 1xN matrices. I'd have to transpose each array to Nx1, and that's pretty lame, since I'd be doing it for every dot-product calculation.
Next up, I tried to create an NxN matrix where column 1 == row 1 (such that you can multiply two matrices and just take the top-left corner as the dot product), but that turned out to be really inefficient.
I'd love to use scipy's sparse package as a magic replacement for numpy's array(), but as yet, I'm not really sure what to do.
Any advice?

Use a scipy.sparse format that is row or column based: csc_matrix and csr_matrix.
These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call transpose(copy=False)), just like with numpy arrays.
EDIT: some timings via ipython:
import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
Now x_csr and x_dok are 50% sparse:
print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
with 49757 stored elements in Compressed Sparse Row format>
And the timings:
timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop
timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop
timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop
timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop
So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-(
As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data:
timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop
Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than dot(x, x) by a factor of 2.

You could create a subclass of one of the existing 2d sparse arrays
from scipy.sparse import dok_matrix
class sparse1d(dok_matrix):
def __init__(self, v):
dok_matrix.__init__(self, (v,))
def dot(self, other):
return dok_matrix.dot(self, other.transpose())[0,0]
a=sparse1d((1,2,3))
b=sparse1d((4,5,6))
print a.dot(b)

I'm not sure that it is really much better or faster, but you could do this to avoid using the transpose:
Asp.multiply(Bsp).sum()
This just takes the element-by-element product of the two matrices and sums the products. You could make a subclass of whatever matrix format you are using that has the above statement as the dot product.
However, it is probably just easier to tranpose them:
Asp*Bsp.T
That doesn't seem like so much to do, but you could also make a subclass and modify the mul() method.

Related

Vectorisation of coordinate distances in Numpy

I'm trying to understand Numpy by applying vectorisation. I'm trying to find the fastest function to do it.
def get_distances3(coordinates):
return np.linalg.norm(
coordinates[:, None, :] - coordinates[None, :, :],
axis=-1)
coordinates = np.random.rand(1000, 3)
%timeit get_distances3(coordinates)
The function above took 10 loops, best of 3: 35.4 ms per loop. From numpy library there's also an np.vectorize option to do it.
def get_distances4(coordinates):
return np.vectorize(coordinates[:, None, :] - coordinates[None, :, :],axis=-1)
%timeit get_distances4(coordinates)
I tried with np.vectorize below, yet ended up with the following error.
TypeError: __init__() got an unexpected keyword argument 'axis'
How can I find vectorization in get_distances4? How should I edit the lsat code in order to avoid the error? I have never used np.vectorize, so I might be missing something.
You're not calling np.vectorize() correctly. I suggest referring to the documentation.
Vectorize takes as its argument a function that is written to operate on scalar values, and converts it into a function that can be vectorized over values in arrays according to the Numpy broadcasting rules. It's basically like a fancy map() for Numpy array.
i.e. as you know Numpy already has built-in vectorized versions of many common functions, but if you had some custom function like "my_special_function(x)" and you wanted to be able to call it on Numpy arrays, you could use my_special_function_ufunc = np.vectorize(my_special_function).
In your above example you might "vectorize" your distance function like:
>>> norm = np.linalg.norm
>>> get_distance4 = np.vectorize(lambda a, b: norm(a - b))
>>> get_distance4(coordinates[:, None, :], coordinates[None, :, :])
However, you will find that this is incredibly slow:
>>> %timeit get_distance4(coordinates[:, None, :], coordinates[None, :, :])
1 loop, best of 3: 10.8 s per loop
This is because your first example get_distance3 is already using Numpy's built-in fast implementations of these operations, whereas the np.vectorize version requires calling the Python function I defined some 3000 times.
In fact according to the docs:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
If you want a potentially faster function for converting distances between vectors you could use scipy.spacial.distance.pdist:
>>> %timeit get_distances3(coordinates)
10 loops, best of 3: 24.2 ms per loop
>>> %timeit distance.pdist(coordinates)
1000 loops, best of 3: 1.77 ms per loop
It's worth noting that this has a different return formation. Rather than a 1000x1000 array it uses a condensed format that excludes i = j entries and i > j entries. If you wish you can then use scipy.spatial.distance.squareform to convert back to the square matrix format.

Slow random sample generation without replacement in scipy

I am trying to create sparse matrix representation of a random hash map h:[n] -> [t] which maps each i
to exactly s random location of available d locations and the value at those location are drawn from some discrete distribution.
:param d: number of bins
:param n: number of items hashed
:param s: sparsity of each column
:param distribution: distribution object.
Here is my attempt:
start_time=time.time()
distribution = scipy.stats.rv_discrete(values=([-1.0, +1.0 ], [0.5, 0.5]),name = 'dist')
data = (1.0/sqrt(self._s))*distribution.rvs(size=self._n*self._s)
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
row = numpy.empty(self._s*self._n)
print time.time()-start_time
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
S = scipy.sparse.csr_matrix( (data, (row, col)), shape = (self._d,self._n))
print time.time()-start_time
return S
Now for creating this map for n=500000, s=10,d=1000, it is taking me around 20s on my decent workstation, in which 90% of time is consumed in generating row indices. Is there anything I can do to speed this up? Any alternatives? Thanks.
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
looks like something that could be written as one non-looping expression; though it probably isn't a big time consumer
My first guess is - but I need to play with this to be sure; I think it's assigning all rows the column index number.
col = np.empty(self._s, self._n)
col[:,:] = np.arange(self._n)
col = col.ravel()
Something similar for:
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
is, I think, picking _s values from _d _n times. Doing the no-replace along _s, but allowing replace on _n could be tricky.
Without running the code myself (with smaller n) I'm stumbling around a bit. Which is the slow part, generating col, row, or the final csr? Iteration on n=500000 is going to be slow.
The matrix will be (1000, 500000), but with (10*500000) nonzero items. So a sparsity of .01. Just for comparison it would be interesting to generate a sparse random matrix of similar size and sparsity
In [5]: %timeit sparse.random(1000, 500000, .01)
1 loop, best of 3: 24.6 s per loop
and the dense random choices:
In [8]: timeit np.random.choice(1000,(10,500000)).shape
10 loops, best of 3: 53 ms per loop
In [9]: np.array([np.random.choice(1000,(10,)) for i in range(500000)]).shape
Out[9]: (500000, 10)
In [10]: timeit np.array([np.random.choice(1000,(10,)) for i in range(500000)]).
...: shape
1 loop, best of 3: 12.7 s per loop
So, yes, the large iteration loop is expensive. But given the replacement policy there might not be a way around that. Or is there?
So as first guess, creating row takes half the time, creating the sparse matrix the other half. I'm not surprised. You are using the coo style of input, which requires lexsorting and summing for duplicates when converting to csr. We might be able to gain speed by using the indptr type of input. There won't be duplicates to sum. And since there are consistently 10 nonzero terms per row, generating the indptr values won't be hard. But I can't do that off the top of my head. (oops, that's the transpose).
random sparse to csr is just a bit slower:
In [11]: %timeit sparse.random(1000, 500000, .01, 'csr')
1 loop, best of 3: 28.3 s per loop

How does numpy broadcasting perform faster?

In the following question,
https://stackoverflow.com/a/40056135/5714445
Numpy's broadcasting provides a solution that's almost 6x faster than using np.setdiff1d() paired with np.view(). How does it manage to do this?
And using A[~((A[:,None,:] == B).all(-1)).any(1)] speeds it up even more.
Interesting, but raises yet another question. How does this perform even better?
I would try to answer the second part of the question.
So, with it we are comparing :
A[np.all(np.any((A-B[:, None]), axis=2), axis=0)] (I)
and
A[~((A[:,None,:] == B).all(-1)).any(1)]
To compare with a matching perspective against the first one, we could write down the second approach like this -
A[(((~(A[:,None,:] == B)).any(2))).all(1)] (II)
The major difference when considering performance, would be the fact that with the first one, we are getting non-matches with subtraction and then checking for non-zeros with .any(). Thus, any() is made to operate on an array of non-boolean dtype array. In the second approach, instead we are feeding it a boolean array obtained with A[:,None,:] == B.
Let's do a small runtime test to see how .any() performs on int dtype vs boolean array -
In [141]: A = np.random.randint(0,9,(1000,1000)) # An int array
In [142]: %timeit A.any(0)
1000 loops, best of 3: 1.43 ms per loop
In [143]: A = np.random.randint(0,9,(1000,1000))>5 # A boolean array
In [144]: %timeit A.any(0)
10000 loops, best of 3: 164 µs per loop
So, with close to 9x speedup on this part, we see a huge advantage to use any() with boolean arrays. This I think was the biggest reason to make the second approach faster.

How to store array of really huge numbers in python?

I am working with huge numbers, such as 150!. To calculate the result is not a problem, by example
f = factorial(150) is
57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000.
But I also need to store an array with N of those huge numbers, in full presison. A list of python can store it, but it is slow. A numpy array is fast, but can not handle the full precision, wich is required for some operations I perform later, and as I have tested, a number in scientific notation (float) does not produce the accurate result.
Edit:
150! is just an example of huge number, it does not mean I am working only with factorials. Also, the full set of numbers (NOT always a result of factorial) change over time, and I need to do the actualization and reevaluation of a function for wich those numbers are a parameter, and yes, full precision is required.
numpy arrays are very fast when they can internally work with a simple data type that can be directly manipulated by the processor. Since there is no simple, native data type that can store huge numbers, they are converted to a float. numpy can be told to work with Python objects but then it will be slower.
Here are some times on my computer. First the setup.
a is a Python list containing the first 50 factorials. b is a numpy array with all the values converted to float64. c is a numpy array storing Python objects.
import numpy as np
import math
a=[math.factorial(n) for n in range(50)]
b=np.array(a, dtype=np.float64)
c=np.array(a, dtype=np.object)
a[30]
265252859812191058636308480000000L
b[30]
2.6525285981219107e+32
c[30]
265252859812191058636308480000000L
Now to measure indexing.
%timeit a[30]
10000000 loops, best of 3: 34.9 ns per loop
%timeit b[30]
1000000 loops, best of 3: 111 ns per loop
%timeit c[30]
10000000 loops, best of 3: 51.4 ns per loop
Indexing into a Python list is fastest, followed by extracting a Python object from a numpy array, and slowest is extracting a 64-bit float from an optimized numpy array.
Now let's measure multiplying each element by 2.
%timeit [n*2 for n in a]
100000 loops, best of 3: 4.73 µs per loop
%timeit b*2
100000 loops, best of 3: 2.76 µs per loop
%timeit c*2
100000 loops, best of 3: 7.24 µs per loop
Since b*2 can take advantage of numpy's optimized array, it is the fastest. The Python list takes second place. And a numpy array using Python objects is the slowest.
At least with the tests I ran, indexing into a Python list doesn't seem slow. What is slow for you?
Store it as tuples of prime factors and their powers. A factorization of a factorial (of, let's say, N) will contain ALL primes less than N. So k'th place in each tuple will be k'th prime. And you'll want to keep a separate list of all the primes you've found. You can easily store factorials as high as a few hundred thousand in this notation. If you really need the digits, you can easily restore them from this (just ignore the power of 5 and subtract the power of 5 from the power of 2 when you multiply the factors to get the factorial... cause 5*2=10).
If you need for the future the exact number of a factorial why dont you save in an array not the result but the number you want to 'factorialize'?
E.G.
You have f = factorial(150)
and you have the result 57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000
But you can simply:
def values():
to_factorial_list = []
...
to_factorial_list.append(values_you_want_to_factorialize)
return to_factorial_list
def setToFactorial(number):
return factorial(number)
print setToFactorial(values()[302])
EDIT:
fair, then my advice is to work both with the logic i suggested as the getsizeof(number) you can merge or work with two arrays, an array to save low factorialized numbers and another to save the big ones, e.g. when getsizeof(number) exceed any size.

Why is numpy.dot much faster than numpy.einsum?

I have numpy compiled with OpenBlas and I am wondering why einsum is much slower than dot (I understand in the 3 indices case, but I dont understand why it is also less performant in the two indices case)? Here an example:
import numpy as np
A = np.random.random([1000,1000])
B = np.random.random([1000,1000])
%timeit np.dot(A,B)
Out: 10 loops, best of 3: 26.3 ms per loop
%timeit np.einsum("ij,jk",A,B)
Out: 5 loops, best of 3: 477 ms per loop
Is there a way to let einsum use OpenBlas and parallelization like numpy.dot?
Why does np.einsum not just call np.dot if it notices a dot product?
einsum parses the index string, and then constructs an nditer object, and uses that to perform a sum-of-products iteration. It has special cases where the indexes just perform axis swaps, and sums ('ii->i'). It may also have special cases for 2 and 3 variables (as opposed to more). But it does not make any attempt to invoke external libraries.
I worked out a pure python work-a-like, but with more focus on the parsing than the calculation special cases.
tensordot reshapes and swaps, so it can then call dot to the actual calculations.

Categories