I have a 3D numpy array, arr, with shape m*n*k.
for every set of values along the m axis (e.g. arr[:, 0, 0]) I want to generate a single value to represent this set, so that I may end up with a 2D matrix, n*k.
If a set of values along the m axis is repeated, then we should generate the same value each time.
I.e it is a hashing problem.
I created a solution to the problem using a dictionary, but it drastically reduces performance. For each set of values, I call this function:
def getCellId(self, valueSet):
# Turn the set of values (a numpy vector) to a tuple so it can be hashed
key = tuple(valueSet)
# Try and simply return an existing ID for this key
try:
return self.attributeDict[key]
except KeyError:
# If the key was new (and didnt exist), try and generate a new Id by adding one to the max of all current Id's. This will fail the very first time we do this (as there will be no Id's yet), so in that case, just assign the value '1' to the newId
try:
newId = max(self.attributeDict.values()) +1
except ValueError:
newId = 1
self.attributeDict[key] = newId
return newId
The array itself is typically of the size 30*256*256, so a single set of values will have 30 values.
I have hundreds of these arrays to process at any one time.
Currently, doing all processing that needs to be done up to calculating the hash
takes 1.3s for a block of 100 arrays.
Including the hashing bumps that up to 75s.
Is there a faster way to generate the single representative value?
This could be one approach using basic numpy functions -
import numpy as np
# Random input for demo
arr = np.random.randint(0,3,[2,5,4])
# Get dimensions for later usage
m,n,k = arr.shape
# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])
# Perform lexsort & get corresponding indices and sorted array
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d = arr2d[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)
# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)
# Get unique labels
labels = df2.cumsum(0)
# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])
Sample run -
In [216]: arr
Out[216]:
array([[[2, 1, 2, 1],
[1, 0, 2, 1],
[2, 0, 1, 1],
[0, 0, 1, 1],
[1, 0, 0, 2]],
[[2, 1, 2, 2],
[0, 0, 2, 1],
[2, 1, 0, 0],
[1, 0, 1, 0],
[0, 1, 1, 0]]])
In [217]: out
Out[217]:
array([[6, 4, 6, 5],
[1, 0, 6, 4],
[6, 3, 1, 1],
[3, 0, 4, 1],
[1, 3, 3, 2]], dtype=int32)
Depending on how many new keys vs old keys need to be generated it's hard to say what will be optimal. But using your logic, the following should be fairly fast:
import collections
import hashlib
_key = 0
def _get_new_key():
global _key
_key += 1
return _key
attributes = collections.defaultdict(_get_new_key)
def get_cell_id(series):
global attributes
return attributes[hashlib.md5(series.tostring()).digest()]
Edit:
I now updated for looping all data series according to your question by using strides:
In [99]: import numpy as np
In [100]: A = np.random.random((30, 256, 256))
In [101]: A_strided = np.lib.stride_tricks.as_strided(A, (A.shape[1] * A.shape[2], A.shape[0]), (A.itemsize, A.itemsize * A.shape[1] * A.shape[2]))
In [102]: %timeit tuple(get_cell_id(S) for S in A_strided)
10 loops, best of 3: 169 ms per loop
The above does 256x256 lookups/assignments of 30 element arrays each.
There is of course no guarantee that the md5 hash wont collide. If that should be an issue, you could of course change to other hashes in the same lib.
Edit 2:
Given that you seem to do the majority of costly operations on the first axis of your 3D array, I would suggest you reorganize your array:
In [254]: A2 = np.random.random((256, 256, 30))
In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize))
In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided)
10 loops, best of 3: 126 ms per loop
Not having to jump around long distances in memory does for about a 25% speed-up
Edit 3:
If there is no actual need for caching a hash to int look-up, but that you just need actual hashes and if the 3D array is of int8-type, then given the A2 and A2_strided organization, time can be reduced some more. Of this 15ms is the tuple-looping.
In [9]: from hashlib import md5
In [10]: %timeit tuple(md5(series.tostring()).digest() for series in A2_strided)
10 loops, best of 3: 72.2 ms per loop
If it is just about hashing try this
import numpy as np
import numpy.random
# create random data
a = numpy.random.randint(10,size=(5,3,3))
# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)
# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)
h[0,0]==h[0,1]
# Output: True
However, use it with caution and test first this code with your code. ... all I can say is that it works for this simple example.
In addition, it may be possible that two values might have the same hash value although they are different. This is an issue which can always happen using the hash function but they are very unlikely
Edit: In order to compare with the other solutions
timeit(np.apply_along_axis(lambda x: hash(tuple(x)),0,a))
# output: 1 loops, best of 3: 677 ms per loop
Related
Say I have an array:
x = np.array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
And a multi-labeled mask:
labels = np.array([[0, 0, 2],
[1, 1, 2],
[1, 1, 2]])
My goal is to sum the entries of x together, grouped by labels. For example:
n_labels = np.max(labels) + 1
out = np.empty(n_labels)
for label in range(n_labels):
mask = labels == label
out[label] = np.sum(x[mask])
>>> out
np.array([1, 20, 15])
However, as x and n_labels become large, I see this being inefficient. Each iteration, you are only summing together a small fraction of the number of entries of x, but still recompute the mask over all of labels (in the expression labels == label) and subsequently index over all of x (in the expression x[mask]). Is there a more efficient way to do this as x and n_labels grow large?
You can use bincount with weights:
np.bincount(labels.ravel(), weights=x.ravel())
out:
array([ 1., 20., 15.])
You don't really have a reason to operate on 2D arrays, so ravel them first:
labels = labels.ravel()
x = x.ravel()
If your labels are already indices, you can use np.argsort along with np.diff and np.add.reduceat:
index = labels.argsort()
splits = np.r_[0, np.flatnonzero(np.diff(labels[index])) + 1]
result = np.add.reduceat(x[index], splits)
labels[index] is the sorted index. Whenever that changes, you enter a new group, and the diff is nonzero. That's what np.flatnonzero(np.diff(labels[index])) finds for you. Since reduceat takes the stop index past the end of the run, you need to add one. np.r_ allows you to prepend zero easily to a 1D array, which is necessary for reduceat to regard t, and also prepend zero., and also prepend zero.he first run (the last is always automatic).
Before you run reduceat, you need to order x into the runs defined by labels, which is what x[index] does.
You can use 2D arrays with another slow and over-engineered approach using np.add.at
sums = np.zeros(labels.max()+1, x.dtype)
np.add.at(sums, labels, x)
sums
Output
array([ 1, 20, 15])
Given an x-dataset,
x = np.array([1, 2, 3, 4, 5])
what is the most efficient way to create the NumPy array where each x coordinate is paired with a y-coordinate of value 0? I am wondering if there is a way specifically that doesn't require any hard coding, so that x could vary in length without causing failure.
As per your problem statement, the following is one way to do it.
# initialize an array of zeros
In [36]: res = np.zeros((2, *x.shape), dtype=x.dtype)
# fill `x` as first row
In [37]: res[0] = x
In [38]: res
Out[38]:
array([[1, 2, 3, 4],
[0, 0, 0, 0]])
When we initialize the array of zeros, we use 2 for axis-0 dimension since your requirement is to create a 2D array. For the column size we simply take the length from the x array. For reasonably larger arrays, this approach would be the fastest.
I'm looking to implement a hardware-efficient multiplication of a list of large matrices (on the order of 200,000 x 200,000). The matrices are very nearly the identity matrix, but with some elements changed to irrational numbers.
In an effort to reduce the memory footprint and make the computation go faster, I want to store the 0s and 1s of the identity as single bytes like so.
import numpy as np
size = 200000
large_matrix = np.identity(size, dtype=uint8)
and just change a few elements to a different data type.
import sympy as sp
# sympy object
irr1 = sp.sqrt(2)
# float
irr2 = e
large_matrix[123456, 100456] = irr1
large_matirx[100456, 123456] = irr2
Is is possible to hold only these elements of the matrix with a different data type, while all the other elements are still bytes? I don't want to have to change everything to a float just because I need one element to be a float.
-----Edit-----
If it's not possible in numpy, then how can I find a solution without numpy?
Maybe you can have a look at the SciPy's Coordinate-based sparse matrix. In that case SciPy creates a sparse matrix (optimized for such large empty matrices) and with its coordinate format you can access and modify the data as you intend.
From its documentation:
>>> from scipy.sparse import coo_matrix
>>> # Constructing a matrix using ijv format
>>> row = np.array([0, 3, 1, 0])
>>> col = np.array([0, 3, 1, 2])
>>> data = np.array([4, 5, 7, 9])
>>> m = coo_matrix((data, (row, col)), shape=(4, 4))
>>> m.toarray()
array([[4, 0, 9, 0],
[0, 7, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 5]])
It does not create a matrix but a set of coordinates with values, which takes much less space than just filling a matrix with zeros.
>>> from sys import getsizeof
>>> getsizeof(m)
56
>>> getsizeof(m.toarray())
176
By definition, NumPy arrays only have one dtype. You can see in the NumPy documentation:
A numpy array is homogeneous, and contains elements described by a dtype object. A dtype object can be constructed from different combinations of fundamental numeric types.
Further reading: https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html
I am trying to implement K-means by selective centroid selection. I have two numpy arrays, one called "features" which has a set of numpy arrays where each array is a datapoint and another np array called "labels", which has the label of class the data point at an index "i" belongs to. I have datapoints related to 4 different classes. What I want to do is to make use of both these numpy arrays, and randomly pick a datapoint one from each class. Could you please help me out with this. Also, is there any way to zip two numpy arrays into a dictionary?
for example I have the features array as :
[[1,1,1],[1,2,3],[1,6,7],[1,4,6],[1,6,9],[1,4,2]]
and my labels array is
[1,2,2,3,1,3]
For each value unique in the labels numpy array, I want one randomly chosen corresponding element in the features array. A sample answer would be :
[1,1,1] from class 1
[1,6,7] from class 2
[1,4,2] from class 3
Given this is the setup in your question:
import numpy as np
features = [[1,1,1],[1,2,3],[1,6,7],[1,4,6],[1,6,9],[1,4,2]]
labels = np.array([1,2,2,3,1,3])
This should get you a random variable from each label in dictionary form:
features_index = np.array(range(0, len(features)))
unique_labels = np.unique(labels)
rand = []
for n in unique_labels:
rand.append(features[np.random.choice(features_index[labels == n])])
dict(zip(unique_labels, rand))
Try:
import numpy as np
features = np.array([[1,1,1],[1,2,3],[1,6,7],[1,4,6],[1,6,9],[1,4,2]])
labels = np.array([1,2,2,3,1,3])
res = {i: features[np.random.choice(np.where(labels == i)[0])] for i in set(labels)}
output
{1: array([1, 1, 1]), 2: array([1, 2, 3]), 3: array([1, 4, 2])}
You can accomplish this with a bit of indexing and numpy.unique
u = np.unique(labels)
f = np.arange(features.shape[0])
idx = np.random.choice(
f, u.shape[0], replace=False
)
dict(zip(u, features[idx]))
{1: array([1, 4, 2]), 2: array([1, 6, 9]), 3: array([1, 1, 1])}
I wanted to repeat the rows of a scipy csr sparse matrix, but when I tried to call numpy's repeat method, it simply treats the sparse matrix like an object, and would only repeat it as an object in an ndarray. I looked through the documentation, but I couldn't find any utility to repeats the rows of a scipy csr sparse matrix.
I wrote the following code that operates on the internal data, which seems to work
def csr_repeat(csr, repeats):
if isinstance(repeats, int):
repeats = np.repeat(repeats, csr.shape[0])
repeats = np.asarray(repeats)
rnnz = np.diff(csr.indptr)
ndata = rnnz.dot(repeats)
if ndata == 0:
return sparse.csr_matrix((np.sum(repeats), csr.shape[1]),
dtype=csr.dtype)
indmap = np.ones(ndata, dtype=np.int)
indmap[0] = 0
rnnz_ = np.repeat(rnnz, repeats)
indptr_ = rnnz_.cumsum()
mask = indptr_ < ndata
indmap -= np.int_(np.bincount(indptr_[mask],
weights=rnnz_[mask],
minlength=ndata))
jumps = (rnnz * repeats).cumsum()
mask = jumps < ndata
indmap += np.int_(np.bincount(jumps[mask],
weights=rnnz[mask],
minlength=ndata))
indmap = indmap.cumsum()
return sparse.csr_matrix((csr.data[indmap],
csr.indices[indmap],
np.r_[0, indptr_]),
shape=(np.sum(repeats), csr.shape[1]))
and be reasonably efficient, but I'd rather not monkey patch the class. Is there a better way to do this?
Edit
As I revisit this question, I wonder why I posted it in the first place. Almost everything I could think to do with the repeated matrix would be easier to do with the original matrix, and then apply the repetition afterwards. My assumption is that post repetition will always be the better way to approach this problem than any of the potential answers.
from scipy.sparse import csr_matrix
repeated_row_matrix = csr_matrix(np.ones([repeat_number,1])) * sparse_row
It's not surprising that np.repeat does not work. It delegates the action to the hardcoded a.repeat method, and failing that, first turns a into an array (object if needed).
In the linear algebra world where sparse code was developed, most of the assembly work was done on the row, col, data arrays BEFORE creating the sparse matrix. The focus was on efficient math operations, and not so much on adding/deleting/indexing rows and elements.
I haven't worked through your code, but I'm not surprised that a csr format matrix requires that much work.
I worked out a similar function for the lil format (working from lil.copy):
def lil_repeat(S, repeat):
# row repeat for lil sparse matrix
# test for lil type and/or convert
shape=list(S.shape)
if isinstance(repeat, int):
shape[0]=shape[0]*repeat
else:
shape[0]=sum(repeat)
shape = tuple(shape)
new = sparse.lil_matrix(shape, dtype=S.dtype)
new.data = S.data.repeat(repeat) # flat repeat
new.rows = S.rows.repeat(repeat)
return new
But it is also possible to repeat using indices. Both lil and csr support indexing that is close to that of regular numpy arrays (at least in new enough versions). Thus:
S = sparse.lil_matrix([[0,1,2],[0,0,0],[1,0,0]])
print S.A.repeat([1,2,3], axis=0)
print S.A[(0,1,1,2,2,2),:]
print lil_repeat(S,[1,2,3]).A
print S[(0,1,1,2,2,2),:].A
give the same result
and best of all?
print S[np.arange(3).repeat([1,2,3]),:].A
After someone posted a really clever response for how best to do this I revisited my original question, to see if there was an even better way. I I came up with one more way that has some pros and cons. Instead of repeating all of the data (as is done with the accepted answer), we can instead instruct scipy to reuse the data of the repeated rows, creating something akin to a view of the original sparse array (as you might do with broadcast_to). This can be done by simply tiling the indptr field.
repeated = sparse.csr_matrix((orig.data, orig.indices, np.tile(orig.indptr, repeat_num)))
This technique repeats the vector repeat_num times, while only modifying the the indptr. The downside is that due to the way the csr matrices encode data, instead of creating a matrix that's repeat_num x n in dimension, it creates one that's (2 * repeat_num - 1) x n where every odd row is 0. This shouldn't be too big of a deal as any operation will be quick given that each row is 0, and they should be pretty easy to slice out afterwards (with something like [::2]), but it's not ideal.
I think the marked answer is probably still the "best" way to do this.
One of the most efficient ways to repeat the sparse matrix would be the way OP suggested. I modified indptr so that it doesn't output rows of 0s.
## original sparse matrix
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
x = scipy.sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
x.toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
To repeat this, you need to repeat data and indices, and you need to fix-up the indptr. This is not the most elegant way, but it works.
## repeated sparse matrix
repeat = 5
new_indptr = indptr
for r in range(1,repeat):
new_indptr = np.concatenate((new_indptr, new_indptr[-1]+indptr[1:]))
x = scipy.sparse.csr_matrix((np.tile(data,repeat), np.tile(indices,repeat), new_indptr))
x.toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6],
[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])