I have a array in size MxN and I like to compute the entropy value of each row. What would be the fastest way to do so ?
scipy.special.entr computes -x*log(x) for each element in an array. After calling that, you can sum the rows.
Here's an example. First, create an array p of positive values whose rows sum to 1:
In [23]: np.random.seed(123)
In [24]: x = np.random.rand(3, 10)
In [25]: p = x/x.sum(axis=1, keepdims=True)
In [26]: p
Out[26]:
array([[ 0.12798052, 0.05257987, 0.04168536, 0.1013075 , 0.13220688,
0.07774843, 0.18022149, 0.1258417 , 0.08837421, 0.07205402],
[ 0.08313743, 0.17661773, 0.1062474 , 0.01445742, 0.09642919,
0.17878489, 0.04420998, 0.0425045 , 0.12877228, 0.1288392 ],
[ 0.11793032, 0.15790292, 0.13467074, 0.11358463, 0.13429674,
0.06003561, 0.06725376, 0.0424324 , 0.05459921, 0.11729367]])
In [27]: p.shape
Out[27]: (3, 10)
In [28]: p.sum(axis=1)
Out[28]: array([ 1., 1., 1.])
Now compute the entropy of each row. entr uses the natural logarithm, so to get the base-2 log, divide the result by log(2).
In [29]: from scipy.special import entr
In [30]: entr(p).sum(axis=1)
Out[30]: array([ 2.22208731, 2.14586635, 2.22486581])
In [31]: entr(p).sum(axis=1)/np.log(2)
Out[31]: array([ 3.20579434, 3.09583074, 3.20980287])
If you don't want the dependency on scipy, you can use the explicit formula:
In [32]: (-p*np.log2(p)).sum(axis=1)
Out[32]: array([ 3.20579434, 3.09583074, 3.20980287])
As #Warren pointed out, it's unclear from your question whether you are starting out from an array of probabilities, or from the raw samples themselves. In my answer I've assumed the latter, in which case the main bottleneck will be computing the bin counts over each row.
Assuming that each vector of samples is relatively long, the fastest way to do this will probably be to use np.bincount:
import numpy as np
def entropy(x):
"""
x is assumed to be an (nsignals, nsamples) array containing integers between
0 and n_unique_vals
"""
x = np.atleast_2d(x)
nrows, ncols = x.shape
nbins = x.max() + 1
# count the number of occurrences for each unique integer between 0 and x.max()
# in each row of x
counts = np.vstack((np.bincount(row, minlength=nbins) for row in x))
# divide by number of columns to get the probability of each unique value
p = counts / float(ncols)
# compute Shannon entropy in bits
return -np.sum(p * np.log2(p), axis=1)
Although Warren's method of computing the entropies from the probability values using entr is slightly faster than using the explicit formula, in practice this is likely to represent a tiny fraction of the total runtime compared to the time taken to compute the bin counts.
Test correctness for a single row:
vals = np.arange(3)
prob = np.array([0.1, 0.7, 0.2])
row = np.random.choice(vals, p=prob, size=1000000)
print("theoretical H(x): %.6f, empirical H(x): %.6f" %
(-np.sum(prob * np.log2(prob)), entropy(row)[0]))
# theoretical H(x): 1.156780, empirical H(x): 1.157532
Test speed:
In [1]: %%timeit x = np.random.choice(vals, p=prob, size=(1000, 10000))
....: entropy(x)
....:
10 loops, best of 3: 34.6 ms per loop
If your data don't consist of integer indices between 0 and the number of unique values, you can convert them into this format using np.unique:
y = np.random.choice([2.5, 3.14, 42], p=prob, size=(1000, 10000))
unq, x = np.unique(y, return_inverse=True)
x.shape = y.shape
Related
How to create a function that loops through numpy matrix to z-scale each and every data points returning the data standardized. Just like how sklearn.preprocessing.StandardScaler does it. I have got up to here with no success. May somebody help me with this?
def stand_scaler(data):
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
for i in range(len(data)):
data[i] = (data[i] - mean)/std
return data
stand_scaler(data)
You shouldn't need a for-loop for this; numpy's array operations are intended for exactly this case. For a one dimensional array it's straightforward:
In [1]: import numpy as np
In [2]: x = np.random.normal(size=10)
In [3]: nx = (x - x.mean()) / x.std()
In [4]: x
Out[4]:
array([ 0.52700345, -0.57358563, -0.16925383, 2.14401554, 1.05223331,
0.72659482, 1.06816826, 0.31194848, 0.04004589, 1.09046925])
In [5]: nx
Out[5]:
array([-0.12859083, -1.62209992, -1.0734181 , 2.06570881, 0.58415071,
0.14225641, 0.60577458, -0.42042233, -0.78939654, 0.63603721])
In [6]: nx.mean()
Out[6]: 5.551115123125783e-17
In [7]: nx.std()
Out[7]: 1.0000000000000002
For higher dimensions, you can choose an axis to work over, and scale by using numpy's broadcasting; e.g., in this case, imagine each column is a different variable:
In [8]: y = np.array([10,1]) * np.random.normal(size=(5,2)) - np.array([5,-10])
In [9]: ny = (y - y.mean(axis=0)) / y.std(axis=0)
In [10]: ny
Out[10]:
array([[ 0.78076062, -0.26971997],
[-1.59591909, -1.2409338 ],
[-0.55740483, -0.81901609],
[ 1.22978416, 1.12697814],
[ 0.14277914, 1.20269171]])
In [11]: ny.mean(axis=0), ny.std(axis=0)
Out[11]: (array([-3.33066907e-17, 8.43769499e-16]), array([1., 1.]))
I am aware of the scipy.spatial.distance.pdist function and how to compute the mean from the resulting matrix/ndarray.
>>> x = np.random.rand(10000, 2)
>>> y = pdist(x, metric='euclidean')
>>> y.mean()
0.5214255824176626
In the example above y gets quite large (nearly 2,500 times as large as the input array):
>>> y.shape
(49995000,)
>>> from sys import getsizeof
>>> getsizeof(x)
160112
>>> getsizeof(y)
399960096
>>> getsizeof(y) / getsizeof(x)
2498.0019986009793
But since I am only interested in the mean pairwise distance, the distance matrix doesn't have to be kept in memory. Instead the mean of each row (or column) can be computed seperatly. The final mean value can then be computed from the row mean values.
Is there already a function which exploit this property or is there an easy way to extend/combine existing functions to do so?
If you use the square version of distance, it is equivalent to using the variance with n-1:
from scipy.spatial.distance import pdist, squareform
import numpy as np
x = np.random.rand(10000, 2)
y = np.array([[1,1], [0,0], [2,0]])
print(pdist(x, 'sqeuclidean').mean())
print(np.var(x, 0, ddof=1).sum()*2)
>>0.331474285845873
0.33147428584587346
You will have to weight each row by the number of observations that make up the mean. For example the pdist of a 3 x 2 matrix is the flattened upper triangle (offset of 1) of the squareform 3 x 3 distance matrix.
arr = np.arange(6).reshape(3,2)
arr
array([[0, 1],
[2, 3],
[4, 5]])
pdist(arr)
array([2.82842712, 5.65685425, 2.82842712])
from sklearn.metrics import pairwise_distances
square = pairwise_distances(arr)
square
array([[0. , 2.82842712, 5.65685425],
[2.82842712, 0. , 2.82842712],
[5.65685425, 2.82842712, 0. ]])
square[triu_indices(square.shape[0], 1)]
array([2.82842712, 5.65685425, 2.82842712])
There is the pairwise_distances_chuncked function that can be used to iterate over the distance matrix row by row, but you will need to keep track of the row index to make sure you only take the mean of values in the upper/lower triangle of the matrix (distance matrix is symmetrical). This isn't complicated, but I imagine you will introduce a significant slowdown.
tot = ((arr.shape[0]**2) - arr.shape[0]) / 2
weighted_means = 0
for i in gen:
if r < arr.shape[0]:
sm = i[0, r:].mean()
wgt = (i.shape[1] - r) / tot
weighted_means += sm * wgt
r += 1
Earlier I asked a similar question where the answer used np.dot, taking advantage of the fact that a dot product involves a sum of products. (To my understanding.)
Now I have a similar issue where I don't think dot will apply, because in place of a sum I want to take an element-wise diagonal. If it does, I haven't been able to apply it correctly.
Given a matrix x and array err:
x = np.matrix([[ 0.02984406, -0.00257266],
[-0.00257266, 0.00320312]])
err = np.array([ 7.6363226 , 13.16548267])
My current implementation with loop is:
res = np.array([np.sqrt(np.diagonal(x * err[i])) for i in range(err.shape[0])])
print(res)
[[ 0.47738755 0.15639712]
[ 0.62682649 0.20535487]]
which takes the diagonal of x.dot(i) for each i in err. Could this be vectorized? In other words, can the output of x * err be 3-dimensional, with np.diagonal then yielding a 2d array, with one element for each diagonal?
Program:
import numpy as np
x = np.matrix([[ 0.02984406, -0.00257266],
[-0.00257266, 0.00320312]])
err = np.array([ 7.6363226 , 13.16548267])
diag = np.diagonal(x)
ans = np.sqrt(diag*err[:,np.newaxis]) # sqrt of outer product
print(ans)
# use out keyword to avoid making new numpy array for many times.
ans = np.empty(x.shape, dtype=x.dtype)
for i in range(100):
ans = np.multiply(diag, err, out=ans)
ans = np.sqrt(ans, out=ans)
Result:
[[ 0.47738755 0.15639712]
[ 0.62682649 0.20535487]]
Here's an approach making use of diagonal-view with ndarray.flat into x and then use broadcasting for element-wise multiplication, like so -
np.sqrt(x.flat[::x.shape[1]+1].A1 * err[:,None])
Sample run -
In [108]: x = np.matrix([[ 0.02984406, -0.00257266],
...: [-0.00257266, 0.00320312]])
...:
...: err = np.array([ 7.6363226 , 13.16548267])
...:
In [109]: np.sqrt(x.flat[::x.shape[1]+1].A1 * err[:,None])
Out[109]:
array([[ 0.47738755, 0.15639712],
[ 0.62682649, 0.20535487]])
Runtime test to see how a view helps over np.diagonal that creates a copy -
In [104]: x = np.matrix(np.random.rand(5000,5000))
In [105]: err = np.random.rand(5000)
In [106]: %timeit np.diagonal(x)*err[:,np.newaxis]
10 loops, best of 3: 66.8 ms per loop
In [107]: %timeit x.flat[::x.shape[1]+1].A1 * err[:,None]
10 loops, best of 3: 37.7 ms per loop
This question is similar to this one.
I have a 2d boolean array "belong" and a 2d float array "angles".
What I want is to sum along the rows the angles for which the corresponding index in belong is True, and do that with numpy (ie. avoid python loops). I don't need to store the resulting rows, which would have different lengths and as explained in the linked question would require a list.
So what I attempted is np.sum(angles[belong] ,axis=1), but angles[belong] returns a 1d result, and I can't reduce it as I want. I have also tried np.sum(angles*belong ,axis=1) and that works. But I wonder if I could improve the timing by accessing only the indexes where belong is True. belong is True about 30% of the time and angles is a simplification of a longer formula which involves angles.
UPDATE
I like the solution with einsum, however in my actual computation the speed up is tiny. I used angles in the question to simplify, in practice it is a formula that uses angles. I suspect that this formula is calculated for all the angles (regardless of belong) and then passed to einsum, which would perform the computation.
This is what I've done:
THRES_THETA and max_line_length are floats.
belong, angle and lines_lengths_vstacked have shape (1653, 58)
and np.count_nonzero(belong)/belong.size -> 0.376473287856979
l2 = (lambda angle=angle, belong=belong, THRES_THETA=THRES_THETA, lines_lengths_vstacked=lines_lengths_vstacked, max_line_length=max_line_length:
np.sum(belong*(0.3 * (1-(angle/THRES_THETA)) + 0.7 * (lines_lengths_vstacked/max_line_length)), axis=1)) #base method
t2 = timeit.Timer(l2)
print(t2.repeat(3, 100))
l1 = (lambda angle=angle, belong=belong, THRES_THETA=THRES_THETA, lines_lengths_vstacked=lines_lengths_vstacked, max_line_length=max_line_length:
np.einsum('ij,ij->i', belong, 0.3 * (1-(angle/THRES_THETA)) + 0.7 * (lines_lengths_vstacked/max_line_length)))
t1 = timeit.Timer(l1)
print(t1.repeat(3, 100))
l3 = (lambda angle=angle, belong=belong:
np.sum(angle*belong ,axis=1)) #base method
t3 = timeit.Timer(l3)
print(t3.repeat(3, 100))
l4 = (lambda angle=angle, belong=belong:
np.einsum('ij,ij->i', belong, angle))
t4 = timeit.Timer(l4)
print(t4.repeat(3, 100))
and the results were:
[0.2505458095931187, 0.22666162878242901, 0.23591678551324263]
[0.23295411847036418, 0.21908727226505043, 0.22407296178704272]
[0.03711204915708555, 0.03149960399994978, 0.033403337575027114]
[0.025264803208228992, 0.022590580646423053, 0.024585736455331464]
If we look at the last two rows, the one corresponding to einsum is about 30% faster than using the base method. But if we look at the first two rows, the speed up for the einsum method is smaller, just about 0.1% faster.
I'm not sure if this timing can be improved.
You can use np.einsum -
np.einsum('ij,ij->i',belong,angles)
You can also use np.bincount, like so -
idx,_ = np.where(belong)
out = np.bincount(idx,angles[belong])
Sample run -
In [32]: belong
Out[32]:
array([[ True, True, True, False, True],
[False, False, False, True, True],
[False, False, True, True, True],
[False, False, True, False, True]], dtype=bool)
In [33]: angles
Out[33]:
array([[ 0.65429151, 0.36235607, 0.98316406, 0.08236384, 0.5576149 ],
[ 0.37890797, 0.60705112, 0.79411002, 0.6450942 , 0.57750073],
[ 0.6731019 , 0.18608778, 0.83387574, 0.80120389, 0.54971573],
[ 0.18971255, 0.86765132, 0.82994543, 0.62344429, 0.05207639]])
In [34]: np.sum(angles*belong ,axis=1) # This worked for you, so using as baseline
Out[34]: array([ 2.55742654, 1.22259493, 2.18479536, 0.88202183])
In [35]: np.einsum('ij,ij->i',belong,angles)
Out[35]: array([ 2.55742654, 1.22259493, 2.18479536, 0.88202183])
In [36]: idx,_ = np.where(belong)
...: out = np.bincount(idx,angles[belong])
...:
In [37]: out
Out[37]: array([ 2.55742654, 1.22259493, 2.18479536, 0.88202183])
Runtime test -
In [52]: def sum_based(belong,angles):
...: return np.sum(angles*belong ,axis=1)
...:
...: def einsum_based(belong,angles):
...: return np.einsum('ij,ij->i',belong,angles)
...:
...: def bincount_based(belong,angles):
...: idx,_ = np.where(belong)
...: return np.bincount(idx,angles[belong])
...:
In [53]: # Inputs
...: belong = np.random.rand(4000,5000)>0.7
...: angles = np.random.rand(4000,5000)
...:
In [54]: %timeit sum_based(belong,angles)
...: %timeit einsum_based(belong,angles)
...: %timeit bincount_based(belong,angles)
...:
1 loops, best of 3: 308 ms per loop
10 loops, best of 3: 134 ms per loop
1 loops, best of 3: 554 ms per loop
I would go with the np.einsum one!
You could use masked arrays for this, but in the tests I ran it is not faster than (angles * belong).sum(1).
A masked array approach would look like this:
sum_ang = np.ma.masked_where(~belong, angles, copy=False).sum(1).data
Here, we are creating a masked array of angles where the values ~belong ("not belong"), are masked (excluded). We take the not because we want to exclude the values in belong that are False. Then take the sum along rows .sum(1). The sum will return another masked array, so you grab the values with the .data attribute of that masked array.
I added the copy=False kwarg so that this code doesn't get slowed down by array creation, but it's still slower than your (angles * belong).sum(1) approach so you should probably just stick with that.
I have found a way that is about 3 times faster than the einsum solution, and I don't think it can get any faster, so I'm answering my own question with this other method.
What I was hoping is to calculate the formula involving angles just for the positions where belong is True. This should speed up about 3 times as belong is True about 30% of the time.
My first attempt using angles[belong] would calculate the formula just for the positions where belong is True, but had the problem that the resulting array was 1d and I couldn't do the row reductions with np.sum. The solution is to use np.add.reduceat.
reduceat can reduce an ufunc (in this case add) at a list of specific slices. So I just need to create that list of slices so that I can reduce the 1d array resulting from angles[belong].
I'll show my code and timings and that should speak by itself.
first I define a function with the reduceat solution:
def vote_op(angle, belong, THRES_THETA, lines_lengths_vstacked, max_line_length):
intermediate = (0.3 * (1-(angle[belong]/THRES_THETA)) + 0.7 * (lines_lengths_vstacked[belong]/max_line_length))
b_ind = np.hstack([0, np.cumsum(np.sum(belong, axis=1))])
votes = np.add.reduceat(intermediate, b_ind[:-1])
return votes
then I compare with the base method and the einsum method:
l1 = (lambda angle=angle, belong=belong, THRES_THETA=THRES_THETA, lines_lengths_vstacked=lines_lengths_vstacked, max_line_length=max_line_length:
np.sum(belong*(0.3 * (1-(angle/THRES_THETA)) + 0.7 * (lines_lengths_vstacked/max_line_length)), axis=1))
t1 = timeit.Timer(l1)
print(t1.repeat(3, 100))
l2 = (lambda angle=angle, belong=belong, THRES_THETA=THRES_THETA, lines_lengths_vstacked=lines_lengths_vstacked, max_line_length=max_line_length:
np.einsum('ij,ij->i', belong, 0.3 * (1-(angle/THRES_THETA)) + 0.7 * (lines_lengths_vstacked/max_line_length)))
t2 = timeit.Timer(l2)
print(t2.repeat(3, 100))
l3 = (lambda angle=angle, belong=belong, THRES_THETA=THRES_THETA, lines_lengths_vstacked=lines_lengths_vstacked, max_line_length=max_line_length:
vote_op(angle, belong, THRES_THETA, lines_lengths_vstacked, max_line_length))
t3 = timeit.Timer(l3)
print(t3.repeat(3, 100))
and the timings:
[2.866840408487671, 2.6822349628234874, 2.665520338478774]
[2.3444239421490725, 2.352450520946098, 2.4150879511222794]
[0.6846337313820605, 0.660780839464234, 0.6091473217964847]
So the reduceat solution is about 3 times faster and gives the same results as the other two.
Note that these results are for a slightly larger example than before where:
belong, angle and lines_lengths_vstacked have shape: (3400, 170)
and np.count_nonzero(belong)/belong.size->0.16765051903114186
Update
Due to a corner case in np.reduceat (as in numpy version '1.11.0rc1') where it can't handle repeated indices correctly, see, I had to add a hack to vote_op() function for the case where there are whole rows in belong that are False. This results in repeated indices in b_ind and wrong results in votes. My solution for the moment is to patch the wrong values, that works but is another step. see new vote_op():
def vote_op(angle, belong, THRES_THETA, lines_lengths_vstacked, max_line_length):
intermediate = (0.3 * (1-(angle[belong]/THRES_THETA)) + 0.7 * (lines_lengths_vstacked[belong]/max_line_length))
b_rows = np.sum(belong, axis=1)
b_ind = np.hstack([0, np.cumsum(b_rows)])[:-1]
intermediate = np.hstack([intermediate, 0])
votes = np.add.reduceat(intermediate, b_ind)
votes[b_rows == 0] = 0
return votes
I have two matrices:
import numpy as np
def create(n):
M = array([[ 0.33840224, 0.25420152, 0.40739624],
[ 0.35087337, 0.40939274, 0.23973389],
[ 0.40168642, 0.29848413, 0.29982946],
[ 0.17442095, 0.50982272, 0.31575633]])
return np.concatenate([M] * n)
A = create(1)
nof_type = A.shape[1]
I = np.eye(nof_type)
Matrix A dimension is 4 x 3 and I is 3 x 3.
What I want to do is to
calculate a distance score for every row in A against every row in I.
for every row in A report the row id of I and the maximum score
So at the end of the day we have 4 x 2 matrix.
How an I achieve that?
This is the function that compute distance score between two numpy array.
def jsd(x,y): #Jensen-shannon divergence
import warnings
warnings.filterwarnings("ignore", category = RuntimeWarning)
x = np.array(x)
y = np.array(y)
d1 = x*np.log2(2*x/(x+y))
d2 = y*np.log2(2*y/(x+y))
d1[np.isnan(d1)] = 0
d2[np.isnan(d2)] = 0
d = 0.5*np.sum(d1+d2)
return d
And in actual case A has number of rows with around 40K. So we really like it to be fast.
Using loopy way:
def scoreit (A, I):
aoa = []
for i, x in enumerate(A):
maxscore = -10000
id = -1
for j, y in enumerate(I):
distance = jsd(x, y)
#print "\t", i, j, distance
if dist > maxscore:
maxscore = distance
id = j
#print "MAX", maxscore, id
aoa.append([maxscore,id])
return aoa
It prints this result:
In [56]: scoreit(A,I)
Out[56]:
[[0.54393736529629078, 1],
[0.56083720679952753, 2],
[0.49502813447483673, 1],
[0.64408263453965031, 0]]
Current timing:
In [57]: %timeit scoreit(create(1000),I)
1 loops, best of 3: 3.31 s per loop
You can extend I's dimensions to a 3D array version at various places to bring in powerful broadcasting into play. We keep A as it is, because it's a huge array and we don't want to incur performance loss moving its elements around. Also, you can avoid that costly affair of checking for NaNs and summing with a single operation of np.nansum that does summing over non-NaNs. Thus, the vectorized solution would look something like this -
def jsd_vectorized(A,I):
# Perform "(x+y)" in a vectorized manner
AI = A+I[:,None]
# Calculate d1 and d2 using AI again in vectorized manner
d1 = A*np.log2(2*A/AI)
d2 = I[:,None,:]*np.log2((2*I[:,None,:])/AI)
# Use np.nansum to ignore NaNs & sum along rows to get all distances
dists = np.nansum(d1,2) + np.nansum(d2,2)
# Pack the argmax IDs and the corresponding scores as final output
ID = dists.argmax(0)
return np.vstack((0.5*dists[ID,np.arange(dists.shape[1])],ID)).T
Sample run
Loopy function to run original function code -
def jsd_loopy(A,I):
dists = np.empty((A.shape[0],I.shape[0]))
for i, x in enumerate(A):
for j, y in enumerate(I):
dists[i,j] = jsd(x, y)
ID = dists.argmax(1)
return np.vstack((dists[np.arange(dists.shape[0]),ID],ID)).T
Run and verify -
In [511]: A = np.array([[ 0.33840224, 0.25420152, 0.40739624],
...: [ 0.35087337, 0.40939274, 0.23973389],
...: [ 0.40168642, 0.29848413, 0.29982946],
...: [ 0.17442095, 0.50982272, 0.31575633]])
...: nof_type = A.shape[1]
...: I = np.eye(nof_type)
...:
In [512]: jsd_loopy(A,I)
Out[512]:
array([[ 0.54393737, 1. ],
[ 0.56083721, 2. ],
[ 0.49502813, 1. ],
[ 0.64408263, 0. ]])
In [513]: jsd_vectorized(A,I)
Out[513]:
array([[ 0.54393737, 1. ],
[ 0.56083721, 2. ],
[ 0.49502813, 1. ],
[ 0.64408263, 0. ]])
Runtime tests
In [514]: A = np.random.rand(1000,3)
In [515]: nof_type = A.shape[1]
...: I = np.eye(nof_type)
...:
In [516]: %timeit jsd_loopy(A,I)
1 loops, best of 3: 782 ms per loop
In [517]: %timeit jsd_vectorized(A,I)
1000 loops, best of 3: 1.17 ms per loop
In [518]: np.allclose(jsd_loopy(A,I),jsd_vectorized(A,I))
Out[518]: True