What's the most sensible way to do this in numpy?

What's the most sensible way to do this in numpy? - python

I have a question which I think may have an easy answer. I have a numpy array with three dimensions - (num_users, num_dates, num_holdings). I'd like to initialize it to some random test values. random.rand works perfectly fine for this, but for each user and each date, the third dimension has to sum to 1 (ie, for any user and any date, their holdings have to sum to 1). I can do this by iterating, as in:
num_users = 2
num_dates = 2
num_holdings = 5
test_arr = np.random.rand(num_users, num_dates, num_holdings)
for user in range(num_users):
for date in range(num_dates):
starting_total = np.sum(test_arr[user, date, :])
test_arr[user, date, :] = np.divide(test_arr[user, date, :], starting_total)
# Check it works
print(np.all(np.sum(test_arr, axis=2).reshape(-1)==1))
But if I'm creating multiple arrays it starts to get a bit slow. Plus it feels a little unsatisfactory. I was wondering if anyone knew of a better way to do it using vector math?
Thanks!

You could do
test_arr /= test_arr.sum(axis=2, keepdims=True)
For example:
In [95]: test_arr = np.random.rand(2, 2, 5)
In [96]: test_arr
Out[96]:
array([[[0.44621493, 0.04093414, 0.30051671, 0.40939041, 0.37251939],
[0.33997017, 0.81257008, 0.52820553, 0.55382711, 0.11720684]],
[[0.78460482, 0.43458619, 0.07722273, 0.18181153, 0.52101088],
[0.47933417, 0.31354249, 0.09966921, 0.59655266, 0.24816989]]])
In [97]: test_arr.sum(axis=2, keepdims=True)
Out[97]:
array([[[1.56957558],
[2.35177973]],
[[1.99923614],
[1.73726842]]])
The use of keepdims=True means that we get a resulting shape (2,2,1) which will correctly broadcast when we divide by it.
In [98]: test_arr /= test_arr.sum(axis=2, keepdims=True)
In [99]: test_arr.sum(axis=2)
Out[99]:
array([[1., 1.],
[1., 1.]])
Note that because of limited precision you won't get exactly 1.0 as the sum, but the difference is negligible:
In [100]: test_arr.sum(axis=2) - 1.0
Out[100]:
array([[ 0.00000000e+00, 0.00000000e+00],
[-1.11022302e-16, -1.11022302e-16]])

Related

Python Optimization: Using vector technique to find power of each matrix in an numpy array

3D numpy array A contains a series (in this example, I am choosing 3) of 2D numpy array D of shape 2 x 2. The D matrix is as follows:
D = np.array([[1,2],[3,4]])
A is initialized and assigned as below:
idx = np.arange(3)
A = np.zeros((3,2,2))
A[idx,:,:] = D # This gives A = [[[1,2],[3,4]],[[1,2],[3,4]],\
# [[1,2],[3,4]]]
# In mathematical notation: A = {D, D, D}
Now, essentially what I require after the execution of the codes is:
Mathematically, A = {D^0, D^1, D^2} = {D0, D1, D2}
where D0 = [[1,0],[0,1]], D1 = [[1,2],[3,4]], D2=[[7,10],[15,22]]
Is it possible to apply power to each matrix element in A without using a for-loop? I would be doing larger matrices with more in the series.
I had defined, n = np.array([0,1,2]) # corresponding to powers 0, 1 and 2 and tried
Result = np.power(A,n) but I do not get the desired output.
Is there are an efficient way to do it?
Full code:
D = np.array([[1,2],[3,4]])
idx = np.arange(3)
A = np.zeros((3,2,2))
A[idx,:,:] = D # This gives A = [[[1,2],[3,4]],[[1,2],[3,4]],\
# [[1,2],[3,4]]]
# In mathematical notation: A = {D, D, D}
n = np.array([0,1,2])
Result = np.power(A,n) # ------> Not the desired output.

A cumulative product exists in numpy, but not for matrices. Therefore, you need to make your own 'matcumprod' function. You can use np.dot for this, but np.matmul (or #) is specialized for matrix multiplication.
Since you state your powers always go from 0 to some_power, I suggest the following function:
def matcumprod(D, upto):
Res = np.empty((upto, *D.shape), dtype=A.dtype)
Res[0, :, :] = np.eye(D.shape[0])
Res[1, :, :] = D.copy()
for i in range(1,upto):
Res[i, :, :] = Res[i-1,:,:] # D
return Res
By the way, a loop often times outperforms a built-in numpy function if the latter uses a lot of memory, so don't fret over it if your powers stay within bounds...

Alright, i spent a lot of time on this problem but could not seem to find a vectorized solution in the way you'd like. So i would like to instead first propose a basic solution, and then perhaps an optimization if you require finding continuous powers.
The function you're looking for is called numpy.linalg.matrix_power
import numpy as np
D = np.matrix([[1,2],[3,4]])
idx = np.arange(3)
A = np.zeros((3,2,2))
A[idx,:,:] = D # This gives A = [[[1,2],[3,4]],[[1,2],[3,4]],\
# [[1,2],[3,4]]]
# In mathematical notation: A = {D, D, D}
np.zeros(A.shape)
n = np.array([0,1,2])
result = [np.linalg.matrix_power(D, i) for i in n]
np.array(result)
#Output:
array([[[ 1, 0],
[ 0, 1]],
[[ 1, 2],
[ 3, 4]],
[[ 7, 10],
[15, 22]]])
However, if you notice, you end up calculating multiple powers for the same base matrix. We could instead utilize the intermediate results and go from there, using numpy.linalg.multi_dot
def all_powers_arr_of_matrix(A):
result = np.zeros(A.shape)
result[0] = np.linalg.matrix_power(A[0], 0)
for i in range(1, A.shape[0]):
result[i] = np.linalg.multi_dot([result[i - 1], A[i]])
return result
result = all_powers_arr_of_matrix(A)
#Output:
array([[[ 1., 0.],
[ 0., 1.]],
[[ 1., 2.],
[ 3., 4.]],
[[ 7., 10.],
[15., 22.]]])
Also, we can avoid creating the matrix A entirely, saving some time.
def all_powers_matrix(D, *rangeargs): #end exclusive
''' Expects 2D matrix.
Use as all_powers_matrix(D, end) or
all_powers_matrix(D, start, end)
'''
if len(rangeargs) == 1:
start = 0
end = rangeargs[0]
elif len(rangeargs) == 2:
start = rangeargs[0]
end = rangeargs[1]
else:
print("incorrect args")
return None
result = np.zeros((end - start, *D.shape))
result[0] = np.linalg.matrix_power(A[0], start)
for i in range(start + 1, end):
result[i] = np.linalg.multi_dot([result[i - 1], D])
return result
return result
result = all_powers_matrix(D, 3)
#Output:
array([[[ 1., 0.],
[ 0., 1.]],
[[ 1., 2.],
[ 3., 4.]],
[[ 7., 10.],
[15., 22.]]])
Note that you'd need to add error handling if you decide to use these functions as-is.

To calculate power of matrix D, one way could be to find the eigenvalues and right eigenvectors of it with np.linalg.eig and then raise the power of the diagonal matrix as it is easier, then after some manipulation, you can use two np.einsum to calculate A
#get eigvalues and eigvectors
eigval, eigvect = np.linalg.eig(D)
# to check how it works, you can do:
print (np.dot(eigvect*eigval,np.linalg.inv(eigvect)))
#[[1. 2.]
# [3. 4.]]
# so you get back on D
#use power as ufunc of outer with n on the eigenvalues to get all the one you want
arrp = np.power.outer( eigval, n).T
#apply_along_axis to create the diagonal matrix along the last axis
diagp = np.apply_along_axis( np.diag, axis=-1, arr=arrp)
#finally use two np.einsum to calculate with the subscript to get what you want
A = np.einsum('lij,jk -> lik',
np.einsum('ij,kjl -> kil',eigvect,diagp), np.linalg.inv(eigvect)).round()
print (A)
print (A.shape)
#[[[ 1. 0.]
# [-0. 1.]]
#
# [[ 1. 2.]
# [ 3. 4.]]
#
# [[ 7. 10.]
# [15. 22.]]]
#
#(3, 2, 2)

I don't have a full solution, but there are some things I wanted to mention which are a bit too long for the comments.
You might first look into addition chain exponentiation if you are computing big powers of big matrices. This is basically asking how many matrix multiplications are required to compute A^k for a given k. For instance A^5 = A(A^2)^2 so you need to only three matrix multiplies: A^2 and (A^2)^2 and A(A^2)^2. This might be the simplest way to gain some efficiency, but you will probably still have to use explicit loops.
Your question is also related to the problem of computing Ax, A^2x, ... , A^kx for a given A and x. This is an active area of research right now (search "matrix powers kernel"), since computing such a sequence efficiently is useful for parallel/communication avoiding Krylov subspace methods. If you're looking for a very efficient solution to your problem it might be worth looking into some of the results about this.

How to only iterate certain axis matrix using numpy iteration? [duplicate]

I want solve linear equation Ax= b, each A contains in 3d matrix. For-example,
In Ax = B,
Suppose A.shape is (2,3,3)
i.e. = [[[1,2,3],[1,2,3],[1,2,3]] [[1,2,3],[1,2,3],[1,2,3]]]
and B.shape is (3,1)
i.e. [1,2,3]^T
And I want to know each 3-vector x of Ax = B i.e.(x_1, x_2, x_3).
What comes to mind is multiply B with np.ones(2,3) and use function dot with the inverse of each A element. But It needs loop to do this.(which consumes lots of time when matrix size going up high) (Ex. A[:][:] = [1,2,3])
How can I solve many Ax = B equation without loop?
I made elements of A and B are same, but as you probably know, it is just example.

For invertible matrices, we could use np.linalg.inv on the 3D array A and then use tensor matrix-multiplication with B so that we lose the last and first axes of those two arrays respectively, like so -
np.tensordot( np.linalg.inv(A), B, axes=((-1),(0)))
Sample run -
In [150]: A
Out[150]:
array([[[ 0.70454189, 0.17544101, 0.24642533],
[ 0.66660371, 0.54608536, 0.37250876],
[ 0.18187631, 0.91397945, 0.55685133]],
[[ 0.81022308, 0.07672197, 0.7427768 ],
[ 0.08990586, 0.93887203, 0.01665071],
[ 0.55230314, 0.54835133, 0.30756205]]])
In [151]: B = np.array([[1],[2],[3]])
In [152]: np.linalg.solve(A[0], B)
Out[152]:
array([[ 0.23594665],
[ 2.07332454],
[ 1.90735086]])
In [153]: np.linalg.solve(A[1], B)
Out[153]:
array([[ 8.43831557],
[ 1.46421396],
[-8.00947932]])
In [154]: np.tensordot( np.linalg.inv(A), B, axes=((-1),(0)))
Out[154]:
array([[[ 0.23594665],
[ 2.07332454],
[ 1.90735086]],
[[ 8.43831557],
[ 1.46421396],
[-8.00947932]]])
Alternatively, the tensor matrix-multiplication could be replaced by np.matmul, like so -
np.matmul(np.linalg.inv(A), B)
On Python 3.x, we could use # operator for the same functionality -
np.linalg.inv(A) # B

Fastest way to compute entropy of each numpy array row?

I have a array in size MxN and I like to compute the entropy value of each row. What would be the fastest way to do so ?

scipy.special.entr computes -x*log(x) for each element in an array. After calling that, you can sum the rows.
Here's an example. First, create an array p of positive values whose rows sum to 1:
In [23]: np.random.seed(123)
In [24]: x = np.random.rand(3, 10)
In [25]: p = x/x.sum(axis=1, keepdims=True)
In [26]: p
Out[26]:
array([[ 0.12798052, 0.05257987, 0.04168536, 0.1013075 , 0.13220688,
0.07774843, 0.18022149, 0.1258417 , 0.08837421, 0.07205402],
[ 0.08313743, 0.17661773, 0.1062474 , 0.01445742, 0.09642919,
0.17878489, 0.04420998, 0.0425045 , 0.12877228, 0.1288392 ],
[ 0.11793032, 0.15790292, 0.13467074, 0.11358463, 0.13429674,
0.06003561, 0.06725376, 0.0424324 , 0.05459921, 0.11729367]])
In [27]: p.shape
Out[27]: (3, 10)
In [28]: p.sum(axis=1)
Out[28]: array([ 1., 1., 1.])
Now compute the entropy of each row. entr uses the natural logarithm, so to get the base-2 log, divide the result by log(2).
In [29]: from scipy.special import entr
In [30]: entr(p).sum(axis=1)
Out[30]: array([ 2.22208731, 2.14586635, 2.22486581])
In [31]: entr(p).sum(axis=1)/np.log(2)
Out[31]: array([ 3.20579434, 3.09583074, 3.20980287])
If you don't want the dependency on scipy, you can use the explicit formula:
In [32]: (-p*np.log2(p)).sum(axis=1)
Out[32]: array([ 3.20579434, 3.09583074, 3.20980287])

As #Warren pointed out, it's unclear from your question whether you are starting out from an array of probabilities, or from the raw samples themselves. In my answer I've assumed the latter, in which case the main bottleneck will be computing the bin counts over each row.
Assuming that each vector of samples is relatively long, the fastest way to do this will probably be to use np.bincount:
import numpy as np
def entropy(x):
"""
x is assumed to be an (nsignals, nsamples) array containing integers between
0 and n_unique_vals
"""
x = np.atleast_2d(x)
nrows, ncols = x.shape
nbins = x.max() + 1
# count the number of occurrences for each unique integer between 0 and x.max()
# in each row of x
counts = np.vstack((np.bincount(row, minlength=nbins) for row in x))
# divide by number of columns to get the probability of each unique value
p = counts / float(ncols)
# compute Shannon entropy in bits
return -np.sum(p * np.log2(p), axis=1)
Although Warren's method of computing the entropies from the probability values using entr is slightly faster than using the explicit formula, in practice this is likely to represent a tiny fraction of the total runtime compared to the time taken to compute the bin counts.
Test correctness for a single row:
vals = np.arange(3)
prob = np.array([0.1, 0.7, 0.2])
row = np.random.choice(vals, p=prob, size=1000000)
print("theoretical H(x): %.6f, empirical H(x): %.6f" %
(-np.sum(prob * np.log2(prob)), entropy(row)[0]))
# theoretical H(x): 1.156780, empirical H(x): 1.157532
Test speed:
In [1]: %%timeit x = np.random.choice(vals, p=prob, size=(1000, 10000))
....: entropy(x)
....:
10 loops, best of 3: 34.6 ms per loop
If your data don't consist of integer indices between 0 and the number of unique values, you can convert them into this format using np.unique:
y = np.random.choice([2.5, 3.14, 42], p=prob, size=(1000, 10000))
unq, x = np.unique(y, return_inverse=True)
x.shape = y.shape

cryptic scipy "could not convert integer scalar" error

I am constructing a sparse vector using a scipy.sparse.csr_matrix like so:
csr_matrix((values, (np.zeros(len(indices)), indices)), shape = (1, max_index))
This works fine for most of my data, but occasionally I get a ValueError: could not convert integer scalar.
This reproduces the problem:
In [145]: inds
Out[145]:
array([ 827969148, 996833913, 1968345558, 898183169, 1811744124,
2101454109, 133039182, 898183170, 919293479, 133039089])
In [146]: vals
Out[146]:
array([ 1., 1., 1., 1., 1., 2., 1., 1., 1., 1.])
In [147]: max_index
Out[147]:
2337713000
In [143]: csr_matrix((vals, (np.zeros(10), inds)), shape = (1, max_index+1))
...
996 fn = _sparsetools.csr_sum_duplicates
997 M,N = self._swap(self.shape)
--> 998 fn(M, N, self.indptr, self.indices, self.data)
999
1000 self.prune() # nnz may have changed
ValueError: could not convert integer scalar
inds is a np.int64 array and vals is a np.float64 array.
The relevant part of the scipy sum_duplicates code is here.
Note that this works:
In [235]: csr_matrix(([1,1], ([0,0], [1,2])), shape = (1, 2**34))
Out[235]:
<1x17179869184 sparse matrix of type '<type 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
So the problem is not that one of the dimensions is > 2^31
Any thoughts why these values should be causing a problem?

Might it be that max_index > 2**31 ?
Try this, just to make sure:
csr_matrix((vals, (np.zeros(10), inds/2)), shape = (1, max_index/2))

The max index you are giving is less than the maximum index of the rows you are supplying.
This
sparse.csr_matrix((vals, (np.zeros(10), inds)), shape = (1, np.max(inds)+1))
works fine with me.
Although making a .todense() results in memory error for the large size of the matrix

Uncommenting the sum_duplicates - function will lead to other errors. But this fix: strange error when creating csr_matrix also solves your problem. You can extend the version_check to newer versions of scipy.
import scipy
import scipy.sparse
if scipy.__version__ in ("0.14.0", "0.14.1", "0.15.1"):
_get_index_dtype = scipy.sparse.sputils.get_index_dtype
def _my_get_index_dtype(*a, **kw):
kw.pop('check_contents', None)
return _get_index_dtype(*a, **kw)
scipy.sparse.compressed.get_index_dtype = _my_get_index_dtype
scipy.sparse.csr.get_index_dtype = _my_get_index_dtype
scipy.sparse.bsr.get_index_dtype = _my_get_index_dtype

why can't x[:,0] = x[0] for a single row vector?

I'm relatively new to python but I'm trying to understand something which seems basic.
Create a vector:
x = np.linspace(0,2,3)
Out[38]: array([ 0., 1., 2.])
now why isn't x[:,0] a value argument?
IndexError: invalid index
It must be x[0]. I have a function I am calling which calculates:
np.sqrt(x[:,0]**2 + x[:,1]**2 + x[:,2]**2)
Why can't what I have just be true regardless of the input? It many other languages, it is independent of there being other rows in the array. Perhaps I misunderstand something fundamental - sorry if so. I'd like to avoid putting:
if len(x) == 1:
norm = np.sqrt(x[0]**2 + x[1]**2 + x[2]**2)
else:
norm = np.sqrt(x[:,0]**2 + x[:,1]**2 + x[:,2]**2)
everywhere. Surely there is a way around this... thanks.
Edit: An example of it working in another language is Matlab:
>> b = [1,2,3]
b =
1 2 3
>> b(:,1)
ans =
1
>> b(1)
ans =
1

Perhaps you are looking for this:
np.sqrt(x[...,0]**2 + x[...,1]**2 + x[...,2]**2)
There can be any number of dimensions in place of the ellipsis ...
See also What does the Python Ellipsis object do?, and the docs of NumPy basic slicing

It looks like the ellipsis as described by #JanneKarila has answered your question, but I'd like to point out how you might make your code a bit more "numpythonic". It appears you want to handle an n-dimensional array with the shape (d_1, d_2, ..., d_{n-1}, 3), and compute the magnitudes of this collection of three-dimensional vectors, resulting in an (n-1)-dimensional array with shape (d_1, d_2, ..., d_{n-1}). One simple way to do that is to square all the elements, then sum along the last axis, and then take the square root. If x is the array, that calculation can be written np.sqrt(np.sum(x**2, axis=-1)). The following shows a few examples.
x is 1-D, with shape (3,):
In [31]: x = np.array([1.0, 2.0, 3.0])
In [32]: np.sqrt(np.sum(x**2, axis=-1))
Out[32]: 3.7416573867739413
x is 2-D, with shape (2, 3):
In [33]: x = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
In [34]: x
Out[34]:
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
In [35]: np.sqrt(np.sum(x**2, axis=-1))
Out[35]: array([ 3.74165739, 8.77496439])
x is 3-D, with shape (2, 2, 3):
In [36]: x = np.arange(1.0, 13.0).reshape(2,2,3)
In [37]: x
Out[37]:
array([[[ 1., 2., 3.],
[ 4., 5., 6.]],
[[ 7., 8., 9.],
[ 10., 11., 12.]]])
In [38]: np.sqrt(np.sum(x**2, axis=-1))
Out[38]:
array([[ 3.74165739, 8.77496439],
[ 13.92838828, 19.10497317]])

I tend to solve this is by writing
x = np.atleast_2d(x)
norm = np.sqrt(x[:,0]**2 + x[:,1]**2 + x[:,2]**2)
Matlab doesn't have 1D arrays, so b=[1 2 3] is still a 2D array and indexing with two dimensions makes sense. It can be a novel concept for you, but they're quite useful in fact (you can stop worrying whether you need to multiply by the transpose, insert a row or a column in another array...)
By the way, you could write a fancier, more general norm like this:
x = np.atleast_2d(x)
norm = np.sqrt((x**2).sum(axis=1))

The problem is that x[:,0] in Python isn't the same as in Matlab.
If you want to extract the first element in the single row vector you should go with
x[:1]
This is called a "slice". In this example it means that you take everything in the array from the first element to the element with index 1 (not included).
Remember that Python has zero-based numbering.
Another example may be:
x[0:2]
which would return the first and the second element of the array.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What's the most sensible way to do this in numpy? - python

Related

Python Optimization: Using vector technique to find power of each matrix in an numpy array

How to only iterate certain axis matrix using numpy iteration? [duplicate]

Fastest way to compute entropy of each numpy array row?

cryptic scipy "could not convert integer scalar" error

why can't x[:,0] = x[0] for a single row vector?

Categories

Resources