apply numpy.histogram to multidimensional array - python

I want to apply numpy.histogram() to a multi-dimensional array along an axis.
Say, for example I have a 2D array and I want to apply histogram() along axis=1.
Code:
import numpy
array = numpy.array([[0.6, 0.7, -0.3, 1.0, -0.8], [0.2, -1.0, -0.5, 0.5, 0.8],
[0.25, 0.3, -0.1, -0.8, 1.0]])
bins = [-1.0, -0.5, 0, 0.5, 1.0, 1.0]
hist, bin_edges = numpy.histogram(array, bins)
print(hist)
Output:
[3 3 3 4 2]
Expected Output:
[[1 1 0 2 1],
[1 1 1 2 0],
[1 1 2 0 1]]
How can I get my expected output?
I tried to use the solution suggested in this post, but it doesn't get me to the expected output.

For n-d cases, you can do this with np.histogram2d just by making a dummy x-axis (i):
def vec_hist(a, bins):
i = np.repeat(np.arange(np.product(a.shape[:-1]), a.shape[-1]))
return np.histogram2d(i, a.flatten(), (a.shape[0], bins)).reshape(a.shape[:-1], -1)
Output
vec_hist(array, bins)
Out[453]:
(array([[ 1., 1., 0., 2., 1.],
[ 1., 1., 1., 2., 0.],
[ 1., 1., 2., 0., 1.]]),
array([ 0. , 0.66666667, 1.33333333, 2. ]),
array([-1. , -0.5 , 0. , 0.5 , 0.9999999, 1. ]))
For histograms over arbitrary axis, you'll probably need to create i using np.meshgrid and np.ravel_multi_axis and then use that to reshape the resulting histogram.

Related

Discretize only a certain arrrays in a tensor with TensorFlow

I have the following array:-
import numpy as np
import tensorflow as tf
input = np.array([[-1.5, 1.0, 3.4, .5], [0.0, 3.0, 1.3, 0.0]])
layer = tf.keras.layers.Discretization(num_bins=2, epsilon=0.01)
layer.adapt(input)
layer(input)
<tf.Tensor: shape=(2, 4), dtype=int64, numpy=
array([[0, 1, 1, 1],
[0, 1, 1, 0]])>
This discretizes the whole tensor. I would like to know if there is a way through which I can just discretize the second array in the tensor.
We can create a mask based on the index of the array that needs to be discretized:
def get_mask(x, array_index):
x = tf.Variable(tf.ones_like(input, dtype=tf.float32))
indices = tf.Variable(array_index, dtype=tf.int32)
updates = tf.Variable(tf.zeros( (indices.shape[0], x.shape[1])), dtype=tf.float32)
return tf.compat.v1.scatter_nd_update(x, indices, updates)
And calling
> mask = get_mask(input, np.array([[1]])) #second array
>
> returns the mask of:
array([[1., 1., 1., 1.],
[0., 0., 0., 0.]])
Then we can apply mask: tf.cast(layer(input), tf.float32) * (1-mask) + input*mask which returns:
array([[-1.5, 1. , 3.4, 0.5],
[ 0. , 1. , 1. , 0. ]]
The above should work for any array and any array index to discretize.

Efficient matrix update and matrix multiplication using Scipy sparse matrix

I have a large matrix (236680*236680), and my pc does not have sufficient memory to read in the complete matrix so that I am thinking the Scipy sparse matrix. My goal is to multiply a generated matrix (not sparse) by np.eye(the number of observation)-np.ones(the number of observation)/the number of observation with a sparse matrix.
In Scipy, I use the following code, but the computation is still huge. My questions include:
to generate the first matrix, is there any other way to speed the process?
for the matrix multiplication, is there any way to reduce the memory usage, as the first matrix is not sparse?
-
from scipy.sparse import lil_matrix
fline=5
nn=1/fline
M=lil_matrix((fline,fline))
M.setdiag(values=1-nn,k=0)
for i in range(fline)[1:]:
M.setdiag(values=0-nn,k=i)
M.setdiag(values=0-nn,k=-i)
#the first matrix is:
array([[ 0.8, -0.2, -0.2, -0.2, -0.2],
[-0.2, 0.8, -0.2, -0.2, -0.2],
[-0.2, -0.2, 0.8, -0.2, -0.2],
[-0.2, -0.2, -0.2, 0.8, -0.2],
[-0.2, -0.2, -0.2, -0.2, 0.8]])
#the second matrix is:
array([[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0.],
[1., 0., 1., 0., 0.]])
a2=M.dot(B)
#the final expected results
array([[-0.2, 0. , -0.2, 0.6, 0. ],
[-0.2, 0. , -0.2, -0.4, 0. ],
[-0.2, 0. , -0.2, 0.6, 0. ],
[-0.2, 0. , -0.2, -0.4, 0. ],
[ 0.8, 0. , 0.8, -0.4, 0. ]])
Updated: is there any way to improve the speed of the cross product? Numpy dot and Scipy sparse dot functions are tested.
For the first problem: Mathematically,
arr1 = array([[ 0.8, -0.2, -0.2, -0.2, -0.2],
[-0.2, 0.8, -0.2, -0.2, -0.2],
[-0.2, -0.2, 0.8, -0.2, -0.2],
[-0.2, -0.2, -0.2, 0.8, -0.2],
[-0.2, -0.2, -0.2, -0.2, 0.8]])
is equivalent to
arr1 = -0.2 * [[1,1,1,1,1,], + 1
[1,1,1,1,1,], 1
[1,1,1,1,1,], 1
[1,1,1,1,1,], 1
[1,1,1,1,1,]] 1
= [1] [1, 1, 1, 1, 1] * 0.2 + 1
[1] 1
[1] 1
[1] 1
[1] 1
Thus, it can be generated using
-0.2 * np.outer([1,1,1,1,1], [1,1,1,1,1]) + scipy.sparse.identity(5)
For the second problem, let me abuse the notation
-0.2* [1] [1, 1, 1, 1, 1] # B + scipy.sparse.identity(5) # B
[1]
[1]
[1]
[1]
can be reduced to
np.outer([1, 1, 1, 1, 1], B.sum(axis=0)) * -0.2 + scipy.sparse.identity(5) # B
One needs not really compute np.outer([1, 1, 1, 1, 1], B.sum(axis=0)) as this would be a dense square matrix that the memory may not fit. (Note that the outer product is basically repeats B.sum(axis=0) in every row it contains.)
To recover the results in a memory efficient way, you only need to store B.sum(axis=0) and scipy.sparse.identity(5) # B .
Scipy sparse matrix is used, since one of the matrics is a sparse matrix and the cross product function in the sparse matrix is the fastest between Numpy and Scipy.
For the first question, #Tai's answer is the foundation, but I use numpy.full function (a little bit faster).
For the second question, dividing the whole matrix and save smaller computed matrices in files are used.
from scipy import sparse
from scipy.sparse import vstack
import h5sparse
import numpy as num
fline=236680
nn=1/fline; dd=1-nn; off=0-nn
div=int(fline/(61*10))
for i in range(61*10):
divM= num.full((fline, div), off) + sparse.identity(fline,format='csc')[:,0+div*i:div+div*i]
vs=[]
for j in range(divM.shape[1]):
divMB=csr_matrix(divM.T[j]).dot(weights)
vs.append(divMB)
divapp=vstack(vs)
if i ==0:
h5f = h5sparse.File("F:/dissertation/dallastest/temp/tt1.h5")
h5f.create_dataset('sparse/matrix', data=divapp, chunks=(389,),maxshape=(None,))
else:
h5f['sparse/matrix'].append(divapp)

Understanding axes in NumPy

I was going through NumPy documentation, and am not able to understand one point. It mentions, for the example below, the array has rank 2 (it is 2-dimensional). The first dimension (axis) has a length of 2, the second dimension has a length of 3.
[[ 1., 0., 0.],
[ 0., 1., 2.]]
How does the first dimension (axis) have a length of 2?
Edit:
The reason for my confusion is the below statement in the documentation.
The coordinates of a point in 3D space [1, 2, 1] is an array of rank
1, because it has one axis. That axis has a length of 3.
In the original 2D ndarray, I assumed that the number of lists identifies the rank/dimension, and I wrongly assumed that the length of each list denotes the length of each dimension (in that order). So, as per my understanding, the first dimension should be having a length of 3, since the length of the first list is 3.
In numpy, axis ordering follows zyx convention, instead of the usual (and maybe more intuitive) xyz.
Visually, it means that for a 2D array where the horizontal axis is x and the vertical axis is y:
x -->
y 0 1 2
| 0 [[1., 0., 0.],
V 1 [0., 1., 2.]]
The shape of this array is (2, 3) because it is ordered (y, x), with the first axis y of length 2.
And verifying this with slicing:
import numpy as np
a = np.array([[1, 0, 0], [0, 1, 2]], dtype=np.float)
>>> a
Out[]:
array([[ 1., 0., 0.],
[ 0., 1., 2.]])
>>> a[0, :] # Slice index 0 of first axis
Out[]: array([ 1., 0., 0.]) # Get values along second axis `x` of length 3
>>> a[:, 2] # Slice index 2 of second axis
Out[]: array([ 0., 2.]) # Get values along first axis `y` of length 2
You may be confusing the other sentence with the picture example below. Think of it like this: Rank = number of lists in the list(array) and the term length in your question can be thought of length = the number of 'things' in the list(array)
I think they are trying to describe to you the definition of shape which is in this case (2,3)
in that post I think the key sentence is here:
In NumPy dimensions are called axes. The number of axes is rank.
If you print the numpy array
print(np.array([[ 1. 0. 0.],[ 0. 1. 2.]])
You'll get the following output
#col1 col2 col3
[[ 1. 0. 0.] # row 1
[ 0. 1. 2.]] # row 2
Think of it as a 2 by 3 matrix... 2 rows, 3 columns. It is a 2d array because it is a list of lists. ([[ at the start is a hint its 2d)).
The 2d numpy array
np.array([[ 1. 0., 0., 6.],[ 0. 1. 2., 7.],[3.,4.,5,8.]])
would print as
#col1 col2 col3 col4
[[ 1. 0. , 0., 6.] # row 1
[ 0. 1. , 2., 7.] # row 2
[3., 4. , 5., 8.]] # row 3
This is a 3 by 4 2d array (3 rows, 4 columns)
The first dimensions is the length:
In [11]: a = np.array([[ 1., 0., 0.], [ 0., 1., 2.]])
In [12]: a
Out[12]:
array([[ 1., 0., 0.],
[ 0., 1., 2.]])
In [13]: len(a) # "length of first dimension"
Out[13]: 2
The second is the length of each "row":
In [14]: [len(aa) for aa in a] # 3 is "length of second dimension"
Out[14]: [3, 3]
Many numpy functions take axis as an argument, for example you can sum over an axis:
In [15]: a.sum(axis=0)
Out[15]: array([ 1., 1., 2.])
In [16]: a.sum(axis=1)
Out[16]: array([ 1., 3.])
The thing to note is that you can have higher dimensional arrays:
In [21]: b = np.array([[[1., 0., 0.], [ 0., 1., 2.]]])
In [22]: b
Out[22]:
array([[[ 1., 0., 0.],
[ 0., 1., 2.]]])
In [23]: b.sum(axis=2)
Out[23]: array([[ 1., 3.]])
Keep the following points in mind when considering Numpy axes:
Each sub-level of a list (or array) represents an axis. For example:
import numpy as np
a = np.array([1,2]) # 1 axis
b = np.array([[1,2],[3,4]]) # 2 axes
c = np.array([[[1,2],[3,4]],[[5,6],[7,8]]]) # 3 axes
Axis labels correspond to the level of the sub-list they represent, starting with axis 0 for the outer most list.
To illustrate this, consider the following array of different shape, each with 24 elements:
# 1D Array
a0 = np.array(
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
)
a0.shape # (24,) - here, the length along the 0-axis is 24
# 2D Array
a01 = np.array(
[
[1.1, 1.2, 1.3, 1.4],
[2.1, 2.2, 2.3, 2.4],
[3.1, 3.2, 3.3, 3.4],
[4.1, 4.2, 4.3, 4.4],
[5.1, 5.2, 5.3, 5.4],
[6.1, 6.2, 6.3, 6.4]
]
)
a01.shape # (6, 4) - now, the length along the 0-axis is 6
# 3D Array
a012 = np.array(
[
[
[1.1.1, 1.1.2],
[1.2.1, 1.2.2],
[1.3.1, 1.3.2]
],
[
[2.1.1, 2.1.2],
[2.2.1, 2.2.2],
[2.3.1, 2.3.2]
],
[
[3.1.1, 3.1.2],
[3.2.1, 3.2.2],
[3.3.1, 3.3.2]
],
[
[4.1.1, 4.1.2],
[4.2.1, 4.2.2],
[4.3.1, 4.3.2]
]
)
a012.shape # (4, 3, 2) - and finally, the length along the 0-axis is 4

remove empty numpy array

I have a numpy array:
array([], shape=(0, 4), dtype=float64)
How can I remove this array in a multidimensional array?
I tried
import numpy as np
if array == []:
np.delete(array)
But, the multidimensional array still has this empty array.
EDIT:
The input is
new_array = [array([], shape=(0, 4), dtype=float64),
array([[-0.97, 0.99, -0.98, -0.93 ],
[-0.97, -0.99, 0.59, -0.93 ],
[-0.97, 0.99, -0.98, -0.93 ],
[ 0.70 , 1, 0.60, 0.65]]), array([[-0.82, 1, 0.61, -0.63],
[ 0.92, -1, 0.77, 0.88],
[ 0.92, -1, 0.77, 0.88],
[ 0.65, -1, 0.73, 0.85]]), array([], shape=(0, 4), dtype=float64)]
The expected output after removing the empty arrays is:
new array = [array([[-0.97, 0.99, -0.98, -0.93 ],
[-0.97, -0.99, 0.59, -0.93 ],
[-0.97, 0.99, -0.98, -0.93 ],
[ 0.70 , 1, 0.60, 0.65]]),
array([[-0.82, 1, 0.61, -0.63],
[ 0.92, -1, 0.77, 0.88],
[ 0.92, -1, 0.77, 0.88],
[ 0.65, -1, 0.73, 0.85]])]
new_array, as printed, looks like a list of arrays. And even if it were an array, it would be a 1d array of dtype=object.
==[] is not the way to check for an empty array:
In [10]: x=np.zeros((0,4),float)
In [11]: x
Out[11]: array([], shape=(0, 4), dtype=float64)
In [12]: x==[]
Out[12]: False
In [14]: 0 in x.shape # check if there's a 0 in the shape
Out[14]: True
Check the syntax for np.delete. It requires an array, an index and an axis, and returns another array. It does not operate in place.
If new_array is a list, a list comprehension would do a nice job of removing the [] arrays:
In [33]: alist=[x, np.ones((2,3)), np.zeros((1,4)),x]
In [34]: alist
Out[34]:
[array([], shape=(0, 4), dtype=float64), array([[ 1., 1., 1.],
[ 1., 1., 1.]]), array([[ 0., 0., 0., 0.]]), array([], shape=(0, 4), dtype=float64)]
In [35]: [y for y in alist if 0 not in y.shape]
Out[35]:
[array([[ 1., 1., 1.],
[ 1., 1., 1.]]), array([[ 0., 0., 0., 0.]])]
It would also work if new_array was a 1d array:
new_array=np.array(alist)
newer_array = np.array([y for y in new_array if 0 not in y.shape])
To use np.delete with new_array, you have to specify which elements:
In [47]: np.delete(new_array,[0,3])
Out[47]:
array([array([[ 1., 1., 1.],
[ 1., 1., 1.]]),
array([[ 0., 0., 0., 0.]])], dtype=object)
to find [0,3] you could use np.where:
np.delete(new_array,np.where([y.size==0 for y in new_array]))
Better yet, skip the delete and where and go with a boolean mask
new_array[np.array([y.size>0 for y in new_array])]
I don't think there's a way of identifying these 'emtpy' arrays without a list comprehension, since you have to check the shape or size property, not the element's data. Also there's a limit as to what kinds of math you can do across elements of an object array. It's more like a list than a 2d array.
I had initially an array (3,11,11) and after a multprocessing using pool.map my array was transformed in a list like this:
[array([], shape=(0, 11, 11), dtype=float64),
array([[[ 0.35318114, 0.36152024, 0.35572945, 0.34495254, 0.34169853,
0.36553977, 0.34266126, 0.3492261 , 0.3339431 , 0.34759375,
0.33490712],...
if a convert this list in an array the shape was (3,), so I used:
myarray = np.vstack(mylist)
and this returned my first 3d array with the original shape (3,11,11).
Delete takes the multidimensional array as a parameter. Then you need to specify the subarray to delete and the axis it's on. See http://docs.scipy.org/doc/numpy/reference/generated/numpy.delete.html
np.delete(new_array,<obj indicating subarray to delete (perhaps an array of integers in your case)>, 0)
Also, note that the deletion is not in-place.

Combining two numpy arrays to form an array with the largest value from each array

I want to combine two numpy arrays to produce an array with the largest values from each array.
import numpy as np
a = np.array([[ 0., 0., 0.5],
[ 0.1, 0.5, 0.5],
[ 0.1, 0., 0.]])
b = np.array([[ 0., 0., 0.0],
[ 0.5, 0.1, 0.5],
[ 0.5, 0.1, 0.]])
I would like to produce
array([[ 0., 0., 0.5],
[ 0.5, 0.5, 0.5],
[ 0.5, 0.1, 0.]])
I know you can do
a += b
which results in
array([[ 0. , 0. , 0.5],
[ 0.6, 0.6, 1. ],
[ 0.6, 0.1, 0. ]])
This is clearly not what I'm after. It seems like such an easy problem and I assume it most probably is.
You can use np.maximum to compute the element-wise maximum of the two arrays:
>>> np.maximum(a, b)
array([[ 0. , 0. , 0.5],
[ 0.5, 0.5, 0.5],
[ 0.5, 0.1, 0. ]])
This works with any two arrays, as long as they're the same shape or one can be broadcast to the shape of the other.
To modify the array a in-place, you can redirect the output of np.maximum back to a:
np.maximum(a, b, out=a)
There is also np.minimum for calculating the element-wise minimum of two arrays.
You are looking for the element-wise maximum.
Example:
>>> np.maximum([2, 3, 4], [1, 5, 2])
array([2, 5, 4])
http://docs.scipy.org/doc/numpy/reference/generated/numpy.maximum.html
inds = b > a
a[inds] = b[inds]
This modifies the original array a which is what += is doing in your example which may or may not be what you want.

Categories