Why replacing values in numpy array does not always work

Why replacing values in numpy array does not always work - python

I am trying to replace/overwrite values in a array using the following commands:
import numpy as np
test = np.array([[4,5,0],[0,0,0],[0,0,6]])
test
Out[20]:
array([[4., 5., 0.],
[0., 0., 0.],
[0., 0., 6.]])
test[np.where(test[...,0] != 0)][...,1:3] = np.array([[10,11]])
test
Out[22]:
array([[4., 5., 0.],
[0., 0., 0.],
[0., 0., 6.]])
However, as one can see in Out22, the array test has not been modified. So I am concluding that it is not possible to simply overwrite a part of a array or just few cells.
Nevertheless, in other contexts, it is possible to overwrite few cells of a array. For example, in the below code:
test = np.array([[1,2,0],[0,0,0],[0,0,3]])
test
Out[11]:
array([[1., 2., 0.],
[0., 0., 0.],
[0., 0., 3.]])
test[test>0]
Out[12]:
array([1., 2., 3.])
test[test>0] = np.array([4,5,6])
test
Out[14]:
array([[4., 5., 0.],
[0., 0., 0.],
[0., 0., 6.]])
Therefore, my 2 questions:
1- Why does the first command
test[np.where(test[...,0] != 0)][...,1:3] = np.array([10,11])
does not allow modifying the array test ? Why does not it allow accessing the array cells and overwrite them?
2- How could I make it work considering that for my code I would need to select the cells using the command above?
Many thanks!

I'll do you one up. This does work:
test[...,1:3][np.where(test[...,0] != 0)] = np.array([[10,11]])
array([[ 4, 10, 11],
[ 0, 0, 0],
[ 0, 0, 6]])
Why? It's the combination of two factors - numpy indexing and .__setitem__ calls.
The python interpreter sort of reads lines backwards. And when it gets to =, it tries to call .__setitem__ on the furthest thing to the left. __setitem__ is (hopefully) a method of the object, and has two inputs, the target and the indices (whatever is between [...] just before it).
a[b] = c #is intepreted as
a.__setitem__(b, c)
Now, when we index in numpy we have three basic ways we can do it.
slicing (returns views)
'advanced indexing' (returns copies)
'simple indexing' (also returns copies)
One major difference between "advanced" and "simple" indexing is that a numpy array's __setitem__ function can interpret advanced indexes. And views mean the data addresses are the same, so we don't need __setitem__ to get to them.
So:
test[np.where(test[...,0] != 0)][...,1:3] = np.array([[10,11]]) #is intepreted as
(test[np.where(test[...,0] != 0)]).__setitem__( slice([...,1:3]),
np.array([[10,11]]))
But, since np.where(test[...,0] != 0) is an advanced index, (test[np.where(test[...,0] != 0)]) returns a copy, which is then lost because it is never assigned. It does take the elements we want and set them to [10,11], but the result is lost in the buffer somewhere.
If we do:
test[..., 1:3][np.where(test[..., 0] != 0)] = np.array([[10, 11]]) #is intepreted as
(test[..., 1:3]).__setitem__( np.where(test[...,0] != 0), np.array([[10,11]]) )
test[...,1:3] is a view, so it still points to the same memory. Now setitem looks for the locations in test[...,1:3] that correspond to np.where(test[...,0] != 0), and set them equal to np.array([[10,11]]). And everything works.
You can also do this:
test[np.where(test[...,0] != 0), 1:3] = np.array([10, 11])
Now, since all the indexing is in one set of brackets, it's calling test.__setitem__ on those indices, which sets the data correctly as well.
Even simpler (and most pythonic) would be:
test[test[...,0] != 0, 1:3] = np.array([10,11])

Related

PyTorch add to tensor at indices with degenerate indices

This question may be seen as an extension to this one.
I have two 1D tensors, counts and idx. Counts is length 20 and stores the occurrences of events that fall into 1 of 20 bins. idx is very long, and each entry is an integer which corresponds to the occurrence of 1 of the 20 events, and each event can occur multiple times. I'd like a vectorized or very fast way to add the number of times event i occurred in idx to the i'th bucket in counts. Furthermore, it would be ideal if the solution was compatible with operation on batches of count's and idx's during a training loop.
My first thought was to simply use this strategy of indexing counts with idx:
counts = torch.zeros(5)
idx = torch.tensor([1,1,1,2,3])
counts[idx] += 1
But it did not work, with counts ending at
tensor([0., 1., 1., 1., 0.])
instead of the desired
tensor([0., 3., 1., 1., 0.])
What's the fastest way I can do this? My next best guess is
for i in range(20):
counts[i] += idx[idx == i].sum()

Please consider the following proposal implemented with the bincount function which counts the frequency of each value in tensor of non-negative ints (The only constraint).
import torch
EVENT_TYPES = 20
counts = torch.zeros(EVENT_TYPES)
events = torch.tensor([1, 1, 1, 2, 3, 9])
batch_counts = torch.bincount(events, minlength=EVENT_TYPES)
print(counts + batch_counts)
Result:
tensor([0., 3., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0.])
You can evaluate that for every batch being only in torch tensor environment. You control the number of event types using the minlength argument in the bincount function. In this case 20 as you described in the problem.

Numpy: could not broadcast input array from shape (3) into shape (1)

I want to create a numpy array b where each component is a 2D matrix, which dimensions are determined by the coordinates of vector a.
What I get doing the following satisfies me:
>>> a = [3,4,1]
>>> b = [np.zeros((a[i], a[i - 1] + 1)) for i in range(1, len(a))]
>>> np.array(b)
array([ array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]]),
array([[ 0., 0., 0., 0., 0.]])], dtype=object)
but if I have found this pathological case where it does not work:
>>> a = [2,1,1]
>>> b = [np.zeros((a[i], a[i - 1] + 1)) for i in range(1, len(a))]
>>> b
[array([[ 0., 0., 0.]]), array([[ 0., 0.]])]
>>> np.array(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not broadcast input array from shape (3) into shape (1)

I will present a solution to the problem, but do take into account what was said in the comments. Having Numpy arrays that are not aligned prevents most of the useful operations from working their magic. Consider using lists instead.
That being said, curious error indeed. I got the thing to work by assigning in a basic for-loop instead of using the np.array call.
a = [2,1,1]
b = np.zeros(len(a)-1, dtype=object)
for i in range(1, len(a)):
b[i-1] = np.zeros((a[i], a[i - 1] + 1))
And the result:
>>> b
array([array([[0., 0., 0.]]), array([[0., 0.]])], dtype=object)

This is a bit peculiar. Typically, numpy will try to create one array from the input of np.array with a common data type. A list of arrays would be interpreted with the list as being the new dimension. For instance, np.array([np.zeros(3, 1), np.zeros(3, 1)]) would produce a 2 x 3 x 1 array. So this can only happen if the arrays in your list match in shape. Otherwise, you end up with an array of arrays (with dtype=object), which as commented, is not really an ideal scenario.
However, your error seems to occur when the first dimension matches. Numpy for some reason tries to broadcast the arrays somehow and fails. I can reproduce your error even if the arrays are of higher dimension, as long as the first dimension between arrays matches.
I know this isn't a solution, but this wouldn't fit in a comment. As noted by #roganjosh, making this kind of array really gives you no benefit. You're better off sticking to a list of arrays for readability and to avoid the cost of creating these arrays.

Using a numpy view to a non-existing array

After defining an arrray a with zeros, I can create a view to the leftmost column with the following function:
a = np.zeros((5, 5))
a_left_col = a[:, 0]
a_left_col[:] = 2.
which prints for a:
array([[2., 0., 0., 0., 0.],
[2., 0., 0., 0., 0.],
[2., 0., 0., 0., 0.],
[2., 0., 0., 0., 0.],
[2., 0., 0., 0., 0.]])
If I subsequently reinitialize a with
a = np.zeros((5, 5))
then the view still exists, but it refers to nothing anymore. How does Python handle the situation if I do a_left_col[:] = 2 again? Is this undefined behavior like in C or C++, or does Python handle it properly, and if so, why doesn't it throw an error?

The original object still exists because it is referenced by the view. (Although it can be no longer accessed through variable a.)
Let's have a detailed look at the object's reference counts:
import sys
import numpy as np
a = np.zeros((5, 5))
print(sys.getrefcount(a)) # 2
a_left_col = a[:, 0]
print(sys.getrefcount(a)) # 3
print(sys.getrefcount(a_left_col.base)) # 3
print(a_left_col.base is a) # True
a = np.ones((5, 5))
print(sys.getrefcount(a_left_col.base)) # 2
Note that a_left_col.base is the reference to the original array. When we reassing a The reference count on the object decreases but it still exists because it is reachable throgh a_left_col.

Behaviour is not undefined. You are merely creating a new object a. The old one is not deallocated, but still exists in memory, since a_left_col still references it. Once you reinitialize a_left_col, the original array can be deallocated.

List as element of list of lists or multidimensional lists as a grid

I am trying to create a lat/lon grid that contains an array of found indices where two conditions are met for a lat/lon combination. This approach might be too complicated, but using a meshgrid or numpy broadcasting failed also. If there is a better approach, feel free to share your knowlegde. :-)
Round lat/lon values to gridsize resolution of 1° but retain full length of array:
x = np.around(lon, decimals=0)
y = np.around(lat, decimals=0)
arrays consists of longitude/latitude values from -180 to 180 and -82° to 82°; multiple douplets possible
Check for each combination of lat/lon how many measurements are available for 1°/1° grid point:
a = arange(-180,181)
b = arange(-82,83)
totalgrid = [ [ 0 for i in range(len(b)) ] for j in range(len(a)) ]
for d1 in range(len(a)):
for d2 in range(len(b)):
totalgrid[d1][d2]=np.where((x==a[d1])&(y==b[d2]))[0]
This method fails and returns only a list of lists with empty arrays. I can't figure out why it's not working properly.
Replacing the last line by:
totalgrid[d1][d2]=np.where((x==a[0])&(y==b[0]))[0]
returns all found indices from lon/lat that are present at -180°/-82°. Unfortunately it takes a while. Am I missing a for loop somewhere?!
The Problem in more detail:
#askewchan
Unfortunately this one does not solve my original problem.
As expected the result represents the groundtrack quite well.
But besides the fact that I need the total number of points for each grid point, I also need each single index of lat/lon combinations in the lat/lon array for further computations.
Let's assume I have an array
lat(100000L,), lon(100000L,) and a third one array(100000L,)
which corresponds to the measurement at each point. I need every index of all 1°/1° combinations in lat/lon, to check this index in the array(100000L,) if a condition is met. Now lets assume that the indices[10000,10001,10002,..,10025] of lat/lon are on the same gridpoint. For those indices I need to check whether array[10000,10001,10002,..,10025] now met a condition, i.e. np.where(array==0). With cts.nonzero() I only get the index in the histogram. But then all information of each point contributing to the value of the histogram is lost. Hopefully you get what was my initial problem.

Not sure if I understand the goal here, but you want to count how many lat/lon pairs you have in each 1° section? This is what a histogram does:
lon = np.random.random(5000)*2*180 - 180
lat = np.random.random(5000)*2*82 - 82
a = np.arange(-180,181)
b = np.arange(-82,83)
np.histogram2d(lon, lat, (a,b))
#(array([[ 0., 0., 1., ..., 0., 0., 0.],
# [ 0., 2., 0., ..., 0., 0., 1.],
# [ 0., 0., 0., ..., 0., 1., 0.],
# ...,
# [ 0., 1., 0., ..., 0., 0., 0.],
# [ 0., 0., 0., ..., 0., 0., 0.],
# [ 0., 0., 0., ..., 0., 0., 0.]]),
The indices where you have a nonzero count would be at:
cts.nonzero()
#(array([ 0, 0, 0, ..., 359, 359, 359]),
# array([ 2, 23, 25, ..., 126, 140, 155]))
You can plot it too:
cts, xs, ys = np.histogram2d(lon, lat, (a,b))
pyplot.imshow(cts, extent=(-82,82,-180,180))

Python time optimisation of for loop using newaxis

I need to calculate n number of points(3D) with equal spacing along a defined line(3D).
I know the starting and end point of the line. First, I used
for k in range(nbin):
step = k/float(nbin-1)
bin_point.append(beam_entry+(step*(beamlet_intersection-beam_entry)))
Then, I found that using append for large arrays takes more time, then I changed code like this:
bin_point = [start_point+((k/float(nbin-1))*(end_point-start_point)) for k in range(nbin)]
I got a suggestion that using newaxis will further improve the time.
The modified code looks like this.
step = arange(nbin) / float(nbin-1)
bin_point = start_point + ( step[:,newaxis,newaxis]*((end_pint - start_point))[newaxis,:,:] )
But, I could not understand the newaxis function, I also have a doubt that, whether the same code will work if the structure or the shape of the start_point and end_point are changed. Similarly how can I use the newaxis to mdoify the following code
for j in range(32): # for all los
line_dist[j] = sqrt([sum(l) for l in (end_point[j]-start_point[j])**2])
Sorry for being so clunky, to be more clear the structure of the start_point and end_point are
array([ [[1,1,1],[],[],[]....[]],
[[],[],[],[]....[]],
[[],[],[],[]....[]]......,
[[],[],[],[]....[]] ])

Explanation of the newaxis version in the question: these are not matrix multiplies, ndarray multiply is element-by-element multiply with broadcasting. step[:,newaxis,newaxis] is num_steps x 1 x 1 and point[newaxis,:,:] is 1 x num_points x num_dimensions. Broadcasting together ndarrays with shape (num_steps x 1 x 1) and (1 x num_points x num_dimensions) will work, because the broadcasting rules are that every dimension should be either 1 or the same; it just means "repeat the array with dimension 1 as many times as the corresponding dimension of the other array". This results in an ndarray with shape (num_steps x num_points x num_dimensions) in a very efficient way; the i, j, k subscript will be the k-th coordinate of the i-th step along the j-th line (given by the j-th pair of start and end points).
Walkthrough:
>>> start_points = numpy.array([[1, 0, 0], [0, 1, 0]])
>>> end_points = numpy.array([[10, 0, 0], [0, 10, 0]])
>>> steps = numpy.arange(10)/9.0
>>> start_points.shape
(2, 3)
>>> steps.shape
(10,)
>>> steps[:,numpy.newaxis,numpy.newaxis].shape
(10, 1, 1)
>>> (steps[:,numpy.newaxis,numpy.newaxis] * start_points).shape
(10, 2, 3)
>>> (steps[:,numpy.newaxis,numpy.newaxis] * (end_points - start_points)) + start_points
array([[[ 1., 0., 0.],
[ 0., 1., 0.]],
[[ 2., 0., 0.],
[ 0., 2., 0.]],
[[ 3., 0., 0.],
[ 0., 3., 0.]],
[[ 4., 0., 0.],
[ 0., 4., 0.]],
[[ 5., 0., 0.],
[ 0., 5., 0.]],
[[ 6., 0., 0.],
[ 0., 6., 0.]],
[[ 7., 0., 0.],
[ 0., 7., 0.]],
[[ 8., 0., 0.],
[ 0., 8., 0.]],
[[ 9., 0., 0.],
[ 0., 9., 0.]],
[[ 10., 0., 0.],
[ 0., 10., 0.]]])
As you can see, this produces the correct answer :) In this case broadcasting (10,1,1) and (2,3) results in (10,2,3). What you had is broadcasting (10,1,1) and (1,2,3) which is exactly the same and also produces (10,2,3).
The code for the distance part of the question does not need newaxis: the inputs are num_points x num_dimensions, the ouput is num_points, so one dimension has to be removed. That is actually the axis you sum along. This should work:
line_dist = numpy.sqrt( numpy.sum( (end_point - start_point) ** 2, axis=1 )
Here numpy.sum(..., axis=1) means sum along that axis only, rather than all elements: a ndarray with shape num_points x num_dimensions summed along axis=1 produces a result with num_points, which is correct.
EDIT: removed code example without broadcasting.
EDIT: fixed up order of indexes.
EDIT: added line_dist

I'm not through understanding all you wrote, but some things I already can tell you; maybe they help.
newaxis is rather a marker than a function (in fact, it is plain None). It is used to add an (unused) dimension to a multi-dimensional value. With it you can make a 3D value out of a 2D value (or even more). Each dimension already there in the input value must be represented by a colon : in the index (assuming you want to use all values, otherwise it gets complicated beyond our usecase), the dimensions to be added are denoted by newaxis.
Example:
input is a one-dimensional vector (1D): 1,2,3
output shall be a matrix (2D).
There are two ways to accomplish this; the vector could fill the lines with one value each, or the vector could fill just the first and only line of the matrix. The first is created by vector[:,newaxis], the second by vector[newaxis,:]. Results of this:
>>> array([ 7,8,9 ])[:,newaxis]
array([[7],
[8],
[9]])
>>> array([ 7,8,9 ])[newaxis,:]
array([[7, 8, 9]])
(Dimensions of multi-dimensional values are represented by nesting of arrays of course.)
If you have more dimensions in the input, use the colon more than once (otherwise the deeper nested dimensions are simply ignored, i.e. the arrays are treated as simple values). I won't paste a representation of this here as it won't clarify things due to the optical complexity when 3D and 4D values are written on a 2D display using nested brackets. I hope it gets clear anyway.

The newaxis reshapes the array in such a way so that when you multiply numpy uses broadcasting. Here is a good tutorial on broadcasting.
step[:, newaxis, newaxis] is the same as step.reshape((step.shape[0], 1, 1)) (if step is 1d). Either method for reshaping should be very fast because reshaping arrays in numpy is very cheep, it just makes a view of the array, especially because you should only be doing it once.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why replacing values in numpy array does not always work - python

Related

PyTorch add to tensor at indices with degenerate indices

Numpy: could not broadcast input array from shape (3) into shape (1)

Using a numpy view to a non-existing array

List as element of list of lists or multidimensional lists as a grid

Python time optimisation of for loop using newaxis

Categories

Resources