Numpy array subset - unexpected behaviour - python

I'm trying to copy a subset of a numpy array (to do image background subtraction - but that's by the by). I don't understand what's wrong with the following - I've demonstrated it interactively because you really don't want to wade through all my code...
>>> from numpy import zeros
>>> a = zeros((5,5,3), 'uint8')
>>> print a.shape
(5, 5, 3)
>>> b = a[1:2][1:2][:].copy()
>>> print b.shape
(0, 5, 3)
>>> print a[1:2][1:2][:].shape
(0, 5, 3)
>>> print a.shape
(5, 5, 3)
>>>
What I'd like is for b.shape to return (2,2,3) - and behave that way in the subsequent operations I need to do with it. I'm sure I've done something really obvious wrong, but I can't work out what. Any suggestions gratefully received!

I believe you meant a[1:3, 1:3, :] instead of a[1:2][1:2][:].
Also, a[1:3, 1:3, ...] would work too (... means "as many : as necessary"). NumPy seems to also allow a[1:3, 1:3].
There are two parts to the explanations:
slicing in Python is left-inclusive and right-exclusive
comma-indexing is necessary here, a[1:3] gives you a shape (2,5,3) and another [1:3] will slice through the first dimension again.
For simple indexing a[1][2][3] is same as a[1,2,3] because each consecutive indexing removes one dimension. That doesn't hold for slicing, though - you need to use commas.

There are two different problems with what you're doing. The primary one is how you're handling indexing in numpy. Numpy matrices have their own syntax that is much more clear than the list of lists syntax that you're using... Use commas instead of separate indices in brackets:
>>> from numpy import zeros
>>> a = zeros((5,5,3), 'uint8')
>>> print a[1:2,1:2,:].shape
(1, 1, 3)
What you're doing instead is failing because a[1:2] still returns a list of lists, so your next index is an index on the outer list (which only has one element), not the inner list that you want:
>>> a[1:2]
array([[[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]], dtype=uint8)
>>> a[1:2][1:2]
array([], shape=(0, 5, 3), dtype=uint8)
(You wouldn't have this problem if you were using simple indices instead of slices, but you should still use the comma syntax because it's much clearer.
Second, you're using slices wrong. The first value of the slice is the index of the first value of the array that you want--and the indices start at 0. The second value is one MORE than the index of the array that you want. This is so that a[first_index:second_index] returns second_index-first_index points. So, you want something like this:
>>> b = a[0:2,0:2,:]
>>> b
array([[[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0]]], dtype=uint8)
Your index of [1:2] will only return one element... the second one in the list.
Also, as a side note, .copy() is redundant here because taking slices from a numpy array already creates a new object.

Related

Multiply each element of a list by an entire other list

I have two lists which are very large. The basic structure is :
a = [1,0,0,0,1,1,0,0] and b=[1,0,1,0]. There is no restriction on the length of either list and there is also no restriction on the value of the elements in either list.
I want to multiply each element of a by the contents of b.
For example, the following code does the job:
multiplied = []
for a_bit in a:
for b_bit in b:
multiplied.append(a_bit*b_bit)
So for the even simpler case of a=[1,0] and b = [1,0,1,0], the output multiplied would be equal to:
>>> print(multiplied)
[1,0,1,0,0,0,0,0]
Is there a way with numpy or map or zip to do this? There are similar questions that are multiplying lists with lists and a bunch of other variations but I haven't seen this one. The problem is that, my nested for loops above are fine and they work but they take forever to process on larger arrays.
You can do this using matrix multiplication, and then flattening the result.
>>> a = np.array([1,0]).reshape(-1,1)
>>> b = np.array([1,0,1,0])
>>> a*b
array([[1, 0, 1, 0],
[0, 0, 0, 0]])
>>> (a*b).flatten()
array([1, 0, 1, 0, 0, 0, 0, 0])
>>>

Numpy Conditionally Replace Column Elements

So I already took a look at this question.
I know you can conditionally replace a single column, but what about multiple columns? When I tried it, it doesn't seem to work.
the_data = np.array([[0, 1, 1, 1],
[0, 1, 3, 1],
[3, 4, 1, 3],
[0, 1, 2, 0],
[2, 1, 0, 0]])
the_data[:,0][the_data[:,0] == 0] = -1 # this works
columns_to_replace = [0, 1, 3]
the_data[:,columns_to_replace][the_data[:,columns_to_replace] == 0] = -1 # this does not work
I initially thought that the second case doesn't work because I thought the_data[:,columns_to_replace] creates a copy instead of directly referencing the elements. However, if that were the case, then the first case shouldn't work either, when you are only replacing the single column.
You're indeed getting a copy because you're using advanced indexing:
Advanced indexing is triggered when the selection object, obj, is a non-tuple sequence object, an ndarray (of data type integer or bool), or a tuple with at least one sequence object or ndarray (of data type integer or bool). There are two types of advanced indexing: integer and Boolean.
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
(Taken from the docs)
The first part works because it uses basic slicing.
I think you can do this without copying, but still with some memory overhead:
columns_to_replace = [0, 1, 3]
mask = np.zeros(the_data.shape, bool) # don't use too much memory
mask[:, columns_to_replace] = 1
np.place(the_data, (the_data == 0) * mask, [-1]) # this doesn't copy anything

Seemingly inconsistent slicing behavior in numpy arrays

I ran across something that seemed to me like inconsistent behavior in Numpy slices. Specifically, please consider the following example:
import numpy as np
a = np.arange(9).reshape(3,3) # a 2d numpy array
y = np.array([1,2,2]) # vector that will be used to index the array
b = a[np.arange(len(a)),y] # a vector (what I want)
c = a[:,y] # a matrix ??
I wanted to obtain a vector such that the i-th element is a[i,y[i]]. I tried two things (b and c above) and was surprised that b and c are not the same... in fact one is a vector and the other is a matrix! I was under the impression that : was shorthand for "all elements" but apparently the meaning is somewhat more subtle.
After trial and error I somewhat understand the difference now (b == np.diag(c)), but would appreciate clarification on why they are different, what exactly using : implies, and how to understand when to use either case.
Thanks!
It's hard to understand advanced indexing (with lists or arrays) without understanding broadcasting.
In [487]: a=np.arange(9).reshape(3,3)
In [488]: idx = np.array([1,2,2])
Index with a (3,) and (3,) producing shape (3,) result:
In [489]: a[np.arange(3),idx]
Out[489]: array([1, 5, 8])
Index with (3,1) and (3,), result is (3,3)
In [490]: a[np.arange(3)[:,None],idx]
Out[490]:
array([[1, 2, 2],
[4, 5, 5],
[7, 8, 8]])
The slice : does basically the same thing. There are subtle differences, but here it's the same.
In [491]: a[:,idx]
Out[491]:
array([[1, 2, 2],
[4, 5, 5],
[7, 8, 8]])
ix_ does the same thing, converting the (3,) & (3,) to (3,1) and (1,3):
In [492]: np.ix_(np.arange(3),idx)
Out[492]:
(array([[0],
[1],
[2]]), array([[1, 2, 2]]))
A broadcasted sum might help visualize the two cases:
In [495]: np.arange(3)*10+idx
Out[495]: array([ 1, 12, 22])
In [496]: np.sum(np.ix_(np.arange(3)*10,idx),axis=0)
Out[496]:
array([[ 1, 2, 2],
[11, 12, 12],
[21, 22, 22]])
When you pass
np.arange(len(a)), y
You can view the result as being all the indexed pairs for the zipped elements you passed. In this case, indexing by np.arange(len(a)) and y
np.arange(len(a))
# [0, 1, 2]
y
# [1, 2, 2]
effectively takes elements: (0, 1), (1, 2), and (2, 2).
print(a[0, 1], a[1, 2], a[2, 2]) # 0th, 1st, 2nd elements from each indexer
# 1 5 8
In the second case, you take the entire slice along the first dimension. (Nothing before the colon.) So this is all elements along the 0th axis. You then specify with y that you want the 1st, 2nd, and 2nd element along each row. (0-indexed.)
As you pointed out, it may seem a bit unintuitive that the results are different given that the individual elements of the slice are equivalent:
a[:] == a[np.arange(len(a))]
and
a[:y] == a[:y]
However, NumPy advanced indexing cares what type of data structure you pass when indexing (tuples, integers, etc). Things can become hairy very quickly.
The detail behind that is this: first consider all NumPy indexing to be of the general form x[obj], where obj is the evaluation of whatever you passed. How NumPy "behaves" depends on what type of object obj is:
Advanced indexing is triggered when the selection object, obj, is a
non-tuple sequence object, an ndarray (of data type integer or bool),
or a tuple with at least one sequence object or ndarray (of data type
integer or bool).
...
The definition of advanced indexing means that x[(1,2,3),] is
fundamentally different than x[(1,2,3)]. The latter is equivalent to
x[1,2,3] which will trigger basic selection while the former will
trigger advanced indexing. Be sure to understand why this occurs.
In your first case, obj = np.arange(len(a)),y, a tuple that fits the bill in bold above. This triggers advanced indexing and forces the behavior described above.
As for the second case, [:,y]
When there is at least one slice (:), ellipsis (...) or np.newaxis in
the index (or the array has more dimensions than there are advanced
indexes), then the behaviour can be more complicated. It is like
concatenating the indexing result for each advanced index element.
Demonstrated:
# Concatenate the indexing result for each advanced index element.
np.vstack((a[0, y], a[1, y], a[2, y]))

Difference between nonzero(a), where(a) and argwhere(a). When to use which?

In Numpy, nonzero(a), where(a) and argwhere(a), with a being a numpy array, all seem to return the non-zero indices of the array. What are the differences between these three calls?
On argwhere the documentation says:
np.argwhere(a) is the same as np.transpose(np.nonzero(a)).
Why have a whole function that just transposes the output of nonzero ? When would that be so useful that it deserves a separate function?
What about the difference between where(a) and nonzero(a)? Wouldn't they return the exact same result?
nonzero and argwhere both give you information about where in the array the elements are True. where works the same as nonzero in the form you have posted, but it has a second form:
np.where(mask,a,b)
which can be roughly thought of as a numpy "ufunc" version of the conditional expression:
a[i] if mask[i] else b[i]
(with appropriate broadcasting of a and b).
As far as having both nonzero and argwhere, they're conceptually different. nonzero is structured to return an object which can be used for indexing. This can be lighter-weight than creating an entire boolean mask if the 0's are sparse:
mask = a == 0 # entire array of bools
mask = np.nonzero(a)
Now you can use that mask to index other arrays, etc. However, as it is, it's not very nice conceptually to figure out which indices correspond to 0 elements. That's where argwhere comes in.
I can't comment on the usefulness of having a separate convenience function that transposes the result of another, but I can comment on where vs nonzero. In it's simplest use case, where is indeed the same as nonzero.
>>> np.where(np.array([[0,4],[4,0]]))
(array([0, 1]), array([1, 0]))
>>> np.nonzero(np.array([[0,4],[4,0]]))
(array([0, 1]), array([1, 0]))
or
>>> a = np.array([[1, 2],[3, 4]])
>>> np.where(a == 3)
(array([1, 0]),)
>>> np.nonzero(a == 3)
(array([1, 0]),)
where is different from nonzero in the case when you wish to pick elements of from array a if some condition is True and from array b when that condition is False.
>>> a = np.array([[6, 4],[0, -3]])
>>> b = np.array([[100, 200], [300, 400]])
>>> np.where(a > 0, a, b)
array([[6, 4], [300, 400]])
Again, I can't explain why they added the nonzero functionality to where, but this at least explains how the two are different.
EDIT: Fixed the first example... my logic was incorrect previously

itertools product speed up

I use itertools.product to generate all possible variations of 4 elements of length 13. The 4 and 13 can be arbitrary, but as it is, I get 4^13 results, which is a lot. I need the result as a Numpy array and currently do the following:
c = it.product([1,-1,np.complex(0,1), np.complex(0,-1)], repeat=length)
sendbuf = np.array(list(c))
With some simple profiling code shoved in between, it looks like the first line is pretty much instantaneous, whereas the conversion to a list and then Numpy array takes about 3 hours.
Is there a way to make this quicker? It's probably something really obvious that I am overlooking.
Thanks!
The NumPy equivalent of itertools.product() is numpy.indices(), but it will only get you the product of ranges of the form 0,...,k-1:
numpy.rollaxis(numpy.indices((2, 3, 3)), 0, 4)
array([[[[0, 0, 0],
[0, 0, 1],
[0, 0, 2]],
[[0, 1, 0],
[0, 1, 1],
[0, 1, 2]],
[[0, 2, 0],
[0, 2, 1],
[0, 2, 2]]],
[[[1, 0, 0],
[1, 0, 1],
[1, 0, 2]],
[[1, 1, 0],
[1, 1, 1],
[1, 1, 2]],
[[1, 2, 0],
[1, 2, 1],
[1, 2, 2]]]])
For your special case, you can use
a = numpy.indices((4,)*13)
b = 1j ** numpy.rollaxis(a, 0, 14)
(This won't run on a 32 bit system, because the array is to large. Extrapolating from the size I can test, it should run in less than a minute though.)
EIDT: Just to mention it: the call to numpy.rollaxis() is more or less cosmetical, to get the same output as itertools.product(). If you don't care about the order of the indices, you can just omit it (but it is cheap anyway as long as you don't have any follow-up operations that would transform your array into a contiguous array.)
EDIT2: To get the exact analogue of
numpy.array(list(itertools.product(some_list, repeat=some_length)))
you can use
numpy.array(some_list)[numpy.rollaxis(
numpy.indices((len(some_list),) * some_length), 0, some_length + 1)
.reshape(-1, some_length)]
This got completely unreadable -- just tell me whether I should explain it any further :)
The first line seems instantaneous because no actual operation is taking place. A generator object is just constructed and only when you iterate through it as the operating taking place. As you said, you get 4^13 = 67108864 numbers, all these are computed and made available during your list call. I see that np.array takes only list or a tuple, so you could try creating a tuple out of your iterator and pass it to np.array to see if there is any performance difference and it does not affect the overall performance of your program. This can be determined only by trying for your usecase though there are some points which say tuple is slightly faster.
To try with a tuple, instead of list just do
sendbuf = np.array(tuple(c))
You could speed things up by skipping the conversion to a list:
numpy.fromiter(c, count=…) # Using count also speeds things up, but it's optional
With this function, the NumPy array is first allocated and then initialized element by element, without having to go through the additional step of a list construction.
PS: fromiter() does not handle the tuples returned by product(), so this might not be a solution, for now. If fromiter() did handle dtype=object, this should work, though.
PPS: As Joe Kington pointed out, this can be made to work by putting the tuples in a structured array. However, this does not appear to always give a speed up.
Let numpy.meshgrid do all the job:
length = 13
x = [1, -1, 1j, -1j]
mesh = numpy.meshgrid(*([x] * length))
result = numpy.vstack([y.flat for y in mesh]).T
on my notebook it takes ~2 minutes
You might want to try a completely different approach: first create an empty array of the desired size:
result = np.empty((4**length, length), dtype=complex)
then use NumPy's slicing abilities to fill out the array yourself:
# Set up of the last "digit":
result[::4, length-1] = 1
result[1::4, length-1] = -1
result[2::4, length-1] = 1j
result[3::4, length-1] = -1j
You can do similar things for the other "digits" (i.e. the elements of result[:, 2], result[:, 1], and result[:, 0]). The whole thing could certainly be put in a loop that iterates over each digit.
Transposing the whole operation (np.empty((length, 4**length)…)) is worth trying, as it might bring a speed gain (through a better use of the memory cache).
Probably not optimized but much less reliant on python type conversions:
ints = [1,2,3,4]
repeat = 3
def prod(ints, repeat):
w = repeat
l = len(ints)
h = l**repeat
ints = np.array(ints)
A = np.empty((h,w), dtype=int)
rng = np.arange(h)
for i in range(w):
x = l**i
idx = np.mod(rng,l*x)/x
A[:,i] = ints[idx]
return A

Categories