could someone explain me the difference in the List size? Once it is (x,1) and the other (x,). I think I get an idexError due to that.
Thanks
print(Annotation_Matrix)
[array([[1],
...,
[7],
[7],
[7]], dtype=uint8)]
print(idx)
[array([ True, True, True, ..., False, False, False], dtype=bool)]
p.s. the left one is created with
matlabfile.get(...)
the right one with
in1d(...)
An array A of size (x,1) is a matrix of x rows and 1 columns (2 dimensions), which differs from A.T of size (1,x). They have the same elements but in different 'orientation'.
An array B of size (x,) is a vector of x coordinates (1 dimension), without any orientation (it's not a row nor a column). It's just a list of elements.
In the first case, one can access an element with A[i,:] which is the same of A[i,0] (because it has only one column).
In the later, the call B[i,:] causes an error because the array B has only one dimension. The correct call is B[i].
I hope this helps you to solve the problem.
Related
Say I have a numpy array, a, of elements (which do not repeat) np.array([1,3,5,2,4]), I would like to retrieve the indices a contains [4,2]. Desired output: np.array([3,4]) as these are the indices the requested elements.
So far, I've tried
np.all(np.array([[1,2,3,4]]).transpose(), axis=1, where=lambda x: x in [1,2])
>>>
array([ True, True, True, True])
But this result does not make sense to me. Elements at indices 2,3 should be False
Perhaps I need to search for one element at a time, but I'd prefer if this operation could be vectorized/fast.
I'd say the function you're looking for is numpy.isin()
arr = np.array([[1,2,3,4]])
print(np.where(np.isin(arr, [1,2])))
Should give the output you're looking for
I have these two numpy arrays in Python:
a = np.array(sorted(np.random.rand(6)*6)) # It is sorted.
b = np.array(np.random.rand(3)*6)
Say that the arrays are
a = array([0.27148588, 0.42828064, 2.48130785, 4.01811243, 4.79403723, 5.46398145])
b = array([0.06231266, 1.64276013, 5.22786201])
I want to produce an array containing the indices where a is <= than each element in b, i.e. I want exactly this:
np.argmin(np.array([a<b_i for b_i in b]),1)-1
which produces array([-1, 1, 4]) meaning that b[0]<a[0], a[1]<b[1]<a[2] and a[4]<b[2]<a[5].
Is there any native numpy fast vectorized way of doing this avoiding the for loop?
To answer your specific question, i.e., a vectorized way to get the equivalent of np.array([a<b_i for b_i in b], you can take advantage of broadcasting, here, you could use:
a[None, ...] < b[..., None]
So:
>>> a[None, ...] < b[..., None]
array([[False, False, False, False, False, False],
[ True, True, False, False, False, False],
[ True, True, True, True, True, False]])
Importantly, for broadcasting:
>>> a[None, ...].shape, b[..., None].shape
((1, 6), (3, 1))
Here's the link to the official numpy docs to understand broadcasting. Some relevant tidbits:
When operating on two arrays, NumPy compares their shapes
element-wise. It starts with the trailing (i.e. rightmost) dimensions
and works its way left. Two dimensions are compatible when
they are equal, or
one of them is 1
...
When either of the dimensions compared is one, the other is used. In
other words, dimensions with size 1 are stretched or “copied” to match
the other.
Edit
As noted in the comments under your question, using an entirely different approach is much better algorithmically than your own, brute force solution, namely, taking advantage of binary search, using np.searchsorted
a = [[1.], [-1.], [1.]]
I have the above list. I want to find the value: (# of -1)/length of a. For the above example, the value is 1/3.
If we have
a = [[1.], [-1.], [1.], [-1.]]
then the value is 1/2.
How can I perform the above calculation in Python?
I've tried a.count(-1)/a.shape[0], but that did not seem to work for list objects.
Two things: first, your list contains list. Doing a.count(-1) is searching the list for the integer -1. Instead, you have to search your list for the list containing -1. like so: a.count([-1.]).
Second, if you are just using a regular list then it does not have a shape property. That is for ndarrays. Instead just use len(a).
So instead of using a.count(-1)/a.shape[0] you should use a.count([-1.])/len(a).
Edit: When a is an ndarray
This can be done quite easily in the case where a is in fact an ndarray, but will look different from the case where a is just a python list. In a simple case of a list of lists containing a single real number each, the solution by Andy L. works fine.
However, recall that ndarrays are basically matrices and checking equality as suggested by Andy L. (a == -1) will check element-wise equality over the entire matrix. The downside to this arises when you want to check the number of rows in an ndarray matching a given row of some length greater than 1 (note that the solution I suggested above will still work if you want to check the number of lists in a python list matching a given list of arbitrary length).
An example:
Suppose we have an array
a = np.array([
[1, 2],
[2, 2],
[1, 3],
[1, 2]
])
And we want to find the proportion of rows equal to [1, 2] (in this case .5). The solution proposed by Andy L. will not quite work in this case because if we try a == [1, 2] we will get the element-wise truth array:
[[True, True],
[False, True],
[True, False],
[True, True]]
Calling .mean() on this array will give us 6/8 = 0.75, not what we want. So we must add an extra step:
temp = (a == [1, 2]).all(axis=1)
proportion = np.sum(temp) / temp.shape[0]
Calling .all(axis=1) will reduce the array to a 1-dimensional array where each value is True if the corresponding row in a == [1, 2] was [True, True] and false otherwise. This will give us the desired result for checking row equality for rows of arbitrary length.
You may try:
a = [[1.], [-1.], [1.], [-1.]]
len([x[0] for x in a if x[0]==-1])/len(a)
Output
0.5
a.count([-1.]) / len(a)
Note to be careful with float equality in general though
As you say a is numpy.ndarray, simply check on -1 and use mean to get your desired output
a = np.array([[1.], [-1.], [1.]])
In [1144]: (a == -1).mean()
Out[1144]: 0.3333333333333333
In [1146]: a = np.array([[1.], [-1.], [1.], [-1.]])
In [1147]: (a == -1).mean()
Out[1147]: 0.5
What is the fastest method to delete elements from numpy array while retreiving their initial positions. The following code does not return all elements that it should:
list = []
for pos,i in enumerate(ARRAY):
if i < some_condition:
list.append(pos) #This is where the loop fails
for _ in list:
ARRAY = np.delete(ARRAY, _)
It really feels like you're going about this inefficiently. You should probably be using more builtin numpy capabilities -- e.g. np.where, or boolean indexing. Using np.delete in a loop like that is going to kill any performance gains you get from using numpy...
For example (with boolean indexing):
keep = np.ones(ARRAY.shape, dtype=bool)
for pos, val in enumerate(ARRAY):
if val < some_condition:
keep[pos] = False
ARRAY = ARRAY[keep]
Of course, this could possibly be simplified (and generalized) even further:
ARRAY = ARRAY[ARRAY >= some_condition]
EDIT
You've stated in the comments that you need the same mask to operate on other arrays as well -- That's not a problem. You can keep a handle on the mask and use it for other arrays:
mask = ARRAY >= some_condition
ARRAY = ARRAY[mask]
OTHER_ARRAY = OTHER_ARRAY[mask]
...
Additionally (and perhaps this is the reason your original code isn't working), as soon as you delete the first index from the array in your loop, all of the other items shift one index to the left, so you're not actually deleting the same items that you "tagged" on the initial pass.
As an example, lets say that your original array was [a, b, c, d, e] and on the original pass, you tagged elements at indexes [0, 2] for deletion (a, c)... On the first pass through your delete loop, you'd remove the item at index 0 -- Which would make your array:
[b, c, d, e]
now on the second iteration of your delete loop, you're going to delete the item at index 2 in the new array:
[b, c, e]
But look, instead of removing c like we wanted, we actually removed d! Oh snap!
To fix that, you could probably write your loop over reversed(list), but that still won't result in a fast operation.
You don't need to iterate, especially with a simple condition like this. And you don't really need to use delete:
A sample array:
In [693]: x=np.arange(10)
A mask, boolean array were a condition is true (or false):
In [694]: msk = x%2==0
In [695]: msk
Out[695]: array([ True, False, True, False, True, False, True, False, True, False], dtype=bool)
where (or nonzero) converts it to indexes
In [696]: ind=np.where(msk)
In [697]: ind
Out[697]: (array([0, 2, 4, 6, 8], dtype=int32),)
You use the whole ind in one call to delete (no need to iterate):
In [698]: np.delete(x,ind)
Out[698]: array([1, 3, 5, 7, 9])
You can use it ind to retain those values instead:
In [699]: x[ind]
Out[699]: array([0, 2, 4, 6, 8])
Or you can used the boolean msk directly:
In [700]: x[msk]
Out[700]: array([0, 2, 4, 6, 8])
or use its inverse:
In [701]: x[~msk]
Out[701]: array([1, 3, 5, 7, 9])
delete doesn't do much more than this kind of boolean masking. It's all Python code, so you can easily study it.
I want to calculate an indexed weight sum across a large (1,000,000 x
3,000) boolean numpy array. The large boolean array changes
infrequently, but the weights come at query time, and I need answers
very fast, without copying the whole large array, or expanding the
small weight array to the size of the large array.
The result should be an array with 1,000,000 entries, each having the
sum of the weights array entries corresponding to that row's True
values.
I looked into using masked arrays, but they seem to require building a
weights array the size of my large boolean array.
The code below gives the correct results, but I can't afford that copy
during the multiply step. The multiply isn't even necessary, since
the values array is boolean, but at least it handles the broadcasting
properly.
I'm new to numpy, and loving it, but I'm about to give up on it for
this particular problem. I've learned enough numpy to know to stay
away from anything that loops in python.
My next step will be to write this routine in C (which has the added
benefit of letting me save memory by using bits instead of bytes, by
the way.)
Unless one of you numpy gurus can save me from cython?
from numpy import array, multiply, sum
# Construct an example values array, alternating True and False.
# This represents four records of three attributes each:
# array([[False, True, False],
# [ True, False, True],
# [False, True, False],
# [ True, False, True]], dtype=bool)
values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
# Construct example weights, one for each attribute:
# array([1, 2, 3])
weights = array(range(1, 4))
# Create expensive NEW array with the weights for the True attributes.
# Broadcast the weights array into the values array.
# array([[0, 2, 0],
# [1, 0, 3],
# [0, 2, 0],
# [1, 0, 3]])
weighted = multiply(values, weights)
# Add up the weights:
# array([2, 4, 2, 4])
answers = sum(weighted, axis=1)
print answers
# Rejected masked_array solution is too expensive (and oddly inverts
# the results):
masked = numpy.ma.array([[1,2,3]] * 4, mask=values)
The dot product (or inner product) is what you want. It allows you to take a matrix of size m×n and a vector of length n and multiply them together yielding a vector of length m, where each entry is the weighted sum of a row of the matrix with the entries of the vector of as weights.
Numpy implements this as array1.dot(array2) (or numpy.dot(array1, array2) in older versions). e.g.:
from numpy import array
values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
weights = array(range(1, 4))
answers = values.dot(weights)
print answers
# output: [ 2 4 2 4 ]
(You should benchmark this though, using the timeit module.)
It seems likely that dbaupp's answer is the correct one. But just for the sake of diversity, here's another solution that saves memory. This will work even for operations that don't have a built-in numpy equivalent.
>>> values = numpy.array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
>>> weights = numpy.array(range(1, 4))
>>> weights_stretched = numpy.lib.stride_tricks.as_strided(weights, (4, 3), (0, 8))
numpy.lib.stride_tricks.as_strided is a wonderful little function! It allows you to specify shape and strides values that allow a small array to mimic a much larger array. Observe -- there aren't really four rows here; it just looks that way:
>>> weights_stretched[0][0] = 4
>>> weights_stretched
array([[4, 2, 3],
[4, 2, 3],
[4, 2, 3],
[4, 2, 3]])
So instead of passing a huge array to MaskedArray, you can pass a smaller one. (But as you've already noticed, numpy masking works in the opposite way you might expect; truth masks, rather than revealing, so you'll have to store your values inverted.) As you can see, MaskedArray doesn't copy any data; it just reflects whatever is in weights_stretched:
>>> masked = numpy.ma.MaskedArray(weights_stretched, numpy.logical_not(values))
>>> weights_stretched[0][0] = 1
>>> masked
masked_array(data =
[[-- 2 --]
[1 -- 3]
[-- 2 --]
[1 -- 3]],
mask =
[[ True False True]
[False True False]
[ True False True]
[False True False]],
fill_value=999999)
Now we can just pass it to sum:
>>> sum(masked, axis=1)
masked_array(data = [2 4 2 4],
mask = [False False False False],
fill_value=999999)
I benchmarked numpy.dot and the above against a 1,000,000 x 30 array. This is the result on a relatively modern MacBook Pro (numpy.dot is dot1; mine is dot2):
>>> %timeit dot1(values, weights)
1 loops, best of 3: 194 ms per loop
>>> %timeit dot2(values, weights)
1 loops, best of 3: 459 ms per loop
As you can see, the built-in numpy solution is faster. But stride_tricks is worth knowing about regardless, so I'm leaving this.
Would this work for you?
a = np.array([sum(row * weights) for row in values])
This uses sum() to immediately sum the row * weights values, so you don't need the memory to store all the intermediate values. Then the list comprehension collects all the values.
You said you want to avoid anything that "loops in Python". This at least does the looping with the C guts of Python, rather than an explicit Python loop, but it can't be as fast as a NumPy solution because that uses compiled C or Fortran.
I don't think you need numpy for something like that. And 1000000 by 3000 is a huge array; this will not fit in your RAM, most likely.
I would do it this way:
Let's say that you data is originally in a text file:
False,True,False
True,False,True
False,True,False
True,False,True
My code:
weight = range(1,4)
dicto = {'True':1, 'False':0}
with open ('my_data.txt') as fin:
a = sum(sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin)
Result:
>>> a
12
EDIT:
I think I slightly misread the question first time around, and summed up the everything together. Here is the solution that gives the exact solution that OP is after:
weight = range(1,4)
dicto = {'True':1, 'False':0}
with open ('my_data.txt') as fin:
a = [sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin]
Result:
>>> a
[2, 4, 2, 4]