Can't delete element in ndarray - python

I am trying to delete the last element in an array, if the element does not meet certain conditions. The code I am using is:
# Set the distibution parameter to 2
a = 2
# Set the size to 100
s = 100
# Create Zipf's Law distribution using a and s
x = np.random.zipf(a,s)
# Reorder list by number frequency
xb = np.unique(x, return_counts=True)
print("X",x)
print("XB",xb)
for i in reversed(xb):
if xb[-1] > xb[-2]*1.5:
xb = np.delete(xb,-1)
print("XB mod",xb)
print()
I get the following output for the python print("X",x) and ````python print("XB", xb):
XB (array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 28,
29,
31, 33, 56, 225]), array([57, 17, 4, 4, 2, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1],
dtype=int64))
However, when I try to run the deletion portion of the code, I get the following error:
Traceback (most recent call last): File "test2.py", line 22, in
if xb[-1] > xb[-2]*1.5: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Any idea how to fix it, so that I can delete the last element in the XB array, if it doesn't meet the condition?

xb is a tuple consisting of a pair of np.ndarray objects.
How do I delete the last element in the XB array, if it doesn't meet the condition
If you want to delete the last pair of zipped values (e.g. 225 and 1 for your data) based on your condition where you compare the last two numbers of the first row of data (e.g. 225 > 56 * 1.5 for your data):
if xb[0][-1] > xb[0][-2] * 1.5:
xb = tuple(x[:-1] for x in xb)
>>> xb
(array([ 1, 2, ..., 31, 33, 56]),
array([57, 17, ..., 1, 1, 1]))

Short answer:
Use all:
for i in reversed(xb):
if all(xb[-1] > xb[-2]*1.5): # use all here
xb = np.delete(xb,-1)
Equivalent: if (xb[-1] > xb[-2]*1.5).all():
Long answer:
You have:
xb
(array([ 1, 2, 3, 4, 5, 7, 9, 10, 13, 21, 22, 24, 30]),
array([62, 16, 2, 4, 6, 3, 1, 1, 1, 1, 1, 1, 1]))
that is a list of numpy arrays.
Next, xb[-1] > xb[-2]*1.5 returns:
array([ True, True, False, False, False, False, False, False, False,
False, False, False, False])
If you do not use all OR any, this condition will raise the errror

the problem is in if xb[-1] > xb[-2]*1.5
xb is not a scalar but a vector (1d array).
So what does it means v1 > v2? all items? at least one item?
Take for example [2,3] > [1,4], all will return False because 3 > 4, is False, any on the other hand, will return True because there is at least one that is true (2 > 1).
Like the error say it is ambiguous.
So, if for example you want that all the items will pass the condition you have to use:
if np.all(xb[-1] > xb[-2]*1.5): ...

Related

Numpy array limiting operation X[X < {value}] = {value}

I came across the following in a piece of code:
X = numpy.array()
X[X < np.finfo(float).eps] = np.finfo(float).eps
I found out the following from the documentation:
class numpy.finfo(dtype):
Machine limits for floating point types.
Parameters:
dtype : float, dtype, or instance
Kind of floating point data-type about which to get information.
I understand that np.finfo(float).eps returns the lowest represent-able float value and that X[X < np.finfo(float).eps] = np.finfo(float).eps makes sure that any value less than np.finfo(float).eps is not contained in the array X, but I'm unable to understand how exactly that happens in the statement of the form X[X < {value}] = {value} and what it means. Any help is appreciated much.
The first time I saw it was as a way to replace NaNs in an array
Basically the conditional X < np.finfo(float).eps creates a boolean mask of Xand then X is iterated over replacing values that have a True associated with them.
So for instance,
x=np.array([-4, -3, -2, -1, 0, 1, 2, 3, 4])
x[x < 0] = 0
Here the mask array would look like,
[True, True, True, True, False, False, False, False, False]
Its a quicker way of doing the following with large arrays,
x=np.array([-4, -3, -2, -1, 0, 1, 2, 3, 4])
for y, idx in enumerate(x):
if y < 0:
x[idx] = 0
This is a fancy way of changing values of an array and changing values if condition is met.
On an easy example:
X = np.random.randint(1, 100, size=5)
print(X) # array([ 1, 17, 92, 9, 11])
X[X < 50] = 50 # Change any value lower than 50 to 50
print(X) # array([50, 50, 92, 50, 50])
Basically this changes array X if you don't make a copy of it and former values are lost forever. Using np.where() would achieve same goal, but it would not override the original array.
X = np.random.randint(1, 100, size=5)
print(X) # array([ 1, 17, 92, 9, 11])
np.where(X < 50, 50, X) # array([50, 50, 92, 50, 50])
print(X) # array([ 1, 17, 92, 9, 11])
Extra info:
Fancy indexing You need to scroll a bit down tho (idk how to copy at specific header)
When we index a numpy array X with another array x, the output is a numpy array with values corresponding to the values of X at indices corresponding to the values of x.
And X < {value} returns a numpy array which has boolean values True or False against each item in X depending on whether the item passed the condition {item} < {value}. Hence, X[X < {value}] = {value} means that we're assigning the value {value} whenever an array item is less than {value}. The following would make things more clear:
>>> x = [1, 2, 0, 3, 4, 0, 5, 6, 0, 7, 8, 0]
>>> X = numpy.array(x)
>>> X < 1
array([False, False, True, False, False, True, False, False, True,
False, False, True])
>>> X[X < 1] = -1
>>> X
array([ 1, 2, -1, 3, 4, -1, 5, 6, -1, 7, 8, -1])
>>> X[x]
array([ 2, -1, 1, 3, 4, 1, -1, 5, 1, 6, -1, 1])
P.S. : The credit for this answer goes to #ForceBru and his comment above!

Computing a "moving sum of counts" on a NumPy array

I have the following arrays:
# input
In [77]: arr = np.array([23, 45, 23, 0, 12, 45, 45])
# result
In [78]: res = np.zeros_like(arr)
Now, I want to compute a moving sum of unique elements and store it in the res array.
Concretely, res array should be:
In [79]: res
Out[79]: array([1, 1, 2, 1, 1, 2, 3])
[23, 45, 23, 0, 12, 45, 45]
[1, 1, 2, 1, 1, 2, 3]
We start counting each element and increment the count if an element re-appears, until we reach end of the array. This element specific counts should be returned as result.
How should we do achieve this using NumPy built-in functions? I tried using numpy.bincount but it gives undesired results.
Not sure you'll find a builtin, so here is a homebrew using argsort.
def running_count(arr):
idx = arr.argsort(kind='mergesort')
sarr = arr[idx]
neq = np.where(sarr[1:] != sarr[:-1])[0] + 1
run = np.ones(arr.shape, int)
run[neq[0]] -= neq[0]
run[neq[1:]] -= np.diff(neq)
res = np.empty_like(run)
res[idx] = run.cumsum()
return res
For example:
>>> running_count(arr)
array([1, 1, 2, 1, 1, 2, 3])
>>> running_count(np.array(list("xabaaybeeetz")))
array([1, 1, 1, 2, 3, 1, 2, 1, 2, 3, 1, 1])
Explainer:
We first sort using argsort because we need indices to go back to original order in the end. Here it is important to have a stable sort, hence the use of the slow mergesort.
Once the elements are sorted the running counts will form a "saw tooth" pattern. The vectorized way to create this is to observe that the diff of a saw tooth has "jump" values where a new tooth starts and ones everywhere else. So that is staight-forward to construct.

Conditional selection in array

I have following list with arrays:
[array([10, 1, 7, 3]),
array([ 0, 14, 12, 13]),
array([ 3, 10, 7, 8]),
array([7, 5]),
array([ 5, 12, 3]),
array([14, 8, 10])]
What I want is to mark rows as "1" or "0", conditional on whether the row matches "10" AND "7" OR "10" AND "3".
np.where(output== 10 & output == 7 ) | (output == 10 & output == 3 ) | (output == 10 & output == 8 ), 1, 0)
returns
array(0)
What's the correct syntax to get into the array of the array?
Expected output:
[ 1, 0, 1, 0, 0, 1 ]
Note:
What is output? After training an CountVectorizer/LDA topic classifier in Scikit, the following script assigns topic probabilities to new documents. Topics above the threshold of 0.2 are then stored in an array.
def sortthreshold(x, thresh):
idx = np.arange(x.size)[x > thresh]
return idx[np.argsort(x[idx])]
output = []
for x in newdoc:
y = lda.transform(bowvectorizer.transform([x]))
output.append(sortthreshold(y[0], 0.2))
Thanks!
Your input data is a plain Python list of Numpy arrays of unequal length, thus it can't be simply converted to a 2D Numpy array, and so it can't be directly processed by Numpy. But it can be process using the usual Python list processing tools.
Here's a list comprehension that uses numpy.isin to test if a row contains any of (3, 7, 8). We first use simple == testing to see if the row contains 10, and only call isin if it does so; the Python and operator will not evaluate its second operand if the first operand is false-ish.
We use np.any to see if any row item passes each test. np.any returns a Boolean value of False or True, but we can pass those values to int to convert them to 0 or 1.
import numpy as np
data = [
np.array([10, 1, 7, 3]), np.array([0, 14, 12, 13]),
np.array([3, 10, 7, 8]), np.array([7, 5]),
np.array([5, 12, 3]), np.array([14, 8, 10]),
]
mask = np.array([3, 7, 8])
result = [int(np.any(row==10) and np.any(np.isin(row, mask)))
for row in data]
print(result)
output
[1, 0, 1, 0, 0, 1]
I've just performed some timeit tests. Curiously, Reblochon Masque's code is faster on the data given in the question, presumably because of the short-circuiting behaviour of plain Python any, and & or. Also, it appears that numpy.in1d is faster than numpy.isin, even though the docs recommend using the latter in new code.
Here's a new version that's about 10% slower than Reblochon's.
mask = np.array([3, 7, 8])
result = [int(any(row==10) and any(np.in1d(row, mask)))
for row in data]
Of course, the true speed on large amounts of real data may vary from what my tests indicate. And time may not be an issue: even on my slow old 32 bit single core 2GHz machine I can process the data in the question almost 3000 times in one second.
hpaulj has suggested an even faster way. Here's some timeit test info, comparing the various versions. These tests were performed on my old machine, YMMV.
import numpy as np
from timeit import Timer
the_data = [
np.array([10, 1, 7, 3]), np.array([0, 14, 12, 13]),
np.array([3, 10, 7, 8]), np.array([7, 5]),
np.array([5, 12, 3]), np.array([14, 8, 10]),
]
def rebloch0(data):
result = []
for output in data:
result.append(1 if np.where((any(output == 10) and any(output == 7)) or
(any(output == 10) and any(output == 3)) or
(any(output == 10) and any(output == 8)), 1, 0) == True else 0)
return result
def rebloch1(data):
result = []
for output in data:
result.append(1 if np.where((any(output == 10) and any(output == 7)) or
(any(output == 10) and any(output == 3)) or
(any(output == 10) and any(output == 8)), 1, 0) else 0)
return result
def pm2r0(data):
mask = np.array([3, 7, 8])
return [int(np.any(row==10) and np.any(np.isin(row, mask)))
for row in data]
def pm2r1(data):
mask = np.array([3, 7, 8])
return [int(any(row==10) and any(np.in1d(row, mask)))
for row in data]
def hpaulj0(data):
mask=np.array([3, 7, 8])
return [int(any(row==10) and any((row[:, None]==mask).flat))
for row in data]
def hpaulj1(data, mask=np.array([3, 7, 8])):
return [int(any(row==10) and any((row[:, None]==mask).flat))
for row in data]
functions = (
rebloch0,
rebloch1,
pm2r0,
pm2r1,
hpaulj0,
hpaulj1,
)
# Verify that all functions give the same result
for func in functions:
print('{:8}: {}'.format(func.__name__, func(the_data)))
print()
def time_test(loops, data):
timings = []
for func in functions:
t = Timer(lambda: func(data))
result = sorted(t.repeat(3, loops))
timings.append((result, func.__name__))
timings.sort()
for result, name in timings:
print('{:8}: {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
print()
time_test(1000, the_data)
typical output
rebloch0: [1, 0, 1, 0, 0, 1]
rebloch1: [1, 0, 1, 0, 0, 1]
pm2r0 : [1, 0, 1, 0, 0, 1]
pm2r1 : [1, 0, 1, 0, 0, 1]
hpaulj0 : [1, 0, 1, 0, 0, 1]
hpaulj1 : [1, 0, 1, 0, 0, 1]
hpaulj1 : 0.140421, 0.154910, 0.156105
hpaulj0 : 0.154224, 0.154822, 0.167101
rebloch1: 0.281700, 0.282764, 0.284599
rebloch0: 0.339693, 0.359127, 0.375715
pm2r1 : 0.367677, 0.368826, 0.371599
pm2r0 : 0.626043, 0.628232, 0.670199
Nice work, hpaulj!
You need to use np.any combined with np.where, and avoid using | and & which are binary operators in python.
import numpy as np
a = [np.array([10, 1, 7, 3]),
np.array([ 0, 14, 12, 13]),
np.array([ 3, 10, 7, 8]),
np.array([7, 5]),
np.array([ 5, 12, 3]),
np.array([14, 8, 10])]
for output in a:
print(np.where(((any(output == 10) and any(output == 7))) or
(any(output == 10) and any(output == 3)) or
(any(output == 10) and any(output == 8 )), 1, 0))
output:
1
0
1
0
0
1
If you want it as a list as the edited question shows:
result = []
for output in a:
result.append(1 if np.where(((any(output == 10) and any(output == 7))) or
(any(output == 10) and any(output == 3)) or
(any(output == 10) and any(output == 8 )), 1, 0) == True else 0)
result
result:
[1, 0, 1, 0, 0, 1]

Iterate through a numpy ndarray, manage first and last elements

I have a numpy array
import numpy as np
arr = np.array([2, 3, 4, 7, 7, 4, 4, 5, 1, 1, 9, 9, 9, 4, 25, 26])
I would like to iterate over this list to produce pairs of "matching" elements. In the above array, 7 matches 7. You only compare the element "ahead" and the element "behind".
My problem: how do I deal with the first and last elements?
This is what I have to begin with:
for i in range(len(arr)):
if (arr[i] == arr[i+1]):
print( "Match at entry %d at array location (%d)" % (arr[i], i))
else:
pass
This outputs:
Match at entry 7 at array location (3)
Match at entry 7 at array location (4)
Match at entry 4 at array location (6)
Match at entry 1 at array location (9)
Match at entry 9 at array location (11)
Match at entry 9 at array location (12)
I feel the condition should be
if ((arr[i] == arr[i+1]) and (arr[i] == arr[i-1]))
but this throws an error.
How do I deal with the first and last elements?
You should avoid loops in NumPy.
Using slightly modified array with pairs at the start end:
>>> arr = np.array([2, 2, 3, 4, 7, 7, 4, 4, 5, 1, 1, 9, 9, 9, 4, 25, 26, 26])
This finds the first index of each pair.
>>> np.where(arr[:-1] == arr[1:])[0]
array([ 0, 4, 6, 9, 11, 12, 16])
Printing them out:
arr = np.array([2, 2, 3, 4, 7, 7, 4, 4, 5, 1, 1, 9, 9, 9, 4, 25, 26, 26])
matches = np.where(arr[:-1] == arr[1:])[0]
for index in matches:
for i in [index, index + 1]:
print("Match at entry %d at array location (%d)" % (arr[i], i))
prints:
Match at entry 2 at array location (0)
Match at entry 2 at array location (1)
Match at entry 7 at array location (4)
Match at entry 7 at array location (5)
Match at entry 4 at array location (6)
Match at entry 4 at array location (7)
Match at entry 1 at array location (9)
Match at entry 1 at array location (10)
Match at entry 9 at array location (11)
Match at entry 9 at array location (12)
Match at entry 9 at array location (12)
Match at entry 9 at array location (13)
Match at entry 26 at array location (16)
Match at entry 26 at array location (17)
The function np.where can be used in several ways. In our case we use the condition arr[:-1] == arr[1:]. This compares each element with the next in the array:
>>> arr[:-1] == arr[1:]
array([ True, False, False, False, True, False, True, False, False,
True, False, True, True, False, False, False, True], dtype=bool)
Now applying np.where to this condition gives a tuple with matching indices.
>>> cond = arr[:-1] == arr[1:]
>>> np.where(cond)
(array([ 0, 4, 6, 9, 11, 12, 16]),)
Since we have a 1D array, we get a tuple with one element. For a 2D array we would have gotten a tuple with two elements, holding the indices along the first and second dimension. We take these indices out of the tuple:
>>> np.where(cond)[0]
array([ 0, 4, 6, 9, 11, 12, 16])

Elegant list comprehension to extract values in one dimension of an array based on values in another dimension

I'm looking for an elegant solution to this:
data = np.loadtxt(file)
# data[:,0] is a time
# data[:,1] is what I want to extract
mean = 0.0
count = 0
for n in xrange(np.size(data[:,0])):
if data[n,0] >= tstart and data[n,0] <= tend:
mean = mean + data[n,1]
count = count + 1
mean = mean / float(count)
I'm guessing I could alternatively first extract my 2D array and then apply np.mean on it but I feel like there could be some list comprehension goodness to make this more elegant (I come from a FORTRAN background...). I was thinking something like (obviously wrong since i would not be an index):
np.mean([x for x in data[i,1] for i in data[:,0] if i >= tstart and i <= tend])
In numpy, rather than listcomps you can use lists and arrays for indexing purposes. To be specific, say we have a 2D array like the one you're working with:
>>> import numpy as np
>>> data = np.arange(20).reshape(10, 2)
>>> data
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15],
[16, 17],
[18, 19]])
We can get the first column:
>>> ts = data[:,0]
>>> ts
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
And create a boolean array corresponding to the terms we want:
>>> (ts >= 2) & (ts <= 6)
array([False, True, True, True, False, False, False, False, False, False], dtype=bool)
Then we can use this to select elements of the column we're interested in:
>>> data[:,1][(ts >= 2) & (ts <= 6)]
array([3, 5, 7])
and finally take its mean:
>>> np.mean(data[:,1][(ts >= 2) & (ts <= 6)])
5.0
Or, in one line:
>>> np.mean(data[:,1][(data[:,0] >= 2) & (data[:,0] <= 6)])
5.0
[Edit: data[:,1][(data[:,0] >= 2) & (data[:,0] <= 6)].mean() will work too; I always forget you can use methods.]

Categories