NumPy: How to avoid this loop? - python

Is there a way to avoid this loop so optimize the code?
import numpy as np
cLoss = 0
dist_ = np.array([0,1,0,1,1,0,0,1,1,0]) # just an example, longer in reality
TLabels = np.array([-1,1,1,1,1,-1,-1,1,-1,-1]) # just an example, longer in reality
t = float(dist_.size)
for i in range(len(dist_)):
labels = TLabels[dist_ == dist_[i]]
cLoss+= 1 - TLabels[i]*(1. * np.sum(labels)/t)
print cLoss
Note: dist_ and TLabels are both numpy arrays with the same shape (t,1)

I am not sure what you exactly want to do, but are you aware of scipy.ndimage.measurements for computing on arrays with labels? It look like you want something like:
cLoss = len(dist_) - sum(TLabels * scipy.ndimage.measurements.sum(TLabels,dist_,dist_) / len(dist_))

I first wonder, what is labels at each step in the loop?
With dist_ = array([2,1,2]) and TLabels=array([1,2,3])
I get
[-1 1]
[1]
[-1 1]
The different length immediately raise a warning flag - it may be difficult to vectorize this.
With the longer arrays in the edited example
[-1 1 -1 -1 -1]
[ 1 1 1 1 -1]
[-1 1 -1 -1 -1]
[ 1 1 1 1 -1]
[ 1 1 1 1 -1]
[-1 1 -1 -1 -1]
[-1 1 -1 -1 -1]
[ 1 1 1 1 -1]
[ 1 1 1 1 -1]
[-1 1 -1 -1 -1]
The labels vectors are all the same length. Is that normal, or just a coincidence of values?
Drop a couple of elements off of dist_, and labels are:
In [375]: for i in range(len(dist_)):
labels = TLabels[dist_ == dist_[i]]
v = (1.*np.sum(labels)/t); v1 = 1-TLabels[i]*v
print(labels, v, TLabels[i], v1)
cLoss += v1
.....:
(array([-1, 1, -1, -1]), -0.25, -1, 0.75)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
(array([-1, 1, -1, -1]), -0.25, 1, 1.25)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
(array([-1, 1, -1, -1]), -0.25, -1, 0.75)
(array([-1, 1, -1, -1]), -0.25, -1, 0.75)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
Again different lengths of labels, but really only a few calculations. There is 1 v value for each different dist_ value.
Without working out all the details, it looks like you are just calculating labels*labels for each distinct dist_ value, and then summing those.
This looks like a groupBy problem. You want to divide the dist_ into groups with a common value, and sum some function of their corresponding TLabels values. Python itertools has a groupBy function, so does pandas. I think both require you to sort dist_.
Try sorting dist_ and see if that adds any clarity to the problem.

I'm not sure if this is any better since I didn't exactly understand why you might want to do this. Many variables in your loop are bivalued hence can be computed in advance.
Also the entries of dist_ can be used as a boolean switch but I used an explicit copy anyhow.
dist_ = np.array([0,1,0,1,1,0,0,1,1,0])
TLabels = np.array([-1,1,1,1,1,-1,-1,1,-1,-1])
t = len(dist)
dist_zeros = dist_== 0
one_zero_sum = [sum(TLabels[dist_zeros])/t , sum(TLabels[~dist_zeros])/t]
cLoss = sum([1-x*one_zero_sum[dist_[y]] for y,x in enumerate(TLabels)])
which results in cLoss = 8.2. I am using Python3 so didn't check whether this is a true division or not in Python2.

Related

Comparing elements at specific positions in numpy.ndarray

I don't know if the title describes my question. I have such list of floats obtained from a sigmoid activation function.
outputs =
[[0.015161413699388504,
0.6720218658447266,
0.0024502829182893038,
0.21356457471847534,
0.002232735510915518,
0.026410426944494247],
[0.006432057358324528,
0.0059209042228758335,
0.9866275191307068,
0.004609372932463884,
0.007315939292311668,
0.010821194387972355],
[0.02358204871416092,
0.5838017225265503,
0.005475651007145643,
0.012086033821106,
0.540218658447266,
0.010054176673293114]]
To calculate my metrics, I would like to say if any neuron's output value is greater than 0.5, it is assumed that the comment belongs to the class (multi-label problem). I could easily do that using
outputs = np.where(np.array(outputs) >= 0.5, 1, 0)
However, I would like to add a condition to consider only the bigger value if class#5 and and any other class have values > 0.5 (as class#5 cannot occur with other classes). How to write that condition?
In my example the output should be:
[[0 1 0 0 0 0]
[0 0 1 0 0 0]
[0 1 0 0 0 0]]
instead of:
[[0 1 0 0 0 0]
[0 0 1 0 0 0]
[0 1 0 0 1 0]]
Thanks,
You can write a custom function that you can then apply to each sub-array in outputs using the np.apply_along_axis() function:
def choose_class(a):
if (len(np.argwhere(a >= 0.5)) > 1) & (a[4] >= 0.5):
return np.where(a == a.max(), 1, 0)
return np.where(a >= 0.5, 1, 0)
outputs = np.apply_along_axis(choose_class, 1, outputs)
outputs
# array([[0, 1, 0, 0, 0, 0],
# [0, 0, 1, 0, 0, 0],
# [0, 1, 0, 0, 0, 0]])
For the simple mask, you don't need np.where
mask = outputs >= 0.5
If you want an integer instead of a boolean:
mask = (outputs >= 0.5).view(np.uint8)
To check the fifth column, you need to keep a reference to the original data around. You can get the maximum masked value in each relevant row with
rows = np.flatnonzero(mask[:, 4])
keep = (outputs[mask] * mask[rows]).argmax()
Then you can blank out the rows and set only the maximum value:
mask[rows] = 0
mask[rows, keep] = 1
One other solution:
# Your example input array
out = np.array([[0.015, 0.672, 0.002, 0.213, 0.002, 0.026],
[0.006, 0.005, 0.986, 0.004, 0.007, 0.010],
[0.023, 0.583, 0.005, 0.012, 0.540, 0.010]])
# We get the desired result
val = (out>=0.5)*out//(out.max(axis=1))[:,None]
This solution do the following operation:
Set to zero all the value < 0.5
Set to 1 the maximum value by row (iif this value is >= 0.5)

Comparing two numpy arrays for compliance with two conditions

Consider two numpy arrays having the same shape, A and B, composed of 1s and 0s. A small example is shown:
A = [[1 0 0 1] B = [[0 0 0 0]
[0 0 1 0] [0 0 0 0]
[0 0 0 0] [1 1 0 0]
[0 0 0 0] [0 0 1 0]
[0 0 1 1]] [0 1 0 1]]
I now want to assign values to the two Boolean variables test1 and test2 as follows:
test1: Is there at least one instance where a 1 in an A column and a 1 in the SAME B column have row differences of exactly 1 or 2? If so, then test1 = True, otherwise False.
In the example above, column 0 of both arrays have 1s that are 2 rows apart, so test1 = True. (there are other instances in column 2 as well, but that doesn't matter - we only require one instance.)
test2: Do the 1 values in A and B all have different array addresses? If so, then test2 = True, otherwise False.
In the example above, both arrays have [4,3] = 1, so test2 = False.
I'm struggling to find an efficient way to do this and would appreciate some assistance.
Here is a simple way to test if two arrays have an entry one element apart in the same column (only in one direction):
(A[1:, :] * B[:-1, :]).any(axis=None)
So you can do
test1 = (A[1:, :] * B[:-1, :] + A[:-1, :] * B[1:, :]).any(axis=None) or (A[2:, :] * B[:-2, :] + A[:-2, :] * B[2:, :]).any(axis=None)
The second test can be done by converting the locations to indices, stacking them together, and using np.unique to count the number of duplicates. Duplicates can only come from the same index in two arrays since an array will never have duplicate indices. We can further speed up the calculation by using flatnonzero instead of nonzero:
test2 = np.all(np.unique(np.concatenate((np.flatnonzero(A), np.flatnonzero(B))), return_counts=True)[1] == 1)
A more efficient test would use np.intersect1d in a similar manner:
test2 = not np.intersect1d(np.flatnonzero(A), np.flatnonzero(B)).size
You can use masked_arrays and for second task you can do:
A_m = np.ma.masked_equal(A, 0)
B_m = np.ma.masked_equal(B, 0)
test2 = np.any((A_m==B_m).compressed())
And a naive way of doing first task is:
test1 = np.any((np.vstack((A_m[:-1],A_m[:-2],A_m[1:],A_m[2:]))==np.vstack((B_m[1:],B_m[2:],B_m[:-1],B_m[:-2]))).compressed())
output:
True
True
For Test2: You could just check if they found any similar indexes found for a value of 1.
A = np.array([[1, 0, 0, 1],[0, 0, 1, 0],[0, 0, 0, 0],[0, 0, 0, 0],[0, 0, 1, 1]])
B = np.array([[0, 0, 0, 0],[0, 0, 0, 0],[1, 1, 0, 0],[0, 0, 1, 0],[0, 1, 0, 1]])
print(len(np.intersect1d(np.flatnonzero(A==1),np.flatnonzero(B==1)))>0))

How can I get exactly the same amount elements replaced in numpy 2D matrix?

I got a symmetrical 2D numpy matrix, it only contains ones and zeros and diagonal elements are always 0.
I want to replace part of the elements from one to zero, and the result need to keep symmetrical too. How many elements will be selected depends on the parameterreplace_rate.
Since it's a symmetrical matrix, I take half of the matrix and select the elements(those values are 1) randomly, change them from 1 to 0. And then with a mirror operation, make sure the whole matrix are still symmetrical.
For example
com = np.array ([[0, 1, 1, 1, 1],
[1, 0, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 0, 1],
[1, 1, 1, 1, 0]])
replace_rate = 0.1
com = np.triu(com)
mask = np.random.choice([0,1],size=(com.shape),p=((1-replace_rate),replace_rate)).astype(np.bool)
r1 = np.random.rand(*com.shape)
com[mask] = r1[mask]
com += com.T - np.diag(com.diagonal())
com is a (5,5) symmetrical matrix, and 10% of elements (only include those values are 1, the diagonal elements are excluded) will be replaced to 0 randomly.
The question is , how can I make sure the amount of elements changed keep the same each time?
Keep the same replace_rate = 0.1, sometimes I will get result like:
com = np.array([[0 1 1 1 1]
[1 0 1 1 1]
[1 1 0 1 1]
[1 1 1 0 1]
[1 1 1 1 0]])
Actually no one changed this time, and if I repeat it, I got 2 elements changed :
com = np.array([[0 1 1 1 1]
[1 0 1 1 1]
[1 1 0 1 0]
[1 1 1 0 1]
[1 1 0 1 0]])
I want to know how to fix the amount of elements changed with the same replace_rate?
Thanks in advance!!
How about something like this:
def make_transform(m, replace_rate):
changed = [] # keep track of indices we already changed
def get_random():
# Get a random pair of indices which are not equal (i.e. not on the diagonal)
c1, c2 = random.choices(range(len(com)), k=2)
if c1 == c2 or (c1,c2) in changed or (c2,c1) in changed:
return get_random() # Recurse until we find an i,j pair : i!=j , that hasnt already been changed
else:
changed.append((c1,c2))
return c1, c2
n_changes = int(m.shape[0]**2 * replace_rate) # the number of changes to make
print(n_changes)
for _ in range(n_changes):
i, j = get_random() # Get an valid index
m[i][j] = m[j][i] = 0
return m
This is the solution I suggest:
def rand_zero(mat, replace_rate):
triu_mat = np.triu(mat)
_ind = np.where(triu_mat != 0) # gets indices of non-zero elements, not just non-diagonals
ind = [x for x in zip(*_ind)]
chng = np.random.choice(range(len(ind)), # select some indices, at rate 'replace_rate'
size = int(replace_rate*mat.size),
replace = False) # do not select duplicates
mod_mat = triu_mat
for c in chng:
mod_mat[ind[c]] = 0
mod_mat = mod_mat + mod_mat.T
return mod_mat
I use int() to truncate to an integer in size, but you can use round() if that's what you desire.
Hope this gives consistent results!

How to apply lower and upper threshold to NumPy array?

I have the following array
array = np.array([-0.5, -2, -1, -0.5, -0.25, 0, 0, -2, -1, 0.25, 0.5, 1, 2])
and would like to apply two thresholds, such that all values below -1.0 are set to 1 and all values above -0.3 are set to 0. For the values inbetween, the following rule should apply: if the last value was below -1.0 then it should be a 1 but if the last value was above -0.3, then it should be a 0.
For the example array above, the output should be
target = np.array([0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0])
If multiple consecutive values are between -1.0 and -0.3, then it should go back as far as required until there is a value above or below the two thresholds and set the output accordingly.
I tried to achieve this by iterating over the array and using a while inside the for loop to find the next occurence where the value is above the threshold, but it doesn't work:
array = np.array([-0.5, -2, -1, -0.5, -0.25, 0, 0, -2, -1, 0.25, 0.5, 1, 2])
p = []
def function(array, p):
for i in np.nditer(array):
if i < -1:
while i <= -0.3:
p.append(1)
i += 1
else:
p.append(0)
i += 1
return p
a = function(array, p)
print(a)
How can I apply the two thresholds to my array as described above?
What you are trying to achieve is called "thresholding with hysteresis". For this, I adapted the very nice algorithm from this answer:
Given your test data,
import numpy as np
array = np.array([-0.5, -2, -1, -0.5, -0.25, 0, 0, -2, -1, 0.25, 0.5, 1, 2])
you detect which values are below the first threshold -1.0, and which are above the second threshold -0.3:
low_values = array <= -1.0
high_values = array >= -0.3
These are the values for which you know the result: either 1 or 0. For all other values, it depends on its neighbors. Thus, all values for which either low_values or high_values is True are known.
You can get the indices of all known elements with:
known_values = high_values | low_values
known_idx = np.nonzero(known_values)[0]
To find the result for all unknown values, we use the np.cumsum function on the known_values array. The Booleans are interpreted as 0 or 1, so this gives us the following array:
acc = np.cumsum(known_values)
which will result in the following for your example:
[ 0 1 2 2 3 4 5 6 7 8 9 10 11].
Now, known_idx[acc - 1] will contain the index of the last known value for each point. With low_values[known_idx[acc - 1]] you get a True if the last known value was below -1.0 and a False if it was above -0.3:
result = low_values[known_idx[acc - 1]]
There is one problem left: If the initial value is below -1.0 or above -0.3, then everything works out perfectly fine. But if it is in-between, then it would depend on its left neighbor - which it doesn't have. So in your case, you simply define it to be zero.
We can do that by checking if acc[0] equals 0 or 1. If acc[0] = 1, then everything is fine, but if acc[0] = 0, then this means that the first value is between -1.0 and -0.3, so we have to set it to zero:
if not acc[0]:
result[0] = False
Finally, as we were doing lots of comparisons, our result array is a boolean array. To convert it to integer 0 and 1, we simply call
result = np.int8(result)
and we get our desired result:
array([0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0], dtype=int8)

Compute the length of consecutive true values in a list

Essentially this problem can be split into two parts. I have a set of binary values that indicate whether a given signal is present or not. Given that the each value also corresponds to a unit of time (in this case minutes) I am trying to determine how long the signal exists on average given its occurrence within the overall list of values throughout the period I'm analyzing. For example, if I have the following list:
[0,0,0,1,1,1,0,0,1,0,0,0,1,1,1,1,0]
I can see that the signal occurs 3 separate times for variable lengths of time (i.e. in the first case for 3 minutes). If I want to calculate the average length of time for each occurrence however I need an indication of how many independent instances of the signal exist (i.e. 3). I have tried various index based strategies such as:
arb_ops.index(1)
to find the next occurrence of true values and correspondingly finding the next occurrence of 0 to find the length but am having trouble contextualizing this into a recursive function for the entire array.
You could use itertools.groupby() to group consecutive equal elements. To calculate a group's length convert the iterator to a list and apply len() to it:
>>> from itertools import groupby
>>> lst = [0 ,0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0 ,1, 1, 1, 1, 0]
>>> for k, g in groupby(lst):
... g = list(g)
... print(k, g, len(g))
...
0 [0, 0, 0] 3
1 [1, 1, 1] 3
0 [0, 0] 2
1 [1] 1
0 [0, 0, 0] 3
1 [1, 1, 1, 1] 4
0 [0] 1
Another option may be MaskedArray.count, which counts non-masked elements of an array along a given axis:
import numpy.ma as ma
a = ma.arange(6).reshape((2, 3))
a[1, :] = ma.masked
a
masked_array(data =
[[0 1 2]
[-- -- --]],
mask =
[[False False False]
[ True True True]],
fill_value = 999999)
a.count()
3
You can extend Masked Arrays quite far...
#eugene-yarmash solution with the groupby is decent. However, if you wanted to go with a solution that requires no import, and where you do the grouping yourself --for learning purposes-- you could try this::
>>> l = [0,0,0,1,1,1,0,0,1,0,0,0,1,1,1,1,0]
>>> def size(xs):
... sz = 0
... for x in xs:
... if x == 0 and sz > 0:
... yield sz
... sz = 0
... if x == 1:
... sz += 1
... if sz > 0:
... yield sz
...
>>> list(size(l))
[3, 1, 4]
I think this problem is actually pretty simple--you know you have a new signal if you see a value is 1, and the previous value is 0.
The code I provided is kind of long, but super simple, and done without imports.
signal = [0,0,0,1,1,1,0,0,1,0,0,0,1,1,1,1,0]
def find_number_of_signals(signal):
index = 0
signal_counter = 0
signal_duration = 0
for i in range(len(signal) - 1):
if signal[index] == 1:
signal_duration += 1.0
if signal[index- 1] == 0:
signal_counter += 1.0
index += 1
print signal_counter
print signal_duration
print float(signal_duration / signal_counter)
find_number_of_signals(signal)

Categories