Fastest way to simulate multiple given multiple probabilities in Python? - python

I have a list of probabilities p = [p1, p2, …, pn].
All I want is to simulate a list s = [0, 1, 0, 0, 1, …, 1], whose first element is 0 with probability p1 and 1 with probability 1 - p1, and then so on for the next elements, always matching the corresponding probabilities in the p-list.
Currently my solution is to for-loop over p, and then append to s the output from np.random.choice() called on each individual pn.
s = []
for item in p:
s.append(np.random.choice([0, 1], p=[item, 1 - item]))

You just need to draw your numbers and element-wise compare them with your p.
just decide if for 1 you want > or >=
import numpy as np
p = np.array([0.2, 0.5, 0, 1.0, 0.9, 0.3, 0.1, 0.8])
x = np.random.random(size=p.shape)
ans = (x>p).astype('int')
print(p)
print(x)
print(ans)
[0.2 0.5 0. 1. 0.9 0.3 0.1 0.8]
[0.08990063 0.51804083 0.9049705 0.0885368 0.1273564 0.18583925
0.51488052 0.23258143]
[0 1 1 0 0 0 1 0]

Related

Comparing elements at specific positions in numpy.ndarray

I don't know if the title describes my question. I have such list of floats obtained from a sigmoid activation function.
outputs =
[[0.015161413699388504,
0.6720218658447266,
0.0024502829182893038,
0.21356457471847534,
0.002232735510915518,
0.026410426944494247],
[0.006432057358324528,
0.0059209042228758335,
0.9866275191307068,
0.004609372932463884,
0.007315939292311668,
0.010821194387972355],
[0.02358204871416092,
0.5838017225265503,
0.005475651007145643,
0.012086033821106,
0.540218658447266,
0.010054176673293114]]
To calculate my metrics, I would like to say if any neuron's output value is greater than 0.5, it is assumed that the comment belongs to the class (multi-label problem). I could easily do that using
outputs = np.where(np.array(outputs) >= 0.5, 1, 0)
However, I would like to add a condition to consider only the bigger value if class#5 and and any other class have values > 0.5 (as class#5 cannot occur with other classes). How to write that condition?
In my example the output should be:
[[0 1 0 0 0 0]
[0 0 1 0 0 0]
[0 1 0 0 0 0]]
instead of:
[[0 1 0 0 0 0]
[0 0 1 0 0 0]
[0 1 0 0 1 0]]
Thanks,
You can write a custom function that you can then apply to each sub-array in outputs using the np.apply_along_axis() function:
def choose_class(a):
if (len(np.argwhere(a >= 0.5)) > 1) & (a[4] >= 0.5):
return np.where(a == a.max(), 1, 0)
return np.where(a >= 0.5, 1, 0)
outputs = np.apply_along_axis(choose_class, 1, outputs)
outputs
# array([[0, 1, 0, 0, 0, 0],
# [0, 0, 1, 0, 0, 0],
# [0, 1, 0, 0, 0, 0]])
For the simple mask, you don't need np.where
mask = outputs >= 0.5
If you want an integer instead of a boolean:
mask = (outputs >= 0.5).view(np.uint8)
To check the fifth column, you need to keep a reference to the original data around. You can get the maximum masked value in each relevant row with
rows = np.flatnonzero(mask[:, 4])
keep = (outputs[mask] * mask[rows]).argmax()
Then you can blank out the rows and set only the maximum value:
mask[rows] = 0
mask[rows, keep] = 1
One other solution:
# Your example input array
out = np.array([[0.015, 0.672, 0.002, 0.213, 0.002, 0.026],
[0.006, 0.005, 0.986, 0.004, 0.007, 0.010],
[0.023, 0.583, 0.005, 0.012, 0.540, 0.010]])
# We get the desired result
val = (out>=0.5)*out//(out.max(axis=1))[:,None]
This solution do the following operation:
Set to zero all the value < 0.5
Set to 1 the maximum value by row (iif this value is >= 0.5)

is there a more efficient way generate an array from another array with a little bit complex rule?

I am trying to compute a distance between an element and a starting point in an array.
Here is an array
assume the element (0,1) is a starting point which has the highest value currently.
a neighbors is an element around a specific point if they have one axis in common and different in another axis by 1 unit.
generally, a neighbor could be the top, bottom, left, right of a specific point, which is inside the array.
the task is to label every elements with a distance value indicate how far it is from the starting point (0,1).
ds = np.array([[1, 2, 1],
[1, 1, 0],
[0, 1, 1]])
dist = np.full_like(ds, -1)
p0 = np.where(ds == 2)
dist[p0] = 0
que = []
que.append(p0)
nghb_x = [0, 0, -1, 1]
nghb_y = [-1, 1, 0, 0]
while len(que):
x, y = que.pop()
d = dist[(x,y)]
for idx0, idx1 in zip(nghb_x, nghb_y):
tmp_x = x + idx0
tmp_y = y + idx1
if np.any(tmp_x >= 0) and np.any(tmp_x < ds.shape[0]) and np.any(tmp_y >= 0) and np.any(tmp_y < ds.shape[1]) and np.any(dist[(tmp_x,tmp_y)] == -1):
dist[(tmp_x,tmp_y)] = d + 1 # distance = distance(x) + 1
que.append((tmp_x, tmp_y))
print('dist:')
print(dist)
the output
dist:
[[1 0 1]
[2 1 2]
[3 2 3]]
is as expected though, I would like to know if is there a more efficient way to do this?
You're calculating the Manhattan distance (the x-distance plus the y-distance) from a target point for each point.
You can use a numpy function to do it in one step, given the target coordinates and the shape of the array:
target = (0, 1)
np.fromfunction(lambda x,y: np.abs(target[0]-x) + np.abs(target[1]-y), ds.shape)
Result:
[[1. 0. 1.]
[2. 1. 2.]
[3. 2. 3.]]
Demo: https://repl.it/repls/TrustyUnhappyFlashdrives

How to apply lower and upper threshold to NumPy array?

I have the following array
array = np.array([-0.5, -2, -1, -0.5, -0.25, 0, 0, -2, -1, 0.25, 0.5, 1, 2])
and would like to apply two thresholds, such that all values below -1.0 are set to 1 and all values above -0.3 are set to 0. For the values inbetween, the following rule should apply: if the last value was below -1.0 then it should be a 1 but if the last value was above -0.3, then it should be a 0.
For the example array above, the output should be
target = np.array([0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0])
If multiple consecutive values are between -1.0 and -0.3, then it should go back as far as required until there is a value above or below the two thresholds and set the output accordingly.
I tried to achieve this by iterating over the array and using a while inside the for loop to find the next occurence where the value is above the threshold, but it doesn't work:
array = np.array([-0.5, -2, -1, -0.5, -0.25, 0, 0, -2, -1, 0.25, 0.5, 1, 2])
p = []
def function(array, p):
for i in np.nditer(array):
if i < -1:
while i <= -0.3:
p.append(1)
i += 1
else:
p.append(0)
i += 1
return p
a = function(array, p)
print(a)
How can I apply the two thresholds to my array as described above?
What you are trying to achieve is called "thresholding with hysteresis". For this, I adapted the very nice algorithm from this answer:
Given your test data,
import numpy as np
array = np.array([-0.5, -2, -1, -0.5, -0.25, 0, 0, -2, -1, 0.25, 0.5, 1, 2])
you detect which values are below the first threshold -1.0, and which are above the second threshold -0.3:
low_values = array <= -1.0
high_values = array >= -0.3
These are the values for which you know the result: either 1 or 0. For all other values, it depends on its neighbors. Thus, all values for which either low_values or high_values is True are known.
You can get the indices of all known elements with:
known_values = high_values | low_values
known_idx = np.nonzero(known_values)[0]
To find the result for all unknown values, we use the np.cumsum function on the known_values array. The Booleans are interpreted as 0 or 1, so this gives us the following array:
acc = np.cumsum(known_values)
which will result in the following for your example:
[ 0 1 2 2 3 4 5 6 7 8 9 10 11].
Now, known_idx[acc - 1] will contain the index of the last known value for each point. With low_values[known_idx[acc - 1]] you get a True if the last known value was below -1.0 and a False if it was above -0.3:
result = low_values[known_idx[acc - 1]]
There is one problem left: If the initial value is below -1.0 or above -0.3, then everything works out perfectly fine. But if it is in-between, then it would depend on its left neighbor - which it doesn't have. So in your case, you simply define it to be zero.
We can do that by checking if acc[0] equals 0 or 1. If acc[0] = 1, then everything is fine, but if acc[0] = 0, then this means that the first value is between -1.0 and -0.3, so we have to set it to zero:
if not acc[0]:
result[0] = False
Finally, as we were doing lots of comparisons, our result array is a boolean array. To convert it to integer 0 and 1, we simply call
result = np.int8(result)
and we get our desired result:
array([0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0], dtype=int8)

NumPy: How to avoid this loop?

Is there a way to avoid this loop so optimize the code?
import numpy as np
cLoss = 0
dist_ = np.array([0,1,0,1,1,0,0,1,1,0]) # just an example, longer in reality
TLabels = np.array([-1,1,1,1,1,-1,-1,1,-1,-1]) # just an example, longer in reality
t = float(dist_.size)
for i in range(len(dist_)):
labels = TLabels[dist_ == dist_[i]]
cLoss+= 1 - TLabels[i]*(1. * np.sum(labels)/t)
print cLoss
Note: dist_ and TLabels are both numpy arrays with the same shape (t,1)
I am not sure what you exactly want to do, but are you aware of scipy.ndimage.measurements for computing on arrays with labels? It look like you want something like:
cLoss = len(dist_) - sum(TLabels * scipy.ndimage.measurements.sum(TLabels,dist_,dist_) / len(dist_))
I first wonder, what is labels at each step in the loop?
With dist_ = array([2,1,2]) and TLabels=array([1,2,3])
I get
[-1 1]
[1]
[-1 1]
The different length immediately raise a warning flag - it may be difficult to vectorize this.
With the longer arrays in the edited example
[-1 1 -1 -1 -1]
[ 1 1 1 1 -1]
[-1 1 -1 -1 -1]
[ 1 1 1 1 -1]
[ 1 1 1 1 -1]
[-1 1 -1 -1 -1]
[-1 1 -1 -1 -1]
[ 1 1 1 1 -1]
[ 1 1 1 1 -1]
[-1 1 -1 -1 -1]
The labels vectors are all the same length. Is that normal, or just a coincidence of values?
Drop a couple of elements off of dist_, and labels are:
In [375]: for i in range(len(dist_)):
labels = TLabels[dist_ == dist_[i]]
v = (1.*np.sum(labels)/t); v1 = 1-TLabels[i]*v
print(labels, v, TLabels[i], v1)
cLoss += v1
.....:
(array([-1, 1, -1, -1]), -0.25, -1, 0.75)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
(array([-1, 1, -1, -1]), -0.25, 1, 1.25)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
(array([-1, 1, -1, -1]), -0.25, -1, 0.75)
(array([-1, 1, -1, -1]), -0.25, -1, 0.75)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
Again different lengths of labels, but really only a few calculations. There is 1 v value for each different dist_ value.
Without working out all the details, it looks like you are just calculating labels*labels for each distinct dist_ value, and then summing those.
This looks like a groupBy problem. You want to divide the dist_ into groups with a common value, and sum some function of their corresponding TLabels values. Python itertools has a groupBy function, so does pandas. I think both require you to sort dist_.
Try sorting dist_ and see if that adds any clarity to the problem.
I'm not sure if this is any better since I didn't exactly understand why you might want to do this. Many variables in your loop are bivalued hence can be computed in advance.
Also the entries of dist_ can be used as a boolean switch but I used an explicit copy anyhow.
dist_ = np.array([0,1,0,1,1,0,0,1,1,0])
TLabels = np.array([-1,1,1,1,1,-1,-1,1,-1,-1])
t = len(dist)
dist_zeros = dist_== 0
one_zero_sum = [sum(TLabels[dist_zeros])/t , sum(TLabels[~dist_zeros])/t]
cLoss = sum([1-x*one_zero_sum[dist_[y]] for y,x in enumerate(TLabels)])
which results in cLoss = 8.2. I am using Python3 so didn't check whether this is a true division or not in Python2.

Numpy sum running length of non-zero values

Looking for a fast vectorized function that returns the rolling number of consecutive non-zero values. The count should start over at 0 whenever encountering a zero. The result should have the same shape as the input array.
Given an array like this:
x = np.array([2.3, 1.2, 4.1 , 0.0, 0.0, 5.3, 0, 1.2, 3.1])
The function should return this:
array([1, 2, 3, 0, 0, 1, 0, 1, 2])
This post lists a vectorized approach which basically consists of two steps:
Initialize a zeros vector of the same size as input vector, x and set ones at places corresponding to non-zeros of x.
Next up, in that vector, we need to put minus of runlengths of each island right after the ending/stop positions for each "island". The intention is to use cumsum again later on, which would result in sequential numbers for the "islands" and zeros elsewhere.
Here's the implementation -
import numpy as np
#Append zeros at the start and end of input array, x
xa = np.hstack([[0],x,[0]])
# Get an array of ones and zeros, with ones for nonzeros of x and zeros elsewhere
xa1 =(xa!=0)+0
# Find consecutive differences on xa1
xadf = np.diff(xa1)
# Find start and stop+1 indices and thus the lengths of "islands" of non-zeros
starts = np.where(xadf==1)[0]
stops_p1 = np.where(xadf==-1)[0]
lens = stops_p1 - starts
# Mark indices where "minus ones" are to be put for applying cumsum
put_m1 = stops_p1[[stops_p1 < x.size]]
# Setup vector with ones for nonzero x's, "minus lens" at stops +1 & zeros elsewhere
vec = xa1[1:-1] # Note: this will change xa1, but it's okay as not needed anymore
vec[put_m1] = -lens[0:put_m1.size]
# Perform cumsum to get the desired output
out = vec.cumsum()
Sample run -
In [116]: x
Out[116]: array([ 0. , 2.3, 1.2, 4.1, 0. , 0. , 5.3, 0. , 1.2, 3.1, 0. ])
In [117]: out
Out[117]: array([0, 1, 2, 3, 0, 0, 1, 0, 1, 2, 0], dtype=int32)
Runtime tests -
Here's some runtimes tests comparing the proposed approach against the other itertools.groupby based approach -
In [21]: N = 1000000
...: x = np.random.rand(1,N)
...: x[x>0.5] = 0.0
...: x = x.ravel()
...:
In [19]: %timeit sumrunlen_vectorized(x)
10 loops, best of 3: 19.9 ms per loop
In [20]: %timeit sumrunlen_loopy(x)
1 loops, best of 3: 2.86 s per loop
You can use itertools.groupby and np.hstack :
>>> import numpy as np
>>> x = np.array([2.3, 1.2, 4.1 , 0.0, 0.0, 5.3, 0, 1.2, 3.1])
>>> from itertools import groupby
>>> np.hstack([[i if j!=0 else j for i,j in enumerate(g,1)] for _,g in groupby(x,key=lambda x: x!=0)])
array([ 1., 2., 3., 0., 0., 1., 0., 1., 2.])
We can group the array elements based on non-zero elements then use a list comprehension and enumerate to replace the non-zero sub-arrays with those index then flatten the list with np.hstack.
This sub-problem came up in Kick Start 2021 Round A for me. My solution:
def current_run_len(a):
a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined
d = np.diff(a_)
starts = np.where(d == 1)[0]
stops = np.where(d == -1)[0]
a_[stops + 1] = -(stops - starts) # +1 for behind-last
return a_[1:-1].cumsum()
In fact, the problem also required a version where you count down consecutive sequences. Thus here another version with an optional keyword argument which does the same for rev=False:
def current_run_len(a, rev=False):
a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined
d = np.diff(a_)
starts = np.where(d == 1)[0]
stops = np.where(d == -1)[0]
if rev:
a_[starts] = -(stops - starts)
cs = -a_.cumsum()[:-2]
else:
a_[stops + 1] = -(stops - starts) # +1 for behind-last
cs = a_.cumsum()[1:-1]
return cs
Results:
a = np.array([1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1])
print('a = ', a)
print('current_run_len(a) = ', current_run_len(a))
print('current_run_len(a, rev=True) = ', current_run_len(a, rev=True))
a = [1 1 1 1 0 0 0 1 1 0 1 0 0 0 1]
current_run_len(a) = [1 2 3 4 0 0 0 1 2 0 1 0 0 0 1]
current_run_len(a, rev=True) = [4 3 2 1 0 0 0 2 1 0 1 0 0 0 1]
For an array that consists of 0s and 1s only, you can simplify [0, a != 0, 0] to [0, a, 0]. But the version as-posted also works for arbitrary non-zero numbers.

Categories