I would like to apply the result of numpy.percentile to its argument, i.e., map every number in the input vector to its quantile.
E.g., if v=np.array([1,2,3,4]), and I want just two quantiles (bigger and smaller than the median), I would get np.array([0,0,1,1]) telling me that the first two elements of v are smaller than the median and the last two are bigger than the median.
Note that I am interested in, say, deciles, not just the median!
IOW, #PaulPanzer hit the nail:
np.digitize(v,np.percentile(v,quantiles))
thanks!
(v > np.percentile(v, 50)).astype(int)
Out[93]:
array([0, 0, 1, 1])
Use np.digitize:
perc = np.percentile(data, q)
indices = np.digitize(data, perc)
Example q = [25,50,75], data = np.arange(8):
indices
# array([0, 0, 1, 1, 2, 2, 3, 3])
Related
I am trying to simulate a grid of spins in python that can change their orientation (represented by the sign):
>>> import numpy as np
>>> spin_values = np.random.choice([-1, 1], (2, 2))
>>> spin_values
array([[-1, 1],
[ 1, 1]])
I then throw two sets of random indices of that grid for spins that have a certain probability to switch their orientation, let's say:
>>> i = np.array([1, 1])
>>> j = np.array([0, 0])
>>> switches = np.array([-1, -1])
i and j here contain the indices that might change and switches states whether they do switch (-1) or keep their orientation (1). My idea for calculating the new orientations was:
>>> spin_values[i, j] *= switches
When a spin orientation only changes once this works fine. However, when it is supposed to change twice (as with the example values) it only changes once, therefore giving me a wrong result.
>>> spin_values
array([[-1, 1],
[-1, 1]])
How could I get the right results while having a short run time (this has to be done many times on a bigger grid)?
I would use numpy.unique to get the count of unique pairs of indices and compute -1 ** n:
idx, cnt = np.unique(np.vstack([i, j]), axis=1, return_counts=True)
spin_values[tuple(idx)] = (-1) ** cnt
Updated spin_values:
array([[-1, 1],
[ 1, 1]])
I have the following arrays:
time = [1e-6, 2e-6, 3e-6, 4e-6, 5e-6, 6e-6, 7e-6, 8e-6, 9e-6, 10e-6]
signal = [0, 10, 3, 2, 1, 0, 10, 2, 2, 5]
and I want to remove (from both arrays) any datapoints that are above a threshold value, with a given padding width
threshold = 9
padding = 3e-6
so any indexes that are above 9 in the signal array or are within 100 data points in the time array should be removed from both arrays. Note: this means there could be overlap if there are two data points within the padding window that are above the threshold
example output
time_out = [4e-6, 5e-6, 9e-6, 10e-6]
signal_out = [2, 1, 2, 5]
EDIT: this post is very similar, however it does it only for one index of an array, where I would need to do it at multiple (above e.g. time=2e-6 and time=7e-6) https://stackoverflow.com/a/66695205/12728698
Let's try this one. The idea is to create a boolean mask which returns True if a signal is out of reach of threshold for each padding. I divided the padding by 3, since IIUC, a padding is a window of size 3, so we only need to consider the signals that are greater than the threshold and its 2 adjacent values.
time_arr = np.array(time)
signal_arr = np.array(signal)
llim = time_arr[signal_arr>threshold, None] - padding/3
ulim = time_arr[signal_arr>threshold, None] + padding/3
msk = ((llim > time_arr) | (ulim< time_arr)).all(axis=0)
time_out = time_arr[msk]
signal_out = signal_arr[msk]
Another option is to use numpy.roll to get the adjacent values to create a boolean mask:
comp = signal_arr<=threshold
msk = np.roll(comp, 1) & comp & np.roll(comp, -1)
time_out = time_arr[msk]
signal_out = signal_arr[msk]
Output:
array([4.e-06, 5.e-06, 9.e-06, 1.e-05])
array([2, 1, 2, 5])
Considering the 3 arrays below:
np.random.seed(0)
X = np.random.randint(10, size=(4,5))
W = np.random.randint(10, size=(3,4))
y = np.random.randint(3, size=(5,1))
i want to add and sum each column of the matrix X to the row of W ,given by y as index. So ,for example, if the first element in y is 3 , i'll add the first column of X to the fourth row of W(index 3 in python) and sum it. i'll do it over and over until all columns of X are added to the specific row of W and summed.
i could do it in different ways:
1- using for loop:
for i,j in enumerate(y):
W[j]+=X[:,i]
2- using the add.at function
np.add.at(W,(y.ravel()),X.T)
3- but i can't understand how to do it using einsum.
i was given a solution ,but really can't understand it.
N = y.max()+1
W[:N] += np.einsum('ijk,lk->il',(np.arange(N)[:,None,None] == y.ravel()),X)
Anyone could explain me this structure?
1 - (np.arange(N)[:,None,None] == y.ravel(),X). i imagine this part refers to summing the column of X to the specific row of W ,according to y. But where s W ? and why do we have to transform W in 4 dimensions in this case?
2- 'ijk,lk->il' - i didnt understand this either.
i -refers to the rows,
j - columns,
k- each element,
l - what does 'l' refers too?.
if anyone can understand this and explain to me , i would really appreciate.
Thanks in advance.
Let's simplify the problem by dropping one dimension and using values that are easy to verify manually:
W = np.zeros(3, np.int)
y = np.array([0, 1, 1, 2, 2])
X = np.array([1, 2, 3, 4, 5])
Values in the vector W get added values from X by looking up with y:
for i, j in enumerate(y):
W[j] += X[i]
W is calculated as [1, 5, 9], (check quickly by hand).
Now, how could this code be vectorized? We can't do a simple W[y] += X[y] as y has duplicate values in it and the different sums would overwrite each other at indices 1 and 2.
What could be done is to broadcast the values into a new dimension of len(y) and then sum up over this newly created dimension.
N = W.shape[0]
select = (np.arange(N) == y[:, None]).astype(np.int)
Taking the index range of W ([0, 1, 2]), and setting the values where they match y to 1 in a new dimension, otherwise 0. select contains this array:
array([[1, 0, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1],
[0, 0, 1]])
It has len(y) == len(X) rows and len(W) columns and shows for every y/row, what index of W it contributes to.
Let's multiply X with this array, mult = select * X[:, None]:
array([[1, 0, 0],
[0, 2, 0],
[0, 3, 0],
[0, 0, 4],
[0, 0, 5]])
We have effectively spread out X into a new dimension, and sorted it in a way we can get it into shape W by summing over the newly created dimension. The sum over the rows is the vector we want to add to W:
sum_Xy = np.sum(mult, axis=0) # [1, 5, 9]
W += sum_Xy
The computation of select and mult can be combined with np.einsum:
# `select` has shape (len(y)==len(X), len(W)), or `yw`
# `X` has shape len(X)==len(y), or `y`
# we want something `len(W)`, or `w`, and to reduce the other dimension
sum_Xy = np.einsum("yw,y->w", select, X)
And that's it for the one-dimensional example. For the two-dimensional problem posed in the question it is exactly the same approach: introduce an additional dimension, broadcast the y indices, and then reduce the additional dimension with einsum.
If you internalize how every step works for the one-dimensional example, I'm sure you can work out how the code is doing it in two dimensions, as it is just a matter of getting the indices right (W rows, X columns).
I'm working on a project involving binary patterns (here np.arrays of 0 and 1).
I'd like to modify a random subset of these and return several altered versions of the pattern where a given fraction of the values have been changed (like map a function to a random subset of an array of fixed size)
ex : take the pattern [0 0 1 0 1] and rate 0.2, return [[0 1 1 0 1] [1 0 1 0 1]]
It seems possible by using auxiliary arrays and iterating with a condition, but is there a "clean" way to do that ?
Thanks in advance !
The map function works on boolean arrays too. You could add the subsample logic to your function, like so:
import numpy as np
rate = 0.2
f = lambda x: np.random.choice((True, x),1,p=[rate,1-rate])[0]
a = np.array([0,0,1,0,1], dtype='bool')
map(f, a)
# This will output array a with on average 20% of the elements changed to "1"
# it can be slightly more or less than 20%, by chance.
Or you could rewrite a map function, like so:
import numpy as np
def map_bitarray(f, b, rate):
'''
maps function f on a random subset of b
:param f: the function, should take a binary array of size <= len(b)
:param b: the binary array
:param rate: the fraction of elements that will be replaced
:return: the modified binary array
'''
c = np.copy(b)
num_elem = len(c)
idx = np.random.choice(range(num_elem), num_elem*rate, replace=False)
c[idx] = f(c[idx])
return c
f = lambda x: True
b = np.array([0,0,1,0,1], dtype='bool')
map_bitarray(f, b, 0.2)
# This will output array b with exactly 20% of the elements changed to "1"
rate=0.2
repeats=5
seed=[0,0,1,0,1]
realizations=np.tile(seed,[repeats,1]) ^ np.random.binomial(1,rate,[repeats,len(seed)])
Use np.tile() to generate a matrix from the seed row.
np.random.binomial() to generate a binomial mask matrix with your requested rate.
Apply the mask with the xor binary operator ^
EDIT:
Based on #Jared Goguen comments, if you want to change 20% of the bits, you can elaborate a mask by choosing elements to change randomly:
seed=[1,0,1,0,1]
rate=0.2
repeats=10
mask_list=[]
for _ in xrange(repeats):
y=np.zeros(len(seed),np.int32)
y[np.random.choice(len(seed),0.2*len(seed))]=1
mask_list.append(y)
mask = np.vstack(mask_list)
realizations=np.tile(seed,[repeats,1]) ^ mask
So, there's already an answer that provides sequences where each element has a random transition probability. However, it seems like you might want an exact fraction of the elements to change instead. For example, [1, 0, 0, 1, 0] can change to [1, 1, 0, 1, 0] or [0, 0, 0, 1, 0], but not [1, 1, 1, 1, 0].
The premise, based off of xvan's answer, uses the bit-wise xor operator ^. When a bit is xor'd with 0, it's value will not change. When a bit is xor'd with 1, it will flip. From your question, it seems like you want to change len(seq)*rate number of bits in the sequence. First create mask which contains len(seq)*rate number of 1's. To get an altered sequence, xor the original sequence with a shuffled version of mask.
Here's a simple, inefficient implementation:
import numpy as np
def edit_sequence(seq, rate, count):
length = len(seq)
change = int(length * rate)
mask = [0]*(length - change) + [1]*change
return [seq ^ np.random.permutation(mask) for _ in range(count)]
rate = 0.2
seq = np.array([0, 0, 1, 0, 1])
print edit_sequence(seq, rate, 5)
# [0, 0, 1, 0, 0]
# [0, 1, 1, 0, 1]
# [1, 0, 1, 0, 1]
# [0, 1, 1, 0, 1]
# [0, 0, 0, 0, 1]
I don't really know much about NumPy, so maybe someone with more experience can make this efficient, but the approach seems solid.
Edit: Here's a version that times about 30% faster:
def edit_sequence(seq, rate, count):
mask = np.zeros(len(seq), dtype=int)
mask[:len(seq)*rate] = 1
output = []
for _ in range(count):
np.random.shuffle(mask)
output.append(seq ^ mask)
return output
It appears that this updated version scales very well with the size of seq and the value of count. Using dtype=bool in seq and mask yields another 50% improvement in the timing.
I have an object which is described by two quantities, A and B (in real case they can be more than two). Objects are correlated depending on the value of A and B. In particular I know the correlation matrix for A and for B. Just as example:
a = np.array([[1, 1, 0, 0],
[1, 1, 0, 0],
[0, 0, 1, 1],
[0, 0, 1, 1]])
b = np.array([[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
na = a.shape[0]
nb = b.shape[0]
correlation for A:
so if an element has A == 0.5 and the other equal to A == 1.5 they are fully correlated (red). Otherwise if an element has A == 0.5 and the second item has A == 3.5 they are uncorrelated (blue).
Similarly for B:
Now I want multiply the two correlation matrixes, but I want to obtain as final matrix a matrix with two axis, where the new axes are a folded version of the original axes:
def get_folded_bin(ia, ib):
return ia * nb + ib
here what I am doing:
result = np.swapaxes(np.tensordot(a, b, axes=0), 1, 2).reshape(na* nb, na * nb)
visually:
and in particular this must hold:
for ia1 in xrange(na):
for ia2 in xrange(na):
for ib1 in xrange(nb):
for ib2 in xrange(nb):
assert(a[ia1, ia2] * b[ib1, ib2] == result[get_folded_bin(ia1, ib1), get_folded_bin(ia2, ib2)])
actually my problem is to do it with more quantities (A, B, C, ...) in a general way. Maybe there is also a simpler function within numpy to do that.
np.einsum lets you simplify the tensordot expression a bit:
result = np.einsum('ij,kl->ikjl',a,b).reshape(-1, na * nb)
I don't think there's a way of eliminating the reshape.
It may also be easier to generalize to more arrays, though I wouldn't get carried away with too many iteration variables in one einsum expression.
I think finally I have found a solution:
np.kron(a,b)
and then I can compose with
np.kron(np.kron(a,b), c)