Randomize part of an array - python

I'm working on a project involving binary patterns (here np.arrays of 0 and 1).
I'd like to modify a random subset of these and return several altered versions of the pattern where a given fraction of the values have been changed (like map a function to a random subset of an array of fixed size)
ex : take the pattern [0 0 1 0 1] and rate 0.2, return [[0 1 1 0 1] [1 0 1 0 1]]
It seems possible by using auxiliary arrays and iterating with a condition, but is there a "clean" way to do that ?
Thanks in advance !

The map function works on boolean arrays too. You could add the subsample logic to your function, like so:
import numpy as np
rate = 0.2
f = lambda x: np.random.choice((True, x),1,p=[rate,1-rate])[0]
a = np.array([0,0,1,0,1], dtype='bool')
map(f, a)
# This will output array a with on average 20% of the elements changed to "1"
# it can be slightly more or less than 20%, by chance.
Or you could rewrite a map function, like so:
import numpy as np
def map_bitarray(f, b, rate):
'''
maps function f on a random subset of b
:param f: the function, should take a binary array of size <= len(b)
:param b: the binary array
:param rate: the fraction of elements that will be replaced
:return: the modified binary array
'''
c = np.copy(b)
num_elem = len(c)
idx = np.random.choice(range(num_elem), num_elem*rate, replace=False)
c[idx] = f(c[idx])
return c
f = lambda x: True
b = np.array([0,0,1,0,1], dtype='bool')
map_bitarray(f, b, 0.2)
# This will output array b with exactly 20% of the elements changed to "1"

rate=0.2
repeats=5
seed=[0,0,1,0,1]
realizations=np.tile(seed,[repeats,1]) ^ np.random.binomial(1,rate,[repeats,len(seed)])
Use np.tile() to generate a matrix from the seed row.
np.random.binomial() to generate a binomial mask matrix with your requested rate.
Apply the mask with the xor binary operator ^
EDIT:
Based on #Jared Goguen comments, if you want to change 20% of the bits, you can elaborate a mask by choosing elements to change randomly:
seed=[1,0,1,0,1]
rate=0.2
repeats=10
mask_list=[]
for _ in xrange(repeats):
y=np.zeros(len(seed),np.int32)
y[np.random.choice(len(seed),0.2*len(seed))]=1
mask_list.append(y)
mask = np.vstack(mask_list)
realizations=np.tile(seed,[repeats,1]) ^ mask

So, there's already an answer that provides sequences where each element has a random transition probability. However, it seems like you might want an exact fraction of the elements to change instead. For example, [1, 0, 0, 1, 0] can change to [1, 1, 0, 1, 0] or [0, 0, 0, 1, 0], but not [1, 1, 1, 1, 0].
The premise, based off of xvan's answer, uses the bit-wise xor operator ^. When a bit is xor'd with 0, it's value will not change. When a bit is xor'd with 1, it will flip. From your question, it seems like you want to change len(seq)*rate number of bits in the sequence. First create mask which contains len(seq)*rate number of 1's. To get an altered sequence, xor the original sequence with a shuffled version of mask.
Here's a simple, inefficient implementation:
import numpy as np
def edit_sequence(seq, rate, count):
length = len(seq)
change = int(length * rate)
mask = [0]*(length - change) + [1]*change
return [seq ^ np.random.permutation(mask) for _ in range(count)]
rate = 0.2
seq = np.array([0, 0, 1, 0, 1])
print edit_sequence(seq, rate, 5)
# [0, 0, 1, 0, 0]
# [0, 1, 1, 0, 1]
# [1, 0, 1, 0, 1]
# [0, 1, 1, 0, 1]
# [0, 0, 0, 0, 1]
I don't really know much about NumPy, so maybe someone with more experience can make this efficient, but the approach seems solid.
Edit: Here's a version that times about 30% faster:
def edit_sequence(seq, rate, count):
mask = np.zeros(len(seq), dtype=int)
mask[:len(seq)*rate] = 1
output = []
for _ in range(count):
np.random.shuffle(mask)
output.append(seq ^ mask)
return output
It appears that this updated version scales very well with the size of seq and the value of count. Using dtype=bool in seq and mask yields another 50% improvement in the timing.

Related

Getting the right sign when calculating repeated sign switches in numpy array

I am trying to simulate a grid of spins in python that can change their orientation (represented by the sign):
>>> import numpy as np
>>> spin_values = np.random.choice([-1, 1], (2, 2))
>>> spin_values
array([[-1, 1],
[ 1, 1]])
I then throw two sets of random indices of that grid for spins that have a certain probability to switch their orientation, let's say:
>>> i = np.array([1, 1])
>>> j = np.array([0, 0])
>>> switches = np.array([-1, -1])
i and j here contain the indices that might change and switches states whether they do switch (-1) or keep their orientation (1). My idea for calculating the new orientations was:
>>> spin_values[i, j] *= switches
When a spin orientation only changes once this works fine. However, when it is supposed to change twice (as with the example values) it only changes once, therefore giving me a wrong result.
>>> spin_values
array([[-1, 1],
[-1, 1]])
How could I get the right results while having a short run time (this has to be done many times on a bigger grid)?
I would use numpy.unique to get the count of unique pairs of indices and compute -1 ** n:
idx, cnt = np.unique(np.vstack([i, j]), axis=1, return_counts=True)
spin_values[tuple(idx)] = (-1) ** cnt
Updated spin_values:
array([[-1, 1],
[ 1, 1]])

How can I improve my custom function vectorization using numpy

I am new to python, and even more new to vectorization. I have attempted to vectorize a custom similarity function that should return a matrix of pairwise similarities between each row in an input array.
IMPORTS:
import numpy as np
from itertools import product
from numpy.lib.stride_tricks import sliding_window_view
INPUT:
np.random.seed(11)
a = np.array([0, 0, 0, 0, 0, 10, 0, 0, 0, 50, 0, 0, 5, 0, 0, 10])
b = np.array([0, 0, 5, 0, 0, 10, 0, 0, 0, 50, 0, 0, 10, 0, 0, 5])
c = np.array([0, 0, 5, 1, 0, 20, 0, 0, 0, 30, 0, 1, 10, 0, 0, 5])
m = np.array((a,b,c))
OUTPUT:
custom_func(m)
array([[ 0, 440, 1903],
[ 440, 0, 1603],
[1903, 1603, 0]])
FUNCTION:
def custom_func(arr):
diffs = 0
max_k = 6
for n in range(1, max_k):
arr1 = np.array([np.sum(i, axis = 1) for i in sliding_window_view(arr, window_shape = n, axis = 1)])
# this function uses np.maximum and np.minimum to subtract the max and min elements (element-wise) between two rows and then sum up the entire of that subtraction
diffs += np.sum((np.array([np.maximum(arr1[i[0]], arr1[i[1]]) for i in product(np.arange(len(arr1)), np.arange(len(arr1)))]) - np.array([np.minimum(arr1[i[0]], arr1[i[1]]) for i in product(np.arange(len(arr1)), np.arange(len(arr1)))])), axis = 1) * n
diffs = diffs.reshape(len(arr), -1)
return diffs
The function is quite simple, it sums up the element-wise differences between max and minimum of rows in N sliding windows. This function is much faster than what I was using before finding out about vectorization today (for loops and pandas dataframes yay).
My first thought is to figure out a way to find both the minimum and maximum of my arrays in a single pass since I currently THINK it has to do two passes, but I was unable to figure out how. Also there is a for loop in my current function because I need to do this for multiple N sliding windows, and I am not sure how to do this without the loop.
Any help is appreciated!
Here are the several optimizations you can apply on the code:
use the Numba's JIT to speed up the computation and replace the product call with nested loops
use a more efficient sliding window algorithm (better complexity)
avoid to compute multiple time product and arrange in the loop
reduce the number of implicit temporary arrays allocated (and array Numpy calls)
do not compute the lower triangular part of diffs since it will always be symmetric
(just copy the upper triangular part)
use integer-based indexing rather than slow slow floating-point one
Here is the resulting code:
import numpy as np
from itertools import product
from numpy.lib.stride_tricks import sliding_window_view
import numba as nb
#nb.njit
def custom_func_fast(arr):
h, w = arr.shape[0], arr.shape[1]
diffs = np.zeros((h, h), dtype=arr.dtype)
max_k = 6
for n in range(1, max_k):
arr1 = np.empty(shape=(h, w-n+1), dtype=arr.dtype)
for i in range(h):
# Efficient sliding window algorithm
assert w >= n
s = np.sum(arr[i, 0:n])
arr1[i, 0] = s
for j in range(n, w):
s -= arr[i, j-n]
s += arr[i, j]
arr1[i, j-n+1] = s
# Efficient distance matrix computation
for i in range(h):
for j in range(i+1, h):
s = 0
for k in range(w-n+1):
s += np.abs(arr1[i,k] - arr1[j,k])
diffs[i, j] += s * n
# Fill the lower triangular part
for i in range(h):
for j in range(i):
diffs[i, j] = diffs[j, i]
return diffs
The resulting code is 290 times faster on the example input array on my machine.
You can start by removing the first list comprehension:
arr1 = sliding_window_view(arr, window_shape = n, axis = 1).sum(axis=2)
I'm not going to touch that long diffs line :(

binary matrix aa (only contain 0, 1), why sum(sum(aa)) isn't equal sum(sum(aa>0))?

I have a binary mask named crop_mask, which only contains 0 and 1. Why isn't the sum(sum(aa1)) equal sum(sum(aa2)).
aa1 = crop_mask,
aa2 = (aa1>0)
print(sum(sum(aa1)), sum(sum(aa2)))
This might be a minor issue, but I am just so confused now. Thanks for any help. I made a screenshot of the result in the attached figure.
updated screenshot
By definition the sum should be the same.
The only thing I can thing of is that the dtype of your array (assuming you are using a numpy array) is not int or float.
Did you check that the "True"s in aa2 match the "1" in aa1?
EDIT:
dtype = np.uint8 limits the maximum value of the column sum to 255 (2^8). So the sum(sum(a)) --> sum([0,160,0,...]) (160 is the remainder of 4000/256)
aa0 = aa0.astype(int) will solve your issue
a = np.zeros((4000, 4000)).astype(np.uint8)
a[:,1] = 1
a[:,4] = 1
b = (a > 0)
sum(sum(b)) # 8000
sum(sum(a)) # 320
a = a.astype(int)
sum(sum(a)) #8000
Assuming your cropmask is indeed a 2-dimensional ndarray with only 1s and 0s, this works:
import numpy as np
cropmask = np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]], np.uint8)
x = (cropmask > 0)
print(sum(sum(cropmask)), sum(sum(x)))
Result:
6 6
The most likely cause here is that you're wrong and your cropmask doesn't actually contain only 1s and 0s.
Have you tried:
print(sum(sum(np.logical_and((0 != crop_mask), (1 != crop_mask)))))
If that comes up greater than 0, there's something else in there.

Map numbers to their percentiles

I would like to apply the result of numpy.percentile to its argument, i.e., map every number in the input vector to its quantile.
E.g., if v=np.array([1,2,3,4]), and I want just two quantiles (bigger and smaller than the median), I would get np.array([0,0,1,1]) telling me that the first two elements of v are smaller than the median and the last two are bigger than the median.
Note that I am interested in, say, deciles, not just the median!
IOW, #PaulPanzer hit the nail:
np.digitize(v,np.percentile(v,quantiles))
thanks!
(v > np.percentile(v, 50)).astype(int)
Out[93]:
array([0, 0, 1, 1])
Use np.digitize:
perc = np.percentile(data, q)
indices = np.digitize(data, perc)
Example q = [25,50,75], data = np.arange(8):
indices
# array([0, 0, 1, 1, 2, 2, 3, 3])

build matrix from blocks

I have an object which is described by two quantities, A and B (in real case they can be more than two). Objects are correlated depending on the value of A and B. In particular I know the correlation matrix for A and for B. Just as example:
a = np.array([[1, 1, 0, 0],
[1, 1, 0, 0],
[0, 0, 1, 1],
[0, 0, 1, 1]])
b = np.array([[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
na = a.shape[0]
nb = b.shape[0]
correlation for A:
so if an element has A == 0.5 and the other equal to A == 1.5 they are fully correlated (red). Otherwise if an element has A == 0.5 and the second item has A == 3.5 they are uncorrelated (blue).
Similarly for B:
Now I want multiply the two correlation matrixes, but I want to obtain as final matrix a matrix with two axis, where the new axes are a folded version of the original axes:
def get_folded_bin(ia, ib):
return ia * nb + ib
here what I am doing:
result = np.swapaxes(np.tensordot(a, b, axes=0), 1, 2).reshape(na* nb, na * nb)
visually:
and in particular this must hold:
for ia1 in xrange(na):
for ia2 in xrange(na):
for ib1 in xrange(nb):
for ib2 in xrange(nb):
assert(a[ia1, ia2] * b[ib1, ib2] == result[get_folded_bin(ia1, ib1), get_folded_bin(ia2, ib2)])
actually my problem is to do it with more quantities (A, B, C, ...) in a general way. Maybe there is also a simpler function within numpy to do that.
np.einsum lets you simplify the tensordot expression a bit:
result = np.einsum('ij,kl->ikjl',a,b).reshape(-1, na * nb)
I don't think there's a way of eliminating the reshape.
It may also be easier to generalize to more arrays, though I wouldn't get carried away with too many iteration variables in one einsum expression.
I think finally I have found a solution:
np.kron(a,b)
and then I can compose with
np.kron(np.kron(a,b), c)

Categories