I am generating a random matrix with
np.random.randint(2, size=(5, 3))
that outputs something like
[0,1,0],
[1,0,0],
[1,1,1],
[1,0,1],
[0,0,0]
How do I create the random matrix with the condition that each row cannot contain all 1's? That is, each row can be [1,0,0] or [0,0,0] or [1,1,0] or [1,0,1] or [0,0,1] or [0,1,0] or [0,1,1] but cannot be [1,1,1].
Thanks for your answers
Here's an interesting approach:
rows = np.random.randint(7, size=(6, 1), dtype=np.uint8)
np.unpackbits(rows, axis=1)[:, -3:]
Essentially, you are choosing integers 0-6 for each row, ie 000-110 as binary. 7 would be 111 (all 1's). You just need to extract binary digits as columns and take the last 3 digits (your 3 columns) since the output of unpackbits is 8 digits.
Output:
array([[1, 0, 1],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[0, 1, 1],
[0, 0, 0]], dtype=uint8)
If you always have 3 columns, one approach is to explicitly list the possible rows and then choose randomly among them until you have enough rows:
import numpy as np
# every acceptable row
choices = np.array([
[1,0,0],
[0,0,0],
[1,1,0],
[1,0,1],
[0,0,1],
[0,1,0],
[0,1,1]
])
n_rows = 5
# randomly pick which type of row to use for each row needed
idx = np.random.choice(range(len(choices)), size=n_rows)
# make an array by using the chosen rows
array = choices[idx]
If this needs to generalize to a large number of columns, it won't be practical to explicitly list all choices (even if you create the choices programmatically, the memory is still an issue; the number of possible rows grows exponentially in the number of columns). Instead, you can create an initial matrix and then just resample any unacceptable rows until there are none left. I'm assuming that a row is unacceptable if it consists only of 1s; it would be easy to adapt this to the case where the threshold is any number of 1s, though.
n_rows = 5
n_cols = 4
array = np.random.randint(2, size=(n_rows, n_cols))
all_1s_idx = array.sum(axis=-1) == n_cols
while all_1s_idx.any():
array[all_1s_idx] = np.random.randint(2, size=(all_1s_idx.sum(), n_cols))
all_1s_idx = array.sum(axis=-1) == n_cols
Here we just keep resampling all unacceptable rows until there are none left. Because all of the necessary rows are resampled at once, this should be quite efficient. Additionally, as the number of columns grows larger, the probability of a row having all 1s decreases exponentially, so efficiency shouldn't be a problem.
#busybear beat me to it but I'll post it anyway, as it is a bit more general:
def not_all(m, k):
if k>64 or sys.byteorder != 'little':
raise NotImplementedError
sample = np.random.randint(0, 2**k-1, (m,), dtype='u8').view('u1').reshape(m, -1)
sample[:, k//8] <<= -k%8
return np.unpackbits(sample).reshape(m, -1)[:, :k]
For example:
>>> sample = not_all(1000000, 11)
# sanity checks
>>> unq, cnt = np.unique(sample, axis=0, return_counts=True)
>>> len(unq) == 2**11-1
True
>>> unq.sum(1).max()
10
>>> cnt.min(), cnt.max()
(403, 568)
And while I'm at hijacking other people's answers here is a streamlined version of #Nathan's acceptance-rejection method.
def accrej(m, k):
sample = np.random.randint(0, 2, (m, k), bool)
all_ones, = np.where(sample.all(1))
while all_ones.size:
resample = np.random.randint(0, 2, (all_ones.size, k), bool)
sample[all_ones] = resample
all_ones = all_ones[resample.all(1)]
return sample.view('u1')
Try this solution using sum():
import numpy as np
array = np.random.randint(2, size=(5, 3))
for i, entry in enumerate(array):
if entry.sum() == 3:
while True:
new = np.random.randint(2, size=(1, 3))
if new.sum() == 3:
continue
break
array[i] = new
print(array)
Good luck my friend!
Related
I have an n row, m column numpy array, and would like to create a new k x m array by selecting k random elements from each column of the array. I wrote the following python function to do this, but would like to implement something more efficient and faster:
def sample_array_cols(MyMatrix, nelements):
vmat = []
TempMat = MyMatrix.T
for v in TempMat:
v = np.ndarray.tolist(v)
subv = random.sample(v, nelements)
vmat = vmat + [subv]
return(np.array(vmat).T)
One question is whether there's a way to loop over each column without transposing the array (and then transposing back). More importantly, is there some way to map the random sample onto each column that would be faster than having a for loop over all columns? I don't have that much experience with numpy objects, but I would guess that there should be something analogous to apply/mapply in R that would work?
One alternative is to randomly generate the indices first, and then use take_along_axis to map them to the original array:
arr = np.random.randn(1000, 5000) # arbitrary
k = 10 # arbitrary
n, m = arr.shape
idx = np.random.randint(0, n, (k, m))
new = np.take_along_axis(arr, idx, axis=0)
Output (shape):
in [215]: new.shape
out[215]: (10, 500) # (k x m)
To sample each column without replacement just like your original solution
import numpy as np
matrix = np.arange(4*3).reshape(4,3)
matrix
Output
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
k = 2
np.take_along_axis(matrix, np.random.rand(*matrix.shape).argsort(axis=0)[:k], axis=0)
Output
array([[ 9, 1, 2],
[ 3, 4, 11]])
I would
Pre-allocate the result array, and fill in columns, and
Use numpy index based indexing
def sample_array_cols(matrix, n_result):
(n,m) = matrix.shape
vmat = numpy.array([n_result, m], dtype= matrix.dtype)
for c in range(m):
random_indices = numpy.random.randint(0, n, n_result)
vmat[:,c] = matrix[random_indices, c]
return vmat
Not quite fully vectorized, but better than building up a list, and the code scans just like your description.
I am pretty new to python and have some problems with Randomness.
I am looking for something similar then RandomChoice in Mathematica.
I create a Matrix of dimension let's say 10x3 with random numbers greater 0. Let us call the total sum of every row s_i for i=0,...,9
Later I want to choose for every row 2 out of 3 elements (no repetition) with weighted probability s_ij/s_i
So I need something like this but with weigthed propabilities
n=10
aa=np.random.uniform(1000, 2500, (n,3))
print(aa)
help=[0,1,2]
dd=np.zeros((n,2))
for i in range(n):
cc=random.sample(help,2)
dd[i,0]=aa[i,cc[0]]
dd[i,1]=aa[i,cc[1]]
print(dd)
Here, additionally speed is an important factor since I will use it in an Montecarlo approach (that's the reason I switched from Mathematica to Python) and I guess, the above code can be improved heavily
Thanks in advance for any tipps/help
EDIT: I now have the following, which is working but does not look like good gode to me
#pre-defined lists
nn=3
aa=np.random.uniform(1000, 2500, (nn,3))
help1=[0,1,2]
help2=aa.sum(axis=1)
#now I create a weigthed prob list and fill it
help3=np.zeros((nn,3))
for i in range(nn):
help3[i,0]=aa[i,0]/help2[i]
help3[i,1]=aa[i,1]/help2[i]
help3[i,2]=aa[i,2]/help2[i]
#every timestep when I have to choose 2 out of 3
help5=np.zeros((nn,2))
for i in range(nn):
#cc=random.sample(help1,2)
help4=np.random.choice(help1, 2, replace=False, p=[help3[i,0], help3[i,1], help3[i,2]])
help5[i,0]=aa[i,cc[0]]
help5[i,1]=aa[i,cc[1]]
print(help5)
As pointed out in the comments, np.random.choice accepts a weights parameter, so you can simply use that in a loop:
import numpy as np
# Make input data
np.random.seed(0)
n = 10
aa = np.random.uniform(1000, 2500, (n, 3))
s = np.random.rand(n, 3)
# Normalize weights
s_norm = s / s.sum(1, keepdims=True)
# Output array
out = np.empty((n, 2), dtype=aa.dtype)
# Sample iteratively
for i in range(n):
out[i] = aa[i, np.random.choice(3, size=2, replace=False, p=s_norm[i])]
This is not the most efficient way to do things, though, as usually using vectorized operations is much faster than looping. Unfortunately, I don't think there is any way to sample from multiple categorical distributions at the same time (see NumPy issue #15201). However, since you always want to get two elements out of three, you could sample the element that you want to remove (with inverted probabilities) and then keep the other two. This snippet does something like that:
import numpy as np
# Make input data
np.random.seed(0)
n = 10
aa = np.random.uniform(1000, 2500, (n, 3))
s = np.random.rand(n, 3)
print(s)
# [[0.26455561 0.77423369 0.45615033]
# [0.56843395 0.0187898 0.6176355 ]
# [0.61209572 0.616934 0.94374808]
# [0.6818203 0.3595079 0.43703195]
# [0.6976312 0.06022547 0.66676672]
# [0.67063787 0.21038256 0.1289263 ]
# [0.31542835 0.36371077 0.57019677]
# [0.43860151 0.98837384 0.10204481]
# [0.20887676 0.16130952 0.65310833]
# [0.2532916 0.46631077 0.24442559]]
# Invert weights
si = 1 / s
# Normalize
si_norm = si / si.sum(1, keepdims=True)
# Accumulate
si_cum = np.cumsum(si_norm, axis=1)
# Sample according to inverted probabilities
t = np.random.rand(n, 1)
idx = np.argmax(t < si_cum, axis=1)
# Get non-sampled indices
r = np.arange(3)
m = r != idx[:, np.newaxis]
choice = np.broadcast_to(r, m.shape)[m].reshape(n, -1)
print(choice)
# [[1 2]
# [0 2]
# [0 2]
# [1 2]
# [0 2]
# [0 2]
# [0 1]
# [1 2]
# [0 2]
# [1 2]]
# Get corresponding data
out = np.take_along_axis(aa, choice, 1)
One possible drawback of this is that the chosen elements will always be in order (that is, for a given row, you may get the pairs of indices (0, 1), (0, 2) or (1, 2), but not (1, 0), (2, 0) or (2, 1)).
Of course, if you really just need a few samples, then the loop is probably the most convenient and maintainable solution, the second one would only be useful if you need to do this at larger scale.
I have an array of 5 values, consisting of 4 values and one index. I sort and split the array along the index. This leads me to splits of matrices with different lengths. From here on I want to calculate the mean, variance of the fourth values and covariance of the first 3 values for every split. My current approach works with a for loop, which I would like to replace by matrix operations, but I am struggeling with the different sizes of my matrices.
import numpy as np
A = np.random.rand(10,5)
A[:,-1] = np.random.randint(4, size=10)
sorted_A = A[np.argsort(A[:,4])]
splits = np.split(sorted_A, np.where(np.diff(sorted_A[:,4]))[0]+1)
My current for loop looks like this:
result = np.zeros((len(splits), 5))
for idx, values in enumerate(splits):
if(len(values))>0:
result[idx, 0] = np.mean(values[:,3])
result[idx, 1] = np.var(values[:,3])
result[idx, 2:5] = np.cov(values[:,0:3].transpose(), ddof=0).diagonal()
else:
result[idx, 0] = values[:,3]
I tried to work with masked arrays without success, since I couldn't load the matrices into the masked arrays in a proper form. Maybe someone knows how to do this or has a different suggestion.
You can use np.add.reduceat as follows:
>>> idx = np.concatenate([[0], np.where(np.diff(sorted_A[:,4]))[0]+1, [A.shape[0]]])
>>> result2 = np.empty((idx.size-1, 5))
>>> result2[:, 0] = np.add.reduceat(sorted_A[:, 3], idx[:-1]) / np.diff(idx)
>>> result2[:, 1] = np.add.reduceat(sorted_A[:, 3]**2, idx[:-1]) / np.diff(idx) - result2[:, 0]**2
>>> result2[:, 2:5] = np.add.reduceat(sorted_A[:, :3]**2, idx[:-1], axis=0) / np.diff(idx)[:, None]
>>> result2[:, 2:5] -= (np.add.reduceat(sorted_A[:, :3], idx[:-1], axis=0) / np.diff(idx)[:, None])**2
>>>
>>> np.allclose(result, result2)
True
Note that the diagonal of the covariance matrix are just the variances which simplifies this vectorization quite a bit.
How to generate a numpy array such that each column of the array comes from a uniform distribution within different ranges efficiently? The following code uses two for loop which is slow, is there any matrix-style way to generate such array faster? Thanks.
import numpy as np
num = 5
ranges = [[0,1],[4,5]]
a = np.zeros((num, len(ranges)))
for i in range(num):
for j in range(len(ranges)):
a[i, j] = np.random.uniform(ranges[j][0], ranges[j][1])
What you can do is produce all random numbers in the interval [0, 1) first and then scale and shift them accordingly:
import numpy as np
num = 5
ranges = np.asarray([[0,1],[4,5]])
starts = ranges[:, 0]
widths = ranges[:, 1]-ranges[:, 0]
a = starts + widths*np.random.random(size=(num, widths.shape[0]))
So basically, you create an array of the right size via np.random.random(size=(num, widths.shape[0])) with random number between 0 and 1. Then you scale each value by a factor corresponding to the width of the interval that you actually want to sample. Finally, you shift them by starts to account for the different starting values of the intervals.
numpy.random.uniform will broadcast its arguments, it can generate the desired samples by passing the following arguments:
low: the sequence of low values.
high: the sequence of high values.
size: a tuple like (num, m), where m is the number of ranges and num the number of groups of m samples to generate.
For example:
In [23]: num = 5
In [24]: ranges = np.array([[0, 1], [4, 5], [10, 15]])
In [25]: np.random.uniform(low=ranges[:, 0], high=ranges[:, 1], size=(num, ranges.shape[0]))
Out[25]:
array([[ 0.98752526, 4.70946614, 10.35525699],
[ 0.86137374, 4.22046152, 12.28458447],
[ 0.92446543, 4.52859103, 11.30326391],
[ 0.0535877 , 4.8597036 , 14.50266784],
[ 0.55854656, 4.86820001, 14.84934564]])
I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing tricks?
Sure you can do it using advanced indexing, whether it is the fastest way probably depends on your array size (if your rows are large it may not be):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:, np.newaxis]
result = A[rows, column_indices]
numpy.lib.stride_tricks.as_strided stricks (abbrev pun intended) again!
Speaking of fancy indexing tricks, there's the infamous - np.lib.stride_tricks.as_strided. The idea/trick would be to get a sliced portion starting from the first column until the second last one and concatenate at the end. This ensures that we can stride in the forward direction as needed to leverage np.lib.stride_tricks.as_strided and thus avoid the need of actually rolling back. That's the whole idea!
Now, in terms of actual implementation we would use scikit-image's view_as_windows to elegantly use np.lib.stride_tricks.as_strided under the hoods. Thus, the final implementation would be -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
a_ext = np.concatenate((a,a[:,:-1]),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), (n-r)%n,0]
Here's a sample run -
In [327]: A = np.array([[4, 0, 0],
...: [1, 2, 3],
...: [0, 0, 5]])
In [328]: r = np.array([2, 0, -1])
In [329]: strided_indexing_roll(A, r)
Out[329]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
Benchmarking
# #seberg's solution
def advindexing_roll(A, r):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
return A[rows, column_indices]
Let's do some benchmarking on an array with large number of rows and columns -
In [324]: np.random.seed(0)
...: a = np.random.rand(10000,1000)
...: r = np.random.randint(-1000,1000,(10000))
# #seberg's solution
In [325]: %timeit advindexing_roll(a, r)
10 loops, best of 3: 71.3 ms per loop
# Solution from this post
In [326]: %timeit strided_indexing_roll(a, r)
10 loops, best of 3: 44 ms per loop
In case you want more general solution (dealing with any shape and with any axis), I modified #seberg's solution:
def indep_roll(arr, shifts, axis=1):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray
How many shifting to use for each dimension. Shape: `(arr.shape[axis],)`.
axis : int
Axis along which elements are shifted.
"""
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts[shifts < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts[:, np.newaxis]
result = arr[tuple(all_idcs)]
arr = np.swapaxes(result,-1,axis)
return arr
I implement a pure numpy.lib.stride_tricks.as_strided solution as follows
from numpy.lib.stride_tricks import as_strided
def custom_roll(arr, r_tup):
m = np.asarray(r_tup)
arr_roll = arr[:, [*range(arr.shape[1]),*range(arr.shape[1]-1)]].copy() #need `copy`
strd_0, strd_1 = arr_roll.strides
n = arr.shape[1]
result = as_strided(arr_roll, (*arr.shape, n), (strd_0 ,strd_1, strd_1))
return result[np.arange(arr.shape[0]), (n-m)%n]
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
r = np.array([2, 0, -1])
out = custom_roll(A, r)
Out[789]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
By using a fast fourrier transform we can apply a transformation in the frequency domain and then use the inverse fast fourrier transform to obtain the row shift.
So this is a pure numpy solution that take only one line:
import numpy as np
from numpy.fft import fft, ifft
# The row shift function using the fast fourrier transform
# rshift(A,r) where A is a 2D array, r the row shift vector
def rshift(A,r):
return np.real(ifft(fft(A,axis=1)*np.exp(2*1j*np.pi/A.shape[1]*r[:,None]*np.r_[0:A.shape[1]][None,:]),axis=1).round())
This will apply a left shift, but we can simply negate the exponential exponant to turn the function into a right shift function:
ifft(fft(...)*np.exp(-2*1j...)
It can be used like that:
# Example:
A = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
r = np.array([1,-1,3])
print(rshift(A,r))
Building on divakar's excellent answer, you can apply this logic to 3D array easily (which was the problematic that brought me here in the first place). Here's an example - basically flatten your data, roll it & reshape it after::
def applyroll_30(cube, threshold=25, offset=500):
flattened_cube = cube.copy().reshape(cube.shape[0]*cube.shape[1], cube.shape[2])
roll_matrix = calc_roll_matrix_flattened(flattened_cube, threshold, offset)
rolled_cube = strided_indexing_roll(flattened_cube, roll_matrix, cube_shape=cube.shape)
rolled_cube = triggered_cube.reshape(cube.shape[0], cube.shape[1], cube.shape[2])
return rolled_cube
def calc_roll_matrix_flattened(cube_flattened, threshold, offset):
""" Calculates the number of position along time axis we need to shift
elements in order to trig the data.
We return a 1D numpy array of shape (X*Y, time) elements
"""
# armax(...) finds the position in the cube (3d) where we are above threshold
roll_matrix = np.argmax(cube_flattened > threshold, axis=1) + offset
# ensure we don't have index out of bound
roll_matrix[roll_matrix>cube_flattened.shape[1]] = cube_flattened.shape[1]
return roll_matrix
def strided_indexing_roll(cube_flattened, roll_matrix_flattened, cube_shape):
# Concatenate with sliced to cover all rolls
# otherwise we shift in the wrong direction for my application
roll_matrix_flattened = -1 * roll_matrix_flattened
a_ext = np.concatenate((cube_flattened, cube_flattened[:, :-1]), axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = cube_flattened.shape[1]
result = viewW(a_ext,(1,n))[np.arange(len(roll_matrix_flattened)), (n - roll_matrix_flattened) % n, 0]
result = result.reshape(cube_shape)
return result
Divakar's answer doesn't do justice to how much more efficient this is on large cube of data. I've timed it on a 400x400x2000 data formatted as int8. An equivalent for-loop does ~5.5seconds, Seberg's answer ~3.0seconds and strided_indexing.... ~0.5second.