I have two numpy masked arrays which I want to merge. I'm using the following code:
import numpy as np
a = np.zeros((10000, 10000), dtype=np.int16)
a[:5000, :5000] = 1
am = np.ma.masked_equal(a, 0)
b = np.zeros((10000, 10000), dtype=np.int16)
b[2500:7500, 2500:7500] = 2
bm = np.ma.masked_equal(b, 0)
arr = np.ma.array(np.dstack((am, bm)), mask=np.dstack((am.mask, bm.mask)))
arr = np.prod(arr, axis=2)
plt.imshow(arr)
The problem is that the np.prod() operation is very slow (4 seconds in my computer). Is there an alternative way of getting a merged array in a more efficient way?
Instead of your last two lines using dstack() and prod(), try this:
arr = np.ma.array(am.filled(1) * bm.filled(1), mask=(am.mask * bm.mask))
Now you don't need prod() at all, and you avoid allocating the 3D array entirely.
I took another approach that may not be particularly efficient, but is reasonably easy to extend and implement.
(I know I'm answering a question that is over 3 years old with functionality that has been around in numpy a long time, but bear with me)
The np.where function in numpy has two main purposes (it is a bit weird), the first is to give you indices for a boolean array:
>>> import numpy as np
>>> a = np.arange(12).reshape(3, 4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> m = (a % 3 == 0)
>>> m
array([[ True, False, False, True],
[False, False, True, False],
[False, True, False, False]], dtype=bool)
>>> row_ind, col_ind = np.where(m)
>>> row_ind
array([0, 0, 1, 2])
>>> col_ind
array([0, 3, 2, 1])
The other purpose of the np.where function is to pick from two arrays based on whether the given boolean array is True/False:
>>> np.where(m, a, np.zeros(a.shape))
array([[ 0., 0., 0., 3.],
[ 0., 0., 6., 0.],
[ 0., 9., 0., 0.]])
Turns out, there is also a numpy.ma.where which deals with masked arrays...
Given a list of masked arrays of the same shape, my code then looks like:
merged = masked_arrays[0]
for ma in masked_arrays[1:]:
merged = np.ma.where(ma.mask, merged, ma)
As I say, not particularly efficient, but certainly easy enough to implement.
HTH
Inspired by the accepted answer I've found a simple way of merging masked arrays. It works making some logical operations on the masks and simply adding 0 filled arrays.
import numpy as np
a = np.zeros((1000, 1000), dtype=np.int16)
a[:500, :500] = 2
am = np.ma.masked_equal(a, 0)
b = np.zeros((1000, 1000), dtype=np.int16)
b[250:750, 250:750] = 3
bm = np.ma.masked_equal(b, 0)
c = np.zeros((1000, 1000), dtype=np.int16)
c[500:1000, 500:1000] = 5
cm = np.ma.masked_equal(c, 0)
bm.mask = np.logical_or(np.logical_and(am.mask, bm.mask), np.logical_not(am.mask))
am = np.ma.array(am.filled(0) + bm.filled(0), mask=(am.mask * bm.mask))
cm.mask = np.logical_or(np.logical_and(am.mask, cm.mask), np.logical_not(am.mask))
am = np.ma.array(am.filled(0) + cm.filled(0), mask=(am.mask * cm.mask))
plt.imshow(am)
I hope someone find this helpful sometime. Masked arrays doesn't seem to be very efficient though. So, if someone finds an alternative to merge arrays I'd be happy to know.
Update: Based on #morningsun comment this implementation is 30% faster and much simpler:
import numpy as np
a = np.zeros((1000, 1000), dtype=np.int16)
a[:500, :500] = 2
am = np.ma.masked_equal(a, 0)
b = np.zeros((1000, 1000), dtype=np.int16)
b[250:750, 250:750] = 3
bm = np.ma.masked_equal(b, 0)
c = np.zeros((1000, 1000), dtype=np.int16)
c[500:1000, 500:1000] = 5
cm = np.ma.masked_equal(c, 0)
am[am.mask] = bm[am.mask]
am[am.mask] = cm[am.mask]
plt.imshow(am)
Related
Construct a 2D, 3x3 matrix with random numbers from 1 to 8 with no duplicates
import numpy as np
random_matrix = np.random.randint(0,10,size=(3,3))
print(random_matrix)
If you want an answer where we don't have to rely on numpy then you can do this:
import random
# Generates a randomized list between 0-9, where 0 is replaced by "#"
x = ["#" if i == 0 else i for i in random.sample(range(10), k=9)]
print(x)
# Slices the list into a 3x3 format
newx = [x[idx:idx+3] for idx in range(0, len(x), 3)]
print(newx)
Output:
[6, 2, 7, 4, '#', 8, 9, 1, 3]
[[6, 2, 7], [4, '#', 8], [9, 1, 3]]
import numpy
x = numpy.arange(0, 9)
numpy.random.shuffle(x)
x = numpy.reshape(x, (3,3))
print(numpy.where(x==0, '#', x))
Let me know, but with my solution, integers seems to be replaced by string.. i don't know if you care. Else, I will found an other solution
You can achieve your goal using a few steps:
Generate sequence of values (in some range) you would like to randomly select into matrix.
Take randomly some number of elements from this sequence to new sequence.
From this new sequence make matrix with wanted shape.
import numpy as np
from random import sample
#step one
values = range(0,11)
#step two
random_sequence = sample(values, 9)
#step three
random_matrix = np.array(random_sequence).reshape(3,3)
Because you sample some number of elements, from unique sequence, that guarantee you uniqueness of new sequence, and then matrix.
You can use np.random.choice with replace=False to generate the (3, 3) array:
np.random.choice(np.arange(9), size=(3, 3), replace=False)
Replacing 0 with np.nan:
>>> np.where(x, x, np.nan)
array([[ 4., 1., 3.],
[ 5., nan, 8.],
[ 2., 6., 7.]])
However, I think Hampus Larsson's answer is better, as this problem is not appropriate for numpy if you intend to replace 0 with the string "#".
you could use numpy but random is enough
import random
numbers = list(range(9))
random.shuffle(numbers)
my_list = [[numbers[i*3 + j] for j in range(0,3)] for i in range(0,3)]
I have a list of times (called times in my code, produced by the code suggested to me in the thread astropy.io fits efficient element access of a large table) and I want to do some statistical tests for periodicity, using Zn^2 and epoch folding tests. Some steps in the code take quite a while to run, and I am wondering if there is a faster way to do it. I have tried the equivalent map and lambda functions, but that takes even longer. My list of times has several hundred or maybe thousands of elements, depending on the dataset. Here is my code:
phase=[(x-mintime)*testfreq[m]-int((x-mintime)*testfreq[m]) for x in times]
# the above step takes 3 seconds for the dataset I am using for testing
# testfreq[m] is just one of several hundred frequencies I am testing
# times is of type numpy.ndarray
phasebin=[int(ph*numbins)for ph in phase]
# 1 second (numbins is 20)
powerarray=[phasebin.count(n) for n in range(0,numbins-1)]
# 0.3 seconds
poweravg=np.mean(powerarray)
chisq[m]=sum([(pow-poweravg)**2/poweravg for pow in powerarray])
# the above 2 steps are very quick
for n in range(0,maxn): # maxn is 3
cosparam=sum([(np.cos(2*np.pi*(n+1)*ph)) for ph in phase])
sinparam=sum([(np.sin(2*np.pi*(n+1)*ph)) for ph in phase])
# these steps each take 4 seconds
z2[m,n]=sum(z2[m,])+(cosparam**2+sinparam**2)/count
# this is quick (count is the number of times)
As this steps through several hundred frequencies on either side of frequencies identified through an FFT search, it takes a very long time to run. The same functionality in a lower level language runs much more quickly, but I need some of the Python modules for plotting, etc. I am hoping that Python can be persuaded to do some of the operations, particularly the phase, phasebin, powerarray, cosparam, and sinparam calculations, significantly faster, but I am not sure how to make this happen. Can anyone tell me how this can be done, or do I have to write and call functions in C or fortran? I know that this could be done in a few minutes e.g. in fortran, but this Python code takes hours as it is.
Thanks very much.
Instead of Python lists, you could use the numpy library, it is much faster for linear algebra type operations. For example to add two arrays in an element-wise fashion
>>> import numpy as np
>>> a = np.array([1,2,3,4,5])
>>> b = np.array([2,3,4,5,6])
>>> a + b
array([ 3, 5, 7, 9, 11])
Similarly, you can multiply arrays by scalars which multiplies each element as you'd expect
>>> 2 * a
array([ 2, 4, 6, 8, 10])
As far as speed, here is the Python list equivalent of adding two lists
>>> c = [1,2,3,4,5]
>>> d = [2,3,4,5,6]
>>> [i+j for i,j in zip(c,d)]
[3, 5, 7, 9, 11]
Then timing the two
>>> from timeit import timeit
>>> setup = '''
import numpy as np
a = np.array([1,2,3,4,5])
b = np.array([2,3,4,5,6])'''
>>> timeit('a+b', setup)
0.521275608325351
>>> setup = '''
c = [1,2,3,4,5]
d = [2,3,4,5,6]'''
>>> timeit('[i+j for i,j in zip(c,d)]', setup)
1.2781205834379108
In this small example numpy was more than twice as fast.
for loop substitute - operating on complete arrays
First multiply phase by 2*pi*n using broadcasting
phase = np.arange(10)
maxn = 3
ens = np.arange(1, maxn+1) # array([1, 2, 3])
two_pi_ens = 2*np.pi*ens
b = phase * two_pi_ens[:, np.newaxis]
b.shape is (3,10) one row for each value of range(1, maxn)
Take the cosine then sum to get the three cosine parameters
c = np.cos(b)
c_param = c.sum(axis = 1) # c_param.shape is 3
Take the sine then sum to get the three sine parameters
s = np.sin(b)
s_param = s.sum(axis = 1) # s_param.shape is 3
Sum of the squares divided by count
d = (np.square(c_param) + np.square(s_param)) / count
# d.shape is (3,)
Assign to z2
for n in range(maxn):
z2[m,n] = z2[m,:].sum() + d[n]
That loop is doing a cumulative sum. numpy ndarrays have a cumsum method.
If maxn is small (3 in your case) it may not be noticeably faster.
z2[m,:] += d
z2[m,:].cumsum(out = z2[m,:])
To illustrate:
>>> a = np.ones((3,3))
>>> a
array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
>>> m = 1
>>> d = (1,2,3)
>>> a[m,:] += d
>>> a
array([[ 1., 1., 1.],
[ 2., 3., 4.],
[ 1., 1., 1.]])
>>> a[m,:].cumsum(out = a[m,:])
array([ 2., 5., 9.])
>>> a
array([[ 1., 1., 1.],
[ 2., 5., 9.],
[ 1., 1., 1.]])
>>>
x2_Kaxs is an Nx3 numpy array of lists, and the elements in those lists index into another array. I want to end up with an Nx3 numpy array of lists of those indexed elements.
x2_Kcids = array([ ax2_cid[axs] for axs in x2_Kaxs.flat ], dtype=object)
This outputs a (N*3)x1 array of numpy arrays. great. that almost works for what I want. All I need to do is reshape it.
x2_Kcids.shape = x2_Kaxs.shape
And this works.x2_Kcids becomes an Nx3 array of numpy arrays. Perfect.
Except all the lists in x2_Kaxs only have one element in them. Then it flattens
it into an Nx3 array of integers, and my code expects a list later in the pipeline.
One solution I came up with was to append a dummy element and then pop it off, but that is very ugly. Is there anything nicer?
Your problem is not really about lists of size 1, it is about list all of the same size. I have created this dummy samples:
ax2_cid = np.random.rand(10)
shape = (10, 3)
x2_Kaxs = np.empty((10, 3), dtype=object).reshape(-1)
for j in xrange(x2_Kaxs.size):
x2_Kaxs[j] = [random.randint(0, 9) for k in xrange(random.randint(1, 5))]
x2_Kaxs.shape = shape
x2_Kaxs_1 = np.empty((10, 3), dtype=object).reshape(-1)
for j in xrange(x2_Kaxs.size):
x2_Kaxs_1[j] = [random.randint(0, 9)]
x2_Kaxs_1.shape = shape
x2_Kaxs_2 = np.empty((10, 3), dtype=object).reshape(-1)
for j in xrange(x2_Kaxs_2.size):
x2_Kaxs_2[j] = [random.randint(0, 9) for k in xrange(2)]
x2_Kaxs_2.shape = shape
If we run your code on these three, the return has the following shapes:
>>> np.array([ax2_cid[axs] for axs in x2_Kaxs.flat], dtype=object).shape
(30,)
>>> np.array([ax2_cid[axs] for axs in x2_Kaxs_1.flat], dtype=object).shape
(30, 1)
>>> np.array([ax2_cid[axs] for axs in x2_Kaxs_2.flat], dtype=object).shape
(30, 2)
And the case with all lists of length 2 won't even let you reshape to (n, 3). The problem is that, even with dtype=object, numpy tries to numpify your input as much as possible, which is all the way down to individual elements if all lists are of the same length. I think that your best bet is to preallocate your x2_Kcids array:
x2_Kcids = np.empty_like(x2_Kaxs).reshape(-1)
shape = x2_Kaxs.shape
x2_Kcids[:] = [ax2_cid[axs] for axs in x2_Kaxs.flat]
x2_Kcids.shape = shape
EDIT Since unubtu's answer is no longer visible, I am going to steal from him. The code above can be much more nicely and compactly written as:
x2_Kcids = np.empty_like(x2_Kaxs)
x2_Kcids.ravel()[:] = [ax2_cid[axs] for axs in x2_Kaxs.flat]
With the above example of single item lists:
>>> x2_Kcids_1 = np.empty_like(x2_Kaxs_1).reshape(-1)
>>> x2_Kcids_1[:] = [ax2_cid[axs] for axs in x2_Kaxs_1.flat]
>>> x2_Kcids_1.shape = shape
>>> x2_Kcids_1
array([[[ 0.37685372], [ 0.95328117], [ 0.63840868]],
[[ 0.43009678], [ 0.02069558], [ 0.32455781]],
[[ 0.32455781], [ 0.37685372], [ 0.09777559]],
[[ 0.09777559], [ 0.37685372], [ 0.32455781]],
[[ 0.02069558], [ 0.02069558], [ 0.43009678]],
[[ 0.32455781], [ 0.63840868], [ 0.37685372]],
[[ 0.63840868], [ 0.43009678], [ 0.25532799]],
[[ 0.02069558], [ 0.32455781], [ 0.09777559]],
[[ 0.43009678], [ 0.37685372], [ 0.63840868]],
[[ 0.02069558], [ 0.17876822], [ 0.17876822]]], dtype=object)
>>> x2_Kcids_1[0, 0]
array([ 0.37685372])
Similar to #Denis:
if x.ndim == 2:
x.shape += (1,)
I'm trying to move a few Matlab libraries that I've built to the python environment. So far, the biggest issue I faced is the dynamic allocation of arrays based on index specification. For example, using Matlab, typing the following:
x = [1 2];
x(5) = 3;
would result in:
x = [ 1 2 0 0 3]
In other words, I didn't know before hand the size of (x), nor its content. The array must be defined on the fly, based on the indices that I'm providing.
In python, trying the following:
from numpy import *
x = array([1,2])
x[4] = 3
Would result in the following error: IndexError: index out of bounds. On workaround is incrementing the array in a loop and then assigned the desired value as :
from numpy import *
x = array([1,2])
idx = 4
for i in range(size(x),idx+1):
x = append(x,0)
x[idx] = 3
print x
It works, but it's not very convenient and it might become very cumbersome for n-dimensional arrays.I though about subclassing ndarray to achieve my goal, but I'm not sure if it would work. Does anybody knows of a better approach?
Thanks for the quick reply. I didn't know about the setitem method (I'm fairly new to Python). I simply overwritten the ndarray class as follows:
import numpy as np
class marray(np.ndarray):
def __setitem__(self, key, value):
# Array properties
nDim = np.ndim(self)
dims = list(np.shape(self))
# Requested Index
if type(key)==int: key=key,
nDim_rq = len(key)
dims_rq = list(key)
for i in range(nDim_rq): dims_rq[i]+=1
# Provided indices match current array number of dimensions
if nDim_rq==nDim:
# Define new dimensions
newdims = []
for iDim in range(nDim):
v = max([dims[iDim],dims_rq[iDim]])
newdims.append(v)
# Resize if necessary
if newdims != dims:
self.resize(newdims,refcheck=False)
return super(marray, self).__setitem__(key, value)
And it works like a charm! However, I need to modify the above code such that the setitem allow changing the number of dimensions following this request:
a = marray([0,0])
a[3,1,0] = 0
Unfortunately, when I try to use numpy functions such as
self = np.expand_dims(self,2)
the returned type is numpy.ndarray instead of main.marray. Any idea on how I could enforce that numpy functions output marray if a marray is provided as an input? I think it should be doable using array_wrap, but I could never find exactly how. Any help would be appreciated.
Took the liberty of updating my old answer from Dynamic list that automatically expands. Think this should do most of what you need/want
class matlab_list(list):
def __init__(self):
def zero():
while 1:
yield 0
self._num_gen = zero()
def __setitem__(self,index,value):
if isinstance(index, int):
self.expandfor(index)
return super(dynamic_list,self).__setitem__(index,value)
elif isinstance(index, slice):
if index.stop<index.start:
return super(dynamic_list,self).__setitem__(index,value)
else:
self.expandfor(index.stop if abs(index.stop)>abs(index.start) else index.start)
return super(dynamic_list,self).__setitem__(index,value)
def expandfor(self,index):
rng = []
if abs(index)>len(self)-1:
if index<0:
rng = xrange(abs(index)-len(self))
for i in rng:
self.insert(0,self_num_gen.next())
else:
rng = xrange(abs(index)-len(self)+1)
for i in rng:
self.append(self._num_gen.next())
# Usage
spec_list = matlab_list()
spec_list[5] = 14
This isn't quite what you want, but...
x = np.array([1, 2])
try:
x[index] = value
except IndexError:
oldsize = len(x) # will be trickier for multidimensional arrays; you'll need to use x.shape or something and take advantage of numpy's advanced slicing ability
x = np.resize(x, index+1) # Python uses C-style 0-based indices
x[oldsize:index] = 0 # You could also do x[oldsize:] = 0, but that would mean you'd be assigning to the final position twice.
x[index] = value
>>> x = np.array([1, 2])
>>> x = np.resize(x, 5)
>>> x[2:5] = 0
>>> x[4] = 3
>>> x
array([1, 2, 0, 0, 3])
Due to how numpy stores the data linearly under the hood (though whether it stores as row-major or column-major can be specified when creating arrays), multidimensional arrays are pretty tricky here.
>>> x = np.array([[1, 2, 3], [4, 5, 6]])
>>> np.resize(x, (6, 4))
array([[1, 2, 3, 4],
[5, 6, 1, 2],
[3, 4, 5, 6],
[1, 2, 3, 4],
[5, 6, 1, 2],
[3, 4, 5, 6]])
You'd need to do this or something similar:
>>> y = np.zeros((6, 4))
>>> y[:x.shape[0], :x.shape[1]] = x
>>> y
array([[ 1., 2., 3., 0.],
[ 4., 5., 6., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
A python dict will work well as a sparse array. The main issue is the syntax for initializing the sparse array will not be as pretty:
listarray = [100,200,300]
dictarray = {0:100, 1:200, 2:300}
but after that the syntax for inserting or retrieving elements is the same
dictarray[5] = 2345
I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.
This code works, and illustrates my goals:
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
permutation = numpy.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
For example:
>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
[1, 1],
[3, 3]]), array([2, 1, 3]))
However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.
Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.
One other thought I had was this:
def shuffle_in_unison_scary(a, b):
rng_state = numpy.random.get_state()
numpy.random.shuffle(a)
numpy.random.set_state(rng_state)
numpy.random.shuffle(b)
This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.
Your can use NumPy's array indexing:
def unison_shuffled_copies(a, b):
assert len(a) == len(b)
p = numpy.random.permutation(len(a))
return a[p], b[p]
This will result in creation of separate unison-shuffled arrays.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)
To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.
If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.
Example: Let's assume the arrays a and b look like this:
a = numpy.array([[[ 0., 1., 2.],
[ 3., 4., 5.]],
[[ 6., 7., 8.],
[ 9., 10., 11.]],
[[ 12., 13., 14.],
[ 15., 16., 17.]]])
b = numpy.array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.]])
We can now construct a single array containing all the data:
c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[ 0., 1., 2., 3., 4., 5., 0., 1.],
# [ 6., 7., 8., 9., 10., 11., 2., 3.],
# [ 12., 13., 14., 15., 16., 17., 4., 5.]])
Now we create views simulating the original a and b:
a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)
The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).
In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.
This solution could be adapted to the case that a and b have different dtypes.
Very simple solution:
randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]
the two arrays x,y are now both randomly shuffled in the same way
James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)
from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array
# Data is currently unshuffled; we should shuffle
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]
Shuffle any number of arrays together, in-place, using only NumPy.
import numpy as np
def shuffle_arrays(arrays, set_seed=-1):
"""Shuffles arrays in-place, in the same order, along axis=0
Parameters:
-----------
arrays : List of NumPy arrays.
set_seed : Seed value if int >= 0, else seed is random.
"""
assert all(len(arr) == len(arrays[0]) for arr in arrays)
seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed
for arr in arrays:
rstate = np.random.RandomState(seed)
rstate.shuffle(arr)
And can be used like this
a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])
shuffle_arrays([a, b, c])
A few things to note:
The assert ensures that all input arrays have the same length along
their first dimension.
Arrays shuffled in-place by their first dimension - nothing returned.
Random seed within positive int32 range.
If a repeatable shuffle is needed, seed value can be set.
After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.
you can make an array like:
s = np.arange(0, len(a), 1)
then shuffle it:
np.random.shuffle(s)
now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.
x_data = x_data[s]
x_label = x_label[s]
There is a well-known function that can handle this:
from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)
Just setting test_size to 0 will avoid splitting and give you shuffled data.
Though it is usually used to split train and test data, it does shuffle them too.
From documentation
Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data into a
single call for splitting (and optionally subsampling) data in a
oneliner.
This seems like a very simple solution:
import numpy as np
def shuffle_in_unison(a,b):
assert len(a)==len(b)
c = np.arange(len(a))
np.random.shuffle(c)
return a[c],b[c]
a = np.asarray([[1, 1], [2, 2], [3, 3]])
b = np.asarray([11, 22, 33])
shuffle_in_unison(a,b)
Out[94]:
(array([[3, 3],
[2, 2],
[1, 1]]),
array([33, 22, 11]))
One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.
# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
np.random.seed(seed)
np.random.shuffle(a)
np.random.seed(seed)
np.random.shuffle(b)
That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.
EDIT, don't use np.random.seed() use np.random.RandomState instead
def shuffle(a, b, seed):
rand_state = np.random.RandomState(seed)
rand_state.shuffle(a)
rand_state.seed(seed)
rand_state.shuffle(b)
When calling it just pass in any seed to feed the random state:
a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)
Output:
>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]
Edit: Fixed code to re-seed the random state
Say we have two arrays: a and b.
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]])
We can first obtain row indices by permutating first dimension
indices = np.random.permutation(a.shape[0])
[1 2 0]
Then use advanced indexing.
Here we are using the same indices to shuffle both arrays in unison.
a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]
This is equivalent to
np.take(a, indices, axis=0)
[[4 5 6]
[7 8 9]
[1 2 3]]
np.take(b, indices, axis=0)
[[6 6 6]
[4 2 0]
[9 1 1]]
If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array
for old_index in len(a):
new_index = numpy.random.randint(old_index+1)
a[old_index], a[new_index] = a[new_index], a[old_index]
b[old_index], b[new_index] = b[new_index], b[old_index]
This implements the Knuth-Fisher-Yates shuffle algorithm.
Shortest and easiest way in my opinion, use seed:
random.seed(seed)
random.shuffle(x_data)
# reset the same seed to get the identical random sequence and shuffle the y
random.seed(seed)
random.shuffle(y_data)
most solutions above work, however if you have column vectors you have to transpose them first. here is an example
def shuffle(self) -> None:
"""
Shuffles X and Y
"""
x = self.X.T
y = self.Y.T
p = np.random.permutation(len(x))
self.X = x[p].T
self.Y = y[p].T
With an example, this is what I'm doing:
combo = []
for i in range(60000):
combo.append((images[i], labels[i]))
shuffle(combo)
im = []
lab = []
for c in combo:
im.append(c[0])
lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)
I extended python's random.shuffle() to take a second arg:
def shuffle_together(x, y):
assert len(x) == len(y)
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random.random() * (i+1))
x[i], x[j] = x[j], x[i]
y[i], y[j] = y[j], y[i]
That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.
Just use numpy...
First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.
import numpy as np
def shuffle_2d(a, b):
rows= a.shape[0]
if b.shape != (rows,1):
b = b.reshape((rows,1))
S = np.hstack((b,a))
np.random.shuffle(S)
b, a = S[:,0], S[:,1:]
return a,b
features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)