Python Multiprocessing data output wrong - python

I am trying Multiprocessing in Python. I have written some code which does vector add, but couldn't get the output out of the function. Which mean, the output Z prints out 0 rather than 2.
from multiprocessing import Process
import numpy as np
numThreads = 16
num = 16
numIter = num/numThreads
X = np.ones((num, 1))
Y = np.ones((num, 1))
Z = np.zeros((num, 1))
def add(X,Y,Z,j):
Z[j] = X[j] + Y[j]
if __name__ == '__main__':
jobs = []
for i in range(numThreads):
p = Process(target=add, args=(X, Y, Z, i,))
jobs.append(p)
for i in range(numThreads):
jobs[i].start()
for i in range(numThreads):
jobs[i].join()
print Z[0]
Edit: Took advice of clocker, and changed my code to this:
import multiprocessing
import numpy as np
numThreads = 16
numRows = 32000
numCols = 2
numOut = 3
stride = numRows / numThreads
X = np.ones((numRows, numCols))
W = np.ones((numCols, numOut))
B = np.ones((numRows, numOut))
Y = np.ones((numRows, numOut))
def conv(idx):
Y[idx*stride:idx*stride+stride] = X[idx*stride:idx*stride+stride].dot(W) + B[idx*stride:idx*stride+stride]
if __name__=='__main__':
pool = multiprocessing.Pool(numThreads)
pool.map(conv, range(numThreads))
print Y
And the output is Y instead of a Saxp.

The reason your last line print Z[0] returns [0] instead of [2] is that each of the processes makes an independent copy of Z (or may be Z[j] - not completely sure about that) before modifying it. Either way, a separate process run will guarantee that your original version will be unchanged.
If you were to use the threading module instead the last line would indeed return [2] as expected, but that is not multi-processing.
So, you probably want to use multiprocessing.Pool instead. Going along your experiment purely for illustration, one could do the following:
In [40]: pool = multiprocessing.Pool()
In [41]: def add_func(j):
....: return X[j] + Y[j]
In [42]: pool = multiprocessing.Pool(numThreads)
In [43]: pool.map(add_func, range(numThreads))
Out[43]:
[array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.]),
array([ 2.])]
Have fun!
For your second part to your question, the problem is that the conv() function does not return any value. While the process pool gets a copy of X, B and W for pulling values from, the Y inside conv() is local to each process that is launched. To get the new computed value of Y, you would use something like this:
def conv(idx):
Ylocal_section = X[idx*stride:idx*stride+stride].dot(W) + B[idx*stride:idx*stride+stride]
return Ylocal_section
results = pool.map(conv, range(numThreads)) # then apply each result to Y
for idx in range(numThreads):
Y[idx*stride:idx*stride+stride] = results[idx]
Parallelism can get complicated really fast, and at this point I would evaluate existing libraries that can perform fast 2D convolution. numpy and scipy libraries can be super efficient and might serve your needs better.

Related

How to covert replace numpy loop with faster broadcast operation

I am trying to use broadcasting to speed up my numpy code. the real code has much larger arrays and loops through multiple times, but I think this snippet illustrates the issue.
import numpy as np
row = np.array([0,0,1,1,4])
dl_ddk = np.array([0,8,29,112,11])
change1 = np.zeros(5)
change2 = np.zeros(5)
for k in range(0, row.shape[0]):
i = row[k]
change1[i] += dl_ddk[k]
change2[row] += dl_ddk
print(change1)
print(change2)
change1 = [8, 141, 0, 0 11]
change2 = [8, 112, 0, 0 11]
I thought these two change arrays would be equals however, it seems that the broadcast operations += is overwriting rather than adding values. Is there a way to vectorize a loop in np with matrix referencing like this that will give the same results as change1?
You can use np.bincount() and use dl_ddk as the weights:
import numpy as np
row = np.array([0,0,1,1,4])
dl_ddk = np.array([0,8,29,112,11])
change1 = np.bincount(row, weights=dl_ddk)
print(change1)
# [ 8. 141. 0. 0. 11.]
The bit in the docs show using it in a way almost exactly like your problem:
If weights is specified the input array is weighted by it, i.e. if a
value n is found at position i, out[n] += weight[i] instead of out[n]
+= 1.
In [1]: row = np.array([0,0,1,1,4])
...: dl_ddk = np.array([0,8,29,112,11])
...: change1 = np.zeros(5)
...: change2 = np.zeros(5)
...: for k in range(0, row.shape[0]):
...: i = row[k]
...: change1[i] += dl_ddk[k]
...: change2[row] += dl_ddk
change2 does not match because of buffering. ufunc has added a at method to address this:
Performs unbuffered in place operation on operand 'a' for elements specified by 'indices'.
In [3]: change3 = np.zeros(5)
In [4]: np.add.at(change3, row, dl_ddk)
In [5]: change1
Out[5]: array([ 8., 141., 0., 0., 11.])
In [6]: change2
Out[6]: array([ 8., 112., 0., 0., 11.])
In [7]: change3
Out[7]: array([ 8., 141., 0., 0., 11.])

Python Optimization: Using vector technique to find power of each matrix in an numpy array

3D numpy array A contains a series (in this example, I am choosing 3) of 2D numpy array D of shape 2 x 2. The D matrix is as follows:
D = np.array([[1,2],[3,4]])
A is initialized and assigned as below:
idx = np.arange(3)
A = np.zeros((3,2,2))
A[idx,:,:] = D # This gives A = [[[1,2],[3,4]],[[1,2],[3,4]],\
# [[1,2],[3,4]]]
# In mathematical notation: A = {D, D, D}
Now, essentially what I require after the execution of the codes is:
Mathematically, A = {D^0, D^1, D^2} = {D0, D1, D2}
where D0 = [[1,0],[0,1]], D1 = [[1,2],[3,4]], D2=[[7,10],[15,22]]
Is it possible to apply power to each matrix element in A without using a for-loop? I would be doing larger matrices with more in the series.
I had defined, n = np.array([0,1,2]) # corresponding to powers 0, 1 and 2 and tried
Result = np.power(A,n) but I do not get the desired output.
Is there are an efficient way to do it?
Full code:
D = np.array([[1,2],[3,4]])
idx = np.arange(3)
A = np.zeros((3,2,2))
A[idx,:,:] = D # This gives A = [[[1,2],[3,4]],[[1,2],[3,4]],\
# [[1,2],[3,4]]]
# In mathematical notation: A = {D, D, D}
n = np.array([0,1,2])
Result = np.power(A,n) # ------> Not the desired output.
A cumulative product exists in numpy, but not for matrices. Therefore, you need to make your own 'matcumprod' function. You can use np.dot for this, but np.matmul (or #) is specialized for matrix multiplication.
Since you state your powers always go from 0 to some_power, I suggest the following function:
def matcumprod(D, upto):
Res = np.empty((upto, *D.shape), dtype=A.dtype)
Res[0, :, :] = np.eye(D.shape[0])
Res[1, :, :] = D.copy()
for i in range(1,upto):
Res[i, :, :] = Res[i-1,:,:] # D
return Res
By the way, a loop often times outperforms a built-in numpy function if the latter uses a lot of memory, so don't fret over it if your powers stay within bounds...
Alright, i spent a lot of time on this problem but could not seem to find a vectorized solution in the way you'd like. So i would like to instead first propose a basic solution, and then perhaps an optimization if you require finding continuous powers.
The function you're looking for is called numpy.linalg.matrix_power
import numpy as np
D = np.matrix([[1,2],[3,4]])
idx = np.arange(3)
A = np.zeros((3,2,2))
A[idx,:,:] = D # This gives A = [[[1,2],[3,4]],[[1,2],[3,4]],\
# [[1,2],[3,4]]]
# In mathematical notation: A = {D, D, D}
np.zeros(A.shape)
n = np.array([0,1,2])
result = [np.linalg.matrix_power(D, i) for i in n]
np.array(result)
#Output:
array([[[ 1, 0],
[ 0, 1]],
[[ 1, 2],
[ 3, 4]],
[[ 7, 10],
[15, 22]]])
However, if you notice, you end up calculating multiple powers for the same base matrix. We could instead utilize the intermediate results and go from there, using numpy.linalg.multi_dot
def all_powers_arr_of_matrix(A):
result = np.zeros(A.shape)
result[0] = np.linalg.matrix_power(A[0], 0)
for i in range(1, A.shape[0]):
result[i] = np.linalg.multi_dot([result[i - 1], A[i]])
return result
result = all_powers_arr_of_matrix(A)
#Output:
array([[[ 1., 0.],
[ 0., 1.]],
[[ 1., 2.],
[ 3., 4.]],
[[ 7., 10.],
[15., 22.]]])
Also, we can avoid creating the matrix A entirely, saving some time.
def all_powers_matrix(D, *rangeargs): #end exclusive
''' Expects 2D matrix.
Use as all_powers_matrix(D, end) or
all_powers_matrix(D, start, end)
'''
if len(rangeargs) == 1:
start = 0
end = rangeargs[0]
elif len(rangeargs) == 2:
start = rangeargs[0]
end = rangeargs[1]
else:
print("incorrect args")
return None
result = np.zeros((end - start, *D.shape))
result[0] = np.linalg.matrix_power(A[0], start)
for i in range(start + 1, end):
result[i] = np.linalg.multi_dot([result[i - 1], D])
return result
return result
result = all_powers_matrix(D, 3)
#Output:
array([[[ 1., 0.],
[ 0., 1.]],
[[ 1., 2.],
[ 3., 4.]],
[[ 7., 10.],
[15., 22.]]])
Note that you'd need to add error handling if you decide to use these functions as-is.
To calculate power of matrix D, one way could be to find the eigenvalues and right eigenvectors of it with np.linalg.eig and then raise the power of the diagonal matrix as it is easier, then after some manipulation, you can use two np.einsum to calculate A
#get eigvalues and eigvectors
eigval, eigvect = np.linalg.eig(D)
# to check how it works, you can do:
print (np.dot(eigvect*eigval,np.linalg.inv(eigvect)))
#[[1. 2.]
# [3. 4.]]
# so you get back on D
#use power as ufunc of outer with n on the eigenvalues to get all the one you want
arrp = np.power.outer( eigval, n).T
#apply_along_axis to create the diagonal matrix along the last axis
diagp = np.apply_along_axis( np.diag, axis=-1, arr=arrp)
#finally use two np.einsum to calculate with the subscript to get what you want
A = np.einsum('lij,jk -> lik',
np.einsum('ij,kjl -> kil',eigvect,diagp), np.linalg.inv(eigvect)).round()
print (A)
print (A.shape)
#[[[ 1. 0.]
# [-0. 1.]]
#
# [[ 1. 2.]
# [ 3. 4.]]
#
# [[ 7. 10.]
# [15. 22.]]]
#
#(3, 2, 2)
I don't have a full solution, but there are some things I wanted to mention which are a bit too long for the comments.
You might first look into addition chain exponentiation if you are computing big powers of big matrices. This is basically asking how many matrix multiplications are required to compute A^k for a given k. For instance A^5 = A(A^2)^2 so you need to only three matrix multiplies: A^2 and (A^2)^2 and A(A^2)^2. This might be the simplest way to gain some efficiency, but you will probably still have to use explicit loops.
Your question is also related to the problem of computing Ax, A^2x, ... , A^kx for a given A and x. This is an active area of research right now (search "matrix powers kernel"), since computing such a sequence efficiently is useful for parallel/communication avoiding Krylov subspace methods. If you're looking for a very efficient solution to your problem it might be worth looking into some of the results about this.

efficient numpy array creation

Given x, I want to produce x, log(x) as a numpy array whereby x has shape s, the result has shape (*s, 2). What's the neatest way to do this? x may just be a float, in which case I want a result with shape (2,).
An ugly way to do this is:
import numpy as np
x = np.asarray(x)
result = np.empty((*x.shape, 2))
result[..., 0] = x
result[..., 1] = np.log(x)
It's important to separate aesthetics from performance. Sometimes ugly code is
fast. In fact, that's the case here. Although creating an empty array and then
assigning values to slices may not look beautiful, it is fast.
import numpy as np
import timeit
import itertools as IT
import pandas as pd
def using_empty(x):
x = np.asarray(x)
result = np.empty(x.shape + (2,))
result[..., 0] = x
result[..., 1] = np.log(x)
return result
def using_concat(x):
x = np.asarray(x)
return np.concatenate([x, np.log(x)], axis=-1).reshape(x.shape+(2,), order='F')
def using_stack(x):
x = np.asarray(x)
return np.stack([x, np.log(x)], axis=x.ndim)
def using_ufunc(x):
return np.array([x, np.log(x)])
using_ufunc = np.vectorize(using_ufunc, otypes=[np.ndarray])
tests = [np.arange(600),
np.arange(600).reshape(20,30),
np.arange(960).reshape(8,15,8)]
# check that all implementations return the same result
for x in tests:
assert np.allclose(using_empty(x), using_concat(x))
assert np.allclose(using_empty(x), using_stack(x))
timing = []
funcs = ['using_empty', 'using_concat', 'using_stack', 'using_ufunc']
for test, func in IT.product(tests, funcs):
timing.append(timeit.timeit(
'{}(test)'.format(func),
setup='from __main__ import test, {}'.format(func), number=1000))
timing = pd.DataFrame(np.array(timing).reshape(-1, len(funcs)), columns=funcs)
print(timing)
yields, the following timeit results on my machine:
using_empty using_concat using_stack using_ufunc
0 0.024754 0.025182 0.030244 2.414580
1 0.025766 0.027692 0.031970 2.408344
2 0.037502 0.039644 0.044032 3.907487
So using_empty is the fastest (of the options tested applied to tests).
Note that np.stack does exactly what you want, so
np.stack([x, np.log(x)], axis=x.ndim)
looks reasonably pretty, but it is also the slowest of the three options tested.
Note that along with being much slower, using_ufunc returns an array of object dtype:
In [236]: x = np.arange(6)
In [237]: using_ufunc(x)
Out[237]:
array([array([ 0., -inf]), array([ 1., 0.]),
array([ 2. , 0.69314718]),
array([ 3. , 1.09861229]),
array([ 4. , 1.38629436]), array([ 5. , 1.60943791])], dtype=object)
which is not the same as the desired result:
In [240]: using_empty(x)
Out[240]:
array([[ 0. , -inf],
[ 1. , 0. ],
[ 2. , 0.69314718],
[ 3. , 1.09861229],
[ 4. , 1.38629436],
[ 5. , 1.60943791]])
In [238]: using_ufunc(x).shape
Out[238]: (6,)
In [239]: using_empty(x).shape
Out[239]: (6, 2)

Is it possible to speed up an element-by-element array operation in Python?

I have a list of times (called times in my code, produced by the code suggested to me in the thread astropy.io fits efficient element access of a large table) and I want to do some statistical tests for periodicity, using Zn^2 and epoch folding tests. Some steps in the code take quite a while to run, and I am wondering if there is a faster way to do it. I have tried the equivalent map and lambda functions, but that takes even longer. My list of times has several hundred or maybe thousands of elements, depending on the dataset. Here is my code:
phase=[(x-mintime)*testfreq[m]-int((x-mintime)*testfreq[m]) for x in times]
# the above step takes 3 seconds for the dataset I am using for testing
# testfreq[m] is just one of several hundred frequencies I am testing
# times is of type numpy.ndarray
phasebin=[int(ph*numbins)for ph in phase]
# 1 second (numbins is 20)
powerarray=[phasebin.count(n) for n in range(0,numbins-1)]
# 0.3 seconds
poweravg=np.mean(powerarray)
chisq[m]=sum([(pow-poweravg)**2/poweravg for pow in powerarray])
# the above 2 steps are very quick
for n in range(0,maxn): # maxn is 3
cosparam=sum([(np.cos(2*np.pi*(n+1)*ph)) for ph in phase])
sinparam=sum([(np.sin(2*np.pi*(n+1)*ph)) for ph in phase])
# these steps each take 4 seconds
z2[m,n]=sum(z2[m,])+(cosparam**2+sinparam**2)/count
# this is quick (count is the number of times)
As this steps through several hundred frequencies on either side of frequencies identified through an FFT search, it takes a very long time to run. The same functionality in a lower level language runs much more quickly, but I need some of the Python modules for plotting, etc. I am hoping that Python can be persuaded to do some of the operations, particularly the phase, phasebin, powerarray, cosparam, and sinparam calculations, significantly faster, but I am not sure how to make this happen. Can anyone tell me how this can be done, or do I have to write and call functions in C or fortran? I know that this could be done in a few minutes e.g. in fortran, but this Python code takes hours as it is.
Thanks very much.
Instead of Python lists, you could use the numpy library, it is much faster for linear algebra type operations. For example to add two arrays in an element-wise fashion
>>> import numpy as np
>>> a = np.array([1,2,3,4,5])
>>> b = np.array([2,3,4,5,6])
>>> a + b
array([ 3, 5, 7, 9, 11])
Similarly, you can multiply arrays by scalars which multiplies each element as you'd expect
>>> 2 * a
array([ 2, 4, 6, 8, 10])
As far as speed, here is the Python list equivalent of adding two lists
>>> c = [1,2,3,4,5]
>>> d = [2,3,4,5,6]
>>> [i+j for i,j in zip(c,d)]
[3, 5, 7, 9, 11]
Then timing the two
>>> from timeit import timeit
>>> setup = '''
import numpy as np
a = np.array([1,2,3,4,5])
b = np.array([2,3,4,5,6])'''
>>> timeit('a+b', setup)
0.521275608325351
>>> setup = '''
c = [1,2,3,4,5]
d = [2,3,4,5,6]'''
>>> timeit('[i+j for i,j in zip(c,d)]', setup)
1.2781205834379108
In this small example numpy was more than twice as fast.
for loop substitute - operating on complete arrays
First multiply phase by 2*pi*n using broadcasting
phase = np.arange(10)
maxn = 3
ens = np.arange(1, maxn+1) # array([1, 2, 3])
two_pi_ens = 2*np.pi*ens
b = phase * two_pi_ens[:, np.newaxis]
b.shape is (3,10) one row for each value of range(1, maxn)
Take the cosine then sum to get the three cosine parameters
c = np.cos(b)
c_param = c.sum(axis = 1) # c_param.shape is 3
Take the sine then sum to get the three sine parameters
s = np.sin(b)
s_param = s.sum(axis = 1) # s_param.shape is 3
Sum of the squares divided by count
d = (np.square(c_param) + np.square(s_param)) / count
# d.shape is (3,)
Assign to z2
for n in range(maxn):
z2[m,n] = z2[m,:].sum() + d[n]
That loop is doing a cumulative sum. numpy ndarrays have a cumsum method.
If maxn is small (3 in your case) it may not be noticeably faster.
z2[m,:] += d
z2[m,:].cumsum(out = z2[m,:])
To illustrate:
>>> a = np.ones((3,3))
>>> a
array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
>>> m = 1
>>> d = (1,2,3)
>>> a[m,:] += d
>>> a
array([[ 1., 1., 1.],
[ 2., 3., 4.],
[ 1., 1., 1.]])
>>> a[m,:].cumsum(out = a[m,:])
array([ 2., 5., 9.])
>>> a
array([[ 1., 1., 1.],
[ 2., 5., 9.],
[ 1., 1., 1.]])
>>>

Better way to shuffle two numpy arrays in unison

I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.
This code works, and illustrates my goals:
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
permutation = numpy.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
For example:
>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
[1, 1],
[3, 3]]), array([2, 1, 3]))
However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.
Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.
One other thought I had was this:
def shuffle_in_unison_scary(a, b):
rng_state = numpy.random.get_state()
numpy.random.shuffle(a)
numpy.random.set_state(rng_state)
numpy.random.shuffle(b)
This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.
Your can use NumPy's array indexing:
def unison_shuffled_copies(a, b):
assert len(a) == len(b)
p = numpy.random.permutation(len(a))
return a[p], b[p]
This will result in creation of separate unison-shuffled arrays.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)
To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.
If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.
Example: Let's assume the arrays a and b look like this:
a = numpy.array([[[ 0., 1., 2.],
[ 3., 4., 5.]],
[[ 6., 7., 8.],
[ 9., 10., 11.]],
[[ 12., 13., 14.],
[ 15., 16., 17.]]])
b = numpy.array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.]])
We can now construct a single array containing all the data:
c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[ 0., 1., 2., 3., 4., 5., 0., 1.],
# [ 6., 7., 8., 9., 10., 11., 2., 3.],
# [ 12., 13., 14., 15., 16., 17., 4., 5.]])
Now we create views simulating the original a and b:
a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)
The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).
In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.
This solution could be adapted to the case that a and b have different dtypes.
Very simple solution:
randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]
the two arrays x,y are now both randomly shuffled in the same way
James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)
from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array
# Data is currently unshuffled; we should shuffle
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]
Shuffle any number of arrays together, in-place, using only NumPy.
import numpy as np
def shuffle_arrays(arrays, set_seed=-1):
"""Shuffles arrays in-place, in the same order, along axis=0
Parameters:
-----------
arrays : List of NumPy arrays.
set_seed : Seed value if int >= 0, else seed is random.
"""
assert all(len(arr) == len(arrays[0]) for arr in arrays)
seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed
for arr in arrays:
rstate = np.random.RandomState(seed)
rstate.shuffle(arr)
And can be used like this
a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])
shuffle_arrays([a, b, c])
A few things to note:
The assert ensures that all input arrays have the same length along
their first dimension.
Arrays shuffled in-place by their first dimension - nothing returned.
Random seed within positive int32 range.
If a repeatable shuffle is needed, seed value can be set.
After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.
you can make an array like:
s = np.arange(0, len(a), 1)
then shuffle it:
np.random.shuffle(s)
now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.
x_data = x_data[s]
x_label = x_label[s]
There is a well-known function that can handle this:
from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)
Just setting test_size to 0 will avoid splitting and give you shuffled data.
Though it is usually used to split train and test data, it does shuffle them too.
From documentation
Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data into a
single call for splitting (and optionally subsampling) data in a
oneliner.
This seems like a very simple solution:
import numpy as np
def shuffle_in_unison(a,b):
assert len(a)==len(b)
c = np.arange(len(a))
np.random.shuffle(c)
return a[c],b[c]
a = np.asarray([[1, 1], [2, 2], [3, 3]])
b = np.asarray([11, 22, 33])
shuffle_in_unison(a,b)
Out[94]:
(array([[3, 3],
[2, 2],
[1, 1]]),
array([33, 22, 11]))
One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.
# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
np.random.seed(seed)
np.random.shuffle(a)
np.random.seed(seed)
np.random.shuffle(b)
That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.
EDIT, don't use np.random.seed() use np.random.RandomState instead
def shuffle(a, b, seed):
rand_state = np.random.RandomState(seed)
rand_state.shuffle(a)
rand_state.seed(seed)
rand_state.shuffle(b)
When calling it just pass in any seed to feed the random state:
a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)
Output:
>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]
Edit: Fixed code to re-seed the random state
Say we have two arrays: a and b.
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]])
We can first obtain row indices by permutating first dimension
indices = np.random.permutation(a.shape[0])
[1 2 0]
Then use advanced indexing.
Here we are using the same indices to shuffle both arrays in unison.
a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]
This is equivalent to
np.take(a, indices, axis=0)
[[4 5 6]
[7 8 9]
[1 2 3]]
np.take(b, indices, axis=0)
[[6 6 6]
[4 2 0]
[9 1 1]]
If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array
for old_index in len(a):
new_index = numpy.random.randint(old_index+1)
a[old_index], a[new_index] = a[new_index], a[old_index]
b[old_index], b[new_index] = b[new_index], b[old_index]
This implements the Knuth-Fisher-Yates shuffle algorithm.
Shortest and easiest way in my opinion, use seed:
random.seed(seed)
random.shuffle(x_data)
# reset the same seed to get the identical random sequence and shuffle the y
random.seed(seed)
random.shuffle(y_data)
most solutions above work, however if you have column vectors you have to transpose them first. here is an example
def shuffle(self) -> None:
"""
Shuffles X and Y
"""
x = self.X.T
y = self.Y.T
p = np.random.permutation(len(x))
self.X = x[p].T
self.Y = y[p].T
With an example, this is what I'm doing:
combo = []
for i in range(60000):
combo.append((images[i], labels[i]))
shuffle(combo)
im = []
lab = []
for c in combo:
im.append(c[0])
lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)
I extended python's random.shuffle() to take a second arg:
def shuffle_together(x, y):
assert len(x) == len(y)
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random.random() * (i+1))
x[i], x[j] = x[j], x[i]
y[i], y[j] = y[j], y[i]
That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.
Just use numpy...
First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.
import numpy as np
def shuffle_2d(a, b):
rows= a.shape[0]
if b.shape != (rows,1):
b = b.reshape((rows,1))
S = np.hstack((b,a))
np.random.shuffle(S)
b, a = S[:,0], S[:,1:]
return a,b
features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)

Categories