how to use sparse vectors and matrices in Python? - python

I am trying to do something very simple, but confused by the abundance of information about sparse matrices and vectors in Python.
I want to create two vectors, x and y, one of length 5 and one of length 6, being sparse. Then I want to set one coordinate in each one of them. Then I want to create a matrix A, sparse, which is 5 x 6 and add to it the outer product between x and y. I then want to do SVD on that A.
Here is what I tried, and it goes wrong in many ways.
from scipy import sparse;
import numpy as np;
import scipy.sparse.linalg as ssl;
x = sparse.bsr_matrix(np.zeros(5));
x[1] = 1;
y = sparse.bsr_matrix(np.zeros(6));
y[1] = 2;
A = sparse.coo_matrix(5, 6);
A = A + np.outer(x,y.transpose())
svdresult = ssl.svds(A,1);

At first, you should determine data you want to store in sparse matrix before constructing it. Otherwise you should use sparse.csc_matrix or sparse.csr_matrix instead. Then you can assign or change data like this:
x[0, 1] = 1
At second, outer product of vectors x and y is equivalent to x.transpose() * y.
Here is working code:
from scipy import sparse
import numpy as np
import scipy.sparse.linalg as ssl
x = np.zeros(5)
x[1] = 1
x_bsr = sparse.bsr_matrix(x)
y = np.zeros(6)
y[1] = 2
y_bsr = sparse.bsr_matrix(y)
A = sparse.coo_matrix((5, 6)) # Sparse matrix 5 x 6
B = x_bsr.transpose().dot(y_bsr) # Outer product of x and y
svdresult = ssl.svds((A + B), 1)
Output:
(array([[ 5.55111512e-17],
[ -1.00000000e+00],
[ 0.00000000e+00],
[ -2.77555756e-17],
[ 1.11022302e-16]]), array([ 2.]), array([[ 0., -1., 0., 0., 0., 0.]]))

Related

How using "dot" (or "matmul") function for iterative multiplication in Python

I need obtain a "W" matrix of multiples matrix multiplications (all multiplications result in column vectors).
from numpy import matrix
from numpy import transpose
from numpy import matmul
from numpy import dot
# Iterative matrix multiplication
def iterativeMultiplication(X, Y):
W = [] # Matrix of matricial products
X = matrix(X) # same number of rows
Y = matrix(Y) # same number of rows
h = 0
while (h < X.shape[1]):
W.append([])
W[h] = dot(transpose(X), Y) # using "dot" function
h += 1
return W
But, unexpectedly, I obtain a list of objects with their respective data types.
X = [[0., 0., 1.], [1.,0.,0.], [2.,2.,2.], [2.,5.,4.]]
Y = [[-0.2], [1.1], [5.9], [12.3]] # Edit Y column
iterativeMultiplication( X, Y )
Results in:
[array([[37.5],[73.3],[60.8]]),
array([[37.5],[73.3],[60.8]]),
array([[37.5],[73.3],[60.8]])]
I need any method for obtain only the numerical values for the matrix conversion.
W = matrix(W) # Results in error
It is the same using "matmul" function. Thx for your time.
If you want to stack multiple matrices, you can use numpy.vstack:
W = numpy.vstack(W)
Edit: There seems to be a discrepancy between your function, X and Y versus the "result" list in your question. But based on your comments below, what you're actually looking for is numpy.hstack (horizontal stack) which will give you the desired 3x3 matrix based on your "result" list.
W = numpy.hstack(W)
Of course you are going to get a list. You initial W as a list, and append the same calculation to it 3 times.
But your 3 element arrays don't make sense with this data, array([[ 3.36877336],[ 3.97112615],[ 3.8092797 ]]).
If I make Xm=np.matrix(X), etc:
In [162]: Xm
Out[162]:
matrix([[ 0., 0., 1.],
[ 1., 0., 0.],
[ 2., 2., 2.],
[ 2., 5., 4.]])
In [163]: Ym
Out[163]:
matrix([[ 0.1, -0.2],
[ 0.9, 1.1],
[ 6.2, 5.9],
[ 11.9, 12.3]])
In [164]: Xm.T.dot(Ym)
Out[164]:
matrix([[ 37.1, 37.5],
[ 71.9, 73.3],
[ 60.1, 60.8]])
In [165]: Xm.T*Ym # matrix interprets * as .dot
Out[165]:
matrix([[ 37.1, 37.5],
[ 71.9, 73.3],
[ 60.1, 60.8]])
You need to edit the question, to have both valid Python code (missing def and :), and results that match the inputs.
===============
In [173]: Y = [[-0.2], [1.1], [5.9], [12.3]]
In [174]: Ym=np.matrix(Y)
Out[176]:
matrix([[ 37.5],
[ 73.3],
[ 60.8]])
=====================
This iteration is clumsy:
h = 0
while (h < X.shape[1]):
W.append([])
W[h] = dot(transpose(X), Y) # using "dot" function
h += 1
A more Pythonic approach
for h in range(X.shape[1]):
W.append(np.dot(...))
Or even
W = [np.dot(....) for h in range(X.shape[1])]

How to create random orthonormal matrix in python numpy

Is there a method that I can call to create a random orthonormal matrix in python? Possibly using numpy? Or is there a way to create a orthonormal matrix using multiple numpy methods? Thanks.
Version 0.18 of scipy has scipy.stats.ortho_group and scipy.stats.special_ortho_group. The pull request where it was added is https://github.com/scipy/scipy/pull/5622
For example,
In [24]: from scipy.stats import ortho_group # Requires version 0.18 of scipy
In [25]: m = ortho_group.rvs(dim=3)
In [26]: m
Out[26]:
array([[-0.23939017, 0.58743526, -0.77305379],
[ 0.81921268, -0.30515101, -0.48556508],
[-0.52113619, -0.74953498, -0.40818426]])
In [27]: np.set_printoptions(suppress=True)
In [28]: m.dot(m.T)
Out[28]:
array([[ 1., 0., -0.],
[ 0., 1., 0.],
[-0., 0., 1.]])
You can obtain a random n x n orthogonal matrix Q, (uniformly distributed over the manifold of n x n orthogonal matrices) by performing a QR factorization of an n x n matrix with elements i.i.d. Gaussian random variables of mean 0 and variance 1. Here is an example:
import numpy as np
from scipy.linalg import qr
n = 3
H = np.random.randn(n, n)
Q, R = qr(H)
print (Q.dot(Q.T))
[[ 1.00000000e+00 -2.77555756e-17 2.49800181e-16]
[ -2.77555756e-17 1.00000000e+00 -1.38777878e-17]
[ 2.49800181e-16 -1.38777878e-17 1.00000000e+00]]
EDIT: (Revisiting this answer after the comment by #g g.) The claim above on the QR decomposition of a Gaussian matrix providing a uniformly distributed (over the, so called, Stiefel manifold) orthogonal matrix is suggested by Theorems 2.3.18-19 of this reference. Note that the statement of the result suggests a "QR-like" decomposition, however, with the triangular matrix R having positive elements.
Apparently, the qr function of scipy (numpy) function does not guarantee positive diagonal elements for R and the corresponding Q is actually not uniformly distributed. This has been observed in this monograph, Sec. 4.6 (the discussion refers to MATLAB, but I guess both MATLAB and scipy use the same LAPACK routines). It is suggested there that the matrix Q provided by qr is modified by post multiplying it with a random unitary diagonal matrix.
Below I reproduce the experiment in the above reference, plotting the empirical distribution (histogram) of phases of eigenvalues of the "direct" Q matrix provided by qr, as well as the "modified" version, where it is seen that the modified version does indeed have a uniform eigenvalue phase, as would be expected from a uniformly distributed orthogonal matrix.
from scipy.linalg import qr, eigvals
from seaborn import distplot
n = 50
repeats = 10000
angles = []
angles_modified = []
for rp in range(repeats):
H = np.random.randn(n, n)
Q, R = qr(H)
angles.append(np.angle(eigvals(Q)))
Q_modified = Q # np.diag(np.exp(1j * np.pi * 2 * np.random.rand(n)))
angles_modified.append(np.angle(eigvals(Q_modified)))
fig, ax = plt.subplots(1,2, figsize = (10,3))
distplot(np.asarray(angles).flatten(),kde = False, hist_kws=dict(edgecolor="k", linewidth=2), ax= ax[0])
ax[0].set(xlabel='phase', title='direct')
distplot(np.asarray(angles_modified).flatten(),kde = False, hist_kws=dict(edgecolor="k", linewidth=2), ax= ax[1])
ax[1].set(xlabel='phase', title='modified');
This is the rvs method pulled from the https://github.com/scipy/scipy/pull/5622/files, with minimal change - just enough to run as a stand alone numpy function.
import numpy as np
def rvs(dim=3):
random_state = np.random
H = np.eye(dim)
D = np.ones((dim,))
for n in range(1, dim):
x = random_state.normal(size=(dim-n+1,))
D[n-1] = np.sign(x[0])
x[0] -= D[n-1]*np.sqrt((x*x).sum())
# Householder transformation
Hx = (np.eye(dim-n+1) - 2.*np.outer(x, x)/(x*x).sum())
mat = np.eye(dim)
mat[n-1:, n-1:] = Hx
H = np.dot(H, mat)
# Fix the last sign such that the determinant is 1
D[-1] = (-1)**(1-(dim % 2))*D.prod()
# Equivalent to np.dot(np.diag(D), H) but faster, apparently
H = (D*H.T).T
return H
It matches Warren's test, https://stackoverflow.com/a/38426572/901925
An easy way to create any shape (n x m) orthogonal matrix:
import numpy as np
n, m = 3, 5
H = np.random.rand(n, m)
u, s, vh = np.linalg.svd(H, full_matrices=False)
mat = u # vh
print(mat # mat.T) # -> eye(n)
Note that if n > m, it would obtain mat.T # mat = eye(m).
from scipy.stats import special_ortho_group
num_dim=3
x = special_ortho_group.rvs(num_dim)
Documentation
if you want a none Square Matrix with orthonormal column vectors you could create a square one with any of the mentioned method and drop some columns.
Numpy also has qr factorization. https://numpy.org/doc/stable/reference/generated/numpy.linalg.qr.html
import numpy as np
a = np.random.rand(3, 3)
q, r = np.linalg.qr(a)
q # q.T
# array([[ 1.00000000e+00, 8.83206468e-17, 2.69154044e-16],
# [ 8.83206468e-17, 1.00000000e+00, -1.30466244e-16],
# [ 2.69154044e-16, -1.30466244e-16, 1.00000000e+00]])

Avoid implicit conversion to matrix in numpy operations

Is there a way to globally avoid the matrix from appearing in any of the results of the numpy computations? For example currently if you have x as a numpy.ndarray and y as a scipy.sparse.csc_matrix, and you say x += y, x will become a matrix afterwards. Is there a way to prevent that from happening, i.e., keep x an ndarray, and more generally, keep using ndarray in all places where a matrix is produced?
I added the scipy tag, This is a scipy.sparse problem, not a np.matrix one.
In [250]: y=sparse.csr_matrix([[0,1],[1,0]])
In [251]: x=np.arange(2)
In [252]: y+x
Out[252]:
matrix([[0, 2],
[1, 1]])
the sparse + array => matrix
(as a side note, np.matrix is a subclass of np.ndarray. sparse.csr_matrix is not a subclass. It has many numpy like operations, but it implements them in its own code).
In [255]: x += y
In [256]: x
Out[256]:
matrix([[0, 2],
[1, 1]])
technically this shouldn't happen; in effect it is doing x = x+y assigning a new value to x, not just modifying x.
If I first turn y into a regular dense matrix, I get an error. Allowing the action would change a 1d array into a 2d one.
In [258]: x += y.todense()
...
ValueError: non-broadcastable output operand with shape (2,) doesn't match the broadcast shape (2,2)
Changing x to 2d allows the addition to proceed - without changing array to matrix:
In [259]: x=np.eye(2)
In [260]: x
Out[260]:
array([[ 1., 0.],
[ 0., 1.]])
In [261]: x += y.todense()
In [262]: x
Out[262]:
array([[ 1., 1.],
[ 1., 1.]])
In general, performing addition/subtraction with sparse matrices is tricky. They were designed for matrix multiplication. Multiplication doesn't change sparsity as much as addition. y+1 for example makes it dense.
Without digging into the details of how sparse addition is coded, I'd say - don't try this x+=... operation without first turning y into a dense version.
In [265]: x += y.A
In [266]: x
Out[266]:
array([[ 1., 2.],
[ 2., 1.]])
I can't think of a good reason not to do this.
(I should check the scipy github for a bug issue on this).
scipy/sparse/compressed.py has the csr addition code. x+y uses x.__add__(y) but sometimes that is flipped to y.__add__(x). x+=y uses x.__iadd__(y). So I may need to examine __iadd__ for ndarray as well.
But the basic addition for a sparse matrix is:
def __add__(self,other):
# First check if argument is a scalar
if isscalarlike(other):
if other == 0:
return self.copy()
else: # Now we would add this scalar to every element.
raise NotImplementedError('adding a nonzero scalar to a '
'sparse matrix is not supported')
elif isspmatrix(other):
if (other.shape != self.shape):
raise ValueError("inconsistent shapes")
return self._binopt(other,'_plus_')
elif isdense(other):
# Convert this matrix to a dense matrix and add them
return self.todense() + other
else:
return NotImplemented
So the y+x becomes y.todense() + x. And x+y uses the same thing.
Regardless of the += details, it is clear that adding a sparse to a dense (array or np.matrix) involves converting the sparse to dense. There's no code that iterates through the sparse values and adds those selectively to the dense array.
It's only if the arrays are both sparse that it performs a special sparse addition. y+y works, returning a sparse. y+=y fails with a NotImplmenentedError from sparse.base.__iadd__.
This is the best diagnostic sequence that I've come up, trying various ways of adding y to a (2,2) array.
In [348]: x=np.eye(2)
In [349]: x+y
Out[349]:
matrix([[ 1., 1.],
[ 1., 1.]])
In [350]: x+y.todense()
Out[350]:
matrix([[ 1., 1.],
[ 1., 1.]])
Addition produces a matrix, but values can be written to x without changing x class (or shape)
In [351]: x[:] = x+y
In [352]: x
Out[352]:
array([[ 1., 1.],
[ 1., 1.]])
+= with a dense matrix does the same:
In [353]: x += y.todense()
In [354]: x
Out[354]:
array([[ 1., 2.],
[ 2., 1.]])
but something in the +=sparse changes the class of x
In [355]: x += y
In [356]: x
Out[356]:
matrix([[ 1., 3.],
[ 3., 1.]])
Further testing and looking at id(x) and x.__array_interface__ it is clear that x += y replaces x. This is true even if x starts as np.matrix. So the sparse += is not an inplace operation. x += y.todense() is an inplace operation.
Yes, it's a bug; but https://github.com/scipy/scipy/issues/7826 says
I do not really see a way to change this.
An X += c * Y without todense follows.
Some inc( various array / matrix, various sparse )
have been tested, but for sure not all.
def inc( X, Y, c=1. ):
""" X += c * Y, X Y sparse or dense """
if (not hasattr( X, "indices" ) # dense += sparse
and hasattr( Y, "indices" )):
# inc an ndarray view, because ndarry += sparse -> matrix --
X = getattr( X, "A", X ).squeeze()
X[Y.indices] += c * Y.data
else:
X += c * Y # sparse + different sparse: SparseEfficiencyWarning
return X

Multiplying Block Matrices in Numpy

Hi Everyone I am python newbie
I have to implement lasso L1 regression for a class assignment. This involves solving a quadratic equation involving block matrices.
minimize x^t * H * x + f^t * x
where x > 0
Where H is a 2 X 2 block matrix with each element being a k dimensional matrix and x and f being a 2 X 1 vectors each element being a k dimension vector.
I was thinking of using ndarrays.
Such that :
np.shape(H) = (2, 2, k, k)
np.shape(x) = (2, k)
But I figured out that np.dot(X, H) doesn't work here.
Is there an easy way to solve this problem? Thanks in advance.
First of all, I am convinced that converting to matrices will lead to more efficient computations. Stating that, if you consider your 2k x 2k matrix being a 2 x 2 matrix, then you operate in a tensor product of vector spaces, and have to use tensordot instead of dot.
Let give it a try, with k=5 for example:
>>> import numpy as np
>>> k = 5
Define our matrix a and vector x
>>> a = np.arange(1.*2*2*k*k).reshape(2,2,k,k)
>>> x = np.arange(1.*2*k).reshape(2,k)
>>> x
array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.]])
now we can multipy our tensors. Be sure to choose right axes, I didn't tested following formula explicetely, and there might be an error
>>> result = np.tensordot(a,x,([1,3],[0,1]))
>>> result
array([[ 985., 1210., 1435., 1660., 1885.],
[ 3235., 3460., 3685., 3910., 4135.]])
>>> np.shape(result)
(2, 5)
np.einsum gives good control over which axes are summed.
np.einsum('ijkl,jk',H,x)
is one possible (generalized) dot product, (2,4) (first and last dim of H)
np.einsum('ijkl,jl',H,x)
is another. You need to be explicit - which dimensions of x go with which of H.

Better way to shuffle two numpy arrays in unison

I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.
This code works, and illustrates my goals:
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
permutation = numpy.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
For example:
>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
[1, 1],
[3, 3]]), array([2, 1, 3]))
However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.
Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.
One other thought I had was this:
def shuffle_in_unison_scary(a, b):
rng_state = numpy.random.get_state()
numpy.random.shuffle(a)
numpy.random.set_state(rng_state)
numpy.random.shuffle(b)
This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.
Your can use NumPy's array indexing:
def unison_shuffled_copies(a, b):
assert len(a) == len(b)
p = numpy.random.permutation(len(a))
return a[p], b[p]
This will result in creation of separate unison-shuffled arrays.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)
To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.
If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.
Example: Let's assume the arrays a and b look like this:
a = numpy.array([[[ 0., 1., 2.],
[ 3., 4., 5.]],
[[ 6., 7., 8.],
[ 9., 10., 11.]],
[[ 12., 13., 14.],
[ 15., 16., 17.]]])
b = numpy.array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.]])
We can now construct a single array containing all the data:
c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[ 0., 1., 2., 3., 4., 5., 0., 1.],
# [ 6., 7., 8., 9., 10., 11., 2., 3.],
# [ 12., 13., 14., 15., 16., 17., 4., 5.]])
Now we create views simulating the original a and b:
a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)
The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).
In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.
This solution could be adapted to the case that a and b have different dtypes.
Very simple solution:
randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]
the two arrays x,y are now both randomly shuffled in the same way
James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)
from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array
# Data is currently unshuffled; we should shuffle
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]
Shuffle any number of arrays together, in-place, using only NumPy.
import numpy as np
def shuffle_arrays(arrays, set_seed=-1):
"""Shuffles arrays in-place, in the same order, along axis=0
Parameters:
-----------
arrays : List of NumPy arrays.
set_seed : Seed value if int >= 0, else seed is random.
"""
assert all(len(arr) == len(arrays[0]) for arr in arrays)
seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed
for arr in arrays:
rstate = np.random.RandomState(seed)
rstate.shuffle(arr)
And can be used like this
a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])
shuffle_arrays([a, b, c])
A few things to note:
The assert ensures that all input arrays have the same length along
their first dimension.
Arrays shuffled in-place by their first dimension - nothing returned.
Random seed within positive int32 range.
If a repeatable shuffle is needed, seed value can be set.
After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.
you can make an array like:
s = np.arange(0, len(a), 1)
then shuffle it:
np.random.shuffle(s)
now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.
x_data = x_data[s]
x_label = x_label[s]
There is a well-known function that can handle this:
from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)
Just setting test_size to 0 will avoid splitting and give you shuffled data.
Though it is usually used to split train and test data, it does shuffle them too.
From documentation
Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data into a
single call for splitting (and optionally subsampling) data in a
oneliner.
This seems like a very simple solution:
import numpy as np
def shuffle_in_unison(a,b):
assert len(a)==len(b)
c = np.arange(len(a))
np.random.shuffle(c)
return a[c],b[c]
a = np.asarray([[1, 1], [2, 2], [3, 3]])
b = np.asarray([11, 22, 33])
shuffle_in_unison(a,b)
Out[94]:
(array([[3, 3],
[2, 2],
[1, 1]]),
array([33, 22, 11]))
One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.
# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
np.random.seed(seed)
np.random.shuffle(a)
np.random.seed(seed)
np.random.shuffle(b)
That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.
EDIT, don't use np.random.seed() use np.random.RandomState instead
def shuffle(a, b, seed):
rand_state = np.random.RandomState(seed)
rand_state.shuffle(a)
rand_state.seed(seed)
rand_state.shuffle(b)
When calling it just pass in any seed to feed the random state:
a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)
Output:
>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]
Edit: Fixed code to re-seed the random state
Say we have two arrays: a and b.
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]])
We can first obtain row indices by permutating first dimension
indices = np.random.permutation(a.shape[0])
[1 2 0]
Then use advanced indexing.
Here we are using the same indices to shuffle both arrays in unison.
a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]
This is equivalent to
np.take(a, indices, axis=0)
[[4 5 6]
[7 8 9]
[1 2 3]]
np.take(b, indices, axis=0)
[[6 6 6]
[4 2 0]
[9 1 1]]
If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array
for old_index in len(a):
new_index = numpy.random.randint(old_index+1)
a[old_index], a[new_index] = a[new_index], a[old_index]
b[old_index], b[new_index] = b[new_index], b[old_index]
This implements the Knuth-Fisher-Yates shuffle algorithm.
Shortest and easiest way in my opinion, use seed:
random.seed(seed)
random.shuffle(x_data)
# reset the same seed to get the identical random sequence and shuffle the y
random.seed(seed)
random.shuffle(y_data)
most solutions above work, however if you have column vectors you have to transpose them first. here is an example
def shuffle(self) -> None:
"""
Shuffles X and Y
"""
x = self.X.T
y = self.Y.T
p = np.random.permutation(len(x))
self.X = x[p].T
self.Y = y[p].T
With an example, this is what I'm doing:
combo = []
for i in range(60000):
combo.append((images[i], labels[i]))
shuffle(combo)
im = []
lab = []
for c in combo:
im.append(c[0])
lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)
I extended python's random.shuffle() to take a second arg:
def shuffle_together(x, y):
assert len(x) == len(y)
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random.random() * (i+1))
x[i], x[j] = x[j], x[i]
y[i], y[j] = y[j], y[i]
That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.
Just use numpy...
First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.
import numpy as np
def shuffle_2d(a, b):
rows= a.shape[0]
if b.shape != (rows,1):
b = b.reshape((rows,1))
S = np.hstack((b,a))
np.random.shuffle(S)
b, a = S[:,0], S[:,1:]
return a,b
features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)

Categories