Related
I've come across some code where the use of numpy.ravel() is resulting in a 2D array - I've had a look at the documentation, which says that ravel() returns a 1D array (see https://numpy.org/doc/stable/reference/generated/numpy.ravel.html).
Here's a code snippet that shows this:
def jumbo():
import numpy as np
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
matrix = np.zeros((3,3))
matrix.ravel()[:] = my_list
return matrix
new_matrix = jumbo()
print(f"new matrix is:\n{new_matrix}")
I suppose part of what I'm asking is what is the function of the range specifier [:] here?
What you did is assigned values at "raveled" matrix, wihtout actually saving ravel operation.
matrix = np.empty((3,3)) # created empty matrix
matrix.ravel() # created 1d view of matrix
matrix.ravel()[:] # indexing every element of matrix to make assignment possible (matrix is still in (3,3) shape)
matrix.ravel()[:] = my_list # assigned values.
if you want return to be 1D then return raveled array like this
def jumbo():
import numpy as np
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
matrix = np.empty((3,3))
matrix.ravel()[:] = my_list
return matrix.ravel()
new_matrix = jumbo()
print(f"new matrix is:\n{new_matrix}")
ravel returns a view of the numpy array, thus when you do:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
matrix = np.zeros((3,3))
matrix.ravel()[:] = my_list
You are using the view as a way to index the matrix as 1D, temporarily.
This enables here to set the values from a 1D list, but the underlying array remains 2D.
The matrix.ravel()[:] is used to enable setting the data. You could also use 2 steps:
view = matrix.ravel()
view[:] = my_list
output:
array([[1., 2., 3.],
[4., 5., 6.],
[7., 8., 9.]])
important note on the views
As #Stef nicely pointed out in the comments, this "trick" will only work for C-contiguous arrays, meaning you couldn't use ravel('F'):
demonstration:
view1 = matrix.ravel()
view2 = matrix.ravel('F') # actually not a view!
id(matrix)
# 140406991308816
id(view1.base)
# 140406991308816
id(view2.base)
# 9497104 # different id, we have a copy!
rows is a 343x30 matrix of real numbers. Im trying to append row vectors from rows to true rows and false rows but it only adds the first row and doesnt do anything afterwards. Ive tried vstack and also tried putting example as a 2d array ([example]) but it crashed my pycharm. what can I do?
true_rows = []
true_labels = []
false_rows = []
false_labels = []
i = 0
for example in rows:
if question.match(example):
true_rows = np.append(true_rows , example , axis=0)
true_labels.append(labels[i])
else:
#false_rows = np.vstack(false_rows, example_t)
false_rows = np.append(false_rows, example, axis=0)
false_labels.append(labels[i])
i += 1
you can use only a simple list to append your rows and then transform this list to numpy array such as :
exemple1 = np.array([1,2,3,4,5])
exemple2 = np.array([6,7,8,9,10])
exemple3 = np.array([11,12,13,14,15])
true_rows = []
true_rows.append(exemple1)
true_rows.append(exemple2)
true_rows.append(exemple3)
true_rows = np.array(true_rows)
you will get this results:
true_rows = array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15]])
you can also use np.concatenate if you want to get one dimensional array like this:
true_rows = np.concatenate(true_rows , axis =0)
you will get this results:
true_rows = array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
Your use of [] and np.append suggests you are trying to imitate a common list append model with arrays. You atleast read enough of the np.append docs to know you need to use axis, and that it returns a new array (the docs are quite clear this is a copy).
But did you test this idea with a small example, and actually look at the results (step by step)?
In [326]: rows = []
In [327]: rows = np.append(rows, np.arange(3), axis=0)
In [328]: rows
Out[328]: array([0., 1., 2.])
In [329]: rows.shape
Out[329]: (3,)
the first append doesn't do anything - the result is the same as arange(3).
In [330]: rows = np.append(rows, np.arange(3), axis=0)
In [331]: rows
Out[331]: array([0., 1., 2., 0., 1., 2.])
In [332]: rows.shape
Out[332]: (6,)
Do you understand why? We join 2 1d arrays on axis 0, making a 1d.
Using [] as a starting point is the same starting with this array:
In [333]: np.array([])
Out[333]: array([], dtype=float64)
In [334]: np.array([]).shape
Out[334]: (0,)
And with axis, np.append is just a call to concatenate:
In [335]: np.concatenate(( [], np.arange(3)), axis=0)
Out[335]: array([0., 1., 2.])
np.append sort looks like list append, but it is not a clone. It's really just a poorly named way to use concatenate. And you can't use it properly without actually understanding dimensions. np.append has an example with an error much like what you got with concatentate.
Repeated use of these array concatenates in a loop is not a good idea. It's hard to get the dimensions right, as you found. And even when it works, it is slow, since each step makes a copy (which grows with the iteration).
That's why the other answer sticks with list append.
vstack is like concatenate with axis 0, but it makes sure all arguments are 2d. But if the number columns differ, it raise an error:
In [336]: np.vstack(( [],np.arange(3)))
Traceback (most recent call last):
File "<ipython-input-336-22038d6ef0f7>", line 1, in <module>
np.vstack(( [],np.arange(3)))
File "<__array_function__ internals>", line 180, in vstack
File "/usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py", line 282, in vstack
return _nx.concatenate(arrs, 0)
File "<__array_function__ internals>", line 180, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 3
In [337]: np.vstack(( [0,0,0],np.arange(3)))
Out[337]:
array([[0, 0, 0],
[0, 1, 2]])
If all you are joining are rows of a (n,30) array, then you do know the column size of the result.
In [338]: res = np.zeros((0,3))
In [339]: np.vstack(( res, np.arange(3)))
Out[339]: array([[0., 1., 2.]])
If you pay attention to the shape details, it is possible to create an array iteratively.
But instead of collecting rows one by one, why not create a mask and do the collection once.
Roughly do
mask = np.array([question.match(example) for example in rows])
true_rows = rows[mask]
false_rows = rows[~mask]
this still requires an iteration, but overall should be faster.
Construct a 2D, 3x3 matrix with random numbers from 1 to 8 with no duplicates
import numpy as np
random_matrix = np.random.randint(0,10,size=(3,3))
print(random_matrix)
If you want an answer where we don't have to rely on numpy then you can do this:
import random
# Generates a randomized list between 0-9, where 0 is replaced by "#"
x = ["#" if i == 0 else i for i in random.sample(range(10), k=9)]
print(x)
# Slices the list into a 3x3 format
newx = [x[idx:idx+3] for idx in range(0, len(x), 3)]
print(newx)
Output:
[6, 2, 7, 4, '#', 8, 9, 1, 3]
[[6, 2, 7], [4, '#', 8], [9, 1, 3]]
import numpy
x = numpy.arange(0, 9)
numpy.random.shuffle(x)
x = numpy.reshape(x, (3,3))
print(numpy.where(x==0, '#', x))
Let me know, but with my solution, integers seems to be replaced by string.. i don't know if you care. Else, I will found an other solution
You can achieve your goal using a few steps:
Generate sequence of values (in some range) you would like to randomly select into matrix.
Take randomly some number of elements from this sequence to new sequence.
From this new sequence make matrix with wanted shape.
import numpy as np
from random import sample
#step one
values = range(0,11)
#step two
random_sequence = sample(values, 9)
#step three
random_matrix = np.array(random_sequence).reshape(3,3)
Because you sample some number of elements, from unique sequence, that guarantee you uniqueness of new sequence, and then matrix.
You can use np.random.choice with replace=False to generate the (3, 3) array:
np.random.choice(np.arange(9), size=(3, 3), replace=False)
Replacing 0 with np.nan:
>>> np.where(x, x, np.nan)
array([[ 4., 1., 3.],
[ 5., nan, 8.],
[ 2., 6., 7.]])
However, I think Hampus Larsson's answer is better, as this problem is not appropriate for numpy if you intend to replace 0 with the string "#".
you could use numpy but random is enough
import random
numbers = list(range(9))
random.shuffle(numbers)
my_list = [[numbers[i*3 + j] for j in range(0,3)] for i in range(0,3)]
As part of a larger function, I'm writing some code to generate a vector/matrix (depending on the input) containing the mean value of each column of the input vector/matrix 'x'. These values are stored in a vector/matrix of the same shape as the input vector.
My preliminary solution for it to work on both a 1-D and matrix arrays is very(!) messy:
# 'x' is of type array and can be a vector or matrix.
import scipy as sp
shp = sp.shape(x)
x_mean = sp.array(sp.zeros(sp.shape(x)))
try: # if input is a matrix
shp_range = range(shp[1])
for d in shp_range:
x_mean[:,d] = sp.mean(x[:,d])*sp.ones(sp.shape(z))
except IndexError: # error occurs if the input is a vector
z = sp.zeros((shp[0],))
x_mean = sp.mean(x)*sp.ones(sp.shape(z))
Coming from a MATLAB background, this is what it would look like in MATLAB:
[R,C] = size(x);
for d = 1:C,
xmean(:,d) = zeros(R,1) + mean(x(:,d));
end
This works on both vectors as well as matrices without errors.
My question is, how can I make my python code work on input of both vector and matrix format without the (ugly) try/except block?
Thanks!
You don't need to distinguish between vectors and matrices for the mean calculation itself - if you use the axis parameter Numpy will perform the calculation along the vector (for vectors) or columns (for matrices). And then to construct the output, you can use a good old-fashioned list comprehension, although it might be a bit slow for huge matrices:
import numpy as np
m = np.mean(x,axis=0) # For vector x, calculate the mean. For matrix x, calculate the means of the columns
x_mean = np.array([m for k in x]) # replace elements for vectors or rows for matrices
Creating the output with a list comprehension is slow because it has to allocate memory twice - once for the list and once for the array. Using np.repeat or np.tile would be faster, but acts funny for vector inputs - the output will be a nested matrix with a 1-long vector in each row. If speed matters more than elegance you can replace the last line with this if:
if len(x.shape) == 1:
x_mean = m*np.ones(len(x))
else:
x_mean = np.tile(m, (x.shape[1],1))
By the way, your Matlab code behaves differently for row vectors and column vectors (try running it with x and x').
First A quick note about broadcasting in numpy. Broadcasting was kinda confusing to me when I switched over from matlab to python, but once I took the time to understand it I realized how useful it could be. To learn more about broadcasting take a look at http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html,
Because of broadcasting an (m,) array in numpy (what you're calling a vector) is essentially equivelent to an (1, m) array or (1, 1, m) array and so on. It seems like you want to have an (m,) array behave like a (m, 1) array. I believe this happens sometimes, especially in the linalg module, but if you're going to do it you should know that you're breaking the numpy convention.
With that warning there's the code:
import scipy as sp
def my_mean(x):
if x.ndim == 1:
x = x[:, sp.newaxis]
m = sp.empty(x.shape)
m[:] = x.mean(0)
return sp.squeeze(m)
and an example:
In [6]: x = sp.arange(30).reshape(5,6)
In [7]: x
Out[7]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29]])
In [8]: my_mean(x)
Out[8]:
array([[ 12., 13., 14., 15., 16., 17.],
[ 12., 13., 14., 15., 16., 17.],
[ 12., 13., 14., 15., 16., 17.],
[ 12., 13., 14., 15., 16., 17.],
[ 12., 13., 14., 15., 16., 17.]])
In [9]: my_mean(x[0])
Out[9]: array([ 2.5, 2.5, 2.5, 2.5, 2.5, 2.5])
This is faster than using tile, the timing is bellow:
In [1]: import scipy as sp
In [2]: x = sp.arange(30).reshape(5,6)
In [3]: m = x.mean(0)
In [5]: timeit m_2d = sp.empty(x.shape); m_2d[:] = m
100000 loops, best of 3: 2.58 us per loop
In [6]: timeit m_2d = sp.tile(m, (len(x), 1))
100000 loops, best of 3: 13.3 us per loop
I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.
This code works, and illustrates my goals:
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
permutation = numpy.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
For example:
>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
[1, 1],
[3, 3]]), array([2, 1, 3]))
However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.
Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.
One other thought I had was this:
def shuffle_in_unison_scary(a, b):
rng_state = numpy.random.get_state()
numpy.random.shuffle(a)
numpy.random.set_state(rng_state)
numpy.random.shuffle(b)
This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.
Your can use NumPy's array indexing:
def unison_shuffled_copies(a, b):
assert len(a) == len(b)
p = numpy.random.permutation(len(a))
return a[p], b[p]
This will result in creation of separate unison-shuffled arrays.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)
To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.
If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.
Example: Let's assume the arrays a and b look like this:
a = numpy.array([[[ 0., 1., 2.],
[ 3., 4., 5.]],
[[ 6., 7., 8.],
[ 9., 10., 11.]],
[[ 12., 13., 14.],
[ 15., 16., 17.]]])
b = numpy.array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.]])
We can now construct a single array containing all the data:
c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[ 0., 1., 2., 3., 4., 5., 0., 1.],
# [ 6., 7., 8., 9., 10., 11., 2., 3.],
# [ 12., 13., 14., 15., 16., 17., 4., 5.]])
Now we create views simulating the original a and b:
a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)
The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).
In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.
This solution could be adapted to the case that a and b have different dtypes.
Very simple solution:
randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]
the two arrays x,y are now both randomly shuffled in the same way
James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)
from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array
# Data is currently unshuffled; we should shuffle
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]
Shuffle any number of arrays together, in-place, using only NumPy.
import numpy as np
def shuffle_arrays(arrays, set_seed=-1):
"""Shuffles arrays in-place, in the same order, along axis=0
Parameters:
-----------
arrays : List of NumPy arrays.
set_seed : Seed value if int >= 0, else seed is random.
"""
assert all(len(arr) == len(arrays[0]) for arr in arrays)
seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed
for arr in arrays:
rstate = np.random.RandomState(seed)
rstate.shuffle(arr)
And can be used like this
a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])
shuffle_arrays([a, b, c])
A few things to note:
The assert ensures that all input arrays have the same length along
their first dimension.
Arrays shuffled in-place by their first dimension - nothing returned.
Random seed within positive int32 range.
If a repeatable shuffle is needed, seed value can be set.
After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.
you can make an array like:
s = np.arange(0, len(a), 1)
then shuffle it:
np.random.shuffle(s)
now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.
x_data = x_data[s]
x_label = x_label[s]
There is a well-known function that can handle this:
from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)
Just setting test_size to 0 will avoid splitting and give you shuffled data.
Though it is usually used to split train and test data, it does shuffle them too.
From documentation
Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data into a
single call for splitting (and optionally subsampling) data in a
oneliner.
This seems like a very simple solution:
import numpy as np
def shuffle_in_unison(a,b):
assert len(a)==len(b)
c = np.arange(len(a))
np.random.shuffle(c)
return a[c],b[c]
a = np.asarray([[1, 1], [2, 2], [3, 3]])
b = np.asarray([11, 22, 33])
shuffle_in_unison(a,b)
Out[94]:
(array([[3, 3],
[2, 2],
[1, 1]]),
array([33, 22, 11]))
One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.
# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
np.random.seed(seed)
np.random.shuffle(a)
np.random.seed(seed)
np.random.shuffle(b)
That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.
EDIT, don't use np.random.seed() use np.random.RandomState instead
def shuffle(a, b, seed):
rand_state = np.random.RandomState(seed)
rand_state.shuffle(a)
rand_state.seed(seed)
rand_state.shuffle(b)
When calling it just pass in any seed to feed the random state:
a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)
Output:
>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]
Edit: Fixed code to re-seed the random state
Say we have two arrays: a and b.
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]])
We can first obtain row indices by permutating first dimension
indices = np.random.permutation(a.shape[0])
[1 2 0]
Then use advanced indexing.
Here we are using the same indices to shuffle both arrays in unison.
a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]
This is equivalent to
np.take(a, indices, axis=0)
[[4 5 6]
[7 8 9]
[1 2 3]]
np.take(b, indices, axis=0)
[[6 6 6]
[4 2 0]
[9 1 1]]
If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array
for old_index in len(a):
new_index = numpy.random.randint(old_index+1)
a[old_index], a[new_index] = a[new_index], a[old_index]
b[old_index], b[new_index] = b[new_index], b[old_index]
This implements the Knuth-Fisher-Yates shuffle algorithm.
Shortest and easiest way in my opinion, use seed:
random.seed(seed)
random.shuffle(x_data)
# reset the same seed to get the identical random sequence and shuffle the y
random.seed(seed)
random.shuffle(y_data)
most solutions above work, however if you have column vectors you have to transpose them first. here is an example
def shuffle(self) -> None:
"""
Shuffles X and Y
"""
x = self.X.T
y = self.Y.T
p = np.random.permutation(len(x))
self.X = x[p].T
self.Y = y[p].T
With an example, this is what I'm doing:
combo = []
for i in range(60000):
combo.append((images[i], labels[i]))
shuffle(combo)
im = []
lab = []
for c in combo:
im.append(c[0])
lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)
I extended python's random.shuffle() to take a second arg:
def shuffle_together(x, y):
assert len(x) == len(y)
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random.random() * (i+1))
x[i], x[j] = x[j], x[i]
y[i], y[j] = y[j], y[i]
That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.
Just use numpy...
First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.
import numpy as np
def shuffle_2d(a, b):
rows= a.shape[0]
if b.shape != (rows,1):
b = b.reshape((rows,1))
S = np.hstack((b,a))
np.random.shuffle(S)
b, a = S[:,0], S[:,1:]
return a,b
features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)