Related
I'm trying to reshape an array from its original shape, to make the elements of each row descend along a diagonal:
np.random.seed(0)
my_array = np.random.randint(1, 50, size=(5, 3))
array([[45, 48, 1],
[ 4, 4, 40],
[10, 20, 22],
[37, 24, 7],
[25, 25, 13]])
I would like the result to look like this:
my_array_2 = np.array([[45, 0, 0],
[ 4, 48, 0],
[10, 4, 1],
[37, 20, 40],
[25, 24, 22],
[ 0, 25, 7],
[ 0, 0, 13]])
This is the closest solution I've been able to get:
my_diag = []
for i in range(len(my_array)):
my_diag_ = np.diag(my_array[i], k=0)
my_diag.append(my_diag_)
my_array1 = np.vstack(my_diag)
array([[45, 0, 0],
[ 0, 48, 0],
[ 0, 0, 1],
[ 4, 0, 0],
[ 0, 4, 0],
[ 0, 0, 40],
[10, 0, 0],
[ 0, 20, 0],
[ 0, 0, 22],
[37, 0, 0],
[ 0, 24, 0],
[ 0, 0, 7],
[25, 0, 0],
[ 0, 25, 0],
[ 0, 0, 13]])
From here I think it might be possible to remove all zero diagonals, but I'm not sure how to do that.
One way using numpy.pad:
n = my_array.shape[1] - 1
np.dstack([np.pad(a, (i, n-i), "constant")
for i, a in enumerate(my_array.T)])
Output:
array([[[45, 0, 0],
[ 4, 48, 0],
[10, 4, 1],
[37, 20, 40],
[25, 24, 22],
[ 0, 25, 7],
[ 0, 0, 13]]])
In [134]: arr = np.array([[45, 48, 1],
...: [ 4, 4, 40],
...: [10, 20, 22],
...: [37, 24, 7],
...: [25, 25, 13]])
In [135]: res= np.zeros((arr.shape[0]+arr.shape[1]-1, arr.shape[1]), arr.dtype)
Taking a hint from how np.diag indexes a diagonal, iterate on the rows of arr:
In [136]: for i in range(arr.shape[0]):
...: n = i*arr.shape[1]
...: m = arr.shape[1]
...: res.flat[n:n+m**2:m+1] = arr[i,:]
...:
In [137]: res
Out[137]:
array([[45, 0, 0],
[ 4, 48, 0],
[10, 4, 1],
[37, 20, 40],
[25, 24, 22],
[ 0, 25, 7],
[ 0, 0, 13]])
There's probably a shift capability in numpy, but I'm not familiar w/it, so here's a solution using pandas. You concat np.zeros to the original array with the number of rows being equal to ncols - 1. Then iterate over each col and shift it down by the number equal to the column number.
import numpy as np
import pandas as pd
np.random.seed(0)
my_array = np.random.randint(1,50, size=(5,3))
df = pd.DataFrame(np.concatenate((my_array,np.zeros((my_array.shape[1]-1,
my_array.shape[1])))))
for col in df.columns:
df[col] = df[col].shift(int(col))
df.fillna(0).values
Output
array([[45., 0., 0.],
[ 4., 48., 0.],
[10., 4., 1.],
[37., 20., 40.],
[25., 24., 22.],
[ 0., 25., 7.],
[ 0., 0., 13.]])
You can create a fancy index for the output using simple broadcasting and padding. First pad the end of your data:
a = np.concatenate((a, np.zeros((a.shape[1] - 1, a.shape[1]), a.dtype)), axis=0)
Now make an index that gets the elements using their negative index. This will make it trivial to roll around the end:
cols = np.arange(a.shape[1])
rows = np.arange(a.shape[0]).reshape(-1, 1) - cols
Now just simply index:
result = a[rows, cols]
For large arrays, this may not be as efficient as running a small loop. At the same time, this avoids actual looping, and allows you to write a one-liner (but please don't):
result = np.concatenate((a, np.zeros((a.shape[1] - 1, a.shape[1]), a.dtype)), axis=0)[np.arange(a.shape[0] + a.shape[1] - 1).reshape(-1, 1) - np.arange(a.shape[1]), np.arange(a.shape[1])]
In a 3d numpy array (with p panels, each with r rows and c columns) I'd like to sort only on columns of a specific panel, such that the corresponding elements on the other panels rearrange themselves accordingly.
Unfortunately, I am not familiar with the jargon of different types of sorting. I'll clarify what I need through an example.
Take A as a 2*3*4 array
A = array([[[ 9, 20, 30, 11],
[ 100, 4, -1, 90]
[ 40, 15, -5, 34]],
[[ 0, 2, 3, 9],
[ -1, 12, 6, -3]
[ 1, -5, 7, 2]]]),
After sort on the columns of the second panel:
A = array([[[ 100, 15, 30, 90],
[ 9, 20, -1, 34]
[ 40, 4, -5, 11]],
[[ -1, -5, 3, -3],
[ 0, 2, 6, 2]
[ 1, 12, 7, 9]]])
As you can see, only the columns of the second panel are sorted (ascendingly) and the elements in the first panel are rearranged (but not sorted!) with their corresponding elements in the second panel.
import numpy as np
A = np.array([[[ 9, 20, 30, 11],
[ 100, 4, -1, 90],
[ 40, 15, -5, 34]],
[[ 0, 2, 3, 9],
[ -1, 12, 6, -3],
[ 1, -5, 7, 2]]])
I, J, K = np.ogrid[tuple(map(slice, A.shape))]
# I, J, K are the identity indices in the sense that (A == A[I, J, K]).all()
newJ = np.argsort(A[1], axis=0) # first axis of A[1] is second axis of A
print(A[I, newJ, K])
yields
[[[100 15 30 90]
[ 9 20 -1 34]
[ 40 4 -5 11]]
[[ -1 -5 3 -3]
[ 0 2 6 2]
[ 1 12 7 9]]]
I have a large dataset in a numpy.ndarray similar to this:
array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
I would like to remove duplicate rows, where the 0th, 1st, 2nd and 5th columns are equal. I.e. On the above matrix, the response would be:
-4, 5, 9, 30, 50, 80
2, -6, 9, 34, 12, 7
5, -9, 0, 32, 18, 0
numpy.unique does something very similar but it only finds duplicates over all columns (axis). I only want specific columns. How would one get around to do this with numpy? I could not find any decent numpy algorithm to do this. Is there a better module?
Use np.unique on the sliced array with return_index param over axis=0, that gives us unique indices, considering each row as one entity. These indices could be then used for row-indexing into the original array for the desired output.
So, with a as the input array, it would be -
a[np.unique(a[:,[0,1,2,5]],return_index=True,axis=0)[1]]
Sample run to break down the steps and hopefully make things clear -
In [29]: a
Out[29]:
array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
In [30]: a_slice = a[:,[0,1,2,5]]
In [31]: _, unq_row_indices = np.unique(a_slice,return_index=True,axis=0)
In [32]: final_output = a[unq_row_indices]
In [33]: final_output
Out[33]:
array([[-4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ 5, -9, 0, 32, 18, 0]])
Pandas has functionality for this via pd.DataFrame.drop_duplicates. However, the convenient syntax comes at the cost of performance.
import pandas as pd
import numpy as np
A = np.array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
res = pd.DataFrame(A)\
.drop_duplicates(subset=[0, 1, 2, 5])\
.values
print(res)
array([[-4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ 5, -9, 0, 32, 18, 0]])
You can use the np.take method (https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.take.html) to get the only the columns from the array that you care about and then use the unique method with return_index=True.
>>> arr = np.array([[ -4, 5, 9, 30, 50, 80],
... [ 2, -6, 9, 34, 12, 7],
... [ -4, 5, 9, 98, -21, 80],
... [ 5, -9, 0, 32, 18, 0]])
>>> relevant_columns = np.take(arr, [0,1,2,5], axis=1)
>>> np.unique(relevant_columns, axis=0, return_index=True)
(array([[ 2, -6, 9, 7],
[ 5, -9, 0, 0],
[-4, 5, 9, 80]]), array([1, 3, 0]))
You can then use np.take() again with your original numpy array. Pass array([1, 3, 0]) as the parameter for the indices.
I'd like to know how to make a simple data cube (matrix) with three 1D arrays or if there's a simpler way. I want to be able to call a specific value at the end from the cube such as cube[0,2,6].
x = arange(10)
y = arange(10,20,1)
z = arange(20,30,1)
cube = meshgrid(x,y,z)
But this doesn't give the desired result, as it gives mulitple arrays and can't call a specific number easily. I'd like to be able to use this for large data sets that would be laborious to do by hand, later on. Thanks
meshgrid as its name suggests creates an orthogonal mesh. If you call it with 3 arguments it will be a 3d mesh. Now the mesh is 3d arrangement of points but each point has 3 coordinates. Therefore meshgrid returns 3 arrays one for each coordinate.
The standard way of getting one 3d array out of that is to apply a vectorised function with three arguments. Here is a simple example:
>>> x = arange(7)
>>> y = arange(0,30,10)
>>> z = arange(0,200,100)
>>> ym, zm, xm = meshgrid(y, z, x)
>>> xm
array([[[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]],
[[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]]])
>>> ym
array([[[ 0, 0, 0, 0, 0, 0, 0],
[10, 10, 10, 10, 10, 10, 10],
[20, 20, 20, 20, 20, 20, 20]],
[[ 0, 0, 0, 0, 0, 0, 0],
[10, 10, 10, 10, 10, 10, 10],
[20, 20, 20, 20, 20, 20, 20]]])
>>> zm
array([[[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0]],
[[100, 100, 100, 100, 100, 100, 100],
[100, 100, 100, 100, 100, 100, 100],
[100, 100, 100, 100, 100, 100, 100]]])
>>> cube = xm + ym + zm
>>> cube
array([[[ 0, 1, 2, 3, 4, 5, 6],
[ 10, 11, 12, 13, 14, 15, 16],
[ 20, 21, 22, 23, 24, 25, 26]],
[[100, 101, 102, 103, 104, 105, 106],
[110, 111, 112, 113, 114, 115, 116],
[120, 121, 122, 123, 124, 125, 126]]])
>>> cube[0, 2, 6]
26
please help me finding a clean way to create a new array out of existing. it should be over-sampled, if the number of example of any class is smaller than the maximum number of examples in the class. samples should be taken from the original array (makes no difference, whether randomly or sequentially)
let's say, initial array is this:
[ 2, 29, 30, 1]
[ 5, 50, 46, 0]
[ 1, 7, 89, 1]
[ 0, 10, 92, 9]
[ 4, 11, 8, 1]
[ 3, 92, 1, 0]
the last column contains classes:
classes = [ 0, 1, 9]
the distribution of the classes is the following:
distrib = [2, 3, 1]
what i need is to create a new array with equal number of samples of all classes, taken randomly from the original array, e.g.
[ 5, 50, 46, 0]
[ 3, 92, 1, 0]
[ 5, 50, 46, 0] # one example added
[ 2, 29, 30, 1]
[ 1, 7, 89, 1]
[ 4, 11, 8, 1]
[ 0, 10, 92, 9]
[ 0, 10, 92, 9] # two examples
[ 0, 10, 92, 9] # added
The following code does what you are after:
a = np.array([[ 2, 29, 30, 1],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 0, 10, 92, 9],
[ 4, 11, 8, 1],
[ 3, 92, 1, 0]])
unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype)
for j in xrange(len(unq)):
indices = np.random.choice(np.where(unq_idx==j)[0], cnt)
out[j*cnt:(j+1)*cnt] = a[indices]
>>> out
array([[ 5, 50, 46, 0],
[ 5, 50, 46, 0],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 4, 11, 8, 1],
[ 2, 29, 30, 1],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9]])
When numpy 1.9 is released, or if you compile from the development branch, then the first two lines can be condensed into:
unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True,
return_counts=True)
Note that, the way np.random.choice works, there is no guarantee that all rows of the original array will be present in the output one, as the example above shows. If that is needed, you could do something like:
unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype)
slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt)))
for j in xrange(len(unq)):
indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j])
out[slices[j]:slices[j+1]] = a[indices]
out = np.vstack((a, out))
>>> out
array([[ 2, 29, 30, 1],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 0, 10, 92, 9],
[ 4, 11, 8, 1],
[ 3, 92, 1, 0],
[ 5, 50, 46, 0],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9]])
This gives a random distribution with equal probability for each class:
distrib = np.bincount(a[:,-1])
prob = 1/distrib[a[:, -1]].astype(float)
prob /= prob.sum()
In [38]: a[np.random.choice(np.arange(len(a)), size=np.count_nonzero(distrib)*distrib.max(), p=prob)]
Out[38]:
array([[ 5, 50, 46, 0],
[ 4, 11, 8, 1],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9],
[ 2, 29, 30, 1],
[ 0, 10, 92, 9],
[ 3, 92, 1, 0],
[ 1, 7, 89, 1],
[ 1, 7, 89, 1]])
Each class has equal probability, not guaranteed equal incidence.
You can use the imbalanced-learn package:
import numpy as np
from imblearn.over_sampling import RandomOverSampler
data = np.array([
[ 2, 29, 30, 1],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 0, 10, 92, 9],
[ 4, 11, 8, 1],
[ 3, 92, 1, 0]
])
ros = RandomOverSampler()
# fit_resample expects two arguments: a matrix of sample data and a vector of
# sample labels. In this case, the sample data is in the first three columns of
# our array and the labels are in the last column
X_resampled, y_resampled = ros.fit_resample(data[:, :-1], data[:, -1])
# fit_resample returns a matrix of resampled data and a vector with the
# corresponding labels. Combine them into a single matrix
resampled = np.column_stack((X_resampled, y_resampled))
print(resampled)
Output:
[[ 2 29 30 1]
[ 5 50 46 0]
[ 1 7 89 1]
[ 0 10 92 9]
[ 4 11 8 1]
[ 3 92 1 0]
[ 3 92 1 0]
[ 0 10 92 9]
[ 0 10 92 9]]
The RandomOverSampler offers different sampling strategies, but the default is to resample all classes except the majority class.