How to group values in matrix with items of unequal length - python

Lets say I have a simple array:
a = np.arange(3)
And an array of indices with the same length:
I = np.array([0, 0, 1])
I now want to group the values based on the indices.
How would I group the elements of the first array to produce the result below?
np.array([[0, 1], [2], dtype=object)
Here is what I tried:
a = np.arange(3)
I = np.array([0, 0, 1])
out = np.empty(2, dtype=object)
out.fill([])
aslists = np.vectorize(lambda x: [x], otypes=['object'])
out[I] += aslists(a)
However, this approach does not concatenate the lists, but only maintains the last value for each index:
array([[1], [2]], dtype=object)
Or, for a 2-dimensional case:
a = np.random.rand(100)
I = (np.random.random(100) * 5 //1).astype(int)
J = (np.random.random(100) * 5 //1).astype(int)
out = np.empty((5, 5), dtype=object)
out.fill([])
How can I append the items from a to out based on the two index arrays?

1D Case
Assuming I being sorted, for a list of arrays as output -
idx = np.unique(I, return_index=True)[1]
out = np.split(a,idx)[1:]
Another with slicing to get idx for splitting a -
out = np.split(a, np.flatnonzero(I[1:] != I[:-1])+1)
To get an array of lists as output -
np.array([i.tolist() for i in out])
Sample run -
In [84]: a = np.arange(3)
In [85]: I = np.array([0, 0, 1])
In [86]: out = np.split(a, np.flatnonzero(I[1:] != I[:-1])+1)
In [87]: out
Out[87]: [array([0, 1]), array([2])]
In [88]: np.array([i.tolist() for i in out])
Out[88]: array([[0, 1], [2]], dtype=object)
2D Case
For 2D case of filling into a 2D array with groupings made from indices in two arrays I and J that represent the rows and columns where the groups are to be assigned, we could do something like this -
ncols = 5
lidx = I*ncols+J
sidx = lidx.argsort() # Use kind='mergesort' to keep order
lidx_sorted = lidx[sidx]
unq_idx, split_idx = np.unique(lidx_sorted, return_index=True)
out.flat[unq_idx] = np.split(a[sidx], split_idx)[1:]

Related

How do I reduce the use of for loops using numpy?

Basically, I have three arrays that I multiply with values from 0 to 2, expanding the number of rows to the number of products (the values to be multiplied are the same for each array). From there, I want to calculate the product of every combination of rows from all three arrays. So I have three arrays
A = np.array([1, 2, 3])
B = np.array([1, 2, 3])
C = np.array([1, 2, 3])
and I'm trying to reduce the operation given below
search_range = np.linspace(0, 2, 11)
results = np.array([[0, 0, 0]])
for i in search_range:
for j in search_range:
for k in search_range:
sm = i*A + j*B + k*C
results = np.append(results, [sm], axis=0)
What I tried doing:
A = np.array([[1, 2, 3]])
B = np.array([[1, 2, 3]])
C = np.array([[1, 2, 3]])
n = 11
scale = np.linspace(0, 2, n).reshape(-1, 1)
A = np.repeat(A, n, axis=0) * scale
B = np.repeat(B, n, axis=0) * scale
C = np.repeat(C, n, axis=0) * scale
results = np.array([[0, 0, 0]])
for i in range(n):
A_i = A[i]
for j in range(n):
B_j = B[j]
C_k = C
sm = A_i + B_j + C_k
results = np.append(results, sm, axis=0)
which only removes the last for loop. How do I reduce the other for loops?
You can get the same result like this:
search_range = np.linspace(0, 2, 11)
search_range = np.array(np.meshgrid(search_range, search_range, search_range))
search_range = search_range.T.reshape(-1, 3)
sm = search_range[:, 0, None]*A + search_range[:, 1, None]*B + search_range[:, 2, None]*C
results = np.concatenate(([[0, 0, 0]], sm))
Instead of using three nested loops to get every combination of elements in the "search_range" array, I used the meshgrid function to convert "search_range" to a 2D array of every possible combination and then instead of i, j and k you can use the 3 items in the arrays in the "search_range".
And finally, as suggested by #Mercury you can use indexing for the new "search_range" array to generate the result. For example search_range[:, 1, None] is an array in shape of (1331, 1), containing singleton arrays of every element at index of 0 in arrays in the "search_range". That concatenate is only there because you wanted the results array to have default value of [[0, 0, 0]], so I appended sm to it; Otherwise, the sm array contains the answer.

applying a function to rows of a ndarray && converting itertool object to numpy array

I am trying to create permutations of size 4 from a group of real numbers. After that, I'd like to know the position of the first element in a permutation after I sort it. Here is what I have tried so far. What's the best way to do this?
import numpy as np
from itertools import chain, permutations
N_PLAYERS = 4
N_STATES = 60
np.random.seed(0)
state_space = np.linspace(0.0, 1.0, num=N_STATES, retstep=True)[0].tolist()
perms = permutations(state_space, N_PLAYERS)
perms_arr = np.fromiter(chain(*perms),dtype=np.float16)
def loc(row):
return np.where(np.argsort(row) == 0)[0].tolist()[0]
locs = np.apply_along_axis(loc, 0, perms)
In [153]: N_PLAYERS = 4
...: N_STATES = 60
...: np.random.seed(0)
...: state_space = np.linspace(0.0, 1.0, num=N_STATES, retstep=True)[0].tolist()
...: perms = itertools.permutations(state_space, N_PLAYERS)
In [154]: alist = list(perms)
In [155]: len(alist)
Out[155]: 11703240
Simply making a list from the permuations produces a list of lists, with all sublists of length N_PLAYERS.
Making an array from that with chain flattens it:
In [156]: perms = itertools.permutations(state_space, N_PLAYERS)
In [158]: perms_arr = np.fromiter(itertools.chain(*perms),dtype=np.float16)
In [159]: perms_arr.shape
Out[159]: (46812960,)
In [160]: alist[0]
Which could be reshaped to (11703240,4).
Using apply on that 1d array doesn't work (or make sense):
In [170]: perms_arr.shape
Out[170]: (46812960,)
In [171]: locs = np.apply_along_axis(loc, 0, perms_arr)
In [172]: locs.shape
Out[172]: ()
Reshape to 4 columns:
In [173]: locs = np.apply_along_axis(loc, 0, perms_arr.reshape(-1,4))
In [174]: locs.shape
Out[174]: (4,)
In [175]: locs
Out[175]: array([ 0, 195054, 578037, 769366])
This applies loc to each column, returning one value for each. But loc has a row variable. Is that supposed to be significant?
I could switch the axis; this takes much longer, and al
In [176]: locs = np.apply_along_axis(loc, 1, perms_arr.reshape(-1,4))
In [177]: locs.shape
Out[177]: (11703240,)
list comprehension
This iteration does the same thing as your apply_along_axis, and I expect is faster (though I haven't timed it - it's too slow).
In [188]: locs1 = np.array([loc(row) for row in perms_arr.reshape(-1,4)])
In [189]: np.allclose(locs, locs1)
Out[189]: True
whole array sort
But argsort takes an axis, so I can sort all rows at once (instead of iterating):
In [185]: np.nonzero(np.argsort(perms_arr.reshape(-1,4), axis=1)==0)
Out[185]:
(array([ 0, 1, 2, ..., 11703237, 11703238, 11703239]),
array([0, 0, 0, ..., 3, 3, 3]))
In [186]: np.allclose(_[1],locs)
Out[186]: True
Or going the other direction: - cf with Out[175]
In [187]: np.nonzero(np.argsort(perms_arr.reshape(-1,4), axis=0)==0)
Out[187]: (array([ 0, 195054, 578037, 769366]), array([0, 1, 2, 3]))

How to add element to empty 2d numpy array

I'm trying to insert elements to an empty 2d numpy array. However, I am not getting what I want.
I tried np.hstack but it is giving me a normal array only. Then I tried using append but it is giving me an error.
Error:
ValueError: all the input arrays must have same number of dimensions
randomReleaseAngle1 = np.random.uniform(20.0, 77.0, size=(5, 1))
randomVelocity1 = np.random.uniform(40.0, 60.0, size=(5, 1))
randomArray =np.concatenate((randomReleaseAngle1,randomVelocity1),axis=1)
arr1 = np.empty((2,2), float)
arr = np.array([])
for i in randomArray:
data = [[170, 68.2, i[0], i[1]]]
df = pd.DataFrame(data, columns = ['height', 'release_angle', 'velocity', 'holding_angle'])
test_y_predictions = model.predict(df)
print(test_y_predictions)
if (np.any(test_y_predictions == 1)):
arr = np.hstack((arr, np.array([i[0], i[1]])))
arr1 = np.append(arr1, np.array([i[0], i[1]]), axis=0)
print(arr)
print(arr1)
I wanted to get something like
[[1.5,2.2],
[3.3,4.3],
[7.1,7.3],
[3.3,4.3],
[3.3,4.3]]
However, I'm getting
[56.60290125 49.79106307 35.45102444 54.89380834 47.09359271 49.19881675
22.96523274 44.52753514 67.19027156 54.10421167]
The recommended list append approach:
In [39]: alist = []
In [40]: for i in range(3):
...: alist.append([i, i+10])
...:
In [41]: alist
Out[41]: [[0, 10], [1, 11], [2, 12]]
In [42]: np.array(alist)
Out[42]:
array([[ 0, 10],
[ 1, 11],
[ 2, 12]])
If we start with a empty((2,2)) array:
In [47]: arr = np.empty((2,2),int)
In [48]: arr
Out[48]:
array([[139934912589760, 139934912589784],
[139934871674928, 139934871674952]])
In [49]: np.concatenate((arr, [[1,10]],[[2,11]]), axis=0)
Out[49]:
array([[139934912589760, 139934912589784],
[139934871674928, 139934871674952],
[ 1, 10],
[ 2, 11]])
Note that empty does not mean the same thing as the list []. It's a real 2x2 array, with 'unspecified' values. And those values remain when we add other arrays to it.
I could start with an array with a 0 dimension:
In [51]: arr = np.empty((0,2),int)
In [52]: arr
Out[52]: array([], shape=(0, 2), dtype=int64)
In [53]: np.concatenate((arr, [[1,10]],[[2,11]]), axis=0)
Out[53]:
array([[ 1, 10],
[ 2, 11]])
That looks more like the list append approach. But why start with the (0,2) array in the first place?
np.concatenate takes a list of arrays (or lists that can be made into arrays). I used nested lists that make (1,2) arrays. With this I can join them on axis 0.
Each concatenate makes a new array. So if done iteratively it is more expensive than the list append.
np.append just takes 2 arrays and does a concatenate. So doesn't add much. hstack tweaks shapes and joins on the 2nd (horizontal) dimension. vstack is another variant. But they all end up using concatenate.
With the hstack method, you can just reshape after you get the final array:
arr = arr.reshape(-1, 2)
print(arr)
The other method can be more easily done in a similar way:
arr1 = np.append(arr1, np.array([i[0], i[1]]) # in the loop
arr1 = arr1.reshape(-1, 2)
print(arr1)

classify np.arrays as duplicates

My goal is to take a list of np.arrays and create an associated list or array that classifies each as having a duplicate or not. Here's what I thought would work:
www = [np.array([1, 1, 1]), np.array([1, 1, 1]), np.array([2, 1, 1])]
uniques, counts = np.unique(www, axis = 0, return_counts = True)
counts = [1 if x > 1 else 0 for x in counts]
count_dict = dict(zip(uniques, counts))
[count_dict[i] for i in www]
The desired output for this case would be :
[1, 1, 0]
because the first and second element have another copy within the original list. It seems that the problem is that I cannot use a np.array as a key for a dictionary.
Suggestions?
First convert www to a 2D Numpy array then do the following:
In [18]: (counts[np.where((www[:,None] == uniques).all(2))[1]] > 1).astype(int)
Out[18]: array([1, 1, 0])
here we use broadcasting for check the equality of all www rows with uniques array and then using all() on last axis to find out which of its rows are completely equal to uniques rows.
Here's the elaborated results:
In [20]: (www[:,None] == uniques).all(2)
Out[20]:
array([[ True, False],
[ True, False],
[False, True]])
# Respective indices in `counts` array
In [21]: np.where((www[:,None] == uniques).all(2))[1]
Out[21]: array([0, 0, 1])
In [22]: counts[np.where((www[:,None] == uniques).all(2))[1]] > 1
Out[22]: array([ True, True, False])
In [23]: (counts[np.where((www[:,None] == uniques).all(2))[1]] > 1).astype(int)
Out[23]: array([1, 1, 0])
In Python, lists (and numpy arrays) cannot be hashed, so they can't be used as dictionary keys. But tuples can! So one option would be to convert your original list to a tuple, and to convert uniques to a tuple. The following works for me:
www = [np.array([1, 1, 1]), np.array([1, 1, 1]), np.array([2, 1, 1])]
www_tuples = [tuple(l) for l in www] # list of tuples
uniques, counts = np.unique(www, axis = 0, return_counts = True)
counts = [1 if x > 1 else 0 for x in counts]
# convert uniques to tuples
uniques_tuples = [tuple(l) for l in uniques]
count_dict = dict(zip(uniques_tuples, counts))
[count_dict[i] for i in www_tuples]
Just a heads-up: this will double your memory consumption, so it may not be the best solution if www is large.
You can mitigate the extra memory consumption by ingesting your data as tuples instead of numpy arrays if possible.

Summing and removing repeated elements of Numpy Arrays

I have 4 1D Numpy arrays of equal length.
The first three act as an ID, uniquely identifying the 4th array.
The ID arrays contain repeated combinations, for which I need to sum the 4th array, and remove the repeating element from all 4 arrays.
x = np.array([1, 2, 4, 1])
y = np.array([1, 1, 4, 1])
z = np.array([1, 2, 2, 1])
data = np.array([4, 7, 3, 2])
In this case I need:
x = [1, 2, 4]
y = [1, 1, 4]
z = [1, 2, 2]
data = [6, 7, 3]
The arrays are rather long so loops really won't work. I'm sure there is a fairly simple way to do this, but for the life of me I can't figure it out.
To get started, we can stack the ID vectors into a matrix such that each ID is a row of three values:
XYZ = np.vstack((x,y,z)).T
Now, we just need to find the indices of repeated rows. Unfortunately, np.unique doesn't operate on rows, so we need to do some tricks:
order = np.lexsort(XYZ.T)
diff = np.diff(XYZ[order], axis=0)
uniq_mask = np.append(True, (diff != 0).any(axis=1))
This part is borrowed from the np.unique source code, and finds the unique indices as well as the "inverse index" mapping:
uniq_inds = order[uniq_mask]
inv_idx = np.zeros_like(order)
inv_idx[order] = np.cumsum(uniq_mask) - 1
Finally, sum over the unique indices:
data = np.bincount(inv_idx, weights=data)
x,y,z = XYZ[uniq_inds].T
You can use unique and sum as reptilicus suggested to do the following
from itertools import izip
import numpy as np
x = np.array([1, 2, 4, 1])
y = np.array([1, 1, 4, 1])
z = np.array([1, 2, 2, 1])
data = np.array([4, 7, 3, 2])
# N = len(x)
# ids = x + y*N + z*(N**2)
ids = np.array([hash((a, b, c)) for a, b, c in izip(x, y, z)]) # creates flat ids
_, idx, idx_rep = np.unique(ids, return_index=True, return_inverse=True)
x_out = x[idx]
y_out = y[idx]
z_out = z[idx]
# data_out = np.array([np.sum(data[idx_rep == i]) for i in idx])
data_out = np.bincount(idx_rep, weights=data)
print x_out
print y_out
print z_out
print data_out

Categories