find indices of ndarray compared with ndarray - python

I have two unsorted ndarrays with the following structure:
a1 = np.array([0,4,2,3],[0,2,5,6],[2,3,7,4],[6,0,9,8],[9,0,6,7])
a2 = np.array([3,4,2],[0,6,9])
I would like to find all the indices of a1, where each a2 row is in a1 and also inside a1 the position:
result = [[0,[3,1,2]],[2,[1,3,0]],[3,[1,0,2]],[4,[1,2,0]]
In this example a2[0] is in a1 at position 0 and 2 within a1 position at 3,1,2 and 1,3,0. For a2[1] at position 3 and 4 within a1 position at 1,0,2 and 1,2,0.
Each a2 row appears twice in a1. a1 has a least 1Mio. rows, a2 around 10,000. So the algorithm should be also quite fast (if possible).
So far, i was thinking about this approach:
big_res = []
for r in xrange(len(a2)):
big_indices = np.argwhere(a1 == a2[r])
small_res = []
for k in xrange(2):
small_indices = [i for i in a2[r] if i in a1[big_indices[k]]]
np.append(small_res, small_indices)
combined_res = [[big_indices[0],small_res[0]],[big_indices[1],small_res[1]]]
np.append(big_res, combined_res)

Using numpy_indexed, (disclaimer: I am its author) what I think of as the hard part can be written efficiently as follows:
import numpy_indexed as npi
a1s = np.sort(a1, axis=1)
a2s = np.sort(a2, axis=1)
matches = np.array([npi.indices(a2s, np.delete(a1s, i, axis=1), missing=-1) for i in range(4)])
rows, cols = np.argwhere(matches != -1).T
a1idx = cols
a2idx = matches[rows, cols]
# results.shape = [len(a2), 2]
result = npi.group_by(a2idx).split_array_as_array(a1idx)
This only gives you the matches efficiently; not the relative orders. But once you have the matches, computing the relative orders should be simple to do in linear time.
EDIT: and some code of questionable density to get your relative orderings:
order = npi.indices(
(np.indices(a1.shape)[0].flatten(), a1.flatten()),
(np.repeat(result.flatten(), 3), np.repeat(a2, 2, axis=0).flatten())
).reshape(-1, 2, 3) - result[..., None] * 4

Related

Comparing values in different pairs of columns in Pandas

I would like to count how many times column A has the same value with B and with C. Similarly, I would like to count how many time A2 has the same value with B2 and with C2.
I have this dataframe:
,A,B,C,A2,B2,C2
2018-12-01,7,0,8,17,17,17
2018-12-02,0,0,8,20,18,18
2018-12-03,9,8,8,17,17,18
2018-12-04,8,8,8,17,17,18
2018-12-05,8,8,8,17,17,17
2018-12-06,9,8,8,15,17,17
2018-12-07,8,9,9,17,17,16
2018-12-08,0,0,0,17,17,17
2018-12-09,8,0,0,17,20,18
2018-12-10,8,8,8,17,17,17
2018-12-11,8,8,9,17,17,17
2018-12-12,8,8,8,17,17,17
2018-12-13,8,8,8,17,17,17
2018-12-14,8,8,8,17,17,17
2018-12-15,9,9,9,17,17,17
2018-12-16,12,0,0,17,19,17
2018-12-17,11,9,9,17,17,17
2018-12-18,8,9,9,17,17,17
2018-12-19,8,9,8,17,17,17
2018-12-20,9,8,8,17,17,17
2018-12-21,9,9,9,17,17,17
2018-12-22,10,9,0,17,17,17
2018-12-23,10,11,10,17,17,17
2018-12-24,10,10,8,17,19,17
2018-12-25,7,10,10,17,17,18
2018-12-26,10,0,10,17,19,17
2018-12-27,9,10,8,18,17,17
2018-12-28,9,9,9,17,17,17
2018-12-29,10,10,12,18,17,17
2018-12-30,10,0,10,16,19,17
2018-12-31,11,8,8,19,17,16
I expect the following value:
A with B = 14
A with C = 14
A2 with B2 = 14
A2 with C2 = 14
I have done this:
ia = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['B'][i]:
ia=ia+1
ib = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['C'][i]:
ib=ib+1
In order to take advantage of pandas, this is one possible solution:
import numpy as np
dfr_h_max1['que'] = np.where((dfr_h_max1['A'] == dfr_h_max1['B']), 1, 0)
After that I could sum all the elements in the new column 'que'.
Another possibility could be related to some sort of boolean variable. Unfortunately, I still do not have enough knowledge about that.
Any other more efficient or elegant solutions?
The primary calculation you need here is, for example, dfr_h_max1['A'] == dfr_h_max1['B'] - as you've done in your edit. That gives you a Series of True/False values based on the equality of each pair of items in the two series. Since True evaluates to 1 and False evaluates to 0, the .sum() is the count of how many True's there were - hence, how many matches.
Put that in a loop and add the required "text" for the output you want:
mains = ('A', 'A2') # the main columns
comps = (['B', 'C'], ['B2', 'C2']) # columns to compare each main with
for main, pair in zip(mains, comps):
for col in pair:
print(f'{main} with {col} = {(dfr_h_max1[main] == dfr_h_max1[col]).sum()}')
# or without f-strings, do:
# print(main, 'with', col, '=', (dfr_h_max1[main] == dfr_h_max1[col]).sum())
Output:
A with B = 14
A with C = 14
A2 with B2 = 21
A2 with C2 = 20
Btw, (df[main] == df[comp]).sum() for Series.sum() can also be written as sum(df[main] == df[comp]) for Python's builtin sum().
In case you have more than two "triplets" of columns (not just A & A2), change the mains and comps to this, so that it works on all triplets:
mains = dfr_h_max1.columns[::3] # main columns (A's), in steps of 3
comps = zip(dfr_h_max1.columns[1::3], # offset by 1 column (B's),
dfr_h_max1.columns[2::3]) # offset by 2 columns (C's),
# in steps of 3
(Or even using the column names / starting letter.)

What's a more efficient way to calculate the max of each row in a matrix excluding its own column?

For a given 2D matrix np.array([[1,3,1],[2,0,5]]) if one needs to calculate the max of each row in a matrix excluding its own column, with expected example return np.array([[3,1,3],[5,5,2]]), what would be the most efficient way to do so?
Currently I implemented it with a loop to exclude its own col index:
n=x.shape[0]
row_max_mat=np.zeros((n,n))
rng=np.arange(n)
for i in rng:
row_max_mat[:,i] = np.amax(s_a_array_sum[:,rng!=i],axis=1)
Is there a faster way to do so?
Similar idea to yours (exclude columns one by one), but with indexing:
mask = ~np.eye(cols, dtype=bool)
a[:,np.where(mask)[1]].reshape((a.shape[0], a.shape[1]-1, -1)).max(1)
Output:
array([[3, 1, 3],
[5, 5, 2]])
You could do this using np.accumulate. Compute the forward and backward accumulations of maximums along the horizontal axis and then combine them with an offset of one:
import numpy as np
m = np.array([[1,3,1],[2,0,5]])
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
print(r)
# [[3 1 3]
# [5 5 2]]
This will require 3x the size of your matrix to process (although you could take that down to 2x if you want an in-place update). Adding a 3rd&4th dimension could also work using a mask but that will require columns^2 times matrix's size to process and will likely be slower.
If needed, you can apply the same technique column wise or to both dimensions (by combining row wise and column wise results).
a = np.array([[1,3,1],[2,0,5]])
row_max = a.max(axis=1).reshape(-1,1)
b = (((a // row_max)+1)%2)
c = b*row_max
d = (a // row_max)*((a*b).max(axis=1).reshape(-1,1))
c+d # result
Since, we are looking to get max excluding its own column, basically the output would have each row filled with the max from it, except for the max element position, for which we will need to fill in with the second largest value. As such, argpartition seems would fit right in there. So, here's one solution with it -
def max_exclude_own_col(m):
out = np.full(m.shape, m.max(1, keepdims=True))
sidx = np.argpartition(-m,2,axis=1)
R = np.arange(len(sidx))
s0,s1 = sidx[:,0], sidx[:,1]
mask = m[R,s0]>m[R,s1]
L1c,L2c = np.where(mask,s0,s1), np.where(mask,s1,s0)
out[R,L1c] = m[R,L2c]
return out
Benchmarking
Other working solution(s) for large arrays -
# #Alain T.'s soln
def max_accum(m):
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
return r
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
So, we will test out with large arrays of various shapes for timings and speedups -
In [54]: import benchit
In [55]: funcs = [max_exclude_own_col, max_accum]
In [170]: inputs = [np.random.randint(0,100,(100000,n)) for n in [10, 20, 50, 100, 200, 500]]
In [171]: T = benchit.timings(funcs, inputs, indexby='shape')
In [172]: T
Out[172]:
Functions max_exclude_own_col max_accum
Shape
100000x10 0.017721 0.014580
100000x20 0.028078 0.028124
100000x50 0.056355 0.089285
100000x100 0.103563 0.200085
100000x200 0.188760 0.407956
100000x500 0.439726 0.976510
# Speedups with max_exclude_own_col over max_accum
In [173]: T.speedups(ref_func_by_index=1)
Out[173]:
Functions max_exclude_own_col Ref:max_accum
Shape
100000x10 0.822783 1.0
100000x20 1.001660 1.0
100000x50 1.584334 1.0
100000x100 1.932017 1.0
100000x200 2.161241 1.0
100000x500 2.220725 1.0

fastest way to iterate a numpy array and update each element

This might be weird to you people, but I happen to have this weird goal to achieve, code goes as follows.
# A is a numpy array, dtype=int32,
# and each element is actually an ID(int), the ID range might be wide,
# but the actually existing values are quite fewer than the dense range,
A = array([[379621, 552965, 192509],
[509849, 252786, 710979],
[379621, 718598, 591201],
[509849, 35700, 951719]])
# and I need to map these sparse ID to dense ones,
# my idea is to have a dict, mapping actual_sparse_ID -> dense_ID
M = {}
# so I iterate this numpy array, and check if this sparse ID has a dense one or not
for i in np.nditer(A, op_flags=['readwrite']):
if i not in M:
M[i] = len(M) # sparse ID got a dense one
i[...] = M[i] # replace sparse one with the dense ID
My goal could be achieved with np.unique(A, return_inverse=True), and the return_inverse result is what I want.
However, the numpy array I have is too huge to fully load into memory, so I cannot run np.unique over the whole data, and this is why I came up with this dict-mapping idea...
Is this the right way to go? Any possible improvement?
I will make an attempt to provide an alternative way of doing this by using numpy.unique() on sub-arrays. This solution is not fully tested. I also did not do any side-by-side performance evaluation since your solution is not fully working for me.
Let's say we have an array c that we split into two smaller arrays. Let's create some test data, for example:
>>> a = np.array([[1,1,2,3,4],[1,2,6,6,2],[8,0,1,1,4]])
>>> b = np.array([[11,2,-1,12,6],[12,2,6,11,2],[7,0,3,1,3]])
>>> c = np.vstack([a, b])
>>> print(c)
[[ 1 1 2 3 4]
[ 1 2 6 6 2]
[ 8 0 1 1 4]
[11 2 -1 12 6]
[12 2 6 11 2]
[ 7 0 3 1 3]]
Here we assume that c is the large array and a and b are sub-arrays. Of course, one could build c first and then extract sub-arrays.
Next step is to run numpy.unique() on the two sub-arrays:
>>> ua, ia = np.unique(a, return_inverse=True)
>>> ub, ib = np.unique(b, return_inverse=True)
>>> uc, ic = np.unique(c, return_inverse=True) # this is for future reference
Now, here is an algorithm for combining the results from subarrays:
def merge_unique(ua, ia, ub, ib):
# make copies *if* changing inputs is undesirable:
ua = ua.copy()
ia = ia.copy()
ub = ub.copy()
ib = ib.copy()
# find differences between unique values in the two arrays:
diffab = np.setdiff1d(ua, ub, assume_unique=True)
diffba = np.setdiff1d(ub, ua, assume_unique=True)
# find indices in ua, ub where to insert "other" unique values:
ssa = np.searchsorted(ua, diffba)
ssb = np.searchsorted(ub, diffab)
# throw away values that are too large:
ssa = ssa[np.where(ssa < len(ua))]
ssb = ssb[np.where(ssb < len(ub))]
# increment indices past previously computed "insert" positions:
for v in ssa[::-1]:
ia[ia >= v] += 1
for v in ssb[::-1]:
ib[ib >= v] += 1
# combine results:
uc = np.union1d(ua, ub) # or use ssa, ssb, diffba, diffab to update ua, ub
ic = np.concatenate([ia, ib])
return uc, ic
Now, let's run this function on the results of numpy.unique() from sub-arrays and then compare merged indices and unique values with the reference results uc and ic:
>>> uc2, ic2 = merge_unique(ua, ia, ub, ib)
>>> np.all(uc2 == uc)
True
>>> np.all(ic2 == ic)
True
Splitting into more than two sub-arrays can be handled with little additional work - simply keep accumulating "unique" values and indices, like this:
uacc, iacc = np.unique(subarr1, return_inverse=True)
ui, ii = np.unique(subarr2, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr3, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr4, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
................................ (etc.)

How to mask with 3d array and 2d array numpy

How do you select a group of elements from a 3d array using a 1d array.
#These are my 3 data types
# A = numpy.ndarray[numpy.ndarray[float]]
# B1 = numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]]
#B2=numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]]
#I want to choose values from A based on values from B1 in the B2
This is what I tried but it returned all False:
A2[i]=image_values[updated_image_values==initial_means[i]]
Example:
A=[[1,1,1][2,2,2]]
B=[[[1,1,1],[2,3,4]],[[2,2,2],[1,1,1]],[[1,1,1],[2,2,2]]]
B2=[[[2,2,2],[9,3,21]],[[22,0,-2],[-1,-1,1]],[[1,-1,-1],[10,0,2]]]
#A2 is calculated as the means of the B2 values that correspond
#to it's value according to B
So, to calculate A2 we use check what values in B2 are equal to values in A. So, for the first index A[0], B[0][0],B[1][1] and B[2][0] are equal to A[0]. So for A2[0], we get the corresponding values of B in B2 and use those to calculate the average for each index:
#A2[0][0]=(B2[0][0][0]+B2[1][1][0]+B2[2][0][0]) /3 = 0.67
#A2[1][2]=(B2[1][0][2]+B2[2][1][2]) /2 = 0
#After doing this for every A2 value, A2 should be:
A2=[[0.67,0,0.67],[16,0,0]]
Here's a vectorized approach with np.add.reduceat -
idx = np.argwhere((B == A[:,None,None]).all(-1))
B2_indexed = B2[idx[:,1],idx[:,2]]
_,start, count = np.unique(idx[:,0],return_index=1,return_counts=1)
out = np.add.reduceat(B2_indexed,start)/count.astype(float)[:,None]
Alternatively, we can save on memory a bit by avoiding creating 4D mask with a 3D mask instead for getting idx, like so -
dims = np.maximum(B.max(axis=(0,1)),A.max(0))+1
A_reduced = np.ravel_multi_index(A.T,dims)
B_reduced = np.ravel_multi_index(B.T,dims)
idx = np.argwhere(B_reduced.T == A_reduced[:,None,None])
Here's another approach with one-loop -
out = np.empty(A.shape)
for i in range(A.shape[0]):
r,c = np.where((B == A[i]).all(-1))
out[i] = B2[r,c].mean(0)

Optimizing Python script with multiple for loops

I'm super new at python and super new at trying to optimize a script for speed. I've got this problem that I've been using to teach myself how to write code and here it is:
I have a dataset with a list of products, their value and their costs. There are three different types of products (A,B,C) - there are anywhere from 30-100 products for each product type. Each product has a value and a cost. I have to select 1 product from product type A, 2 from product type B, and 3 from products type C -- once I use a product, I cannot use it again (no replacement).
My goal is to optimize the value of products given my budget constraint.
Given that I'm basically trying to create a list of combinations, I started there and wrote a few "for loops" in order to achieve that. I initially tried to do too much in the loops and change the data type to list because from my research it sounded like that it would speed it up -- it did speed it up immensely.
The problem is that I am still processing 350k records a second at best, which puts me at about 7 hours to complete if I have 30 items in list_a, 50 in list_b, and 50 in list_c.
I have created 3 lists of lists -- (list_a, list_b, and list_c) that each look like my example below for list_a. Then, I evaluate each permutation inside the for loop to see if this permutation has a higher value than the current highest value permutation and that the cost is below the constraint. If it meets those conditions, I append it to the masterlist of permutations (combo_list).
list_a = [['product1','product1_cost','product1_value'],['product2','product2_cost','product2_value'],...]
num_a = len(list_a)
num_b = len(list_b)
num_c = len(list_c)
combo_list = [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]] # this is to create the list of lists that I will populate
a = 0 #for row numbers
b1 = 0
c1 = 0
l = 0 #of iterations
max_value = 0
for a in range(0,num_a):
for b1 in range(0,num_b):
b2 = b1 + 1 #second b
for b2 in range(b2,num_b):
for c1 in range(0,num_c):
c2 = c1 +1 #second c
for c2 in range(c2,num_C):
c3 =c2+1 #third c
for c3 in range(c3,num_C):
data = [list_a[a][0],list_a[a][1],list_a[a][2],list_b[b1][0],list_b[b1][1],list_b[b1][2],list_b[b2][0],list_b[b2][1],list_b[b2][2],list_c[c1][0],list_c[c1][1],list_c[c1][2],list_c[c2][0],list_c[c2][1],list_c[c2][2],list_c[c3][0],list_c[c3][1],list_c[c3][2]]
total_cost = data[1] + data[4] + data[7] + data[10] + data[13] + data[16]
total_value = data[2] + data[5] + data[8] + data[11] + data[14] + data[17]
data[18]=total_cost
data[19]=total_value
if total_value >= max_value and total_cost <= constraint:
combo_list.append(data)
max_value = total_value
c3 +=1
l +=1
c2 +=1
c1 +=1
b2+=1
b1 +=1
a +=1
then I turn it into a dataframe or csv
Thank you for any help.
So, I was able to figure this out using itertools combinations:
tup_b = itertools.combinations(list_b, 2)
list_b = map(list,tup_b)
df_b = pd.DataFrame(list_b)
#Extending the list
df_b['B'] = df_b[0] + df_b[1]
df_b = df_b[['B']]
#flatten list
b = df_b.values.tolist()
b = list(itertools.chain(*r))
# adding values and costs
r = len(b)
x=0
for x in range(0,r):
cost = [b[x][1] +b[x][4]]
value = [b[x][2] +b[x][5]]
r[x] = r[x] +cost +value
x +=1
#shortening list
df_b = pd.DataFrame(b)
df_b = df_b[[0,3,6,7]]
df_b.columns = ['B1','B2','cost','value']
Then, I did the same for list_c with similar structure as above, using this:
tup_c = itertools.combinations(list_c, 3)
Using this dropped by time to process from ~5 hours to 8 minutes...
Thanks all for the help.

Categories