FAST: 1D overlaps with rows in 2D? - python

let say i have 2D array, f.e.:
In [136]: ary
array([[6, 7, 9],
[0, 2, 5],
[3, 3, 4],
[2, 2, 8],
[3, 4, 9],
[0, 5, 7],
[2, 4, 9],
[3, 5, 7],
[7, 8, 8],
[0, 2, 3]])
I want to calculate overlap with 1D vector, FAST.
I can almost do it with (8ms on big array):
(ary == match) # .sum(axis=1).argsort()[::-1]
The problem with it is that it only matches if both Position and Value match.
match == [6,5,4]
array([[ True, False, False],
[False, False, False],
[False, False, True],
[False, False, False],
[False, False, False],
[False, True, False],
[False, False, False],
[False, True, False],
[False, False, False],
[False, False, False]])
F.e. 5 in 2nd column of 1d vec did not match with 5 in 3rd column on the 2nd row.
It works with .isin()
np.isin(ary,match,assume_unique=True).sum(axis=1).argsort()[::-1][:5]
but it is slow on big arrays (200000,10) ~20ms
Help me extend the first case so that it can match the Value in any position of 1D vector with the row.
the expected result is row-indexes ordered by OVERLAP COUNT, lets use [2,4,5] because it has more matches:
In [147]: np.isin(ary,[2,5,4],assume_unique=True)
Out[147]:
array([[False, False, False],
[False, True, True],
[False, False, True],
[ True, True, False],
[False, True, False],
[False, True, False],
[ True, True, False],
[False, True, False],
[False, False, False],
[False, True, False]])
Overlap :
In [149]: np.isin(ary,[2,5,4],assume_unique=True).sum(axis=1)
Out[149]: array([0, 2, 1, 2, 1, 1, 2, 1, 0, 1])
Order by overlap :
In [148]: np.isin(ary,[2,5,4],assume_unique=True).sum(axis=1).argsort()[::-1]
Out[148]: array([6, 3, 1, 9, 7, 5, 4, 2, 8, 0])
See rows : 6,3,1 have Overlap:2 that why they are first
Variants:
#could be from 1000,10,10 to 2000,100,20 .. ++
def data(cells=2000,seg=100,items=10):
ary = np.random.randint(0,cells,(cells*seg,items))
rnd = np.random.randint(0,cells*seg)
return ary, ary[rnd]
def best2(match,ary): #~20ms, (200000,10)
return np.isin(ary,match,assume_unique=True).sum(axis=1).argsort()[::-1][:5]
def best3(match,ary): #Corralien ~20ms
return np.logical_or.reduce(np.ravel(ary) == match[:, None], axis=0).reshape(ary.shape).sum(axis=1).argsort()[::-1][:5]
Can this be sped if using numba+cuda OR cupy on GPU ?

The main problem of all approach so fast is that they create huge temporary array while finally only 5 items are important. Numba can be used to compute the arrays on the fly (with efficient JIT-compiled loops) avoiding some temporary array. Moreover, a full sort is not required as only the top 5 items need to be retrieved. A partition can be used instead. It is even possible to use a faster approach since only the 5 selected items matters and not the others. Here is the resulting code:
#nb.njit('int32[::1](int32[::1], int32[:,::1])')
def computeScore(match, ary):
n, m = ary.shape
assert m == match.shape[0]
tmp = np.empty(n, dtype=np.int32)
for i in range(n):
s = 0
# Count the number of matching items (with repetition)
for j in range(m):
# Find a match
item = ary[i, j]
found = False
for k in range(m):
found |= item == match[k]
s += found
tmp[i] = s
return tmp
def best4(match, ary):
n, m = ary.shape
score = computeScore(match, ary)
bestItems = np.argpartition(score, n-5)[n-5:] # sadly not supported by Numba yet
order = np.argsort(-score[bestItems]) # bastItems is not sorted and likely needs to be
return bestItems[order]
Note that best4 can provide results different to best2 when the matching score (stored in tmp) is equal between multiple items. This is due to the sorting algorithm which is not stable by default in Numpy (the kind parameter can be used to adapt this behavior). This is also true for the partition algorithm although Numpy does not seems to provide a stable partition algorithm yet.
This code should be faster than other implementation, but not by a large margin. One of the issue is that Numba (and most C/C++ compilers like the one used to compile Numpy) do not succeed to generate a fast code since it does not know the value m at compile time. As a result, the most aggressive optimizations (eg. unrolling loops and using of SIMD instructions) can hardly be applied. You can help Numba using assertions or escaping conditionals.
Moreover, the code can be parallelized using multiple threads to make it much faster on mainstream platforms. Note that the parallelized version may not faster on small data nor all platforms since creating threads introduces an overhead that could be bigger than the actual computation.
Here is the resulting implementation:
#nb.njit('int32[::1](int32[::1], int32[:,::1])', parallel=True)
def computeScoreOpt(match, ary):
n, m = ary.shape
assert m == match.shape[0]
assert m == 10
tmp = np.empty(n, dtype=np.int32)
for i in nb.prange(n):
# Thie enable Numba to assume m=10 in the following code
# and generate a very efficient code for this specific case.
# The assert should be enough but the internals of Numba
# prevent the information to be propagatted to this portion
# of the code when it is parallelized.
if m != 10: continue
s = 0
for j in range(m):
item = ary[i, j]
found = False
for k in range(m):
found |= item == match[k]
s += found
tmp[i] = s
return tmp
def best5(match, ary):
n, m = ary.shape
score = computeScoreOpt(match, ary)
bestItems = np.argpartition(score, n-5)[n-5:]
order = np.argsort(-score[bestItems])
return bestItems[order]
Here are the timings on my machine with the example dataset:
best2: 18.2 ms
best3: 17.8 ms
best4 (sequential -- default): 12.0 ms
best4 (parallel): 3.1 ms
best5 (sequential): 3.2 ms
best5 (parallel -- default): 1.2 ms
The fastest implementation is 15 times faster than the original reference implementation.
Note that if m is greater than about 30, it should be better to use a more advanced set-based algorithm. An alternative solution is to sort match first and then use np.isin in the i-based loop in this case.

Use broadcasting and np.logical_or.reduce:
# match = np.array(match)
>>> np.logical_or.reduce(np.ravel(ary) == match[:, None], axis=0) \
.reshape(ary.shape)
array([[ True, False, False],
[False, False, True],
[False, False, True],
[False, False, False],
[False, True, False],
[False, True, False],
[False, True, False],
[False, True, False],
[False, False, False],
[False, False, False]])
Performance
match = np.array([6, 5, 4])
ary = np.random.randint(0, 10, (200000, 10))
%timeit np.logical_or.reduce(np.ravel(ary) == match[:, None], axis=0).reshape(ary.shape)
7.49 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Partition set with constraints (backtracking with Python)

I have a set of N items that I want to split in K subsets of size n1, n2, ..., nk (with n1 + n2 + ... + nk = N)
I also have constraints on which item can belong to which subset.
For my problem, at least one solution always exist.
I'm looking to implement an algorithm in Python to generate (at least) one solution.
Exemple :
Possibilities :
Item\Subset
0
1
2
A
True
True
False
B
True
True
True
C
False
False
True
D
True
True
True
E
True
False
False
F
True
True
True
G
False
False
True
H
True
True
True
I
True
True
False
Sizes constraints : (3, 3, 3)
Possible solution : [0, 0, 2, 1, 0, 1, 2, 2, 1]
Implementation :
So far, I have tried brute force with success, but I now want to find a more optimized algorithm.
I was thinking about backtracking, but I'm not sure it is the right method, nor if my implementation is right :
import pandas as pd
import numpy as np
import string
def solve(possibilities, constraints_sizes):
solution = [None] * len(possibilities)
def extend_solution(position):
possible_subsets = [index for index, value in possibilities.iloc[position].iteritems() if value]
for subset in possible_subsets:
solution[position] = subset
unique, counts = np.unique([a for a in solution if a is not None], return_counts=True)
if all(length <= constraints_sizes[sub] for sub, length in zip(unique, counts)):
if position >= len(possibilities)-1 or extend_solution(position+1):
return solution
return None
return extend_solution(0)
if __name__ == '__main__':
constraints_sizes = [5, 5, 6]
possibilities = pd.DataFrame([[False, True, False],
[True, True, True],
[True, True, True],
[True, True, True],
[True, False, False],
[True, True, True],
[True, True, True],
[True, True, True],
[True, False, False],
[True, True, True],
[True, True, True],
[True, True, True],
[False, True, True],
[True, True, True],
[True, True, True],
[True, False, False]],
index=list(string.ascii_lowercase[:16]))
solution = solve(possibilities, constraints_sizes)
One possible expected solution : [1, 0, 0, 1, 0, 1, 1, 1, 0, 2, 2, 2, 2, 2, 2, 0]
Unfortunately, this code fails to find a solution (eventhough it works with the previous example).
What am I missing ?
Thank you very much.
This problem can be solved by setting up a bipartite flow network with Items on one side, Subsets on the other, a surplus of 1 at each Item, a deficit of (Subset's size) at each Subset, and arcs of capacity 1 from each Item to each Subset to which it can belong. Then you need a maximum flow on this network; OR-Tools can do this, but you have a lot of options.
#David Eisenstat mentioned OR-Tools as a package to solve this kind of problem.
Thanks to him, I've found out that this problem could match one of their example, an Assignement with Task Sizes problem
It matches my understanding of the problem better than what I understood from the suggested "Flow network" concept, but I'd be happy to discuss about that.
Here is the solution I implemented, based on their example :
from ortools.sat.python import cp_model
def solve(possibilities, constraint_sizes):
# Transform possibilities into costs (0 if possible, 1 otherwise)
costs = [[int(not row[subset]) for row in possibilities] for subset in range(len(possibilities[0]))]
num_subsets = len(costs)
num_items = len(costs[0])
model = cp_model.CpModel()
# Variables
x = {}
for subset in range(num_subsets):
for item in range(num_items):
x[subset, item] = model.NewBoolVar(f'x[{subset},{item}]')
# Constraints :
# Each subset should should contain a given number of item
for subset, size in zip(range(num_subsets), constraint_sizes):
model.Add(sum(x[subset, item] for item in range(num_items)) <= size)
# Each item is assigned to exactly one subset
for item in range(num_items):
model.Add(sum(x[subset, item] for subset in range(num_subsets)) == 1)
# Objective
objective_terms = []
for subset in range(num_subsets):
for item in range(num_items):
objective_terms.append(costs[subset][item] * x[subset, item])
model.Minimize(sum(objective_terms))
# Solve
solver = cp_model.CpSolver()
status = solver.Solve(model)
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
solution = []
for item in range(num_items):
for subset in range(num_subsets):
if solver.BooleanValue(x[subset, item]):
solution.append(subset)
return solution
return None
The trick here is to tranform the possibilities into costs (0 only if possible), and to optimize the total cost.
An acceptable solution should then have a 0 total cost.
It gives a right solution, for the previous problem :
possibilities = [[False, True, False],
[True, True, True],
[True, True, True],
[True, True, True],
[True, False, False],
[True, True, True],
[True, True, True],
[True, True, True],
[True, False, False],
[True, True, True],
True, True, True],
[True, True, True],
[False, True, True],
[True, True, True],
[True, True, True],
[True, False, False]]
constraint_sizes = [5, 5, 6]
solution = solver(possibilities, constraint_sizes)
print(solution) # [1, 2, 1, 0, 0, 0, 2, 1, 0, 1, 2, 2, 2, 2, 1, 0]
I have now two more questions :
Can we transform the optimization objective (minimize the cost) into a hard constraint (cost should equal to 0) ? I guess it could lower the computing time.
How can I get other solutions and not only one ?
I am also still looking for a plain Python solution without any library...
Thank you

Compare row values in an array and remove those having nearly identical values

I have the following array:
a= [[2,3,50], [5,6,5], [8,10,5], [1,3,51] , [8,10,12]]
I would like to compare rows and remove those having nearly identical values.
For instance [2,3,50] and [1,3,51] are almost identical (the difference in each value is less than 1).
At the end, I should get the following array:
a= [[2,3,50], [5,6,5], [8,10,5], [8,10,12]]
where [1,3,51] has been removed.
Is there an efficient to do this in Python, avoiding multiple loops ?
Best
We can perform outer subtraction, outer on the first axes of two versions of a and then get the absolute value and check if all values along the common axis is lesser than or equal to the threshold value of 1. This will give us a 2D mask. We need to select the upper triangular mask to make sure the closeby pairs are not accounted for more than once. Reset the diagonal ones that correspong to own cases. Finally, check if there's at least match in each col, which are the closeby ones that we need to remove. Hence, invert the mask and select rows off a.
The implementation would be -
a[~np.triu((np.abs(a[:,None,:]-a)<=1).all(2),1).any(0)]
Sample run with step-by-step execution should help clarify.
Input array :
In [112]: a
Out[112]:
array([[ 2, 3, 50],
[ 5, 6, 5],
[ 8, 10, 5],
[ 1, 3, 51],
[ 8, 10, 12]])
Steps :
In [114]: (np.abs(a[:,None,:]-a)<=1).all(2)
Out[114]:
array([[ True, False, False, True, False],
[False, True, False, False, False],
[False, False, True, False, False],
[ True, False, False, True, False],
[False, False, False, False, True]])
In [115]: np.triu((np.abs(a[:,None,:]-a)<=1).all(2),1)
Out[115]:
array([[False, False, False, True, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]])
In [116]: np.triu((np.abs(a[:,None,:]-a)<=1).all(2),1).any(0)
Out[116]: array([False, False, False, True, False])
In [117]: ~np.triu((np.abs(a[:,None,:]-a)<=1).all(2),1).any(0)
Out[117]: array([ True, True, True, False, True])
In [118]: a[~np.triu((np.abs(a[:,None,:]-a)<=1).all(2),1).any(0)]
Out[118]:
array([[ 2, 3, 50],
[ 5, 6, 5],
[ 8, 10, 5],
[ 8, 10, 12]])
Just to do one more round of verification, let's set the last row as another closeby of the second-last row. This should lead to the last row being removed too. Hence -
In [120]: a[-1] = [0,3,52]
In [122]: a
Out[122]:
array([[ 2, 3, 50],
[ 5, 6, 5],
[ 8, 10, 5],
[ 1, 3, 51],
[ 0, 3, 52]])
In [123]: a[~np.triu((np.abs(a[:,None,:]-a)<=1).all(2),1).any(0)]
Out[123]:
array([[ 2, 3, 50],
[ 5, 6, 5],
[ 8, 10, 5]])
With one-loop for memory-efficiency
We can use one loop to save on memory and hence being efficient on that front and also make use of slicing in the process -
n = len(a)
mask = np.zeros(n, dtype=bool)
for i in range(n-1):
mask[i+1:] |= (np.abs(a[i+1:]-a[i])<=1).all(1)
out = a[~mask]
A few questions about the definition of the problem.
First: Suppose we have an array [ ... a1 ... a2 ... ], where a1 and a2 are "nearly identical"; which one do we remove? This is pretty easy to resolve: pick the first one.
Second: Suppose we have an array [ b1 ... bN ] where bi and bi+1 are nearly identical but bi and bi+2 are not nearly identical. Which ones do we remove? In this case, I guess you could remove all the odd entries or all the even entries.
Third: what about a midway situation where you have a mix-and-match of nearly identical successive pairs? What's the prescription?
I think the problem is related to the fact that "nearly identical" is not transitive, unlike "strictly identical". This suggests the following approach, which defines a somewhat different criterion for "nearly identical": Define a hash function that maps rows into "OK rows"; for example, round all elements of the rows down to an even number. Then define "nearly identical" to be all rows that map into the same "OK row". You could define a map from "OK row" to list of nearly identical rows in a, and return the first element of each list.
Perhaps it would help to have a little more context for the question. For example, I'm working on a problem involving a large number of time series's. I'd like to predict the next value in each of the series's, using SARIMA. However, the cost of building one SARIMA model for each series is prohibitive, so what I do (in brief) is cluster the series's using K-Means clustering with a value of K such that building K SARIMA models is acceptable. In this case, what I'm hoping is that different series's in the same cluster are "nearly identical enough" that one prediction serves for both.

Can someone please explain np.less_equal.outer(range(1,18),range(1,13))

I was debugging a code written by someone who has left the organization and came across a line, which uses np.less_equal.outer & np.greater_equal.outer functions. I know that np.outer creates a Cartesian cross product of two 1-dimensional arrays and creates two arrays, and np.less_equal compares the element of two arrays and returns true or false. Can someone please explain how this combined form works.
Thanks!
less_equal and greater_equal are special types of numpy functions called ufuncs, in that they have extendible functionalities, including accumulate, at, and outer.
In this case ufunc.outer extends the function to work similarly to the outer product - but while the actual outer product would be multiply.outer, this instead does the greater or less than comparison.
So you get a 2d array of booleans corresponding to each element of the first array, and whether they are greater or less than each of the elements in the second array.
np.less_equal.outer(range(1,18),range(1,13))
Out[]:
array([[ True, True, True, ..., True, True, True],
[False, True, True, ..., True, True, True],
[False, False, True, ..., True, True, True],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]], dtype=bool)
EDIT: a much more pythonic way of doing this would be:
np.triu(np.ones((18, 13), dtype = bool), 0)
That is, the upper triangle of a boolean array of shape (18, 13)
From the documentation, we have that for one-dimensional arrays A and B, the operation np.less_equal.outer(A, B) is equivalent to:
m = len(A)
n = len(B)
r = empty(m, n)
for i in range(m):
for j in range(n):
r[i,j] = (A[i] <= B[j])
Here's the mathematical representation of the result:
here is an example:
np.less_equal([4, 2, 1], [2, 2, 2])
array([False, True, True])
np.greater_equal([4, 2, 1], [2, 2, 2])
array([ True, True, False], dtype=bool)
and first the outer function
np.outer(range(1,2), range(1,3))
array([[1 2 3],
[2 4 6],
)
hope that helps.

Vectorized element assignment involving comparisons between matrices in Numpy

I'm currently trying to replace the for-loops in this code chunk with vectorized operations in Numpy:
def classifysignal(samplemat, binedges, nbinmat, nodatacode):
ndata, nsignals = np.shape(samplemat)
classifiedmat = np.zeros(shape=(ndata, nsignals))
ncounts = 0
for i in range(ndata):
for j in range(nsignals):
classifiedmat[i,j] = nbinmat[j]
for e in range(nbinmat[j]):
if samplemat[i,j] == nodatacode:
classifiedmat[i,j] == nodatacode
break
elif samplemat[i,j] <= binedges[j, e]:
classifiedmat[i,j] = e
ncounts += 1
break
ncounts = float(ncounts/nsignals)
return classifiedmat, ncounts
However, I'm having a little trouble conceptualizing how to replace the third for loop (i.e. the one beginning with for e in range(nbinmat[j]), since it entails comparing individual elements of two separate matrices before assigning a value, with the indices of these elements (i and e) being completely decoupled from each other. Is there a simple way to do this using whole-array operations, or would sticking with for-loops be best?
PS: My first Stackoverflow question, so if anything's unclear/more details are needed, please let me know! Thanks.
Without some concrete examples and explanation it's hard (or at least work) to figure out what you are trying to do, especially in the inner loop. So let's tackle a few pieces and try to simplify them
In [59]: C=np.zeros((3,4),int)
In [60]: N=np.arange(4)
In [61]: C[:]=N
In [62]: C
Out[62]:
array([[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]])
means that classifiedmat[i,j] = nbinmat[j] can be moved out of the loops
classifiedmat = np.zeros(samplemat.shape)
classifiedmat[:] = nbinmat
and
In [63]: S=np.arange(12).reshape(3,4)
In [64]: C[S>8]=99
In [65]: C
Out[65]:
array([[ 0, 1, 2, 3],
[ 0, 1, 2, 3],
[ 0, 99, 99, 99]])
suggests that
if samplemat[i,j] == nodatacode:
classifiedmat[i,j] == nodatacode
could be replaced with
classifiedmat[samplemat==nodatacode] = nodatacode
I haven't worked out whether loop and break modifies this replacement or not.
a possible model for inner loop is:
In [83]: B=np.array((np.arange(4),np.arange(2,6))).T
In [84]: for e in range(2):
C[S<=B[:,e]]=e
....:
In [85]: C
Out[85]:
array([[ 1, 1, 1, 1],
[ 0, 1, 2, 3],
[ 0, 99, 99, 99]])
You could also compare all values of S and B with:
In [86]: S[:,:,None]<=B[None,:,:]
Out[86]:
array([[[ True, True],
[ True, True],
[ True, True],
[ True, True]],
[[False, False],
[False, False],
[False, False],
[False, False]],
[[False, False],
[False, False],
[False, False],
[False, False]]], dtype=bool)
The fact that you are iterating over:
for e in range(nbinmat[j]):
may throw out all these equivalences. I'm not going try to figure out its significance. But maybe I've given you some ideas.
Well, if you want to use vector operations you need to solve the problem using linear algebra. I can't rethink the problem for you, but the general approach I would take is something like:
res = Subtract samplemat from binedges
res = Normalize values in res to 0 and 1 (use clip?). i.e if > 0, then 1 else 0.
ncount = sum ( res )
classifiedMat = res * binedges
And so on.

Numpy.where: very slow with conditions from two different arrays

I have three arrays of type numpy.ndarray with dimensions (n by 1), named amplitude, distance and weight. I would like to use selected entries of the amplitude array, based on their respective distance- and weight-values. For example I would like to find the indices of the entries within a certain distance range, so I write:
index = np.where( (distance<10) & (distance>=5) )
and I would then proceed by using the values from amplitude(index).
This works perfectly well as long as I only use one array for specifying the conditions. When I try for example
index = np.where( (distance<10) & (distance>=5) & (weight>0.8) )
the operation becomes super-slow. Why is that, and is there a better way for this task? In fact, I eventually want to use many conditions from something like 6 different arrays.
This is just a guess, but perhaps numpy is broadcasting your arrays? If the arrays are the exact same shape, then numpy won't broadcast them:
>>> distance = numpy.arange(5) > 2
>>> weight = numpy.arange(5) < 4
>>> distance.shape, weight.shape
((5,), (5,))
>>> distance & weight
array([False, False, False, True, False], dtype=bool)
But if they have different shapes, and the shapes are broadcastable, then it will. (n,), (n, 1), and (1, n) are all arguably "n by 1" arrays, they aren't all the same:
>>> distance[None,:].shape, weight[:,None].shape
((1, 5), (5, 1))
>>> distance[None,:]
array([[False, False, False, True, True]], dtype=bool)
>>> weight[:,None]
array([[ True],
[ True],
[ True],
[ True],
[False]], dtype=bool)
>>> distance[None,:] & weight[:,None]
array([[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, False, False]], dtype=bool)
In addition to returning undesired results, this could be causing a big slowdown if the arrays are even moderately large:
>>> distance = numpy.arange(5000) > 500
>>> weight = numpy.arange(5000) < 4500
>>> %timeit distance & weight
100000 loops, best of 3: 8.17 us per loop
>>> %timeit distance[:,None] & weight[None,:]
10 loops, best of 3: 48.6 ms per loop

Categories