Say I have multiple different vectors of the same length
Example:
1: [1, 2, 3, 4]
2: [5, 6, 7, 8]
3: [3, 8, 9, 10]
4: [6, 9, 12, 3]
And I want to figure out the optimal integer coefficients for these vectors such that the sum of the vectors is closest to a respective specified goal vector.
Goal Vector: [55,101,115,60]
Assuming the combination only involves adding arrays together (no subtraction), how would I go about doing this? Are there any Python libraries (numpy, scikit, etc.) that would help me do this? I suspect that it is a linear algebra solution.
Example Combination Answer: [3, 3, 3, 1, 2, 4, 1, 1, 1, 2, 3, 4]
where each of the values are one of those arrays. (This is just a random example)
You could write your problem as a system of linear-equations:
arr1[0] + b*arr2[0] + c*arr3[0] + d*arr4[0] = res[0]
a*arr1[1] + b*arr2[1] + c*arr3[1] + d*arr4[1] = res[1]
a*arr1[2] + b*arr2[2] + c*arr3[2] + d*arr4[2] = res[2]
a*arr1[3] + b*arr2[3] + c*arr3[3] + d*arr4[3] = res[3]
#For all positive a,b,c,d.
Which you could then solve, if there is an exact solution.
If there is no exact solution, there is a scipy method to calculate the non-negative least squares solution to a linear matrix equation called scipy.optimize.nnls.
from scipy import optimize
import numpy as np
arr1 = [1, 2, 3, 4]
arr2 = [5, 6, 7, 8]
arr3 = [3, 8, 9, 10]
arr4 = [6, 9, 12, 3]
res = [55,101,115,60]
a = np.array([
[arr1[0], arr2[0], arr3[0], arr4[0]],
[arr1[1], arr2[1], arr3[1], arr4[1]],
[arr1[2], arr2[2], arr3[2], arr4[2]],
[arr1[3], arr2[3], arr3[3], arr4[3]]
])
solution,_ = optimize.nnls(a,res)
print(solution)
print('Coefficients before Rounding', solution)
solution = solution.round()
print('Coefficients after Rounding', solution)
print('Resuls', [arr1[i]*solution[0] + arr2[i]*solution[1] + arr3[i]*solution[2] + arr4[i]*solution[3] for i in range(4)])
This would print
Coefficients before Rounding [0. 0.1915493 3.83943662 6.98826291]
Coefficients after Rounding [0. 0. 4. 7.]
Resuls [54.0, 95.0, 120.0, 61.0]
Pretty close, isn't it?
It could indeed happen that this is not the perfect solution. But as discussed in this thread "integer problems are not even simple to solve" (#seberg)
Related
I'm having a list with a random amount of integers and/or floats. What I'm trying to achieve is to find the exceptions inside my numbers (hoping to use the right words to explain this). For example:
list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
90 to 99% of my integer values are between 1 and 20
sometimes there are values that are much higher, let's say somewhere around 100 or 1.000 or even more
My problem is, that these values can be different all the time. Maybe the regular range is somewhere between 1.000 to 1.200 and the exceptions are in the range of half a million.
Is there a function to filter out these special numbers?
Assuming your list is l:
If you know you want to filter a certain percentile/quantile, you can
use:
This removes bottom 10% and top 90%. Of course, you can change any of
them to your desired cut-off (for example you can remove the bottom filter and only filter the top 90% in your example):
import numpy as np
l = np.array(l)
l = l[(l>np.quantile(l,0.1)) & (l<np.quantile(l,0.9))].tolist()
output:
[ 3 2 14 2 8 4 3 5]
If you are not sure of the percentile cut-off and are looking to
remove outliers:
You can adjust your cut-off for outliers by adjusting argument m in
function call. The larger it is, the less outliers are removed. This function seems to be more robust to various types of outliers compared to other outlier removal techniques.
import numpy as np
l = np.array(l)
def reject_outliers(data, m=6.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m].tolist()
print(reject_outliers(l))
output:
[1, 3, 2, 14, 2, 1, 8, 1, 4, 3, 5]
You can use the built-in filter() method:
lst1 = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
lst2 = list(filter(lambda x: x > 5,lst1))
print(lst2)
Output:
[14, 108, 8, 97]
So here is a method how to block out those deviators
import math
_list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
def consts(_list):
mu = 0
for i in _list:
mu += i
mu = mu/len(_list)
sigma = 0
for i in _list:
sigma += math.pow(i-mu,2)
sigma = math.sqrt(sigma/len(_list))
return sigma, mu
def frequence(x, sigma, mu):
return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-(1/2)*math.pow(((x-mu)/sigma),2))
sigma, mu = consts(_list)
new_list = []
for i in range(len(_list)):
if frequence(_list[i], sigma, mu) > 0.01:
new_list.append(i)
print(new_list)
import numpy
square = numpy.reshape(range(0,16),(4,4))
square
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In the above array, how do I access the primary diagonal and secondary diagonal of any given element? For example 9.
by primary diagonal, I mean - [4,9,14],
by secondary diagonal, I mean - [3,6,9,12]
I can't use numpy.diag() cause it takes the entire array to get the diagonal.
Base on your description, with np.where, np.diagonal and np.fliplr
import numpy as np
x,y=np.where(square==9)
np.diagonal(square, offset=-(x-y))
Out[382]: array([ 4, 9, 14])
x,y=np.where(np.fliplr(square)==9)
np.diagonal(np.fliplr(square), offset=-(x-y))
# base on the op's comment it should be np.diagonal(np.fliplr(square), offset=-(x-y))
Out[396]: array([ 3, 6, 9, 12])
For the first diagonal, use the fact that both x_coordiante and y_coordinate increase with 1 each step:
def first_diagonal(x, y, length_array):
if x < y:
return zip(range(x, length_array), range(length_array - x))
else:
return zip(range(length_array - y), range(y, length_array))
For the secondary diagonal, use the fact that the x_coordinate + y_coordinate = constant.
def second_diagonal(x, y, length_array):
tot = x + y
return zip(range(tot+1), range(tot, -1, -1))
This gives you two lists you can use to access your matrix.
Of course, if you have a non square matrix these functions will have to be reshaped a bit.
To illustrate how to get the desired output:
a = np.reshape(range(0,16),(4,4))
first = first_diagonal(1, 2, len(a))
second = second_diagonal(1,2, len(a))
primary_diagonal = [a[i[0]][i[1]] for i in first]
secondary_diagonal = [a[i[0]][i[1]] for i in second]
print(primary_diagonal)
print(secondary_diagonal)
this outputs:
[4, 9, 14]
[3, 6, 9, 12]
Say I have two lists of data as follows:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14]
That is, it's pretty clear that merely fitting a line to this data doesn't work, but instead the slope changed at a point in the data. (Obviously, one can pinpoint from this data set pretty easily where that change is, but it's not as clear in the set I'm working with so let's ignore that.) Something with the derivative, I'm guessing, but the point here is I want to treat this as a free parameter where I say "it's this point, +/- this uncertainty, and here is the linear slope before and after this point."
Note, I can do this with an array if it's easier. Thanks!
Here is a plot of your data:
You need to find two slopes (== taking two derivatives). First, find the slope between every two points (using numpy):
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype=np.float)
y = np.array([1, 2, 3, 4, 5, 6, 8, 10, 12, 14],dtype=np.float)
m = np.diff(y)/np.diff(x)
print (m)
# [ 1. 1. 1. 1. 1. 2. 2. 2. 2.]
Clearly, slope changes from 1 to 2 in the sixth interval (between sixth and seventh points). Then take the derivative of this array, which tells you when the slope changes:
print (np.diff(m))
[ 0. 0. 0. 0. 1. 0. 0. 0.]
To find the index of the non-zero value:
idx = np.nonzero(np.diff(m))[0]
print (idx)
# 4
Since we took one derivative with respect to x, and indices start from zero in Python, idx+2 tells you that the slope is different before and after the sixth point.
I'm not sure to understand very well what you want but you can see the evolution this way (derivative):
>>> y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14]
>>> dy=[y[i+1]-y[i] for i in range(len(y)-1)]
>>> dy
[1, 1, 1, 1, 1, 2, 2, 2, 2]
and then find the point where it change (second derivative):
>>> dpy=[dy[i+1]-dy[i] for i in range(len(dy)-1)]
>>> dpy
[0, 0, 0, 0, 1, 0, 0, 0]
if you want the index of this point :
>>> dpy.index(1)
4
that can give you the value of the last point before change of slope :
>>> change=dpy.index(1)
>>> y[change]
5
In your y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14] the change happen at the index [4] (list indexing start to 0) and the value of y at this point is 5.
You can calculate the slope as the difference between each pair of points (the first derivative). Then check where the slope changes (the second derivative). If it changes, append the index location to idx, the collection of points where the slope changes.
Note that the first point does not have a unique slope. The second pair of points will give you the slope, but you need the third pair before you can measure the change in slope.
idx = []
prior_slope = float(y[1] - y[0]) / (x[1] - x[0])
for n in range(2, len(x)): # Start from 3rd pair of points.
slope = float(y[n] - y[n - 1]) / (x[n] - x[n - 1])
if slope != prior_slope:
idx.append(n)
prior_slope = slope
>>> idx
[6]
Of course this could be done more efficiently in Pandas or Numpy, but I am just giving you a simple Python 2 solution.
A simple conditional list comprehension should also be pretty efficient, although it is more difficult to understand.
idx = [n for n in range(2, len(x))
if float(y[n] - y[n - 1]) / (x[n] - x[n - 1])
!= float(y[n - 1] - y[n - 2]) / (x[n - 1] - x[n - 2])]
Knee point might be a potential solution.
from kneed import KneeLocator
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1, 2, 3, 4, 5, 6, 8, 10, 12, 14])
kn = KneeLocator(x, y, curve='convex', direction='increasing')
# You can use array y to automatically determine 'convex' and 'increasing' if y is well-behaved
idx = (np.abs(x - kn.knee)).argmin()
>>> print(x[idx], y[idx])
6 6
I tried implementing the distance measure shown in the image, in Python as such:
import numpy as np
A = [1, 2, 3, 4, 5, 6, 7, 8, 1]
B = [1, 2, 3, 2, 4, 6, 7, 8, 2]
A = np.asarray(A).flatten()
B = np.asarray(B).flatten()
x = np.sum(1 - np.divide((1 + np.minimum(A, B)), (1 + np.maximum(A, B))))
print("Distance: {}".format(x))
but after testing, it doesn't seem to be the right approach. The maximum value returned if there's no similarity at all between the given vectors should be 1, with 0 as perfect similiarity. A and B in the image are both vectors with size m.
Edit: forgot to add that I ignored the part for min(A, B) < 0 as that wont ever happen for my intentions
This should work. First, we create a matrix AB by stacking the columns and calculate the minimum vector AB_min and maximum vector AB_max out of that. Then, we compute D as you defined it, making use of numpy.where to specify the two conditions. After that, we sum the elements to get the D_proposed as you defined it. It gives a value of 0.9 for this example.
import numpy as np
A = [1, 2, 3, 4, 5, 6, 7, 8, 1]
B = [1, 2, 3, 2, 4, 6, 7, 8, 2]
AB = np.column_stack((A,B))
AB_min = np.min(AB,1)
AB_max = np.max(AB,1)
print AB_min
print AB_max
D = np.where(AB_min >= 0.,\
1. - (1. + AB_min) / (1. + AB_max),\
1. - (1. + AB_min + abs(AB_min)) / (1. + AB_max + abs(AB_min)))
print D
D_proposed = np.sum(D)
print D_proposed
I have three numpy arrays:
X: a 3073 x 49000 matrix
W: a 10 x 3073 matrix
y: a 49000 x 1 vector
y contains values between 0 and 9, each value represents a row in W.
I would like to add the first column of X to the row in W given by the first element in y. I.e. if the first element in y is 3, add the first column of X to the fourth row of W. And then add the second column of X to the row in W given by the second element in y and so on, until all columns of X has been aded to the row in W specified by y, which means a total of 49000 added rows.
W[y] += X.T does not work for me, because this will not add more than one vector to a row in W.
Please note: I'm only looking for vectorized solutions. I.e. no for-loops.
EDIT: To clarify I'll add an example with small matrix sizes adapted from Salvador Dali's example below.
In [1]: import numpy as np
In [2]: a, b, c = 3, 4, 5
In [3]: np.random.seed(0)
In [4]: X = np.random.randint(10, size=(b,c))
In [5]: W = np.random.randint(10, size=(a,b))
In [6]: y = np.random.randint(a, size=(c,1))
In [7]: X
Out[7]:
array([[5, 0, 3, 3, 7],
[9, 3, 5, 2, 4],
[7, 6, 8, 8, 1],
[6, 7, 7, 8, 1]])
In [8]: W
Out[8]:
array([[5, 9, 8, 9],
[4, 3, 0, 3],
[5, 0, 2, 3]])
In [9]: y
Out[9]:
array([[0],
[1],
[1],
[2],
[0]])
In [10]: W[y.ravel()] + X.T
Out[10]:
array([[10, 18, 15, 15],
[ 4, 6, 6, 10],
[ 7, 8, 8, 10],
[ 8, 2, 10, 11],
[12, 13, 9, 10]])
In [11]: W[y.ravel()] = W[y.ravel()] + X.T
In [12]: W
Out[12]:
array([[12, 13, 9, 10],
[ 7, 8, 8, 10],
[ 8, 2, 10, 11]])
The problem is to get BOTH column 0 and column 4 in X added to row 0 in W, as well as both column 1 and 2 in X added to row 1 in W.
The desired outcome is thus:
W = [[17, 22, 16, 16],
[ 7, 11, 14, 17],
[ 8, 2, 10, 11]]
First the straight forward loop solution as reference:
In [65]: for i,j in enumerate(y):
W[j]+=X[:,i]
....:
In [66]: W
Out[66]:
array([[17, 22, 16, 16],
[ 7, 11, 14, 17],
[ 8, 2, 10, 11]])
An add.at solution:
In [67]: W=W1.copy()
In [68]: np.add.at(W,(y.ravel()),X.T)
In [69]: W
Out[69]:
array([[17, 22, 16, 16],
[ 7, 11, 14, 17],
[ 8, 2, 10, 11]])
add.at does an unbuffered calculation, getting around the buffering that prevents W[y.ravel()] += X.T from working. It is still iterative, but the loop has been moved to compiled code. It isn't true vectorization because the order of application matters. The addition for one row of X.T depends on the results from the previous rows.
https://stackoverflow.com/a/20811014/901925 is the answer I gave a couple of years ago to a similar question (for 1d arrays).
But when dealing with your large arrays:
X: a 3073 x 49000 matrix
W: a 10 x 3073 matrix
y: a 49000 x 1 vector
this can run into speed issues. Note that W[y.ravel()] is the same size as X.T (why did you pick these sizes that require transpose?). And it's a copy, not a view. So there's already a time penalty.
bincount has been suggested in previous questions, and I think it is faster. Making for loop with index arrays faster (both bincount and add.at solutions)
Iterating over the small 3073 dimension could also have speed advantages. Or better yet on the size 10 dimension as Divakar demonstrates.
For the small test case, a,b,c=3,4,5, the add.at solution is fastest, with Divakar's bincount and einseum next. For a larger a,b,c=10,1000,20000, add.at gets very slow, with bincount being the fastest.
Related SO answers
https://stackoverflow.com/a/28205888/901925 (notes that bincount requires complete coverage for y).
https://stackoverflow.com/a/30041823/901925 (where Divakar again shows that bincount rules!)
Vectorized approaches
Approach #1
Based on this answer, here's a vectorized solution using np.bincount -
N = y.max()+1
id = y.ravel() + np.arange(X.shape[0])[:,None]*N
W[:N] += np.bincount(id.ravel(), weights=X.ravel()).reshape(-1,N).T
Approach #2
You can make good usage of boolean indexing and np.einsum to get the job done in a concise vectorized manner -
N = y.max()+1
W[:N] += np.einsum('ijk,lk->il',(np.arange(N)[:,None,None] == y.ravel()),X)
Loopy approaches
Approach #3
Since you are selecting and adding up a huge number of columns from X per unique y, it might be better in terms of performance to run a loop with complexity equal to the number of such unique y's, which seems to be at max equal to the number of rows in W and that in your case is just 10. Thus, the loop has just 10 iterations, not bad! Here's the implementation to fulfill those aspirations -
for k in range(W.shape[0]):
W[k] += X[:,(y==k).ravel()].sum(1)
Approach #4
You can bring in np.einsum to do the columnwise summations and have the final output like so -
for k in range(W.shape[0]):
W[k] += np.einsum('ij->i',X[:,(y==k).ravel()])
This will achieve what you want: X + W[y.ravel()].T
To see that this really does the work, here is a reproducible example:
import numpy as np
np.random.seed(0)
a, b, c = 3, 5, 4 # you can use your 3073, 49000, 10 later
X = np.random.rand(a, b)
W = np.random.rand(c, a)
y = np.random.randint(c, size=(b, 1))
Now your matrices are:
[[ 0.0871293 0.0202184 0.83261985]
[ 0.77815675 0.87001215 0.97861834]
[ 0.79915856 0.46147936 0.78052918]
[ 0.11827443 0.63992102 0.14335329]]
[[3]
[0]
[3]
[2]
[0]]
[[ 0.5488135 0.71518937 0.60276338 0.54488318 0.4236548 ]
[ 0.64589411 0.43758721 0.891773 0.96366276 0.38344152]
[ 0.79172504 0.52889492 0.56804456 0.92559664 0.07103606]]
And W[y.ravel()] gives you " W given by the first element in y". By transposing it, you will get a matrix ready to be added to X:
[[ 0.11827443 0.0871293 0.11827443 0.79915856 0.0871293 ]
[ 0.63992102 0.0202184 0.63992102 0.46147936 0.0202184 ]
[ 0.14335329 0.83261985 0.14335329 0.78052918 0.83261985]]
While I can't say that this is very pythonic, it is a solution (I think):
for column in range(x.shape[1]):
w[y[column]] = x[:,column].T