This is a two part question.
Part 1
Given the following Numpy array:
foo = array([[22.5, 20. , 0. , 20. ],
[24. , 40. , 0. , 8. ],
[ 0. , 0. , 50. , 9.9],
[ 0. , 0. , 0. , 9. ],
[ 0. , 0. , 0. , 2.5]])
what is the most efficient way to (i) find the two minimal possible sums of values across columns (taking into account cell values greater than zero only) where for every column only one row is used and (ii) keep track of the array index locations visited on that route?
For example, in the example above this would be: minimum_bar = 22.5 + 20 + 50 + 2.5 = 95 at indices [0,0], [0,1], [2,2], [4,3] and next_best_bar = 22.5 + 20 + 50 + 8 = 100.5 at indices [0,0], [0,1], [2,2], [1,3].
Part 2
Similar to Part 1 but now with the constraint that the row-wise of sums of foo (if that row is used in the solution) must be greater than the values in an array (for example np.array([10, 10, 10, 10, 10]). In other words sum(row[0])>array[0]=62.5>10=True but sum(row[4])>array[4]=2.5>10=False.
In which case the result is: minimum_bar = 22.5 + 20 + 50 + 9.9 = 102.4 at indices [0,0], [0,1], [2,2], [2,3] and next_best_bar = 22.5 + 20 + 50 + 20 = 112.5 at indices [0,0], [0,1], [2,2], [0,3].
My initial approach was to find all possible routes (combinations of indices using itertools) but this solution does not scale well for large matrix sizes (e.g., mxn=500x500).
Here's one solution that I came up with (hopefully I didn't misunderstand anything in your question)
def minimum_routes(foo):
assert len(foo) >= 2
assert np.all(np.any(foo > 0, axis=0))
foo = foo.astype(float)
foo[foo <= 0] = np.inf
foo.sort(0)
minimum_bar = foo[0]
next_best_bar = minimum_bar.copy()
c = np.argmin(np.abs(foo[0] - foo[1]))
next_best_bar[c] = foo[1, c]
return minimum_bar, next_best_bar
Let's test it:
foo = np.array([[22.5, 20. , 0. , 20. ],
[24. , 40. , 0. , 8. ],
[ 0. , 0. , 50. , 9.9],
[ 0. , 0. , 0. , 9. ],
[ 0. , 0. , 0. , 2.5]])
# PART 1
minimum_bar, next_best_bar = minimum_routes(foo)
# (array([22.5, 20. , 50. , 2.5]), array([24. , 20. , 50. , 2.5]))
# PART 2
constraint = np.array([10, 10, 10, 10, 10])
minimum_bar, next_best_bar = minimum_routes(foo[foo.sum(1) > constraint])
# (array([22.5, 20. , 50. , 8. ]), array([24., 20., 50., 8.]))
To find the indices:
np.where(foo == minimum_bar)
np.where(foo == next_best_bar)
Related
I am trying to modify the values of the values down below to the expected values. The function down below is meant to sum out all the values between 2 consecutive elements of limits. none of the values are between 0 and 2 within Numbers so the resultant is 0. However the values between 2 and 5 are 3,4 within Numbers so the resultant is 3+4=7. The function has been gotten from issue: issue.
def formating(a, b):
# Formating goes here
x = np.sort(b);
# digitize
l = np.digitize(a, x)
# output:
result = np.bincount(l, weights=a)
return result
Numbers = np.array([3, 4, 5, 7, 8, 10,20])
limit1 = np.array([0, 2 , 5, 12, 15])
limit2 = np.array([0, 2 , 5, 12])
limit3 = np.array([0, 2 , 5, 12, 15, 22])
result1= formating(Numbers, limit1)
result2= formating(Numbers, limit2)
result3= formating(Numbers, limit3)
Current output
result1: [ 0. 0. 7. 30. 0. 20.]
result2: [ 0. 0. 7. 30. 20.]
result3: [ 0. 0. 7. 30. 0. 20.]
Wanted Output:
result1: [ 0. 7. 30. 0.]
result2: [ 0. 7. 30. ]
result3: [ 0. 7. 30. 0. 20.]
So just throw out the bins for numbers off the end.
result1 = result1[1:len(limit1)]
result2 = result2[1:len(limit2)]
result3 = result3[1:len(limit3)]
Or, for smarter results, end the function with:
result = np.bincount(1, weights=a)
return result[1:len(b)]
The function down below is meant to sum out all second row values of Numbers[:,0] between 2 consecutive elements of limits limit1-3. For the first calculation if none of the values are between 0 and 2 (the first two elements of limit1) within Numbers so the resultant is 0. For the second calculation 3,4 within Numbers[:,0] is between the values 2-5 in limit1 so the second column of Numbers is summed up 1+3 =4 resulting in 4. How could I implement this to the function below?
def formating(a, b, c):
# Formating goes here
x = np.sort(c);
# digitize
l = np.digitize(a, x)
# output:
result = np.bincount(l, weights=b)
return result[1:len(b)]
Numbers = np.array([[3,1], [4,3], [5,3], [7,11], [8,9], [10,20] , [20, 45]])
limit1 = np.array([0, 2 , 5, 12, 15])
limit2 = np.array([0, 2 , 5, 12])
limit3 = np.array([0, 2 , 5, 12, 15, 22])
result1= formating(Numbers[:,0], Numbers[:,1], limit1)
result2= formating(Numbers[:,0], Numbers[:,1], limit2)
result3= formating(Numbers[:,0], Numbers[:,1], limit3)
Expected Output
result1: [ 0. 4. 43. 0. ]
result2: [ 0. 4. 43. ]
result3: [ 0. 4. 43. 0. 45.]
Current Output
result1: [ 0. 4. 43. 0. 45.]
result2: [ 0. 4. 43. 45.]
result3: [ 0. 4. 43. 0. 45.]
This:
return result[1:len(b)]
should be
return result[1:len(c)]
Your return vector is dependent on the length of your bins, not your input data.
I would like to make a sparse matrix from the dense one, such that in each row or column only n-largest elements are preserved. I do the following:
def sparsify(K, min_nnz = 5):
'''
This function eliminates the elements which are smaller that the maximal element in the matrix,
Parameters
----------
K : ndarray
K - the input matrix
min_nnz:
the minimal number of elements in row or column to be preserved
'''
cond = np.bitwise_or(K >= -np.partition(-K, min_nnz - 1, axis = 1)[:, min_nnz - 1][:, None],
K >= -np.partition(-K, min_nnz - 1, axis = 0)[min_nnz - 1, :][None, :])
return spsp.csr_matrix(np.where(cond, K, 0))
This approach works as intended but seems to be not the most efficient, and the robust one. What would you recommend to do it an better way?
The example of usage:
A = np.random.rand(10, 10)
A_sp = sparsify(A, min_nnz = 3)
Instead of making another dense matrix, you can use coo_matrix to build up using only the values you need:
return spsp.coo_matrix((K[cond], np.where(cond)), shape = K.shape)
As for the rest, you can maybe short-circuit the second dimension, but your time savings will be completely dependent on your inputs
def sparsify(K, min_nnz = 5):
'''
This function eliminates the elements which are smaller that the maximal element in the matrix,
Parameters
----------
K : ndarray
K - the input matrix
min_nnz:
the minimal number of elements in row or column to be preserved
'''
cond = K >= -np.partition(-K, min_nnz - 1, axis = 0)[min_nnz - 1, :]
mask = cond.sum(1) < min_nnz
cond[mask] = np.bitwise_or(cond[mask],
K[mask] >= -np.partition(-K[mask],
min_nnz - 1,
axis = 1)[:, min_nnz - 1][:, None])
return spsp.coo_matrix((K[cond], np.where(cond)), shape = K.shape)
Testing:
sparsify(A)
Out[]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 58 stored elements in COOrdinate format>
sparsify(A).A
Out[]:
array([[0. , 0. , 0.61362248, 0. , 0.73648987,
0.64561856, 0.40727807, 0.61674005, 0.53533315, 0. ],
[0.8888361 , 0.64548039, 0.94659603, 0.78474203, 0. ,
0. , 0.78809603, 0.88938798, 0. , 0.37631541],
[0.69356682, 0. , 0. , 0. , 0. ,
0.7386594 , 0.71687659, 0.67750768, 0.58002451, 0. ],
[0.67241433, 0.71923718, 0.95888737, 0. , 0. ,
0. , 0.82773085, 0.69788448, 0.63736915, 0.4263064 ],
[0. , 0.65831794, 0. , 0. , 0.59850093,
0. , 0. , 0.61913869, 0.65024867, 0.50860294],
[0.75522891, 0. , 0.93342402, 0.8284258 , 0.64471939,
0.6990814 , 0. , 0. , 0. , 0.32940821],
[0. , 0.88458635, 0.62460096, 0.60412265, 0.66969674,
0. , 0.40318741, 0. , 0. , 0.44116059],
[0. , 0. , 0.500971 , 0.92291245, 0. ,
0.8862903 , 0. , 0.375885 , 0.49473635, 0. ],
[0.86920647, 0.85157893, 0.89883006, 0. , 0.68427193,
0.91195162, 0. , 0. , 0.94762875, 0. ],
[0. , 0.6435456 , 0. , 0.70551006, 0. ,
0.8075527 , 0. , 0.9421039 , 0.91096934, 0. ]])
sparsify(A).A.astype(bool).sum(0)
Out[]: array([5, 6, 7, 5, 5, 6, 5, 7, 7, 5])
sparsify(A).A.astype(bool).sum(1)
Out[]: array([6, 7, 5, 7, 5, 6, 6, 5, 6, 5])
I have this big serie of length t (t = 200K rows)
prices = [200, 100, 500, 300 ..]
and I want to calculate a matrix (tXt) where a value is calculated as:
matrix[i][j] = prices[j]/prices[i] - 1
I tried this using a double for, but it's too slow. Any ideas how to perform it better?
for p0 in prices:
for p1 in prices:
matrix[i][j] = p1/p0 - 1
A vectorized solution is using np.meshgrid, with prices and 1/prices as arguments (note that prices must be an array), and multiplying the result and substracting 1 in order to compute matrix[i][j] = prices[j]/prices[i] - 1:
a, b = np.meshgrid(p, 1/p)
a * b - 1
As an example:
p = np.array([1,4,2])
Would give:
a, b = np.meshgrid(p, 1/p)
a * b - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
Quick check of some of the cells:
(i,j) prices[j]/prices[i] - 1
--------------------------------
(1,1) 1/1 - 1 = 0
(1,2) 4/1 - 1 = 3
(1,3) 2/1 - 1 = 1
(2,1) 1/4 - 1 = -0.75
Another solution:
[p] / np.array([p]).T - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
There are two idiomatic ways of doing an outer product-type operation. Either use the .outer method of universal functions, here np.divide:
In [2]: p = np.array([10, 20, 30, 40])
In [3]: np.divide.outer(p, p)
Out[3]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
Alternatively, use broadcasting:
In [4]: p[:, None] / p[None, :]
Out[4]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
This p[None, :] itself can be spelled as a reshape, p.reshape((1, len(p))), but readability.
Both are equivalent to a double for-loop:
In [6]: o = np.empty((len(p), len(p)))
In [7]: for i in range(len(p)):
...: for j in range(len(p)):
...: o[i, j] = p[i] / p[j]
...:
In [8]: o
Out[8]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
I guess it can be done in this way
import numpy
prices = [200., 300., 100., 500., 600.]
x = numpy.array(prices).reshape(1, len(prices))
matrix = (1/x.T) * x - 1
Let me explain in details. This matrix is a matrix product of column vector of element-wise reciprocal price values and a row vector of original price values. Then matrix of ones of the same size needs to be subtracted from the result.
First of all we create row-vector from prices list
x = numpy.array(prices).reshape(1, len(prices))
Reshaping is required here. Otherwise your vector will have shape (len(prices),), not required (1, len(prices)).
Then we compute a column vector of element-wise reciprocal price values:
(1/x.T)
Finally, we compute the resulting matrix
matrix = (1/x.T) * x - 1
Here ending - 1 will be broadcasted to a matrix of the same shape with (1/x.T) * x.
I have a ndarray. From this array I need to choose the list of N numbers with biggest values. I found heapq.nlargest to find the N largest entries, but I need to extract the indexes.
I want to build a new array where only the N rows with the largest weights in the first column will survive. The rest of the rows will be replaced by random values
import numpy as np
import heapq # For choosing list of max values
a = [[1.1,2.1,3.1], [2.1,3.1,4.1], [5.1,0.1,7.1],[0.1,1.1,1.1],[4.1,3.1,9.1]]
a = np.asarray(a)
maxVal = heapq.nlargest(2,a[:,0])
if __name__ == '__main__':
print a
print maxVal
The output I have is:
[[ 1.1 2.1 3.1]
[ 2.1 3.1 4.1]
[ 5.1 0.1 7.1]
[ 0.1 1.1 1.1]
[ 4.1 3.1 9.1]]
[5.0999999999999996, 4.0999999999999996]
but what I need is [2,4] as the indexes to build a new array. The indexes are the rows so if in this example I want to replace the rest by 0 I need to finish with:
[[0.0 0.0 0.0]
[ 0.0 0.0 0.0]
[ 5.1 0.1 7.1]
[ 0.0 0.0 0.0]
[ 4.1 3.1 9.1]]
I am stuck in the point where I need indexes. The original array has 1000 rows and 100 columns. The weights are normalized floating points and I don't want to do something like if a[:,1] == maxVal[0]: because sometimes I have the weights very close and can finish with more values maxVal[0] than my original N.
Is there any simple way to extract indexes on this setup to replace the rest of the array?
If you only have 1000 rows, I would forget about the heap and use np.argsort on the first column:
>>> np.argsort(a[:,0])[::-1][:2]
array([2, 4])
If you want to put it all together, it would look something like:
def trim_rows(a, n) :
idx = np.argsort(a[:,0])[:-n]
a[idx] = 0
>>> a = np.random.rand(10, 4)
>>> a
array([[ 0.34416425, 0.89021968, 0.06260404, 0.0218131 ],
[ 0.72344948, 0.79637177, 0.70029863, 0.20096129],
[ 0.27772833, 0.05372373, 0.00372941, 0.18454153],
[ 0.09124461, 0.38676351, 0.98478492, 0.72986697],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0.27597241, 0.26705301, 0.62124467, 0.43337711],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0.3943888 , 0.61586129, 0.02776393, 0.2560126 ],
[ 0.5934556 , 0.23093912, 0.12550062, 0.58542137]])
>>> trim_rows(a, 3)
>>> a
array([[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0. , 0. , 0. , 0. ],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
And for your data size it's probably fast enough:
In [7]: a = np.random.rand(1000, 100)
In [8]: %timeit -n1 -r1 trim_rows(a, 50)
1 loops, best of 1: 7.65 ms per loop