I have this big serie of length t (t = 200K rows)
prices = [200, 100, 500, 300 ..]
and I want to calculate a matrix (tXt) where a value is calculated as:
matrix[i][j] = prices[j]/prices[i] - 1
I tried this using a double for, but it's too slow. Any ideas how to perform it better?
for p0 in prices:
for p1 in prices:
matrix[i][j] = p1/p0 - 1
A vectorized solution is using np.meshgrid, with prices and 1/prices as arguments (note that prices must be an array), and multiplying the result and substracting 1 in order to compute matrix[i][j] = prices[j]/prices[i] - 1:
a, b = np.meshgrid(p, 1/p)
a * b - 1
As an example:
p = np.array([1,4,2])
Would give:
a, b = np.meshgrid(p, 1/p)
a * b - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
Quick check of some of the cells:
(i,j) prices[j]/prices[i] - 1
--------------------------------
(1,1) 1/1 - 1 = 0
(1,2) 4/1 - 1 = 3
(1,3) 2/1 - 1 = 1
(2,1) 1/4 - 1 = -0.75
Another solution:
[p] / np.array([p]).T - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
There are two idiomatic ways of doing an outer product-type operation. Either use the .outer method of universal functions, here np.divide:
In [2]: p = np.array([10, 20, 30, 40])
In [3]: np.divide.outer(p, p)
Out[3]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
Alternatively, use broadcasting:
In [4]: p[:, None] / p[None, :]
Out[4]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
This p[None, :] itself can be spelled as a reshape, p.reshape((1, len(p))), but readability.
Both are equivalent to a double for-loop:
In [6]: o = np.empty((len(p), len(p)))
In [7]: for i in range(len(p)):
...: for j in range(len(p)):
...: o[i, j] = p[i] / p[j]
...:
In [8]: o
Out[8]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
I guess it can be done in this way
import numpy
prices = [200., 300., 100., 500., 600.]
x = numpy.array(prices).reshape(1, len(prices))
matrix = (1/x.T) * x - 1
Let me explain in details. This matrix is a matrix product of column vector of element-wise reciprocal price values and a row vector of original price values. Then matrix of ones of the same size needs to be subtracted from the result.
First of all we create row-vector from prices list
x = numpy.array(prices).reshape(1, len(prices))
Reshaping is required here. Otherwise your vector will have shape (len(prices),), not required (1, len(prices)).
Then we compute a column vector of element-wise reciprocal price values:
(1/x.T)
Finally, we compute the resulting matrix
matrix = (1/x.T) * x - 1
Here ending - 1 will be broadcasted to a matrix of the same shape with (1/x.T) * x.
Related
I have a 2d MxN array A , each row of which is a sequence of indices, padded by -1's at the end e.g.:
[[ 2 1 -1 -1 -1]
[ 1 4 3 -1 -1]
[ 3 1 0 -1 -1]]
I have another MxN array of float values B:
[[ 0.7 0.4 1.5 2.0 4.4 ]
[ 0.8 4.0 0.3 0.11 0.53]
[ 0.6 7.4 0.22 0.71 0.06]]
and I want to use the indices in A to filter B i.e. for each row, only the indices present in A retain their values, and the values at all other locations are set to 0.0, i.e. the result would look like:
[[ 0.0 0.4 1.5 0.0 0.0 ]
[ 0.0 4.0 0.0 0.11 0.53 ]
[ 0.6 7.4 0.0 0.71 0.0]]
What's a good way to do this in "pure" numpy? (I would like to do this in pure numpy so I can jit it in jax.
Numpy supports fancy indexing. Ignoring the "-1" entries for the moment, you can do something like this:
index = (np.arange(B.shape[0]).reshape(-1, 1), A)
result = np.zeros_like(B)
result[index] = B[index]
This works because indices are broadcasted. The column np.arange(B.shape[0]).reshape(-1, 1) matches all the elements of a given row of A to the corresponding row in B and result.
This example does not address the fact that -1 is a valid numpy index. You need to clear the elements that correspond to -1 in A when 4 (the last column) is not present in that row:
mask = (A == -1).any(axis=1) & (A != A.shape[1] - 1).all(axis=1)
result[mask, -1] = 0.0
Here, the mask is [True, False, True], indicating that even though the second row has a -1 in it, it also contains a 4.
This approach is fairly efficient. It will create no more than a couple of boolean arrays of the same shape as A for the mask.
You can use broadcasting, but note that it will create a large intermediate array of shape (M, N, N) (in pure numpy at least):
import numpy as np
A = ...
B = ...
M, N = A.shape
out = np.where(np.any(A[..., None] == np.arange(N), axis=1), B, 0.0)
out:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
Another possible solution:
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
np.where(mask, B, 0)
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
EDIT (When there is rows with only -1)
The following code aims to contemplate the possibility, raised by #MadPhysicist (to whom I thank), of having rows containing only -1 -- that is only necessary to add 2 lines of code to my previous code.
A = np.array([[ 2, 1, -1, -1, -1],
[ -1, -1, -1, -1, -1],
[ 3, 1, 0, -1, -1]])
B = np.array([[ 0.7, 0.4, 1.5, 2.0, 4.4 ],
[ 0.8, 4.0, 0.3, 0.11, 0.53],
[ 0.6, 7.4, 0.22, 0.71, 0.06]])
rminus1 = np.all(A == -1, axis=1) # new
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
C = np.where(mask, B, 0)
C[rminus1, :] = 0 # new
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
This is a two part question.
Part 1
Given the following Numpy array:
foo = array([[22.5, 20. , 0. , 20. ],
[24. , 40. , 0. , 8. ],
[ 0. , 0. , 50. , 9.9],
[ 0. , 0. , 0. , 9. ],
[ 0. , 0. , 0. , 2.5]])
what is the most efficient way to (i) find the two minimal possible sums of values across columns (taking into account cell values greater than zero only) where for every column only one row is used and (ii) keep track of the array index locations visited on that route?
For example, in the example above this would be: minimum_bar = 22.5 + 20 + 50 + 2.5 = 95 at indices [0,0], [0,1], [2,2], [4,3] and next_best_bar = 22.5 + 20 + 50 + 8 = 100.5 at indices [0,0], [0,1], [2,2], [1,3].
Part 2
Similar to Part 1 but now with the constraint that the row-wise of sums of foo (if that row is used in the solution) must be greater than the values in an array (for example np.array([10, 10, 10, 10, 10]). In other words sum(row[0])>array[0]=62.5>10=True but sum(row[4])>array[4]=2.5>10=False.
In which case the result is: minimum_bar = 22.5 + 20 + 50 + 9.9 = 102.4 at indices [0,0], [0,1], [2,2], [2,3] and next_best_bar = 22.5 + 20 + 50 + 20 = 112.5 at indices [0,0], [0,1], [2,2], [0,3].
My initial approach was to find all possible routes (combinations of indices using itertools) but this solution does not scale well for large matrix sizes (e.g., mxn=500x500).
Here's one solution that I came up with (hopefully I didn't misunderstand anything in your question)
def minimum_routes(foo):
assert len(foo) >= 2
assert np.all(np.any(foo > 0, axis=0))
foo = foo.astype(float)
foo[foo <= 0] = np.inf
foo.sort(0)
minimum_bar = foo[0]
next_best_bar = minimum_bar.copy()
c = np.argmin(np.abs(foo[0] - foo[1]))
next_best_bar[c] = foo[1, c]
return minimum_bar, next_best_bar
Let's test it:
foo = np.array([[22.5, 20. , 0. , 20. ],
[24. , 40. , 0. , 8. ],
[ 0. , 0. , 50. , 9.9],
[ 0. , 0. , 0. , 9. ],
[ 0. , 0. , 0. , 2.5]])
# PART 1
minimum_bar, next_best_bar = minimum_routes(foo)
# (array([22.5, 20. , 50. , 2.5]), array([24. , 20. , 50. , 2.5]))
# PART 2
constraint = np.array([10, 10, 10, 10, 10])
minimum_bar, next_best_bar = minimum_routes(foo[foo.sum(1) > constraint])
# (array([22.5, 20. , 50. , 8. ]), array([24., 20., 50., 8.]))
To find the indices:
np.where(foo == minimum_bar)
np.where(foo == next_best_bar)
I have the following numpy array:
foo = np.array([[0.0, 10.0], [0.13216, 12.11837], [0.25379, 42.05027], [0.30874, 13.11784]])
which yields:
[[ 0. 10. ]
[ 0.13216 12.11837]
[ 0.25379 42.05027]
[ 0.30874 13.11784]]
How can I normalize the Y component of this array. So it gives me something like:
[[ 0. 0. ]
[ 0.13216 0.06 ]
[ 0.25379 1 ]
[ 0.30874 0.097]]
Referring to this Cross Validated Link, How to normalize data to 0-1 range?, it looks like you can perform min-max normalisation on the last column of foo.
v = foo[:, 1] # foo[:, -1] for the last column
foo[:, 1] = (v - v.min()) / (v.max() - v.min())
foo
array([[ 0. , 0. ],
[ 0.13216 , 0.06609523],
[ 0.25379 , 1. ],
[ 0.30874 , 0.09727968]])
Another option for performing normalisation (as suggested by OP) is using sklearn.preprocessing.normalize, which yields slightly different results -
from sklearn.preprocessing import normalize
foo[:, [-1]] = normalize(foo[:, -1, None], norm='max', axis=0)
foo
array([[ 0. , 0.2378106 ],
[ 0.13216 , 0.28818769],
[ 0.25379 , 1. ],
[ 0.30874 , 0.31195614]])
sklearn.preprocessing.MinMaxScaler can also be used (feature_range=(0, 1) is default):
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
v = foo[:,1]
v_scaled = min_max_scaler.fit_transform(v)
foo[:,1] = v_scaled
print(foo)
Output:
[[ 0. 0. ]
[ 0.13216 0.06609523]
[ 0.25379 1. ]
[ 0.30874 0.09727968]]
Advantage is that scaling to any range can be done.
I think you want this:
foo[:,1] = (foo[:,1] - foo[:,1].min()) / (foo[:,1].max() - foo[:,1].min())
You are trying to min-max scale between 0 and 1 only the second column.
Using sklearn.preprocessing.minmax_scale, should easily solve your problem.
e.g.:
from sklearn.preprocessing import minmax_scale
column_1 = foo[:,0] #first column you don't want to scale
column_2 = minmax_scale(foo[:,1], feature_range=(0,1)) #second column you want to scale
foo_norm = np.stack((column_1, column_2), axis=1) #stack both columns to get a 2d array
Should yield
array([[0. , 0. ],
[0.13216 , 0.06609523],
[0.25379 , 1. ],
[0.30874 , 0.09727968]])
Maybe you want to min-max scale between 0 and 1 both columns. In this case, use:
foo_norm = minmax_scale(foo, feature_range=(0,1), axis=0)
Which yields
array([[0. , 0. ],
[0.42806245, 0.06609523],
[0.82201853, 1. ],
[1. , 0.09727968]])
note: Not to be confused with the operation that scales the norm (length) of a vector to a certain value (usually 1), which is also commonly referred to as normalization.
I was reading and came across this formula:
The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix:
M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]]
Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i. I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5 matrix I believe. I tried to do
df = pd.DataFrame(M)
item_mean_subtracted = df.sub(df.mean(axis=0), axis=1)
similarity_matrix = item_mean_subtracted.fillna(0).corr(method="pearson").values
However, this does not seem right.
Here's a possible implementation of the adjusted cosine similarity:
import numpy as np
from scipy.spatial.distance import pdist, squareform
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
Remarks:
I'm taking advantage of NumPy broadcasting to subtract the mean.
If M is a sparse matrix, you could do something like ths: M.toarray().
From the docs:
Y = pdist(X, 'cosine')
Computes the cosine distance between vectors u and v,
1 − u⋅v / (||u||2||v||2)
where ||∗||2 is the 2-norm of its argument *, and u⋅v is the dot product of u and v.
Array transposition is performed through the T method.
Demo:
In [277]: M_u
Out[277]: array([ 2. , 1. , 2.4, 1. ])
In [278]: item_mean_subtracted
Out[278]:
array([[ 0. , 1. , 2. , -1. , -2. ],
[-1. , -1. , -1. , -1. , 4. ],
[ 2.6, 1.6, 0.6, -2.4, -2.4],
[ 0. , 0. , 0. , 0. , 0. ]])
In [279]: np.set_printoptions(precision=2)
In [280]: similarity_matrix
Out[280]:
array([[ 1. , 0.87, 0.4 , -0.68, -0.72],
[ 0.87, 1. , 0.8 , -0.65, -0.91],
[ 0.4 , 0.8 , 1. , -0.38, -0.8 ],
[-0.68, -0.65, -0.38, 1. , 0.27],
[-0.72, -0.91, -0.8 , 0.27, 1. ]])
I'm setting a numpy array with a power-law equation. The problem is that part of my domain tries to do numpy.power(x, n) when x is negative and n is not an integer. In this part of the domain I want the value to be 0.0. Below is a code that has the correct behavior, but is there a more Pythonic way to do this?
# note mesh.x is a numpy array of length nx
myValues = npy.zeros((nx))
para = [5.8780046, 0.714285714, 2.819250868]
for j in range(nx):
if mesh.x[j] > para[1]:
myValues[j] = para[0]*npy.power(mesh.x[j]-para[1],para[2])
else:
myValues[j] = 0.0
Is "numpythonic" a word? It should be a word. The following is really neither pythonic nor unpythonic, but it is much more efficient than using a for loop, and close(r) to the way Travis would probably do it:
import numpy
mesh_x = numpy.array([0.5,1.0,1.5])
myValues = numpy.zeros_like( mesh_x )
para = [5.8780046, 0.714285714, 2.819250868]
mask = mesh_x > para[1]
myValues[mask] = para[0] * numpy.power(mesh_x[mask] - para[1], para[2])
print(myValues)
For very large problems you would probably want to avoid creating temporary arrays:
mask = mesh.x > para[1]
myValues[mask] = mesh.x[mask]
myValues[mask] -= para[1]
myValues[mask] **= para[2]
myValues[mask] *= para[0]
Here's one approach with np.where to choose values between the power calculations and 0 -
import numpy as np
np.where(mesh.x>para[1],para[0]*np.power(mesh.x-para[1],para[2]),0)
Explanation :
np.where(mask,A,B) chooses elements from A or B depending on mask elements. So, in our case it is mesh.x>para[1] when doing a vectorized comparison for all mesh.x elements in one go.
para[0]*np.power(mesh.x-para[1],para[2]) gives us the elements that are to be chosen in case a mask element is True. Else, we choose 0, which is the third argument to np.where.
More of an explanation of the answers given by #jez and #Divakar with simple examples than an answer itself. They both rely on some form of boolean indexing.
>>>
>>> a
array([[-4.5, -3.5, -2.5],
[-1.5, -0.5, 0.5],
[ 1.5, 2.5, 3.5]])
>>> n = 2.2
>>> a ** n
array([[ nan, nan, nan],
[ nan, nan, 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
np.where is made for this it selects one of two values based on a boolean array.
>>> np.where(np.isnan(a**n), 0, a**n)
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
>>>
>>> b = np.where(a < 0, 0, a)
>>> b
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.5],
[ 1.5, 2.5, 3.5]])
>>> b **n
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
Use of boolean indexing on the left-hand-side and the right-hand-side. This is similar to np.where
>>>
>>> a[a >= 0] = a[a >= 0] ** n
>>> a
array([[ -4.5 , -3.5 , -2.5 ],
[ -1.5 , -0.5 , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
>>> a[a < 0] = 0
>>> a
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
>>>