Related
I have an array in the following form where the first two columns are supposed to be indices of a 2-dimensional array and the following columns are arbitrary values.
data = np.array([[ 0. , 1. , 48. , 4. ],
[ 1. , 2. , 44. , 4.4],
[ 1. , 1. , 34. , 2.3],
[ 0. , 2. , 55. , 2.2],
[ 0. , 0. , 42. , 2. ],
[ 1. , 0. , 22. , 1. ]])
How do I combine the indices data[:,:2] with their values data[:,2:] such that the resulting array is accessible by the indices in the first two columns.
In my example that would be:
result = np.array([[[42. , 2. ], [48. , 4. ], [55. , 2.2]],
[[22. , 1. ], [34. , 2.3], [44. , 4.4]]])
I know that there is a trivial solution using python loops. But performance is a concern since I'm dealing with a huge amount of data. Specifically it's output of another program that I need to process.
Maybe there is a relatively trivial numpy solution as well. But I'm kind of stuck.
If it helps the following can be safely assumed:
All numbers in the first two columns are whole numbers (although the array consists of floats).
Every possible index (or rather combinations of indices) in the original array is used exactly once. I.e. there is guaranteed to be exactly one entry of the form [i, j, ...].
The indices start at 0 and I know the highest indices beforehand.
Edit:
Hmm. I see now how my example is misleading. The truth is that some of my input arrays are sorted, but that's unreliable. So I shouldn't assume anything about the order. I reordered some rows in my example to make it clearer. In case anyone wants to make sense of the answer and comment below: In my original question the array appeared to be sorted by the first two columns.
find row, column, depth base your data array, then fill like below:
import numpy as np
data = np.array([[ 0. , 0. , 42. , 2. ],
[ 0. , 1. , 48. , 4. ],
[ 0. , 2. , 55. , 2.2],
[ 1. , 0. , 22. , 1. ],
[ 1. , 1. , 34. , 2.3],
[ 1. , 2. , 44. , 4.4]])
row = int(max(data[:,0]))+1
col = int(max(data[:,1]))+1
depth = len(data[0, 2:])
out = np.zeros([row, col, depth])
out = data[:, 2:].reshape(row,col,depth)
print(out)
Output:
[[[42. 2. ]
[48. 4. ]
[55. 2.2]]
[[22. 1. ]
[34. 2.3]
[44. 4.4]]]
You can use numba in no-python parallel mode with loops (which is inherently for python loops acceleration) that will be one of the most efficient methods in terms of performance as szczesny mentioned in the comments, that won't need to sort; this code is adjusted for when column counts are 2, if it be changeable, this code can be modified to handle that:
# without signature --> #nb.njit(parallel=True)
#nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
data_ = data[:, :2].astype(np.int8)
res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, 2))
for i in nb.prange(data_.shape[0]):
res[data_[i, 0], data_[i, 1], 0] = data[i, 2]
res[data_[i, 0], data_[i, 1], 1] = data[i, 3]
return res
without the sorting and curing the proposed NumPy code (horizontal axis --> data.shape[0]):
More general to consider more than 2 columns:
#nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
data_ = data[:, :2].astype(np.int8)
assert data_.shape[0] == data.shape[0]
depth = data[:, 2:].shape[1]
res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, depth))
for i in nb.prange(data_.shape[0]):
for j in range(depth):
res[data_[i, 0], data_[i, 1], j] = data[i, j + 2]
return res
I am using the following to calculate the running gradients between data in the same indexes across multiple matrices:
import numpy as np
array_1 = np.array([[1,2,3], [4,5,6]])
array_2 = np.array([[2,3,4], [5,6,7]])
array_3 = np.array([[1,8,9], [9,6,7]])
flat_1 = array_1.flatten()
flat_2 = array_2.flatten()
flat_3 = array_3.flatten()
print('flat_1: {0}'.format(flat_1))
print('flat_2: {0}'.format(flat_2))
print('flat_3: {0}'.format(flat_3))
data = []
gradient_list = []
for item in zip(flat_1,flat_2,flat_3):
data.append(list(item))
print('items: {0}'.format(list(item)))
grads = np.gradient(list(item))
print('grads: {0}'.format(grads))
gradient_list.append(grads)
grad_array=np.array(gradient_list)
print('grad_array: {0}'.format(grad_array))
This doesn't look like an optimal way of doing this - is there a vectorized way of calculating gradients between data in 2d arrays?
numpy.gradient takes axis as parameter, so you might just stack the arrays, and then calcualte the gradient along a certain axis; For instance, use np.dstack with axis=2; If you need a different shape as result, just use reshape method:
np.gradient(np.dstack((array_1, array_2, array_3)), axis=2)
#array([[[ 1. , 0. , -1. ],
# [ 1. , 3. , 5. ],
# [ 1. , 3. , 5. ]],
# [[ 1. , 2.5, 4. ],
# [ 1. , 0.5, 0. ],
# [ 1. , 0.5, 0. ]]])
Or if flatten the arrays first:
np.gradient(np.column_stack((array_1.ravel(), array_2.ravel(), array_3.ravel())), axis=1)
#array([[ 1. , 0. , -1. ],
# [ 1. , 3. , 5. ],
# [ 1. , 3. , 5. ],
# [ 1. , 2.5, 4. ],
# [ 1. , 0.5, 0. ],
# [ 1. , 0.5, 0. ]])
My aim is to interpolate some data. To do that i have to create a meshgrid.
To do this step, i got an array with my 2D coordinate "coord" (first column : element number, second : X and third : Y).
I do a meshgrid with np.meshgrid as you can see below.
But my results seem to be strange, so i would like to know if i have done
a mistake...Must i have to reorganize my data before meshgrid step?
import numpy as np
coord = np.array([[ 1. , -1.38888667, -1.94444333],
[ 2. , -1.94444333, -1.38888667],
[ 3. , 0.27777667, -1.94444333],
[ 4. , -0.27777667, -1.38888667],
[ 5. , 1.94444333, -1.94444333],
[ 6. , 1.38888667, -1.38888667],
[ 7. , -1.38888667, -0.27777667],
[ 8. , -1.94444333, 0.27777667],
[ 9. , 0.27777667, -0.27777667],
[ 10. , -0.27777667, 0.27777667],
[ 11. , 1.94444333, -0.27777667],
[ 12. , 1.38888667, 0.27777667],
[ 13. , -1.38888667, 1.38888667],
[ 14. , -1.94444333, 1.94444333],
[ 15. , 0.27777667, 1.38888667],
[ 16. , -0.27777667, 1.94444333],
[ 17. , 1.94444333, 1.38888667],
[ 18. , 1.38888667, 1.94444333]])
[Y,X]=np.meshgrid(coord[:,2],coord[:,1])
If i plot Y, i got that :
plt.imshow(Y);plt.colorbar();plt.show()
---- EDIT LATER -----
I m wondering (for example) if the coordinates with meshgrid have to be strictly increasing? if there is a better way when i have some coordinates not organized?
For the interpolation, i would like to use :
def interpolate(values, tri,uv,d=2):
simplex = tri.find_simplex(uv)
vertices = np.take(tri.simplices, simplex, axis=0)
temp = np.take(tri.transform, simplex, axis=0)
delta = uv- temp[:, d]
bary = np.einsum('njk,nk->nj', temp[:, :d, :], delta)
return np.einsum('nj,nj->n', np.take(values, vertices), np.hstack((bary, 1.0 - bary.sum(axis=1, keepdims=True))))
which was used in Stack before Speedup scipy griddata for multiple interpolations between two irregular grids allowing to limit the calculation time
I am trying to implement Non-negative Matrix Factorization so as to find the missing values of a matrix for a Recommendation Engine Project. I am using the nimfa library to implement matrix factorization. But can't seem to figure out how to predict the missing values.
The missing values in this matrix is represented by 0.
a=[[ 1. 0.45643546 0. 0.1 0.10327956 0.0225877 ]
[ 0.15214515 1. 0.04811252 0.07607258 0.23570226 0.38271325]
[ 0. 0.14433757 1. 0.07905694 0. 0.42857143]
[ 0.1 0.22821773 0.07905694 1. 0. 0.27105237]
[ 0.06885304 0.47140452 0. 0. 1. 0.13608276]
[ 0.00903508 0.4592559 0.17142857 0.10842095 0.08164966 1. ]]
import nimfa
model = nimfa.Lsnmf(a, max_iter=100000,rank =4)
#fit the model
fit = model()
#get U and V matrices from fit
U = fit.basis()
V = fit.coef()
print numpy.dot(U,V)
But the ans given is nearly same as a and I can't predict the zero values.
Please tell me which method to use or any other implementations possible and any possible resources.
I want to use this function to minimize the error in predicting the values.
error=|| a - UV ||_F + c*||U||_F + c*||V||_F
where _F denotes the frobenius norm
I have not used nimfa before so I cannot answer on exactly how to do that, but with sklearn you can perform a preprocessor to transform the missing values, like this:
In [28]: import numpy as np
In [29]: from sklearn.preprocessing import Imputer
# prepare a numpy array
In [30]: a = np.array(a)
In [31]: a
Out[31]:
array([[ 1. , 0.45643546, 0. , 0.1 , 0.10327956,
0.0225877 ],
[ 0.15214515, 1. , 0.04811252, 0.07607258, 0.23570226,
0.38271325],
[ 0. , 0.14433757, 1. , 0.07905694, 0. ,
0.42857143],
[ 0.1 , 0.22821773, 0.07905694, 1. , 0. ,
0.27105237],
[ 0.06885304, 0.47140452, 0. , 0. , 1. ,
0.13608276],
[ 0.00903508, 0.4592559 , 0.17142857, 0.10842095, 0.08164966,
1. ]])
In [32]: pre = Imputer(missing_values=0, strategy='mean')
# transform missing_values as "0" using mean strategy
In [33]: pre.fit_transform(a)
Out[33]:
array([[ 1. , 0.45643546, 0.32464951, 0.1 , 0.10327956,
0.0225877 ],
[ 0.15214515, 1. , 0.04811252, 0.07607258, 0.23570226,
0.38271325],
[ 0.26600665, 0.14433757, 1. , 0.07905694, 0.35515787,
0.42857143],
[ 0.1 , 0.22821773, 0.07905694, 1. , 0.35515787,
0.27105237],
[ 0.06885304, 0.47140452, 0.32464951, 0.27271009, 1. ,
0.13608276],
[ 0.00903508, 0.4592559 , 0.17142857, 0.10842095, 0.08164966,
1. ]])
You can read more here.
I have an 5x17511 2D array (name = 'da') which made by a pandas.read_csv(...)
I also added one column for indexing like this: da.index = pd.date_range(...)
So my 2D array has 6x17511 size now.
I'd like to insert/append one more row to this 2D array, how to do this?
I already tried with: np.insert(da,1,np.array((1,2,3,4,5,6)), 0) but it says:
ValueError: Shape of passed values is (6, 17512), indices imply (6,
17511)
Thanks in advance!
I have assumed this is a numpy question rather than a pandas question ...
You could try vstack ...
import numpy as np
da = np.random.rand(17511, 6)
newrow = np.array((1,2,3,4,5,6))
da = np.vstack([da, newrow])
Which yields ...
In [5]: da
Out[5]:
array([[ 0.50203777, 0.55102172, 0.74798053, 0.57291239, 0.38977322,
0.40878739],
[ 0.9960413 , 0.22293403, 0.34136638, 0.12845067, 0.20262593,
0.50798698],
[ 0.05298782, 0.09129754, 0.40833606, 0.67150583, 0.19569471,
0.75176924],
...,
[ 0.97927055, 0.44649323, 0.84851791, 0.05370892, 0.94375771,
0.24508979],
[ 0.85952039, 0.2852414 , 0.85662827, 0.97665465, 0.65528357,
0.71483845],
[ 1. , 2. , 3. , 4. , 5. ,
6. ]])
In [6]: len(da)
Out[6]: 17512
And (albeit with different random numbers), I can access the top and bottom of the numpy array as follows ...
In [9]: da[:5]
Out[9]:
array([[ 0.76697236, 0.96475768, 0.09145486, 0.27159858, 0.05160006,
0.66495098],
[ 0.62635043, 0.1316334 , 0.66257157, 0.99141318, 0.77212699,
0.17016979],
[ 0.86705298, 0.11120927, 0.29585339, 0.44128326, 0.32290492,
0.99298705],
[ 0.74053894, 0.90743885, 0.99838398, 0.40713677, 0.17337202,
0.56982539],
[ 0.99136919, 0.13045787, 0.67881652, 0.03814385, 0.98036307,
0.53594215]])
In [10]: da[-5:]
Out[10]:
array([[ 0.8793664 , 0.0392912 , 0.8106504 , 0.17920025, 0.26767578,
0.98386519],
[ 0.41231276, 0.02633723, 0.7872108 , 0.60894162, 0.5358851 ,
0.65758067],
[ 0.10341791, 0.48079533, 0.1638601 , 0.5470736 , 0.7339205 ,
0.60609949],
[ 0.55320512, 0.12962241, 0.84443947, 0.81012583, 0.22057856,
0.33495709],
[ 1. , 2. , 3. , 4. , 5. ,
6. ]])