Building histogram from a dict without having to iterate over the keys - python

I have a dict containing numpy array of varying length:
MyDcit= {0:array([[ 15. , 3.89678216],
[ 36. , 9.49245167],
[ 53. , 3.82997799],
[ 83. , 5.25727272],
[ 86. , 8.76663208]]),
1:array([[ 4. , 4.1171155 ],
[ 16. , 12.68122196],
[ 31. , 8.64805222],
[ 37. , 6.07202959]]),
2:array([]),...,
90:array([[ 1. , 1. ],
[ 24. , 8.14221573],
[ 27. , 7.36309862]])}
I would like to obtain an histogram of all the values in the dict. The solution I have now is to iterate over the keys in the dict and fill a numpy array with an histogram of fixed length:
for KeysElements in MyDict.keys():
hist,bins = numpy.histogram(np.asarray(MyDict[KeysElements])[:,1],50)
numpy_hist[KeysElements,:] = hist
I then sum up all the histograms over the fist dimension of the numpy array to obtain the histogram of all the keys of the initial dict:
Total_hist = numpy.sum(numpy_hist,axis=0)
The problems with this solutions is that I do not knwo how to handle the bins which change for each iteration, so my question is: are there any possibilities to achieve this without having to built histograms in a loop?
Thanks for any advices or links.
Greg

You don't seem to use the MyDict index values or the 0th values in the 2nd axis of your np arrays. If this is the case then you could add all the numpy arrays together and do the histogram on that
import numpy as np
MyDict = {0:np.array([[ 15. , 3.89678216],
[ 36. , 9.49245167],
[ 53. , 3.82997799],
[ 83. , 5.25727272],
[ 86. , 8.76663208]]),
1:np.array([[ 4. , 4.1171155 ],
[ 16. , 12.68122196],
[ 31. , 8.64805222],
[ 37. , 6.07202959]]),
2:np.array([]),
90:np.array([[ 1. , 1. ],
[ 24. , 8.14221573],
[ 27. , 7.36309862]])}
np_array = np.array([]).reshape(0,2)
for i in MyDict:
a = MyDict[i]
if len(a.shape) == 2 and a.shape[1] == 2:
np_array = np.append(np_array, MyDict[i], axis=0)
print(np.histogram(np_array, 50))

Related

Combine array of indices with array of values

I have an array in the following form where the first two columns are supposed to be indices of a 2-dimensional array and the following columns are arbitrary values.
data = np.array([[ 0. , 1. , 48. , 4. ],
[ 1. , 2. , 44. , 4.4],
[ 1. , 1. , 34. , 2.3],
[ 0. , 2. , 55. , 2.2],
[ 0. , 0. , 42. , 2. ],
[ 1. , 0. , 22. , 1. ]])
How do I combine the indices data[:,:2] with their values data[:,2:] such that the resulting array is accessible by the indices in the first two columns.
In my example that would be:
result = np.array([[[42. , 2. ], [48. , 4. ], [55. , 2.2]],
[[22. , 1. ], [34. , 2.3], [44. , 4.4]]])
I know that there is a trivial solution using python loops. But performance is a concern since I'm dealing with a huge amount of data. Specifically it's output of another program that I need to process.
Maybe there is a relatively trivial numpy solution as well. But I'm kind of stuck.
If it helps the following can be safely assumed:
All numbers in the first two columns are whole numbers (although the array consists of floats).
Every possible index (or rather combinations of indices) in the original array is used exactly once. I.e. there is guaranteed to be exactly one entry of the form [i, j, ...].
The indices start at 0 and I know the highest indices beforehand.
Edit:
Hmm. I see now how my example is misleading. The truth is that some of my input arrays are sorted, but that's unreliable. So I shouldn't assume anything about the order. I reordered some rows in my example to make it clearer. In case anyone wants to make sense of the answer and comment below: In my original question the array appeared to be sorted by the first two columns.
find row, column, depth base your data array, then fill like below:
import numpy as np
data = np.array([[ 0. , 0. , 42. , 2. ],
[ 0. , 1. , 48. , 4. ],
[ 0. , 2. , 55. , 2.2],
[ 1. , 0. , 22. , 1. ],
[ 1. , 1. , 34. , 2.3],
[ 1. , 2. , 44. , 4.4]])
row = int(max(data[:,0]))+1
col = int(max(data[:,1]))+1
depth = len(data[0, 2:])
out = np.zeros([row, col, depth])
out = data[:, 2:].reshape(row,col,depth)
print(out)
Output:
[[[42. 2. ]
[48. 4. ]
[55. 2.2]]
[[22. 1. ]
[34. 2.3]
[44. 4.4]]]
You can use numba in no-python parallel mode with loops (which is inherently for python loops acceleration) that will be one of the most efficient methods in terms of performance as szczesny mentioned in the comments, that won't need to sort; this code is adjusted for when column counts are 2, if it be changeable, this code can be modified to handle that:
# without signature --> #nb.njit(parallel=True)
#nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
data_ = data[:, :2].astype(np.int8)
res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, 2))
for i in nb.prange(data_.shape[0]):
res[data_[i, 0], data_[i, 1], 0] = data[i, 2]
res[data_[i, 0], data_[i, 1], 1] = data[i, 3]
return res
without the sorting and curing the proposed NumPy code (horizontal axis --> data.shape[0]):
More general to consider more than 2 columns:
#nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
data_ = data[:, :2].astype(np.int8)
assert data_.shape[0] == data.shape[0]
depth = data[:, 2:].shape[1]
res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, depth))
for i in nb.prange(data_.shape[0]):
for j in range(depth):
res[data_[i, 0], data_[i, 1], j] = data[i, j + 2]
return res

Different results on using function and it's content

I was trying to understand the working of the function fast_knn of impyute library. So, I tried to execute it line by line in order to understand the working. Here it is:
import numpy as np
from scipy.spatial import KDTree
def shepards(distances, power=2):
return to_percentage(1/np.power(distances, power))
def to_percentage(vec):
return vec/np.sum(vec)
data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
data_temp[0][2] = np.nan
k=4
eps=0
p=2
distance_upper_bound=np.inf
leafsize=10
idw_fn=shepards
init_impute_fn=mean
nan_xy = np.argwhere(np.isnan(data_temp))
data_temp_c = init_impute_fn(data_temp)
kdtree = KDTree(data_temp_c, leafsize=leafsize)
for x_i, y_i in nan_xy:
distances, indices = kdtree.query(data_temp_c[x_i], k=k+1, eps=eps,
p=p, distance_upper_bound=distance_upper_bound)
# Will always return itself in the first index. Delete it.
distances, indices = distances[1:], indices[1:]
# Add small constant to distances to avoid division by 0
distances += 1e-3
weights = idw_fn(distances)
# Assign missing value the weighted average of `k` nearest neighbours
data_temp[x_i][y_i] = np.dot(weights, [data_temp_c[ind][y_i] for ind in indices])
data_temp
This outputs:
array([[ 0. , 1. , 10.06569379, 3. , 4. ],
[ 5. , 6. , 7. , 8. , 9. ],
[10. , 11. , 12. , 13. , 14. ],
[15. , 16. , 17. , 18. , 19. ],
[20. , 21. , 22. , 23. , 24. ]])
whereas the function has a different output. The code :
from impyute import fast_knn
import numpy as np
data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
data_temp[0][2] = np.nan
fast_knn(data_temp, k=4)
and the output
array([[ 0. , 1. , 16.78451885, 3. , 4. ],
[ 5. , 6. , 7. , 8. , 9. ],
[10. , 11. , 12. , 13. , 14. ],
[15. , 16. , 17. , 18. , 19. ],
[20. , 21. , 22. , 23. , 24. ]])
``
There seems to be discrepancies with the GitHub repository code and library source code ( the repository has not been updated). The following is the library source code :
def fast_knn(data, k=3, eps=0, p=2, distance_upper_bound=np.inf, leafsize=10, **kwargs):
null_xy = find_null(data)
data_c = mean(data)
kdtree = KDTree(data_c, leafsize=leafsize)
for x_i, y_i in null_xy:
distances, indices = kdtree.query(data_c[x_i], k=k+1, eps=eps,
p=p, distance_upper_bound=distance_upper_bound)
# Will always return itself in the first index. Delete it.
distances, indices = distances[1:], indices[1:]
weights = distances/np.sum(distances)
# Assign missing value the weighted average of `k` nearest neighbours
data[x_i][y_i] = np.dot(weights, [data_c[ind][y_i] for ind in indices])
return data
The weights are computed in a different manner (not using the shepards function). Hence, the difference in outputs.
Maybe you used the code on the current master branch of impyute. But the impyute package version you used maybe v0.0.8 — the current recent version — whose code is at the release/0.0.8 branch.
The difference in the definition of fast_knn is below.
On the current master branch:
# Will always return itself in the first index. Delete it.
distances, indices = distances[1:], indices[1:]
# Add small constant to distances to avoid division by 0
distances += 1e-3
weights = idw_fn(distances)
On release/0.0.8 branch:
# Will always return itself in the first index. Delete it.
distances, indices = distances[1:], indices[1:]
weights = distances/np.sum(distances)
If you use the code in the release/0.0.8 branch, you will get the same result as you use the impyute package.

Vectorize an index-based matrix operation in numpy

How can I vectorize the following loop?
def my_fnc():
m = np.arange(27.).reshape((3,3,3))
ret = np.empty_like(m)
it = np.nditer(m, flags=['multi_index'])
for x in it:
i,j,k = it.multi_index
ret[i,j,k] = x / m[i,j,i]
return ret
Basically I'm dividing each value in m by something similar to a diagonal. Not all values in m will be different, the arange is just an example.
Thanks in advance! ~
P.S.: here's the output of the function above, don't mind the nans :)
array([[[ nan, inf, inf],
[ 1. , 1.33333333, 1.66666667],
[ 1. , 1.16666667, 1.33333333]],
[[ 0.9 , 1. , 1.1 ],
[ 0.92307692, 1. , 1.07692308],
[ 0.9375 , 1. , 1.0625 ]],
[[ 0.9 , 0.95 , 1. ],
[ 0.91304348, 0.95652174, 1. ],
[ 0.92307692, 0.96153846, 1. ]]])
Use advanced-indexing to get the m[i,j,i] equivalent in one go and then simply divide input array by it -
r = np.arange(len(m))
ret = m/m[r,:,r,None] # Add new axis with None to allow for broadcasting

Numpy calculate gradients accross matrices

I am using the following to calculate the running gradients between data in the same indexes across multiple matrices:
import numpy as np
array_1 = np.array([[1,2,3], [4,5,6]])
array_2 = np.array([[2,3,4], [5,6,7]])
array_3 = np.array([[1,8,9], [9,6,7]])
flat_1 = array_1.flatten()
flat_2 = array_2.flatten()
flat_3 = array_3.flatten()
print('flat_1: {0}'.format(flat_1))
print('flat_2: {0}'.format(flat_2))
print('flat_3: {0}'.format(flat_3))
data = []
gradient_list = []
for item in zip(flat_1,flat_2,flat_3):
data.append(list(item))
print('items: {0}'.format(list(item)))
grads = np.gradient(list(item))
print('grads: {0}'.format(grads))
gradient_list.append(grads)
grad_array=np.array(gradient_list)
print('grad_array: {0}'.format(grad_array))
This doesn't look like an optimal way of doing this - is there a vectorized way of calculating gradients between data in 2d arrays?
numpy.gradient takes axis as parameter, so you might just stack the arrays, and then calcualte the gradient along a certain axis; For instance, use np.dstack with axis=2; If you need a different shape as result, just use reshape method:
np.gradient(np.dstack((array_1, array_2, array_3)), axis=2)
#array([[[ 1. , 0. , -1. ],
# [ 1. , 3. , 5. ],
# [ 1. , 3. , 5. ]],
# [[ 1. , 2.5, 4. ],
# [ 1. , 0.5, 0. ],
# [ 1. , 0.5, 0. ]]])
Or if flatten the arrays first:
np.gradient(np.column_stack((array_1.ravel(), array_2.ravel(), array_3.ravel())), axis=1)
#array([[ 1. , 0. , -1. ],
# [ 1. , 3. , 5. ],
# [ 1. , 3. , 5. ],
# [ 1. , 2.5, 4. ],
# [ 1. , 0.5, 0. ],
# [ 1. , 0.5, 0. ]])

Python - construction of meshgrid (irregular grid) with numpy

My aim is to interpolate some data. To do that i have to create a meshgrid.
To do this step, i got an array with my 2D coordinate "coord" (first column : element number, second : X and third : Y).
I do a meshgrid with np.meshgrid as you can see below.
But my results seem to be strange, so i would like to know if i have done
a mistake...Must i have to reorganize my data before meshgrid step?
import numpy as np
coord = np.array([[ 1. , -1.38888667, -1.94444333],
[ 2. , -1.94444333, -1.38888667],
[ 3. , 0.27777667, -1.94444333],
[ 4. , -0.27777667, -1.38888667],
[ 5. , 1.94444333, -1.94444333],
[ 6. , 1.38888667, -1.38888667],
[ 7. , -1.38888667, -0.27777667],
[ 8. , -1.94444333, 0.27777667],
[ 9. , 0.27777667, -0.27777667],
[ 10. , -0.27777667, 0.27777667],
[ 11. , 1.94444333, -0.27777667],
[ 12. , 1.38888667, 0.27777667],
[ 13. , -1.38888667, 1.38888667],
[ 14. , -1.94444333, 1.94444333],
[ 15. , 0.27777667, 1.38888667],
[ 16. , -0.27777667, 1.94444333],
[ 17. , 1.94444333, 1.38888667],
[ 18. , 1.38888667, 1.94444333]])
[Y,X]=np.meshgrid(coord[:,2],coord[:,1])
If i plot Y, i got that :
plt.imshow(Y);plt.colorbar();plt.show()
---- EDIT LATER -----
I m wondering (for example) if the coordinates with meshgrid have to be strictly increasing? if there is a better way when i have some coordinates not organized?
For the interpolation, i would like to use :
def interpolate(values, tri,uv,d=2):
simplex = tri.find_simplex(uv)
vertices = np.take(tri.simplices, simplex, axis=0)
temp = np.take(tri.transform, simplex, axis=0)
delta = uv- temp[:, d]
bary = np.einsum('njk,nk->nj', temp[:, :d, :], delta)
return np.einsum('nj,nj->n', np.take(values, vertices), np.hstack((bary, 1.0 - bary.sum(axis=1, keepdims=True))))
which was used in Stack before Speedup scipy griddata for multiple interpolations between two irregular grids allowing to limit the calculation time

Categories