Different results on using function and it's content - python

I was trying to understand the working of the function fast_knn of impyute library. So, I tried to execute it line by line in order to understand the working. Here it is:
import numpy as np
from scipy.spatial import KDTree
def shepards(distances, power=2):
return to_percentage(1/np.power(distances, power))
def to_percentage(vec):
return vec/np.sum(vec)
data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
data_temp[0][2] = np.nan
k=4
eps=0
p=2
distance_upper_bound=np.inf
leafsize=10
idw_fn=shepards
init_impute_fn=mean
nan_xy = np.argwhere(np.isnan(data_temp))
data_temp_c = init_impute_fn(data_temp)
kdtree = KDTree(data_temp_c, leafsize=leafsize)
for x_i, y_i in nan_xy:
distances, indices = kdtree.query(data_temp_c[x_i], k=k+1, eps=eps,
p=p, distance_upper_bound=distance_upper_bound)
# Will always return itself in the first index. Delete it.
distances, indices = distances[1:], indices[1:]
# Add small constant to distances to avoid division by 0
distances += 1e-3
weights = idw_fn(distances)
# Assign missing value the weighted average of `k` nearest neighbours
data_temp[x_i][y_i] = np.dot(weights, [data_temp_c[ind][y_i] for ind in indices])
data_temp
This outputs:
array([[ 0. , 1. , 10.06569379, 3. , 4. ],
[ 5. , 6. , 7. , 8. , 9. ],
[10. , 11. , 12. , 13. , 14. ],
[15. , 16. , 17. , 18. , 19. ],
[20. , 21. , 22. , 23. , 24. ]])
whereas the function has a different output. The code :
from impyute import fast_knn
import numpy as np
data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
data_temp[0][2] = np.nan
fast_knn(data_temp, k=4)
and the output
array([[ 0. , 1. , 16.78451885, 3. , 4. ],
[ 5. , 6. , 7. , 8. , 9. ],
[10. , 11. , 12. , 13. , 14. ],
[15. , 16. , 17. , 18. , 19. ],
[20. , 21. , 22. , 23. , 24. ]])
``

There seems to be discrepancies with the GitHub repository code and library source code ( the repository has not been updated). The following is the library source code :
def fast_knn(data, k=3, eps=0, p=2, distance_upper_bound=np.inf, leafsize=10, **kwargs):
null_xy = find_null(data)
data_c = mean(data)
kdtree = KDTree(data_c, leafsize=leafsize)
for x_i, y_i in null_xy:
distances, indices = kdtree.query(data_c[x_i], k=k+1, eps=eps,
p=p, distance_upper_bound=distance_upper_bound)
# Will always return itself in the first index. Delete it.
distances, indices = distances[1:], indices[1:]
weights = distances/np.sum(distances)
# Assign missing value the weighted average of `k` nearest neighbours
data[x_i][y_i] = np.dot(weights, [data_c[ind][y_i] for ind in indices])
return data
The weights are computed in a different manner (not using the shepards function). Hence, the difference in outputs.

Maybe you used the code on the current master branch of impyute. But the impyute package version you used maybe v0.0.8 — the current recent version — whose code is at the release/0.0.8 branch.
The difference in the definition of fast_knn is below.
On the current master branch:
# Will always return itself in the first index. Delete it.
distances, indices = distances[1:], indices[1:]
# Add small constant to distances to avoid division by 0
distances += 1e-3
weights = idw_fn(distances)
On release/0.0.8 branch:
# Will always return itself in the first index. Delete it.
distances, indices = distances[1:], indices[1:]
weights = distances/np.sum(distances)
If you use the code in the release/0.0.8 branch, you will get the same result as you use the impyute package.

Related

Combine array of indices with array of values

I have an array in the following form where the first two columns are supposed to be indices of a 2-dimensional array and the following columns are arbitrary values.
data = np.array([[ 0. , 1. , 48. , 4. ],
[ 1. , 2. , 44. , 4.4],
[ 1. , 1. , 34. , 2.3],
[ 0. , 2. , 55. , 2.2],
[ 0. , 0. , 42. , 2. ],
[ 1. , 0. , 22. , 1. ]])
How do I combine the indices data[:,:2] with their values data[:,2:] such that the resulting array is accessible by the indices in the first two columns.
In my example that would be:
result = np.array([[[42. , 2. ], [48. , 4. ], [55. , 2.2]],
[[22. , 1. ], [34. , 2.3], [44. , 4.4]]])
I know that there is a trivial solution using python loops. But performance is a concern since I'm dealing with a huge amount of data. Specifically it's output of another program that I need to process.
Maybe there is a relatively trivial numpy solution as well. But I'm kind of stuck.
If it helps the following can be safely assumed:
All numbers in the first two columns are whole numbers (although the array consists of floats).
Every possible index (or rather combinations of indices) in the original array is used exactly once. I.e. there is guaranteed to be exactly one entry of the form [i, j, ...].
The indices start at 0 and I know the highest indices beforehand.
Edit:
Hmm. I see now how my example is misleading. The truth is that some of my input arrays are sorted, but that's unreliable. So I shouldn't assume anything about the order. I reordered some rows in my example to make it clearer. In case anyone wants to make sense of the answer and comment below: In my original question the array appeared to be sorted by the first two columns.
find row, column, depth base your data array, then fill like below:
import numpy as np
data = np.array([[ 0. , 0. , 42. , 2. ],
[ 0. , 1. , 48. , 4. ],
[ 0. , 2. , 55. , 2.2],
[ 1. , 0. , 22. , 1. ],
[ 1. , 1. , 34. , 2.3],
[ 1. , 2. , 44. , 4.4]])
row = int(max(data[:,0]))+1
col = int(max(data[:,1]))+1
depth = len(data[0, 2:])
out = np.zeros([row, col, depth])
out = data[:, 2:].reshape(row,col,depth)
print(out)
Output:
[[[42. 2. ]
[48. 4. ]
[55. 2.2]]
[[22. 1. ]
[34. 2.3]
[44. 4.4]]]
You can use numba in no-python parallel mode with loops (which is inherently for python loops acceleration) that will be one of the most efficient methods in terms of performance as szczesny mentioned in the comments, that won't need to sort; this code is adjusted for when column counts are 2, if it be changeable, this code can be modified to handle that:
# without signature --> #nb.njit(parallel=True)
#nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
data_ = data[:, :2].astype(np.int8)
res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, 2))
for i in nb.prange(data_.shape[0]):
res[data_[i, 0], data_[i, 1], 0] = data[i, 2]
res[data_[i, 0], data_[i, 1], 1] = data[i, 3]
return res
without the sorting and curing the proposed NumPy code (horizontal axis --> data.shape[0]):
More general to consider more than 2 columns:
#nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
data_ = data[:, :2].astype(np.int8)
assert data_.shape[0] == data.shape[0]
depth = data[:, 2:].shape[1]
res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, depth))
for i in nb.prange(data_.shape[0]):
for j in range(depth):
res[data_[i, 0], data_[i, 1], j] = data[i, j + 2]
return res

I want to get minimun number's index for a row excluding zeros

Consider the following code that generates the following dst matrix.
tmp = pd.DataFrame()
tmp['a'] = np.random.randint(1, 10, 5)
tmp['b'] = np.random.randint(1, 10, 5)
dst = pairwise_distances(tmp, tmp, metric='l2')
dst
which looks like the following
array([[0. , 5.38516481, 5. , 4.12310563, 2. ],
[5.38516481, 0. , 1.41421356, 3.16227766, 5. ],
[5. , 1.41421356, 0. , 4. , 4.12310563],
[4.12310563, 3.16227766, 4. , 0. , 5. ],
[2. , 5. , 4.12310563, 5. , 0. ]])
Now, I want to somehow get 4 as an output column, because for row=0 and col=4 lies the minimum distance of row0 to another row apart from itself.
I'm trying to use the following code to do the job! but np.nonzeros() is messing up the game.
np.argmin(dst[0, np.nonzero(dst[0,:])]) I'm getting 3 as an output, where I should be getting 4. I understand that np.nonzero() return another set of dimensions [1,2,3,4] of which argmin picks 3rd column which is actual 4th column of the dst matrix. Need help! Thanks in advance!!
Instead of argmin, use np.min and compare the result to dst[0,:]. Finally, pass it to np.flatnonzero or np.nonzero
np.flatnonzero(np.min(dst[0,np.nonzero(dst[0,:])]) == dst[0,:])
Out[150]: array([4], dtype=int64)
Or
np.nonzero(np.min(dst[0,np.nonzero(dst[0,:])]) == dst[0,:])[0]
Out[151]: array([4], dtype=int64)
If you want to return an integer index, you may use np.argmax at the last step
np.argmax(np.min(dst[0,np.nonzero(dst[0,:])]) == dst[0,:])
Out[157]: 4

apply polyfit on rolling base

I found this usefull article on polyfit which works pretty good:
http://www.emilkhatib.com/analyzing-trends-in-data-with-pandas/
import numpy as np
coefficients, residuals, _, _, _ = np.polyfit(range(len(selected.index)),selected,1,full=True)
mse = residuals[0]/(len(selected.index))
nrmse = np.sqrt(mse)/(selected.max() - selected.min())
print('Slope ' + str(coefficients[0]))
print('NRMSE: ' + str(nrmse))
now i would like to use this on a rolling base..
def test(input_list, i):
if sum(~np.isnan(x) for x in input_list) < 2:
return np.NaN
print(input_list)
coefficients, residuals, _, _, _ = np.polyfit(range(len(input_list)),input_list,1,full=True)
mse = residuals[0]/(len(input_list))
nrmse = np.sqrt(mse)/(input_list.max() - input_list.min())
print('Slope ' + str(coefficients[0]))
print('NRMSE: ' + str(nrmse))
a = coefficients[0]*i + coefficients[1]
return a
df['pred'] = df['abs'].rolling(window=2, min_periods=1, center=False).apply(lambda x: test(x, base1.index))
but i wont get it working :)
i get
IndexError: index 0 is out of bounds for axis 0 with size 0 instead of correct results :)
anybody got an idea? thanks! e.
****EDIT1****
sorry, i missed posting a concrete example...
i managed to get the function working, by transforming the numpy array in a df.
but somehow residuals is empty
import quandl
import MySQLdb
import pandas as pd
import numpy as np
import sys
import matplotlib.pyplot as plt
def test(input_list, i):
if sum(~np.isnan(x) for x in input_list) < 2:
return np.NaN
abc = pd.DataFrame(input_list)
coefficients, residuals, _, _, _ = np.polyfit(range(len(abc)),abc[0],1,full=True)
#residuals is empty... why?
a = coefficients[0]*len(abc) + coefficients[1]
return a
df = quandl.get("WIKI/GOOGL")
df = df.ix[:, ['High', 'Low', 'Close']]
#reseit index for calc
#base1['DateTime'] = base1.index
#base1.index = range(len(base1))
df['close_pred'] = df['Close'].rolling(window=15, min_periods=2, center=False).apply(lambda x: test(x, 0))
print(df.head(30).to_string())
Residuals are empty just for 1st iteration see little modified code and answer
def test(data):
if sum(~np.isnan(x) for x in data) < 2:
return np.NaN
df = pd.DataFrame(data)
coefficients, residuals, _, _, _ = np.polyfit(range(len(data)),df[0],1,full=True)
#if residuals.size == 0:
# residuals = [0]
print(coefficients[-2], residuals, data)
return coefficients[-2]
and answer
df_xx['pred'] = df_xx[0].rolling(window=5, min_periods=2, center=False).apply(lambda y: test(y))
0.9999999999999998 [] [0. 1.]
1.0 [4.29279946e-34] [0. 1. 2.]
1.0000000000000002 [3.62112419e-33] [0. 1. 2. 3.]
0.9999999999999999 [8.77574736e-31] [0. 1. 2. 3. 4.]
0.9999999999999999 [1.25461096e-30] [1. 2. 3. 4. 5.]
0.9999999999999999 [2.93468782e-30] [2. 3. 4. 5. 6.]
0.9999999999999997 [1.38665176e-30] [3. 4. 5. 6. 7.]
0.9999999999999997 [2.18347839e-30] [4. 5. 6. 7. 8.]
0.9999999999999999 [6.21693422e-30] [5. 6. 7. 8. 9.]
1.0 [1.07025673e-29] [ 6. 7. 8. 9. 10.]
1.0000000000000002 [1.4374879e-29] [ 7. 8. 9. 10. 11.]
0.9999999999999997 [1.14542951e-29] [ 8. 9. 10. 11. 12.]
1.0000000000000004 [9.73226454e-30] [ 9. 10. 11. 12. 13.]
0.9999999999999997 [1.99069506e-29] [10. 11. 12. 13. 14.]
0.9999999999999997 [1.09437894e-29] [11. 12. 13. 14. 15.]
1.0 [3.60983058e-29] [12. 13. 14. 15. 16.]
1.0000000000000002 [1.90967258e-29] [13. 14. 15. 16. 17.]
1.0000000000000002 [3.13030715e-29] [14. 15. 16. 17. 18.]
1.0 [1.25806434e-29] [15. 16. 17. 18. 19.]
simple code below fix it
if residuals.size == 0:
residuals = [0]

Building histogram from a dict without having to iterate over the keys

I have a dict containing numpy array of varying length:
MyDcit= {0:array([[ 15. , 3.89678216],
[ 36. , 9.49245167],
[ 53. , 3.82997799],
[ 83. , 5.25727272],
[ 86. , 8.76663208]]),
1:array([[ 4. , 4.1171155 ],
[ 16. , 12.68122196],
[ 31. , 8.64805222],
[ 37. , 6.07202959]]),
2:array([]),...,
90:array([[ 1. , 1. ],
[ 24. , 8.14221573],
[ 27. , 7.36309862]])}
I would like to obtain an histogram of all the values in the dict. The solution I have now is to iterate over the keys in the dict and fill a numpy array with an histogram of fixed length:
for KeysElements in MyDict.keys():
hist,bins = numpy.histogram(np.asarray(MyDict[KeysElements])[:,1],50)
numpy_hist[KeysElements,:] = hist
I then sum up all the histograms over the fist dimension of the numpy array to obtain the histogram of all the keys of the initial dict:
Total_hist = numpy.sum(numpy_hist,axis=0)
The problems with this solutions is that I do not knwo how to handle the bins which change for each iteration, so my question is: are there any possibilities to achieve this without having to built histograms in a loop?
Thanks for any advices or links.
Greg
You don't seem to use the MyDict index values or the 0th values in the 2nd axis of your np arrays. If this is the case then you could add all the numpy arrays together and do the histogram on that
import numpy as np
MyDict = {0:np.array([[ 15. , 3.89678216],
[ 36. , 9.49245167],
[ 53. , 3.82997799],
[ 83. , 5.25727272],
[ 86. , 8.76663208]]),
1:np.array([[ 4. , 4.1171155 ],
[ 16. , 12.68122196],
[ 31. , 8.64805222],
[ 37. , 6.07202959]]),
2:np.array([]),
90:np.array([[ 1. , 1. ],
[ 24. , 8.14221573],
[ 27. , 7.36309862]])}
np_array = np.array([]).reshape(0,2)
for i in MyDict:
a = MyDict[i]
if len(a.shape) == 2 and a.shape[1] == 2:
np_array = np.append(np_array, MyDict[i], axis=0)
print(np.histogram(np_array, 50))

Python - construction of meshgrid (irregular grid) with numpy

My aim is to interpolate some data. To do that i have to create a meshgrid.
To do this step, i got an array with my 2D coordinate "coord" (first column : element number, second : X and third : Y).
I do a meshgrid with np.meshgrid as you can see below.
But my results seem to be strange, so i would like to know if i have done
a mistake...Must i have to reorganize my data before meshgrid step?
import numpy as np
coord = np.array([[ 1. , -1.38888667, -1.94444333],
[ 2. , -1.94444333, -1.38888667],
[ 3. , 0.27777667, -1.94444333],
[ 4. , -0.27777667, -1.38888667],
[ 5. , 1.94444333, -1.94444333],
[ 6. , 1.38888667, -1.38888667],
[ 7. , -1.38888667, -0.27777667],
[ 8. , -1.94444333, 0.27777667],
[ 9. , 0.27777667, -0.27777667],
[ 10. , -0.27777667, 0.27777667],
[ 11. , 1.94444333, -0.27777667],
[ 12. , 1.38888667, 0.27777667],
[ 13. , -1.38888667, 1.38888667],
[ 14. , -1.94444333, 1.94444333],
[ 15. , 0.27777667, 1.38888667],
[ 16. , -0.27777667, 1.94444333],
[ 17. , 1.94444333, 1.38888667],
[ 18. , 1.38888667, 1.94444333]])
[Y,X]=np.meshgrid(coord[:,2],coord[:,1])
If i plot Y, i got that :
plt.imshow(Y);plt.colorbar();plt.show()
---- EDIT LATER -----
I m wondering (for example) if the coordinates with meshgrid have to be strictly increasing? if there is a better way when i have some coordinates not organized?
For the interpolation, i would like to use :
def interpolate(values, tri,uv,d=2):
simplex = tri.find_simplex(uv)
vertices = np.take(tri.simplices, simplex, axis=0)
temp = np.take(tri.transform, simplex, axis=0)
delta = uv- temp[:, d]
bary = np.einsum('njk,nk->nj', temp[:, :d, :], delta)
return np.einsum('nj,nj->n', np.take(values, vertices), np.hstack((bary, 1.0 - bary.sum(axis=1, keepdims=True))))
which was used in Stack before Speedup scipy griddata for multiple interpolations between two irregular grids allowing to limit the calculation time

Categories