I like to develop a query system that finds the most similar items to given one based on a binary signature extracted from the data. I probe for the most efficient way since I have runtime constraints. I tried to use scipy distance but it was too slow. Do you know any other useful library or trick to make it in a faster manner.
For being and example scenario,
I have a query vector with binary values with length 68, and I have a dataset with a matrix size 3000Kx68. I like to find the most similar item in this matrix to given query by using Hamming distance.
thanks for any comment
Nice problem, I liked the answers of Alex and Piotr. My first naive attempt resulted also in a solution time around 800ms (on my system). My second attempt, using numpy's (un)packbits, resulted in a 4x speed increase.
import numpy as np
LENGTH = 68
K = 1024
DATASIZE = 3000 * K
DATA = np.random.randint(0, 2, (DATASIZE, LENGTH)).astype(np.bool)
def RandomVect():
return np.random.randint(0, 2, (LENGTH)).astype(np.bool)
def HammingDist(vec1, vec2):
return np.sum(np.logical_xor(vec1, vec2))
def SmallestHamming(vec):
XorData = np.logical_xor(DATA, vec[np.newaxis, :])
Lengths = np.sum(XorData, axis=1)
return DATA[np.argmin(Lengths)] # returns first smallest
def main():
v1 = RandomVect()
v2 = SmallestHamming(v1)
print(HammingDist(v1, v2))
# oke, lets try make it faster... (using numpy.(un)packbits)
DATA2 = np.packbits(DATA, axis=1)
NBYTES = DATA2.shape[-1]
BYTE2ONES = np.zeros((256), dtype=np.uint8)
for i in range(0,256):
BYTE2ONES[i] = np.sum(np.unpackbits(np.uint8(i)))
def RandomVect2():
return np.packbits(RandomVect())
def HammingDist2(vec1, vec2):
v1 = np.unpackbits(vec1)
v2 = np.unpackbits(vec2)
return np.sum(np.logical_xor(v1, v2))
def SmallestHamming2(vec):
XorData = DATA2 ^ vec[np.newaxis, :]
Lengths = np.sum(BYTE2ONES[XorData], axis=1)
return DATA2[np.argmin(Lengths)] # returns first smallest
def main2():
v1 = RandomVect2()
v2 = SmallestHamming2(v1)
print(HammingDist2(v1, v2))
Use cdist from SciPy:
from scipy.spatial.distance import cdist
Y = cdist(XA, XB, 'hamming')
Computes the normalized Hamming distance, or the proportion of those vector elements between two n-vectors u and v which disagree. To save memory, the matrix X can be of type boolean
Reference: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
I would be surprised if there was a significantly faster way than this: Put your data into a pandas DataFrame (M), each vector by columns, and your target vector into a pandas Series (x),
import numpy as np
import pandas as pd
rows = 68
columns=3000
M = pd.DataFrame(np.random.rand(rows,columns)>0.5)
x = pd.Series(np.random.rand(rows)>0.5)
then do the following
%timeit M.apply(lambda y: x==y).astype(int).sum().idxmax()
1 loop, best of 3: 746 ms per loop
Edit: Actually, I am surprised this is a much faster way
%timeit M.eq(x, axis=0).astype(int).sum().idxmax()
100 loops, best of 3: 2.68 ms per loop
Related
I'm trying to vectorize my fitness function for a Minimum Vector Cover genetic algorithm, but I'm at a loss about how to do it.
As it stands now:
vert_cover_fitness = [1 if self.dna[edge[0]] or self.dna[edge[1]] else -num_edges for edge in edges]
The dna is a one-dimensional binary array of size [0..n], where each index corresponds to a vertex, and its value indicates if we have chosen it or not. edges is a two dimensional positive integer array, where each value corresponds to a vertex (index) in dna. Both are ndarrays.
Simply explained - if one of the vertices connected by an edge is "selected", then we get a score of one. If not, the function is penalized by -num_edges.
I have tried np.vectorize as an attempt to get away cheap with a lambda function:
fit_func = np.vectorize(lambda edge: 1 if self.dna[edge[0]] or self.dna[edge[1]] else -num_edges)
vert_cover_fitness = fit_func(edges)
This returns IndexError: invalid index to scalar variable., as this function is applied to each value, and not each row.
To fix this I tried np.apply_along_axis. This works but it's just a wrapper for a loop so I'm not getting any speedups.
If any Numpy wizards can see some obvious way to do this, I would much appreciate your help. I'm guessing a problem lies with the representation of the problem, and that changing either the dna or edges shapes could help. I'm just not skilled enough to see what I should do.
I came up with this bit of numpy code, it runs 30x faster than your for loop on my randomly generated data.
import numpy as np
num_vertices = 1000
num_edges = 500
dna = np.random.choice([0, 1], num_vertices)
edges = np.random.randint(0, num_vertices, num_edges * 2).reshape(-1, 2)
vert_cover_fitness1 = [1 if dna[edge[0]] or dna[edge[1]] else -num_edges for edge in edges]
vert_cover_fitness2 = np.full([num_edges], -num_edges)
mask = (dna[edges[:, 0]] | dna[edges[:, 1]]).astype(bool)
vert_cover_fitness2[mask] = 1.0
print((vert_cover_fitness1 == vert_cover_fitness2).all()) # this shows it's correct
Here is the timeit code used to measure the speedup.
import timeit
setup = """
import numpy as np
num_vertices = 1000
num_edges = 500
dna = np.random.choice([0, 1], num_vertices)
edges = np.random.randint(0, num_vertices, num_edges*2).reshape(-1, 2)
"""
python_loop = "[1 if dna[edge[0]] or dna[edge[1]] else -num_edges for edge in edges]"
print(timeit.timeit(python_loop, setup, number=1000))
vectorised="""
vert_cover_fitness2 = np.full([num_edges], -num_edges)
mask = (dna[edges[:, 0]] | dna[edges[:, 1]]).astype(bool)
vert_cover_fitness2[mask] = 1.0
"""
print(timeit.timeit(vectorised, setup, number=1000))
# prints:
# 0.375906624016352
# 0.012783741112798452
Is there a open source function to compute moving z-score like https://turi.com/products/create/docs/generated/graphlab.toolkits.anomaly_detection.moving_zscore.create.html. I have access to pandas rolling_std for computing std, but want to see if it can be extended to compute rolling z scores.
rolling.apply with a custom function is significantly slower than using builtin rolling functions (such as mean and std). Therefore, compute the rolling z-score from the rolling mean and rolling std:
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0).shift(1)
z = (x-m)/s
return z
According to the definition given on this page the rolling z-score depends on the rolling mean and std just prior to the current point. The shift(1) is used above to achieve this effect.
Below, even for a small Series (of length 100), zscore is over 5x faster than using rolling.apply. Since rolling.apply(zscore_func) calls zscore_func once for each rolling window in essentially a Python loop, the advantage of using the Cythonized r.mean() and r.std() functions becomes even more apparent as the size of the loop increases.
Thus, as the length of the Series increases, the speed advantage of zscore increases.
In [58]: %timeit zscore(x, N)
1000 loops, best of 3: 903 µs per loop
In [59]: %timeit zscore_using_apply(x, N)
100 loops, best of 3: 4.84 ms per loop
This is the setup used for the benchmark:
import numpy as np
import pandas as pd
np.random.seed(2017)
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0).shift(1)
z = (x-m)/s
return z
def zscore_using_apply(x, window):
def zscore_func(x):
return (x[-1] - x[:-1].mean())/x[:-1].std(ddof=0)
return x.rolling(window=window+1).apply(zscore_func)
N = 5
x = pd.Series((np.random.random(100) - 0.5).cumsum())
result = zscore(x, N)
alt = zscore_using_apply(x, N)
assert not ((result - alt).abs() > 1e-8).any()
You should use native functions of pandas:
# Compute rolling zscore for column ="COL" and window=window
col_mean = df["COL"].rolling(window=window).mean()
col_std = df["COL"].rolling(window=window).std()
df["COL_ZSCORE"] = (df["COL"] - col_mean)/col_std
def zscore(arr, window):
x = arr.rolling(window = 1).mean()
u = arr.rolling(window = window).mean()
o = arr.rolling(window = window).std()
return (x-u)/o
df['zscore'] = zscore(df['value'],window)
Let us say you have a data frame called data, which looks like this:
enter image description here
then you run the following code,
data_zscore=data.apply(lambda x: (x-x.expanding().mean())/x.expanding().std())
enter image description here
Please note that the first row will always have NaN values as it doesn't have a standard deviation.
This can be solved in a single line of code. Given that s is the input series and wlen is the window length:
zscore = s.sub(s.rolling(wlen).mean()).div(s.rolling(wlen).std())
If you need to shift the mean and std it can still be done:
zscore = s.sub(s.rolling(wlen).mean().shift()).div(s.rolling(wlen).std().shift())
I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values.
My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it...
If it helps this is my code... nothing special or hard..
def SimMat(mat,name):
mrange = mat.shape[0]
print "mrange:", mrange
nTRs = mat.shape[1]
print "nTRs:", nTRs
SimM = numpy.zeros((mrange,mrange))
for i in range(mrange):
SimM[i][i] = 1
for i in range (mrange):
for j in range(i+1, mrange):
pearV = scipy.stats.pearsonr(mat[i], mat[j])
if(pearV[1] <= 0.05):
if(pearV[0] >= 0.5):
print "Pearson value:", pearV[0]
SimM[i][j] = pearV[0]
SimM[j][i] = 0
else:
SimM[i][j] = SimM[j][i] = 0
numpy.savetxt(name, SimM)
return SimM, nTRs
Thanks
The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:
import numpy as np
import h5py
from scipy.stats import pearsonr
# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()
# Compute dot products
h5 = h5py.File("data.h5",'r+')
A = h5["test"][:]
N = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
for i in range(N):
out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.
Short Question
I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values contained within each bin.
How do I extract the indices of each bin to efficiently perform my calculation using the bins values?
What I am looking for is a solution which avoids the bottleneck of having to select every time ind == j from my large array. Is there a way to obtain directly, in one go, the indices of the elements belonging to every bin?
Detailed Explanation
1. Straightforward Solution
One way to achieve what I need is to use code like the following (see e.g. THIS related answer), where I digitize my values and then have a j-loop selecting digitized indices equal to j like below
import numpy as np
# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
y = np.sum(x)
return y
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = [func(vals[ind == j]) for j in range(1, nbins)]
This is not what I want as it selects every time ind == j from my large array. This makes this solution very inefficient and slow.
2. Using binned_statistics
The above approach turns out to be the same implemented in scipy.stats.binned_statistic, for the general case of a user-defined function. Using Scipy directly an identical output can be obtained with the following
import numpy as np
from scipy.stats import binned_statistics
vals = np.random.random(1e8)
results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]
3. Using labeled_comprehension
Another Scipy alternative is to use scipy.ndimage.measurements.labeled_comprehension. Using that function, the above example would become
import numpy as np
from scipy.ndimage import labeled_comprehension
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)
Unfortunately also this form is inefficient and in particular, it has no speed advantage over my original example.
4. Comparison with IDL language
To further clarify, what I am looking for is a functionality equivalent to the REVERSE_INDICES keyword in the HISTOGRAM function of the IDL language HERE. Can this very useful functionality be efficiently replicated in Python?
Specifically, using the IDL language the above example could be written as
vals = randomu(s, 1e8)
nbins = 100
bins = [0:1:1./nbins]
h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)
result = dblarr(nbins)
for j=0, nbins-1 do begin
jbins = r[r[j]:r[j+1]-1] ; Selects indices of bin j
result[j] = func(vals[jbins])
endfor
The above IDL implementation is about 10 times faster than the Numpy one, due to the fact that the indices of the bins do not have to be selected for every bin. And the speed difference in favour of the IDL implementation increases with the number of bins.
I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster
import numpy as np
from scipy.sparse import csr_matrix
def binned_statistic(x, values, func, nbins, range):
'''The usage is nearly the same as scipy.stats.binned_statistic'''
N = len(values)
r0, r1 = range
digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))
return [func(group) for group in np.split(S.data, S.indptr[1:-1])]
I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.
I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.
vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()
#example of using the data:
for j in range(nbins-1):
#I am using a plotting function for your f, to show that they cluster
plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')
The figure shows that the bins are actually clusters as expected:
You can halve the computation time by sorting the array first, then use np.searchsorted.
vals = np.random.random(1e8)
vals.sort()
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
results = [func(vals[np.searchsorted(ind,j,side='left'):
np.searchsorted(ind,j,side='right')])
for j in range(1,nbins)]
Using 1e8 as my test case, I go from 34 seconds of computation to about 17.
One efficient solution is using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
npi.group_by(ind).split(vals)
Pandas has a very fast grouping code (I think it's written in C), so if you don't mind loading the library you could do that :
import pandas as pd
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').sum().values
or more generally :
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').agg(func).values
Although the latter is slower for standard aggregation functions
(like sum, mean, etc)
I have two 2d numpy arrays: x_array contains positional information in the x-direction, y_array contains positions in the y-direction.
I then have a long list of x,y points.
For each point in the list, I need to find the array index of the location (specified in the arrays) which is closest to that point.
I have naively produced some code which works, based on this question:
Find nearest value in numpy array
i.e.
import time
import numpy
def find_index_of_nearest_xy(y_array, x_array, y_point, x_point):
distance = (y_array-y_point)**2 + (x_array-x_point)**2
idy,idx = numpy.where(distance==distance.min())
return idy[0],idx[0]
def do_all(y_array, x_array, points):
store = []
for i in xrange(points.shape[1]):
store.append(find_index_of_nearest_xy(y_array,x_array,points[0,i],points[1,i]))
return store
# Create some dummy data
y_array = numpy.random.random(10000).reshape(100,100)
x_array = numpy.random.random(10000).reshape(100,100)
points = numpy.random.random(10000).reshape(2,5000)
# Time how long it takes to run
start = time.time()
results = do_all(y_array, x_array, points)
end = time.time()
print 'Completed in: ',end-start
I'm doing this over a large dataset and would really like to speed it up a bit.
Can anyone optimize this?
Thanks.
UPDATE: SOLUTION following suggestions by #silvado and #justin (below)
# Shoe-horn existing data for entry into KDTree routines
combined_x_y_arrays = numpy.dstack([y_array.ravel(),x_array.ravel()])[0]
points_list = list(points.transpose())
def do_kdtree(combined_x_y_arrays,points):
mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
start = time.time()
results2 = do_kdtree(combined_x_y_arrays,points_list)
end = time.time()
print 'Completed in: ',end-start
This code above sped up my code (searching for 5000 points in 100x100 matrices) by 100 times. Interestingly, using scipy.spatial.KDTree (instead of scipy.spatial.cKDTree) gave comparable timings to my naive solution, so it is definitely worth using the cKDTree version...
Here is a scipy.spatial.KDTree example
In [1]: from scipy import spatial
In [2]: import numpy as np
In [3]: A = np.random.random((10,2))*100
In [4]: A
Out[4]:
array([[ 68.83402637, 38.07632221],
[ 76.84704074, 24.9395109 ],
[ 16.26715795, 98.52763827],
[ 70.99411985, 67.31740151],
[ 71.72452181, 24.13516764],
[ 17.22707611, 20.65425362],
[ 43.85122458, 21.50624882],
[ 76.71987125, 44.95031274],
[ 63.77341073, 78.87417774],
[ 8.45828909, 30.18426696]])
In [5]: pt = [6, 30] # <-- the point to find
In [6]: A[spatial.KDTree(A).query(pt)[1]] # <-- the nearest point
Out[6]: array([ 8.45828909, 30.18426696])
#how it works!
In [7]: distance,index = spatial.KDTree(A).query(pt)
In [8]: distance # <-- The distances to the nearest neighbors
Out[8]: 2.4651855048258393
In [9]: index # <-- The locations of the neighbors
Out[9]: 9
#then
In [10]: A[index]
Out[10]: array([ 8.45828909, 30.18426696])
scipy.spatial also has a k-d tree implementation: scipy.spatial.KDTree.
The approach is generally to first use the point data to build up a k-d tree. The computational complexity of that is on the order of N log N, where N is the number of data points. Range queries and nearest neighbour searches can then be done with log N complexity. This is much more efficient than simply cycling through all points (complexity N).
Thus, if you have repeated range or nearest neighbor queries, a k-d tree is highly recommended.
If you can massage your data into the right format, a fast way to go is to use the methods in scipy.spatial.distance:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
In particular pdist and cdist provide fast ways to calculate pairwise distances.
Search methods have two phases:
build a search structure, e.g. a KDTree, from the npt data points (your x y)
lookup nq query points.
Different methods have different build times, and different query times.
Your choice will depend a lot on npt and nq:
scipy cdist
has build time 0, but query time ~ npt * nq.
KDTree build times are complicated,
lookups are very fast, ~ ln npt * nq.
On a regular (Manhatten) grid, you can do much better: see (ahem)
find-nearest-value-in-numpy-array.
A little
testbench:
: building a KDTree of 5000 × 5000 2d points takes about 30 seconds,
then queries take microseconds;
scipy cdist
25 million × 20 points (all pairs, 4G) takes about 5 seconds, on my old iMac.
I have been trying to follow along with this, but new to Jupyter Notebooks, Python and the various tools being discussed here, but I have managed to get some way down the road I'm travelling.
BURoute = pd.read_csv('C:/Users/andre/BUKP_1m.csv', header=None)
NGEPRoute = pd.read_csv('c:/Users/andre/N1-06.csv', header=None)
I create a combined XY array from my BURoute dataframe
combined_x_y_arrays = BURoute.iloc[:,[0,1]]
And I create the points with the following command
points = NGEPRoute.iloc[:,[0,1]]
I then do the KDTree magic
def do_kdtree(combined_x_y_arrays, points):
mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
results2 = do_kdtree(combined_x_y_arrays, points)
This gives me an array of the indexes. I'm now trying to figure out how to calculate the distance between the points and the indexed points in the results array.
def find_nearest_vector(self,arrList, value):
y,x = value
offset =10
x_Array=[]
y_Array=[]
for p in arrList:
x_Array.append(p[1])
y_Array.append(p[0])
x_Array=np.array(x_Array)
y_Array=np.array(y_Array)
difference_array_x = np.absolute(x_Array-x)
difference_array_y = np.absolute(y_Array-y)
index_x = np.where(difference_array_x<offset)[0]
index_y = np.where(difference_array_y<offset)[0]
index = np.intersect1d(index_x, index_y, assume_unique=True)
nearestCootdinate = (arrList[index][0][0],arrList[index][0][1])
return nearestCootdinate