How to improve the performance of this tiny distance Python function - python

I'm running into a performance bottleneck when using a custom distance metric function for a clustering algorithm from sklearn.
The result as shown by Run Snake Run is this:
Clearly the problem is the dbscan_metric function. The function looks very simple and I don't quite know what the best approach to speeding it up would be:
def dbscan_metric(a,b):
if a.shape[0] != NUM_FEATURES:
return np.linalg.norm(a-b)
else:
return np.linalg.norm(np.multiply(FTR_WEIGHTS, (a-b)))
Any thoughts as to what is causing it to be this slow would be much appreciated.

I am not familiar with what the function does - but is there a possibility of repeated calculations? If so, you could memoize the function:
cache = {}
def dbscan_metric(a,b):
diff = a - b
if a.shape[0] != NUM_FEATURES:
to_calc = diff
else:
to_calc = np.multiply(FTR_WEIGHTS, diff)
if not cache.get(to_calc): cache[to_calc] = np.linalg.norm(to_calc)
return cache[to_calc]

Related

how can i refactor my python code to decrease the time complexity

this code takes 9 sec which is very long time, i guess the problem in 2 loop in my code
for symptom in symptoms:
# check if the symptom is mentioned in the user text
norm_symptom = symptom.replace("_"," ")
for combin in list_of_combinations:
print(getSimilarity([combin, norm_symptom]))
if getSimilarity([combin,norm_symptom])>0.25:
if symptom not in extracted_symptoms:
extracted_symptoms.append(symptom)
i tried to use zip like this:
for symptom, combin in zip(symptoms,list_of_combinations):
norm_symptom = symptom.replace("_"," ")
if (getSimilarity([combin, norm_symptom]) > 0.25 and symptom not in extracted_symptoms):
extracted_symptoms.append(symptom)
Indeed, you're algorithm is slow because of the 2 nested loops.
It performs with big O N*M (see more here https://www.freecodecamp.org/news/big-o-notation-why-it-matters-and-why-it-doesnt-1674cfa8a23c/)
N being the lenght of symptoms
and M being the list_of_combinations
What can takes time also is the computation getSimilarity, what is this operation ?
Use a dict to store the results of getSimilarity for each combination and symptom. This way, you can avoid calling getSimilarity multiple times for the same combination and symptom. This way it will be more efficient, thus faster.
import collections
similarity_results = collections.defaultdict(dict)
for symptom in symptoms:
norm_symptom = symptom.replace("_"," ")
for combin in list_of_combinations:
# Check if the similarity has already been computed
if combin in similarity_results[symptom]:
similarity = similarity_results[symptom][combin]
else:
similarity = getSimilarity([combin, norm_symptom])
similarity_results[symptom][combin] = similarity
if similarity > 0.25:
if symptom not in extracted_symptoms:
extracted_symptoms.append(symptom)
Update:
Alternatively You could use an algorithm based on the Levenshtein distance, which is a measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. Python-Levenshtein library does that.
import Levenshtein
def getSimilarity(s1, s2):
distance = Levenshtein.distance(s1, s2)
return 1.0 - (distance / max(len(s1), len(s2)))
extracted_symptoms = []
for symptom, combin in zip(symptoms, list_of_combinations):
norm_symptom = symptom.replace("_", " ")
if (getSimilarity(combin, norm_symptom) > 0.25) and (symptom not in extracted_symptoms):
extracted_symptoms.append(symptom)

Parallel computing in Python Similar to MATLAB

I have been using parfor in MATLAB to run parallel for loops for quite some time. I need to do something similar in Python but I cannot find any simple solution. This is my code:
t = list(range(1,3,1))
G = list(range(0,3,2))
results = pandas.DataFrame(columns = ['tau', 'p_value','G','t_i'],index=range(0,len(G)*len(t)))
counter = 0
for iteration_G in list(range(0,len(G))):
for iteration_t in list(range(0,len(t))):
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
results['tau'][counter] = tau
results['p_value'][counter] = p_value
results['G'][counter] = G[iteration_G]
results['t_i'][counter] = G[iteration_t]
counter = counter + 1
I would like to use the parfor equivalent in the first loop.
I'm not familiar with parfor, but you can use the joblib package to run functions in parallel.
In this simple example there's a function that prints its argument and we use Parallel to execute it multiple times in parallel with a for-loop
import multiprocessing
from joblib import Parallel, delayed
# function that you want to run in parallel
def foo(i):
print(i)
# define the number of cores (this is how many processes wil run)
num_cores = multiprocessing.cpu_count()
# execute the function in parallel - `return_list` is a list of the results of the function
# in this case it will just be a list of None's
return_list = Parallel(n_jobs=num_cores)(delayed(foo)(i) for i in range(20))
If this doesn't work for what you want to do, you can try to use numba - it might be a bit more difficult to set-up, but in theory with numba you can just add #njit(parallel=True) as a decorator to your function and numba will try to parallelise it for you.
I found a solution using parfor. It is still a bit more complicated than MATLAB's parfor but it's pretty close to what I am used to.
t = list(range(1,16,1))
G = list(range(0,62,2))
for iteration_t in list(range(0,len(t))):
#parfor(list(range(0,len(G))))
def fun(iteration_G):
result = pandas.DataFrame(columns = ['tau', 'p_value'],index=range(0,1))
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
result['tau'] = tau
result['p_value'] = p_value
fun = numpy.array([tau,p_value])
return fun

How to decrease the output time of covariance function

I have written a function for covariance matrix and the output I am getting is correct but the problem with the code is, it's taking too much time for high dimension dataset.
Could you please help me to modify the code below to take less time for output?
def cov_variance(norm_data,mean_of_mat):
col = len(norm_data)
row = len(norm_data[0])
out =[]
i = 0
sum_of_covar = 0
freezrow = 0
flag = 1
while flag<=len(mean_of_mat):
for r in range(row):
for c in range(col):
sum_of_covar+=(((norm_data[c][freezrow])-mean_of_mat[freezrow])*\
((norm_data[c][r])-mean_of_mat[i]))
freezrow=freezrow
out.append(sum_of_covar)
i+=1
sum_of_covar=0
freezrow=freezrow
flag+=1
freezrow+=1
i=0
out1 = map(lambda x : x/col-1,out)
cov_variance_output = reshape(out1,row)
return cov_variance_output
Like doctorlove already said, don't implement your own. It will almost certainly be slower and/or less versatile (speaking from my own experience).
I tried commenting with this info but my rep is too low. You can find information on calculating covariance matrices with numpy here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html

Is there a cythonized version of `norm` methods from scipy.stats?

I'm talking about the main public methods for continuous RV in scipy.stats:
specifically,
from scipy.stats import norm
then using
norm.ppf or norm.pdf
Link: http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html
Is there any opportunity for speed optimization on norm.ppf() or norm.pdf() from using cython? Or is it already optimized or not worth wrapping with cython?
Have you examined the code in
.../scipy/stats/distributions.py?
It looks like norm_pdf ends up using
_norm_pdf_C = math.sqrt(2*pi)
_norm_pdf_logC = math.log(_norm_pdf_C)
def _norm_pdf(x):
return exp(-x**2/2.0) / _norm_pdf_C
Since it doesn't involve a loop through numpy arrays it does not look like a prime candidate for cython speedup. Would you write it differently?
Opps, sorry. You are asking about a function like:
def pdf(self,x,*args,**kwds):
args, loc, scale = self._parse_args(*args, **kwds)
x,loc,scale = map(asarray,(x,loc,scale))
args = tuple(map(asarray,args))
x = asarray((x-loc)*1.0/scale)
cond0 = self._argcheck(*args) & (scale > 0)
cond1 = (scale > 0) & (x >= self.a) & (x <= self.b)
cond = cond0 & cond1
output = zeros(shape(cond),'d')
putmask(output,(1-cond0)+np.isnan(x),self.badvalue)
if any(cond):
goodargs = argsreduce(cond, *((x,)+args+(scale,)))
scale, goodargs = goodargs[-1], goodargs[:-1]
place(output,cond,self._pdf(*goodargs) / scale)
if output.ndim == 0:
return output[()]
return output
Besides argument checking and massaging with map, I see some iteration hiding in putmask and place. I haven't used place much, but I think it's iterating on cond, and applying self._pdf, and placing the values in output. I suspect code organization is intended to provide a lot of flexibility, allowing for different models and distributions. I don't see tight code aimed at speed.
For code that would benefit from conversion to cython you probably need to write something from scratch, something that does not call alot of other numpy and scipy code, and does not build an elaborate class structure. Focus on a very specific calculation, not a family of calculations.

Interpreting Hamming Distance speed in python

I've been working on making my python more pythonic and toying with runtimes of short snippets of code. My goal to improve the readability, but additionally, to speed execution.
This example conflicts with the best practices I've been reading about and I'm interested to find the where the flaw in my thought process is.
The problem is to compute the hamming distance on two equal length strings. For example the hamming distance of strings 'aaab' and 'aaaa' is 1.
The most straightforward implementation I could think of is as follows:
def hamming_distance_1(s_1, s_2):
dist = 0
for x in range(len(s_1)):
if s_1[x] != s_2[x]: dist += 1
return dist
Next I wrote two "pythonic" implementations:
def hamming_distance_2(s_1, s_2):
return sum(i.imap(operator.countOf, s_1, s_2))
and
def hamming_distance_3(s_1, s_2):
return sum(i.imap(lambda s: int(s[0]!=s[1]), i.izip(s_1, s_2)))
In execution:
s_1 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
s_2 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
print 'ham_1 ', timeit.timeit('hamming_distance_1(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_1",number=1000)
print 'ham_2 ', timeit.timeit('hamming_distance_2(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_2",number=1000)
print 'ham_3 ', timeit.timeit('hamming_distance_3(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_3",number=1000)
returning:
ham_1 1.84980392456
ham_2 3.26420593262
ham_3 3.98718094826
I expected that ham_3 would run slower then ham_2, due to the fact that calling a lambda is treated as a function call, which is slower then calling the built in operator.countOf.
I was surprised I couldn't find a way to get a more pythonic version to run faster then ham_1 however. I have trouble believing that ham_1 is the lower bound for pure python.
Thoughts anyone?
The key is making less method lookups and function calls:
def hamming_distance_4(s_1, s_2):
return sum(i != j for i, j in i.izip(s_1, s_2))
runs at ham_4 1.10134792328 in my system.
ham_2 and ham_3 makes lookups inside the loops, so they are slower.
I wonder if this might be a bit more Pythonic, in some broader sense. What if you use http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html ... a module that already implements what you're looking for?

Categories