I've been working on making my python more pythonic and toying with runtimes of short snippets of code. My goal to improve the readability, but additionally, to speed execution.
This example conflicts with the best practices I've been reading about and I'm interested to find the where the flaw in my thought process is.
The problem is to compute the hamming distance on two equal length strings. For example the hamming distance of strings 'aaab' and 'aaaa' is 1.
The most straightforward implementation I could think of is as follows:
def hamming_distance_1(s_1, s_2):
dist = 0
for x in range(len(s_1)):
if s_1[x] != s_2[x]: dist += 1
return dist
Next I wrote two "pythonic" implementations:
def hamming_distance_2(s_1, s_2):
return sum(i.imap(operator.countOf, s_1, s_2))
and
def hamming_distance_3(s_1, s_2):
return sum(i.imap(lambda s: int(s[0]!=s[1]), i.izip(s_1, s_2)))
In execution:
s_1 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
s_2 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
print 'ham_1 ', timeit.timeit('hamming_distance_1(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_1",number=1000)
print 'ham_2 ', timeit.timeit('hamming_distance_2(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_2",number=1000)
print 'ham_3 ', timeit.timeit('hamming_distance_3(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_3",number=1000)
returning:
ham_1 1.84980392456
ham_2 3.26420593262
ham_3 3.98718094826
I expected that ham_3 would run slower then ham_2, due to the fact that calling a lambda is treated as a function call, which is slower then calling the built in operator.countOf.
I was surprised I couldn't find a way to get a more pythonic version to run faster then ham_1 however. I have trouble believing that ham_1 is the lower bound for pure python.
Thoughts anyone?
The key is making less method lookups and function calls:
def hamming_distance_4(s_1, s_2):
return sum(i != j for i, j in i.izip(s_1, s_2))
runs at ham_4 1.10134792328 in my system.
ham_2 and ham_3 makes lookups inside the loops, so they are slower.
I wonder if this might be a bit more Pythonic, in some broader sense. What if you use http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html ... a module that already implements what you're looking for?
Related
I have a list of strings/narratives which I need to compare and get a distance measure between each string. The current code I have written works but for larger lists it takes along time since I use 2 for loops. I have used the levenshtien distance to measure the distance between strings.
The list of strings/narratives is stored in a dataframe.
def edit_distance(s1, s2):
m=len(s1)+1
n=len(s2)+1
tbl = {}
for i in range(m): tbl[i,0]=i
for j in range(n): tbl[0,j]=j
for i in range(1, m):
for j in range(1, n):
cost = 0 if s1[i-1] == s2[j-1] else 1
tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
return tbl[i,j]
def narrative_feature_extraction(df):
startTime = time.time()
leven_matrix = np.zeros((len(df['Narrative']),len(df['Narrative'])))
for i in range(len(df['Narrative'])):
for j in range(len(df['Narrative'])):
leven_matrix[i][j] = edit_distance(df['Narrative'].iloc[i],df['Narrative'].iloc[j])
endTime = time.time()
total = (endTime - startTime)
print "Feature Extraction (Leven) Runtime:" + str(total)
return leven_matrix
X = narrative_feature_extraction(df)
If the list has n narratives, the resulting X is a nxn matrix, where the rows are the narratives and the columns is what that narrative is compared to. For example, for the distance (i,j) it is the levenshtien distance between narrative i and j.
Is there a way to optimize this code so that there isn't a need to have so many for loops? Or is there a pythonic way of calculating this?
hard to give exact code without data/examples, but a few suggestions:
Use list comprehension, much faster than for ... in range ...
Depending on your version of pandas, "df[i][j]" indexing can be veeeery slow, instead use .iloc or .loc (if you want to mix and match use .iloc[df.index.get_loc("itemname"),df.columns.get_loc("itemname")] to convert loc to iloc properly if you have this issue. (I think it is only slow if you are getting warning flags for writing to a dataframe slice and depends a lot on what version of python/pandas you have, but have not tested extensively)
Better yet, run all calcs and then throw into dataframe in one go depending on your use case
If you like the pythonic reading of for loops, try to avoid using "in range" at least and instead use "for j in X[:,0]" for example. I find this to be faster in most cases, and you can use with enumerate to keep index values (example below)
Examples/timings:
def test1(): #list comprehension
X=np.random.normal(size=(100,2))
results=[[x*y for x in X[:,0]] for y in X[:,1]]
df=pd.DataFrame(data=np.array(results))
if __name__ == '__main__':
import timeit
print("test1: "+str(timeit.timeit("test1()", setup="from __main__ import test1",number=10)))
def test2(): #enumerate, df at end
X=np.random.normal(size=(100,2))
results=np.zeros((100,100))
for ind,i in enumerate(X[:,0]):
for col,j in enumerate(X[:,1]):
results[ind,col]=i*j
df=pd.DataFrame(data=results)
if __name__ == '__main__':
import timeit
print("test2: "+str(timeit.timeit("test2()", setup="from __main__ import test2",number=10)))
def test3(): #in range, but df at end
X=np.random.normal(size=(100,2))
results=np.zeros((100,100))
for i in range(len(X)):
for j in range(len(X)):
results[i,j]=X[i,0]*X[j,1]
df=pd.DataFrame(data=results)
if __name__ == '__main__':
import timeit
print("test3: "+str(timeit.timeit("test3()", setup="from __main__ import test3",number=10)))
def test4(): #current method
X=np.random.normal(size=(100,2))
df=pd.DataFrame(data=np.zeros((100,100)))
for i in range(len(X)):
for j in range(len(X)):
df[i][j]=(X[i,0]*X[j,1])
if __name__ == '__main__':
import timeit
print("test4: "+str(timeit.timeit("test4()", setup="from __main__ import test4",number=10)))
output:
test1: 0.0492231889643
test2: 0.0587620022106
test3: 0.123777403419
test4: 12.6396287782
so list comprehension is ~250 times faster, and enumerate is twice as fast as "for x in range". Although the real slowdown is individual indexing of your dataframe (even if using .loc or .iloc this will still be your bottleneck so I suggest working with arrays outside of the df if possible)
Hope this helps and you are able to apply to your case. I'd recommend reading up on map, filter, reduce, (maybe enumerate) functions as well as they are quite quick and might help you: http://book.pythontips.com/en/latest/map_filter.html
Unfortunately I am not really familiar with your use case though, but I don't see a reason why it wouldn't be applicable or compatible with this type of code tuning.
I'm looking for some help understanding best practices regarding dictionaries in Python.
I have an example below:
def convert_to_celsius(temp, source):
conversion_dict = {
'kelvin': temp - 273.15,
'romer': (temp - 7.5) * 40 / 21
}
return conversion_dict[source]
def convert_to_celsius_lambda(temp, source):
conversion_dict = {
'kelvin': lambda x: x - 273.15,
'romer': lambda x: (x - 7.5) * 40 / 21
}
return conversion_dict[source](temp)
Obviously, the two functions achieve the same goal, but via different means. Could someone help me understand the subtle difference between the two, and what the 'best' way to go on about this would be?
If you have both dictionaries being created inside the function, then the former will be more efficient - although the former performs two calculations when only one is needed, there is more overhead in the latter version for creating the lambdas each time it's called:
>>> import timeit
>>> setup = "from __main__ import convert_to_celsius, convert_to_celsius_lambda, convert_to_celsius_lambda_once"
>>> timeit.timeit("convert_to_celsius(100, 'kelvin')", setup=setup)
0.5716437913429102
>>> timeit.timeit("convert_to_celsius_lambda(100, 'kelvin')", setup=setup)
0.6484164544288618
However, if you move the dictionary of lambdas outside the function:
CONVERSION_DICT = {
'kelvin': lambda x: x - 273.15,
'romer': lambda x: (x - 7.5) * 40 / 21
}
def convert_to_celsius_lambda_once(temp, source):
return CONVERSION_DICT[source](temp)
then the latter is more efficient, as the lambda objects are only created once, and the function only does the necessary calculation on each call:
>>> timeit.timeit("convert_to_celsius_lambda_once(100, 'kelvin')", setup=setup)
0.3904035060131186
Note that this will only be a benefit where the function is being called a lot (in this case, 1,000,000 times), so that the overhead of creating the two lambda function objects is less than the time wasted in calculating two results when only one is needed.
The dictionary is totally pointless, since you need to re-create it on each call but all you ever do is a single look-up. Juse use an if:
def convert_to_celsius(temp, source):
if source == "kelvin": return temp - 273.15
elif source == "romer": return (temp - 7.5) * 40 / 21
raise KeyError("unknown temperature source '%s'" % source)
Even though both achieve the same thing, the first part is more readable and faster.
In your first example you have a simple arithmetical operation which is going to be calculated once convert_to_celsius is called.
In the second example you calculate only the required temperature.
If you had the second function do an expensive calculation, then it would probably make sense to use a function instead, but for this particular example it's not required.
As others have pointed out, neither of your options are ideal. The first one does both calculations every time and has an unnecessary dict. The second one has to create the lambdas every time through. If this example is the goal then I agree with unwind to just use an if statement. If the goal is to learn something that can be expanded to other uses, I like this approach:
convert_to_celsius = { 'kelvin' : lambda temp: temp - 273.15 ,
'romer' : lambda temp: (temp-7.5) * 40 / 21}
newtemp = convert_to_celsius[source](temp)
Your calculation defintions are all stored together and your function call is uncluttered and meaningful.
I'm running into a performance bottleneck when using a custom distance metric function for a clustering algorithm from sklearn.
The result as shown by Run Snake Run is this:
Clearly the problem is the dbscan_metric function. The function looks very simple and I don't quite know what the best approach to speeding it up would be:
def dbscan_metric(a,b):
if a.shape[0] != NUM_FEATURES:
return np.linalg.norm(a-b)
else:
return np.linalg.norm(np.multiply(FTR_WEIGHTS, (a-b)))
Any thoughts as to what is causing it to be this slow would be much appreciated.
I am not familiar with what the function does - but is there a possibility of repeated calculations? If so, you could memoize the function:
cache = {}
def dbscan_metric(a,b):
diff = a - b
if a.shape[0] != NUM_FEATURES:
to_calc = diff
else:
to_calc = np.multiply(FTR_WEIGHTS, diff)
if not cache.get(to_calc): cache[to_calc] = np.linalg.norm(to_calc)
return cache[to_calc]
The documentation for itertools provides a recipe for a pairwise() function, which I've slightly modified below so that it returns (last_item, None) as the final pair:
from itertools import tee, izip_longest
def pairwise_tee(iterable):
a, b = tee(iterable)
next(b, None)
return izip_longest(a, b)
However, it seemed to me that using tee() might be overkill (given that it's only being used to provide one step of look-ahead), so I tried writing an alternative that avoids it:
def pairwise_zed(iterator):
a = next(iterator)
for b in iterator:
yield a, b
a = b
yield a, None
Note: it so happens that I know my input will be an iterator for my use case; I'm aware that the function above won't work with a regular iterable. The requirement to accept an iterator is also why I'm not using something like izip_longest(iterable, iterable[1:]), by the way.
Testing both functions for speed gave the following results in Python 2.7.3:
>>> import random, string, timeit
>>> for length in range(0, 61, 10):
... text = "".join(random.choice(string.ascii_letters) for n in range(length))
... for variant in "tee", "zed":
... test_case = "list(pairwise_%s(iter('%s')))" % (variant, text)
... setup = "from __main__ import pairwise_%s" % variant
... result = timeit.repeat(test_case, setup=setup, number=100000)
... print "%2d %s %r" % (length, variant, result)
... print
...
0 tee [0.4337780475616455, 0.42563915252685547, 0.42760396003723145]
0 zed [0.21209311485290527, 0.21059393882751465, 0.21039700508117676]
10 tee [0.4933490753173828, 0.4958930015563965, 0.4938509464263916]
10 zed [0.32074403762817383, 0.32239794731140137, 0.32340312004089355]
20 tee [0.6139161586761475, 0.6109561920166016, 0.6153261661529541]
20 zed [0.49281787872314453, 0.49651598930358887, 0.4942781925201416]
30 tee [0.7470319271087646, 0.7446520328521729, 0.7463529109954834]
30 zed [0.7085139751434326, 0.7165200710296631, 0.7171430587768555]
40 tee [0.8083810806274414, 0.8031280040740967, 0.8049719333648682]
40 zed [0.8273730278015137, 0.8248250484466553, 0.8298079967498779]
50 tee [0.8745720386505127, 0.9205660820007324, 0.878741979598999]
50 zed [0.9760301113128662, 0.9776301383972168, 0.978381872177124]
60 tee [0.9913749694824219, 0.9922418594360352, 0.9938201904296875]
60 zed [1.1071209907531738, 1.1063809394836426, 1.1069209575653076]
... so, it turns out that pairwise_tee() starts to outperform pairwise_zed() when there are about forty items. That's fine, as far as I'm concerned - on average, my input is likely to be under that threshold.
My question is: which should I use? pairwise_zed() looks like it'll be a little faster (and to my eyes is slightly easier to follow), but pairwise_tee() could be considered the "canonical" implementation by virtue of being taken from the official docs (to which I could link in a comment), and will work for any iterable - which isn't a consideration at this point, but I suppose could be later.
I was also wondering about potential gotchas if the iterator is interfered with outside the function, e.g.
for a, b in pairwise(iterator):
# do something
q = next(iterator)
... but as far as I can tell, pairwise_zed() and pairwise_tee() behave identically in that situation (and of course it would be a damn fool thing to do in the first place).
The itertools tee implementation is idiomatic for those experienced with itertools, though I'd be tempted to use islice instead of next to advance the leading iterator.
A disadvantage of your version is that it's less easy to extend it to n-wise iteration as your state is stored in local variables; I'd be tempted to use a deque:
def pairwise_deque(iterator, n=2):
it = chain(iterator, repeat(None, n - 1))
d = collections.deque(islice(it, n - 1), maxlen=n)
for a in it:
d.append(a)
yield tuple(d)
A useful idiom is calling iter on the iterator parameter; this is an easy way to ensure your function works on any iterable.
This is a subjective question; both versions are fine.
I would use tee, because it looks simpler to me: I know what tee does, so the first is immediately obvious, whereas with the second I have to think a little about the order in which you overwrite a at the end of each loop. The timings are small enough as to be probably irrelephant, but you're the judge of that.
Regarding your other question, from the tee docs:
Once tee() has made a split, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed.
Novice programmer here. I'm writing a program that analyzes the relative spatial locations of points (cells). The program gets boundaries and cell type off an array with the x coordinate in column 1, y coordinate in column 2, and cell type in column 3. It then checks each cell for cell type and appropriate distance from the bounds. If it passes, it then calculates its distance from each other cell in the array and if the distance is within a specified analysis range it adds it to an output array at that distance.
My cell marking program is in wxpython so I was hoping to develop this program in python as well and eventually stick it into the GUI. Unfortunately right now python takes ~20 seconds to run the core loop on my machine while MATLAB can do ~15 loops/second. Since I'm planning on doing 1000 loops (with a randomized comparison condition) on ~30 cases times several exploratory analysis types this is not a trivial difference.
I tried running a profiler and array calls are 1/4 of the time, almost all of the rest is unspecified loop time.
Here is the python code for the main loop:
for basecell in range (0, cellnumber-1):
if firstcelltype == np.array((cellrecord[basecell,2])):
xloc=np.array((cellrecord[basecell,0]))
yloc=np.array((cellrecord[basecell,1]))
xedgedist=(xbound-xloc)
yedgedist=(ybound-yloc)
if xloc>excludedist and xedgedist>excludedist and yloc>excludedist and yedgedist>excludedist:
for comparecell in range (0, cellnumber-1):
if secondcelltype==np.array((cellrecord[comparecell,2])):
xcomploc=np.array((cellrecord[comparecell,0]))
ycomploc=np.array((cellrecord[comparecell,1]))
dist=math.sqrt((xcomploc-xloc)**2+(ycomploc-yloc)**2)
dist=round(dist)
if dist>=1 and dist<=analysisdist:
arraytarget=round(dist*analysisdist/intervalnumber)
addone=np.array((spatialraw[arraytarget-1]))
addone=addone+1
targetcell=arraytarget-1
np.put(spatialraw,[targetcell,targetcell],addone)
Here is the matlab code for the main loop:
for basecell = 1:cellnumber;
if firstcelltype==cellrecord(basecell,3);
xloc=cellrecord(basecell,1);
yloc=cellrecord(basecell,2);
xedgedist=(xbound-xloc);
yedgedist=(ybound-yloc);
if (xloc>excludedist) && (yloc>excludedist) && (xedgedist>excludedist) && (yedgedist>excludedist);
for comparecell = 1:cellnumber;
if secondcelltype==cellrecord(comparecell,3);
xcomploc=cellrecord(comparecell,1);
ycomploc=cellrecord(comparecell,2);
dist=sqrt((xcomploc-xloc)^2+(ycomploc-yloc)^2);
if (dist>=1) && (dist<=100.4999);
arraytarget=round(dist*analysisdist/intervalnumber);
spatialsum(1,arraytarget)=spatialsum(1,arraytarget)+1;
end
end
end
end
end
end
Thanks!
Here are some ways to speed up your python code.
First: Don't make np arrays when you are only storing one value. You do this many times over in your code. For instance,
if firstcelltype == np.array((cellrecord[basecell,2])):
can just be
if firstcelltype == cellrecord[basecell,2]:
I'll show you why with some timeit statements:
>>> timeit.Timer('x = 111.1').timeit()
0.045882196294822819
>>> t=timeit.Timer('x = np.array(111.1)','import numpy as np').timeit()
0.55774970267830071
That's an order of magnitude in difference between those calls.
Second: The following code:
arraytarget=round(dist*analysisdist/intervalnumber)
addone=np.array((spatialraw[arraytarget-1]))
addone=addone+1
targetcell=arraytarget-1
np.put(spatialraw,[targetcell,targetcell],addone)
can be replaced with
arraytarget=round(dist*analysisdist/intervalnumber)-1
spatialraw[arraytarget] += 1
Third: You can get rid of the sqrt as Philip mentioned by squaring analysisdist beforehand. However, since you use analysisdist to get arraytarget, you might want to create a separate variable, analysisdist2 that is the square of analysisdist and use that for your comparison.
Fourth: You are looking for cells that match secondcelltype every time you get to that point rather than finding those one time and using the list over and over again. You could define an array:
comparecells = np.where(cellrecord[:,2]==secondcelltype)[0]
and then replace
for comparecell in range (0, cellnumber-1):
if secondcelltype==np.array((cellrecord[comparecell,2])):
with
for comparecell in comparecells:
Fifth: Use psyco. It is a JIT compiler. Matlab has a built-in JIT compiler if you're using a somewhat recent version. This should speed-up your code a bit.
Sixth: If the code still isn't fast enough after all previous steps, then you should try vectorizing your code. It shouldn't be too difficult. Basically, the more stuff you can have in numpy arrays the better. Here's my try at vectorizing:
basecells = np.where(cellrecord[:,2]==firstcelltype)[0]
xlocs = cellrecord[basecells, 0]
ylocs = cellrecord[basecells, 1]
xedgedists = xbound - xloc
yedgedists = ybound - yloc
whichcells = np.where((xlocs>excludedist) & (xedgedists>excludedist) & (ylocs>excludedist) & (yedgedists>excludedist))[0]
selectedcells = basecells[whichcells]
comparecells = np.where(cellrecord[:,2]==secondcelltype)[0]
xcomplocs = cellrecords[comparecells,0]
ycomplocs = cellrecords[comparecells,1]
analysisdist2 = analysisdist**2
for basecell in selectedcells:
dists = np.round((xcomplocs-xlocs[basecell])**2 + (ycomplocs-ylocs[basecell])**2)
whichcells = np.where((dists >= 1) & (dists <= analysisdist2))[0]
arraytargets = np.round(dists[whichcells]*analysisdist/intervalnumber) - 1
for target in arraytargets:
spatialraw[target] += 1
You can probably take out that inner for loop, but you have to be careful because some of the elements of arraytargets could be the same. Also, I didn't actually try out all of the code, so there could be a bug or typo in there. Hopefully, it gives you a good idea of how to do this. Oh, one more thing. You make analysisdist/intervalnumber a separate variable to avoid doing that division over and over again.
Not too sure about the slowness of python but you Matlab code can be HIGHLY optimized. Nested for-loops tend to have horrible performance issues. You can replace the inner loop with a vectorized function ... as below:
for basecell = 1:cellnumber;
if firstcelltype==cellrecord(basecell,3);
xloc=cellrecord(basecell,1);
yloc=cellrecord(basecell,2);
xedgedist=(xbound-xloc);
yedgedist=(ybound-yloc);
if (xloc>excludedist) && (yloc>excludedist) && (xedgedist>excludedist) && (yedgedist>excludedist);
% for comparecell = 1:cellnumber;
% if secondcelltype==cellrecord(comparecell,3);
% xcomploc=cellrecord(comparecell,1);
% ycomploc=cellrecord(comparecell,2);
% dist=sqrt((xcomploc-xloc)^2+(ycomploc-yloc)^2);
% if (dist>=1) && (dist<=100.4999);
% arraytarget=round(dist*analysisdist/intervalnumber);
% spatialsum(1,arraytarget)=spatialsum(1,arraytarget)+1;
% end
% end
% end
%replace with:
secondcelltype_mask = secondcelltype == cellrecord(:,3);
xcomploc_vec = cellrecord(secondcelltype_mask ,1);
ycomploc_vec = cellrecord(secondcelltype_mask ,2);
dist_vec = sqrt((xcomploc_vec-xloc)^2+(ycomploc_vec-yloc)^2);
dist_mask = dist>=1 & dist<=100.4999
arraytarget_vec = round(dist_vec(dist_mask)*analysisdist/intervalnumber);
count = accumarray(arraytarget_vec,1, [size(spatialsum,1),1]);
spatialsum(:,1) = spatialsum(:,1)+count;
end
end
end
There may be some small errors in there since I don't have any data to test the code with but it should get ~10X speed up on the Matlab code.
From my experience with numpy I've noticed that swapping out for-loops for vectorized/matrix-based arithmetic has noticeable speed-ups as well. However, without the shapes the shapes of all of your variables its hard to vectorize things.
You can avoid some of the math.sqrt calls by replacing the lines
dist=math.sqrt((xcomploc-xloc)**2+(ycomploc-yloc)**2)
dist=round(dist)
if dist>=1 and dist<=analysisdist:
arraytarget=round(dist*analysisdist/intervalnumber)
with
dist=(xcomploc-xloc)**2+(ycomploc-yloc)**2
dist=round(dist)
if dist>=1 and dist<=analysisdist_squared:
arraytarget=round(math.sqrt(dist)*analysisdist/intervalnumber)
where you have the line
analysisdist_squared = analysis_dist * analysis_dist
outside of the main loop of your function.
Since math.sqrt is called in the innermost loop, you should have from math import sqrt at the top of the module and just call the function as sqrt.
I would also try replacing
dist=(xcomploc-xloc)**2+(ycomploc-yloc)**2
with
dist=(xcomploc-xloc)*(xcomploc-xloc)+(ycomploc-yloc)*(ycomploc-yloc)
There's a chance it will produce faster byte code to do multiplication rather than exponentiation.
I doubt these will get you all the way to MATLABs performance, but they should help reduce some overhead.
If you have a multicore, you could maybe give the multiprocessing module a try and use multiple processes to make use of all the cores.
Instead of sqrt you could use x**0.5, which is, if I remember correct, slightly faster.