Python is a kind of "script" programming language.
In this situation:
def dic_test():
a={}
a[0]=[0,0,0]
for i in range(10000000):
a[0][0]+=1
a[0][1]+=1
a[0][2]+=1
print(a)
def no_dic_test():
a={}
a[0]=[0,0,0]
target=a[0]
for i in range(10000000):
target[0]+=1
target[1]+=1
target[2]+=1
print(a)
Will no_dic_test() be faster than dic_test()?
I thought Yes. Because, Python is dynamical. Each statement will be translated separately.
I used profile to benchmark. The first function was slower than second one, but the different was slight.
First function: 5 function calls in 26.113 seconds
Second function: 5 function calls in 23.835 seconds
That is a extreme case. In my own case, like 10k keys, 10k times operations, direct use of a dictionary will be faster. I am so surprised.
To end, is there "static compiler" like C or cache optimisation in Python for Dictionary? or are Python hash table just too fast to face the problems?
Thanks!
Its pretty obvious that the second function is doing far less work on each loop.
The first function will have to do a dict lookup and local store for each loop where as the second function does this once.
There are runtimes like PyPy that spot the hot loop and JIT compile them for added performance, but the CPython runtime doesn't do this kind of optimisation yet.
Related
I am learning scala by converting some of my python code to scala code. I just encountered an issue where the python code is significantly outperforming the scala code. The code is supposed to construct a set of candidate pairs based on some conditions. Scala has comparable runtime performance with python for all previous parts.
The id_map is an array of map from Long to set of string. The average number of k-v pairs in the map is 1942.
The scala code snippet is below:
// id_map Array[mutable.Map[Long, Set[String]]
val candidate_pairs = id_map
.flatMap(hashmap => hashmap.values)
.filter(_.size >= 2)
.flatMap(strset => strset.toList.combinations(2))
.map(_.sorted)
.toSet
and the corresponding python code is
candidate_pairs = set()
for hashmap in id_map.values():
for strset in hashmap.values():
if len(strset) >= 2:
for pair in combinations(strset, 2):
candidate_pairs.add(tuple(sorted(pair)))
The scala code snippet takes 80 seconds while python version takes 10 seconds.
I am wondering what can I optimize the above code to make it faster. What I have been trying is updating the set using the for loop
var candidate_pairs = Set.empty[List[String]]
for (
hashmap: mutable.Map[Long, Set[String]] <- id_map;
setstr: Set[String] <- hashmap.values if setstr.size >= 2;
pair <- setstr.toList.combinations(2)
)
candidate_pairs += pair.sorted
and although the candidate_pairs is updated a lot of time and each time it creates a new set, it actually is faster than the previous scala version, and takes about 50 seconds, still worse than python though. I tried using mutable set but however the result is about the same as the immutable version.
Any help would be appreciated! Thanks!
Being slower than python sounds ... surprising.
First of all, make sure you have adequate memory settings, and it is not spending half of those 80 seconds in GC.
Also, be sure to "warm up" the JVM (run your function a few times before doing actual measurement), use the same exact data for runs in python and scala (not just same statistics, exactly the same data), and do not include the time spent acquiring/generating data into measurement. Make several runs and compare average time, not how much a single run took.
Having said that, a few ways to make your code faster:
Adding .view (or .iterator) after id_map in your implementation cuts the execution time by about factor of 4 in my experiments.
(.view makes your chained transformation applied "lazily" – essentially, making a single pass through the single instance of array instead of multiple with multiple copies).
- Replacing .map(_.sorted) with
.map {
case List(a,b) if a < b => (a,b)
case List(a,b) => (b, a)
}
Shaves off about another 75% (sorting two element lists is mostly overhead).
This changes the return type to tuples rather than lists (constructing lots of tiny lists also adds up), but this seems even more appropriate in this case actually.
– Removing .filter(_.size >= 2) (it is redundant anyway, and computing size of a collection may get expensive) yields further improvement, but fairly small, that I did not bother to measure exactly.
Additionally, it may be cheaper to get rid of the separate sort step altogether, and just add .sorted before .combinations. I have not tested it, because it would be futile without knowing more details about your data profile.
These are some general improvements that should improve your performance either way, though it is hard to be sure you'll see the same effect as I do, as I don't really know anything about your data beyond that average map size, the improvement you see might be even better than mine, or it could be somewhat smaller ... but you should see some.
I ran this version with some test Scala code I created. On a list of 1944 elements, it completed in about 15 ms on my laptop.
id_map
.flatMap(hashmap => hashmap.values)
.flatMap { strset =>
if (strset.size >= 2) {
strset.toIndexedSeq.combinations(2)
} else IndexedSeq.empty
}.map(_.sorted).toSet
Main changes I have are to use an IndexedSeq instead of a List (which is a LinkedList), and to do the filter on the fly.
I assume you didn't want to hyper optimize, in which case you could still remove a lot of the intermediate collections created in the flatMap, combinations, conversion to IndexedSeq and toSet call.
In the examples below, both functions have roughly the same number of procedures.
def lenIter(aStr):
count = 0
for c in aStr:
count += 1
return count
or
def lenRecur(aStr):
if aStr == '':
return 0
return 1 + lenRecur(aStr[1:])
Picking between the two techniques is a matter of style or is there a most efficient method here?
Python does not perform tail call optimization, so the recursive solution can hit a stack overflow on long strings. The iterative method does not have this flaw.
That said, len(str) is faster than both methods.
This is not correct: 'functions have roughly the same number of procedures'. You probably mean that: 'these procedures require the same number of operations', or, more formally 'they have the same computational time complexity'.
While both have the same computational time complexity, the one using recursion requires additional CPU instructions to execute code for creating new instances of procedures during recursion, and to switch contexts. And to clean up after returning from every recursion. While these operations do not increase the theoretical computational complexity, in most real life implementations of operating systems they will put significant load.
Also the resursive method will have higher space complexity, as each new instance of recursively-called procedure needs new storage for its data.
Surely the first approach is more optimized, as python doesn't have to do a lot of function call and string slicing, which each of these operations are contain some other operations that cost much for python interpreter, and may be cause a lot of problems in future and in dealing with log strings.
As a more pythonic way you better to use len() function in order to get the length of a string.
You can also use code object to see the required stack sized for each function:
>>> lenRecur.__code__.co_stacksize
4
>>> lenIter.__code__.co_stacksize
3
I notice some interesting behavior when it comes to building lists in different ways. .append takes longer than list-comprehensions, which take longer than map, as shown in the experiments below:
def square(x): return x**2
def appendtime(times=10**6):
answer = []
start = time.clock()
for i in range(times):
answer.append(square(i))
end = time.clock()
return end-start
def comptime(times=10**6):
start = time.clock()
answer = [square(i) for i in range(times)]
end = time.clock()
return end-start
def maptime(times=10**6):
start = time.clock()
answer = map(square, range(times))
end = time.clock()
return end-start
for func in [appendtime, comptime, maptime]:
print("%s: %s" %(func.__name__, func()))
Python 2.7:
appendtime: 0.42632
comptime: 0.312877
maptime: 0.232474
Python 3.3.3:
appendtime: 0.614167
comptime: 0.5506650000000001
maptime: 0.57115
Now, I am very aware that range in python 2.7 builds a list, so I get why there is a disparity between the times of the corresponding functions in python 2.7 and 3.3. What I am more concerned about is the relative time differences between append, list-comprehension and map.
At first, I considered that this might be because map and list comprehensions may afford the interpreter knowledge of the eventual size of the resultant list, which would allow the interpreter to malloc a sufficiently large C array under the hood to store the list. By that logic, list-comprehensions and map should take pretty much the same amount of time.
However, the timing data shows that in python 2.7, listcomps are ~1.36x as fast as append, and map is ~1.34x as fast as listcomps.
More curious is that in python 3.3, listcomps are ~1.12x as fast as append, and map is actually slower than listcomps.
Clearly, map and listcomps don't "play by the same rules"; clearly, map takes advantage of something that listcomps don't.
Could anybody shed some light on the reason behind the difference in these timing values?
First, in python3.x, map returns an iterable, NOT a list, so that explains the 50kx speedup there. To make it a fair timing, in python3.x you'd need list(map(...)).
Second, .append will be slower because each time through the loop, the interpretter needs to look up the list, then it needs to look up the append function on the list. This additional .append lookup does not need to happen with the list-comp or map.
Finally, with the list-comprehension, I believe the function square needs to be looked up at every turn of your loop. With map, it is only looked up when you call map which is why if you're calling a function in your list-comprehension, map will typically be faster. Note that a list-comprehension usually beats out map with a lambda function though.
I am currently in the process of optimising the translation part of my software, which translates co-ordinates x amount of times. My current translation code is in the translate function and the supposedly optimised portion in the translate_map function.
I read here that the map function should be used instead of for loops where possible because the loop is performed in C.
When I run a test case below, the map function actually runs slower than a standard for loop. Why does the map perform slower than the conventional for loop? How could I optimise the translate function to run faster?
import time
def translate(atom_list):
for i in atom_list:
i[1]+=1
i[2]+=1
i[3]+=1
atoms = [[1,1,1,1]]*1000
start = time.time()
for x in xrange(10000):
translate(atoms)
print time.time() - start
atoms = [[1,1,1,1]]*1000
start = time.time()
def translate_map(atom_list):
atom_list[1]+=1
atom_list[2]+=1
atom_list[3]+=1
for x in xrange(10000):
map(translate_map,atoms)
print time.time() - start
output:
2.92705798149
4.14674210548
I suspect most of the overhead you're seeing with your map implementation comes from function call overhead. The translate function does all its work within a single loop, so there's just a single function call for the whole process. The implementation with map makes a separate function call for every item in the list.
A second source of overhead (though I suspect it is small compared to the function calls) is that map creates a list with the return values from the function. Since translate_map doesn't have a return statement, this will be all None values. Note that in Python 3, map is a generator, so your map version won't work at all unless you iterate over the results from the map call. The explicit loop is much clearer though, so I'd stick with that (if you don't go for numpy).
Oh, yes, numpy would make this much easier (and almost certainly faster too):
def translate(arr): # arr should be a numpy array
arr += 1
That's it! No loops needed (at the Python level).
So I have a time-critical section of code within a Python script, and I decided to write a Cython module (with one function -- all I need) to replace it. Unfortunately, the execution speed of the function I'm calling from the Cython module (which I'm calling within my Python script) isn't nearly as fast as I tested it to be in a variety of other scenarios. Note that I CANNOT share the code itself because of contract law! See the following cases, and take them as an initial description of my issue:
(1) Execute Cython function by using the Python interpreter to import the module and run the function. Runs relatively quickly (~0.04 sec on ~100 separate tests, versus original ~0.24 secs).
(2) Call Cython function within Python script at 'global' level (i.e. not inside any function). Same speed as case (1).
(3) Call Cython function within Python script, with Cython function inside my Python script's main function; tested with the Cython function in global and local namespaces, all with the same speed as case (1).
(4) Same as (3), but inside a simple for-loop within said Python function. Same speed as case (1).
(5) problem! Same as (4), but inside yet another for-loop: Cython function's execution time (whether called globally or locally) balloons to ~10 times that of the other cases, and this is where I need the function to get called. Nothing odd to report about this loop, and I tested all of the components of this loop (adjusting/removing what I could). I also tried using a 'while' loop for giggles, to no avail.
"One thing I've yet to try is making this inner-most loop a function and going from there." EDIT: Just tried this- no luck.
Thanks for any suggestions you have- I deeply regret not being able to share my code...it hurts my soul a little, but my client just can't have this code floating around. Let me know if there is any other information that I can provide!
-The Real Problem and an Initial (ugly) Solution-
It turns out that the best hint in this scenario was the obvious one (as usual): it wasn't the for-loop that was causing the problem; why would it? After a few more tests, it became obvious that something about the way I was calling my Cython function was wrong, because I could call it elsewhere (using an input variable different from the one going to the 'real' Cython function) without the performance loss issue.
The underlying issue: data types. I wrote my Cython function to expect a list full of standard floats. Unfortunately, my code did this:
function_input = list(numpy_array_containing_npfloat64_data) # yuck.
type(function_input[0]) = numpy.float64
output = Cython_Function(function_input)
inside the Cython function:
def Cython_Function(list function_input):
cdef many_vars
"""process lots of vars expecting C floats""" # Slowness from converting numpy.float64's --> floats???
type(output) = list
return output
I'm aware that I can play around more with types in the Cython function, which I very well may do to prevent having to 'list' an existing numpy array. Anyway, here is my current solution:
function_input = [float(x) for x in function_input]
I welcome any feedback and suggestions for improvement. The function_input numpy array doesn't really need the precision of numpy.float64, but it does get used a few times before getting passed to my Cython function.
It could be that, while individually, each function call with the Cython implementation is faster than its corresponding Python function, there is more overhead in the Cython function call because it has to look up the name in the module namespace. You can try assigning the function to a local callable first, for example:
from module import function
def main():
my_func = functon
for i in sequence:
my_func()
If possible, you should try to include the loops within the Cython function, which would reduce the overhead of a Python loop to the (very minimal) overhead of a compiled C loop. I understand that it might not be possible (i.e. need references from a global/larger scope), but it's worth some investigation on your part. Good luck!
function_input = list(numpy_array_containing_npfloat64_data)
def Cython_Function(list function_input):
cdef many_vars
I think the problem is in using the numpy array as a list ... can't you use the np.ndarray as input to the Cython function?
def Cython_Function(np.ndarray[dtype=np.float64] input):
....