I saw a comment that lead me to the question Why does Python code run faster in a function?.
I got to thinking, and figured I would try it myself using the timeit library, however I got very different results:
(note: 10**8 was changed to 10**7 to make things a little bit speedier to time)
>>> from timeit import repeat
>>> setup = """
def main():
for i in xrange(10**7):
pass
"""
>>> stmt = """
for i in xrange(10**7):
pass
"""
>>> min(repeat('main()', setup, repeat=7, number=10))
1.4399558753975725
>>> min(repeat(stmt, repeat=7, number=10))
1.4410973942722194
>>> 1.4410973942722194 / 1.4399558753975725
1.000792745732109
Did I use timeit correctly?
Why are these results less 0.1% different from each other, while the results from the other question were nearly 250% different?
Does it only make a difference when using CPython compiled versions of Python (like Cython)?
Ultimately: is Python code really faster in a function, or does it just depend on how you time it?
The flaw in your test is the way timeit compiles the code of your stmt. It's actually compiled within the following template:
template = """
def inner(_it, _timer):
%(setup)s
_t0 = _timer()
for _i in _it:
%(stmt)s
_t1 = _timer()
return _t1 - _t0
"""
Thus stmt is actually running in a function, using the fastlocals array (i.e. STORE_FAST).
Here's a test with your function in the question as f_opt versus the unoptimized compiled stmt executed in the function f_no_opt:
>>> code = compile(stmt, '<string>', 'exec')
>>> f_no_opt = types.FunctionType(code, globals())
>>> t_no_opt = min(timeit.repeat(f_no_opt, repeat=10, number=10))
>>> t_opt = min(timeit.repeat(f_opt, repeat=10, number=10))
>>> t_opt / t_no_opt
0.4931101445632647
It comes down to compiler optimization algorithms. When performing Just-in-time compilation, it is much easier to identify frequently used chunks of code if they're found in functions.
The efficiency gains really would depend on the nature of the tasks being performed. In the example you gave, you aren't really doing anything computationally intensive, leaving fewer opportunities to achieve gains in efficiency through optimization.
As others have pointed out, however, CPython does not do just-in-time compilation. When code is compiled, however, C compilers will often execute them faster.
Check out this document on the GCC compiler: http://gcc.gnu.org/onlinedocs/gcc/Inline.html
Related
#This is my code
from sortedcontainers import SortedList, SortedSet, SortedDict
import timeit
import random
def test_speed1(data):
SortedList(data)
def test_speed2(data):
sorted_data = SortedList()
for val in data:
sorted_data.add(val)
data = []
numpts = 10 ** 5
for i in range(numpts):
data.append(random.random())
print(f'Num of pts:{len(data)}')
sorted_data = SortedList()
n_runs=10
result = timeit.timeit(stmt='test_speed1(data)', globals=globals(), number=n_runs)
print(f'Speed1 is {1000*result/n_runs:0.0f}ms')
n_runs=10
result = timeit.timeit(stmt='test_speed2(data)', globals=globals(), number=n_runs)
print(f'Speed2 is {1000*result/n_runs:0.0f}ms')
enter image description here
The code for test speed2 is supposed to take 12~ ms (I checked the setup they report). Why does it take 123 ms (10X slowers)???
test_speed1 runs in 15 ms (which makes sense)
I am running in Conda.
The
This is where they outlined the performance
https://grantjenks.com/docs/sortedcontainers/performance.html
You are presumably not executing your benchmark in the same conditions as they do:
you are not using the same benchmark code,
you don't use the same computer with the same performance characteristics,
you are not using the same Python version and environment,
you are not running the same OS,
etc.
Hence, the benchmark results are not comparable and you cannot conclude anything about the performance (and certainly not that "sortedcontainers is too slow").
Performance is only relative to a given execution context and they only stated that their solution is faster relative to other concurrent solutions.
If you really wish to execute the benchmark on your computer, follow the instructions they give in the documentation.
"init() uses Python’s highly optimized sorted() function while add() cannot.". This is why the speed2 is faster than the speed3.
This is the answer I got from the developers on the sortedcontainers library.
When I profile code which heavily uses the all/any builtins I find that the call graph as produced by profiling tools (like gprof2dot) can be more difficult to interpret, because there are many edges from different callers leading to the all/any node, and many edges leading away. I'm looking for a way of essentially omitting the all/any nodes from the call graph such that in this very simplified example the two code paths would not converge and then split.
import cProfile
import random
import time
def bad_true1(*args, **kwargs):
time.sleep(random.random() / 100)
return True
def bad_true2(*args, **kwargs):
time.sleep(random.random() / 100)
return True
def code_path_one():
nums = [random.random() for _ in range(100)]
return all(bad_true1(x) for x in nums)
def code_path_two():
nums = [random.random() for _ in range(100)]
return all(bad_true2(x) for x in nums)
def do_the_thing():
code_path_one()
code_path_two()
def main():
profile = OmittingProfile()
profile.enable()
do_the_thing()
profile.disable()
profile.dump_stats("foo.prof")
if "__main__" == __name__:
main()
I don't think cProfile provides a way to filter out functions while collecting. You can probably manually filter functions out after you got the stats, but that's more than you want to do.
Also, in my experience, as long as there are a lot of nested calls, cProfile will only be helpful to find the "most time consuming function", and that's it, no extra context information, as cProfile only logs its parent function, not the whole call stack. For complicated programs, that may not be super helpful.
From the reasons above, I would recommend you to try some other profiling tools. For example, viztracer. VizTracer draws out your entire program execution so you know what happens in your program. And it happens to have the ability to filter out builtin functions.
pip install viztracer
# --ignore_c_function is the optional filter
viztracer --ignore_c_function your_program.py
vizviewer result.json
Of course, there are other statistical profilers that provide flamegraph, which contains less information, but also introduces less overhead, like pyspy, scalene.
cProfile is a good tool for some simple profiling, but it's definitely not the best tool in the market.
The timeit module is great for measuring the execution time of small code snippets but when the code changes global state (like timeit) it's really hard to get accurate timings.
For example if I want to time it takes to import a module then the first import will take much longer than subsequent imports, because the submodules and dependencies are already imported and the files are already cached. So using a bigger number of repeats, like in:
>>> import timeit
>>> timeit.timeit('import numpy', number=1)
0.2819331711316805
>>> # Start a new Python session:
>>> timeit.timeit('import numpy', number=1000)
0.3035142574359181
doesn't really work, because the time for one execution is almost the same as for 1000 rounds. I could execute the command to "reload" the package:
>>> timeit.timeit('imp.reload(numpy)', 'import importlib as imp; import numpy', number=1000)
3.6543283935557156
But that it's only 10 times slower than the first import seems to suggest it's not accurate either.
It also seems impossible to unload a module entirely ("Unload a module in Python").
So the question is: What would be an appropriate way to accuratly measure the import time?
Since it's nearly impossible to fully unload a module, maybe the inspiration behind this answer is this...
You could run a loop in a python script to run x times a python command importing numpy and another one doing nothing, and substract both + average:
import subprocess,time
n=100
python_load_time = 0
numpy_load_time = 0
for i in range(n):
s = time.time()
subprocess.call(["python","-c","import numpy"])
numpy_load_time += time.time()-s
s = time.time()
subprocess.call(["python","-c","pass"])
python_load_time += time.time()-s
print("average numpy load time = {}".format((numpy_load_time-python_load_time)/n))
I have a large Python code base which we recently started compiling with Cython. Without making any changes to the code, I expected performance to stay about the same, but we planned to optimize heavier computations with Cython specific code after profiling. However, the speed of the compiled application plummeted and it appears to be across the board. Methods are taking anywhere from 10% to 300% longer than before.
I've been playing around with test code to try and find things Cython does poorly and it appears that string manipulation is one of them. My question is, am I doing something wrong or is Cython really just bad at some things? Can you help me understand why this is so bad and what else Cython might do very poorly?
EDIT: Let me try to clarify. I realize that this type of string concatenation is very bad; I just noticed it has a huge speed difference so I posted it (probably a bad idea). The codebase doesn't have this type of terrible code but has still slowed dramatically and I'm hoping for pointers on what type of constructs Cython handles poorly so I can figure out where to look. I've tried profiling but it was not particularly helpful.
For reference, here is my string manipulation test code. I realize the code below is terrible and useless, but I'm still shocked by the speed difference.
# pyCode.py
def str1():
val = ""
for i in xrange(100000):
val = str(i)
def str2():
val = ""
for i in xrange(100000):
val += 'a'
def str3():
val = ""
for i in xrange(100000):
val += str(i)
Timing code
# compare.py
import timeit
pyTimes = {}
cyTimes = {}
# STR1
number=10
setup = "import pyCode"
stmt = "pyCode.str1()"
pyTimes['str1'] = timeit.timeit(stmt=stmt, setup=setup, number=number)
setup = "import cyCode"
stmt = "cyCode.str1()"
cyTimes['str1'] = timeit.timeit(stmt=stmt, setup=setup, number=number)
# STR2
setup = "import pyCode"
stmt = "pyCode.str2()"
pyTimes['str2'] = timeit.timeit(stmt=stmt, setup=setup, number=number)
setup = "import cyCode"
stmt = "cyCode.str2()"
cyTimes['str2'] = timeit.timeit(stmt=stmt, setup=setup, number=number)
# STR3
setup = "import pyCode"
stmt = "pyCode.str3()"
pyTimes['str3'] = timeit.timeit(stmt=stmt, setup=setup, number=number)
setup = "import cyCode"
stmt = "cyCode.str3()"
cyTimes['str3'] = timeit.timeit(stmt=stmt, setup=setup, number=number)
for funcName in sorted(pyTimes.viewkeys()):
print "PY {} took {}s".format(funcName, pyTimes[funcName])
print "CY {} took {}s".format(funcName, cyTimes[funcName])
Compiling a Cython module with
cp pyCode.py cyCode.py
cython cyCode.py
gcc -O2 -fPIC -shared -I$PYTHONHOME/include/python2.7 \
-fno-strict-aliasing -fno-strict-overflow -o cyCode.so cyCode.c
Resulting timings
> python compare.py
PY str1 took 0.1610019207s
CY str1 took 0.104282140732s
PY str2 took 0.0739600658417s
CY str2 took 2.34380102158s
PY str3 took 0.224936962128s
CY str3 took 21.6859738827s
For reference, I've tried this with Cython 0.19.1 and 0.23.4. I've compiled the C code with gcc 4.8.2 and icc 14.0.2, trying various flags with both.
Worth reading: Pep 0008 > Programming Recommendations:
Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b . This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.
Reference: https://www.python.org/dev/peps/pep-0008/#programming-recommendations
Repeated string concatenation of that form is usually frowned upon; some interpreters optimize for it anyway (secretly overallocating and allowing mutation of technically immutable data types in cases where it's known to be safe), but Cython is trying to hard code some things, which makes that harder.
The real answer is "Don't concatenate immutable types over and over." (it's wrong everywhere, just worse in Cython). A perfectly reasonable approach Cython would likely handle fine is to make a list of the individual str, and then call ''.join(listofstr) at the end to make the str at once.
In any event, you're not giving Cython any typing information to work with, so the speed ups aren't going to be very impressive. Try to help it out with the easy stuff, and the speed ups there may more than make up for losses elsewhere. For example, cdef your loop variable and using ''.join might help here:
cpdef str2():
cdef int i
val = []
for i in xrange(100000): # Maybe range; Cython docs aren't clear if xrange optimized
val.append('a')
val = ''.join(val)
I'm trying to time several things in python, including upload time to Amazon's S3 Cloud Storage, and am having a little trouble. I can time my hash, and a few other things, but not the upload. I thought this post would finally, get me there, but I can't seem to find salvation. Any help would be appreciated. Very new to python, thanks!
import timeit
accKey = r"xxxxxxxxxxx";
secKey = r"yyyyyyyyyyyyyyyyyyyyyyyyy";
bucket_name = 'sweet_data'
c = boto.connect_s3(accKey, secKey)
b = c.get_bucket(bucket_name);
k = Key(b);
p = '/my/aws.path'
f = 'C:\\my.file'
def upload_data(p, f):
k.key = p
k.set_contents_from_filename(f)
return
t = timeit.Timer(lambda: upload_data(p, f), "from aws_lib import upload_data; p=%r; f = %r" % (p,f))
# Just calling the function works fine
#upload_data(p, f)
I know this is heresy in the Python community, but I actually recommend not to use timeit, especially for something like this. For your purposes, I believe it will be good enough (and possibly even better than timeit!) if you simply use time.time() to time things. In other words, do something like
from time import time
t0 = time()
myfunc()
t1 = time()
print t1 - t0
Note that depending on your platform, you might want to try time.clock() instead (see Stack Overflow questions such as this and this), and if you're on Python 3.3, then you have better options, due to PEP 418.
You can use the command line interface to timeit.
Just save your code as a module without the timing stuff. For example:
# file: test.py
data = range(5)
def foo(l):
return sum(l)
Then you can run the timing code from the command line, like this:
$ python -mtimeit -s 'import test;' 'test.foo(test.data)'
See also:
http://docs.python.org/2/library/timeit.html#command-line-interface
http://docs.python.org/2/library/timeit.html#examples