Benchmarking run times in Python - python

I have to benchmark JSON serialization time and compare it to thrift and Google's protocol buffer's serialization time. Also it has to be in Python.
I was planning on using the Python profilers.
http://docs.python.org/2/library/profile.html
Would the profiler be the best way to find function runtimes? Or would outputting a timestamp before and after the function call be the better option?
Or is there an even better way?

From the profile docs that you linked to:
Note The profiler modules are designed to provide an execution profile for a given program, not for benchmarking purposes (for that, there is timeit for reasonably accurate results). This particularly applies to benchmarking Python code against C code: the profilers introduce overhead for Python code, but not for C-level functions, and so the C code would seem faster than any Python one.
So, no, you do not want to use profile to benchmark your code. What you want to use profile for is to figure out why your code is too slow, after you already know that it is.
And you do not want to output a timestamp before and after the function call, either. There are just way too many things you can get wrong that way if you're not careful (using the wrong timestamp function, letting the GC run a cycle collection in the middle of your test run, including test overhead in the loop timing, etc.), and timeit takes care of all of that for you.
Something like this is a common way to benchmark things:
for impl in 'mycode', 'googlecode', 'thriftcode':
t = timeit.timeit('serialize(data)',
setup='''from {} import serialize;
with open('data.txt') as f: data=f.read()
'''.format(impl),
number=10000)
print('{}: {}'.format(impl, t)
(I'm assuming here that you can write three modules that wrap the three different serialization tools in the same API, a single serialize function that takes a string and does something or other with it. Obviously there are different ways to organize things.)

You should be careful when you are profiling python code based on a time stamp at the start and end of the problem. This does not take into account other processes that might also be running concurrently.
Instead, you should consider looking at
Is there any simple way to benchmark python script?

Related

How to speed up language-tool-python library use case

I have a pandas dataframe with 3 million rows of social media comments. I'm using the language-tool-python library to find the number of grammatical errors in a comment. Afaik the language-tool library by default sets up a local language-tool server on your machine and queries responses from that.
Getting the number of grammatical errors is just consists of creating an instance of the language tool object and calling the .check() method with the string you want to check as a parameter.
>>> tool = language_tool_python.LanguageTool('en-US')
>>> text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
>>> matches = tool.check(text)
>>> len(matches)
2
So the method I used is df['body_num_errors'] = df['body'].apply(lambda row: len(tool.check(row))). Now I am pretty sure this works. Its quite straight forward. This single line of code has been running for the past hour.
Because running the above example took 10-20 second, so with 3 million instances, it might as well take virtually forever.
Is there any way I can cut my losses and speed this process up? Would iterating over every row and putting the whole thing inside of a threadpoolexecutor help? Intuitively it makes sense to me as its a I/O bound task.
I am open to any suggestions as to how to speed up this process and if the above method works would appreciate if someone can show me some sample code.
edit - Correction.
It takes 10-20 seconds along with the instantiation, calling the method is almost instantaneous.
I'm the creator of language_tool_python. First, none of the comments here make sense. The bottleneck is in tool.check(); there is nothing slow about using pd.DataFrame.map().
LanguageTool is running on a local server on your machine. There are at least two major ways to speed this up:
Method 1: Initialize multiple servers
servers = []
for i in range(100):
servers.append(language_tool_python.LanguageTool('en-US'))
Then call to each server from a different thread. Or alternatively initialize each server within its own thread.
Method 2: Increase the thread count
LanguageTool takes a maxCheckThreads option – see the LT HTTPServerConfig documentation – so you could also try playing around with that? From a glance at LanguageTool's source code, it looks like the default number of threads in a single LanguageTool server is 10.
In the documentation, we can see that language-tool-python has the configuration option maxSpellingSuggestions.
However, despite the name of the variable and the default value being 0, I have noticed that the code runs noticeably faster (almost 2 times faster) when this parameter is actually set to 1.
I don't know where this discrepancy comes from, and the documentation does not mention anything specific about the default behavior. It is a fact however that (at least for my own dataset, which I don't think can affect this much the running time) this setting improves the performance.
Example initialization:
import language_tool_python
language_tool = language_tool_python.LanguageTool('en-US', config={'maxSpellingSuggestions': 1})
If you are worried about scaling up with pandas, switch to Dask instead. It integrates with Pandas and will use multiple cores in your CPU, which I am assuming you have, instead of a single-core that pandas use. This helps parallelize the 3 million instances and can speed up your execution time. You can read more about dask here or see an example here.

Logging progress of `enumerate` function in Python

I am working on optimizing a Python (version 2.7; constrained due to old modules I need to use) script that is running on AWS and trying to identify exactly how many resources I need in the environment I am building. In order to do so, I am logging quite a bit of information to the console when the script runs and benchmarking different resource configurations.
One of the bottlenecks is a list with 32,000 items that runs through the Python enumerate function. This takes quite a while to run and my script is blind to its progress until it finishes. I assume enumerate is looping through the items in some fashion so thought I could inject logging or printing within that loop. I know this is 'hacky' but it will work fine for me for now since it is temporary while I run tests.
I can't find where the function runs from (I did find an Enumerate class in the numba module. I tried printing in there and it did not work). I know it is part of the __builtin__ module and am having trouble tracking that down as well and tried several techniques to find its exact location such as printing __builtin__.__file__ but to no avail.
My question is, a) can you all help me identify where the function lives and if this is a good approach? or b) if there is a better approach for this.
Thanks for your help!
I suggest you to use tqdm module. tqdm
tqdm module is used for visualization of the progress of an enumeration. It works with iterables (length known or unknown).
By extending tqdm class you can log required information after any required number of iterations. You can find more information in the tqdm class docstring.

What is the difference between random.normalvariate() and random.gauss() in python?

What is the difference between random.normalvariate() and random.gauss()?
They take the same parameters and return the same value, performing essentially the same function.
I understand from a previous answer that random.gauss() is not thread safe, but what does this mean in this context? Why should a programmer care about this? Alternatively posed, why was both a thread safe and non-thread safe version of included in Python's 'random'?
This is an interesting question. In general, the best way to know the difference between two python implementations is to inspect the code yourself:
import inspect, random
str_gauss = inspect.getsource(random.gauss)
str_nv=inspect.getsource(random.normalvariate)
and then you print each of the strings to see how the sources differ. A quick look at the codes show that not only they behave differently multithread-wise, but also that the algorithms are not the same; for example, normalvariate uses something called the Kinderman and Monahan method, as per the following comments in str_nv:
# Uses Kinderman and Monahan method. Reference: Kinderman,
# A.J. and Monahan, J.F., "Computer generation of random
# variables using the ratio of uniform deviates", ACM Trans
# Math Software, 3, (1977), pp257-260.
Thread-safe pieces of code must account for possible race conditions during execution. This introduces overhead as a result of synchronization schemes like mutexes, semaphores, etc.
However, if you are writing non-reentrant code, no race conditions normally arise, which essentially means that you can write code that executes a bit faster. I guess this is why random.gauss() was introduced, since the python doc says it's faster than the thread-safe version.
I'm not entirely sure about this but the Python Documentation says that random.gauss is slightly faster so if you're OK with non-thread safe then you can go a little faster.
In a multi-threaded system, calling random.normalvariate twice very quickly in succession will cause the internal code of random.normalvariate to be run twice, potentially before the first call has had a chance to return. Internal variables may of the function may not be reset before the second, which may cause errors in the function output.
Successive calls to random.gauss must instead wait for earlier calls to return before being called themselves.
The advantage with random.normalvariate is therefore that it is faster, but may produce an erroneous output.

Profiling Python via C-api (How to ? )

I have used the Python's C-API to call some Python code in my c code and now I want to profile my python code for bottlenecks. I came across the PyEval_SetProfile API and am not sure how to use it. Do I need to write my own profiling function?
I will be very thankful if you can provide an example or point me to an example.
If you only need to know the amount of time spent in the Python code, and not (for example), where in the Python code the most time is spent, then the Python profiling tools are not what you want. I would write some simple C code that sampled the time before and after the Python interpreter invocation, and use that. Or, C-level profiling tools to measure the Python interpreter as a C function call.
If you need to profile within the Python code, I wouldn't recommend writing your own profile function. All it does is provide you with raw data, you'd still have to aggregate and analyze it. Instead, write a Python wrapper around your Python code that invokes the cProfile module to capture data that you can then examine.

What's the best way to record the type of every variable assignment in a Python program?

Python is so dynamic that it's not always clear what's going on in a large program, and looking at a tiny bit of source code does not always help. To make matters worse, editors tend to have poor support for navigating to the definitions of tokens or import statements in a Python file.
One way to compensate might be to write a special profiler that, instead of timing the program, would record the runtime types and paths of objects of the program and expose this data to the editor.
This might be implemented with sys.settrace() which sets a callback for each line of code and is how pdb is implemented, or by using the ast module and an import hook to instrument the code, or is there a better strategy? How would you write something like this without making it impossibly slow, and without runnning afoul of extreme dynamism e.g side affects on property access?
I don't think you can help making it slow, but it should be possible to detect the address of each variable when you encounter a STORE_FAST STORE_NAME STORE_* opcode.
Whether or not this has been done before, I do not know.
If you need debugging, look at PDB, this will allow you to step through your code and access any variables.
import pdb
def test():
print 1
pdb.set_trace() # you will enter an interpreter here
print 2
What if you monkey-patched object's class or another prototypical object?
This might not be the easiest if you're not using new-style classes.
You might want to check out PyChecker's code - it does (i think) what you are looking to do.
Pythoscope does something very similar to what you describe and it uses a combination of static information in a form of AST and dynamic information through sys.settrace.
BTW, if you have problems refactoring your project, give Pythoscope a try.

Categories