I have implemented a specific algorithm in R. There exists a Python library which offers an alternative implementation.
I would now like to formally compare the speed of the two implementations to assess which is "more efficient" in different kinds of situations.
What would be the best way to do this? Here, it is suggested to use the system.time() (in R) and time.time() (in Python) functions to compare the actual executions of the functions. Is this generally recommended? Do these time measures actually measure the same thing?
Would it be an alternative to measure the time of the whole script execution to include "overhead" stuff like variable definitions etc?
PS: I am a regular user of R, but have close to zero experience in Python
Related
I've read online that Scala is faster than Python, e.g. here. I've also seen a comparison between different front ends that concluded that R was so slow that the tester gave up trying to measure its performance (here; although this was specifically a test for user-defined functions and may not have used the sparklyr package).
I also know that sparklyr now has arrow integration, which has led to performance improvements for user defined functions as well as copying data to/from the cluster, as shown here.
My question: how fast is sparklyr compared to Python/Scala? I'm mostly interested in standard 'out-of-the-box' functions, but would also be interested to know how it stacks up for user-defined functions now that arrow has been integrated. And are there particular circumstances in which it performs well or badly?
I ask because I've built an app in sparklyr that is slower than I would have hoped despite lots of tinkering with tuning parameters, and I'm wondering if this is partly because of limitations in the package.
in a local PC, most of the time, R is much faster than spark.
I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).
I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:
difflib.SequenceMatcher from the standard lib. ratio() gives good results, but takes >100ms for 10kb text. quick_ratio() takes only half the time, but the results are sometimes far of the real value.
python-Levenshtein: levenshtein is an acceptable metric for my use case, but Levenshtein.ratio('foo', 'bar') is not faster than the SequenceMatcher.
Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.
edlib seems to be fast enough for my use case.
It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher.
I've had some luck with RapidFuzz, I don't know how it compares to the others but it was much faster than thefuzz/fuzzywuzzy.
Don't know if it's applicable for your use-case, but this one of the first things you find when you google fast string similarity python
Based on a lot of reading up I've been able to do, something like tfidf_matcher worked well for me. Returns the best k matches. Also, it's easily 1000x faster than Fuzzywuzzy.
I have recently stumpled upon the boost.odeint library and I am surprised about the number of possibilities and configurability. However, having used scipy.integrate.odeint extensively (which is essentially a wrapper to ODEPACK in fortran) I wonder how their performances compare. I know that boost.odeint also comes with parallelization, which is not possible with scipy (as far as I know) which would increase performance then a lot, but I am asking for the single core case.
However, since I would have to wrap boost.odeint (using cython or boost.python) into python in that case, maybe someone of you has done that already? This would be a great achivement, since all the analysis possibilites are much more advanced in python.
As far as I can tell from comparing the lists of available steppers for
Boost.odeint and scipy.integrate.ode, the only algorithm implemented by both is
the Dormand-Prince fifth order stepper, dopri5. You could compare the
efficiency of the two implementations of this algorithm in Python by using
this Cython wrapper to Boost.odeint (it does not expose all the
steppers provided by Boost.odeint, but does expose dopri5).
Depending on your definition of "testing performance" you could also compare
different algorithms, but this is obviously not the same thing as comparing
two implementations of the same algorithm.
Is there a particular software resource monitor that researchers or academics use to compare execution time and other resource usage metrics between programming environments? For instance, if I have a routine in C++, python and another in Matlab, that are all identical in function and similar implantations -how would I make an objective, measurable result comparison as to which was the most efficient process. Likewise is it a tool that could also analyze performance between versions of the same code to track improvements in processing efficiency. Please try to answer this question without generalizations like "oh, C++ is always more efficient than python and python will always be more efficient than Matlab."
The correct way is to write tests. Get current time before actual algo starts, and get current time after it ends. There are ways to do that in c++, python and matlab
You must not think of results as they are 100% precision because of system scheduling process etc, though it is a good way to compare before-after results.
Good way to get more precision results is to run your code multiple times.
I'm having a lot of fun learning Python by writing a genetic programming type of application.
I've had some great advice from Torsten Marek, Paul Hankin and Alex Martelli on this site.
The program has 4 main functions:
generate (randomly) an expression tree.
evaluate the fitness of the tree
crossbreed
mutate
As all of generate, crossbreed and mutate call 'evaluate the fitness'. it is the busiest function and is the primary bottleneck speedwise.
As is the nature of genetic algorithms, it has to search an immense solution space so the faster the better. I want to speed up each of these functions. I'll start with the fitness evaluator. My question is what is the best way to do this. I've been looking into cython, ctypes and 'linking and embedding'. They are all new to me and quite beyond me at the moment but I look forward to learning one and eventually all of them.
The 'fitness function' needs to compare the value of the expression tree to the value of the target expression. So it will consist of a postfix evaluator which will read the tree in a postfix order. I have all the code in python.
I need advice on which I should learn and use now: cython, ctypes or linking and embedding.
Thank you.
Ignore everyone elses' answer for now. The first thing you should learn to use is the profiler. Python comes with a profile/cProfile; you should learn how to read the results and analyze where the real bottlenecks is. The goal of optimization is three-fold: reduce the time spent on each call, reduce the number of calls to be made, and reduce memory usage to reduce disk thrashing.
The first goal is relatively easy. The profiler will show you the most time-consuming functions and you can go straight to that function to optimize it.
The second and third goal is harder since this means you need to change the algorithm to reduce the need to make so much calls. Find the functions that have high number of calls and try to find ways to reduce the need to call them. Utilize the built-in collections, they're very well optimized.
If you're doing a lot of number and array processing, you should take a look at pandas, Numpy/Scipy, gmpy third party modules; they're well optimised C libraries for processing arrays/tabular data.
Another thing you want to try is PyPy. PyPy can JIT recompile and do much more advanced optimisation than CPython, and it'll work without the need to change your python code. Though well optimised code targeting CPython can look quite different from well optimised code targeting PyPy.
Next to try is Cython. Cython is a slightly different language than Python, in fact Cython is actually best described as C with typed Python-like syntax.
For parts of your code that is in very tight loops that you can no longer optimize using any other ways, you may want to rewrite it as C extension. Python has a very good support for extending with C. In PyPy, the best way to extend PyPy is with cffi.
Cython is the quickest to get the job done, either by writing your algorithm directly in Cython, or by writing it in C and bind it to python with Cython.
My advice: learn Cython.
Another great option is boost::python which lets you easily wrap C or C++.
Of these possibilities though, since you have python code already written, cython is probably a good thing to try first. Perhaps you won't have to rewrite any code to get a speedup.
Try to work your fitness function so that it will support memoization. This will replace all calls that are duplicates of previous calls with a quick dict lookup.