Most efficient string similarity metric function

Most efficient string similarity metric function - python

I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).
I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:
difflib.SequenceMatcher from the standard lib. ratio() gives good results, but takes >100ms for 10kb text. quick_ratio() takes only half the time, but the results are sometimes far of the real value.
python-Levenshtein: levenshtein is an acceptable metric for my use case, but Levenshtein.ratio('foo', 'bar') is not faster than the SequenceMatcher.
Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.

edlib seems to be fast enough for my use case.
It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher.

I've had some luck with RapidFuzz, I don't know how it compares to the others but it was much faster than thefuzz/fuzzywuzzy.
Don't know if it's applicable for your use-case, but this one of the first things you find when you google fast string similarity python

Based on a lot of reading up I've been able to do, something like tfidf_matcher worked well for me. Returns the best k matches. Also, it's easily 1000x faster than Fuzzywuzzy.

Related

Does python use the best possible algorithms in order to save the most time it can?

I have a question that might be very simple to answer.
I couldn't find the answer anywhere.
Does python use the best possible algorithms in order to save the most time it can?
I just saw on some website that for example, the max method time order -in lists- is O(n) in python which there are better time orders as you know.
Is it true?
should I use the algorithms that I know they can perform better in order to save more time or does python did this for me in its methods?

max method time order -in lists- is O(n) in python which there are better time orders as you know. Is it true?
No this is not true. Finding the maximum value in a list will require that all values in the list are inspected, hence O(n).
You may be confused with lists that have been prepared in some way. For instance:
You have a list that is already sorted (which is a O(nlogn) process). In that case you can of course get the maximum in constant time, since you know its index. If the list is sorted in ascending order, it would be unwise to call max on it, as that would indeed be a waste of time. You may know the list is sorted, but python will not assume this, and still scan the whole list.
You have a list that has been heapified to a max-heap (which is a O(n) process). Again, in that case you can get the maximum in constant time, since it is stored at index 0. Lists can be heapified with heapq -- the default being a min-heap.
So, if you know nothing about your list, then you will have to inspect all values to be sure to identify the maximum. That is what max() does. In case you do know something more that could help to identify the maximum without having to look at all values, then use another, more appropriate method.
should I use the algorithms that I know they can perform better in order to save more time or does python did this for me in its methods?
You should use the algorithms that you know can perform better (based on what you know about a data structure). In many cases there is such better algorithm implementation available via a python library. For example, to find a particular value in a sorted list, use bisect.bisect_left and not index.
Look at a more complex example. Say you have written code that can generate chess moves and simulate a game of chess. You have good ideas about evaluation functions, alphabeta pruning, killer moves, lookup tables, ...and a bunch of other optimisation techniques. You cannot expect python to get smart when you issue a naive max on "all" evaluated chess states. You need to implement the complex algorithm to efficiently search and filter the right states to get the "best" chess move out of that forest of information without wasting time on less promising moves.

A Python list is a sequential and contiguus container. That means that finding the ith element is in constant time, and adding to the end is easy is no reallocation is required.
Finding a value is O(n/2), and finding min or max is O(n).
If you want a list and being able to find its minimum value in O(1), the heapq module that maintains a binary tree is available.
But Python offers few specialized containers in its standard library.

In terms of complexity, you'll find that python almost always uses solutions based on algorithms with best complexity. Performance may vary depending on constants, and python is just not the fastest language compared to C or C++.
In this case, if you're looking for max value from a list, there is no better solution - to find maximum value, you have to check every value, meaning solution is O(n). That's just how lists work - it's just list with values. If you were to use some other structure, e.g. sorted list - accessing max value would take O(1) - but you would pay for this low complexity with higher complexity of adding/deleting values.

It differs from library to library
The defult python librarys like the sort function (if a algorithem not selected) will use the most efficient algorithem by deffult.
Sadly is Python quite slow in genera compared to languages like C, C++ or java.
This is becouse that python is one script that reads your script and executes it live.
C, C++ and Java all compiles to binary (exe) before executing.
//SW

Speed comparison of R and Python implementations

I have implemented a specific algorithm in R. There exists a Python library which offers an alternative implementation.
I would now like to formally compare the speed of the two implementations to assess which is "more efficient" in different kinds of situations.
What would be the best way to do this? Here, it is suggested to use the system.time() (in R) and time.time() (in Python) functions to compare the actual executions of the functions. Is this generally recommended? Do these time measures actually measure the same thing?
Would it be an alternative to measure the time of the whole script execution to include "overhead" stuff like variable definitions etc?
PS: I am a regular user of R, but have close to zero experience in Python

absolute fastest lookup in python / cython

I'd like to do a lookup mapping 32bit integer => 32bit integer.
The input keys aren't necessary contiguous nor cover 2^32 -1 (nor do I want this in-memory to consume that much space!).
The use case is for a poker evaluator, so doing a lookup must be as fast as possible. Perfect hashing would be nice, but that might be a bit out of scope.
I feel like the answer is some kind of cython solution, but I'm not sure about the underpinnings of cython and if it really does any good with Python's dict() type. Of course a flat array with just a simple offset jump would be super fast, but then I'm allocating 2^32 - 1 places in memory for the table, which I don't want.
Any tips / strategies? Absolute speed with minimal memory footprint is the goal.

You aren't smart enough to write something faster than dict. Don't feel bad; 99.99999% of the people on the planet aren't. Use a dict.

First, you should actually define what "fast enough" means to you, before you do anything else. You can always make something faster, so you need to set a target so you don't go insane. It is perfectly reasonable for this target to be dual-headed - say something like "Mapping lookups must execute in these parameters (min/max/mean), and when/if we hit those numbers we're willing to spend X more development hours to optimize even further, but then we'll stop."
Second, the very first thing you should do to make this faster is to copy the code in Objects/dictobject.c in the Cpython source tree (make something new like intdict.c or something) and then modify it so that the keys are not python objects. Chasing after a better hash function will not likely be a good use of your time for integers, but eliminating INCREF/DECREF and PyObject_RichCompareBool calls for your keys will be a huge win. Since you're not deleting keys you could also elide any checks for dummy values (which exist to preserve the collision traversal for deleted entries), although it's possible that you'll get most of that win for free simply by having better branch prediction for your new object.

You are describing a perfect use case for a hash indexed collection. You are also describing a perfect scenario for the strategy of write it first, optimise it second.
So start with the Python dict. It's fast and it absolutely will do the job you need.
Then benchmark it. Figure out how fast it needs to go, and how near you are. Then 3 choices.
It's fast enough. You're done.
It's nearly fast enough, say within about a factor of two. Write your own hash indexing, paying attention to the hash function and the collision strategy.
It's much too slow. You're dead. There is nothing simple that will give you a 10x or 100x improvement. At least you didn't waste any time on a better hash index.

DFT in Python taking significantly longer than C

I'm currently working on translating some C code To Python. This code is being used to help identify errors arising from the CLEAN algorithm used in Radio Astronomy. In order to do this analysis the value of the Fourier Transforms of Intensity Maps, Q Stokes Map and U Stokes Map must be found at specific pixel values (given by ANT_pix). These Maps are just 257*257 arrays.
The below code takes a few seconds to run with C but takes hours to run with Python. I'm pretty sure that it is terribly optimized as my knowledge of Python is quite poor.
Thanks for any help you can give.
Update My question is if there is a better way to implement the loops in Python which will speed things up. I've read quite a few answer here for other questions on Python which recommend avoiding nested for loops in Python if possible and I'm just wondering if anyone knows a good way of implementing something like the Python code below without the loops or with better optimised loops. I realise this may be a tall order though!
I've been using the FFT up till now but my supervisor wants to see what sort of difference the DFT will make. This is because the Antenna position will not, in general, occur at exact pixels values. Using FFT requires round to the closest pixel value.
I'm using Python as CASA, the computer program used to reduce Radio Astronomy datasets is written in python and implementing Python scripts in it is far far easier than C.
Original Code
def DFT_Vis(ANT_Pix="",IMap="",QMap="",UMap="", NMap="", Nvis=""):
UV=numpy.zeros([Nvis,6])
Offset=(NMap+1)/2
ANT=ANT_Pix+Offset;
i=0
l=0
k=0
SumI=0
SumRL=0
SumLR=0
z=0
RL=QMap+1j*UMap
LR=QMap-1j*UMap
Factor=[math.e**(-2j*math.pi*z/NMap) for z in range(NMap)]
for i in range(Nvis):
X=ANT[i,0]
Y=ANT[i,1]
for l in range(NMap):
for k in range(NMap):
Temp=Factor[int((X*l)%NMap)]*Factor[int((Y*k)%NMap)];
SumI+=IMap[l,k]*Temp
SumRL+=RL[l,k]*Temp
SumLR+=IMap[l,k]*Temp
k=1
UV[i,0]=SumI.real
UV[i,1]=SumI.imag
UV[i,2]=SumRL.real
UV[i,3]=SumRL.imag
UV[i,4]=SumLR.real
UV[i,5]=SumLR.imag
l=1
k=1
SumI=0
SumRL=0
SumLR=0
return(UV)

You should probably use numpy's fourier transform code, rather than writing your own: http://docs.scipy.org/doc/numpy/reference/routines.fft.html

If you are interested in boosting the performance of your script cython could be an option.

I am not an expert on the FFT, but my understanding is that the FFT is simply a fast way to compute the DFT. So to me your question sounds like you are trying to write a bubble sort algorithm to see if it gives a better answer than quicksort. They are both sorting algorithms that would give the same result!
So I am questioning your basic premise. I am wondering if you can just change your rounding on your data and get the same result from the SciPy FFT code.
Also, according to my DSP textbook, the FFT can produce a more accurate answer than computing the DFT the long way, simply because floating point operations are inexact, and the FFT invokes fewer floating point operations along the way to finding the correct answer.
If you have some working C code that does the calculation you want, you could always wrap the C code to let you call it from Python. Discussion here: Wrapping a C library in Python: C, Cython or ctypes?
To answer your actual question: as #ZoZo123 noted, it would be a big win to change from range() to xrange(). With range(), Python has to build a list of numbers, and then destroy the list when done; with xrange() Python just makes an iterator that yields up the numbers one at a time. (But note that in Python 3.x, range() makes an iterator and there is no xrange().)
Also, if this code does not have to integrate with the rest of your code, you might try running this code under PyPy. This is exactly the sort of code that PyPy can best optimize. The problem with PyPy is that currently your project must be "pure" Python, and it looks like you are using NumPy. (There are projects to get NumPy and PyPy to work together, but that's not done yet.) http://pypy.org/
If this code does need to integrate with the rest of your code, then I think you need to look at Cython (as noted by #Krzysztof Rosiński).

Increasing efficiency of Python copying large datasets

I'm having a bit of trouble with an implementation of random forests I'm working on in Python. Bare in mind, I'm well aware that Python is not intended for highly efficient number crunching. The choice was based more on wanting to get a deeper understanding of and additional experience in Python. I'd like to find a solution to make it "reasonable".
With that said, I'm curious if anyone here can make some performance improvement suggestions to my implementation. Running it through the profiler, it's obvious the most time is being spent executing the list "append" command and my dataset split operation. Essentially I have a large dataset implemented as a matrix (rather, a list of lists). I'm using that dataset to build a decision tree, so I'll split on columns with the highest information gain. The split consists of creating two new dataset with only the rows matching some critera. The new dataset is generated by initializing two empty lista and appending appropriate rows to them.
I don't know the size of the lists in advance, so I can't pre-allocate them, unless it's possible to preallocate abundant list space but then update the list size at the end (I haven't seen this referenced anywhere).
Is there a better way to handle this task in python?

Without seeing your codes, it is really hard to give any specific suggestions since optimisation is code-dependent process that varies case by case. However there are still some general things:
review your algorithm, try to reduce the number of loops. It seems
you have a lot of loops and some of them are deeply embedded in
other loops (I guess).
if possible use higher performance utility modules such as itertools
instead of naive codes written by yourself.
If you are interested, try PyPy (http://pypy.org/), it is a
performance-oriented implementation of Python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.