use custom compare function in set equality python - python

I wish to use a custom compare function while calculating set. I wish to take advantage of the efficiencies of using set algorithm. technically I could create a double for loop to compare the two lists (keep, original) but I thought this might not be efficient.
eg://
textlist = ["ravi is happy", "happy ravi is", "is happy ravi", "is ravi happy"]
set() should return only 1 of these elements as the compare function would return if True if similarity between comparing items >= threshold.
In python. Thanks.
P.S.
The real trick is that I'd like to use my string_compare(t1,t2): Float to do the comparison rather then hashing and equal...
P.S.S.
C# has similar function:
How to remove similar string from a list?

I think this is what you were looking for:
{' '.join(sorted(sentence.split())) for sentence in textlist}
This re-orders the string and therefore Python set will now work because we are comparing identical strings.

Related

Searching fast in big binary file in python [duplicate]

Assume I have a set of strings S and a query string q. I want to know if any member of S is a substring of q. (For the purpose of this question substring includes equality, e.g. "foo" is a substring of "foo".) For example assume the function that does what I want is called anySubstring:
S = ["foo", "baz"]
q = "foobar"
assert anySubstring(S, q) # "foo" is a substring of "foobar"
S = ["waldo", "baz"]
assert not anySubstring(S, q)
Is there any easy-to-implement algorithm to do this with time complexity sublinear in len(S)? It's ok if S has to be processed into some clever data structure first because I will be querying each S with a lot of q strings, so the amortized cost of this preprocessing might be reasonable.
EDIT: To clarify, I don't care which member of S is a substring of q, only whether at least one is. In other words, I only care about a boolean answer.
I think Aho-Corasick algorithm does what you want. I think there is another solution which is very simple to implement, it's Karp-Rabin algorithm.
So if the length of S is way less then the sum of the lengths of the potential substrings your best option would be to build a suffix tree from S and then do a search in it. This is linear with respect to the length of S plus the summar length of the candidate substrings. Of course there can not be an algorithm with better complexity as you have to pass through all the input at least. If the case is opposite i.e. the length of s is more then the summar length of the substrings your best option would be aho-corasick.
Hope this helps.
Create a regular expression .*(S1|S2|...|Sn).* and construct its minimal DFA.
Run your query string through the DFA.

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?
I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)
Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.
#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]
#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )
print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])
I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.
Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.
The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.
And one more thing: try using
for i in allCombinations:
if i not in allWordsJoined:
invalidCombinations.add(i)
instead of
invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)
I'm not sure, but less memory allocations can be a small boost for real data run.
Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.
You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.
Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

Searching string for different substrings

I have a string. I need to know if any of the following substrings appear in the string.
So, if I have:
thing_name = "VISA ASSESSMENTS"
I've been doing my searches with:
any((_ in thing_name for _ in ['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
I'm going through a long list of thing_name items, and I don't need to filter, exactly, just check for any number of substrings.
Is this the best way to do this? It feels wrong, but I can't think of a more efficient way to pull this off.
You can try re.search to see if that is faster. Something along the lines of
import re
pattern = re.compile('|'.join(['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
isMatch = (pattern.search(thing_name) != None)
If your list of substrings is small and the input is small, then using a for loop to do compares is fine.
Otherwise the fastest way I know to search a string for a (large) list of substrings is to construct a DAWG of the word list and then iterate through the input string, keeping a list of DAWG traversals and registering the substrings at each successful traverse.
Another way is to add all the substrings to a hashtable and then hash every possible substring (up to the length of the longest substring) as you traverse the input string.
It's been a while since I've worked in python, my memory of it is that it's slow to implement stuff in. To go the DAWG route, I would probably implement it as a native module and then use it from python (if possible). Otherwise, I'd do some speed checks to verify first but probably go the hashtable route since there are already high performance hashtables in python.

Grouping strings lexicographically (python)

I have N strings that I want to divide lexicographic into M even-sized buckets (+/- 1 string). Also, N>>M.
The direct way would be to sort all the strings and split the resulting list into the M buckets.
I would like to instead approximate this by routing each string as it is created to a bucket, before the full list is available.
Is there a fast and pythonic way to assign strings to buckets? I'm essentially looking for a string-equivalent of the integer modulo operator. Perhaps a hash that preserves lexicographic order? Is that even possible?
You can sort by first two chars of a string, or something of this sort.
Let's say that M=100, so you should divide the characters into sqrt(M) regions, and each should point to another sqrt(M) regions, then for each string you get, you can compare the first char to decide which region to direct the string to and again for the second char, something like a tree with buckets as leaves and comparisons as nodes.
A hash by definition doesn't preserve any order.
And I don't think there is any pythonic way to do this.
You could just create dictionaries (which are basically hashing functions) and keep adding a string to each round-robin style, but it wouldn't preserve any order.

How to make binary to hex converter using for loop - Python

Yes, this is homework.
I have the basic idea. I know that basically I need to introduce a for loop and set if's saying if the value is above 9 then it's a, b, c, and so forth. But what I need is to get the for loop to grab the integer and its index number to calculate and go back and forth and then print out the hex. by the way its an 8 bit binary number and has to come out in two digit hex form.
thanks a lot!!
I'm assuming that you have a string containing the binary data.
In Python, you can iterate over all sorts of things, strings included. It becomes as simple as this:
for char in mystring:
pass
And replace pass with your suite (a term meaning a "block" of code). At this point, char will be a single-character string. Nice an straight forward.
For getting the character ordinal, investigate ord (find help for it yourself, it's not hard and it's good practice).
For converting the number to hex, you could use % string formatting with '%x', which will produce a value like '9f', or you could use the hex function, which will produce a value like '0x9f'; there are other ways, too.
If you can't figure any thing out, ask; but try to work it out first. It's your homework. :-)
So assuming that you've got the binary number in a string, you will want to have an index variable that gets incremented with each iteration of the for loop. I'm not going to give you the exact code, but consider this:
Python's for loop is designed to set the index variable (for index in list) to each value of a list of values.
You can use the range function to generate a list of numbers (say, from 0 to 7).
You can get the character at a given index in a string by using e.g. binary[index].

Categories