Grouping strings lexicographically (python)

Grouping strings lexicographically (python) - python

I have N strings that I want to divide lexicographic into M even-sized buckets (+/- 1 string). Also, N>>M.
The direct way would be to sort all the strings and split the resulting list into the M buckets.
I would like to instead approximate this by routing each string as it is created to a bucket, before the full list is available.
Is there a fast and pythonic way to assign strings to buckets? I'm essentially looking for a string-equivalent of the integer modulo operator. Perhaps a hash that preserves lexicographic order? Is that even possible?

You can sort by first two chars of a string, or something of this sort.
Let's say that M=100, so you should divide the characters into sqrt(M) regions, and each should point to another sqrt(M) regions, then for each string you get, you can compare the first char to decide which region to direct the string to and again for the second char, something like a tree with buckets as leaves and comparisons as nodes.

A hash by definition doesn't preserve any order.
And I don't think there is any pythonic way to do this.
You could just create dictionaries (which are basically hashing functions) and keep adding a string to each round-robin style, but it wouldn't preserve any order.

Related

Ordinal Numbers in Python

Does there exist an implementation of ordinal numbers in python?
An example use-case is the following:
We want to maintain a sorted dictionary indexed by numbers.
Over time we may insert an arbitrary number of "standard" elements (idx, e)
The dict should also contain certain "special" elements (idx, w) which should always appear at the end, after all other elements.
A very clean solution (*) would be to simply index the elements with ordinal numbers, so the "standard" elements would use the indices 0,1,2,3,... and the "special" elements could be assigned to ω, ω+1, ω+2, ω+3, ... Obviously this approach also is incredibly powerful in its generalization power.
Of course there are alternative solutions, for example one could maintain multiple lists. However this has the disadvantage that if we want to iterate over all elements, we need to work through the lists one-by-one. With ordinal number indexing, one would just have to iterate over one list.
(*) At least from my perspective as a mathematician

Get the highest String Version number in Python

I am trying to get the highest version of a string in Python. I was trying to sort the list but that of course doesnt work as easily as Python will sort the string representation.
For that I am trying to work with regex but it somehow doesnt match.
The Strings look like this:
topic_v10_ext2
topic_v20_ext2
topic_v2_ext2
topic_v5_ext2
topic_v7_ext2
My Regex looks like this.
version_no = re.search("(?:_v([0-9]+))?", v.name)
I was thinking about saving the names in a list and look for the highest v_xx in the list to return.
Also for now I am doing this in two FOR loops. Which runs in 2*O(log(n)) which is not optimal I believe.
How can I get the highest version in a fast and simple way?

You can use sorted or list.sort with key:
sorted(l, key=lambda x:int(x.split('_')[1][1:]), reverse=True)
['topic_v20_ext2',
'topic_v10_ext2',
'topic_v7_ext2',
'topic_v5_ext2',
'topic_v2_ext2']
x.split('_'): returns splitted str, e.g.: ['topic', 'v20', 'ext2']
Since the version is the key to the sorting, select it by x.split('_')[1]
Selected V20 has unwanted character 'V', thus reselect it by slicing [1:] to get all the digits.
Finally, convert digits to int for numerical ordering.
Also, sorted by default returns ascending order of sort. Since you require descending order, use reverse=True.

It could also work with regular expressions, as first tried:
import re
v = 'topic_v7_ext2'
version_no = re.search("^[^_]*_v([0-9]+)", v)
print(version_no.group(1))
That expression searches for pattern from the beginning of the string (^), takes all characters different from _ (I hope your topics can't have one, else both answers are wrong), then finds the '_v' and takes the version number.
There is no need to match _ext, so it doesn't matter if it's there or not!

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?
I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)
Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.
#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]
#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )
print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])
I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.
Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.

The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.
And one more thing: try using
for i in allCombinations:
if i not in allWordsJoined:
invalidCombinations.add(i)
instead of
invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)
I'm not sure, but less memory allocations can be a small boost for real data run.

Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.
You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.
Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

Searching string for different substrings

I have a string. I need to know if any of the following substrings appear in the string.
So, if I have:
thing_name = "VISA ASSESSMENTS"
I've been doing my searches with:
any((_ in thing_name for _ in ['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
I'm going through a long list of thing_name items, and I don't need to filter, exactly, just check for any number of substrings.
Is this the best way to do this? It feels wrong, but I can't think of a more efficient way to pull this off.

You can try re.search to see if that is faster. Something along the lines of
import re
pattern = re.compile('|'.join(['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
isMatch = (pattern.search(thing_name) != None)

If your list of substrings is small and the input is small, then using a for loop to do compares is fine.
Otherwise the fastest way I know to search a string for a (large) list of substrings is to construct a DAWG of the word list and then iterate through the input string, keeping a list of DAWG traversals and registering the substrings at each successful traverse.
Another way is to add all the substrings to a hashtable and then hash every possible substring (up to the length of the longest substring) as you traverse the input string.
It's been a while since I've worked in python, my memory of it is that it's slow to implement stuff in. To go the DAWG route, I would probably implement it as a native module and then use it from python (if possible). Otherwise, I'd do some speed checks to verify first but probably go the hashtable route since there are already high performance hashtables in python.

How to make binary to hex converter using for loop - Python

Yes, this is homework.
I have the basic idea. I know that basically I need to introduce a for loop and set if's saying if the value is above 9 then it's a, b, c, and so forth. But what I need is to get the for loop to grab the integer and its index number to calculate and go back and forth and then print out the hex. by the way its an 8 bit binary number and has to come out in two digit hex form.
thanks a lot!!

I'm assuming that you have a string containing the binary data.
In Python, you can iterate over all sorts of things, strings included. It becomes as simple as this:
for char in mystring:
pass
And replace pass with your suite (a term meaning a "block" of code). At this point, char will be a single-character string. Nice an straight forward.
For getting the character ordinal, investigate ord (find help for it yourself, it's not hard and it's good practice).
For converting the number to hex, you could use % string formatting with '%x', which will produce a value like '9f', or you could use the hex function, which will produce a value like '0x9f'; there are other ways, too.
If you can't figure any thing out, ask; but try to work it out first. It's your homework. :-)

So assuming that you've got the binary number in a string, you will want to have an index variable that gets incremented with each iteration of the for loop. I'm not going to give you the exact code, but consider this:
Python's for loop is designed to set the index variable (for index in list) to each value of a list of values.
You can use the range function to generate a list of numbers (say, from 0 to 7).
You can get the character at a given index in a string by using e.g. binary[index].

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping strings lexicographically (python) - python

A hash by definition doesn't preserve any order. And I don't think there is any pythonic way to do this. You could just create dictionaries (which are basically hashing functions) and keep adding a string to each round-robin style, but it wouldn't preserve any order.

Related

Ordinal Numbers in Python

Get the highest String Version number in Python

Most efficient way to check if any substrings in list are in another list of strings

Searching string for different substrings

How to make binary to hex converter using for loop - Python

Categories

Resources