Get the highest String Version number in Python - python

I am trying to get the highest version of a string in Python. I was trying to sort the list but that of course doesnt work as easily as Python will sort the string representation.
For that I am trying to work with regex but it somehow doesnt match.
The Strings look like this:
topic_v10_ext2
topic_v20_ext2
topic_v2_ext2
topic_v5_ext2
topic_v7_ext2
My Regex looks like this.
version_no = re.search("(?:_v([0-9]+))?", v.name)
I was thinking about saving the names in a list and look for the highest v_xx in the list to return.
Also for now I am doing this in two FOR loops. Which runs in 2*O(log(n)) which is not optimal I believe.
How can I get the highest version in a fast and simple way?

You can use sorted or list.sort with key:
sorted(l, key=lambda x:int(x.split('_')[1][1:]), reverse=True)
['topic_v20_ext2',
'topic_v10_ext2',
'topic_v7_ext2',
'topic_v5_ext2',
'topic_v2_ext2']
x.split('_'): returns splitted str, e.g.: ['topic', 'v20', 'ext2']
Since the version is the key to the sorting, select it by x.split('_')[1]
Selected V20 has unwanted character 'V', thus reselect it by slicing [1:] to get all the digits.
Finally, convert digits to int for numerical ordering.
Also, sorted by default returns ascending order of sort. Since you require descending order, use reverse=True.

It could also work with regular expressions, as first tried:
import re
v = 'topic_v7_ext2'
version_no = re.search("^[^_]*_v([0-9]+)", v)
print(version_no.group(1))
That expression searches for pattern from the beginning of the string (^), takes all characters different from _ (I hope your topics can't have one, else both answers are wrong), then finds the '_v' and takes the version number.
There is no need to match _ext, so it doesn't matter if it's there or not!

Related

Create unique id of fixed length only using given symbols?

I am trying to see how I can create a set of unique IDs of a fixed length (say length 12) in python which uses a specific subset of all alphanumeric characters. The usecase here is that these IDs need to be read by people and referred to in printed documents, and so I am trying to avoid using characters L, I, O and numbers 0, 1. I of course need to be able to generate a new ID as needed.
I looked into the UUID function in other answers but wasn't able to find a way to use that function to meet my requirements. I've done a lot of searching, but apologies if this is duplicate.
Edit: So far I tried using UUID as described here. And also the hashids function. But could not figure out a way to do it using them. The next best solution I could come up with is create a list of random strings, and check against all existing ids. But that seems woefully inefficient.
For a set of characters to sample you could use string.ascii_uppercase (A-Z) plus string.digits (0-9), but then remove unwanted characters 'LIO01'. From there you can use random.choices to generate a sequence of length k while allowing repeated characters.
import string
import random
def unique_id(size):
chars = list(set(string.ascii_uppercase + string.digits).difference('LIO01'))
return ''.join(random.choices(chars, k=size))
>>> unique_id(12)
'HBFXXHWZ8349'
>>> unique_id(12)
'A7W5WK636BYN'
>>> unique_id(12)
'WJ2JBX924NVK'
You could use an iterator like itertools.combinations
import itertools
import string
valid_chars = set(string.ascii_lowercase + string.digits) - set('lio01')
# Probably would want to persist the used values by using some sort of database/file
# instead of this
used = set()
unique_id_generator = itertools.combinations(valid_chars, 12)
generated = "".join(next(unique_id_generator))
while generated in used:
generated = "".join(next(unique_id_generator))
# Once an unused value has been found, add it to used list (or some sort of database where you can keep track)
used.add(generated)
This generator will continue to produce all possible combinations (without replacement) of all ascii lower case characters and digits excluding the ones you mentioned. If you need this upper case, you can use .upper() and if you want to allow replacement, you can use itertools.combinations_with_replacement.
If 'xyz' is not considered to be the same as 'xzy', take a look at itertools.permutations.
I bumped to a similar problem and the simplest solution I could think of is this one:
Answer
from secrets import token_urlsafe
id = ''.join([c for c in token_urlsafe(10) if c not in '-_OI0l'])[:5]
print(id) # 'a3HkR'
Explanation
token_urlsafe(10) String with 10 random chars from [a-z, A-Z, 0-9, -, _]
if c not in '-_OI0l' remove characters you don't want
[:5] Take just 5 from the beginning, if you want 5 for example.
Strengths
Readable
One-liner
Customizable
Can be highly secure if needed
Limitations
You can check the uniqueness in other ways, or just pick as long an id as needed so that randomness takes care of that for you.
The above example can create 459 165 024 different ids.
If you remove many characters or you want more characters you have to make the number in token_urlsafe(number) also bigger to not run into an IndexError.

Searching words without diacritics in a sorted list of words

I've been trying to come up with an efficient solution for the following problem. I have a sorted list of words that contain diacritics and I want to be able to do a search without using diacritics. So for example I want to match 'kříž' just using 'kriz'. After a bit of brainstorming I came up with the following and I want to ask you, more experienced (or clever) ones, whether it's optimal or there's a better solution. I'm using Python but the problem is language independent.
First I provide a mapping of those characters that have some diacritical siblings. So in case of Czech:
cz_map = {'a' : ('á',), ... 'e' : ('é', 'ě') ... }
Now I can easily create all variants of a word on the input. So for 'lama' I get: ['lama', 'láma', 'lamá', 'lámá']. I could already use this to search for words that match any of those permutations but when it comes to words like 'nepredvidatelny' (unpredictable) one gets 13824 permutations. Even though my laptop has a shining Intel i5 logo on him, this is to my taste too naive solution.
Here's an improvement I came up with. The dictionary of words I'm using has a variant of binary search for prefix matching (returns a word on the lowest index with a matching prefix) that is very useful in this case. I start with a first character, search for it's prefix existence in a dictionary and if it's there, I stack it up for the next character that will be tested appended to all of these stacked up sequences. This way I'm propagating only those strings that lead to a match. Here's the code:
def dia_search(word, cmap, dictionary):
prefixes = ['']
for c in word:
# each character maps to itself
subchars = [c]
# and some diacritical siblings if they exist
if cmap.has_key(c):
subchars += cmap[c]
# build a list of matching prefixes for the next round
prefixes = [p+s for s in subchars
for p in prefixes
if dictionary.psearch(p+s)>0]
return prefixes
This technique gives very good results but could it be even better? Or is there a technique that doesn't need the character mapping as in this case? I'm not sure this is relevant but the dictionary I'm using isn't sorted by any collate rules so the sequence is 'a', 'z', 'á' not 'a', 'á', 'z' as one could expect.
Thanks for all comments.
EDIT: I cannot create any auxiliary precomputed database that would be a copy of the original one but without diacritics. Let's say the original database is too big to be replicated.
using the standard library only (str.maketrans and str.translate) you could do this:
intab = "řížéě" # ...add all the other characters
outtab = "rizee" # and the characters you want them translated to
transtab = str.maketrans(intab, outtab)
strg = "abc kříž def ";
print(strg.translate(transtab)) # abc kriz def
this is for python3.
for python 2 you'd need to:
from string import maketrans
transtab = maketrans(intab, outtab)
# the rest remains the same
Have a look into Unidecode using which u can actually convert the diacritics into closest ascii. e.g.:-unidecode(u'kříž')
As has been suggested, what you want to do is to translate your unicode words (containing diacritics) to the closest standard 24-word alphabet version.
One way of implementing this would be to create a second list of words (of the same size of the original) with the corresponding translations. Then you do the query in the translated list, and once you have a match look up the corresponding location in the original list.
Or in case you can alter the original list, you can translate everything in-place and strip duplicates.

regex extraction 2 groups resulting only in one match

New to regex.
Consider you have the following text structure:
"hello_1:45||hello_2:67||bye_1:45||bye_5:89||.....|| bye_last:100" and so on
I want to build a dictionary out of it taking the string value as a key, and the decimal number as the dict value.
I was trying to check my concept using this nice tool
I wrote my regex expression:
(\w+):(\d+)
And got only one match ->the first in the string : hello_1:45
I tried also something like:
.*(\w+):(\d+).*
But also not good, any ideas?
You should use the g (global) modifier to get all the matches and not stop to the first one. In python you can use the re.findall function to get all the matches. Check the example here.
You may achieve this only through split function.
s = "hello_1:45||hello_2:67||bye_1:45||bye_5:89"
print {i.split(':')[0]:i.split(':')[1] for i in s.split('||')}
Try this if you want to convert the value part as int.
print {i.split(':')[0]:int(i.split(':')[1]) for i in s.split('||')}
or
print {i.split(':')[0]:float(i.split(':')[1]) for i in s.split('||')}

Searching string for different substrings

I have a string. I need to know if any of the following substrings appear in the string.
So, if I have:
thing_name = "VISA ASSESSMENTS"
I've been doing my searches with:
any((_ in thing_name for _ in ['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
I'm going through a long list of thing_name items, and I don't need to filter, exactly, just check for any number of substrings.
Is this the best way to do this? It feels wrong, but I can't think of a more efficient way to pull this off.
You can try re.search to see if that is faster. Something along the lines of
import re
pattern = re.compile('|'.join(['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
isMatch = (pattern.search(thing_name) != None)
If your list of substrings is small and the input is small, then using a for loop to do compares is fine.
Otherwise the fastest way I know to search a string for a (large) list of substrings is to construct a DAWG of the word list and then iterate through the input string, keeping a list of DAWG traversals and registering the substrings at each successful traverse.
Another way is to add all the substrings to a hashtable and then hash every possible substring (up to the length of the longest substring) as you traverse the input string.
It's been a while since I've worked in python, my memory of it is that it's slow to implement stuff in. To go the DAWG route, I would probably implement it as a native module and then use it from python (if possible). Otherwise, I'd do some speed checks to verify first but probably go the hashtable route since there are already high performance hashtables in python.

Grouping strings lexicographically (python)

I have N strings that I want to divide lexicographic into M even-sized buckets (+/- 1 string). Also, N>>M.
The direct way would be to sort all the strings and split the resulting list into the M buckets.
I would like to instead approximate this by routing each string as it is created to a bucket, before the full list is available.
Is there a fast and pythonic way to assign strings to buckets? I'm essentially looking for a string-equivalent of the integer modulo operator. Perhaps a hash that preserves lexicographic order? Is that even possible?
You can sort by first two chars of a string, or something of this sort.
Let's say that M=100, so you should divide the characters into sqrt(M) regions, and each should point to another sqrt(M) regions, then for each string you get, you can compare the first char to decide which region to direct the string to and again for the second char, something like a tree with buckets as leaves and comparisons as nodes.
A hash by definition doesn't preserve any order.
And I don't think there is any pythonic way to do this.
You could just create dictionaries (which are basically hashing functions) and keep adding a string to each round-robin style, but it wouldn't preserve any order.

Categories