Create unique id of fixed length only using given symbols? - python

I am trying to see how I can create a set of unique IDs of a fixed length (say length 12) in python which uses a specific subset of all alphanumeric characters. The usecase here is that these IDs need to be read by people and referred to in printed documents, and so I am trying to avoid using characters L, I, O and numbers 0, 1. I of course need to be able to generate a new ID as needed.
I looked into the UUID function in other answers but wasn't able to find a way to use that function to meet my requirements. I've done a lot of searching, but apologies if this is duplicate.
Edit: So far I tried using UUID as described here. And also the hashids function. But could not figure out a way to do it using them. The next best solution I could come up with is create a list of random strings, and check against all existing ids. But that seems woefully inefficient.

For a set of characters to sample you could use string.ascii_uppercase (A-Z) plus string.digits (0-9), but then remove unwanted characters 'LIO01'. From there you can use random.choices to generate a sequence of length k while allowing repeated characters.
import string
import random
def unique_id(size):
chars = list(set(string.ascii_uppercase + string.digits).difference('LIO01'))
return ''.join(random.choices(chars, k=size))
>>> unique_id(12)
'HBFXXHWZ8349'
>>> unique_id(12)
'A7W5WK636BYN'
>>> unique_id(12)
'WJ2JBX924NVK'

You could use an iterator like itertools.combinations
import itertools
import string
valid_chars = set(string.ascii_lowercase + string.digits) - set('lio01')
# Probably would want to persist the used values by using some sort of database/file
# instead of this
used = set()
unique_id_generator = itertools.combinations(valid_chars, 12)
generated = "".join(next(unique_id_generator))
while generated in used:
generated = "".join(next(unique_id_generator))
# Once an unused value has been found, add it to used list (or some sort of database where you can keep track)
used.add(generated)
This generator will continue to produce all possible combinations (without replacement) of all ascii lower case characters and digits excluding the ones you mentioned. If you need this upper case, you can use .upper() and if you want to allow replacement, you can use itertools.combinations_with_replacement.
If 'xyz' is not considered to be the same as 'xzy', take a look at itertools.permutations.

I bumped to a similar problem and the simplest solution I could think of is this one:
Answer
from secrets import token_urlsafe
id = ''.join([c for c in token_urlsafe(10) if c not in '-_OI0l'])[:5]
print(id) # 'a3HkR'
Explanation
token_urlsafe(10) String with 10 random chars from [a-z, A-Z, 0-9, -, _]
if c not in '-_OI0l' remove characters you don't want
[:5] Take just 5 from the beginning, if you want 5 for example.
Strengths
Readable
One-liner
Customizable
Can be highly secure if needed
Limitations
You can check the uniqueness in other ways, or just pick as long an id as needed so that randomness takes care of that for you.
The above example can create 459 165 024 different ids.
If you remove many characters or you want more characters you have to make the number in token_urlsafe(number) also bigger to not run into an IndexError.

Related

enumeration of character sequence "permutations" (python)

I have following problem:
There are n=20 characters in the sequence. For each position there is a predefined list of possible characters which can be 1 to m (where m usually is a single digit).
How can I enumerate all possible permutations efficiently?
Or in essence is there some preexisting library (numpy?) that could do that before I try it myself?
itertools.product seems to offer what I need. I just need to pass it a list of list:
itertools.product(*positions)
where positions is a list of lists (eg which chars at which positions).
In my case the available options for each position are small and often also just 1 so that keeps the number of possibilities in check but might crash your application if too many get generated.
I then build the final string:
for s in itertools.product(*positions):
result = ''.join(s)
results.append(result)

How to call another function's results

def most_frequency_occ(chars,inputString):
count = 0
for ind_char in inputString:
ind_char = ind_char.lower()
if chars == ind_char:
count += 1
return count
def general(inputString):
maxOccurences = 0
for chars in inputString:
most_frequency_occ(chars, inputString)
This is my current code. I'm trying to find the most frequent occurring letter in general. I created another function called most_frequency_occ that finds a specific character in the string that occurs the most often, but how do I generalize it into finding the frequent letter in a string without specifying a specific character and only using loops, without any build in string functions either.
For example:
print(general('aqweasdaza'))
should print 4 as "a" occurs the most frequently, occurring 4 times.
If I got your task, I think that using a dictionary will be more comfortable for you.
# initializing string
str = "Hello world"
# initializing dict of freq
freq = {}
for i in str:
if i in freq:
freq[i] += 1
else:
freq[i] = 1
# Now, you have the count of every char in this string.
# If you want to extract the max, this step will do it for you:
max_freq_chr = max(stats.values())
There are multiple ways you find the most common letter in a string.
One easy to understand and cross-language way of doing this would be:
initialize an array of 26 integers set to 0.
go over each letter one by one of your string, if the first letter is an B (B=2), you can increment the second value of the array
Find the largest value in your array, return the corresponding letter.
Since you are using python, you could use dictionaries since it would be less work to implement.
A word of caution, it sounds like you are doing a school assignment. If your school has a plagiarism checker that checks the internet, you might be caught for academic dishonesty if you copy paste code from the internet.
The other answers have suggested alternative ways of counting the letters in a string, some of which may be better than what you've come up with on your own. But I think it may be worth answering your question about how to call your most_frequency_occ function from your general function even if the algorithm isn't great, since you'll need to understand how functions work in other contexts.
The thing to understand about function calls is that the call expression will be evaluated to the value returned by the function. In this case, that's the count. Often you may want to assign the return value to a variable so you can reference it multiple times. Here's what that might look like:
count = most_frequency_occ(chars, inputString)
Now you can do a comparsion between the count and the previously best count to see if you've just checked the most common letter so far:
maxOccurences = 0
for chars in inputString:
count = most_frequency_occ(chars, inputString)
if count > maxOccurences: # check if chars is more common than the previous best
maxOccurences = count
return maxOccurences
One final note: Some of your variable and function names are a bit misleading. That often happens when you're changing your code around from one design to another, but not changing the variable names at the same time. You may want to occasionally reread your code and double check to make sure that the variable names still match what you're doing with them. If not, you should "refactor" your code by renaming the variables to better match their actual uses.
To be specific, your most_frequency_occ function isn't actually finding the most frequent character itself, it's only doing a small step in that process, counting how often a single character occurs. So I'd call it count_char or something similar. The general function might be named something more descriptive like find_most_frequent_character.
And the variable chars (which exists in both functions) is also misleading since it represents a single character, but the name chars implies something plural (like a list or a string that contains several characters). Renaming it to char might be better, as that seems more like a singular name.

Searching words without diacritics in a sorted list of words

I've been trying to come up with an efficient solution for the following problem. I have a sorted list of words that contain diacritics and I want to be able to do a search without using diacritics. So for example I want to match 'kříž' just using 'kriz'. After a bit of brainstorming I came up with the following and I want to ask you, more experienced (or clever) ones, whether it's optimal or there's a better solution. I'm using Python but the problem is language independent.
First I provide a mapping of those characters that have some diacritical siblings. So in case of Czech:
cz_map = {'a' : ('á',), ... 'e' : ('é', 'ě') ... }
Now I can easily create all variants of a word on the input. So for 'lama' I get: ['lama', 'láma', 'lamá', 'lámá']. I could already use this to search for words that match any of those permutations but when it comes to words like 'nepredvidatelny' (unpredictable) one gets 13824 permutations. Even though my laptop has a shining Intel i5 logo on him, this is to my taste too naive solution.
Here's an improvement I came up with. The dictionary of words I'm using has a variant of binary search for prefix matching (returns a word on the lowest index with a matching prefix) that is very useful in this case. I start with a first character, search for it's prefix existence in a dictionary and if it's there, I stack it up for the next character that will be tested appended to all of these stacked up sequences. This way I'm propagating only those strings that lead to a match. Here's the code:
def dia_search(word, cmap, dictionary):
prefixes = ['']
for c in word:
# each character maps to itself
subchars = [c]
# and some diacritical siblings if they exist
if cmap.has_key(c):
subchars += cmap[c]
# build a list of matching prefixes for the next round
prefixes = [p+s for s in subchars
for p in prefixes
if dictionary.psearch(p+s)>0]
return prefixes
This technique gives very good results but could it be even better? Or is there a technique that doesn't need the character mapping as in this case? I'm not sure this is relevant but the dictionary I'm using isn't sorted by any collate rules so the sequence is 'a', 'z', 'á' not 'a', 'á', 'z' as one could expect.
Thanks for all comments.
EDIT: I cannot create any auxiliary precomputed database that would be a copy of the original one but without diacritics. Let's say the original database is too big to be replicated.
using the standard library only (str.maketrans and str.translate) you could do this:
intab = "řížéě" # ...add all the other characters
outtab = "rizee" # and the characters you want them translated to
transtab = str.maketrans(intab, outtab)
strg = "abc kříž def ";
print(strg.translate(transtab)) # abc kriz def
this is for python3.
for python 2 you'd need to:
from string import maketrans
transtab = maketrans(intab, outtab)
# the rest remains the same
Have a look into Unidecode using which u can actually convert the diacritics into closest ascii. e.g.:-unidecode(u'kříž')
As has been suggested, what you want to do is to translate your unicode words (containing diacritics) to the closest standard 24-word alphabet version.
One way of implementing this would be to create a second list of words (of the same size of the original) with the corresponding translations. Then you do the query in the translated list, and once you have a match look up the corresponding location in the original list.
Or in case you can alter the original list, you can translate everything in-place and strip duplicates.

Is there a way to generate possible short forms?

Consider the string Building Centre. If asked to abbreviate this to fit a specific number of characters, you and I may choose very different but equally valid representations. For instance, three valid 7 character representations are:
BLD CNT
BLD CTR
BLDNGCT
These are generated by:
Using only existing letters in the string (can't abbreviate using z)
Using them in the order they appear (LBD is not valid since L does not come before B in Building).
Selecting up to as many characters (including spaces) as indicated.
I'm looking to write a breadth or depth of search based algorithm to generate all such short forms for a given string and desired length.
Before I go about writing the script, I am wondering if something similar has already been implemented. If not, how would you suggest I write something like this? Besides itertools, are there any useful libraries?
Yes, this can be beautifully done with itertools:
import itertools
text = 'Building Centre'
length = 7
shorts = [''.join(short) for short in itertools.combinations(text, length)]
print(shorts) # 6435 different versions!
Note that itertools.combinations does indeed preserve the order. You way want to check out the docs
Edit
If short forms with fewer than length characters should be allowed as well, you can use
shorts = list(itertools.chain(*((''.join(short) for short in itertools.combinations(text, l))
for l in range(1, length + 1))))
As stated in the comments, some short forms get generated twice. To fix this, use e.g. shorts = list(set(shorts)).

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?
I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)
Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.
#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]
#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )
print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])
I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.
Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.
The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.
And one more thing: try using
for i in allCombinations:
if i not in allWordsJoined:
invalidCombinations.add(i)
instead of
invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)
I'm not sure, but less memory allocations can be a small boost for real data run.
Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.
You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.
Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

Categories