Is there a way to generate possible short forms? - python

Consider the string Building Centre. If asked to abbreviate this to fit a specific number of characters, you and I may choose very different but equally valid representations. For instance, three valid 7 character representations are:
BLD CNT
BLD CTR
BLDNGCT
These are generated by:
Using only existing letters in the string (can't abbreviate using z)
Using them in the order they appear (LBD is not valid since L does not come before B in Building).
Selecting up to as many characters (including spaces) as indicated.
I'm looking to write a breadth or depth of search based algorithm to generate all such short forms for a given string and desired length.
Before I go about writing the script, I am wondering if something similar has already been implemented. If not, how would you suggest I write something like this? Besides itertools, are there any useful libraries?

Yes, this can be beautifully done with itertools:
import itertools
text = 'Building Centre'
length = 7
shorts = [''.join(short) for short in itertools.combinations(text, length)]
print(shorts) # 6435 different versions!
Note that itertools.combinations does indeed preserve the order. You way want to check out the docs
Edit
If short forms with fewer than length characters should be allowed as well, you can use
shorts = list(itertools.chain(*((''.join(short) for short in itertools.combinations(text, l))
for l in range(1, length + 1))))
As stated in the comments, some short forms get generated twice. To fix this, use e.g. shorts = list(set(shorts)).

Related

enumeration of character sequence "permutations" (python)

I have following problem:
There are n=20 characters in the sequence. For each position there is a predefined list of possible characters which can be 1 to m (where m usually is a single digit).
How can I enumerate all possible permutations efficiently?
Or in essence is there some preexisting library (numpy?) that could do that before I try it myself?
itertools.product seems to offer what I need. I just need to pass it a list of list:
itertools.product(*positions)
where positions is a list of lists (eg which chars at which positions).
In my case the available options for each position are small and often also just 1 so that keeps the number of possibilities in check but might crash your application if too many get generated.
I then build the final string:
for s in itertools.product(*positions):
result = ''.join(s)
results.append(result)

Create unique id of fixed length only using given symbols?

I am trying to see how I can create a set of unique IDs of a fixed length (say length 12) in python which uses a specific subset of all alphanumeric characters. The usecase here is that these IDs need to be read by people and referred to in printed documents, and so I am trying to avoid using characters L, I, O and numbers 0, 1. I of course need to be able to generate a new ID as needed.
I looked into the UUID function in other answers but wasn't able to find a way to use that function to meet my requirements. I've done a lot of searching, but apologies if this is duplicate.
Edit: So far I tried using UUID as described here. And also the hashids function. But could not figure out a way to do it using them. The next best solution I could come up with is create a list of random strings, and check against all existing ids. But that seems woefully inefficient.
For a set of characters to sample you could use string.ascii_uppercase (A-Z) plus string.digits (0-9), but then remove unwanted characters 'LIO01'. From there you can use random.choices to generate a sequence of length k while allowing repeated characters.
import string
import random
def unique_id(size):
chars = list(set(string.ascii_uppercase + string.digits).difference('LIO01'))
return ''.join(random.choices(chars, k=size))
>>> unique_id(12)
'HBFXXHWZ8349'
>>> unique_id(12)
'A7W5WK636BYN'
>>> unique_id(12)
'WJ2JBX924NVK'
You could use an iterator like itertools.combinations
import itertools
import string
valid_chars = set(string.ascii_lowercase + string.digits) - set('lio01')
# Probably would want to persist the used values by using some sort of database/file
# instead of this
used = set()
unique_id_generator = itertools.combinations(valid_chars, 12)
generated = "".join(next(unique_id_generator))
while generated in used:
generated = "".join(next(unique_id_generator))
# Once an unused value has been found, add it to used list (or some sort of database where you can keep track)
used.add(generated)
This generator will continue to produce all possible combinations (without replacement) of all ascii lower case characters and digits excluding the ones you mentioned. If you need this upper case, you can use .upper() and if you want to allow replacement, you can use itertools.combinations_with_replacement.
If 'xyz' is not considered to be the same as 'xzy', take a look at itertools.permutations.
I bumped to a similar problem and the simplest solution I could think of is this one:
Answer
from secrets import token_urlsafe
id = ''.join([c for c in token_urlsafe(10) if c not in '-_OI0l'])[:5]
print(id) # 'a3HkR'
Explanation
token_urlsafe(10) String with 10 random chars from [a-z, A-Z, 0-9, -, _]
if c not in '-_OI0l' remove characters you don't want
[:5] Take just 5 from the beginning, if you want 5 for example.
Strengths
Readable
One-liner
Customizable
Can be highly secure if needed
Limitations
You can check the uniqueness in other ways, or just pick as long an id as needed so that randomness takes care of that for you.
The above example can create 459 165 024 different ids.
If you remove many characters or you want more characters you have to make the number in token_urlsafe(number) also bigger to not run into an IndexError.

Scoring word similarity between arbitrary text

I have a list of over 500 very important, but arbitrary strings. they look like:
list_important_codes = ['xido9','uaid3','frps09','ggix21']
What I know
*Casing is not important, but all other characters must match exactly.
*Every string starts with 4 alphabetical characters, and ends with either one or two numerical characters.
*I have a list of about 100,000 strings,list_recorded_codes that were hand-typed and should match list_important_codes exactly, but about 10,000 of them dont. Because these strings were typed manually, the incorrect strings are usually only about 1 character off. (errors such as: *has an added space, *got two letters switched around, *has "01" instead of "1", etc)
What I need to do
I need to iterate through list_recorded_codes and find all of their perfect matches within list_important_codes.
What I tried
I spent about 10 hours trying to manually program a way to fix each word, but it seems to be impractical and incredibly tedious. not to mention, when my list doubles in size at a later date, i would have to manually go about that process again.
The solution I think I need, and the expected output
Im hoping that Python's NLTK can efficiently 'score' these arbitrary terms to find a 'best score'. For example, if the word in question is inputword = "gdix88", and that word gets compared to score(inputword,"gdox89")=.84 and score(inputword,"sudh88")=.21. with my expected output being highscore=.84, highscoreword='gdox89'
for manually_entered_text in ['xido9','uaid3','frp09','ggix21']:
--get_highest_score_from_important_words() #returns word_with_highest_score
--manually_entered_text = word_with_highest_score
I am also willing to use a different set of tools to fix this issue if needed. but also, the simpler the better! Thank you!
The 'score' you are looking for is called an edit distance. There is quite a lot of literature and algorithms available - easy to find, but only after you know the proper term :)
See the corresponding wikipedia article.
The nltk package provides an implementation of the so-called Levenshtein edit-distance:
from nltk.metrics.distance import edit_distance
if __name__ == '__main__':
print(edit_distance("xido9", "xido9 "))
print(edit_distance("xido9", "xido8"))
print(edit_distance("xido9", "xido9xxx"))
print(edit_distance("xido9", "xido9"))
The results are 1, 1, 3 and 0 in this case.
Here is the documentation of the corresponding nltk module
There are more specialized versions of this score that take into account how frequent various typing errors are (for example 'e' instead of 'r' might occur quite often because the keys are next to each other on a qwert keyboard).
But classic Levenshtein would were I would start.
You could apply a dynamic programming approach to this problem. Once you have your scoring matrix, you alignment_matrix and your local and global alignment functions set up, you could iterate through the list_important_codes and find the highest scoring alignment in the list_recorded_codes. Here is a project I did for DNA sequence alignment: DNA alignment. You can easily adapt it to your problem.

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?
I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)
Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.
#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]
#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )
print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])
I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.
Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.
The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.
And one more thing: try using
for i in allCombinations:
if i not in allWordsJoined:
invalidCombinations.add(i)
instead of
invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)
I'm not sure, but less memory allocations can be a small boost for real data run.
Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.
You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.
Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

How to work with very long strings in Python?

I'm tackling project euler's problem 220 (looked easy, in comparison to some of the
others - thought I'd try a higher numbered one for a change!)
So far I have:
D = "Fa"
def iterate(D,num):
for i in range (0,num):
D = D.replace("a","A")
D = D.replace("b","B")
D = D.replace("A","aRbFR")
D = D.replace("B","LFaLb")
return D
instructions = iterate("Fa",50)
print instructions
Now, this works fine for low values, but when you put it to repeat higher then you just get a "Memory error". Can anyone suggest a way to overcome this? I really want a string/file that contains instructions for the next step.
The trick is in noticing which patterns emerge as you run the string through each iteration. Try evaluating iterate(D,n) for n between 1 and 10 and see if you can spot them. Also feed the string through a function that calculates the end position and the number of steps, and look for patterns there too.
You can then use this knowledge to simplify the algorithm to something that doesn't use these strings at all.
Python strings are not going to be the answer to this one. Strings are stored as immutable arrays, so each one of those replacements creates an entirely new string in memory. Not to mention, the set of instructions after 10^12 steps will be at least 1TB in size if you store them as characters (and that's with some minor compressions).
Ideally, there should be a way to mathematically (hint, there is) generate the answer on the fly, so that you never need to store the sequence.
Just use the string as a guide to determine a method which creates your path.
If you think about how many "a" and "b" characters there are in D(0), D(1), etc, you'll see that the string gets very long very quickly. Calculate how many characters there are in D(50), and then maybe think again about where you would store that much data. I make it 4.5*10^15 characters, which is 4500 TB at one byte per char.
Come to think of it, you don't have to calculate - the problem tells you there are 10^12 steps at least, which is a terabyte of data at one byte per character, or quarter of that if you use tricks to get down to 2 bits per character. I think this would cause problems with the one-minute time limit on any kind of storage medium I have access to :-)
Since you can't materialize the string, you must generate it. If you yield the individual characters instead of returning the whole string, you might get it to work.
def repl220( string ):
for c in string:
if c == 'a': yield "aRbFR"
elif c == 'b': yield "LFaLb"
else yield c
Something like that will do replacement without creating a new string.
Now, of course, you need to call it recursively, and to the appropriate depth. So, each yield isn't just a yield, it's something a bit more complex.
Trying not to solve this for you, so I'll leave it at that.
Just as a word of warning be careful when using the replace() function. If your strings are very large (in my case ~ 5e6 chars) the replace function would return a subset of the string (around ~ 4e6 chars) without throwing any errors.
You could treat D as a byte stream file.
Something like:-
seedfile = open('D1.txt', 'w');
seedfile.write("Fa");
seedfile.close();
n = 0
while (n
warning totally untested

Categories