string combinations that include a substring over a finite alphabet in python

string combinations that include a substring over a finite alphabet in python - python

Let's assume we have an alphabet of 20 letters. Also let's assume that we have the following substring CCAY. I would like to calculate the number of the words which have length N letters and include the specific substring.
To be more precise, if the N = 6 I would like the following combinations CCAYxx, xCCAYx, xxCCAY where x is any letter of the alphabet. If N = 7 the combinations adjust as follows CCAYxxx, xCCAYxx, xxCCAYx, xxxCCAY and so on.
Also, I can think a pitfall when the substring consists of only one letter of the alphabet e.g CCCC which means that in case of N = 6 the string CCCCCC should not be counted multiple times.
I would appreciate any help or guidance on how to approach this problem. Any sample code in python would be also highly appreciated.

You said brute force is okay, so here we go:
alphabet = 'abc'
substring = 'ccc'
n = 7
res = set()
for combination in itertools.product(alphabet, repeat=n-len(substring)):
# get the carthesian product of the alphabet such that we end up
# with a total length of 'n' for the final combination
for idx in range(len(combination)+1):
res.add(''.join((*combination[:idx], substring, *combination[idx:])))
print(len(res))
Prints:
295
For a substring with no repetitions, like abc, I get 396 as result, so I assume it covers to corner case appropriately.
That this is inefficient enough to make mathematicians weep goes without saying, but as long as your problems are small in length it should get the job done.
Analytical approach
The maximum number of combinations is given by the ways of unique ordered combinations of length n, given len(alphabet) = k symbols, which is k^n. Additionally, the 'substring' can be inserted into the combinations at any point, which leads to a total maximum of (n+1)*k^n. The latter only holds if the substring does not produce identical final combinations at any point, which makes this problem hard to compute analytically. So, the vague answer is your result will be somewhere between k^n and (n+1)*k^n.
If you want to count the number of identical final combinations that include the substring, you can do so by counting the number of repetitions of the substring within a preliminary product:
n = 6
pre_prod = 'abab'
sub = 'ab'
pre_prods = ['ababab', 'aabbab', 'ababab', 'abaabb', 'ababab']
prods = ['ababab', 'aabbab', 'abaabb']
# len(pre_prodd) - pre_prod.count(sub) -> len(prods) aka 5 - 2 = 3
I will see if I can find a formula for that .. sometime soon.

Related

How to call an index value from an itertools permutation without converting it to a list?

I need to create all combinations of these characters:
'0123456789qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM. '
That are 100 letters long, such as:
'0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001'
I'm currently using this code:
import itertools
babel = itertools.product(k_c, repeat = 100)
This code works, but I need to be able to return the combination at a certain index, however itertools.product does not support indexing, turning the product into a list yields a MemoryError, and iterating through the product until I reaches a certain value takes too long for values over a billion.
Thanks for any help

With 64 characters and 100 letters there will be 64^100 combinations. For each value of the first letter, there will be 64^99 combinations of the remaining letters, then 64^98, 64^97, and so on.
This means that your Nth combination can be expressed as N in base 64 where each "digit" represents the index of the letter in the string.
An easy solution would be to build the string recursively by progressively determining the index of each position and getting the rest of the string with the remainder of N:
chars = '0123456789qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM. '
def comboNumber(n,size=100):
if size == 1: return chars[n]
return comboNumber(n//len(chars),size-1)+chars[n%len(chars)]
output:
c = comboNumber(123456789000000000000000000000000000000000000123456789)
print(c)
# 000000000000000000000000000000000000000000000000000000000000000000000059.90jDxZuy6drpQdWATyZ8007dNJs
c = comboNumber(1083232247617211325080159061900470944719547986644358934)
print(c)
# 0000000000000000000000000000000000000000000000000000000000000000000000Python.Person says Hello World
Conversely, if you want to know at which combination index a particular string is located, you can compute the base64 value by combining the character index (digit) at each position:
s = "Python.Person says Hello World" # leading zeroes are implied
i = 0
for c in s:
i = i*len(chars)+chars.index(c)
print(i) # 1083232247617211325080159061900470944719547986644358934
You are now this much closer to understanding base64 encoding which is the same thing applied to 24bit numbers coded over 4 characters (i.e 3 binary bytes --> 4 alphanumeric characters) or any variant thereof

How to calculate the amount of possibles sets of strings with elements of differents lenghts?

I need to write an algorithm that calculates the amount of possible strings given some restrictions:
The strings must have an N amount of characters;
The strings only have an X number of different letters;
The strings can't have an Y number of digraphs;
For example, for N = 2, X = 3 and Y = 6:
The string: _ _ _
My set of letters: {a, b, c}
Set of proibited digraphs: {aa, bb, cc, ab, ac, bc}
#The proper result is 1, but I'm getting -9
So far, my code only works properly when the string only have a lenght of 2. I'm being unsuccessful in making an equation that cover all cases. I can't make a bruteforce algorithm because the string lenght can have any value lower than 10^9
That's my code:
def arrangements(sizeOfSet, numberOfElements, isDigraph):
if isDigraph:
return pow(numberOfElements, sizeOfSet - 1)
#sizeOfSet is subtracted by 1 cause digraphs occupies 2 spaces
return pow(numberOfElements, sizeOfSet)
And then I subtract the arrangements of my set of letters from the arrangements of the proibited digraphs set.

This is a kind of a comment but it is too big for a comment.
Unfortunately I don't think you can solve this problem as an algorithm with N, X and Y as its input because that data is not enough. You need the actual set of digraphs rather than its size.
Consider two similar problems. Both have an alphabet of 2 letters (a, b) and N = 3 but the sets of prohibited digraphs are different.
Problem #1
We prohibit aa and ab. So all the allowed triplets are:
baa
bba
bbb
So there are 3 solutions
Problem #2
We prohibit aa and bb. So all the allowed triplets are:
aba
bab
There are only 2 solutions.
This happens because digraphs aa and ab interact differently than digraphs aa and bb. Particularly no triplet is filtered out by both aa and bb at the same time, but the triplet aab is filtered out by both aa and ab so effectively one less triplet is filtered out.
What I think could work is using dynamic programming over the length of the strings controlling which is the current last character.
Update (sketch of the algorithm)
Let's try to solve it as a dynamic programming problem over the length of the string. What we want to keep as the state is the number of strings ending with each character (so an array of size X, or a dictionary of the same size). If we find a way to update such a state for each next string size, then when we get to the N we can just sum values over the array and that will be our final answer.
Updating the state is easy. Let's consider what happens when we add a new character to a valid string of the length n-1. We get a new valid string of the length n unless the last pair is one of the prohibited digraphs. So building new state is easy (here is a Python-style pseudo-code):
for prev_last_char in chars:
for new_last_char in chars:
if (prev_last_char, new_last_char) not in prohibited:
new_state[new_last_char] += old_state[prev_last_char]
Obviously the initial state is array (or dictionary) full of 1 for every character.
Assuming prohibited access time is O(1) which should be achievable with a hash-based dictionary, the complexity of one step is O(X*X). Thus total algorithm complexity is O(X^2*N).

Fastest way to sort string to match second string - only adjacent swaps allowed

I want to get the minimum number of letter-swaps needed to convert one string to match a second string. Only adjacent swaps are allowed.
Inputs are: length of strings, string_1, string_2
Some examples:
Length | String 1 | String 2 | Output
-------+----------+----------+-------
3 | ABC | BCA | 2
7 | AABCDDD | DDDBCAA | 16
7 | ZZZAAAA | ZAAZAAZ | 6
Here's my code:
def letters(number, word_1, word_2):
result = 0
while word_1 != word_2:
index_of_letter = word_1.find(word_2[0])
result += index_of_letter
word_1 = word_1.replace(word_2[0], '', 1)
word_2 = word_2[1:]
return result
It gives the correct results, but the calculation should stay under 20 seconds.
Here are two sets of input data (1 000 000 characters long strings): https://ufile.io/8hp46 and https://ufile.io/athxu.
On my setup the first one is executed in around 40 seconds and the second in 4 minutes.
How to calculate the result in less than 20 seconds?

#KennyOstrom's is 90% there. The inversion count is indeed the right angle to look at this problem.
The only bit that is missing is that we need a "relative" inversion count, meaning the number of inversions not to get to normal sort order but to the other word's order. We therefore need to compute the permutation that stably maps word1 to word2 (or the other way round), and then compute the inversion count of that. Stability is important here, because obviously there will be lots of nonunique letters.
Here is a numpy implementation that takes only a second or two for the two large examples you posted. I did not test it extensively, but it does agree with #trincot's solution on all test cases. For the two large pairs it finds 1819136406 and 480769230766.
import numpy as np
_, word1, word2 = open("lit10b.in").read().split()
word1 = np.frombuffer(word1.encode('utf8')
+ (((1<<len(word1).bit_length()) - len(word1))*b'Z'),
dtype=np.uint8)
word2 = np.frombuffer(word2.encode('utf8')
+ (((1<<len(word2).bit_length()) - len(word2))*b'Z'),
dtype=np.uint8)
n = len(word1)
o1 = np.argsort(word1, kind='mergesort')
o2 = np.argsort(word2, kind='mergesort')
o1inv = np.empty_like(o1)
o1inv[o1] = np.arange(n)
order = o2[o1inv]
sum_ = 0
for i in range(1, len(word1).bit_length()):
order = np.reshape(order, (-1, 1<<i))
oo = np.argsort(order, axis = -1, kind='mergesort')
ioo = np.empty_like(oo)
ioo[np.arange(order.shape[0])[:, None], oo] = np.arange(1<<i)
order[...] = order[np.arange(order.shape[0])[:, None], oo]
hw = 1<<(i-1)
sum_ += ioo[:, :hw].sum() - order.shape[0] * (hw-1)*hw // 2
print(sum_)

Your algorithm runs in O(n2) time:
The find() call will take O(n) time
The replace() call will create a complete new string which takes O(n) time
The outer loop executes O(n) times
As others have stated, this can be solved by counting inversions using merge sort, but in this answer I try to stay close to your algorithm, keeping the outer loop and result += index_of_letter, but changing the way index_of_letter is calculated.
The improvement can be done as follows:
preprocess the word_1 string and note the first position of each distinct letter in word_1 in a dict keyed by these letters. Link each letter with its next occurrence. I think it is most efficient to create one list for this, having the size of word_1, where at each index you store the index of the next occurrence of the same letter. This way you have a linked list for each distinct letter. This preprocessing can be done in O(n) time, and with it you can replace the find call with a O(1) lookup. Every time you do this, you remove the matched letter from the linked list, i.e. the index in the dict moves to the index of the next occurrence.
The previous change will give the absolute index, not taking into account the removals of letters that you have in your algorithm, so this will give wrong results. To solve that, you can build a binary tree (also preprocessing), where each node represents an index in word_1, and which gives the actual number of non-deleted letters preceding a given index (including itself as well if not deleted yet). The nodes in the binary tree never get deleted (that might be an idea for a variant solution), but the counts get adjusted to reflect a deletion of a character. At most O(logn) nodes need to get a decremented value upon such a deletion. But apart from that no string would be rebuilt like with replace. This binary tree could be represented as a list, corresponding to nodes in in-order sequence. The values in the list would be the numbers of non-deleted letters preceding that node (including itself).
The initial binary tree could be depicted as follows:
The numbers in the nodes reflect the number of nodes at their left side, including themselves. They are stored in the numLeft list. Another list parent precalculates at which indexes the parents are located.
The actual code could look like this:
def letters(word_1, word_2):
size = len(word_1) # No need to pass size as argument
# Create a binary tree for word_1, organised as a list
# in in-order sequence, and with the values equal to the number of
# non-matched letters in the range up to and including the current index:
treesize = (1<<size.bit_length()) - 1
numLeft = [(i >> 1 ^ ((i + 1) >> 1)) + 1 for i in range(0, treesize)]
# Keep track of parents in this tree (could probably be simpler, I welcome comments).
parent = [(i & ~((i^(i+1)) + 1)) | (((i ^ (i+1))+1) >> 1) for i in range(0, treesize)]
# Create a linked list for each distinct character
next = [-1] * size
head = {}
for i in range(len(word_1)-1, -1, -1): # go backwards
c = word_1[i]
# Add index at front of the linked list for this character
if c in head:
next[i] = head[c]
head[c] = i
# Main loop counting number of swaps needed for each letter
result = 0
for i, c in enumerate(word_2):
# Extract next occurrence of this letter from linked list
j = head[c]
head[c] = next[j]
# Get number of preceding characters with a binary tree lookup
p = j
index_of_letter = 0
while p < treesize:
if p >= j: # On or at right?
numLeft[p] -= 1 # Register that a letter has been removed at left side
if p <= j: # On or at left?
index_of_letter += numLeft[p] # Add the number of left-side letters
p = parent[p] # Walk up the tree
result += index_of_letter
return result
This runs in O(nlogn) where the logn factor is provided by the upwards walk in the binary tree.
I tested on thousands of random inputs, and the above code produces the same results as your code in all cases. But... it runs a lot faster on the larger inputs.

I am going by the assumption that you just want to find the number of swaps, quickly, without needing to know what exactly to swap.
google how to count inversions. It is often taught with merge-sort. Several of the results are on stack overflow, like Merge sort to count split inversions in Python
Inversions are the number of adjacent swaps to get to a sorted string.
Count the inversions in string 1.
Count the inversions in string 2.
Error edited out here, see correction in correct answer. I would normally just delete a wrong answer but this answer is referenced in correct answer.
It makes sense, and it happens to work for all three of your small test cases, so I'm going to just assume this is the answer you want.
Using some code that I happen to have lying around from retaking some algorithms classes on free online classes (for fun):
print (week1.count_inversions('ABC'), week1.count_inversions('BCA'))
print (week1.count_inversions('AABCDDD'), week1.count_inversions('DDDBCAA'))
print (week1.count_inversions('ZZZAAAA'), week1.count_inversions('ZAAZAAZ'))
0 2
4 20
21 15
That lines up with the values you gave above: 2, 16, and 6.

Permutations of 2 characters in Python into fixed length string with equal numbers of each character

I've looked through the 2 questions below, which seem closest to what I am asking, but don't get me to the answer to my question.
Permutation of x length of 2 characters
How to generate all permutations of a list in Python
I am trying to find a way to take 2 characters, say 'A' and 'B', and find all unique permutations of those characters into a 40 character string. Additionally - I need each character to be represented 20 times in the string. So all resulting strings each have 20 'A's and 20 'B's.
Like this:
'AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBB'
'AAAAAAAAAAAAAAAAAAABABBBBBBBBBBBBBBBBBBB'
'AAAAAAAAAAAAAAAAAABAABBBBBBBBBBBBBBBBBBB'
etc...
All I really need is the count of unique combinations that follow these rules.
y=['A','A','A','A','B','B','B','B']
comb = set(itertools.permutations(y))
print("Combinations Found: {:,}".format(len(comb)))
This works, but it doesn't scale well to an input string of 20 'A's and 20 'B's. The above code takes 90 seconds to execute. Even just scaling up to 10 'A's and 10 'B's ran for 20 minutes before I killed it.
Is there more efficient way to approach this given the parameters I've described?

If all you need is the count, this can be generalized to n choose k. Your total size is n and the number of elements of "A" is k. So, your answer would be:
(n choose k) = (40 choose 20) = 137846528820

How to create all possible sentence of length 100 characters from a list of strings in Python

I am trying to create a sentence of length 100 characters from a given list of strings. The length has to be exactly one hundred characters. We also have to find all possible sentences using permutation. There has to be a space between each word, and duplicate words are not allowed. The list is given below:
['saintliness', 'wearyingly', 'shampoo', 'headstone', 'dripdry', 'elapse', 'redaction', 'allegiance', 'expressionless', 'awesomeness', 'hearkened', 'aloneness', 'beheld', 'courtship', 'swoops', 'memphis', 'attentional', 'pintsized', 'rustics', 'hermeneutics', 'dismissive', 'delimiting', 'proposes', 'between', 'postilion', 'repress', 'racecourse', 'matures', 'directions', 'bloodline', 'despairing', 'syrian', 'guttering', 'unsung', 'suspends', 'coachmen', 'usurpation', 'convenience', 'portal', 'deferentially', 'tarmacadam', 'underlay', 'lifetime', 'nudeness', 'influences', 'unicyclists', 'endangers', 'unbridled', 'kennedy', 'indian', 'reminiscent', 'ravish', 'republics', 'nucleic', 'acacia', 'redoubled', 'minnows', 'bucklers', 'decays', 'garnered', 'aussies', 'harshen', 'monogram', 'consignments', 'continuum', 'pinion', 'inception', 'immoderate', 'reiterated', 'hipster', 'stridently', 'relinquished', 'microphones', 'righthanders', 'ethereally', 'glutted', 'dandies', 'entangle', 'selfdestructive', 'selfrighteous', 'rudiments', 'spotlessly', 'comradeinarms', 'shoves', 'presidential', 'amusingly', 'schoolboys', 'phlogiston', 'teachable', 'letting', 'remittances', 'armchairs', 'besieged', 'monophthongs', 'mountainside', 'aweless', 'redialling', 'licked', 'shamming', 'eigenstate']
Approach:
My first approach is to use backtracking and permutations to generate all sentences. But I think the complexity will be too high since my list is so big.
Is there any other method I can use here or some inbuilt functions/packages I can use here? What will be best way in python to do this? Any pointers will be helpful here.

You can't do it.
Think about it: even for selecting 4 words you already have 100 × 99 × 98 × 97 possibilities, almost 100 million.
Given the length of your words at least 8 of them will fit in the sentence. There is 100 × 99 × 98 … × 93 possibilities. That's approximately 7×10^15, a totally infeasible number.

This problem is similar to the problem of partitioning in number theory.
The complexity of the problem can (presumably) be reduced using some of the constraints that are encoded in the problem statement:
The lengths of the words in the words list.
Repeats of word lengths: for example a word of length 8 is repeated X times.
Here's a possible general approach (would take some refining):
Find all partitions for the number 100 using only the lengths of the words in the words list. (You would start with word lengths and their repeats, and not by brute forcing all possible partitions.)
Filter out partitions that have repeat length values exceeding repeat length values for words in the list.
Apply combinations of words onto the partitions. A set of words of equal length will be mapped to length values in a partition. Say for example you have the partition (15+15+15+10+10+10+10+5+5+5) then you would generate combinations for all length 15 words over 3, length 10 words over 4, and length 5 words over 3. (I'm ignoring the space separation issue here).
Generate permutations of all the combinations over all the partitions.

Simplify a bit: Change all the strings from "xxx" to "xxx ". Then set the sentence length to 101. This allows you to use len(x) instead of len(x)+1 and eliminates the edge case for the last word in the sentence. As you traverse, and build the sentence left to right, you can eliminate words that would overflow the length, based on the sentence you've just constructed.
UPDATE:
Consider this to be a base n number problem where n is the number of words you have. Create a vector initialized with 0 [NOTE: it's only fixed size to illustrate]:
acc = [0, 0, 0, 0]
This is your "accumulator".
Now construct your sentence:
dict[acc[0]] + dict[acc[1]] + dict[acc[2]] + dict[acc[3]]
So, you get able able able able
Now increment the most significant "digit" in the acc. This is denoted by "curpos". Here curpos is 3.
[0, 0, 0, 1]
Now you get able able able baker
You keep bumping acc[curpos] until you hit [0, 0, 0, n] Now you've got a "carry out". "Go left" by decrementing curpos to 2. increment acc[curpos]. If it doesn't "carry out", "go right" by incrementing curpos and set acc[curpos] = 0. If you had gotten a carry out, you'd do a "go left" by decrementing curpos to 1.
This is a form of backtracking (e.g. the "go left"), but you don't need a tree. Just this acc vector and a state machine with three states: goleft, goright, test/trunc/output/inc.
After the "go right" curpos will be back to the "most significant" position. That is, the sentence length constructed from acc[0 to curpos - 1] (the length without adding the final word) is less than 100. If it's too long (e.g. it's already over 100), do a "go left". If it's too short (e.g. you've got to add another word to get near [enough] to 100), do a "go right"
When you get a carry out and curpos==0, you're done
I recently devised this as a solution to the "vampire number challenge" and the traversal you need is very similar.

I am not going to provide a complete solution, but I'll walk through my thinking.
Constraints:
A permutation of your complete list that exceeds 100 characters can be immediately thrown out. (Ok, 99 + len(longest_word)).)
You are essentially dealing with a subset of the power set of elements in your list.
Given that:
Build the power set, but discard any sentences that exceed your maximum
Filter the final set for sentences that exactly match your needs
So you can have the following:
def construct_sentences(dictionary: list, length: int) -> list:
if not dictionary:
return [(0, [])]
else:
word = dictionary[0]
word_length = len(word) + 1
subset_length = length - word_length
sentence_subset = construct_sentences(dictionary[1:], subset_length)
new_sentences = []
for sentence_length, sentence in sentence_subset:
if sentence_length + word_length <= length:
new_sentences = new_sentences + [(sentence_length + word_length, sentence + [word])]
return new_sentences + sentence_subset
I'm using tuples to write-aside the length of the list and make it easily available for comparison. The result of the above function will give you a list of sentences that are all less than the length (which is key when considering potential permutations: 100 is fairly short so there is a vast number of permutations that can be readily discarded). The next step would be to simply filter any sentence that isn't long enough (i.e. 100 characters).
Note that at this point you have every possible list filtering your criteria, but that list may be reordered 2^n ways. Still, that becomes a more manageable situation. With a list of 100 words, averaging under 9 characters a word, you have a average number of words in a sentence equal to 10. 2^10 isn't the worst situation in the world...
You'll have to modify it for your truncation case, of course, but this gets you in the ballpark. Unless I completely missed something, which is always possible. I doubly think something is wrong because running this produces a surprisingly short list.

Your problem size is way too large, but if 1) your actual problem is much smaller in scope, and/or 2) you have a lot of time and a very fast computer, you can generate these permutations using a recursive generator.
def f(string, list1):
for word in list1:
new_string = string + (' ' if string else '') + word
# If there are other constraints that will allow you to prune branches,
# you can add those conditions here and break out of the for loop
if len(new_string) >= 100:
yield new_string[:100]
else:
list2 = list1[:]
list2.remove(word)
for item in f(new_string, list2):
yield item
x = f('', list1)
for sentence in x:
check(sentence)
One caveat is that this may produce identical sentences if two words at the end get truncated to look the same.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.