Word ranking partial completion [duplicate] - python

This question already has answers here:
Finding the ranking of a word (permutations) with duplicate letters
(6 answers)
Closed 8 years ago.
I am not sure how to solve this problem within the constraints.
Shortened problem formulation:
"Word" as any sequence of capital letters A-Z (not limited to just "dictionary words").
Consider list of permutations of all characters in a word, sorted lexicographically
Find a position of original word in such a list
Do not generate all possible permutations of a word, since it won't fit in time-memory constraints.
Constraints: word length <= 25 characters; memory limit 1Gb, any answer should fit in 64-bit integer
Original problem formulation:
Consider a "word" as any sequence of capital letters A-Z (not limited to just "dictionary words"). For any word with at least two different letters, there are other words composed of the same letters but in a different order (for instance, STATIONARILY/ANTIROYALIST, which happen to both be dictionary words; for our purposes "AAIILNORSTTY" is also a "word" composed of the same letters as these two). We can then assign a number to every word, based on where it falls in an alphabetically sorted list of all words made up of the same set of letters. One way to do this would be to generate the entire list of words and find the desired one, but this would be slow if the word is long. Write a program which takes a word as a command line argument and prints to standard output its number. Do not use the method above of generating the entire list. Your program should be able to accept any word 25 letters or less in length (possibly with some letters repeated), and should use no more than 1 GB of memory and take no more than 500 milliseconds to run. Any answer we check will fit in a 64-bit integer.
Sample words, with their rank:
ABAB = 2
AAAB = 1
BAAA = 4
QUESTION = 24572
BOOKKEEPER = 10743
examples:
AAAB - 1
AABA - 2
ABAA - 3
BAAA - 4
AABB - 1
ABAB - 2
ABBA - 3
BAAB - 4
BABA - 5
BBAA - 6
I came up with I think is only a partial solution.
Imagine I have the word JACBZPUC. I sort the word and get ABCCJPUZ This should be rank 1 in the word rank. From ABCCJPUZ to the first alphabetical word right before the word starting with J I want to find the number of permutations between the 2 words.
ex:
for `JACBZPUC`
sorted --> `ABCCJPUZ`
permutations that start with A -> 8!/2!
permutations that start with B -> 8!/2!
permutations that start with C -> 8!/2!
Add the 3 values -> 60480
The other C is disregarded as the permutations would have the same values as the previous C (duplicates)
At this point I have the ranks from ABCCJPUZ to the word right before the word that starts with J
ABCCJPUZ rank 1
...
... 60480 values
...
*HERE*
JABCCJPUZ rank 60481 LOCATION A
...
...
...
JACBZPUC rank ??? LOCATION B
I'm not sure how to get the values between Locations A and B:
Here is my code to find the 60480 values
def perm(word):
return len(set(itertools.permutations(word)))
def swap(word, i, j):
word = list(word)
word[i], word[j] = word[j], word[i]
print word
return ''.join(word)
def compute(word):
if ''.join(sorted(word)) == word:
return 1
total = 0
sortedWord = ''.join(sorted(word))
beforeFirstCharacterSet = set(sortedWord[:sortedWord.index(word[0])])
print beforeFirstCharacterSet
for i in beforeFirstCharacterSet:
total += perm(swap(sortedWord,0,sortedWord.index(i)))
return total
Here is a solution I found online to solve this problem.
Consider the n-letter word { x1, x2, ... , xn }. My solution is based on the idea that the word number will be the sum of two quantities:
The number of combinations starting with letters lower in the alphabet than x1, and
how far we are into the the arrangements that start with x1.
The trick is that the second quantity happens to be the word number of the word { x2, ... , xn }. This suggests a recursive implementation.
Getting the first quantity is a little complicated:
Let uniqLowers = { u1, u2, ... , um } = all the unique letters lower than x1
For each uj, count the number of permutations starting with uj.
Add all those up.
I think I complete step number 1 but not number 2. I am not sure how to complete this part
Here is the Haskell solution...I don't know Haskell =/ and I am trying to write this program in Python
https://github.com/david-crespo/WordNum/blob/master/comb.hs

The idea of finding the number of prmutations of the letters before the actual first letter is good.But your calculation:
for `JACBZPUC`
sorted --> `ABCCJPUZ`
permutations that start with A -> 8!/2!
permutations that start with B -> 8!/2!
permutations that start with C -> 8!/2!
Add the 3 values -> 60480
is wrong. There are only 8!/2! = 20160 permutations of JACBZPUC, so the starting position can't be greater than 60480. In your method, the first letter is fixed, you can only permute the seven following letters. So:
permutations that start with A: 7! / 2! == 2520
permutations that start with B: 7! / 2! == 2520
permutations that start with C: 7! / 1! == 5040
-----
10080
You don't divide by 2! to find the permutations beginning with C, because the seven remaning letters are unique; there's only one C left.
Here's a Python implementation:
def fact(n):
"""factorial of n, n!"""
f = 1
while n > 1:
f *= n
n -= 1
return f
def rrank(s):
"""Back-end to rank for 0-based rank of a list permutation"""
# trivial case
if len(s) < 2: return 0
order = s[:]
order.sort()
denom = 1
# account for multiple occurrences of letters
for i, c in enumerate(order):
n = 1
while i + n < len(order) and order[i + n] == c:
n += 1
denom *= n
# starting letters alphabetically before current letter
pos = order.index(s[0])
#recurse to list without its head
return fact(len(s) - 1) * pos / denom + rrank(s[1:])
def rank(s):
"""Determine 1-based rank of string permutation"""
return rrank(list(s)) + 1
strings = [
"ABC", "CBA",
"ABCD", "BADC", "DCBA", "DCAB", "FRED",
"QUESTION", "BOOKKEEPER", "JACBZPUC",
"AAAB", "AABA", "ABAA", "BAAA"
]
for s in strings:
print s, rank(s)

The second part of the solution you have found is also --I think-- what I was about to suggest:
To go from what you call "Location A" to "Location B", you have to find the position of word ACBZPUC among its possible permutations. Consider that a new question to your algorithm, with a new word that just happens to be one position shorter than the original one.

The words in the alphabetical list between JABCCPUZ, which you know the position of, and JACBZPUC, which you want to find the position of, all start with J. Finding the position of JACBZPUC relative to JABCCPUZ, then, is equivalent to finding the relative positions of those two words with the initial J removed, which is the same as the problem you were trying to solve initially but with a word one character shorter.
Repeat that process enough times and you will be left with a word that contains a single character, C. The position of a word with a single character is known to always be 1, so you can then sum that and all of the previous relative positions for an absolute position.

Related

Efficient Way to Count K-mers in O(k*N + k*Q)?

I have a string of lowercase alphabets. I need to find how many times each k-mer the question asks appears. The catch is I need to output the count in an order of k-mers the question asks. Another catch is I may need to output the count for the same k-mer more than one time. I need to accomplish this in O(kN +kQ) where k is the length of k-mer, N is the length of a DNA string and Q is the number of specific k-mers of interest.
For example, for the following input where N=7, k=2, q=3, aaabaab is the DNA string, the next 5 lines are the k-mers of my interest :
7 3 5
aaabaab
aaa
aab
aaa
baa
xyz
I would expect to output the following:
aaa 1
aab 2
aaa 1
baa 1
xyz 0
Note that aaa is asked twice!
I have a list of Q k-mers. I have a dictionary of k-mers with the counts (the length of a dictionary could be less than Q). With a for-loop, I iterate through DNA and each character while keeping tracking of a current k-mer O(N). In the next iteration, I update the current k-mer by dropping the first letter and append the current character. In order to output the answer, I iterate the list of Q k-mers and search for its count in the dictionary.
l, n , k, q = [int(x) for x in sys.stdin.readline().strip('\n').split(' ')]
dna = ''
for i in range(l):
dna += sys.stdin.readline().strip('\n')
mykmer =[]
mycount = {}
for i in range(q):
kmer = sys.stdin.readline().strip('\n')
mykmer.append(kmer)
mycount[kmer]=0
current = dna[0:k]
for j in range(k-1,len(dna)):
if j != k-1:
current = current[1:]+str(dna[j])
if current in mykmer:
mycount[current] += 1
for x in mykmer:
print(str(x)+' '+str(mycount[x]))
I get correct answers, but I get timed out!
I would improve your inner loop to:
for j in range(len(dna) - (len(dna) % k)):
current = dna[j:j+k]
if current in mycount:
mycount[current] += 1
Slicing once costs less than repeated slicing and appending. current = current[1:]+str(dna[j]) costs more than dna[j:j+k]. As it results in 3 string allocations where as the slice results in one.
Use the dictionary you already have rather than the list to do membership tests on. This removes a factor of Q.
The range(len(dna) - (len(dna) % k)) ensures that the loop does not unnecessarily consider the last few indexes.

Number of Palindromic Slices in a string with O(N) complexity

def solution(S):
total = 0
i = 1
while i <= len(S):
for j in range(0, len(S) - i + 1):
if is_p(S[ j: j + i]):
total += 1
i += 1
return total
def is_p(S):
if len(S) == 1:
return False
elif S == S[::-1]:
return True
else:
return False
I am writing a function to count the number of Palindromic Slices(with length bigger than 1) in a string. The above code is in poor time complexity. Can someone help me to improve it and make it O(N) complexity?
Edit: It is not duplicate since the other question is about finding the longest Palindromic Slices
Apply Manacher's Algorithm, also described by the multiple answers to this question.
That gives you the length of the longest palindrome centered at every location (centered at a character for odd-length, or centered between characters for even-length). You can use this to easily calculate the number of palindromes. Note that every palindrome must be centered somewhere, so it must be a substring (or equal to) the longest palindrome centered there.
So consider the string ababcdcbaa. By Manacher's Algorithm, you know that the maximal length palindrome centered at the d has length 7: abcdcba. By the properties of palindromes, you immediately know that bcdcb and cdc and d are also palindromes centered at d. In fact there are floor((k+1)/2) palindromes centered at a location, if you know that the longest palindrome centered there has length k.
So you sum the results of Manacher's Algorithm to get your count of all palindromes. If you want to only count palindromes of length > 1, you just need to subtract the number of length-1 palindromes, which is just n, the length of your string.
This can be done in linear time using suffix trees:
1) For constant sized alphabet we can build suffix trees using Ukkonen's Algorithm in O(n).
2) For given string S, build a generalized suffix tree of S#S' where S' is reverse of string S and # is delimiting character.
3) Now in this suffix tree, for every suffix i in S, look for lowest common ancestor of (2n-i+1) suffix is S'.
4) count for all such suffixes in the tree to get total count of all palindromes.

Average time a substring occurs

I have a programm that returns too many results, so i want to take only the useful results that are above the average. My question is in a string length N that is produced from alphabet of k letters, how many times in average all substrings length m occurs? For example in the string "abcbbbbcbabcabcbcab" of alphabet {a,b,c} how many times in average all the substrings of length 3 occurs, abc occurs 3 times, bbb occurs 2 times (i count it even if they overlap), and so on. Or is there a way to know it from python (where my code is) before executing the programm?
Do you want to count the substrings in a specific string, or do you want the theoretical average in the general case? The probability that a string with length m of an alphabet with k characters occurs at any given position is 1/(k^m), so if your string is N characters long, that would make an expected number of occurrences of(N-m+1)/(k^m) (-m+1 because the string can not appear in the last m-1 positions). Another way to see this is as the number of substrings of length m (N-m+1) divided by the number of different such substrings (k^m).
You can calculate the average counts for your example, to see whether the formula gets to about the right result. Of course, one should not expect too much, as it's a very small sample size...
>>> s = "abcbbbbcbabcabcbcab"
>>> N = len(s)
>>> k = 3
>>> m = 3
For this, the formula gives us
>>> (N-m+1)/(k**m)
0.6296296296296297
We can count the occurrences for all the three-letter strings using itertools.product and a count function (str.count will not count overlapping strings correctly):
>>> count = lambda x: sum(s[i:i+m] == x for i in range(len(s)))
>>> X = [''.join(cs) for cs in itertools.product("abc", repeat=3)]
>>> counts = [count(x) for x in X]
In this case, this gives you exactly the same result as the formula. (I'm just as surprised as you.)
>>> sum(counts)/len(counts)
0.6296296296296297

Convert a long number to corresponding letter combinations

Given a number, translate it to all possible combinations of corresponding letters. For example, if given the number 1234, it should spit out abcd, lcd, and awd because the combinations of numbers corresponding to letters could be 1 2 3 4, 12 3 4, or 1 23 4.
I was thinking of ways to do this in Python and I was honestly stumped. Any hints?
I basically only setup a simple system to convert single digit to letters so far.
Make str.
Implement partition as in here.
Filter lists with a number over 26.
Write function that returns letters.
def alphabet(n):
# return " abcde..."[n]
return chr(n + 96)
def partition(lst):
for i in range(1, len(lst)):
for r in partition(lst[i:]):
yield [lst[:i]] + r
yield [lst]
def int2words(x):
for lst in partition(str(x)):
ints = [int(i) for i in lst]
if all(i <= 26 for i in ints):
yield "".join(alphabet(i) for i in ints)
x = 12121
print(list(int2words(x)))
# ['ababa', 'abau', 'abla', 'auba', 'auu', 'laba', 'lau', 'lla']
I'm not gonna give you a complete solution but an idea where to start:
I would transform the number to a string and iterate over the string, as the alphabet has 26 characters you would only have to check one- and two-digit numbers.
As in a comment above a recursive approach will do the trick, e.g.:
Number is 1234
*) Take first character -> number is 1
*) From there combine it with all remaining 1-digit numbers -->
1 2 3 4
*) Then combine it with the next 2 digit number (if <= 26) and the remaining 1 digit numbers -->
1 23 4
*) ...and so on
As i said, it's just an idea where to start, but basically its a recursive approach using combinatorics including checks if two digit numbers aren't greater then 26 and thus beyond the alphabet.

python recursion with bubble sort

So, i have this problem where i recieve 2 strings of letters ACGT, one with only letters, the other contain letters and dashes "-".both are same length. the string with the dashes is compared to the string without it. cell for cell. and for each pairing i have a scoring system. i wrote this code for the scoring system:
for example:
dna1: -ACA
dna2: TACG
the scoring is -1. (because dash compared to a letter(T) gives -2, letter compared to same letter gives +1 (A to A), +1 (C to C) and non similar letters give (-1) so sum is -1.
def get_score(dna1, dna2, match=1, mismatch=-1, gap=-2):
""""""
score = 0
for index in range(len(dna1)):
if dna1[index] is dna2[index]:
score += match
elif dna1[index] is not dna2[index]:
if "-" not in (dna1[index], dna2[index]):
score += mismatch
else:
score += gap
this is working fine.
now i have to use recursion to give the best possible score for 2 strings.
i recieve 2 strings, they can be of different sizes this time. ( i cant change the order of letters).
so i wrote this code that adds "-" as many times needed to the shorter string to create 2 strings of same length and put them in the start of list. now i want to start moving the dashes and record the score for every dash position, and finally get the highest posibble score. so for moving the dashes around i wrote a litle bubble sort.. but it dosnt seem to do what i want. i realize its a long quesiton but i'd love some help. let me know if anything i wrote is not understood.
def best_score(dna1, dna2, match=1, mismatch=-1, gap=-2,\
score=[], count=0):
""""""
diff = abs(len(dna1) - len(dna2))
if len(dna1) is len(dna2):
short = []
elif len(dna1) < len(dna2):
short = [base for base in iter(dna1)]
else:
short = [base for base in iter(dna2)]
for i in range(diff):
short.insert(count, "-")
for i in range(diff+count, len(short)-1):
if len(dna1) < len(dna2):
score.append((get_score(short, dna2),\
''.join(short), dna2))
else:
score.append((get_score(dna1, short),\
dna1, ''.join(short)))
short[i+1], short[i] = short[i], short[i+1]
if count is min(len(dna1), len(dna2)):
return score[score.index(max(score))]
return best_score(dna1, dna2, 1, -1, -2, score, count+1)
First, if I correctly deciephered your cost function, your best score value do not depend on gap, as number of dashes is fixed.
Second, it is lineary dependent on number of mismatches and so doesn't depend on match and mismatch exact values, as long as they are positive and negative respectively.
So your task reduces to lookup of a longest subsequence of longest string letters strictly matching subsequence of letters of the shortest one.
Third, define by M(string, substr) function returnin length of best match from above. If you smallest string fisrt letter is S, that is substr == 'S<letters>', then
M(string, 'S<letters>') = \
max(1 + M(string[string.index(S):], '<letters>') + # found S
M(string[1:], '<letters>')) # letter S not found, placed at 1st place
latter is an easy to implement recursive expression.
For a pair string, substr denoting m=M(string, substr) best score is equal
m * match + (len(substr) - m) * mismatch + (len(string)-len(substr)) * gap
It is straightforward, storing what value was max in recursive expression, to find what exactly best match is.

Categories