The problem is given a string S and an integer k<len(S) we need to find the highest string in dictionary order with any k characters removed but maintaining relative ordering of string.
This is what I have so far:
def allPossibleCombinations(k,s,strings):
if k == 0:
strings.append(s)
return strings
for i in range(len(s)):
new_str = s[:i]+s[i+1:]
strings = allPossibleCombinations(k-1, new_str, strings)
return strings
def stringReduction(k, s):
strings = []
combs = allPossibleCombinations(k,s, strings)
return sorted(combs)[-1]
This is working for a few test cases but it says that I have too many recursive calls for other testcases. I don't know the testcases.
This should get you started -
from itertools import combinations
def all_possible_combinations(k = 0, s = ""):
yield from combinations(s, len(s) - k)
Now for a given k=2, and s="abcde", we show all combinations of s with k characters removed -
for c in all_possible_combinations(2, "abcde"):
print("".join(c))
# abc
# abd
# abe
# acd
# ace
# ade
# bcd
# bce
# bde
# cde
it says that I have too many recursive calls for other testcases
I'm surprised that it failed on recursive calls before it failed on taking too long to come up with an answer. The recursion depth is the same as k, so k would have had to reach 1000 for default Python to choke on it. However, your code takes 4 minutes to solve what appears to be a simple example:
print(stringReduction(8, "dermosynovitis"))
The amount of time is a function of k and string length. The problem as I see it, recursively, is this code:
for i in range(len(s)):
new_str = s[:i]+s[i+1:]
strings = allPossibleCombinations(k-1, new_str, strings, depth + 1)
Once we've removed the first character say, and done all the combinations without it, there's nothing stopping the recursive call that drops out the second character from again removing the first character and trying all the combinations. We're (re)testing too many strings!
The basic problem is that you need to prune (i.e. avoid) strings as you test, rather than generate all possibilties and test them. If a candidate's first letter is less than that of the best string you've seen so far, no manipulation of the remaining characters in that candidate is going to improve it.
Related
I am trying to create a function compare(lst1,lst2) which compares the each element in a list and returns every common element in a new list and shows percentage of how common it is. All the elements in the list are going to be strings. For example the function should return:
lst1 = AAAAABBBBBCCCCCDDDD
lst2 = ABCABCABCABCABCABCA
common strand = AxxAxxxBxxxCxxCxxxx
similarity = 25%
The parts of the list which are not similar will simply be returned as x.
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
This is what I came up with.
lst1 = 'AAAAABBBBBCCCCCDDDD'
lst2 = 'ABCABCABCABCABCABCA'
common_strand = ''
score = 0
for i in range(len(lst1)):
if lst1[i] == lst2[i]:
common_strand = common_strand + str(lst1[i])
score += 1
else:
common_strand = common_strand + 'x'
print('Common Strand: ', common_strand)
print('Similarity Score: ', score/len(lst1))
Output:
Common Strand: AxxAxxxBxxxCxxCxxxx
Similarity Score: 0.2631578947368421
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
You have two strings A and B. Strings are ordered sequences of characters.
Suppose both A and B have equal length (the same number of characters). Choose some position i < len(A), len(B) (remember Python sequences are 0-indexed). Your problem statement requires:
If character i in A is identical to character i in B, yield that character
Otherwise, yield some placeholder to denote the mismatch
How do you find the ith character in some string A? Take a look at Python's string methods. Remember: strings are sequences of characters, so Python strings also implement several sequence-specific operations.
If len(A) != len(B), you need to decide what to do if you're comparing the ith element in either string to a string smaller than i. You might think to represent these as the same placeholder in (2).
If you know how to iterate the result of zip, you know how to use for loops. All you need is a way to iterate over the sequence of indices. Check out the language built-in functions.
Finally, for your measure of similarity: if you've compared n characters and found that N <= n are mismatched, you can define 1 - (N / n) as your measure of similarity. This works well for equally-long strings (for two strings with different lengths, you're always going to be calculating the proportion relative to the longer string).
We have IT olimpics in my country. Normally they are written in Java, C or C++. I gues for a year or so they also include other languages like python.
I tried to solve a task from previous years in Python called Letters and I'm constantly failing. The task is to write a code that counts minimum number of shifts between neighboring letters to turn one string into another.
As input you get number of letters in one string and two strings with same amount of letters but in different order. Lenght of one string is from 2 to 1 000 000 letters. There are only capital letters, they can but don't have to be sorted and can repeat.
Here's an example:
7
AABCDDD
DDDBCAA
Correct output should be 16
As output you have to return single value which is minimum number of shifts. It has to calculate output under 5seconds.
I made it calculate correct output, but in longer strings (lik 800 000 letters) it starts to slow down. The longest inputs return value in about 30 seconds. There's also one input counting 900 000 letters per word that calculates 30 minutes!
Under link you can find all input files for tests:
https://oi.edu.pl/l/19oi_ksiazeczka/
Click on this link to download files for "Letters" task:
XIX OI testy i rozwiązania - zad. LIT (I etap) (3.5 MB)
Bellow is my code. How can I speed it up?
# import time
import sys
# start = time.time()
def file_reader():
standard_input=""
try:
data = sys.stdin.readlines()
for line in data:
standard_input+=line
except:
print("An exception occurred")
return standard_input
def mergeSortInversions(arr):
if len(arr) == 1:
return arr, 0
else:
a = arr[:len(arr)//2]
b = arr[len(arr)//2:]
a, ai = mergeSortInversions(a)
b, bi = mergeSortInversions(b)
c = []
i = 0
j = 0
inversions = 0 + ai + bi
while i < len(a) and j < len(b):
if a[i] <= b[j]:
c.append(a[i])
i += 1
else:
c.append(b[j])
j += 1
inversions += (len(a)-i)
c += a[i:]
c += b[j:]
return c, inversions
def literki():
words=(file_reader()).replace("\n", "")
number = int("".join((map(str, ([int(i) for i in list(words) if i.isdigit()])))))
all_letters = [x for x in list(words) if x not in str(number)]
name = all_letters[:number]
anagram = all_letters[number:]
p=[]
index=list(range(len(anagram)))
anagram_dict = {index[i]: anagram[i] for i in range(len(index))}
new_dict = {}
anagram_counts={}
for key, value in anagram_dict.items():
if value in new_dict:
new_dict[value].append(key)
else:
new_dict[value]=[key]
for i in new_dict:
anagram_counts.update({i:new_dict[i]})
for letter in name:
a=anagram_counts[letter]
p.append(a.pop(0))
print(mergeSortInversions(p)[1])
#>>
literki()
# end = time.time()
# print(start-end)
So to explain what it does in parts: File_reader: simply reads an input file from standard input. mergeSortInversions(arr): normally it would sort a string, but here I wanted it to return sum of inversions. I'm not that smart to figure it out by myself, I found it on web but it does the job. Unfortunatelly, for 1mln strings it does that in 10 secondes or so. In "literki" function: first, I've devided input to have number of signs and two, even in lenght words as lists.
Then, I've made something similar in function to stacks array (not shure if it is called this way in english). basically I made a dictionary with every letter as key and indexes of those letters as a list in values (if a letter occurs more than once, value would contain a list of all indexes for that letter). Last thing I did before "the slow thing", for every letter in "name" variable I've extracted coresponding index. Up to that point all opertations for every input, ware taking arround 2 secconds. And now two lines that generate the rest of time for calculating outcome: - I append the index to p=[] list and in the same time pop it from list in dictionary, so it wouldn't read it again for another same letter. - I calculate number of moves (inversions) with mergeSortInversions(arr) based on p=[...] list and print it as output.
I know that poping from bottom is slow but on the other hand I would have to create lists of indexes from bottom (so I could pop index from top) but that took even longer. I've also tried converting a=[... ] with deque but it also was to slow.
I think I'd try a genetic algorithm for this problem. GA's don't always come up with an optimal solution, but they are very good for getting an acceptable solution in a reasonable amount of time. And for small inputs, they can be optimal.
The gist is to come up with:
1) A fitness function that assigns a number indicating how good a particular candidate solution is
2) A sexual reproduction function, that combines, in a simple way, part of two candidate solutions
3) A mutation function, that introduces one small change to a candidate solution.
So you just let those functions go to town, creating solution after solution, and keeping the best ones - not the best one, the best ones.
Then after a while, the best solution found is your answer.
Here's an example of using a GA for another hard problem, called The House Robber Problem. It's in Python:
http://stromberg.dnsalias.org/~strombrg/house-robber-problem/
Given a string, lets say "TATA__", I need to find the total number of differences between adjacent characters in that string. i.e. there is a difference between T and A, but not a difference between A and A, or _ and _.
My code more or less tells me this. But when a string such as "TTAA__" is given, it doesn't work as planned.
I need to take a character in that string, and check if the character next to it is not equal to the first character. If it is indeed not equal, I need to add 1 to a running count. If it is equal, nothing is added to the count.
This what I have so far:
def num_diffs(state):
count = 0
for char in state:
if char != state[char2]:
count += 1
char2 += 1
return count
When I run it using num_diffs("TATA__") I get 4 as the response. When I run it with num_diffs("TTAA__") I also get 4. Whereas the answer should be 2.
If any of that makes sense at all, could anyone help in fixing it/pointing out where my error lies? I have a feeling is has to do with state[char2]. Sorry if this seems like a trivial problem, it's just that I'm totally new to the Python language.
import operator
def num_diffs(state):
return sum(map(operator.ne, state, state[1:]))
To open this up a bit, it maps !=, operator.ne, over state and state beginning at the 2nd character. The map function accepts multible iterables as arguments and passes elements from those one by one as positional arguments to given function, until one of the iterables is exhausted (state[1:] in this case will stop first).
The map results in an iterable of boolean values, but since bool in python inherits from int you can treat it as such in some contexts. Here we are interested in the True values, because they represent the points where the adjacent characters differed. Calling sum over that mapping is an obvious next step.
Apart from the string slicing the whole thing runs using iterators in python3. It is possible to use iterators over the string state too, if one wants to avoid slicing huge strings:
import operator
from itertools import islice
def num_diffs(state):
return sum(map(operator.ne,
state,
islice(state, 1, len(state))))
There are a couple of ways you might do this.
First, you could iterate through the string using an index, and compare each character with the character at the previous index.
Second, you could keep track of the previous character in a separate variable. The second seems closer to your attempt.
def num_diffs(s):
count = 0
prev = None
for ch in s:
if prev is not None and prev!=ch:
count += 1
prev = ch
return count
prev is the character from the previous loop iteration. You assign it to ch (the current character) at the end of each iteration so it will be available in the next.
You might want to investigate Python's groupby function which helps with this kind of analysis.
from itertools import groupby
def num_diffs(seq):
return len(list(groupby(seq))) - 1
for test in ["TATA__", "TTAA__"]:
print(test, num_diffs(test))
This would display:
TATA__ 4
TTAA__ 2
The groupby() function works by grouping identical entries together. It returns a key and a group, the key being the matching single entry, and the group being a list of the matching entries. So each time it returns, it is telling you there is a difference.
Trying to make as little modifications to your original code as possible:
def num_diffs(state):
count = 0
for char2 in range(1, len(state)):
if state[char2 - 1] != state[char2]:
count += 1
return count
One of the problems with your original code was that the char2 variable was not initialized within the body of the function, so it was impossible to predict the function's behaviour.
However, working with indices is not the most Pythonic way and it is error prone (see comments for a mistake that I made). You may want rewrite the function in such a way that it does one loop over a pair of strings, a pair of characters at a time:
def num_diffs(state):
count = 0
for char1, char2 in zip(state[:-1], state[1:]):
if char1 != char2:
count += 1
return count
Finally, that very logic can be written much more succinctly — see #Ilja's answer.
I had a question where I had to find contiguous substrings of a string, and the condition was the first and last letters of the substring had to be same. I tried doing it, but the runtime exceed the time-limit for the question for several test cases. I tried using map for a for loop, but I have no idea what to do for the nested for loop. Can anyone please help me to decrease the runtime of this program?
n = int(raw_input())
string = str(raw_input())
def get_substrings(string):
length = len(string)
list = []
for i in range(length):
for j in range(i,length):
list.append(string[i:j + 1])
return list
substrings = get_substrings(string)
contiguous = filter(lambda x: (x[0] == x[len(x) - 1]), substrings)
print len(contiguous)
If i understand properly the question, please let me know if thats not the case but try this:
Not sure if this will speed up runtime, but i believe this algorithm may for longer strings especially (eliminates nested loop). Iterate through the string once, storing the index (position) of each character in a data structure with constant time lookup (hashmap, or an array if setup properly). When finished you should have a datastructure storing all the different locations of every character. Using this you can easily retrieve the substrings.
Example:
codingisfun
take the letter i for example, after doing what i said above, you look it up in the hashmap and see that it occurs at index 3 and 6. Meaning you can do something like substring(3, 6) to get it.
not the best code, but it seems reasonable for a starting point...you may be able to eliminate a loop with some creative thinking:
import string
import itertools
my_string = 'helloilovetocode'
mappings = dict()
for index, char in enumerate(my_string):
if not mappings.has_key(char):
mappings[char] = list()
mappings[char].append(index)
print char
for char in mappings:
if len(mappings[char]) > 1:
for subset in itertools.combinations(mappings[char], 2):
print my_string[subset[0]:(subset[1]+1)]
The problem is that your code far too inefficient in terms of algorithmic complexity.
Here's an alternative (a cleaner but slightly slower version of soliman's I believe)
import collections
def index_str(s):
"""
returns the indices characters show up at
"""
indices = collections.defaultdict(list)
for index, char in enumerate(s):
indices[char].append(index)
return indices
def get_substrings(s):
indices = index_str(s)
for key, index_lst in indices.items():
num_indices = len(index_lst)
for i in range(num_indices):
for j in range(i, num_indices):
yield s[index_lst[i]: index_lst[j] + 1]
The algorithmic problem with your solution is that you blindly check each possible substring, when you can easily determine what actual pairs are in a single, linear time pass. If you only want the count, that can be determined easily in O(MN) time, for a string of length N and M unique characters (given the number of occurrences of a char, you can mathematically figure out how many substrings there are). Of course, in the worst case (all chars are the same), your code will have the same complexity as ours, but the in average case complexity yours is much worse since you have a nested for loop (n^2 time)
I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I've written takes an extremely long time to run. I've never even completed one word running my program over night.
The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words.
If anybody has an idea how to optimize this I would be really thankful, I don't have a clue how to continue...
here's a sample of my code:
def findUnique(list):
# Takes a list with dictionaries and compairs each element in the dictionaries
# with the others and puts all unique element in new dictionaries and finally
# puts the new dictionaries in a list.
# The result is a list with (in this case) 3 dictionaries containing all unique
# sequences and their locations from each string.
dicList=[]
listlength=len(list)
s=0
valuelist=[]
for i in list:
j=i.values()
valuelist.append(j)
while s<listlength:
currdic=list[s]
dic={}
for key in currdic:
currval=currdic[key]
test=True
n=0
while n<listlength:
if n!=s:
if currval in valuelist[n]: #this is where it takes to much time
n=listlength
test=False
else:
n+=1
else:
n+=1
if test:
dic[key]=currval
dicList.append(dic)
s+=1
return dicList
def slices(seq, length, prefer_last=False):
unique = {}
if prefer_last: # this doesn't have to be a parameter, just choose one
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]] = start
else: # prefer first
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
# or find all locations for each slice:
import collections
def slices(seq, length):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]].append(start)
return unique
This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don't want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length).
Printing unique slices:
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Including locations:
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
# (depending on which slices function used)
In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.
import operator
import random
import string
def slices(seq, length):
unique = {}
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
def sample_with_repeat(population, length, choice=random.choice):
return "".join(choice(population) for _ in xrange(length))
word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
(x for x in words_slices_sets if x is not n),
n)
for n in words_slices_sets]
print [len(x) for x in unique_words_slices]
You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense.
If you can provide more information about your input data, a specific solution might be available.
For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. In the worst case however, it would run out of memory constructing all 256^20 keys. Knowing your inputs makes all the difference.
edit
I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node.
adenovirus 2: 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes
enterobacteria phage lambda: 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes.
I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. You'd have to try it to see.
If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go.
If you can't find some way to prune the keys, you could try using a more compact representation. For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code.
I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge.
Let me build off Roger Pate's answer. If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word").
import collections
def hashed_slices(seq, length, hasher=None):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[hasher(seq[start:start+length])].append(start)
return unique
(If you really want to get fancy, you can use a rolling hash, though you'll need to change the function.)
Now, we can combine all the hashes :
unique = [] # Unique words in first string
# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
for h, starts in hashed.iteritems() :
# We only care about the first word
if h in hashed_starts[0] :
all_hashed[h][i]=starts
# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
if len(starts_by_word) == 1 :
# if there's only one word for the hash, it's obviously valid
unique.extend(words[0][i:i+20] for i in starts_by_word.values())
else :
# we might have a hash collision
candidates = {}
for word_idx, starts in starts_by_word.iteritems() :
candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
# Now go that we have the candidate slices, find the unique ones
valid = candidates[0]
for word_idx, candidate_set in candidates.iteritems() :
if word_idx != 0 :
valid -= candidate_set
unique.extend(valid)
(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.)
Be warned, I haven't tested this. Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. The hard part is choosing the hash. Too many collisions and you'll won't gain anything. Too few and you'll hit the memory problems. If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. So the slices will take up nearly a fourth of the memory. That would save roughly 250 MB of memory in Roger Pate's answer.
The code is still O(N^2), but the constant should be much lower.
Let's attempt to improve on Roger Pate's excellent answer.
Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway.
Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. So perhaps try only the 20s starting with one particular letter. For DNA, this cuts the requirements down by 75%.
seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
for letterid in range(maxlength):
for wordid,word in words:
if (letterid < len(word)):
letter = word[letterid]
if letter is startletter:
seq = word[letterid:letterid+seqlen]
if seq in seqtrie and not wordid in seqtrie[seq]:
seqtrie[seq].append(wordid)
Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc.