Python, Huge Iteration Performance Problem - python

I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I've written takes an extremely long time to run. I've never even completed one word running my program over night.
The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words.
If anybody has an idea how to optimize this I would be really thankful, I don't have a clue how to continue...
here's a sample of my code:
def findUnique(list):
# Takes a list with dictionaries and compairs each element in the dictionaries
# with the others and puts all unique element in new dictionaries and finally
# puts the new dictionaries in a list.
# The result is a list with (in this case) 3 dictionaries containing all unique
# sequences and their locations from each string.
dicList=[]
listlength=len(list)
s=0
valuelist=[]
for i in list:
j=i.values()
valuelist.append(j)
while s<listlength:
currdic=list[s]
dic={}
for key in currdic:
currval=currdic[key]
test=True
n=0
while n<listlength:
if n!=s:
if currval in valuelist[n]: #this is where it takes to much time
n=listlength
test=False
else:
n+=1
else:
n+=1
if test:
dic[key]=currval
dicList.append(dic)
s+=1
return dicList

def slices(seq, length, prefer_last=False):
unique = {}
if prefer_last: # this doesn't have to be a parameter, just choose one
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]] = start
else: # prefer first
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
# or find all locations for each slice:
import collections
def slices(seq, length):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]].append(start)
return unique
This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don't want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length).
Printing unique slices:
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Including locations:
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
# (depending on which slices function used)
In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.
import operator
import random
import string
def slices(seq, length):
unique = {}
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
def sample_with_repeat(population, length, choice=random.choice):
return "".join(choice(population) for _ in xrange(length))
word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
(x for x in words_slices_sets if x is not n),
n)
for n in words_slices_sets]
print [len(x) for x in unique_words_slices]

You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense.
If you can provide more information about your input data, a specific solution might be available.
For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. In the worst case however, it would run out of memory constructing all 256^20 keys. Knowing your inputs makes all the difference.
edit
I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node.
adenovirus 2: 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes
enterobacteria phage lambda: 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes.
I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. You'd have to try it to see.
If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go.
If you can't find some way to prune the keys, you could try using a more compact representation. For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code.
I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge.

Let me build off Roger Pate's answer. If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word").
import collections
def hashed_slices(seq, length, hasher=None):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[hasher(seq[start:start+length])].append(start)
return unique
(If you really want to get fancy, you can use a rolling hash, though you'll need to change the function.)
Now, we can combine all the hashes :
unique = [] # Unique words in first string
# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
for h, starts in hashed.iteritems() :
# We only care about the first word
if h in hashed_starts[0] :
all_hashed[h][i]=starts
# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
if len(starts_by_word) == 1 :
# if there's only one word for the hash, it's obviously valid
unique.extend(words[0][i:i+20] for i in starts_by_word.values())
else :
# we might have a hash collision
candidates = {}
for word_idx, starts in starts_by_word.iteritems() :
candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
# Now go that we have the candidate slices, find the unique ones
valid = candidates[0]
for word_idx, candidate_set in candidates.iteritems() :
if word_idx != 0 :
valid -= candidate_set
unique.extend(valid)
(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.)
Be warned, I haven't tested this. Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. The hard part is choosing the hash. Too many collisions and you'll won't gain anything. Too few and you'll hit the memory problems. If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. So the slices will take up nearly a fourth of the memory. That would save roughly 250 MB of memory in Roger Pate's answer.
The code is still O(N^2), but the constant should be much lower.

Let's attempt to improve on Roger Pate's excellent answer.
Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway.
Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. So perhaps try only the 20s starting with one particular letter. For DNA, this cuts the requirements down by 75%.
seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
for letterid in range(maxlength):
for wordid,word in words:
if (letterid < len(word)):
letter = word[letterid]
if letter is startletter:
seq = word[letterid:letterid+seqlen]
if seq in seqtrie and not wordid in seqtrie[seq]:
seqtrie[seq].append(wordid)
Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc.

Related

Run time difference for "in" searching through "list" and "set" using Python

My understanding of list and set in Python are mainly that list allows duplicates, list allows ordered information, and list has position information. I found while I was trying to search if an element is with in a list, the runtime is much faster if I convert the list to a set first. For example, I wrote a code trying find the longest consecutive sequence in a list. Use a list from 0 to 10000 as an example, the longest consecutive is 10001. While using a list:
start_time = datetime.now()
nums = list(range(10000))
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the list##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code is "Duration: 0:00:01.481939"
With adding only one line to convert the list to set in third row below:
start_time = datetime.now()
nums = list(range(10000))
nums = set(nums)
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the set(was a list)##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code by using a set is now "Duration: 0:00:00.005138", Many time shorter than search through a list. Could anyone help me to understand the reason for that? Thank you!
This is a great question.
The issue with arrays is that there is no smarter way to search in some array a besides just comparing every element one by one.
Sometimes you'll get lucky and get a match on the first element of a.
Sometimes you'll get unlucky and not find a match until the last element of a, or perhaps none at all.
On average, you'll have to search half the elements of they array each time.
This is said to have a "time complexity" of O(len(a)), or colloquially, O(n). This means the time taken by the algorithm (searching for a value in array) is linearly propertional to the size of the input (the number of elements in the array to be searched). This is why it's called "linear search". Oh, your array got 2x bigger? Well your searches just got 2x slower. 1000x bigger? 1000x slower.
Arrays are great, but they're đź’© for searching if the element count gets too high.
Sets are clever. In Python, they're implemented as if they were a Dictionary with keys and no values. Like dictionaries, they're backed by data structure called a hash table.
A hash table uses the hash of a value as a quick way to get a "summary" of an object. This "summary" is then used to narrow down its search, so it only needs to linearly search a very small subset of all the elements. Searching in a hash table a time complexity of O(1). Notice that there's no "n" or len(the_set) in there. That's because the time taken to search in a hash table does not grow as the size of the hash table grows. So it's said to have constant time complexity.
By analogy, you only search the dairy isle when you're looking for milk. You know the hash value of milk (say, it's isle) is "dairy" and not "deli", so you don't have to waste any time searching for milk
A natural follow-up question is "then why don't we always use sets?". Well, there's a trade-off.
As you mentioned, sets can't contain duplicates, so if you want to store two of something, it's a non-starter.
Prior to Python 3.7, they were also unordered, so if you cared about
the order of elements, they won't do, either. * Sets generally have a
larger cpu/memory overhead, which adds up when using many sets containing small numbers of elements.
Also, it's possible
that because of fancy CPU features (like CPU caches and branch
prediction), linear searching through small arrays can actually be
faster than the hash-based look-up in sets.
I'd recommend you do some further reading into data structures and algorithms. This stuff is quite language-independent. Now that you know that set and dict use a Hash Table behind the scenes, you can look up resource that cover hash tables in any language, and that should help. There's also some Python-centric resoruces too, like https://www.interviewcake.com/concept/python/hash-map

Most efficient way to get first value that startswith of large list

I have a very large list with over a 100M strings. An example of that list look as follows:
l = ['1,1,5.8067',
'1,2,4.9700',
'2,2,3.9623',
'2,3,1.9438',
'2,7,1.0645',
'3,3,8.9331',
'3,5,2.6772',
'3,7,3.8107',
'3,9,7.1008']
I would like to get the first string that starts with e.g. '3'.
To do so, I have used a lambda iterator followed by next() to get the first item:
next(filter(lambda i: i.startswith('3,'), l))
Out[1]: '3,3,8.9331'
Considering the size of the list, this strategy unfortunately still takes relatively much time for a process I have to do over and over again. I was wondering if someone could come up with an even faster, more efficient approach. I am open for alternative strategies.
I have no way of testing it myself but it is possible that if you will join all the strings with a char that is not in any of the string:
concat_list = '$'.join(l)
And now use a simple .find('$3,'), it would be faster. It might happen if all the strings are relatively short. Since now all the string is in one place in memory.
If the amount of unique letters in the text is small you can use Abrahamson-Kosaraju method and het time complexity of practically O(n)
Another approach is to use joblib, create n threads when the i'th thread is checking the i + k * n, when one is finding the pattern it stops the others. So the time complexity is O(naive algorithm / n).
Since your actual strings consist of relatively short tokens (such as 301) after splitting the the strings by tabs, you can build a dict with each possible length of the first token as the keys so that subsequent lookups take only O(1) in average time complexity.
Build the dict with values of the list in reverse order so that the first value in the list that start with each distinct character will be retained in the final dict:
d = {s[:i + 1]: s for s in reversed(l) for i in range(len(s.split('\t')[0]))}
so that given:
l = ['301\t301\t51.806763\n', '301\t302\t46.970094\n',
'301\t303\t39.962393\n', '301\t304\t18.943836\n',
'301\t305\t11.064584\n', '301\t306\t4.751911\n']
d['3'] will return '301\t301\t51.806763'.
If you only need to test each of the first tokens as a whole, rather than prefixes, you can simply make the first tokens as the keys instead:
d = {s.split('\t')[0]: s for s in reversed(l)}
so that d['301'] will return '301\t301\t51.806763'.

Performance task in Python

We have IT olimpics in my country. Normally they are written in Java, C or C++. I gues for a year or so they also include other languages like python.
I tried to solve a task from previous years in Python called Letters and I'm constantly failing. The task is to write a code that counts minimum number of shifts between neighboring letters to turn one string into another.
As input you get number of letters in one string and two strings with same amount of letters but in different order. Lenght of one string is from 2 to 1 000 000 letters. There are only capital letters, they can but don't have to be sorted and can repeat.
Here's an example:
7
AABCDDD
DDDBCAA
Correct output should be 16
As output you have to return single value which is minimum number of shifts. It has to calculate output under 5seconds.
I made it calculate correct output, but in longer strings (lik 800 000 letters) it starts to slow down. The longest inputs return value in about 30 seconds. There's also one input counting 900 000 letters per word that calculates 30 minutes!
Under link you can find all input files for tests:
https://oi.edu.pl/l/19oi_ksiazeczka/
Click on this link to download files for "Letters" task:
XIX OI testy i rozwiÄ…zania - zad. LIT (I etap) (3.5 MB)
Bellow is my code. How can I speed it up?
# import time
import sys
# start = time.time()
def file_reader():
standard_input=""
try:
data = sys.stdin.readlines()
for line in data:
standard_input+=line
except:
print("An exception occurred")
return standard_input
def mergeSortInversions(arr):
if len(arr) == 1:
return arr, 0
else:
a = arr[:len(arr)//2]
b = arr[len(arr)//2:]
a, ai = mergeSortInversions(a)
b, bi = mergeSortInversions(b)
c = []
i = 0
j = 0
inversions = 0 + ai + bi
while i < len(a) and j < len(b):
if a[i] <= b[j]:
c.append(a[i])
i += 1
else:
c.append(b[j])
j += 1
inversions += (len(a)-i)
c += a[i:]
c += b[j:]
return c, inversions
def literki():
words=(file_reader()).replace("\n", "")
number = int("".join((map(str, ([int(i) for i in list(words) if i.isdigit()])))))
all_letters = [x for x in list(words) if x not in str(number)]
name = all_letters[:number]
anagram = all_letters[number:]
p=[]
index=list(range(len(anagram)))
anagram_dict = {index[i]: anagram[i] for i in range(len(index))}
new_dict = {}
anagram_counts={}
for key, value in anagram_dict.items():
if value in new_dict:
new_dict[value].append(key)
else:
new_dict[value]=[key]
for i in new_dict:
anagram_counts.update({i:new_dict[i]})
for letter in name:
a=anagram_counts[letter]
p.append(a.pop(0))
print(mergeSortInversions(p)[1])
#>>
literki()
# end = time.time()
# print(start-end)
So to explain what it does in parts: File_reader: simply reads an input file from standard input. mergeSortInversions(arr): normally it would sort a string, but here I wanted it to return sum of inversions. I'm not that smart to figure it out by myself, I found it on web but it does the job. Unfortunatelly, for 1mln strings it does that in 10 secondes or so. In "literki" function: first, I've devided input to have number of signs and two, even in lenght words as lists.
Then, I've made something similar in function to stacks array (not shure if it is called this way in english). basically I made a dictionary with every letter as key and indexes of those letters as a list in values (if a letter occurs more than once, value would contain a list of all indexes for that letter). Last thing I did before "the slow thing", for every letter in "name" variable I've extracted coresponding index. Up to that point all opertations for every input, ware taking arround 2 secconds. And now two lines that generate the rest of time for calculating outcome: - I append the index to p=[] list and in the same time pop it from list in dictionary, so it wouldn't read it again for another same letter. - I calculate number of moves (inversions) with mergeSortInversions(arr) based on p=[...] list and print it as output.
I know that poping from bottom is slow but on the other hand I would have to create lists of indexes from bottom (so I could pop index from top) but that took even longer. I've also tried converting a=[... ] with deque but it also was to slow.
I think I'd try a genetic algorithm for this problem. GA's don't always come up with an optimal solution, but they are very good for getting an acceptable solution in a reasonable amount of time. And for small inputs, they can be optimal.
The gist is to come up with:
1) A fitness function that assigns a number indicating how good a particular candidate solution is
2) A sexual reproduction function, that combines, in a simple way, part of two candidate solutions
3) A mutation function, that introduces one small change to a candidate solution.
So you just let those functions go to town, creating solution after solution, and keeping the best ones - not the best one, the best ones.
Then after a while, the best solution found is your answer.
Here's an example of using a GA for another hard problem, called The House Robber Problem. It's in Python:
http://stromberg.dnsalias.org/~strombrg/house-robber-problem/

Quick(er) way to check whether a word is english by comparing it against a whitelist of english words?

I am trying to eliminate all non-english words from many (100k) preprocssed text files (porter stemmed and lowercased, dropped all non a-z characters). I already parallelized the process to speed things up, but it is still painfully slow. Is there any more efficient way to do this in python?
englishwords = list(set(nltk.corpus.words.words()))
englishwords = [x.lower() for x in list(englishwords)]
englishwords = [ps.stem(w) for w in englishwords]
# this step takes too long:
shareholderletter= ' '.join(w for w in nltk.wordpunct_tokenize(shareholderletter) if w in englishwords)
You are checking for somthing in otherthing - and your otherthing is a list.
Lists are good for storing stuff, but lookup of "does x is in" your list takes O(n).
Use a set instead, that drops the lookup to O(1) and it eleminate any dupes so your base -size of things to look up in drops as well if you got duplicates.
If your set does not change afterwards, go and use a frozenset - which is immutable.
Read: Documentation of sets
If you use follow #DeepSpace 's suggestion, and leverage set operations you get even better performance:
s = set( t.lower().strip() for t in ["Some","text","in","set"])
t = set("Some text in a string that holds other words as well".lower().split())
print ( s&t ) # show me all things that are in both sets (aka intersection)
Output:
set(['text', 'some', 'in'])
See set operations
O(n): worstcase: your word is the last of 200k words in your list and you check the whole list - which takes 200k checks.
O(1): lookup time is constant, no matter how many items are in your datastructure, it takes the same amount of time to check if its in. To get this benefit, a set has a more complex storage-solution that takes slightly more memory (then a list) to perform so well on lookups.
Edit: worst case scenario for not finding a word inside a set/list:
import timeit
setupcode = """# list with some dupes
l = [str(i) for i in range(10000)] + [str(i) for i in range(10000)] + [str(i) for i in range(10000)]
# set of this list
s = set( l )
"""
print(timeit.timeit("""k = "10000" in l """,setup = setupcode, number=100))
print(timeit.timeit("""k = "10000" in s """,setup = setupcode, number=100))
0.03919574100000034 # checking 100 times if "10000" is in the list
0.00000512200000457 # checking 100 times if "10000" us in the set

need to decrease the run time of my program

I had a question where I had to find contiguous substrings of a string, and the condition was the first and last letters of the substring had to be same. I tried doing it, but the runtime exceed the time-limit for the question for several test cases. I tried using map for a for loop, but I have no idea what to do for the nested for loop. Can anyone please help me to decrease the runtime of this program?
n = int(raw_input())
string = str(raw_input())
def get_substrings(string):
length = len(string)
list = []
for i in range(length):
for j in range(i,length):
list.append(string[i:j + 1])
return list
substrings = get_substrings(string)
contiguous = filter(lambda x: (x[0] == x[len(x) - 1]), substrings)
print len(contiguous)
If i understand properly the question, please let me know if thats not the case but try this:
Not sure if this will speed up runtime, but i believe this algorithm may for longer strings especially (eliminates nested loop). Iterate through the string once, storing the index (position) of each character in a data structure with constant time lookup (hashmap, or an array if setup properly). When finished you should have a datastructure storing all the different locations of every character. Using this you can easily retrieve the substrings.
Example:
codingisfun
take the letter i for example, after doing what i said above, you look it up in the hashmap and see that it occurs at index 3 and 6. Meaning you can do something like substring(3, 6) to get it.
not the best code, but it seems reasonable for a starting point...you may be able to eliminate a loop with some creative thinking:
import string
import itertools
my_string = 'helloilovetocode'
mappings = dict()
for index, char in enumerate(my_string):
if not mappings.has_key(char):
mappings[char] = list()
mappings[char].append(index)
print char
for char in mappings:
if len(mappings[char]) > 1:
for subset in itertools.combinations(mappings[char], 2):
print my_string[subset[0]:(subset[1]+1)]
The problem is that your code far too inefficient in terms of algorithmic complexity.
Here's an alternative (a cleaner but slightly slower version of soliman's I believe)
import collections
def index_str(s):
"""
returns the indices characters show up at
"""
indices = collections.defaultdict(list)
for index, char in enumerate(s):
indices[char].append(index)
return indices
def get_substrings(s):
indices = index_str(s)
for key, index_lst in indices.items():
num_indices = len(index_lst)
for i in range(num_indices):
for j in range(i, num_indices):
yield s[index_lst[i]: index_lst[j] + 1]
The algorithmic problem with your solution is that you blindly check each possible substring, when you can easily determine what actual pairs are in a single, linear time pass. If you only want the count, that can be determined easily in O(MN) time, for a string of length N and M unique characters (given the number of occurrences of a char, you can mathematically figure out how many substrings there are). Of course, in the worst case (all chars are the same), your code will have the same complexity as ours, but the in average case complexity yours is much worse since you have a nested for loop (n^2 time)

Categories