Python memory error when searching substring

Python memory error when searching substring - python

I am trying to find substring of very large string and getting memory error:
The code:
def substr(string):
le = []
st = list(string)
for s in xrange(len(string)+1):
for s1 in xrange(len(string)+1):
le.append(''.join(st[s:s1]))
cou = Counter(le)
cou_val = cou.keys()
cou_val.remove('')
return le, cou_val
I am getting error as ile "solution.py", line 31, in substr
le.append(''.join(st[s:s1]))
MemoryError
How to tackle this problem?

Answer
I noticed that your code prints all the possible substrings of string in a certain order. I suggest that instead of storing all of them in an array, you use code to return just the substring that you want. I tested the subroutine below with 'a very long string' and it always returns the same value as if you were to get an indexed value from an array.
string = 'a very long string'
def substr2(string,i):
return string[i//(len(string)+1):i%(len(string)+1)]
print(substr2(string,10))
Solution
The way you order the arguments for your for loops (s,s1) work similarly to a number system. s1 increments by 1 until it gets to a given value, then it resets to 0 and s increments by 1, repeating the cycle. This is seen in a decimal system (e.g. 01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16 etc.)
The i//n div operator returns the integer value of i/n. (e.g. 14//10=1).
The i%n mod operator returns the remainder value of i/n. (e.g. 14%10 is 4).
So if we were to, for example, increment i by 1 and define (s,s1) as [i//10,i%10], we would get:
[0,0],[0,1],[0,2],[0,3],[0,4],[0,5],[0,6],[0,7],[0,8],[0,9],[1,0],[1,1],[1,2] etc.
We can utilize these to produce the same answer as in your array.
PS. My first answer. Hope this helps!

It seems that you are running out of memory. When the string is too large the code you posted seems to be copying it over and over again into the le list. As #Rikka's link suggests, buffer/memoryview may be of use for you but I have never used it.
As a workaround to your solution/code I would suggest that instead of storing each substring in le, store the indexes as a tuple. Additionally, I don't think that st list is required (not sure tho if your way speeds it up) so the result would be (code not tested):
def substr(string):
le = []
for s in xrange(len(string)+1):
for s1 in xrange(len(string)+1):
# Skip empty strings
if s!=s1:
le.append((s, s1))
cou = Counter(le)
cou_val = cou.keys()
cou_val.remove('')
return le, cou_val
Now, an example of how you can use the substr is (code not tested):
myString = "very long string here"
matchString = "here"
matchPos = False
indexes, count = substr(myString)
# Get all the substrings without storing them simultaneously in memory
for i in indexes:
# construct substring and compare
if myString[i[0],i[1]]==matchString:
matchPos = i
break
After the above you have start and end positions of the 1st occurrence of "here" into your large string. I am not sure what you try to achieve but this can easily be modified to find all occurrences, count matches, etc - I just post it as example. I am also not sure why the Counter is there...
This approach should not present the memory error, however, it is a trade-off between memory and CPU and I expect it to be bit slower on runtime since every time you use indexes you have to re-construct every substring.
Hope it helps

The solution:
The error in memory is always caused by out of range.And the slice technique also has some rules.
When the step is positive, just like 1, the first index must be greater than the second.And on the contrary, when negative, such as -1, the number of the index is shorter than the second, but it is actually the greater one.(-1 > -2)
So in your program, the index s is greater than s1 when step is one, so you access a place you have not applied for it.And you know, that is MemoryError!!!

Related

Time complexity for two different solutions

I want to understand the difference in time complexity between these two solutions.
The task is not relevant but if you're curious here's the link with the explanation.
This is my first solution. Scores a 100% in correctness but 0% in performance:
def solution(s, p ,q):
dna = dict({'A': 1, 'C': 2, 'G': 3, 'T': 4})
result = []
for i in range(len(q)):
least = 4
for c in set(s[p[i] : q[i] + 1]):
if least > dna[c]: least = dna[c]
result.append(least)
return result
This is the second solution. Scores 100% in both correctness and performance:
def solution(s, p ,q):
result = []
for i in range(len(q)):
if 'A' in s[p[i]:q[i] + 1]: result.append(1)
elif 'C' in s[p[i]:q[i] + 1]: result.append(2)
elif 'G' in s[p[i]:q[i] + 1]: result.append(3)
else: result.append(4)
return list(result)
Now this is how I see it. In both solutions I'm iterating through a range of Q length and on each iteration I'm slicing different portions of a string, with a length between 1 and 100,000.
Here's where I get confused, in my first solution on each iteration, I'm slicing once a portion of the string and create a set to remove all the duplicates. The set can have a length between 1 and 4, so iterating through it must be very quick. What I notice is that I iterate through it only once, on each iteration.
In the second solution on each iteration, I'm slicing three times a portion of the string and iterate through it, in the worst case three times with a length of 100,000.
Then why is the second solution faster? How can the first have a time complexity of O(n*m) and the second O(n+m)?
I thought it's because of the in and the for operators, but I tried the same second solution in JavaScript with the indexOf method and it still gets a 100% in performance. But why? I can understand that if in Python the in and the for operators have different implementations and work differently behind the scene, but in JS the indexOf method is just going to apply a for loop. Then isn't it the same as just doing the for loop directly inside my function? Shouldn't that be a O(n*m) time complexity?

You haven't specified how the performance rating is obtained, but anyway, the second algorithm is clearly better, mainly because it uses the in operator, that under the hood calls a function implemented in C, which is far more efficient than python. More on this topic here.
Also, I'm not sure, but I don't think that the python interpreter isn't smart enough to slice the string only once and then reuse the same portion the other times in the second algorithm.
Creating the set in the first algorithm also seems like a very costly operation.
Lastly, maybe the performance ratings aren't based on the algorithm complexity, but rather on the execution time over different test strings?

I think the difference in complexity can easily be showcased on an example.
Consider the following input:
s = 'ACGT' * 1000000
# = 'ACGTACGTACGTACGTACGTACGTACGTACGTACGT...ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT'
p = [0]
q = [3999999]
Algorithm 2 very quickly checks that 'A' is in s[0:4000000] (it's the first character - no need to iterate through the whole string to find it!).
Algorithm 1, on the other hand, must iterate through the whole string s[0:4000000] to build the set {'A','C','G','T'}, because iterating through the whole string is the only way to check that there isn't a fifth distinct character hidden somewhere in the string.
Important note
I said algorithm 2 should be fast on this example, because the test 'A' in ... doesn't need to iterate through the whole string to find 'A' if 'A' is at the beginning of the string. However, note a possible important difference in complexity between 'A' in s and 'A' in s[0:4000000]. The problem is that creating a slice of the string might cost time (and memory) if it's copying the string. Instead of slicing, you should use s.find('A', 0, 4000000), which is guaranteed not to build a copy. For more information on this:
Documentation on string.find
Stackoverflow: Time complexity of string slice

Most efficient way to get first value that startswith of large list

I have a very large list with over a 100M strings. An example of that list look as follows:
l = ['1,1,5.8067',
'1,2,4.9700',
'2,2,3.9623',
'2,3,1.9438',
'2,7,1.0645',
'3,3,8.9331',
'3,5,2.6772',
'3,7,3.8107',
'3,9,7.1008']
I would like to get the first string that starts with e.g. '3'.
To do so, I have used a lambda iterator followed by next() to get the first item:
next(filter(lambda i: i.startswith('3,'), l))
Out[1]: '3,3,8.9331'
Considering the size of the list, this strategy unfortunately still takes relatively much time for a process I have to do over and over again. I was wondering if someone could come up with an even faster, more efficient approach. I am open for alternative strategies.

I have no way of testing it myself but it is possible that if you will join all the strings with a char that is not in any of the string:
concat_list = '$'.join(l)
And now use a simple .find('$3,'), it would be faster. It might happen if all the strings are relatively short. Since now all the string is in one place in memory.
If the amount of unique letters in the text is small you can use Abrahamson-Kosaraju method and het time complexity of practically O(n)
Another approach is to use joblib, create n threads when the i'th thread is checking the i + k * n, when one is finding the pattern it stops the others. So the time complexity is O(naive algorithm / n).

Since your actual strings consist of relatively short tokens (such as 301) after splitting the the strings by tabs, you can build a dict with each possible length of the first token as the keys so that subsequent lookups take only O(1) in average time complexity.
Build the dict with values of the list in reverse order so that the first value in the list that start with each distinct character will be retained in the final dict:
d = {s[:i + 1]: s for s in reversed(l) for i in range(len(s.split('\t')[0]))}
so that given:
l = ['301\t301\t51.806763\n', '301\t302\t46.970094\n',
'301\t303\t39.962393\n', '301\t304\t18.943836\n',
'301\t305\t11.064584\n', '301\t306\t4.751911\n']
d['3'] will return '301\t301\t51.806763'.
If you only need to test each of the first tokens as a whole, rather than prefixes, you can simply make the first tokens as the keys instead:
d = {s.split('\t')[0]: s for s in reversed(l)}
so that d['301'] will return '301\t301\t51.806763'.

The number of differences between characters in a string in Python 3

Given a string, lets say "TATA__", I need to find the total number of differences between adjacent characters in that string. i.e. there is a difference between T and A, but not a difference between A and A, or _ and _.
My code more or less tells me this. But when a string such as "TTAA__" is given, it doesn't work as planned.
I need to take a character in that string, and check if the character next to it is not equal to the first character. If it is indeed not equal, I need to add 1 to a running count. If it is equal, nothing is added to the count.
This what I have so far:
def num_diffs(state):
count = 0
for char in state:
if char != state[char2]:
count += 1
char2 += 1
return count
When I run it using num_diffs("TATA__") I get 4 as the response. When I run it with num_diffs("TTAA__") I also get 4. Whereas the answer should be 2.
If any of that makes sense at all, could anyone help in fixing it/pointing out where my error lies? I have a feeling is has to do with state[char2]. Sorry if this seems like a trivial problem, it's just that I'm totally new to the Python language.

import operator
def num_diffs(state):
return sum(map(operator.ne, state, state[1:]))
To open this up a bit, it maps !=, operator.ne, over state and state beginning at the 2nd character. The map function accepts multible iterables as arguments and passes elements from those one by one as positional arguments to given function, until one of the iterables is exhausted (state[1:] in this case will stop first).
The map results in an iterable of boolean values, but since bool in python inherits from int you can treat it as such in some contexts. Here we are interested in the True values, because they represent the points where the adjacent characters differed. Calling sum over that mapping is an obvious next step.
Apart from the string slicing the whole thing runs using iterators in python3. It is possible to use iterators over the string state too, if one wants to avoid slicing huge strings:
import operator
from itertools import islice
def num_diffs(state):
return sum(map(operator.ne,
state,
islice(state, 1, len(state))))

There are a couple of ways you might do this.
First, you could iterate through the string using an index, and compare each character with the character at the previous index.
Second, you could keep track of the previous character in a separate variable. The second seems closer to your attempt.
def num_diffs(s):
count = 0
prev = None
for ch in s:
if prev is not None and prev!=ch:
count += 1
prev = ch
return count
prev is the character from the previous loop iteration. You assign it to ch (the current character) at the end of each iteration so it will be available in the next.

You might want to investigate Python's groupby function which helps with this kind of analysis.
from itertools import groupby
def num_diffs(seq):
return len(list(groupby(seq))) - 1
for test in ["TATA__", "TTAA__"]:
print(test, num_diffs(test))
This would display:
TATA__ 4
TTAA__ 2
The groupby() function works by grouping identical entries together. It returns a key and a group, the key being the matching single entry, and the group being a list of the matching entries. So each time it returns, it is telling you there is a difference.

Trying to make as little modifications to your original code as possible:
def num_diffs(state):
count = 0
for char2 in range(1, len(state)):
if state[char2 - 1] != state[char2]:
count += 1
return count
One of the problems with your original code was that the char2 variable was not initialized within the body of the function, so it was impossible to predict the function's behaviour.
However, working with indices is not the most Pythonic way and it is error prone (see comments for a mistake that I made). You may want rewrite the function in such a way that it does one loop over a pair of strings, a pair of characters at a time:
def num_diffs(state):
count = 0
for char1, char2 in zip(state[:-1], state[1:]):
if char1 != char2:
count += 1
return count
Finally, that very logic can be written much more succinctly — see #Ilja's answer.

need to decrease the run time of my program

I had a question where I had to find contiguous substrings of a string, and the condition was the first and last letters of the substring had to be same. I tried doing it, but the runtime exceed the time-limit for the question for several test cases. I tried using map for a for loop, but I have no idea what to do for the nested for loop. Can anyone please help me to decrease the runtime of this program?
n = int(raw_input())
string = str(raw_input())
def get_substrings(string):
length = len(string)
list = []
for i in range(length):
for j in range(i,length):
list.append(string[i:j + 1])
return list
substrings = get_substrings(string)
contiguous = filter(lambda x: (x[0] == x[len(x) - 1]), substrings)
print len(contiguous)

If i understand properly the question, please let me know if thats not the case but try this:
Not sure if this will speed up runtime, but i believe this algorithm may for longer strings especially (eliminates nested loop). Iterate through the string once, storing the index (position) of each character in a data structure with constant time lookup (hashmap, or an array if setup properly). When finished you should have a datastructure storing all the different locations of every character. Using this you can easily retrieve the substrings.
Example:
codingisfun
take the letter i for example, after doing what i said above, you look it up in the hashmap and see that it occurs at index 3 and 6. Meaning you can do something like substring(3, 6) to get it.
not the best code, but it seems reasonable for a starting point...you may be able to eliminate a loop with some creative thinking:
import string
import itertools
my_string = 'helloilovetocode'
mappings = dict()
for index, char in enumerate(my_string):
if not mappings.has_key(char):
mappings[char] = list()
mappings[char].append(index)
print char
for char in mappings:
if len(mappings[char]) > 1:
for subset in itertools.combinations(mappings[char], 2):
print my_string[subset[0]:(subset[1]+1)]

The problem is that your code far too inefficient in terms of algorithmic complexity.
Here's an alternative (a cleaner but slightly slower version of soliman's I believe)
import collections
def index_str(s):
"""
returns the indices characters show up at
"""
indices = collections.defaultdict(list)
for index, char in enumerate(s):
indices[char].append(index)
return indices
def get_substrings(s):
indices = index_str(s)
for key, index_lst in indices.items():
num_indices = len(index_lst)
for i in range(num_indices):
for j in range(i, num_indices):
yield s[index_lst[i]: index_lst[j] + 1]
The algorithmic problem with your solution is that you blindly check each possible substring, when you can easily determine what actual pairs are in a single, linear time pass. If you only want the count, that can be determined easily in O(MN) time, for a string of length N and M unique characters (given the number of occurrences of a char, you can mathematically figure out how many substrings there are). Of course, in the worst case (all chars are the same), your code will have the same complexity as ours, but the in average case complexity yours is much worse since you have a nested for loop (n^2 time)

Python trick in finding leading zeros in string

I have a binary string say '01110000', and I want to return the number of leading zeros in front without writing a forloop. Does anyone have any idea on how to do that? Preferably a way that also returns 0 if the string immediately starts with a '1'

If you're really sure it's a "binary string":
input = '01110000'
zeroes = input.index('1')
Update: it breaks when there's nothing but "leading" zeroes
An alternate form that handles the all-zeroes case.
zeroes = (input+'1').index('1')

Here is another way:
In [36]: s = '01110000'
In [37]: len(s) - len(s.lstrip('0'))
Out[37]: 1
It differs from the other solutions in that it actually counts the leading zeroes instead of finding the first 1. This makes it a little bit more general, although for your specific problem that doesn't matter.

A simple one-liner:
x = '01110000'
leading_zeros = len(x.split('1', 1)[0])
This partitions the string into everything up to the first '1' and the rest after it, then counts the length of the prefix. The second argument to split is just an optimization and represents the number of splits to perform, meaning the function will stop after it found the first '1' instead of splitting it on all occurences. You could just use x.split('1')[0] if performance doesn't matter.

I'd use:
s = '00001010'
sum(1 for _ in itertools.takewhile('0'.__eq__, s))
Rather pythonic, works in the general case, for example on the empty string and non-binary strings, and can handle strings of any length (or even iterators).

If you know it's only 0 or 1:
x.find(1)
(will return -1 if all zeros; you may or may not want that behavior)

If you don't know which number would be next to zeros i.e. "1" in this case, and you just want to check if there are leading zeros, you can convert to int and back and compare the two.
"0012300" == str(int("0012300"))

How about re module?
a = re.search('(?!0)', data)
then a.start() is the position.

I'm using has_leading_zero = re.match(r'0\d+', str(data)) as a solution that accepts any number and treats 0 as a valid number without a leading zero

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python memory error when searching substring - python

Related

Time complexity for two different solutions

Most efficient way to get first value that startswith of large list

The number of differences between characters in a string in Python 3

need to decrease the run time of my program

Python trick in finding leading zeros in string

Categories

Resources