Find Repeating Substring In a List - python

I have a long list of sub-strings (close to 16000) that I want to find where the repeating cycle starts/stops. I have come up with this code as a starting point:
strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
pat = [ '1100100100000010',
'1001001000000110',
'0010010000001100',]
for i in range(0,len(strings)-1):
for j in range(0,len(pat)):
if strings[i] == pat[j]:
continue
if strings[i+1] == pat[j]:
print 'match', strings[i]
break
break
The problem with this method is that you have to know what pat is to search for it. I would like to be able to start with the first n sub-list (in this case 3) and search for them, if not match move down one sub-string to the next 3 until it has gone through the entire list or finds the repeat. I believe if the length is high enough (maybe 10) it will find the repeat without being too time demanding.

strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
n = 3
patt_dict = {}
for i in range(0, len(strings) - n, 1):
patt = (' '.join(strings[i:i + n]))
if patt not in patt_dict.keys(): patt_dict[patt] = 1
else: patt_dict[patt] += 1
for key in patt_dict.keys():
if patt_dict[key] > 1:
print 'Found ' + str(patt_dict[key]) + ' repeating instances of ' + str(key) + '.'
Give this a shot. Runs in linear time. Basically uses a dictionary to count the number of times that an n-size pattern occurs in a subset. If it exceeds 1, then we have a repeating pattern :)

Here's a reasonably simple way that finds all matches of all lengths >= 1:
def findall(xs):
from itertools import combinations
# x2i maps each member of xs to a list of all the
# indices at which that member appears.
x2i = {}
for i, x in enumerate(xs):
x2i.setdefault(x, []).append(i)
n = len(xs)
for ixs in x2i.values():
if len(ixs) > 1:
for i, j in combinations(ixs, 2):
length = 1 # xs[i] == xs[j]
while (i + length < n and
j + length < n and
xs[i + length] == xs[j + length]):
length += 1
yield i, j, length
Then:
for i, j, n in findall(strings):
print("match of length", n, "at indices", i, "and", j)
displays:
match of length 4 at indices 0 and 6
match of length 1 at indices 3 and 9
match of length 3 at indices 1 and 7
match of length 2 at indices 2 and 8
What you do and don't want hasn't been precisely specified, so this lists all matches. You probably don't really want some of the them. For example, the match of length 3 at indices 1 and 7 is just the tail end of the match of length 4 at indices 0 and 6.
So you'll need to alter the code to compute what you really want. Perhaps you only want a single, maximal match? All maximal matches? Only matches of a particular length? Etc.

Here's something that will find all subarrays that match within the strings array.
strings = ['A', 'B', 'C', 'D', 'Z', 'B', 'B', 'C', 'A', 'B', 'C']
pat = ['A', 'B', 'C', 'D']
i = 0
while i < len(strings):
if strings[i] not in pat:
i += 1
continue
matches = 0
for j in xrange(pat.index(strings[i]), len(pat)):
if i + j - pat.index(strings[i]) >= len(strings):
break
if strings[i + j - pat.index(strings[i])] == pat[j]:
matches += 1
else:
break
if matches:
print 'matched at index %d subsequence length: %d value %s' % (i, matches, strings[i])
i += matches
else:
i += 1
Output:
matched at index 0 subsequence length: 4 value A
matched at index 5 subsequence length: 1 value B
matched at index 6 subsequence length: 2 value B
matched at index 8 subsequence length: 3 value A

Related

How can I count the number of ways to divide a string into N parts of any size?

I'm trying to count the number of ways you can divide a given string into three parts in Python.
Example: "bbbbb" can be divided into three parts 6 ways:
b|b|bbb
b|bb|bb
b|bbb|b
bb|b|bb
bb|bb|b
bbb|b|b
My first line of thinking was N choose K, where N = the string's length and K = the number of ways to split (3), but that only works for 3 and 4.
My next idea was to iterate through the string and count the number of spots the first third could be segmented and the number of spots the second third could be segmented, then multiply the two counts, but I'm having trouble implementing that, and I'm not even too sure if it'd work.
How can I count the ways to split a string into N parts?
Think of it in terms of the places of the splits as the elements you're choosing:
b ^ b ^ b ^ ... ^ b
^ is where you can split, and there are N - 1 places where you can split (N is the length of the string), and, if you want to split the string into M parts, you need to choose M - 1 split places, so it's N - 1 choose M - 1.
For you example, N = 5, M = 3. (N - 1 choose M - 1) = (4 choose 2) = 6.
An implementation:
import scipy.special
s = 'bbbbb'
n = len(s)
m = 3
res = scipy.special.comb(n - 1, m - 1, exact=True)
print(res)
Output:
6
I came up with a solution to find the number of ways to split a string in python and I think it is quite easier to understand and has a better time complexity
def slitStr(s):
i = 1
j= 2
count = 0
while i <= len(s)-2:
# a, b, c are the split strings
a = s[:i]
b = s[i:j]
c = s[j:]
#increase j till it gets to the end of the list
#each time j gets to the end of the list increment i
#set j to i + 1
if j<len(s):
j+= 1
if j==len(s):
i += 1
j = i+1
# you can increment count after each iteration
count += 1
You can customize the solution to fit your need. I hope this helps.
Hope this helps you too :
string = "ABCDE"
div = "|"
out = []
for i in range(len(string)):
temp1 = ''
if 1 < i < len(string):
temp1 += string[0:i-1] + div
for j in range(len(string) + 1):
temp2 = ""
if j > i:
temp2 += string[i-1:j-1] + div + string[j-1:]
out.append(temp1 + temp2)
print(out)
Result :
['A|B|CDE', 'A|BC|DE', 'A|BCD|E', 'AB|C|DE', 'AB|CD|E', 'ABC|D|E']

Python Optimization : Find the most occured sequence of 4 letters inside a 1000 letters string randomly generated

I'm here to ask help about my program.
I realise a program that raison d'être is to find the most occured four letters string on a x letters bigger string which have been generated randomly.
As example, if you would know the most occured sequence of four letters in 'abcdeabcdef' it's pretty easy to understand that is 'abcd' so the program will return this.
Unfortunately, my program works very slow, I mean, It take 119.7 seconds, for analyze all possibilities and display the results for only a 1000 letters string.
This is my program, right now :
import random
chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
string = ''
for _ in range(1000):
string += str(chars[random.randint(0, 25)])
print(string)
number = []
for ____ in range(0,26):
print(____)
for ___ in range(0,26):
for __ in range(0, 26):
for _ in range(0, 26):
test = chars[____] + chars[___] + chars[__] + chars[_]
print('trying :',test, end = ' ')
number.append(0)
for i in range(len(string) -3):
if string[i: i+4] == test:
number[len(number) -1] += 1
print('>> finished')
_max = max(number)
for i in range(len(number)-1):
if number[i] == _max :
j, k, l, m = i, 0, 0, 0
while j > 25:
j -= 26
k += 1
while k > 25:
k -= 26
l += 1
while l > 25:
l -= 26
m += 1
Result = chars[m] + chars[l] + chars[k] + chars[j]
print(str(Result),'occured',_max, 'times' )
I think there is ways to optimize it but at my level, I really don't know. Maybe the structure itself is not the best. Hope you'll gonna help me :D
You only need to loop through your list once to count the 4-letter sequences. You are currently looping n*n*n*n. You can use zip to make a four letter sequence that collects the 997 substrings, then use Counter to count them:
from collections import Counter
import random
chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
s = "".join([chars[random.randint(0, 25)] for _ in range(1000)])
it = zip(s, s[1:], s[2:], s[3:])
counts = Counter(it)
counts.most_common(1)
Edit:
.most_common(x) returns a list of the x most common strings. counts.most_common(1) returns a single item list with the tuple of letters and number of times it occurred like; [(('a', 'b', 'c', 'd'), 2)]. So to get a string, just index into it and join():
''.join(counts.most_common(1)[0][0])
Even with your current approach of iterating through every possible 4-letter combination, you can speed up a lot by keeping a dictionary instead of a list, and testing whether the sequence occurs at all first before trying to count the occurrences:
counts = {}
for a in chars:
for b in chars:
for c in chars:
for d in chars:
test = a + b + c + d
print('trying :',test, end = ' ')
if test in s: # if it occurs at all
# then record how often it occurs
counts[test] = sum(1 for i in range(len(s)-4)
if test == s[i:i+4])
The multiple loops can be replaced with itertools.permutations, though this improves readability rather than performance:
length = 4
for sequence in itertools.permutations(chars, length):
test = "".join(sequence)
if test in s:
counts[test] = sum(1 for i in range(len(s)-length) if test == s[i:i+length])
You can then display the results like this:
_max = max(counts.values())
for k, v in counts.items():
if v == _max:
print(k, "occurred", _max, "times")
Provided that the string is shorter or around the same length as 26**4 characters, then it is much faster still to iterate through the string rather than through every combination:
length = 4
counts = {}
for i in range(len(s) - length):
sequence = s[i:i+length]
if sequence in counts:
counts[sequence] += 1
else:
counts[sequence] = 1
This is equivalent to the Counter approach already suggested.

Finding Subarrays of Vowels from a given String

You are given a string S, and you have to find all the amazing substrings of S.
Amazing Substring is one that starts with a vowel (a, e, i, o, u, A, E, I, O, U).
Input
The only argument given is string S.
Output
Return a single integer X mod 10003, here X is number of Amazing Substrings in given string.
Constraints
1 <= length(S) <= 1e6
S can have special characters
Example
Input
ABEC
Output
6
Explanation
Amazing substrings of given string are :
1. A
2. AB
3. ABE
4. ABEC
5. E
6. EC
here number of substrings are 6 and 6 % 10003 = 6.
I have implemented the following algo for the above Problem.
class Solution:
# #param A : string
# #return an integer
def solve(self, A):
x = ['a', 'e','i','o', 'u', 'A', 'E', 'I', 'O', 'U']
y = []
z = len(A)
for i in A:
if i in x:
n = A.index(i)
m = z
while m > n:
y.append(A[n:m])
m -= 1
if y:
return len(y)%10003
else:
return 0
Above Solution works fine for strings of normal length but not for greater length.
For example,
A = "pGpEusuCSWEaPOJmamlFAnIBgAJGtcJaMPFTLfUfkQKXeymydQsdWCTyEFjFgbSmknAmKYFHopWceEyCSumTyAFwhrLqQXbWnXSn"
Above Algo outputs 1630 subarrays but the expected answer is 1244.
Please help me improving the above algo. Thanks for the help
Focus on the required output: you do not need to find all of those substrings. All you need is the quantity of substrings.
Look again at your short example, ABEC. There are two vowels, A and E.
A is at location 0. There are 4 total substrings, ending there and at each following location.
E is at location 2. There are 2 total substrings, ending there and at each following location.
2+4 => 6
All you need do is to find the position of each vowel, subtract from the string length, and accumulate those differences:
A = "pGpEusuCSWEaPOJmamlFAnIBgAJGtcJaMPFTLfUfkQKXeymydQsdWCTyEFjFgbSmknAmKYFHopWceEyCSumTyAFwhrLqQXbWnXSn"
lenA = len(A)
vowel = "aeiouAEIOU"
count = 0
for idx, char in enumerate(A):
if char in vowel:
count += lenA - idx
print(count%10003)
Output:
1244
In a single command:
print( sum(len(A) - idx if char.lower() in "aeiou" else 0
for idx, char in enumerate(A)) )
When you hit a vowel in a string, all sub-strings that start with this vowel are 'amazing' so you can just count them:
def solve(A):
x = ['a', 'e','i','o', 'u', 'A', 'E', 'I', 'O', 'U']
ans = 0
for i in range(len(A)):
if A[i] in x:
ans = (ans + len(A)-i)%10003
return ans
When you are looking for the index of the element n = A.index(i), you get the index of the first occurrence of the element. By using enumerate you can loop through indices and elements simultaneously.
def solve(A):
x = ['a', 'e','i','o', 'u', 'A', 'E', 'I', 'O', 'U']
y = []
z = len(A)
for n,i in enumerate(A):
if i in x:
m = z
while m > n:
y.append(A[n:m])
m -= 1
if y:
return len(y)%10003
else:
return 0
A more general solution is to find all amazing substrings and then count them :
string = "pGpEusuCSWEaPOJmamlFAnIBgAJGtcJaMPFTLfUfkQKXeymydQsdWCTyEFjFgbSmknAmKYFHopWceEyCSumTyAFwhrLqQXbWnXSn"
amazing_substring_start = ['a','e','i','o','u','A','E','I','O','U']
amazing_substrings = []
for i in range(len(string)):
if string[i] in amazing_substring_start:
for j in range(len(string[i:])+1):
amazing_substring = string[i:i+j]
if amazing_substring!='':
amazing_substrings += [amazing_substring]
print amazing_substrings,len(amazing_substrings)%10003
create a loop to calculate the number of amazing subarrays created by every vowel
def Solve(A):
sumn = 0
for i in range(len(A)):
if A[i] in "aeiouAEIOU":
sumn += len(A[i:])
return sumn%10003

How to find total number of possible combinations for a string?

How to find total number of possible sub sequences for a string that start with a particular character say 'a' and end with a particular character say 'b' from a given string?
EXAMPLE:
for a string 'aabb' if we want to know the count of how many sub sequences are possible if the sub-sequence must start from character'a' and end with character 'b' then valid sub sequences can be from (ab) contributed by index (0,2), (ab) contributed by index (0,3), (ab) contributed by index (1,2), (ab) contributed by index (1,3), (aab) using index (0,1,2) , (aab) using index (0,1,3) ,(abb) using index(0,2,3),(abb) using index(1,2,3) and aabb itself
so total is 9 .I can solve this for a string of small length but how to solve this for a large string where brute force doesn't work
Note:We consider two sub strings to be different if they start or end
at different indices of the given string.
def count(str,str1 ,str2 ):
l = len(str)
count=0
for i in range(0, l+1):
for j in range(i+1, l+1):
if str[i] == str1 and str[j-1] == str2:
count+=1
return count
Before I post my main code I'll try to explain how it works. Let the source string be 'a123b'. The valid subsequences consist of all the subsets of '123' prefixed with 'a' and suffixed with 'b'. The set of all subsets is called the powerset, and the itertools docs have code showing how to produce the powerset using combinations in the Itertools Recipes section.
# Print all subsequences of '123', prefixed with 'a' and suffixed with 'b'
from itertools import combinations
src = '123'
for i in range(len(src) + 1):
for s in combinations(src, i):
print('a' + ''.join(s) + 'b')
output
ab
a1b
a2b
a3b
a12b
a13b
a23b
a123b
Here's a brute-force solution which uses that recipe.
from itertools import combinations
def count_bruteforce(src, targets):
c0, c1 = targets
count = 0
for i in range(2, len(src) + 1):
for t in combinations(src, i):
if t[0] == c0 and t[-1] == c1:
count += 1
return count
It can be easily shown that the number of subsets of a set of n items is 2**n. So rather than producing the subsets one by one we can speed up the process by using that formula, which is what my count_fast function does.
from itertools import combinations
def count_bruteforce(src, targets):
c0, c1 = targets
count = 0
for i in range(2, len(src) + 1):
for t in combinations(src, i):
if t[0] == c0 and t[-1] == c1:
count += 1
return count
def count_fast(src, targets):
c0, c1 = targets
# Find indices of the target chars
idx = {c: [] for c in targets}
for i, c in enumerate(src):
if c in targets:
idx[c].append(i)
idx0, idx1 = idx[c0], idx[c1]
count = 0
for u in idx0:
for v in idx1:
if v < u:
continue
# Calculate the number of valid subsequences
# which start at u+1 and end at v-1.
n = v - u - 1
count += 2 ** n
return count
# Test
funcs = (
count_bruteforce,
count_fast,
)
targets = 'ab'
data = (
'ab', 'aabb', 'a123b', 'aacbb', 'aabbb',
'zababcaabb', 'aabbaaabbb',
)
for src in data:
print(src)
for f in funcs:
print(f.__name__, f(src, targets))
print()
output
ab
count_bruteforce 1
count_fast 1
aabb
count_bruteforce 9
count_fast 9
a123b
count_bruteforce 8
count_fast 8
aacbb
count_bruteforce 18
count_fast 18
aabbb
count_bruteforce 21
count_fast 21
zababcaabb
count_bruteforce 255
count_fast 255
aabbaaabbb
count_bruteforce 730
count_fast 730
There may be a way to make this even faster by starting the inner loop at the correct place rather than using continue to skip unwanted indices.
Easy, it should just be the number of letters to the power of two. I.e, n^2
Python implementation would just be n_substrings = n ** 2

Creating recursive function for nested loop in python

I had posted this question :
Non overlapping pattern matching with gap constraint in python ; two months back. I got only one response. But the solution is quite long, and for each word in a pattern, one nested loop is formed. Is there any way of forming the following function recursively ?
i=0
while i < len(pt_dic[pt_split[0]]):
match=False
ii = pt_dic[pt_split[0]][i]
#print "ii=" + str(ii)
# Start loop at next index after ii
j = next(x[0] for x in enumerate(pt_dic[pt_split[1]]) if x[1] > ii)
while j < len(pt_dic[pt_split[1]]) and not match:
jj = pt_dic[pt_split[1]][j]
#print "jj=" + str(jj)
if jj > ii and jj <= ii + 2:
# Start loop at next index after ii
k = next(x[0] for x in enumerate(pt_dic[pt_split[2]]) if x[1] > jj)
while k < len(pt_dic[pt_split[2]]) and not match:
kk = pt_dic[pt_split[2]][k]
#print "kk=" + str(kk)
if kk > jj and kk <= jj + 2:
# Start loop at next index after kk
l = next(x[0] for x in enumerate(pt_dic[pt_split[3]]) if x[1] > kk)
while l < len(pt_dic[pt_split[2]]) and not match:
ll = pt_dic[pt_split[3]][l]
#print "ll=" + str(ll)
if ll > kk and ll <= kk + 2:
print "Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + "," + str(ll) + ")"
# Now that we've found a match, skip indices within that match.
i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > ll)
i -= 1
match=True
l += 1
k += 1
j += 1
i += 1
Edit : For those who don't get the context :
I want to find total no. of non-overlapping matches of a pattern appearing in a sequence, with the gap constraint 2.
Eg. A B C is a pattern found using some algorithm. I have to find the total # of this pattern appearing in a sequence such as A A B B C D E A B C … , where the max gap constraint is 2.
Max. gap isn't seen across sequence, but is seen between two words belonging to a pattern that are substring in sequence. E.g. Pat: A B C and seq: A B D E C B A B A B C D E.
In this case, A B D E C ... is a match as max two gaps allowed between A,B and B, C. Next we find A B A B C as another match. Interestingly. there are two matches, (2 chars b/w A, B and 2 chars b/w B,C) . However, we will count it only as one, as it's an overlapping match. A B X X X C isn't valid.
I have read the original question only briefly. I'm not really sure if I've got the gap counting part right. I think you have L sorted sequences of unique indices and the code searches for all lists with L elements, where Nth element is from Nth sequence and where two adjacent items satisfy a condition prev < next < prev + GAP + 1
Anyway this question is about nested loops.
The basic idea of the code below is to pass a list of sequences to the recursive function. This function takes the first sequence from it and iterates over it. The remaining sequences are passed to the other instances of the same function where each instance does the same, i.e. iterates over the first sequence and passes the rest until no sequences to iterate over are left.
During that process a partial solution is being built step by step. The recursion continues only if this partial solution satisfies the condition. When all sequences are exhausted, the partial solution becomes a final solution.
list_of_seqs= [
[0, 1, 7, 11, 22, 29],
[2, 3, 8, 14, 25, 33],
[4, 9, 15, 16, 27, 34],
]
def found_match(m):
print(m)
GAP = 2
def recloop(part, ls):
if not ls:
found_match(part)
return
seq, *ls = ls # this is Python3 syntax
last = part[-1] if part else None
# this is not optimized:
for i in seq:
if last is None or last < i <= last + GAP + 1:
recloop(part + [i], ls)
recloop([], list_of_seqs)
For Python2 replace the marked line with seq, ls = ls[0], ls[1:]

Categories