Related
I'm looking for a Python solution to extract from a series of letters/numbers, the most repeating pattern which comes with an outcome and a specific length.
Problem: When is search more likely to occur given a 4 digit block (Block Length) of digits/letters? (So the string has to END with search)
Example:
Input: 0010000101010001011010011101001101000011100010100101010111
Search: 1
Block Length: 4
---
Answer: 0101
Appeared: 5 times
In the above case "1" is more likely to appear when 010 comes before 1.
001000 [0101] 0100 [0101] 1010011101001101000011100 [0101] 0 [0101] [0101] 11
So the answer is 0101 an it appeared 5 times.
NOTE:
This could return 0001 but that only appeared 4 times while 0101 appeared 5 times.
Changing the length would result in:
Input: 0010000101010001011010011101001101000011100010100101010111 (same as above)
Search: 1
Block Length: 5
---
Answer: 00101
Appeared: 4 times
Because:
00100 [00101] 010 [00101] 101001110100110100001110 [00101][00101] 010111
NOTE:
The second example could return 00001 but that only appeared 2 times while 00101 appeared 4 times.
If there are multiple outcomes ie: 0101 and 0111 have the same presence, both outcome should be showing.
I'm at the point where I can find the more repeating string, but I don't know how to give the length:
def find_most_repetitive_substring(string):
max_counter = 1
position, substring_length, times = 0, 0, 0
for i in range(len(string)):
for j in range(len(string) - i):
counter = 1
if j == 0:
continue
while True:
if string[i + counter * j: i + (counter + 1) * j] != string[i: i + j] or i + (counter + 1) * j > len(string):
if counter > max_counter:
max_counter = counter
position, substring_length, times = i, j, counter
break
else:
counter += 1
return string[position: position + substring_length * times]
I've used re here as you're dealing with text but you can use over techiques to create blocks of N length with overlaps...
import re
def f(text, search, length):
# Get unique blocks of length N - including overlaps
overlaps = set(re.findall(f'(?=(.{{{length}}}))', text))
# Priotise those ending with length, then the count of non-overlapping and then include the block itself
return max((block.endswith(search), text.count(block), block) for block in overlaps)
S = '0010000101010001011010011101001101000011100010100101010111'
f(S, '1', 4)
# (True, 5, '0101')
f(S, '1', 5)
#(True, 4, '00101')
This might help you with part of your question (i.e., getting the counts); iterating and storing things in a lookup table (dictionary):
def find_most_repetitive_substring(string, substring_length, ending='1'):
"""
Finds the most repetive substring in a given string.
:param string: String to search for repetitions.
:param substring_length: Length of the substring to search for.
:param ending: character that pattern must end with. default is '1'.
:return: Most repetitive substring and its number of occurrences.
"""
substring_count = {}
for i in range(len(string) - substring_length + 1):
substring = string[i:i + substring_length]
if substring[-1] == ending: # added for ending
if substring in substring_count:
substring_count[substring] += 1
else:
substring_count[substring] = 1
max_substr = max(substring_count, key=substring_count.get)
return max_substr, substring_count[max_substr]
find_most_repetitive_substring('0010000101010001011010011101001101000011100010100101010111', 4)
And if you want to get all the keys with the max val, you can just return a list, changing the last lines to something like this:
max_substr = max(substring_count, key=substring_count.get)
max_substrs = [k for k, v in substring_count.items() if v == substring_count[max_substr]]
return max_substrs, substring_count[max_substr]
You could use Counter:
from collections import Counter
def find_most_repetitive_substring(string, size, search):
res = Counter([string[i:i+size] for i in range(len(string) - size)])
return max(res, key=lambda x: (x.endswith(search), res.get(x)))
Example run:
inp = "0010000101010001011010011101001101000011100010100101010111"
s = find_most_repetitive_substring(inp, 4, "1")
print(s) # 0101
I am doing a text mining and trying to clean bullet screen (弹幕) data.(Bullet screen is a kind of comment in video websites) There are repetitions of expressions in my data. ("LOL LOL LOL", "LMAOLMAOLMAOLMAO") And I want to get "LOL", "LMAO".
In most cases, I want to find the minimum period of a sequence.
CORNER CASE: The tail of the input sequence can be seen as a part of the periodic subsequence.
"eat an apple eat an apple eat an" # input
"eat an apple" # output
There are some other test cases:
cases = [
"abcd", #4 abcd
"ababab", #2 ab
"ababcababc", #5 ababc
"abcdabcdabc", #4 abcd
]
NOTE: As for the last case "abcdabcdabc", "abcd" is better than "abcdabcdabc" because the last three character "abc" is part of "abcd".
def solve(x):
n = len(x)
d = dict()
T = 0
k = 0
while k < n:
w = x[k]
if w not in d:
d[w] = T
T += 1
else:
while k < n and d.get(x[k], None) == k%T:
k += 1
if k < n:
T = k+1
k += 1
return T, x[:T]
it can output correct answers for first two cases but fails to handle all of them.
There is effective Z-algorithm
Given a string S of length n, the Z Algorithm produces an array Z
where Z[i] is the length of the longest substring starting from S[i]
which is also a prefix of S, i.e. the maximum k such that
S[j] = S[i + j] for all 0 ≤ j < k. Note that Z[i] = 0 means that
S[0] ≠ S[i]. For easier terminology, we will refer to substrings which
are also a prefix as prefix-substrings.
Calculate Z-array for your string and find such position i with property i + Z[i] == len and len % i == 0 (len is string length). Now i is period length
I'm not fluent in Python, but can easily describe the algorithm you need:
found <- false
length <- inputString.length
size = 1
output <- inputString
while (not found) and (size <= length / 2) do
if (length % size = 0) then
chunk <- inputString.substring(0, size)
found <- true
for (j <- 1,length/size) do
if (not inputString.substring(j * size, size).equals(chunk)) then
found <- false
if end
for end
if found then
output <- chunk
if end
if end
size <- size + 1
while end
The idea is to increasingly take substrings starting from the start of the string, the starting length of the substrings being 1 and while you do not find a repetitive cycle, you increase the length (until it is evidently no longer feasible, that is, half of the length of the input has been reached). In each iteration you compare the length of the substring with the length of the input string and if the length of the input string is not divisible with the current substring, then the current substring will not be repetitive for the input string (an optimization would be to find out what numbers is your input string's length divisible with and check only for that lengths in your substrings, but I avoided such optimizations for the sake of understandability). If the size of your string is divisible with the current size, then you take the substring from the start of your input string up until the current size and check whether it is repeated. The first time you find such a pattern you can stop with your loop, because you have found the solution. If no such solution is found, then the input string is the smallest repetitive substring and it is repeated 0 times, as it is found in your string only once.
EDIT
If you want to tolerate the last occurrence being only a part of the pattern, limited by the inputString, then the algorithm can be changed like this:
found <- false
length <- inputString.length
size = 1
output <- inputString
while (not found) and (size <= length / 2) do
chunk <- inputString.substring(0, size)
found <- true
for (j <- 1,length/size) do
if (not inputString.substring(j * size, size).equals(chunk)) then
found <- (chunk.indexOf(inputString.substring(j).length) = 0)
if end
for end
if found then
output <- chunk
if end
size <- size + 1
while end
In this case, we see the line of
found <- (chunk.indexOf(inputString.substring(j).length) = 0)
so, in the case of a mismatch, we check whether our chunk starts with the remaining part of the string. If so, then we are at the end of the input string and the pattern is partially matched up until the end of the string, so found will be true. If not, then found will be false.
You could do it this way :
def solve(string):
foundPeriods = {}
for x in range(len(string)):
#Tested substring
substring = string[0:len(string)-x]
#Frequency count
occurence_count = string.count(substring)
#Make a comparaison to original string
if substring * occurence_count in string:
foundPeriods[occurence_count] = substring
return foundPeriods[max(foundPeriods.keys())]
for x in cases:
print(x ,'===> ' , solve(x), "#" , len(solve(x)))
print()
Output
abcd ===> a # 1
ababab ===> ab # 2
ababcababc ===> ababc # 5
abcdabcdabc ===> abcd # 4
EDIT :
Answer edited to consider the following in the question
"abcdabcdabc", "abcd" is better than "abcdabcdabc" because it comes more naturally
I have a string that holds a very long sentence without whitespaces/spaces.
mystring = "abcdthisisatextwithsampletextforasampleabcd"
I would like to find all of the repeated substrings that contains minimum 4 chars.
So I would like to achieve something like this:
'text' 2 times
'sample' 2 times
'abcd' 2 times
As both abcd,text and sample can be found two times in the mystring they were recognized as properly matched substrings with more than 4 char length. It's important that I am seeking repeated substrings, finding only existing English words is not a requirement.
The answers I found are helpful for finding duplicates in texts with whitespaces, but I couldn't find a proper resource that covers the situation when there are no spaces and whitespaces in the string. How can this be done in the most efficient way?
Let's go through this step by step. There are several sub-tasks you should take care of:
Identify all substrings of length 4 or more.
Count the occurrence of these substrings.
Filter all substrings with 2 occurrences or more.
You can actually put all of them into a few statements. For understanding, it is easier to go through them one at a time.
The following examples all use
mystring = "abcdthisisatextwithsampletextforasampleabcd"
min_length = 4
1. Substrings of a given length
You can easily get substrings by slicing - for example, mystring[4:4+6] gives you the substring from position 4 of length 6: 'thisis'. More generically, you want substrings of the form mystring[start:start+length].
So what values do you need for start and length?
start must...
cover all substrings, so it must include the first character: start in range(0, ...).
not map to short substrings, so it can stop min_length characters before the end: start in range(..., len(mystring) - min_length + 1).
length must...
cover the shortest substring of length 4: length in range(min_length, ...).
not exceed the remaining string after i: length in range(..., len(mystring) - i + 1))
The +1 terms come from converting lengths (>=1) to indices (>=0).
You can put this all together into a single comprehension:
substrings = [
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
]
2. Count substrings
Trivially, you want to keep a count for each substring. Keeping anything for each specific object is what dicts are made for. So you should use substrings as keys and counts as values in a dict. In essence, this corresponds to this:
counts = {}
for substring in substrings:
try: # increase count for existing keys, set for new keys
counts[substring] += 1
except KeyError:
counts[substring] = 1
You can simply feed your substrings to collections.Counter, and it produces something like the above.
>>> counts = collections.Counter(substrings)
>>> print(counts)
Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})
Notice how the duplicate 'abcd' maps to the count of 2.
3. Filtering duplicate substrings
So now you have your substrings and the count for each. You need to remove the non-duplicate substrings - those with a count of 1.
Python offers several constructs for filtering, depending on the output you want. These work also if counts is a regular dict:
>>> list(filter(lambda key: counts[key] > 1, counts))
['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple']
>>> {key: value for key, value in counts.items() if value > 1}
{'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}
Using Python primitives
Python ships with primitives that allow you to do this more efficiently.
Use a generator to build substrings. A generator builds its member on the fly, so you never actually have them all in-memory. For your use case, you can use a generator expression:
substrings = (
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
)
Use a pre-existing Counter implementation. Python comes with a dict-like container that counts its members: collections.Counter can directly digest your substring generator. Especially in newer version, this is much more efficient.
counts = collections.Counter(substrings)
You can exploit Python's lazy filters to only ever inspect one substring. The filter builtin or another generator generator expression can produce one result at a time without storing them all in memory.
for substring in filter(lambda key: counts[key] > 1, counts):
print(substring, 'occurs', counts[substring], 'times')
Nobody is using re! Time for an answer [ab]using the regular expression built-in module ;)
import re
Finding all the maximal substrings that are repeated
repeated_ones = set(re.findall(r"(.{4,})(?=.*\1)", mystring))
This matches the longest substrings which have at least a single repetition after (without consuming). So it finds all disjointed substrings that are repeated while only yielding the longest strings.
Finding all substrings that are repeated, including overlaps
mystring_overlap = "abcdeabcdzzzzbcde"
# In case we want to match both abcd and bcde
repeated_ones = set()
pos = 0
while True:
match = re.search(r"(.{4,}).*(\1)+", mystring_overlap[pos:])
if match:
repeated_ones.add(match.group(1))
pos += match.pos + 1
else:
break
This ensures that all --not only disjoint-- substrings which have repetition are returned. It should be much slower, but gets the work done.
If you want in addition to the longest strings that are repeated, all the substrings, then:
base_repetitions = list(repeated_ones)
for s in base_repetitions:
for i in range(4, len(s)):
repeated_ones.add(s[:i])
That will ensure that for long substrings that have repetition, you have also the smaller substring --e.g. "sample" and "ample" found by the re.search code; but also "samp", "sampl", "ampl" added by the above snippet.
Counting matches
Because (by design) the substrings that we count are non-overlapping, the count method is the way to go:
from __future__ import print_function
for substr in repeated_ones:
print("'%s': %d times" % (substr, mystring.count(substr)))
Results
Finding maximal substrings:
With the question's original mystring:
{'abcd', 'text', 'sample'}
with the mystring_overlap sample:
{'abcd'}
Finding all substrings:
With the question's original mystring:
{'abcd', 'ample', 'mple', 'sample', 'text'}
... and if we add the code to get all substrings then, of course, we get absolutely all the substrings:
{'abcd', 'ampl', 'ample', 'mple', 'samp', 'sampl', 'sample', 'text'}
with the mystring_overlap sample:
{'abcd', 'bcde'}
Future work
It's possible to filter the results of the finding all substrings with the following steps:
take a match "A"
check if this match is a substring of another match, call it "B"
if there is a "B" match, check the counter on that match "B_n"
if "A_n = B_n", then remove A
go to first step
It cannot happen that "A_n < B_n" because A is smaller than B (is a substring) so there must be at least the same number of repetitions.
If "A_n > B_n" it means that there is some extra match of the smaller substring, so it is a distinct substring because it is repeated in a place where B is not repeated.
Script (explanation where needed, in comments):
from collections import Counter
mystring = "abcdthisisatextwithsampletextforasampleabcd"
mystring_len = len(mystring)
possible_matches = []
matches = []
# Range `start_index` from 0 to 3 from the left, due to minimum char count of 4
for start_index in range(0, mystring_len-3):
# Start `end_index` at `start_index+1` and range it throughout the rest of
# the string
for end_index in range(start_index+1, mystring_len+1):
current_string = mystring[start_index:end_index]
if len(current_string) < 4: continue # Skip this interation, if len < 4
possible_matches.append(mystring[start_index:end_index])
for possible_match, count in Counter(possible_matches).most_common():
# Iterate until count is less than or equal to 1 because `Counter`'s
# `most_common` method lists them in order. Once 1 (or less) is hit, all
# others are the same or lower.
if count <= 1: break
matches.append((possible_match, count))
for match, count in matches:
print(f'\'{match}\' {count} times')
Output:
'abcd' 2 times
'text' 2 times
'samp' 2 times
'sampl' 2 times
'sample' 2 times
'ampl' 2 times
'ample' 2 times
'mple' 2 times
Here's a Python3 friendly solution:
from collections import Counter
min_str_length = 4
mystring = "abcdthisisatextwithsampletextforasampleabcd"
all_substrings =[mystring[start_index:][:end_index + 1] for start_index in range(len(mystring)) for end_index in range(len(mystring[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0] for item in counted_substrings.most_common() if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
print(counted_final_candidates)
Bonus: largest string
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in not_counted_final_candidates if substring1!=substring2 and substring1 in substring2 ]
largest_common_string = list(set(not_counted_final_candidates) - set(sub_sub_strings))
Everything as a function:
from collections import Counter
def get_repeated_strings(input_string, min_str_length = 2, calculate_largest_repeated_string = True ):
all_substrings = [input_string[start_index:][:end_index + 1]
for start_index in range(len(input_string))
for end_index in range(len(input_string[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0]
for item in counted_substrings.most_common()
if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
### This is just a bit of bonus code for calculating the largest repeating sting
if calculate_largest_repeated_string == True:
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in
not_counted_final_candidates if substring1 != substring2 and substring1 in substring2]
largest_common_strings = list(set(not_counted_final_candidates) - set(sub_sub_strings))
return counted_final_candidates, largest_common_strings
else:
return counted_final_candidates
Example:
mystring = "abcdthisisatextwithsampletextforasampleabcd"
print(get_repeated_strings(mystring, min_str_length= 4))
Output:
({'abcd': 2, 'text': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'ampl': 2, 'ample': 2, 'mple': 2}, ['abcd', 'text', 'sample'])
CODE:
pattern = "abcdthisisatextwithsampletextforasampleabcd"
string_more_4 = []
k = 4
while(k <= len(pattern)):
for i in range(len(pattern)):
if pattern[i:k+i] not in string_more_4 and len(pattern[i:k+i]) >= 4:
string_more_4.append( pattern[i:k+i])
k+=1
for i in string_more_4:
if pattern.count(i) >= 2:
print(i + " -> " + str(pattern.count(i)) + " times")
OUTPUT:
abcd -> 2 times
text -> 2 times
samp -> 2 times
ampl -> 2 times
mple -> 2 times
sampl -> 2 times
ample -> 2 times
sample -> 2 times
Hope this helps as my code length was short and it is easy to understand. Cheers!
This is in Python 2 because I'm not doing Python 3 at this time. So you'll have to adapt it to Python 3 yourself.
#!python2
# import module
from collections import Counter
# get the indices
def getIndices(length):
# holds the indices
specific_range = []; all_sets = []
# start building the indices
for i in range(0, length - 2):
# build a set of indices of a specific range
for j in range(1, length + 2):
specific_range.append([j - 1, j + i + 3])
# append 'specific_range' to 'all_sets', reset 'specific_range'
if specific_range[j - 1][1] == length:
all_sets.append(specific_range)
specific_range = []
break
# return all of the calculated indices ranges
return all_sets
# store search strings
tmplst = []; combos = []; found = []
# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"
# get length of string
length = len(mystring)
# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)
# get the search string combinations
for sublst in all_sets:
for subsublst in sublst:
tmplst.append(mystring[subsublst[0]: subsublst[1]])
combos.append(tmplst)
tmplst = []
# search for matching string patterns
for sublst in all_sets:
for subsublst in sublst:
for sublstitems in combos:
if mystring[subsublst[0]: subsublst[1]] in sublstitems:
found.append(mystring[subsublst[0]: subsublst[1]])
# make a dictionary containing the strings and their counts
d1 = Counter(found)
# filter out counts of 2 or more and print them
for k, v in d1.items():
if v > 1:
print k, v
$ cat test.py
import collections
import sys
S = "abcdthisisatextwithsampletextforasampleabcd"
def find(s, min_length=4):
"""
Find repeated character sequences in a provided string.
Arguments:
s -- the string to be searched
min_length -- the minimum length of the sequences to be found
"""
counter = collections.defaultdict(int)
# A repeated sequence can't be longer than half the length of s
sequence_length = len(s) // 2
# populate counter with all possible sequences
while sequence_length >= min_length:
# Iterate over the string until the number of remaining characters is
# fewer than the length of the current sequence.
for i, x in enumerate(s[:-(sequence_length - 1)]):
# Window across the string, getting slices
# of length == sequence_length.
candidate = s[i:i + sequence_length]
counter[candidate] += 1
sequence_length -= 1
# Report.
for k, v in counter.items():
if v > 1:
print('{} {} times'.format(k, v))
return
if __name__ == '__main__':
try:
s = sys.argv[1]
except IndexError:
s = S
find(s)
$ python test.py
sample 2 times
sampl 2 times
ample 2 times
abcd 2 times
text 2 times
samp 2 times
ampl 2 times
mple 2 times
This is my approach to this problem:
def get_repeated_words(string, minimum_len):
# Storing count of repeated words in this dictionary
repeated_words = {}
# Traversing till last but 4th element
# Actually leaving `minimum_len` elements at end (in this case its 4)
for i in range(len(string)-minimum_len):
# Starting with a length of 4(`minimum_len`) and going till end of string
for j in range(i+minimum_len, len(string)):
# getting the current word
word = string[i:j]
# counting the occurrences of the word
word_count = string.count(word)
if word_count > 1:
# storing in dictionary along with its count if found more than once
repeated_words[word] = word_count
return repeated_words
if __name__ == '__main__':
mystring = "abcdthisisatextwithsampletextforasampleabcd"
result = get_repeated_words(mystring, 4)
This is how I would do it, but I don't know any other way:
string = "abcdthisisatextwithsampletextforasampleabcd"
l = len(string)
occurences = {}
for i in range(4, l):
for start in range(l - i):
substring = string[start:start + i]
occurences[substring] = occurences.get(substring, 0) + 1
for key in occurences.keys():
if occurences[key] > 1:
print("'" + key + "'", str(occurences[key]), "times")
Output:
'sample' 2 times
'ampl' 2 times
'sampl' 2 times
'ample' 2 times
'samp' 2 times
'mple' 2 times
'text' 2 times
Efficient, no, but easy to understand, yes.
Here is simple solution using the more_itertools library.
Given
import collections as ct
import more_itertools as mit
s = "abcdthisisatextwithsampletextforasampleabcd"
lbound, ubound = len("abcd"), len(s)
Code
windows = mit.flatten(mit.windowed(s, n=i) for i in range(lbound, ubound))
filtered = {"".join(k): v for k, v in ct.Counter(windows).items() if v > 1}
filtered
Output
{'abcd': 2,
'text': 2,
'samp': 2,
'ampl': 2,
'mple': 2,
'sampl': 2,
'ample': 2,
'sample': 2}
Details
The procedures are:
build sliding windows of varying sizes lbound <= n < ubound
count all occurrences and filter replicates
more_itertools is a third-party package installed by > pip install more_itertools.
s = 'abcabcabcdabcd'
d = {}
def get_repeats(s, l):
for i in range(len(s)-l):
ss = s[i: i+l]
if ss not in d:
d[ss] = 1
else:
d[ss] = d[ss]+1
return d
get_repeats(s, 3)
I have long file like 1200 sequences
>3fm8|A|A0JLQ2
CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP
QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA
LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL
I want to read each possible pattern has cysteine in middle and has in the beginning five string and follow by other five string such as xxxxxCxxxxx
the output should be like this:
QDIQLCGMGIL
ILPEHCIIDIT
TISDNCVVIFS
FSKTSCSYCTM
this is the pogram only give position of C . it is not work like what I want
pos=[]
def find(ch,string1):
for i in range(len(string1)):
if ch == string1[i]:
pos.append(i)
return pos
z=find('C','AWERQRTCWERTYCTAAAACTTCTTT')
print z
You need to return outside the loop, you are returning on the first match so you only ever get a single character in your list:
def find(ch,string1):
pos = []
for i in range(len(string1)):
if ch == string1[i]:
pos.append(i)
return pos # outside
You can also use enumerate with a list comp in place of your range logic:
def indexes(ch, s1):
return [index for index, char in enumerate(s1)if char == ch and 5 >= index <= len(s1) - 6]
Each index in the list comp is the character index and each char is the actual character so we keep each index where char is equal to ch.
If you want the five chars that are both sides:
In [24]: s="CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP"
In [25]: inds = indexes("C",s)
In [26]: [s[i-5:i+6] for i in inds]
Out[26]: ['QDIQLCGMGIL', 'ILPEHCIIDIT']
I added checking the index as we obviously cannot get five chars before C if the index is < 5 and the same from the end.
You can do it all in a single function, yielding a slice when you find a match:
def find(ch, s):
ln = len(s)
for i, char in enumerate(s):
if ch == char and 5 <= i <= ln - 6:
yield s[i- 5:i + 6]
Where presuming the data in your question is actually two lines from yoru file like:
s="""">3fm8|A|A0JLQ2CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDALYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCY"""
Running:
for line in s.splitlines():
print(list(find("C" ,line)))
would output:
['0JLQ2CFLVNL', 'QDIQLCGMGIL', 'ILPEHCIIDIT']
['TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']
Which gives six matches not four as your expected output suggest so I presume you did not include all possible matches.
You can also speed up the code using str.find, starting at the last match index + 1 for each subsequent match
def find(ch, s):
ln, i = len(s) - 6, s.find(ch)
while 5 <= i <= ln:
yield s[i - 5:i + 6]
i = s.find(ch, i + 1)
Which will give the same output. Of course if the strings cannot overlap you can start looking for the next match much further in the string each time.
My solution is based on regex, and shows all possible solutions using regex and while loop. Thanks to #Smac89 for improving it by transforming it into a generator:
import re
string = """CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL"""
# Generator
def find_cysteine2(string):
# Create a loop that will utilize regex multiple times
# in order to capture matches within groups
while True:
# Find a match
data = re.search(r'(\w{5}C\w{5})',string)
# If match exists, let's collect the data
if data:
# Collect the string
yield data.group(1)
# Shrink the string to not include
# the previous result
location = data.start() + 1
string = string[location:]
# If there are no matches, stop the loop
else:
break
print [x for x in find_cysteine2(string)]
# ['QDIQLCGMGIL', 'ILPEHCIIDIT', 'TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']
I would like to create a program that generate a particular long 7 characters string.
It must follow this rules:
0-9 are before a-z which are before A-Z
Length is 7 characters.
Each character must be different from the two close (Example 'NN' is not allowed)
I need all the possible combination incrementing from 0000000 to ZZZZZZZ but not in a random sequence
I have already done it with this code:
from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product
chars = digits + ascii_lowercase + ascii_uppercase
for n in range(7, 8):
for comb in product(chars, repeat=n):
if (comb[6] != comb[5] and comb[5] != comb[4] and comb[4] != comb[3] and comb[3] != comb[2] and comb[2] != comb[1] and comb[1] != comb[0]):
print ''.join(comb)
But it is not performant at all because i have to wait a long time before the next combination.
Can someone help me?
Edit: I've updated the solution to use cached short sequences for lengths greater than 4. This significantly speeds up the calculations. With the simple version, it'd take 18.5 hours to generate all sequences of length 7, but with the new method only 4.5 hours.
I'll let the docstring do all of the talking for describing the solution.
"""
Problem:
Generate a string of N characters that only contains alphanumerical
characters. The following restrictions apply:
* 0-9 must come before a-z, which must come before A-Z
* it's valid to not have any digits or letters in a sequence
* no neighbouring characters can be the same
* the sequences must be in an order as if the string is base62, e.g.,
01010...01019, 0101a...0101z, 0101A...0101Z, 01020...etc
Solution:
Implement a recursive approach which discards invalid trees. For example,
for "---" start with "0--" and recurse. Try "00-", but discard it for
"01-". The first and last sequences would then be "010" and "ZYZ".
If the previous character in the sequence is a lowercase letter, such as
in "02f-", shrink the pool of available characters to a-zA-Z. Similarly,
for "9gB-", we should only be working with A-Z.
The input also allows to define a specific sequence to start from. For
example, for "abGH", each character will have access to a limited set of
its pool. In this case, the last letter can iterate from H to Z, at which
point it'll be free to iterate its whole character pool next time around.
When specifying a starting sequence, if it doesn't have enough characters
compared to `length`, it will be padded to the right with characters free
to explore their character pool. For example, for length 4, the starting
sequence "29" will be transformed to "29 ", where we will deal with two
restricted characters temporarily.
For long lengths the function internally calls a routine which relies on
fewer recursions and cached results. Length 4 has been chosen as optimal
in terms of precomputing time and memory demands. Briefly, the sequence is
broken into a remainder and chunks of 4. For each preceeding valid
subsequence, all valid following subsequences are fetched. For example, a
sequence of six would be split into "--|----" and for "fB|----" all
subsequences of 4 starting A, C, D, etc would be produced.
Examples:
>>> for i, x in enumerate(generate_sequences(7)):
... print i, x
0, 0101010
1, 0101012
etc
>>> for i, x in enumerate(generate_sequences(7, '012abcAB')):
... print i, x
0, 012abcAB
1, 012abcAC
etc
>>> for i, x in enumerate(generate_sequences(7, 'aB')):
... print i, x
0, aBABABA
1, aBABABC
etc
"""
import string
ALLOWED_CHARS = (string.digits + string.ascii_letters,
string.ascii_letters,
string.ascii_uppercase,
)
CACHE_LEN = 4
def _generate_sequences(length, sequence, previous=''):
char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
if sequence[-length] != ' ':
char_set = char_set[char_set.find(sequence[-length]):]
sequence[-length] = ' '
char_set = char_set.replace(previous, '')
if length == 1:
for char in char_set:
yield char
else:
for char in char_set:
for seq in _generate_sequences(length-1, sequence, char):
yield char + seq
def _generate_sequences_cache(length, sequence, cache, previous=''):
sublength = length if length == CACHE_LEN else min(CACHE_LEN, length-CACHE_LEN)
subseq = cache[sublength != CACHE_LEN]
char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
if sequence[-length] != ' ':
char_set = char_set[char_set.find(sequence[-length]):]
index = len(sequence) - length
subseq0 = ''.join(sequence[index:index+sublength]).strip()
sequence[index:index+sublength] = [' '] * sublength
if len(subseq0) > 1:
subseq[char_set[0]] = tuple(
s for s in subseq[char_set[0]] if s.startswith(subseq0))
char_set = char_set.replace(previous, '')
if length == CACHE_LEN:
for char in char_set:
for seq in subseq[char]:
yield seq
else:
for char in char_set:
for seq1 in subseq[char]:
for seq2 in _generate_sequences_cache(
length-sublength, sequence, cache, seq1[-1]):
yield seq1 + seq2
def precompute(length):
char_set = ALLOWED_CHARS[0]
if length > 1:
sequence = [' '] * length
result = {}
for char in char_set:
result[char] = tuple(char + seq for seq in _generate_sequences(
length-1, sequence, char))
else:
result = {char: tuple(char) for char in ALLOWED_CHARS[0]}
return result
def generate_sequences(length, sequence=''):
# -------------------------------------------------------------------------
# Error checking: consistency of the value/type of the arguments
if not isinstance(length, int):
msg = 'The sequence length must be an integer: {}'
raise TypeError(msg.format(type(length)))
if length < 0:
msg = 'The sequence length must be greater or equal than 0: {}'
raise ValueError(msg.format(length))
if not isinstance(sequence, str):
msg = 'The sequence must be a string: {}'
raise TypeError(msg.format(type(sequence)))
if len(sequence) > length:
msg = 'The sequence has length greater than {}'
raise ValueError(msg.format(length))
# -------------------------------------------------------------------------
if not length:
yield ''
else:
# ---------------------------------------------------------------------
# Error checking: the starting sequence, if provided, must be valid
if any(s not in ALLOWED_CHARS[0]+' ' for s in sequence):
msg = 'The sequence contains invalid characters: {}'
raise ValueError(msg.format(sequence))
if sequence.strip() != sequence.replace(' ', ''):
msg = 'Uninitiated characters in the middle of the sequence: {}'
raise ValueError(msg.format(sequence.strip()))
sequence = sequence.strip()
if any(a == b for a, b in zip(sequence[:-1], sequence[1:])):
msg = 'No neighbours must be the same character: {}'
raise ValueError(msg.format(sequence))
char_type = [s.isalpha() * (2 - s.islower()) for s in sequence]
if char_type != sorted(char_type):
msg = '0-9 must come before a-z, which must come before A-Z: {}'
raise ValueError(msg.format(sequence))
# ---------------------------------------------------------------------
sequence = list(sequence.ljust(length))
if length <= CACHE_LEN:
for s in _generate_sequences(length, sequence):
yield s
else:
remainder = length % CACHE_LEN
if not remainder:
cache = tuple((precompute(CACHE_LEN),))
else:
cache = tuple((precompute(CACHE_LEN), precompute(remainder)))
for s in _generate_sequences_cache(length, sequence, cache):
yield s
I've included thorough error checks in the generate_sequences() function. For the sake of brevity you can remove them if you can guarantee that whoever calls the function will never do so with invalid input. Specifically, invalid starting sequences.
Counting number of sequences of specific length
While the function will sequentially generate the sequences, there is a simple combinatorics calcuation we can perform to compute how many valid sequences exist in total.
The sequences can effectively be broken down to 3 separate subsequences. Generally speaking, a sequence can contain anything from 0 to 7 digits, followed by from 0 to 7 lowercase letters, followed by from 0 to 7 uppercase letters. As long as the sum of those is 7. This means we can have the partition (1, 3, 3), or (2, 1, 3), or (6, 0, 1), etc. We can use the stars and bars to calculate the various combinations of splitting a sum of N into k bins. There is already an implementation for python, which we'll borrow. The first few partitions are:
[0, 0, 7]
[0, 1, 6]
[0, 2, 5]
[0, 3, 4]
[0, 4, 3]
[0, 5, 2]
[0, 6, 1]
...
Next, we need to calculate how many valid sequences we have within a partition. Since the digit subsequences are independent of the lowercase letters, which are independent of the uppercase letters, we can calculate them individually and multiply them together.
So, how many digit combinations we can have for a length of 4? The first character can be any of the 10 digits, but the second character has only 9 options (ten minus the one that the previous character is). Similarly for the third letter and so on. So the total number of valid subsequences is 10*9*9*9. Similarly, for length 3 for letters, we get 26*25*25. Overall, for the partition, say, (2, 3, 2), we have 10*9*26*25*25*26*25 = 950625000 combinations.
import itertools as it
def partitions(n, k):
for c in it.combinations(xrange(n+k-1), k-1):
yield [b-a-1 for a, b in zip((-1,)+c, c+(n+k-1,))]
def count_subsequences(pool, length):
if length < 2:
return pool**length
return pool * (pool-1)**(length-1)
def count_sequences(length):
counts = [[count_subsequences(i, j) for j in xrange(length+1)] \
for i in [10, 26]]
print 'Partition {:>18}'.format('Sequence count')
total = 0
for a, b, c in partitions(length, 3):
subtotal = counts[0][a] * counts[1][b] * counts[1][c]
total += subtotal
print '{} {:18}'.format((a, b, c), subtotal)
print '\nTOTAL {:22}'.format(total)
Overall, we observe that while generating the sequences fast isn't a problem, there are so many that it can take a long time. Length 7 has 78550354750 (78.5 billion) valid sequences and this number only scales approximately by a factor of 25 with each incremented length.
Try this
import string
import random
a = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for _ in range(7))
print(a)
If it's a random string you want that sticks to the above rules you can use something like this:
def f():
digitLen = random.randrange(8)
smallCharLen = random.randint(0, 7 - digitLen)
capCharLen = 7 - (smallCharLen + digitLen)
print (str(random.randint(0,10**digitLen-1)).zfill(digitLen) +
"".join([random.choice(ascii_lowercase) for i in range(smallCharLen)]) +
"".join([random.choice(ascii_uppercase) for i in range(capCharLen)]))
I haven't added the repeated character rule but one you have the string it's easy to filter out the unwanted strings using dictionaries. You can also fix the length of each segment by putting conditions on the segment lengths.
Edit: a minor bug.
Extreme cases are not handled here but can be done this way
import random
from string import digits, ascii_uppercase, ascii_lowercase
len1 = random.randint(1, 7)
len2 = random.randint(1, 7-len1)
len3 = 7 - len1 - len2
print len1, len2, len3
result = ''.join(random.sample(digits, len1) + random.sample(ascii_lowercase, len2) + random.sample(ascii_uppercase, len3))
with a similar approach of #julian
from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product, tee, chain, izip, imap
def flatten(listOfLists):
"Flatten one level of nesting"
#recipe of itertools
return chain.from_iterable(listOfLists)
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
#recipe of itertools
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def eq_pair(x):
return x[0]==x[1]
def comb_noNN(alfa,size):
if size>0:
for candidato in product(alfa,repeat=size):
if not any( imap(eq_pair,pairwise(candidato)) ):
yield candidato
else:
yield tuple()
def my_string(N=7):
for a in range(N+1):
for b in range(N-a+1):
for c in range(N-a-b+1):
if sum([a,b,c])==N:
for letras in product(
comb_noNN(digits,c),
comb_noNN(ascii_lowercase,b),
comb_noNN(ascii_uppercase,a)
):
yield "".join(flatten(letras))
comb_noNN generate all combinations of char of a particular size that follow rule 3, then in my_string check all combination of length that add up to N and generate all string that follow rule 1 by individually generating each of digits, lower and upper case letters.
Some output of for i,x in enumerate(my_string())
0, '0101010'
...
100, '0101231'
...
491041580, '936gzrf'
...
758790032, '27ktxfi'
...
The reason it takes a long time to generate the first result with the original implementation is it takes a long time to reach the first valid value of 0101010 when starting from 0000000 as you do when using product.
Here's a recursive version which generates valid sequences rather than discarding invalid ones:
from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product
all_chars=[digits, ascii_lowercase, ascii_uppercase]
def seq(char_sets, start=None):
for char_set in char_sets:
for val in seqperm(char_set, start):
yield val
def seqperm(char_set, start=None, exclude=None):
left_chars, remaining_chars=char_set[0], char_set[1:]
if start:
try:
left_chars=left_chars[left_chars.index(start[0]):]
start=start[1:]
except:
left_chars=''
for left in left_chars:
if left != exclude:
if len(remaining_chars) > 0:
for right in seqperm(remaining_chars, start, left):
yield left + right
else:
yield left
if __name__ == "__main__":
count=int(argv[1])
start=None
if len(argv) == 3:
start=argv[2]
# char_sets=list(combinations_with_replacement(all_chars, 7))
char_sets=[[''.join(all_chars)] * 7]
for idx, val in enumerate(seq(char_sets, start)):
if idx == count:
break
print idx, val
Run as follows:
./permute.py 10
Output:
0 0101010
1 0101012
2 0101013
3 0101014
4 0101015
5 0101016
6 0101017
7 0101018
8 0101019
9 010101a
If you pass an additional argument then the script skips to the portion of the sequence which starts with that third argument like this:
./permute.py 10 01234Z
If it's a requirement to generate only permutations where lower letters always follow numbers and upper case always follow numbers and lower case then comment out the line char_sets=[[''.join(all_chars)] * 7] and use the line char_sets=list(combinations_with_replacement(all_chars, 7)).
Sample output for the above command line with char_sets=list(combinations_with_replacement(all_chars, 7)):
0 01234ZA
1 01234ZB
2 01234ZC
3 01234ZD
4 01234ZE
5 01234ZF
6 01234ZG
7 01234ZH
8 01234ZI
9 01234ZJ
Sample output for the same command line with char_sets=[[''.join(all_chars)] * 7]:
0 01234Z0
1 01234Z1
2 01234Z2
3 01234Z3
4 01234Z4
5 01234Z5
6 01234Z6
7 01234Z7
8 01234Z8
9 01234Z9
It's possible to implement the above without recursion as below. Performance characteristics don't change much:
from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product, izip_longest
all_chars=[digits, ascii_lowercase, ascii_uppercase]
def seq(char_sets, start=''):
for char_set in char_sets:
for val in seqperm(char_set, start):
yield val
def seqperm(char_set, start=''):
iters=[iter(chars) for chars in char_set]
# move to starting point in sequence if specified
for char, citer, chars in zip(list(start), iters, char_set):
try:
for _ in range(0, chars.index(char)):
citer.next()
except ValueError:
raise StopIteration
pos=0
val=''
while True:
citer=iters[pos]
try:
char=citer.next()
if val and val[-1] == char:
char=citer.next()
if pos == len(char_set) - 1:
yield val+char
else:
val = val + char
pos += 1
except StopIteration:
if pos == 0:
raise StopIteration
iters[pos] = iter(chars)
pos -= 1
val=val[:pos]
if __name__ == "__main__":
count=int(argv[1])
start=''
if len(argv) == 3:
start=argv[2]
# char_sets=list(combinations_with_replacement(all_chars, 7))
char_sets=[[''.join(all_chars)] * 7]
for idx, val in enumerate(seq(char_sets, start)):
if idx == count:
break
print idx, val
A recursive version with caching is also possible and that generates results faster but is less flexible.