I would like to create a program that generate a particular long 7 characters string.
It must follow this rules:
0-9 are before a-z which are before A-Z
Length is 7 characters.
Each character must be different from the two close (Example 'NN' is not allowed)
I need all the possible combination incrementing from 0000000 to ZZZZZZZ but not in a random sequence
I have already done it with this code:
from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product
chars = digits + ascii_lowercase + ascii_uppercase
for n in range(7, 8):
for comb in product(chars, repeat=n):
if (comb[6] != comb[5] and comb[5] != comb[4] and comb[4] != comb[3] and comb[3] != comb[2] and comb[2] != comb[1] and comb[1] != comb[0]):
print ''.join(comb)
But it is not performant at all because i have to wait a long time before the next combination.
Can someone help me?
Edit: I've updated the solution to use cached short sequences for lengths greater than 4. This significantly speeds up the calculations. With the simple version, it'd take 18.5 hours to generate all sequences of length 7, but with the new method only 4.5 hours.
I'll let the docstring do all of the talking for describing the solution.
"""
Problem:
Generate a string of N characters that only contains alphanumerical
characters. The following restrictions apply:
* 0-9 must come before a-z, which must come before A-Z
* it's valid to not have any digits or letters in a sequence
* no neighbouring characters can be the same
* the sequences must be in an order as if the string is base62, e.g.,
01010...01019, 0101a...0101z, 0101A...0101Z, 01020...etc
Solution:
Implement a recursive approach which discards invalid trees. For example,
for "---" start with "0--" and recurse. Try "00-", but discard it for
"01-". The first and last sequences would then be "010" and "ZYZ".
If the previous character in the sequence is a lowercase letter, such as
in "02f-", shrink the pool of available characters to a-zA-Z. Similarly,
for "9gB-", we should only be working with A-Z.
The input also allows to define a specific sequence to start from. For
example, for "abGH", each character will have access to a limited set of
its pool. In this case, the last letter can iterate from H to Z, at which
point it'll be free to iterate its whole character pool next time around.
When specifying a starting sequence, if it doesn't have enough characters
compared to `length`, it will be padded to the right with characters free
to explore their character pool. For example, for length 4, the starting
sequence "29" will be transformed to "29 ", where we will deal with two
restricted characters temporarily.
For long lengths the function internally calls a routine which relies on
fewer recursions and cached results. Length 4 has been chosen as optimal
in terms of precomputing time and memory demands. Briefly, the sequence is
broken into a remainder and chunks of 4. For each preceeding valid
subsequence, all valid following subsequences are fetched. For example, a
sequence of six would be split into "--|----" and for "fB|----" all
subsequences of 4 starting A, C, D, etc would be produced.
Examples:
>>> for i, x in enumerate(generate_sequences(7)):
... print i, x
0, 0101010
1, 0101012
etc
>>> for i, x in enumerate(generate_sequences(7, '012abcAB')):
... print i, x
0, 012abcAB
1, 012abcAC
etc
>>> for i, x in enumerate(generate_sequences(7, 'aB')):
... print i, x
0, aBABABA
1, aBABABC
etc
"""
import string
ALLOWED_CHARS = (string.digits + string.ascii_letters,
string.ascii_letters,
string.ascii_uppercase,
)
CACHE_LEN = 4
def _generate_sequences(length, sequence, previous=''):
char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
if sequence[-length] != ' ':
char_set = char_set[char_set.find(sequence[-length]):]
sequence[-length] = ' '
char_set = char_set.replace(previous, '')
if length == 1:
for char in char_set:
yield char
else:
for char in char_set:
for seq in _generate_sequences(length-1, sequence, char):
yield char + seq
def _generate_sequences_cache(length, sequence, cache, previous=''):
sublength = length if length == CACHE_LEN else min(CACHE_LEN, length-CACHE_LEN)
subseq = cache[sublength != CACHE_LEN]
char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
if sequence[-length] != ' ':
char_set = char_set[char_set.find(sequence[-length]):]
index = len(sequence) - length
subseq0 = ''.join(sequence[index:index+sublength]).strip()
sequence[index:index+sublength] = [' '] * sublength
if len(subseq0) > 1:
subseq[char_set[0]] = tuple(
s for s in subseq[char_set[0]] if s.startswith(subseq0))
char_set = char_set.replace(previous, '')
if length == CACHE_LEN:
for char in char_set:
for seq in subseq[char]:
yield seq
else:
for char in char_set:
for seq1 in subseq[char]:
for seq2 in _generate_sequences_cache(
length-sublength, sequence, cache, seq1[-1]):
yield seq1 + seq2
def precompute(length):
char_set = ALLOWED_CHARS[0]
if length > 1:
sequence = [' '] * length
result = {}
for char in char_set:
result[char] = tuple(char + seq for seq in _generate_sequences(
length-1, sequence, char))
else:
result = {char: tuple(char) for char in ALLOWED_CHARS[0]}
return result
def generate_sequences(length, sequence=''):
# -------------------------------------------------------------------------
# Error checking: consistency of the value/type of the arguments
if not isinstance(length, int):
msg = 'The sequence length must be an integer: {}'
raise TypeError(msg.format(type(length)))
if length < 0:
msg = 'The sequence length must be greater or equal than 0: {}'
raise ValueError(msg.format(length))
if not isinstance(sequence, str):
msg = 'The sequence must be a string: {}'
raise TypeError(msg.format(type(sequence)))
if len(sequence) > length:
msg = 'The sequence has length greater than {}'
raise ValueError(msg.format(length))
# -------------------------------------------------------------------------
if not length:
yield ''
else:
# ---------------------------------------------------------------------
# Error checking: the starting sequence, if provided, must be valid
if any(s not in ALLOWED_CHARS[0]+' ' for s in sequence):
msg = 'The sequence contains invalid characters: {}'
raise ValueError(msg.format(sequence))
if sequence.strip() != sequence.replace(' ', ''):
msg = 'Uninitiated characters in the middle of the sequence: {}'
raise ValueError(msg.format(sequence.strip()))
sequence = sequence.strip()
if any(a == b for a, b in zip(sequence[:-1], sequence[1:])):
msg = 'No neighbours must be the same character: {}'
raise ValueError(msg.format(sequence))
char_type = [s.isalpha() * (2 - s.islower()) for s in sequence]
if char_type != sorted(char_type):
msg = '0-9 must come before a-z, which must come before A-Z: {}'
raise ValueError(msg.format(sequence))
# ---------------------------------------------------------------------
sequence = list(sequence.ljust(length))
if length <= CACHE_LEN:
for s in _generate_sequences(length, sequence):
yield s
else:
remainder = length % CACHE_LEN
if not remainder:
cache = tuple((precompute(CACHE_LEN),))
else:
cache = tuple((precompute(CACHE_LEN), precompute(remainder)))
for s in _generate_sequences_cache(length, sequence, cache):
yield s
I've included thorough error checks in the generate_sequences() function. For the sake of brevity you can remove them if you can guarantee that whoever calls the function will never do so with invalid input. Specifically, invalid starting sequences.
Counting number of sequences of specific length
While the function will sequentially generate the sequences, there is a simple combinatorics calcuation we can perform to compute how many valid sequences exist in total.
The sequences can effectively be broken down to 3 separate subsequences. Generally speaking, a sequence can contain anything from 0 to 7 digits, followed by from 0 to 7 lowercase letters, followed by from 0 to 7 uppercase letters. As long as the sum of those is 7. This means we can have the partition (1, 3, 3), or (2, 1, 3), or (6, 0, 1), etc. We can use the stars and bars to calculate the various combinations of splitting a sum of N into k bins. There is already an implementation for python, which we'll borrow. The first few partitions are:
[0, 0, 7]
[0, 1, 6]
[0, 2, 5]
[0, 3, 4]
[0, 4, 3]
[0, 5, 2]
[0, 6, 1]
...
Next, we need to calculate how many valid sequences we have within a partition. Since the digit subsequences are independent of the lowercase letters, which are independent of the uppercase letters, we can calculate them individually and multiply them together.
So, how many digit combinations we can have for a length of 4? The first character can be any of the 10 digits, but the second character has only 9 options (ten minus the one that the previous character is). Similarly for the third letter and so on. So the total number of valid subsequences is 10*9*9*9. Similarly, for length 3 for letters, we get 26*25*25. Overall, for the partition, say, (2, 3, 2), we have 10*9*26*25*25*26*25 = 950625000 combinations.
import itertools as it
def partitions(n, k):
for c in it.combinations(xrange(n+k-1), k-1):
yield [b-a-1 for a, b in zip((-1,)+c, c+(n+k-1,))]
def count_subsequences(pool, length):
if length < 2:
return pool**length
return pool * (pool-1)**(length-1)
def count_sequences(length):
counts = [[count_subsequences(i, j) for j in xrange(length+1)] \
for i in [10, 26]]
print 'Partition {:>18}'.format('Sequence count')
total = 0
for a, b, c in partitions(length, 3):
subtotal = counts[0][a] * counts[1][b] * counts[1][c]
total += subtotal
print '{} {:18}'.format((a, b, c), subtotal)
print '\nTOTAL {:22}'.format(total)
Overall, we observe that while generating the sequences fast isn't a problem, there are so many that it can take a long time. Length 7 has 78550354750 (78.5 billion) valid sequences and this number only scales approximately by a factor of 25 with each incremented length.
Try this
import string
import random
a = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for _ in range(7))
print(a)
If it's a random string you want that sticks to the above rules you can use something like this:
def f():
digitLen = random.randrange(8)
smallCharLen = random.randint(0, 7 - digitLen)
capCharLen = 7 - (smallCharLen + digitLen)
print (str(random.randint(0,10**digitLen-1)).zfill(digitLen) +
"".join([random.choice(ascii_lowercase) for i in range(smallCharLen)]) +
"".join([random.choice(ascii_uppercase) for i in range(capCharLen)]))
I haven't added the repeated character rule but one you have the string it's easy to filter out the unwanted strings using dictionaries. You can also fix the length of each segment by putting conditions on the segment lengths.
Edit: a minor bug.
Extreme cases are not handled here but can be done this way
import random
from string import digits, ascii_uppercase, ascii_lowercase
len1 = random.randint(1, 7)
len2 = random.randint(1, 7-len1)
len3 = 7 - len1 - len2
print len1, len2, len3
result = ''.join(random.sample(digits, len1) + random.sample(ascii_lowercase, len2) + random.sample(ascii_uppercase, len3))
with a similar approach of #julian
from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product, tee, chain, izip, imap
def flatten(listOfLists):
"Flatten one level of nesting"
#recipe of itertools
return chain.from_iterable(listOfLists)
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
#recipe of itertools
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def eq_pair(x):
return x[0]==x[1]
def comb_noNN(alfa,size):
if size>0:
for candidato in product(alfa,repeat=size):
if not any( imap(eq_pair,pairwise(candidato)) ):
yield candidato
else:
yield tuple()
def my_string(N=7):
for a in range(N+1):
for b in range(N-a+1):
for c in range(N-a-b+1):
if sum([a,b,c])==N:
for letras in product(
comb_noNN(digits,c),
comb_noNN(ascii_lowercase,b),
comb_noNN(ascii_uppercase,a)
):
yield "".join(flatten(letras))
comb_noNN generate all combinations of char of a particular size that follow rule 3, then in my_string check all combination of length that add up to N and generate all string that follow rule 1 by individually generating each of digits, lower and upper case letters.
Some output of for i,x in enumerate(my_string())
0, '0101010'
...
100, '0101231'
...
491041580, '936gzrf'
...
758790032, '27ktxfi'
...
The reason it takes a long time to generate the first result with the original implementation is it takes a long time to reach the first valid value of 0101010 when starting from 0000000 as you do when using product.
Here's a recursive version which generates valid sequences rather than discarding invalid ones:
from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product
all_chars=[digits, ascii_lowercase, ascii_uppercase]
def seq(char_sets, start=None):
for char_set in char_sets:
for val in seqperm(char_set, start):
yield val
def seqperm(char_set, start=None, exclude=None):
left_chars, remaining_chars=char_set[0], char_set[1:]
if start:
try:
left_chars=left_chars[left_chars.index(start[0]):]
start=start[1:]
except:
left_chars=''
for left in left_chars:
if left != exclude:
if len(remaining_chars) > 0:
for right in seqperm(remaining_chars, start, left):
yield left + right
else:
yield left
if __name__ == "__main__":
count=int(argv[1])
start=None
if len(argv) == 3:
start=argv[2]
# char_sets=list(combinations_with_replacement(all_chars, 7))
char_sets=[[''.join(all_chars)] * 7]
for idx, val in enumerate(seq(char_sets, start)):
if idx == count:
break
print idx, val
Run as follows:
./permute.py 10
Output:
0 0101010
1 0101012
2 0101013
3 0101014
4 0101015
5 0101016
6 0101017
7 0101018
8 0101019
9 010101a
If you pass an additional argument then the script skips to the portion of the sequence which starts with that third argument like this:
./permute.py 10 01234Z
If it's a requirement to generate only permutations where lower letters always follow numbers and upper case always follow numbers and lower case then comment out the line char_sets=[[''.join(all_chars)] * 7] and use the line char_sets=list(combinations_with_replacement(all_chars, 7)).
Sample output for the above command line with char_sets=list(combinations_with_replacement(all_chars, 7)):
0 01234ZA
1 01234ZB
2 01234ZC
3 01234ZD
4 01234ZE
5 01234ZF
6 01234ZG
7 01234ZH
8 01234ZI
9 01234ZJ
Sample output for the same command line with char_sets=[[''.join(all_chars)] * 7]:
0 01234Z0
1 01234Z1
2 01234Z2
3 01234Z3
4 01234Z4
5 01234Z5
6 01234Z6
7 01234Z7
8 01234Z8
9 01234Z9
It's possible to implement the above without recursion as below. Performance characteristics don't change much:
from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product, izip_longest
all_chars=[digits, ascii_lowercase, ascii_uppercase]
def seq(char_sets, start=''):
for char_set in char_sets:
for val in seqperm(char_set, start):
yield val
def seqperm(char_set, start=''):
iters=[iter(chars) for chars in char_set]
# move to starting point in sequence if specified
for char, citer, chars in zip(list(start), iters, char_set):
try:
for _ in range(0, chars.index(char)):
citer.next()
except ValueError:
raise StopIteration
pos=0
val=''
while True:
citer=iters[pos]
try:
char=citer.next()
if val and val[-1] == char:
char=citer.next()
if pos == len(char_set) - 1:
yield val+char
else:
val = val + char
pos += 1
except StopIteration:
if pos == 0:
raise StopIteration
iters[pos] = iter(chars)
pos -= 1
val=val[:pos]
if __name__ == "__main__":
count=int(argv[1])
start=''
if len(argv) == 3:
start=argv[2]
# char_sets=list(combinations_with_replacement(all_chars, 7))
char_sets=[[''.join(all_chars)] * 7]
for idx, val in enumerate(seq(char_sets, start)):
if idx == count:
break
print idx, val
A recursive version with caching is also possible and that generates results faster but is less flexible.
Related
I'm looking for a Python solution to extract from a series of letters/numbers, the most repeating pattern which comes with an outcome and a specific length.
Problem: When is search more likely to occur given a 4 digit block (Block Length) of digits/letters? (So the string has to END with search)
Example:
Input: 0010000101010001011010011101001101000011100010100101010111
Search: 1
Block Length: 4
---
Answer: 0101
Appeared: 5 times
In the above case "1" is more likely to appear when 010 comes before 1.
001000 [0101] 0100 [0101] 1010011101001101000011100 [0101] 0 [0101] [0101] 11
So the answer is 0101 an it appeared 5 times.
NOTE:
This could return 0001 but that only appeared 4 times while 0101 appeared 5 times.
Changing the length would result in:
Input: 0010000101010001011010011101001101000011100010100101010111 (same as above)
Search: 1
Block Length: 5
---
Answer: 00101
Appeared: 4 times
Because:
00100 [00101] 010 [00101] 101001110100110100001110 [00101][00101] 010111
NOTE:
The second example could return 00001 but that only appeared 2 times while 00101 appeared 4 times.
If there are multiple outcomes ie: 0101 and 0111 have the same presence, both outcome should be showing.
I'm at the point where I can find the more repeating string, but I don't know how to give the length:
def find_most_repetitive_substring(string):
max_counter = 1
position, substring_length, times = 0, 0, 0
for i in range(len(string)):
for j in range(len(string) - i):
counter = 1
if j == 0:
continue
while True:
if string[i + counter * j: i + (counter + 1) * j] != string[i: i + j] or i + (counter + 1) * j > len(string):
if counter > max_counter:
max_counter = counter
position, substring_length, times = i, j, counter
break
else:
counter += 1
return string[position: position + substring_length * times]
I've used re here as you're dealing with text but you can use over techiques to create blocks of N length with overlaps...
import re
def f(text, search, length):
# Get unique blocks of length N - including overlaps
overlaps = set(re.findall(f'(?=(.{{{length}}}))', text))
# Priotise those ending with length, then the count of non-overlapping and then include the block itself
return max((block.endswith(search), text.count(block), block) for block in overlaps)
S = '0010000101010001011010011101001101000011100010100101010111'
f(S, '1', 4)
# (True, 5, '0101')
f(S, '1', 5)
#(True, 4, '00101')
This might help you with part of your question (i.e., getting the counts); iterating and storing things in a lookup table (dictionary):
def find_most_repetitive_substring(string, substring_length, ending='1'):
"""
Finds the most repetive substring in a given string.
:param string: String to search for repetitions.
:param substring_length: Length of the substring to search for.
:param ending: character that pattern must end with. default is '1'.
:return: Most repetitive substring and its number of occurrences.
"""
substring_count = {}
for i in range(len(string) - substring_length + 1):
substring = string[i:i + substring_length]
if substring[-1] == ending: # added for ending
if substring in substring_count:
substring_count[substring] += 1
else:
substring_count[substring] = 1
max_substr = max(substring_count, key=substring_count.get)
return max_substr, substring_count[max_substr]
find_most_repetitive_substring('0010000101010001011010011101001101000011100010100101010111', 4)
And if you want to get all the keys with the max val, you can just return a list, changing the last lines to something like this:
max_substr = max(substring_count, key=substring_count.get)
max_substrs = [k for k, v in substring_count.items() if v == substring_count[max_substr]]
return max_substrs, substring_count[max_substr]
You could use Counter:
from collections import Counter
def find_most_repetitive_substring(string, size, search):
res = Counter([string[i:i+size] for i in range(len(string) - size)])
return max(res, key=lambda x: (x.endswith(search), res.get(x)))
Example run:
inp = "0010000101010001011010011101001101000011100010100101010111"
s = find_most_repetitive_substring(inp, 4, "1")
print(s) # 0101
This question was asked in an exam but my code (given below) passed just 2 cases out of 7 cases.
Input Format : single line input seperated by comma
Input: str = “abcd,b”
Output: 6
“ab”, “abc”, “abcd”, “b”, “bc” and “bcd” are the required sub-strings.
def slicing(s, k, n):
loop_value = n - k + 1
res = []
for i in range(loop_value):
res.append(s[i: i + k])
return res
x, y = input().split(',')
n = len(x)
res1 = []
for i in range(1, n + 1):
res1 += slicing(x, i, n)
count = 0
for ele in res1:
if y in ele:
count += 1
print(count)
When the target string (ts) is found in the string S, you can compute the number of substrings containing that instance by multiplying the number of characters before the target by the number of characters after the target (plus one on each side).
This will cover all substrings that contain this instance of the target string leaving only the "after" part to analyse further, which you can do recursively.
def countsubs(S,ts):
if ts not in S: return 0 # shorter or no match
before,after = S.split(ts,1) # split on target
result = (len(before)+1)*(len(after)+1) # count for this instance
return result + countsubs(ts[1:]+after,ts) # recurse with right side
print(countsubs("abcd","b")) # 6
This will work for single character and multi-character targets and will run much faster than checking all combinations of substrings one by one.
Here is a simple solution without recursion:
def my_function(s):
l, target = s.split(',')
result = []
for i in range(len(l)):
for j in range(i+1, len(l)+1):
ss = l[i] + l[i+1:j]
if target in ss:
result.append(ss)
return f'count = {len(result)}, substrings = {result}'
print(my_function("abcd,b"))
#count = 6, substrings = ['ab', 'abc', 'abcd', 'b', 'bc', 'bcd']
Here you go, this should help
from itertools import combinations
output = []
initial = input('Enter string and needed letter seperated by commas: ') #Asking for input
list1 = initial.split(',') #splitting the input into two parts i.e the actual text and the letter we want common in output
text = list1[0]
final = [''.join(l) for i in range(len(text)) for l in combinations(text, i+1)] #this is the core part of our code, from this statement we get all the available combinations of the set of letters (all the way from 1 letter combinations to nth letter)
for i in final:
if 'b' in i:
output.append(i) #only outputting the results which have the required letter/phrase in it
I want to be able to generate 12 character long chain, of hexadecimal, BUT with no more than 2 identical numbers duplicate in the chain: 00 and not 000
Because, I know how to generate ALL possibilites, including 00000000000 to FFFFFFFFFFF, but I know that I won't use all those values, and because the size of the file generated with ALL possibilities is many GB long, I want to reduce the size by avoiding the not useful generated chains.
So my goal is to have results like 00A300BF8911 and not like 000300BF8911
Could you please help me to do so?
Many thanks in advance!
if you picked the same one twice, remove it from the choices for a round:
import random
hex_digits = set('0123456789ABCDEF')
result = ""
pick_from = hex_digits
for digit in range(12):
cur_digit = random.sample(hex_digits, 1)[0]
result += cur_digit
if result[-1] == cur_digit:
pick_from = hex_digits - set(cur_digit)
else:
pick_from = hex_digits
print(result)
Since the title mentions generators. Here's the above as a generator:
import random
hex_digits = set('0123456789ABCDEF')
def hexGen():
while True:
result = ""
pick_from = hex_digits
for digit in range(12):
cur_digit = random.sample(hex_digits, 1)[0]
result += cur_digit
if result[-1] == cur_digit:
pick_from = hex_digits - set(cur_digit)
else:
pick_from = hex_digits
yield result
my_hex_gen = hexGen()
counter = 0
for result in my_hex_gen:
print(result)
counter += 1
if counter > 10:
break
Results:
1ECC6A83EB14
D0897DE15E81
9C3E9028B0DE
CE74A2674AF0
9ECBD32C003D
0DF2E5DAC0FB
31C48E691C96
F33AAC2C2052
CD4CEDADD54D
40A329FF6E25
5F5D71F823A4
You could also change the while true loop to only produce a certain number of these based on a number passed into the function.
I interpret this question as, "I want to construct a rainbow table by iterating through all strings that have the following qualities. The string has a length of 12, contains only the characters 0-9 and A-F, and it never has the same character appearing three times in a row."
def iter_all_strings_without_triplicates(size, last_two_digits = (None, None)):
a,b = last_two_digits
if size == 0:
yield ""
else:
for c in "0123456789ABCDEF":
if a == b == c:
continue
else:
for rest in iter_all_strings_without_triplicates(size-1, (b,c)):
yield c + rest
for s in iter_all_strings_without_triplicates(12):
print(s)
Result:
001001001001
001001001002
001001001003
001001001004
001001001005
001001001006
001001001007
001001001008
001001001009
00100100100A
00100100100B
00100100100C
00100100100D
00100100100E
00100100100F
001001001010
001001001011
...
Note that there will be several hundred terabytes' worth of values outputted, so you aren't saving much room compared to just saving every single string, triplicates or not.
import string, random
source = string.hexdigits[:16]
result = ''
while len(result) < 12 :
idx = random.randint(0,len(source))
if len(result) < 3 or result[-1] != result[-2] or result[-1] != source[idx] :
result += source[idx]
You could extract a random sequence from a list of twice each hexadecimal digits:
digits = list('1234567890ABCDEF') * 2
random.shuffle(digits)
hex_number = ''.join(digits[:12])
If you wanted to allow shorter sequences, you could randomize that too, and left fill the blanks with zeros.
import random
digits = list('1234567890ABCDEF') * 2
random.shuffle(digits)
num_digits = random.randrange(3, 13)
hex_number = ''.join(['0'] * (12-num_digits)) + ''.join(digits[:num_digits])
print(hex_number)
You could use a generator iterating a window over the strings your current implementation yields. Sth. like (hex_str[i:i + 3] for i in range(len(hex_str) - window_size + 1)) Using len and set you could count the number of different characters in the slice. Although in your example it might be easier to just compare all 3 characters.
You can create an array from 0 to 255, and use random.sample with your list to get your list
I have a string that holds a very long sentence without whitespaces/spaces.
mystring = "abcdthisisatextwithsampletextforasampleabcd"
I would like to find all of the repeated substrings that contains minimum 4 chars.
So I would like to achieve something like this:
'text' 2 times
'sample' 2 times
'abcd' 2 times
As both abcd,text and sample can be found two times in the mystring they were recognized as properly matched substrings with more than 4 char length. It's important that I am seeking repeated substrings, finding only existing English words is not a requirement.
The answers I found are helpful for finding duplicates in texts with whitespaces, but I couldn't find a proper resource that covers the situation when there are no spaces and whitespaces in the string. How can this be done in the most efficient way?
Let's go through this step by step. There are several sub-tasks you should take care of:
Identify all substrings of length 4 or more.
Count the occurrence of these substrings.
Filter all substrings with 2 occurrences or more.
You can actually put all of them into a few statements. For understanding, it is easier to go through them one at a time.
The following examples all use
mystring = "abcdthisisatextwithsampletextforasampleabcd"
min_length = 4
1. Substrings of a given length
You can easily get substrings by slicing - for example, mystring[4:4+6] gives you the substring from position 4 of length 6: 'thisis'. More generically, you want substrings of the form mystring[start:start+length].
So what values do you need for start and length?
start must...
cover all substrings, so it must include the first character: start in range(0, ...).
not map to short substrings, so it can stop min_length characters before the end: start in range(..., len(mystring) - min_length + 1).
length must...
cover the shortest substring of length 4: length in range(min_length, ...).
not exceed the remaining string after i: length in range(..., len(mystring) - i + 1))
The +1 terms come from converting lengths (>=1) to indices (>=0).
You can put this all together into a single comprehension:
substrings = [
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
]
2. Count substrings
Trivially, you want to keep a count for each substring. Keeping anything for each specific object is what dicts are made for. So you should use substrings as keys and counts as values in a dict. In essence, this corresponds to this:
counts = {}
for substring in substrings:
try: # increase count for existing keys, set for new keys
counts[substring] += 1
except KeyError:
counts[substring] = 1
You can simply feed your substrings to collections.Counter, and it produces something like the above.
>>> counts = collections.Counter(substrings)
>>> print(counts)
Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})
Notice how the duplicate 'abcd' maps to the count of 2.
3. Filtering duplicate substrings
So now you have your substrings and the count for each. You need to remove the non-duplicate substrings - those with a count of 1.
Python offers several constructs for filtering, depending on the output you want. These work also if counts is a regular dict:
>>> list(filter(lambda key: counts[key] > 1, counts))
['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple']
>>> {key: value for key, value in counts.items() if value > 1}
{'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}
Using Python primitives
Python ships with primitives that allow you to do this more efficiently.
Use a generator to build substrings. A generator builds its member on the fly, so you never actually have them all in-memory. For your use case, you can use a generator expression:
substrings = (
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
)
Use a pre-existing Counter implementation. Python comes with a dict-like container that counts its members: collections.Counter can directly digest your substring generator. Especially in newer version, this is much more efficient.
counts = collections.Counter(substrings)
You can exploit Python's lazy filters to only ever inspect one substring. The filter builtin or another generator generator expression can produce one result at a time without storing them all in memory.
for substring in filter(lambda key: counts[key] > 1, counts):
print(substring, 'occurs', counts[substring], 'times')
Nobody is using re! Time for an answer [ab]using the regular expression built-in module ;)
import re
Finding all the maximal substrings that are repeated
repeated_ones = set(re.findall(r"(.{4,})(?=.*\1)", mystring))
This matches the longest substrings which have at least a single repetition after (without consuming). So it finds all disjointed substrings that are repeated while only yielding the longest strings.
Finding all substrings that are repeated, including overlaps
mystring_overlap = "abcdeabcdzzzzbcde"
# In case we want to match both abcd and bcde
repeated_ones = set()
pos = 0
while True:
match = re.search(r"(.{4,}).*(\1)+", mystring_overlap[pos:])
if match:
repeated_ones.add(match.group(1))
pos += match.pos + 1
else:
break
This ensures that all --not only disjoint-- substrings which have repetition are returned. It should be much slower, but gets the work done.
If you want in addition to the longest strings that are repeated, all the substrings, then:
base_repetitions = list(repeated_ones)
for s in base_repetitions:
for i in range(4, len(s)):
repeated_ones.add(s[:i])
That will ensure that for long substrings that have repetition, you have also the smaller substring --e.g. "sample" and "ample" found by the re.search code; but also "samp", "sampl", "ampl" added by the above snippet.
Counting matches
Because (by design) the substrings that we count are non-overlapping, the count method is the way to go:
from __future__ import print_function
for substr in repeated_ones:
print("'%s': %d times" % (substr, mystring.count(substr)))
Results
Finding maximal substrings:
With the question's original mystring:
{'abcd', 'text', 'sample'}
with the mystring_overlap sample:
{'abcd'}
Finding all substrings:
With the question's original mystring:
{'abcd', 'ample', 'mple', 'sample', 'text'}
... and if we add the code to get all substrings then, of course, we get absolutely all the substrings:
{'abcd', 'ampl', 'ample', 'mple', 'samp', 'sampl', 'sample', 'text'}
with the mystring_overlap sample:
{'abcd', 'bcde'}
Future work
It's possible to filter the results of the finding all substrings with the following steps:
take a match "A"
check if this match is a substring of another match, call it "B"
if there is a "B" match, check the counter on that match "B_n"
if "A_n = B_n", then remove A
go to first step
It cannot happen that "A_n < B_n" because A is smaller than B (is a substring) so there must be at least the same number of repetitions.
If "A_n > B_n" it means that there is some extra match of the smaller substring, so it is a distinct substring because it is repeated in a place where B is not repeated.
Script (explanation where needed, in comments):
from collections import Counter
mystring = "abcdthisisatextwithsampletextforasampleabcd"
mystring_len = len(mystring)
possible_matches = []
matches = []
# Range `start_index` from 0 to 3 from the left, due to minimum char count of 4
for start_index in range(0, mystring_len-3):
# Start `end_index` at `start_index+1` and range it throughout the rest of
# the string
for end_index in range(start_index+1, mystring_len+1):
current_string = mystring[start_index:end_index]
if len(current_string) < 4: continue # Skip this interation, if len < 4
possible_matches.append(mystring[start_index:end_index])
for possible_match, count in Counter(possible_matches).most_common():
# Iterate until count is less than or equal to 1 because `Counter`'s
# `most_common` method lists them in order. Once 1 (or less) is hit, all
# others are the same or lower.
if count <= 1: break
matches.append((possible_match, count))
for match, count in matches:
print(f'\'{match}\' {count} times')
Output:
'abcd' 2 times
'text' 2 times
'samp' 2 times
'sampl' 2 times
'sample' 2 times
'ampl' 2 times
'ample' 2 times
'mple' 2 times
Here's a Python3 friendly solution:
from collections import Counter
min_str_length = 4
mystring = "abcdthisisatextwithsampletextforasampleabcd"
all_substrings =[mystring[start_index:][:end_index + 1] for start_index in range(len(mystring)) for end_index in range(len(mystring[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0] for item in counted_substrings.most_common() if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
print(counted_final_candidates)
Bonus: largest string
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in not_counted_final_candidates if substring1!=substring2 and substring1 in substring2 ]
largest_common_string = list(set(not_counted_final_candidates) - set(sub_sub_strings))
Everything as a function:
from collections import Counter
def get_repeated_strings(input_string, min_str_length = 2, calculate_largest_repeated_string = True ):
all_substrings = [input_string[start_index:][:end_index + 1]
for start_index in range(len(input_string))
for end_index in range(len(input_string[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0]
for item in counted_substrings.most_common()
if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
### This is just a bit of bonus code for calculating the largest repeating sting
if calculate_largest_repeated_string == True:
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in
not_counted_final_candidates if substring1 != substring2 and substring1 in substring2]
largest_common_strings = list(set(not_counted_final_candidates) - set(sub_sub_strings))
return counted_final_candidates, largest_common_strings
else:
return counted_final_candidates
Example:
mystring = "abcdthisisatextwithsampletextforasampleabcd"
print(get_repeated_strings(mystring, min_str_length= 4))
Output:
({'abcd': 2, 'text': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'ampl': 2, 'ample': 2, 'mple': 2}, ['abcd', 'text', 'sample'])
CODE:
pattern = "abcdthisisatextwithsampletextforasampleabcd"
string_more_4 = []
k = 4
while(k <= len(pattern)):
for i in range(len(pattern)):
if pattern[i:k+i] not in string_more_4 and len(pattern[i:k+i]) >= 4:
string_more_4.append( pattern[i:k+i])
k+=1
for i in string_more_4:
if pattern.count(i) >= 2:
print(i + " -> " + str(pattern.count(i)) + " times")
OUTPUT:
abcd -> 2 times
text -> 2 times
samp -> 2 times
ampl -> 2 times
mple -> 2 times
sampl -> 2 times
ample -> 2 times
sample -> 2 times
Hope this helps as my code length was short and it is easy to understand. Cheers!
This is in Python 2 because I'm not doing Python 3 at this time. So you'll have to adapt it to Python 3 yourself.
#!python2
# import module
from collections import Counter
# get the indices
def getIndices(length):
# holds the indices
specific_range = []; all_sets = []
# start building the indices
for i in range(0, length - 2):
# build a set of indices of a specific range
for j in range(1, length + 2):
specific_range.append([j - 1, j + i + 3])
# append 'specific_range' to 'all_sets', reset 'specific_range'
if specific_range[j - 1][1] == length:
all_sets.append(specific_range)
specific_range = []
break
# return all of the calculated indices ranges
return all_sets
# store search strings
tmplst = []; combos = []; found = []
# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"
# get length of string
length = len(mystring)
# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)
# get the search string combinations
for sublst in all_sets:
for subsublst in sublst:
tmplst.append(mystring[subsublst[0]: subsublst[1]])
combos.append(tmplst)
tmplst = []
# search for matching string patterns
for sublst in all_sets:
for subsublst in sublst:
for sublstitems in combos:
if mystring[subsublst[0]: subsublst[1]] in sublstitems:
found.append(mystring[subsublst[0]: subsublst[1]])
# make a dictionary containing the strings and their counts
d1 = Counter(found)
# filter out counts of 2 or more and print them
for k, v in d1.items():
if v > 1:
print k, v
$ cat test.py
import collections
import sys
S = "abcdthisisatextwithsampletextforasampleabcd"
def find(s, min_length=4):
"""
Find repeated character sequences in a provided string.
Arguments:
s -- the string to be searched
min_length -- the minimum length of the sequences to be found
"""
counter = collections.defaultdict(int)
# A repeated sequence can't be longer than half the length of s
sequence_length = len(s) // 2
# populate counter with all possible sequences
while sequence_length >= min_length:
# Iterate over the string until the number of remaining characters is
# fewer than the length of the current sequence.
for i, x in enumerate(s[:-(sequence_length - 1)]):
# Window across the string, getting slices
# of length == sequence_length.
candidate = s[i:i + sequence_length]
counter[candidate] += 1
sequence_length -= 1
# Report.
for k, v in counter.items():
if v > 1:
print('{} {} times'.format(k, v))
return
if __name__ == '__main__':
try:
s = sys.argv[1]
except IndexError:
s = S
find(s)
$ python test.py
sample 2 times
sampl 2 times
ample 2 times
abcd 2 times
text 2 times
samp 2 times
ampl 2 times
mple 2 times
This is my approach to this problem:
def get_repeated_words(string, minimum_len):
# Storing count of repeated words in this dictionary
repeated_words = {}
# Traversing till last but 4th element
# Actually leaving `minimum_len` elements at end (in this case its 4)
for i in range(len(string)-minimum_len):
# Starting with a length of 4(`minimum_len`) and going till end of string
for j in range(i+minimum_len, len(string)):
# getting the current word
word = string[i:j]
# counting the occurrences of the word
word_count = string.count(word)
if word_count > 1:
# storing in dictionary along with its count if found more than once
repeated_words[word] = word_count
return repeated_words
if __name__ == '__main__':
mystring = "abcdthisisatextwithsampletextforasampleabcd"
result = get_repeated_words(mystring, 4)
This is how I would do it, but I don't know any other way:
string = "abcdthisisatextwithsampletextforasampleabcd"
l = len(string)
occurences = {}
for i in range(4, l):
for start in range(l - i):
substring = string[start:start + i]
occurences[substring] = occurences.get(substring, 0) + 1
for key in occurences.keys():
if occurences[key] > 1:
print("'" + key + "'", str(occurences[key]), "times")
Output:
'sample' 2 times
'ampl' 2 times
'sampl' 2 times
'ample' 2 times
'samp' 2 times
'mple' 2 times
'text' 2 times
Efficient, no, but easy to understand, yes.
Here is simple solution using the more_itertools library.
Given
import collections as ct
import more_itertools as mit
s = "abcdthisisatextwithsampletextforasampleabcd"
lbound, ubound = len("abcd"), len(s)
Code
windows = mit.flatten(mit.windowed(s, n=i) for i in range(lbound, ubound))
filtered = {"".join(k): v for k, v in ct.Counter(windows).items() if v > 1}
filtered
Output
{'abcd': 2,
'text': 2,
'samp': 2,
'ampl': 2,
'mple': 2,
'sampl': 2,
'ample': 2,
'sample': 2}
Details
The procedures are:
build sliding windows of varying sizes lbound <= n < ubound
count all occurrences and filter replicates
more_itertools is a third-party package installed by > pip install more_itertools.
s = 'abcabcabcdabcd'
d = {}
def get_repeats(s, l):
for i in range(len(s)-l):
ss = s[i: i+l]
if ss not in d:
d[ss] = 1
else:
d[ss] = d[ss]+1
return d
get_repeats(s, 3)
Deleting any pair of adjacent letters with same value. For example, string "aabcc" would become either "aab" or "bcc" after operation.
Sample input = aaabccddd
Sample output = abd
Confused how to iterate the list or the string in a way to match the duplicates and removing them, here is the way I am trying and I know it is wrong.
S = input()
removals = []
for i in range(0, len(S)):
if i + 1 >= len(S):
break
elif S[i] == S[i + 1]:
removals.append(i)
# removals is to store all the indexes that are to be deleted.
removals.append(i + 1)
i += 1
print(i)
Array = list(S)
set(removals) #removes duplicates from removals
for j in range(0, len(removals)):
Array.pop(removals[j]) # Creates IndexOutOfRange error
This is a problem from Hackerrank: Super Reduced String
Removing paired letters can be reduced to reducing runs of letters to an empty sequence if there are an even number of them, 1 if there are an odd number. aaaaaa becomes empty, aaaaa is reduced to a.
To do this on any sequence, use itertools.groupby() and count the group size:
# only include a value if their consecutive count is odd
[v for v, group in groupby(sequence) if sum(1 for _ in group) % 2]
then repeat until the size of the sequence no longer changes:
prev = len(sequence) + 1
while len(sequence) < prev:
prev = len(sequence)
sequence = [v for v, group in groupby(sequence) if sum(1 for _ in group) % 2]
However, since Hackerrank gives you text it'd be faster if you did this with a regular expression:
import re
even = re.compile(r'(?:([a-z])\1)+')
prev = len(text) + 1
while len(text) < prev:
prev = len(text)
text = even.sub(r'', text)
[a-z] in a regex matches a lower-case letter, (..)groups that match, and\1references the first match and will only match if that letter was repeated.(?:...)+asks for repeats of the same two characters.re.sub()` replaces all those patterns with empty text.
The regex approach is good enough to pass that Hackerrank challenge.
You can use stack in order to achieve O(n) time complexity. Iterate over the characters in a string and for each character check if the top of stack contains the same character. In case it does pop the character from stack and move to next item. Otherwise push the character to the stack. Whatever remains in the stack is the result:
s = 'aaabccddd'
stack = []
for c in s:
if stack and stack[-1] == c:
stack.pop()
else:
stack.append(c)
print ''.join(stack) if stack else 'Empty String' # abd
Update Based on the discussion I ran couple of tests to measure the speed of regex and stack based solutions with input length of 100. Tests were run on Python 2.7 on Windows 8:
All same
Regex: 0.0563033799756
Stack: 0.267807865445
Nothing to remove
Regex: 0.075074750044
Stack: 0.183467329017
Worst case
Regex: 1.9983200193
Stack: 0.196362265609
Alphabet
Regex: 0.0759905517997
Stack: 0.182778728207
Code used for benchmarking:
import re
import timeit
def reduce_regexp(text):
even = re.compile(r'(?:([a-z])\1)+')
prev = len(text) + 1
while len(text) < prev:
prev = len(text)
text = even.sub(r'', text)
return text
def reduce_stack(s):
stack = []
for c in s:
if stack and stack[-1] == c:
stack.pop()
else:
stack.append(c)
return ''.join(stack)
CASES = [
['All same', 'a' * 100],
['Nothing to remove', 'ab' * 50],
['Worst case', 'ab' * 25 + 'ba' * 25],
['Alphabet', ''.join([chr(ord('a') + i) for i in range(25)] * 4)]
]
for name, case in CASES:
print(name)
res = timeit.timeit('reduce_regexp(case)',
setup='from __main__ import reduce_regexp, case; import re',
number=10000)
print('Regex: {}'.format(res))
res = timeit.timeit('reduce_stack(case)',
setup='from __main__ import reduce_stack, case',
number=10000)
print('Stack: {}'.format(res))