Count max consecutive RE groups in a string [duplicate] - python

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
How can I count the max amount of consecutive string groups in a string?
import re
s = "HELLOasdHELLOasdHELLOHELLOHELLOasdHELLOHELLO"
# Give me the max amount of consecutive HELLO groups ---> wich is 3
# There's a group of 3 and a group of 2, but 3 is the max.
count = re.findall("(HELLO)+", s) # count is: ['HELLO', 'HELLO', 'HELLO', 'HELLO']
count = len(count)
print(count)
Output is:
4
Which is totally wrong. The max amount of consecutive HELLO is 3.
I think I'm using the wrong RE and I have no clue how to count those repetitions in order to find the max.
And I can't understand why the output is 4.
Thanks!

You need to capture the entire string of consecutive HELLOs in your match; then you can work out the number of HELLOs by dividing the length of the match string by 5 (the length of HELLO). Using a list comprehension:
import re
s = "HELLOasdHELLOasdHELLOHELLOHELLOasdHELLOHELLO"
print(max([len(x) // 5 for x in re.findall(r'((?:HELLO)+)', s)]))
Output
3

I think you should change to another solution that is easier to understand than looking for short code.
s = "HELLOasdHELLOasdHELLOHELLOHELLOasdHELLOHELLO"
word_search = "HELLO"
def find_char(str_var: str, word_search: str) -> int:
count = 0
for i in range(len(s)):
char = word_search * i
if str_var.find(char) != - 1:
count = i
return count
find = find_char(s)
print(find) # 3
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Update:
Actually it can work with one line of code without requiring additional modules:
c = max([i for i in range(len(s)) if s.find('HELLO' * i) != -1])
There are many cases where people use Regular Expressions to cool and reject simpler solutions to work.
Back to the topic of the topic owner. He doesn't understand why the output = 4.
Because re.findall () will return a list of elements that it finds, here, it finds 4 elements.
And len (list) will return the total number of elements in a list -> output = 4.
with the regex of the above answers, it also found 4 elements and when assembled into the top code, len(list) = 4.
We need to explain that len​​(list) is not right.
The problem is in the form of find max(x) if x * 'HELLO' in s.
Regex is a powerful and useful tool! but we don't always use it.

As explained in this question: Regex behaving weird when finding floating point strings
If one or more groups are present in the pattern, [re.findall will] return a list of groups
Therefore you want a non-capturing group instead. Let's work through with the sample string and pattern:
s = 'HELLOasdHELLOasdHELLOHELLOHELLOasdHELLOHELLO'
p = 'HELLO'
To find all occurrences of consecutive repetitions of the pattern, we just need to modify your regex slight to use a non-capturing group:
>>> matches = re.findall(f'(?:{p})+', s)
>>> matches
['HELLO', 'HELLO', 'HELLOHELLOHELLO', 'HELLOHELLO']
Now we just need to find the longest string and divide its length by the length of the pattern:
>> max(map(len, matches)) // len(p)
3

Related

How can I optimize this for-loop?

I need to check the occurrences of the letter "a" in a string s of size n.
Example:
s = "abcac"
n = 10
String to check for occurrences of letter "a": "abcacabcac".
Occurrences: 4
My code works, but I need it to work faster for larger values of n.
What can I do to optimize this code?
def repeatedString(s, n):
a_count, word_iter = 0, 0
for i in range(n):
if s[word_iter] == "a":
a_count+=1
word_iter += 1
if word_iter == (len(s)):
word_iter = 0
return a_count
You only don't need to assemble the full repeated string to do it. count the number of the specified characted in the whole string and multiple that by the number of times it will be fully repeated (n//len(s) times). Add to that the number of occurrences that will appear in the last (truncated) part at the end of the repetitions (i.e. first n%len(s) characters)
def countChar(s,n,c):
return s.count(c)*n//len(s)+s[:n%len(s)].count(c)
output:
countChar("abcac",10,"a") # 4 times in 'abcacabcac'
countChar("abcac",17,"a") # 7 times in 'abcacabcacabcacab'
Count the number of times a appears in a string, s up to length n
s = "abcac"
n = 10
str(s*(int(n/len(s))))[:n].count('a')
You can use regular expressions:
import re
a_count = len(re.findall(r'a',s))
re.findall returns an array of all matches, and we can just get the length of it. Using a regular expression allows for greater generalization and the ability to search for more complex patterns. Debra's original answer is better for a simple string search though:
a_count = s.count('a')

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?
Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5
You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.
try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions
Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.
I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

Python strings: quickly summarize the character count in order of appearance

Let's say I have the following strings in Python3.x
string1 = 'AAAAABBBBCCCDD'
string2 = 'CCBADDDDDBACDC'
string3 = 'DABCBEDCCAEDBB'
I would like to create a summary "frequency string" that counts the number of characters in the string in the following format:
string1_freq = '5A4B3C2D' ## 5 A's, followed by 4 B's, 3 C's, and 2D's
string2_freq = '2C1B1A5D1B1A1C1D1C'
string3_freq = '1D1A1B1C1B1E1D2C1A1E1D2B'
My problem:
How would I quickly create such a summary string?
My idea would be: create an empty list to keep track of the count. Then create a for loop which checks the next character. If there's a match, increase the count by +1 and move to the next character. Otherwise, append to end of the string 'count' + 'character identity'.
That's very inefficient in Python. Is there a quicker way (maybe using the functions below)?
There are several ways to count the elements of a string in python. I like collections.Counter, e.g.
from collections import Counter
counter_str1 = Counter(string1)
print(counter_str1['A']) # 5
print(counter_str1['B']) # 4
print(counter_str1['C']) # 3
print(counter_str1['D']) # 2
There's also str.count(sub[, start[, end]
Return the number of non-overlapping occurrences of substring sub in
the range [start, end]. Optional arguments start and end are
interpreted as in slice notation.
As an example:
print(string1.count('A')) ## 5
The following code accomplishes the task without importing any modules.
def freq_map(s):
num = 0 # number of adjacent, identical characters
curr = s[0] # current character being processed
result = '' # result of function
for i in range(len(s)):
if s[i] == curr:
num += 1
else:
result += str(num) + curr
curr = s[i]
num = 1
result += str(num) + curr
return result
Note: Since you requested a solution based on performance, I suggest you use this code or a modified version of it.
I have executed rough performance test against the code provided by CoryKramer for reference. This code performed the same function in 58% of the time without using external modules. The snippet can be found here.
I would use itertools.groupby to group consecutive runs of the same letter. Then use a generator expression within join to create a string representation of the count and letter for each run.
from itertools import groupby
def summarize(s):
return ''.join(str(sum(1 for _ in i[1])) + i[0] for i in groupby(s))
Examples
>>> summarize(string1)
'5A4B3C2D'
>>> summarize(string2)
'2C1B1A5D1B1A1C1D1C'
>>> summarize(string3)
'1D1A1B1C1B1E1D2C1A1E1D2B'

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]
You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.
Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.
Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.
You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']

Count character repeats in Python

I'm writing a Python program and I need some way to count the number of times an X or a stretch of Xs occurs in a string. So for example if the input is aaaXXXbbbXXXcccXdddXXXXXeXf then the output should be 5, since there are 5 stretches of X in the string.
In Perl I would have done this as follows.
my $count =()= $str =~ m/X+/g;
I'm familiar with the re.search command in Python, but I'm unaware of how to count the number of results, and I'm unsure whether this is the most efficient way to approach my problem in Python.
My highest priority is readability/clarity; efficiency is secondary.
You can use itertools.groupby for this:
>>> s = "aaaXXXbbbXXXcccXdddXXXXXeXf"
>>> import itertools
>>> sum(e == 'X' for e, g in itertools.groupby(s))
5
This groups the elements in the iterable -- if no key-function is given, it just groups equal elements. Then, you just use sum to count the elements where the key is 'X'.
Or how about regular expressions:
>>> import re
>>> len(re.findall("X+", s))
5
This should work:
prev = None
count = 0
for letter in string:
if letter == 'X' and prev != 'X':
count += 1
prev = letter

Categories