Count character repeats in Python - python

I'm writing a Python program and I need some way to count the number of times an X or a stretch of Xs occurs in a string. So for example if the input is aaaXXXbbbXXXcccXdddXXXXXeXf then the output should be 5, since there are 5 stretches of X in the string.
In Perl I would have done this as follows.
my $count =()= $str =~ m/X+/g;
I'm familiar with the re.search command in Python, but I'm unaware of how to count the number of results, and I'm unsure whether this is the most efficient way to approach my problem in Python.
My highest priority is readability/clarity; efficiency is secondary.

You can use itertools.groupby for this:
>>> s = "aaaXXXbbbXXXcccXdddXXXXXeXf"
>>> import itertools
>>> sum(e == 'X' for e, g in itertools.groupby(s))
5
This groups the elements in the iterable -- if no key-function is given, it just groups equal elements. Then, you just use sum to count the elements where the key is 'X'.
Or how about regular expressions:
>>> import re
>>> len(re.findall("X+", s))
5

This should work:
prev = None
count = 0
for letter in string:
if letter == 'X' and prev != 'X':
count += 1
prev = letter

Related

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?
Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5
You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.
try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions
Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.
I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

How to stop over counting of duplicate letters in a list of strings

I'm trying to count the number of times a duplicate letter shows up in the list element.
For example, given
arr = ['capps','hat','haaah']
I out put a list and I get ['1','0','1']
def myfunc(words):
counter = 0 #counters dup letters in words
len_ = len(words)-1
for i in range(len_):
if words[i] == words[i+1]: #if the letter ahead is the same add one
counter+=1
return counter
def minimalOperations(arr):
return [*map(myfunc,arr)] #map fuc applies myfunc to element in words.
But my code would output [1,0,2]
I'm not sure why I am over counting.
Can anyone help me resolve this, thank you in advance.
A more efficient solution using a regular expression:
import re
def myfunc(words):
reg_str = r"(\w)\1{1,}"
return len(re.findall(reg_str, words))
This function will find the number of substrings of length 2 or more containing the same letter. Thus 'aaa' in your example will only be counted once.
For a string like
'hhhhfafaahggaa'
the output will be 4 , since there are 4 maximal substrings of the same letter occuring at least twice : 'hhh' , 'ss', 'gg', 'aa'
You aren't accounting for situations where you have greater than 2 identical characters in succession. To do this, you can look back as well as forward:
if (words[i] == words[i+1]) and (words[i] != words[i-1] if i != 0 else True)
# as before
The ternary statement helps for the first iteration of the loop, to avoid comparing the last letter of a string with the first.
Another solution is to use itertools.groupby and count the number of instances where a group has a length greater than 1:
arr = ['capps','hat','haaah']
from itertools import groupby
res = [sum(1 for _, j in groupby(el) if sum(1 for _ in j) > 1) for el in arr]
print(res)
[1, 0, 1]
The sum(1 for _ in j) part is used to count the number items in a generator. It's also possible to use len(list(j)), though this requires list construction.
Well, your code counts the number of duplications, so what you observe is quite logical:
your input is arr = ['capps','hat','haaah']
in 'capps', the letter p is duplicated 1 time => myfunc() returns 1
in 'hat', there is no duplicated letter => myfunc() returns 0
in 'haaah', the letter a is duplicated 2 times => myfunc() returns 2
So finally you get [1,0,2].
For your purpose, I suggest you to use a regex to match and count the number of groups of duplicated letters in each word. I also replaced the usage of map() with a list comprehension that I find more readable:
import re
def myfunc(words):
return len(re.findall(r'(\w)\1+', words))
def minimalOperations(arr):
return [myfunc(a) for a in arr]
arr = ['capps','hat','haaah']
print(minimalOperations(arr)) # [1,0,1]
arr = ['cappsuul','hatppprrrrtyyy','haaah']
print(minimalOperations(arr)) # [2,3,1]
You need to keep track of a little more state, specifically if you're looking at duplicates now.
def myfunc(words):
counter = 0 #counters dup letters in words
seen = None
len_ = len(words)-1
for i in range(len_):
if words[i] == words[i+1] and words[i+1] != seen: #if the letter ahead is the same add one and wasn't the first
counter+=1
seen = words[i]
return counter
This gives you the following output
>>> arr = ['capps','hat','haaah']
>>> map(myfunc, arr)
[1, 0, 1]
As others have pointed out, you could use a regular expression and trade clarity for performance. They key is to find a regular expression that means "two or more repeated characters" and may depend on what you consider to be characters (e.g. how do you treat duplicate punctuation?)
Note: the "regex" used for this is technically an extension on regular expressions because it requires memory.
The form will be len(re.findall(regex, words))
I would break this kind of problem into smaller chunks. Starting by grouping duplicates.
The documentation for itertools has groupby and recipes for this kind of things.
A slightly edited version of unique_justseen would look like this:
duplicates = (len(sum(1 for _ in group) for _key, group in itertools.groupby("haaah")))
and yields values: 1, 3, 1. As soon as any of these values are greater than 1 you have a duplicate. So just count them:
sum(n > 1 for n in duplicates)
Use re.findall for matches of 2 or more letters
>>> arr = ['capps','hat','haaah']
>>> [len(re.findall(r'(.)\1+', w)) for w in arr]
[1, 0, 1]

Python strings: quickly summarize the character count in order of appearance

Let's say I have the following strings in Python3.x
string1 = 'AAAAABBBBCCCDD'
string2 = 'CCBADDDDDBACDC'
string3 = 'DABCBEDCCAEDBB'
I would like to create a summary "frequency string" that counts the number of characters in the string in the following format:
string1_freq = '5A4B3C2D' ## 5 A's, followed by 4 B's, 3 C's, and 2D's
string2_freq = '2C1B1A5D1B1A1C1D1C'
string3_freq = '1D1A1B1C1B1E1D2C1A1E1D2B'
My problem:
How would I quickly create such a summary string?
My idea would be: create an empty list to keep track of the count. Then create a for loop which checks the next character. If there's a match, increase the count by +1 and move to the next character. Otherwise, append to end of the string 'count' + 'character identity'.
That's very inefficient in Python. Is there a quicker way (maybe using the functions below)?
There are several ways to count the elements of a string in python. I like collections.Counter, e.g.
from collections import Counter
counter_str1 = Counter(string1)
print(counter_str1['A']) # 5
print(counter_str1['B']) # 4
print(counter_str1['C']) # 3
print(counter_str1['D']) # 2
There's also str.count(sub[, start[, end]
Return the number of non-overlapping occurrences of substring sub in
the range [start, end]. Optional arguments start and end are
interpreted as in slice notation.
As an example:
print(string1.count('A')) ## 5
The following code accomplishes the task without importing any modules.
def freq_map(s):
num = 0 # number of adjacent, identical characters
curr = s[0] # current character being processed
result = '' # result of function
for i in range(len(s)):
if s[i] == curr:
num += 1
else:
result += str(num) + curr
curr = s[i]
num = 1
result += str(num) + curr
return result
Note: Since you requested a solution based on performance, I suggest you use this code or a modified version of it.
I have executed rough performance test against the code provided by CoryKramer for reference. This code performed the same function in 58% of the time without using external modules. The snippet can be found here.
I would use itertools.groupby to group consecutive runs of the same letter. Then use a generator expression within join to create a string representation of the count and letter for each run.
from itertools import groupby
def summarize(s):
return ''.join(str(sum(1 for _ in i[1])) + i[0] for i in groupby(s))
Examples
>>> summarize(string1)
'5A4B3C2D'
>>> summarize(string2)
'2C1B1A5D1B1A1C1D1C'
>>> summarize(string3)
'1D1A1B1C1B1E1D2C1A1E1D2B'

Detect and count numerical sequence in Python array

In a numerical sequence (e.g. one-dimensional array) I want to find different patterns of numbers and count each finding separately. However, the numbers can occur repeatedly but only the basic pattern is important.
# Example signal (1d array)
a = np.array([1,1,2,2,2,2,1,1,1,2,1,1,2,3,3,3,3,3,2,2,1,1,1])
# Search for these exact following "patterns": [1,2,1], [1,2,3], [3,2,1]
# Count the number of pattern occurrences
# [1,2,1] = 2 (occurs 2 times)
# [1,2,3] = 1
# [3,2,1] = 1
I have come up with the Knuth-Morris-Pratt string matching (http://code.activestate.com/recipes/117214/), which gives me the index of the searched pattern.
for s in KnuthMorrisPratt(list(a), [1,2,1]):
print('s')
The problem is, I don't know how to find the case, where the pattern [1,2,1] "hides" in the sequence [1,2,2,2,1]. I need to find a way to reduce this sequence of repeated numbers in order to get to [1,2,1]. Any ideas?
I don't use NumPy and I am quite new to Python, so there might be a better and more efficient solution.
I would write a function like this:
def dac(data, pattern):
count = 0
for i in range(len(data)-len(pattern)+1):
tmp = data[i:(i+len(pattern))]
if tmp == pattern:
count +=1
return count
If you want to ignore repeated numbers in the middle of your pattern:
def dac(data, pattern):
count = 0
for i in range(len(data)-len(pattern)+1):
tmp = [data[i], data [i+1]]
try:
for j in range(len(data)-i):
print(i, i+j)
if tmp[-1] != data[i+j+1]:
tmp.append(data[i+j+1])
if len(tmp) == len(pattern):
print(tmp)
break
except:
pass
if tmp == pattern:
count +=1
return count
Hope that might help.
Here's a one-liner that will do it
import numpy as np
a = np.array([1,1,2,2,2,2,1,1,1,2,1,1,2,3,3,3,3,3,2,2,1,1,1])
p = np.array([1,2,1])
num = sum(1 for k in
[a[j:j+len(p)] for j in range(len(a) - len(p) + 1)]
if np.array_equal(k, p))
The innermost part is a list comprehension that generates all pieces of the array that are the same length as the pattern. The outer part sums 1 for every element of this list which matches the pattern.
The only way I could think of solving your problem with the
subpatterns matching was to use regex.
The following is a demonstration for findind for example the sequence [1,2,1] in list1:
import re
list1 = [1,1,2,2,2,2,1,1,1,2,1,1,2,3,3,3,3,3,2,2,1,1,1]
str_list = ''.join(str(i) for i in list1)
print re.findall(r'1+2+1', str_list)
This will give you as a result:
>>> print re.findall(r'1+2+1', str_list)
['1122221', '1121']

How to find number or 'xx' pairs in a barcode recursively (python)

I am using python.
Im having trouble with this recursion problem, I am trying to find how many pairs of characters are the same in a string. For example, 'xx' would return 1 and 'xxx' would also return one because the pairs are not allowed to overlap. 'aabbb' would return 2.
I am completely stuck. I thought of breaking the word up into length 2 strings and recursing through the string like that, but then cases like 'aaa' would result in incorrect output.
Thanks.
Not sure why you want to do this recursively. If you wish to avoid regex, you can still just scan the string from left to right. For example, using itertools.groupby
>>> from itertools import groupby
>>> s = 'aabbb'
>>> sum(sum(1 for i in g)//2 for k,g in groupby(s))
2
>>> s = 'yyourr ssstringg'
>>> sum(sum(1 for i in g)//2 for k,g in groupby(s))
4
sum(1 for i in g) is used to find the length of the group. If the groups are not very long you can use len(list(g)) instead
You can use regex for that:
import re
s = 'yyourr ssstringg'
print len(re.findall(r'(\w)\1', s))
[OUTPUT]
4
This also takes care of your "overlaps-not-allowed" problem as you can see in the above example it prints 4 and not 5.
For a recursion approach, you can do it as:
st = 'yyourr ssstringg'
def get_double(s):
if len(s) < 2:
return 0
else:
for i,k in enumerate(s):
if k==s[i+1]:
return 1 + get_double(s[i+2:])
>>> print get_double(st)
4
And without a for loop:
st = 'yyourr sstringg'
def get_double(s):
if len(s) < 2:
return 0
elif s[0]==s[1]:
return 1 + get_double(s[2:])
else:
return 0 + get_double(s[1:])
>>> print get_double(st)
4
I would evaluate it by 2's.
for example "sskkkj" would be looked at as two sets of two char strings:
"ss", "kk", "kj" # from 0 index
"sk", "kk" # offset by 1
look at the two sets at the same time and add only one to the count if either has a pair.

Categories