Python strings: quickly summarize the character count in order of appearance

Python strings: quickly summarize the character count in order of appearance - python

Let's say I have the following strings in Python3.x
string1 = 'AAAAABBBBCCCDD'
string2 = 'CCBADDDDDBACDC'
string3 = 'DABCBEDCCAEDBB'
I would like to create a summary "frequency string" that counts the number of characters in the string in the following format:
string1_freq = '5A4B3C2D' ## 5 A's, followed by 4 B's, 3 C's, and 2D's
string2_freq = '2C1B1A5D1B1A1C1D1C'
string3_freq = '1D1A1B1C1B1E1D2C1A1E1D2B'
My problem:
How would I quickly create such a summary string?
My idea would be: create an empty list to keep track of the count. Then create a for loop which checks the next character. If there's a match, increase the count by +1 and move to the next character. Otherwise, append to end of the string 'count' + 'character identity'.
That's very inefficient in Python. Is there a quicker way (maybe using the functions below)?
There are several ways to count the elements of a string in python. I like collections.Counter, e.g.
from collections import Counter
counter_str1 = Counter(string1)
print(counter_str1['A']) # 5
print(counter_str1['B']) # 4
print(counter_str1['C']) # 3
print(counter_str1['D']) # 2
There's also str.count(sub[, start[, end]
Return the number of non-overlapping occurrences of substring sub in
the range [start, end]. Optional arguments start and end are
interpreted as in slice notation.
As an example:
print(string1.count('A')) ## 5

The following code accomplishes the task without importing any modules.
def freq_map(s):
num = 0 # number of adjacent, identical characters
curr = s[0] # current character being processed
result = '' # result of function
for i in range(len(s)):
if s[i] == curr:
num += 1
else:
result += str(num) + curr
curr = s[i]
num = 1
result += str(num) + curr
return result
Note: Since you requested a solution based on performance, I suggest you use this code or a modified version of it.
I have executed rough performance test against the code provided by CoryKramer for reference. This code performed the same function in 58% of the time without using external modules. The snippet can be found here.

I would use itertools.groupby to group consecutive runs of the same letter. Then use a generator expression within join to create a string representation of the count and letter for each run.
from itertools import groupby
def summarize(s):
return ''.join(str(sum(1 for _ in i[1])) + i[0] for i in groupby(s))
Examples
>>> summarize(string1)
'5A4B3C2D'
>>> summarize(string2)
'2C1B1A5D1B1A1C1D1C'
>>> summarize(string3)
'1D1A1B1C1B1E1D2C1A1E1D2B'

Related

How can I optimize this for-loop?

I need to check the occurrences of the letter "a" in a string s of size n.
Example:
s = "abcac"
n = 10
String to check for occurrences of letter "a": "abcacabcac".
Occurrences: 4
My code works, but I need it to work faster for larger values of n.
What can I do to optimize this code?
def repeatedString(s, n):
a_count, word_iter = 0, 0
for i in range(n):
if s[word_iter] == "a":
a_count+=1
word_iter += 1
if word_iter == (len(s)):
word_iter = 0
return a_count

You only don't need to assemble the full repeated string to do it. count the number of the specified characted in the whole string and multiple that by the number of times it will be fully repeated (n//len(s) times). Add to that the number of occurrences that will appear in the last (truncated) part at the end of the repetitions (i.e. first n%len(s) characters)
def countChar(s,n,c):
return s.count(c)*n//len(s)+s[:n%len(s)].count(c)
output:
countChar("abcac",10,"a") # 4 times in 'abcacabcac'
countChar("abcac",17,"a") # 7 times in 'abcacabcacabcacab'

Count the number of times a appears in a string, s up to length n
s = "abcac"
n = 10
str(s*(int(n/len(s))))[:n].count('a')

You can use regular expressions:
import re
a_count = len(re.findall(r'a',s))
re.findall returns an array of all matches, and we can just get the length of it. Using a regular expression allows for greater generalization and the ability to search for more complex patterns. Debra's original answer is better for a simple string search though:
a_count = s.count('a')

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?

Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5

You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.

try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions

Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.

I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

Backward search implementation python

I am dealing with some string search tasks just to improve an efficient way of searching.
I am trying to implement a way of counting how many substrings there are in a given set of strings by using backward search.
For example given the following strings:
original = 'panamabananas$'
s = smnpbnnaaaaa$a
s1 = $aaaaaabmnnnps #sorted version of s
I am trying to find how many times the substring 'ban' it occurs. For doing so I was thinking in iterate through both strings with zip function. In the backward search, I should first look for the last character of ban (n) in s1 and see where it matches with the next character a in s. It matches in indexes 9,10 and 11, which actually are the third, fourth and fifth a in s. The next character to look for is b but only for the matches that occurred before (This means, where n in s1 matched with a in s). So we took those a (third, fourth and fifth) from s and see if any of those third, fourth or fifth a in s1 match with any b in s. This way we would have found an occurrence of 'ban'.
It seems complex to me to iterate and save cuasi-occurences so what I was trying is something like this:
n = 0 #counter of occurences
for i, j in zip(s1, s):
if i == 'n' and j == 'a': # this should save the match
if i[3:6] == 'a' and any(j[3:6] == 'b'):
n += 1
I think nested if statements may be needed but I am still a beginner. Because I am getting 0 occurrences when there are one ban occurrences in the original.

You can run a loop with find to count the number of occurence of substring.
s = 'panamabananasbananasba'
ss = 'ban'
count = 0
idx = s.find(ss, 0)
while (idx != -1):
count += 1
idx += len(ss)
idx = s.find(ss, idx)
print count
If you really want backward search, then reverse the string and substring and do the same mechanism.
s = 'panamabananasbananasban'
s = s[::-1]
ss = 'ban'
ss = ss[::-1]

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]

You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.

Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.

Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.

You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']

Counting the number of times a character in string1 is found in string2?

I'm trying to create a function which takes two strings and then returns the sum total of how many times every character in the first string is found in the second string under the condition that duplicate characters in the first are ignored.
e.g. search_counter('aabbaa','a') would mean a count of 1 since the the second string only has one a and no bs and we only want to search for a once despite there being four as.
Here's my attempt so far:
def search_counter(search_string, searchme):
count = 0
for x in search_string:
for y in searchme:
if x == y:
count = count + 1
return count
The problem with my example is that there is no check to ignore duplicate characters in search_string.
So instead of getting search_counter('aaa','a') = 1 I get 3.

for x in search_string:
You can get a list of characters without duplicates by converting the string to a set.
for x in set(search_string):

You can eliminate repetitions from a string by transforming it into a set:
>>> set("asdddd")
set(['a', 's', 'd'])
Preprocess your string this way, and the rest of the algorithm will remain the same (set objects are iterables, just like strings)

You can use iteration to do this
def search_counter(string, search):
count = 0
for i in range(len(string)):
count += string[i:].startswith(search)
return count
Or this one-liner
search_counter = lambda string, search: sum([string[i:].startswith(search) for i in range(len(string))])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python strings: quickly summarize the character count in order of appearance - python

Related

How can I optimize this for-loop?

Is there an easy way to get the number of repeating character in a word?

Backward search implementation python

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Counting the number of times a character in string1 is found in string2?

Categories

Resources