Count spaces in text (treat consecutive spaces as one) - python

How would you count the number of spaces or new line charaters in a text in such a way that consecutive spaces are counted only as one?
For example, this is very close to what I want:
string = "This is an example text.\n But would be good if it worked."
counter = 0
for i in string:
if i == ' ' or i == '\n':
counter += 1
print(counter)
However, instead of returning with 15, the result should be only 11.

The default str.split() function will treat consecutive runs of spaces as one. So simply split the string, get the size of the resulting list, and subtract one.
len(string.split())-1

Assuming you are permitted to use Python regex;
import re
print len(re.findall(ur"[ \n]+", string))
Quick and easy!
UPDATE: Additionally, use [\s] instead of [ \n] to match any whitespace character.

You can do this:
string = "This is an example text.\n But would be good if it worked."
counter = 0
# A boolean flag indicating whether the previous character was a space
previous = False
for i in string:
if i == ' ' or i == '\n':
# The current character is a space
previous = True # Setup for the next iteration
else:
# The current character is not a space, check if the previous one was
if previous:
counter += 1
previous = False
print(counter)

re to the rescue.
>>> import re
>>> string = "This is an example text.\n But would be good if it worked."
>>> spaces = sum(1 for match in re.finditer('\s+', string))
>>> spaces
11
This consumes minimal memory, an alternative solution that builds a temporary list would be
>>> len(re.findall('\s+', string))
11
If you only want to consider space characters and newline characters (as opposed to tabs, for example), use the regex '(\n| )+' instead of '\s+'.

Just store a character that was the last character found. Set it to i each time you loop. Then within your inner if, do not increase the counter if the last character found was also a whitespace character.

You can iterate through numbers to use them as indexes.
for i in range(1, len(string)):
if string[i] in ' \n' and string[i-1] not in ' \n':
counter += 1
if string[0] in ' \n':
counter += 1
print(counter)
Pay attention to the first symbol as this constuction starts from the second symbol to prevent IndexError.

You can use enumerate, checking the next char is not also whitespace so consecutive whitespace will only count as 1:
string = "This is an example text.\n But would be good if it worked."
print(sum(ch.isspace() and not string[i:i+1].isspace() for i, ch in enumerate(string, 1)))
You can also use iter with a generator function, keeping track of the last character and comparing:
def con(s):
it = iter(s)
prev = next(it)
for ele in it:
yield prev.isspace() and not ele.isspace()
prev = ele
yield ele.isspace()
print(sum(con(string)))
An itertools version:
string = "This is an example text.\n But would be good if it worked. "
from itertools import tee, izip_longest
a, b = tee(string)
next(b)
print(sum(a.isspace() and not b.isspace() for a,b in izip_longest(a,b, fillvalue="") ))

Try:
def word_count(my_string):
word_count = 1
for i in range(1, len(my_string)):
if my_string[i] == " ":
if not my_string[i - 1] == " ":
word_count += 1
return word_count

You can use the function groupby() to find groups of consecutive spaces:
from collections import Counter
from itertools import groupby
s = 'This is an example text.\n But would be good if it worked.'
c = Counter(k for k, _ in groupby(s, key=lambda x: ' ' if x == '\n' else x))
print(c[' '])
# 11

Related

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?
Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5
You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.
try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions
Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.
I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

partition a string by dash (-) python

I want to get a string and divide it into parts separated by "-".
Input:
aabbcc
And output:
aa-bb-cc
is there a way to do so?
If you want to do it based on the same letter then you can use itertools.groupby() to do this, e.g.:
In []:
import itertools as it
s = 'aabbcc'
'-'.join(''.join(g) for k, g in it.groupby(s))
Out[]:
'aa-bb-cc'
Or if you want it in chunks of 2 you can use iter() and zip():
In []:
n = 2
'-'.join(''.join(p) for p in zip(*[iter(s)]*n))
Out[]:
'aa-bb-cc'
Note: if the string length is not divisible by 2 this will drop the last character - you can replace zip(...) with itertools.zip_longest(..., fillvalue='') but it is unclear if the OP has this issue)
If you consider creating pair-divided by a dash, you can use the below function:
def pair_div(string):
newString=str() #for storing the divided string
for i,s in enumerate(string):
if i%2!=0 and i<(len(string)-1): #we make sure the function divides every two chars but not the last character of string.
newString+=s+'-' #If it is the second member of pair, add a dash after it
else:
newString+=s #If not, just add the character
return(newString)
And for example:
[In]:string="aazzxxcceewwqqbbvvaa"
[Out]:'aa-zz-xx-cc-ee-ww-qq-bb-vv-aa'
But if you consider dividing same characters as a group and separate with a dash, you better your regex methods.
BR,
Shend
You can try
data = "aabbcc"
"-".join([data[x:x+2] for x in range(0, len(data), 2)])
if you want to divide the string into block of 2 characters, then this will help you.
import textwrap
s='aabbcc'
lst=textwrap.wrap(s,2)
print('-'.join(lst))
2nd attribute defines the no. of characters you want in a particular group
s = 'aabbccdd'
#index 01234567
new_s = ''
1)
for idx, char in enumerate(s):
new_s+=char
if idx%2 != 0:
new_s += '-'
print(new_s.strip('-'))
# aa-bb-cc-dd
2)
new_s = ''.join([s[i]+'-' if i%2 != 0 else s[i] for i in range(len(s))]).strip('-')
print(new_s)
# aa-bb-cc-dd

Remove punctuation items from end of string

I have a seemingly simple problem, which I cannot seem to solve. Given a string containing a DOI, I need to remove the last character if it is a punctuation mark until the last character is letter or number.
For example, if the string was:
sampleDoi = "10.1097/JHM-D-18-00044.',"
I want the following output:
"10.1097/JHM-D-18-00044"
ie. remove .',
I wrote the following script to do this:
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i - 1
else:
print (a)
break
However, this produces 10.1097/JHM-D-18-00 but I would like it to produce 10.1097/JHM-D-18-00044. Why is the 44 removed from the end?
The string function rstrip() is designed to do exactly this:
>>> sampleDoi = "10.1097/JHM-D-18-00044.',"
>>> sampleDoi.rstrip(",.'")
'10.1097/JHM-D-18-00044'
Corrected code:
import string
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i # Well Really this line can just be removed all together.
else:
print (a)
break
This gives the output you want, while keeping the original code mostly the same.
This is one way using next and str.isalnum with a generator expression utilizing enumerate / reversed.
sampleDoi = "10.1097/JHM-D-18-00044.',"
idx = next((i for i, j in enumerate(reversed(sampleDoi)) if j.isalnum()), 0)
res = sampleDoi[:-idx]
print(res)
'10.1097/JHM-D-18-00044'
The default parameter 0is used so that, if no alphanumeric character is found, an empty string is returned.
If you dont wanna use regex:
the_str = "10.1097/JHM-D-18-00044.',"
while the_str[-1] in string.punctuation:
the_str = the_str[:-1]
Removes the last character until it's no longer a punctuation character.

Error while Trying To Print the First Occurrence of a repeating Character in a String using Python 3.6

I am writing a simple program to replace the repeating characters in a string with an *(asterisk). But the thing here is I can print the 1st occurrence of a repeating character in a string, but not the other occurrences.
For example,
if my input is Google, my output should be Go**le.
I am able to replace the characters that repeat with an asterisk, but just cant find a way to print the 1st occurrence of the character. In other words, my output right now is ****le.
Have a look at my Python3 code for this:
s = 'Google'
s = s.lower()
for i in s:
if s.count(i)>1:
s = s.replace(i,'*')
print(s)
Can someone suggest me what should be done to get the required output?
replace will replace ALL occurences of the char. You need to follow on the characters you already have seen, and if they are repeated to replace JUST this character (at specific index).
Strings don't support index assignment, so we can build a new list that represents the new string and ''.join() it afterwards.
Using Set you can follow on what items you have seen already.
It would look like this:
s = 'Google'
seen = set()
new_string = []
for c in s:
if c.lower() in seen:
new_string.append('*')
else:
new_string.append(c)
seen.add(c.lower())
new_string = ''.join(new_string)
print(new_string)
Go**le
This is my approach:
First, you need to find the nth occurrence of the character. Then, you can replace other occurrences by using this snippet:
s = s[:position] + '*' + s[position+1:]
Full example code:
def find_nth(haystack, needle, n):
start = haystack.find(needle)
while start >= 0 and n > 1:
start = haystack.find(needle, start+len(needle))
n -= 1
return start
s = 'Google'
s_lower = s.lower()
for c in s_lower:
if s_lower.count(c) > 1:
position = find_nth(s_lower, c, 2)
s = s[:position] + '*' + s[position+1:]
print(s)
Runnable link: https://repl.it/Mc4U/4
Regex approach:
import re
s = 'Google'
s_lower = s.lower()
for c in s_lower:
if s_lower.count(c) > 1:
position = [m.start() for m in re.finditer(c, s_lower)][1]
s = s[:position] + '*' + s[position+1:]
print(s)
Runnable link: https://repl.it/Mc4U/3
How about using list comprensions? When constructing a list from another list (which is kind of what you are doing here, since we're considering strings as lists), list comprehension is a great tool:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
inputstring = 'Google'.lower()
outputstring = ''.join(
[char if inputstring.find(char, 0, index) == -1 else '*'
for index, char in enumerate(inputstring)])
print(outputstring)
This results in go**le.
Hope this helps!
(edited to use '*' as the replacement character instead of '#')

python: Finding a substring within a string

I'm trying to write a program which counts how many times a substring appears within a string.
word = "wejmfoiwstreetstreetskkjoih"
streets = "streets"
count = 0
if streets in word:
count += 1
print(count)
as you can see "streets" appears twice but the last s of streets is also the beginning of streets. I can't think of a way to loop this.
Thanks!
Can be done using a regex
>>> import re
>>> text = 'streetstreets'
>>> len(re.findall('(?=streets)', text))
2
From the docs:
(?=...)
Matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example, Isaac
(?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Quick and dirty:
>>> word = "wejmfoiwstreetstreetskkjoih"
>>> streets = "streets"
>>> sum(word[start:].startswith(streets) for start in range(len(word)))
2
A generic (though not as elegant) way would be a loop like this:
def count_substrings(stack, needle):
idx = 0
count = 0
while True:
idx = stack.find(needle, idx) + 1 # next time look after this idx
if idx <= 0:
break
count += 1
return count
My measurement shows that it's ~8.5 times faster than the solution with startswith for every substring.

Categories