The parameter to the function satisfy these rules:
It does not have any leading whitespace
It might have trailing whitespaces
There might be interleaved whitespaces in the string.
Goal: remove duplicate whitespaces that are interleaved & strip trailing whitespaces.
This is how I am doing it now:
# toks - a priori no leading space
def squeeze(toks):
import re
p = re.compile(r'\W+')
a = p.split( toks )
for i in range(0, len(a)):
if len(a[i]) == 0:
del a[i]
return ' '.join(a)
>>> toks( ' Mary Decker is hot ' )
Mary Decker is hot
Can this be improved ? Pythonic enough ?
This is how I would do it:
" ".join(toks.split())
PS. Is there a subliminal message in this question? ;-)
Can't you use rstrip()?
some_string.rstrip()
or strip() for stripping the string from both sides?
In addition: the strip() methods also support to pass in arbitrary strip characters:
string.strip = strip(s, chars=None)
strip(s [,chars]) -> string
Related: if you need to strip whitespaces in-between: split the string, strip the terms and re-join it.
Reading the API helps!
To answer your questions literally:
Yes, it could be improved. The first improvement would be to make it work.
>>> squeeze('x ! y')
'x y' # oops
Problem 1: You are using \W+ (non-word characters) when you should be using \s+ (whitespace characters)
>>> toks = 'x ! y z '
>>> re.split('\W+', toks)
['x', 'y', 'z', '']
>>> re.split('\s+', toks)
['x', '!', 'y', 'z', '']
Problem 2: The loop to delete empty strings works, but only by accident. If you wanted a general-purpose loop to delete empty strings in situ, you would need to work backwards, otherwise your subscript i would get out of whack with the number of elements remaining. It works here because re.split() without a capturing group can produce empty elements only at the start and end. You have defined away the start problem, and the end case doesn't cause a problem because there have been no prior deletions. So you are left with a very ugly loop which could be replaced by two lines:
if a and not a[-1]: # guard against empty list
del a[-1]
However unless your string is very long and you are worried about speed (in which case you probably shouldn't be using re), you'd probably want to allow for leading whitespace (assertions like "my data doesn't have leading whitespace" are ignored by convention) and just do it in a loop on the fly:
a = [x for x in p.split(toks) if x]
Next step is to avoid building the list a:
return ' '.join(x for x in p.split(toks) if x)
Now you did mention "Pythonic" ... so let's throw out all that re import and compile overhead stuff, and the genxp and just do this:
return ' '.join(toks.split())
Well, I tend not to use the re module if I can do the job reasonably with
the built-in functions and features. For example:
def toks(s):
return ' '.join([x for x in s.split(' ') if x])
... seems to accomplish the same goal with only built in split, join, and the list comprehension to filter our empty elements of the split string.
Is that more "Pythonic?" I think so. However my opinion is hardly authoritative.
This could be done as a lambda expression as well; and I think that would not be Pythonic.
Incidentally this assumes that you want to ONLY squeeze out duplicate spaces and trim leading and trailing spaces. If your intent is to munge all whitespace sequences into single spaces (and trim leading and trailing) then change s.split(' ') to s.split() -- passing no argument, or None, to the split() method is different than passing it a space.
To make your code more Pythonic, you must realize that in Python, a[i] being a string, instead of deleting a[i] if a[i]=='' , it is better keeping a[i] if a[i]!='' .
So, instead of
def squeeze(toks):
import re
p = re.compile(r'\W+')
a = p.split( toks )
for i in range(0, len(a)):
if len(a[i]) == 0:
del a[i]
return ' '.join(a)
write
def squeeze(toks):
import re
p = re.compile(r'\W+')
a = p.split( toks )
a = [x for x in a if x]
return ' '.join(a)
and then
def squeeze(toks):
import re
p = re.compile(r'\W+')
return ' '.join([x for x in p.split( toks ) if x])
Then, taking account that a function can receive a generator as well as a list:
def squeeze(toks):
import re
p = re.compile(r'\W+')
return ' '.join((x for x in p.split( toks ) if x))
and that doubling parentheses isn't obligatory:
def squeeze(toks):
import re
p = re.compile(r'\W+')
return ' '.join(x for x in p.split( toks ) if x)
.
.
Additionally, instead of obliging Python to verify if re is or isn't present in the namespace of the function squeeze() each time it is called (it is what it does), it would be better to pass re as an argument by defautlt :
import re
def squeeze(toks,re = re):
p = re.compile(r'\W+')
return ' '.join(x for x in p.split( toks ) if x)
and , even better:
import re
def squeeze(toks,p = re.compile(r'\W+')):
return ' '.join(x for x in p.split( toks ) if x)
.
.
Remark: the if x part in the expression is useful only to leave apart the heading '' and the ending '' occuring in the list p.split( toks ) when toks begins and ends with whitespaces.
But , instead of splitting, it is as much good to keep what is desired:
import re
def squeeze(toks,p = re.compile(r'\w+')):
return ' '.join(p.findall(toks))
.
.
All that said, the pattern r'\W+' in your question is wrong for your purpose, as John Machin pointed it out.
If you want to compress internal whitespaces and to remove trailing whitespaces, whitespace being taken in its pure sense designating the set of characters ' ' , '\f' , '\n' , '\r' , '\t' , '\v' ( see \s in re) , you must replace your spliting with this one:
import re
def squeeze(toks,p = re.compile(r'\s+')):
return ' '.join(x for x in p.split( toks ) if x)
or, keeping the right substrings:
import re
def squeeze(toks,p = re.compile(r'\S+')):
return ' '.join(p.findall(toks))
which is nothing else than the simpler and faster expression ' '.join(toks.split())
But if you want in fact just to compress internal and remove trailing characters ' ' and '\t' , keeping the newlines untouched, you will use
import re
def squeeze(toks,p = re.compile(r'[^ \t]+')):
return ' '.join(p.findall(toks))
and that can't be replaced by anything else.
I know this question is old. But why not use regex?
import re
result = ' Mary Decker is hot '
print(f"=={result}==")
result = re.sub('\s+$', '', result)
print(f"=={result}==")
result = re.sub('^\s+', '', result)
print(f"=={result}==")
result = re.sub('\s+', ' ', result)
print(f"=={result}==")
The output is
== Mary Decker is hot ==
== Mary Decker is hot==
==Mary Decker is hot==
==Mary Decker is hot==
Related
To be more specific, it's for an "if" condition
I have a list of strings which have 5 spaces then the last character
Is there a character that can replace the last character of every string
Like:
if string == " &":
do something
And the condition would be true if & == any type of character
You can access the last character by slicing, e.g. -1 is the last one:
lst = ['&', 'A', 'B', 'C']
s = 'some random string which ends on &'
if s[-1] in lst:
print('hurray!')
#hurray!
Alternatively you can also use .endswith() if its only a few entries:
s = 'some random string which ends on &'
if s.endswith('&') or s.endswith('A'):
print('hurray!')
#hurray!
Since you also asked how to replace the last character, this can be done like this:
s = s[:-1] + '!'
#Out[72]: 'some random string which ends on !'
As per you comment, here is a wildcard solution:
import re
s = r' &'
pattern = r' .{1}$'
if re.search(pattern, s):
print('hurray!')
#hurray!
Try this:
if string[-1] == 'A' or string[-1] == '1':
do something
You may use a regular expression along with re.search, for example:
vals = ["validA", "valid1", "invalid"]
for val in vals:
if re.search(r'[A1]$', val):
print(val + ": MATCH")
This prints:
validA: MATCH
valid1: MATCH
Perhaps you're looking for the .endswith() function? For example:
if "waffles".endswith("s"):
...
we get a string from user and want to lowercase it and remove vowels and add a '.' before each letter of it. for example we get 'aBAcAba' and change it to '.b.c.b' . two early things are done but i want some help with third one.
str = input()
str=str.lower()
for i in range(0,len(str)):
str=str.replace('a','')
str=str.replace('e','')
str=str.replace('o','')
str=str.replace('i','')
str=str.replace('u','')
print(str)
for j in range(0,len(str)):
str=str.replace(str[j],('.'+str[j]))
print(str)
A few things:
You should avoid the variable name str because this is used by a builtin, so I've changed it to st
In the first part, no loop is necessary; replace will replace all occurrences of a substring
For the last part, it is probably easiest to loop through the string and build up a new string. Limiting this answer to basic syntax, a simple for loop will work.
st = input()
st=st.lower()
st=st.replace('a','')
st=st.replace('e','')
st=st.replace('o','')
st=st.replace('i','')
st=st.replace('u','')
print(st)
st_new = ''
for c in st:
st_new += '.' + c
print(st_new)
Another potential improvement: for the second part, you can also write a loop (instead of your five separate replace lines):
for c in 'aeiou':
st = st.replace(c, '')
Other possibilities using more advanced techniques:
For the second part, a regular expression could be used:
st = re.sub('[aeiou]', '', st)
For the third part, a generator expression could be used:
st_new = ''.join(f'.{c}' for c in st)
You can use str.join() to place some character in between all the existing characters, and then you can use string concatenation to place it again at the end:
# st = 'bcb'
st = '.' + '.'.join(st)
# '.b.c.b'
As a sidenote, please don't use str as a variable name. It's the name of the "string" datatype, and if you make a variable named it then you can't properly work with other strings any more. string, st, s, etc. are fine, as they're not the reserved keyword str.
z = "aBAcAba"
z = z.lower()
newstring = ''
for i in z:
if not i in 'aeiou':
newstring+='.'
newstring+=i
print(newstring)
Here I have gone step by step, first converting the string to lowercase, then checking if the word is not vowel, then add a dot to our final string then add the word to our final string.
You could try splitting the string into an array and then build a new string with the indexes of the array appending an "."
not too efficient but will work.
thanks to all of you especially allani. the bellow code worked.
st = input()
st=st.lower()
st=st.replace('a','')
st=st.replace('e','')
st=st.replace('o','')
st=st.replace('i','')
st=st.replace('u','')
print(st)
st_new = ''
for c in st:
st_new += '.' + c
print(st_new)
This does everything.
import re
data = 'KujhKyjiubBMNBHJGJhbvgqsauijuetystareFGcvb'
matches = re.compile('[^aeiou]', re.I).finditer(data)
final = f".{'.'.join([m.group().lower() for m in matches])}"
print(final)
#.k.j.h.k.y.j.b.b.m.n.b.h.j.g.j.h.b.v.g.q.s.j.t.y.s.t.r.f.g.c.v.b
s = input()
s = s.lower()
for i in s:
for x in ['a','e','i','o','u']:
if i == x:
s = s.replace(i,'')
new_s = ''
for i in s:
new_s += '.'+ i
print(new_s)
def add_dots(n):
return ".".join(n)
print(add_dots("test"))
def remove_dots(a):
return a.replace(".", "")
print(remove_dots("t.e.s.t"))
I have the following piece of code. Basically, I'm trying to replace a word if it matches one of these regex patterns. If the word matches even once, the word should be completely gone from the new list. The code below works, however, I'm wondering if there's a way to implement this so that I can indefinitely add more patterns to the 'pat' list without having to write additional if statements within the for loop.
To clarify, my regex patterns have negative lookaheads and lookbehinds to make sure it's one word.
pat = [r'(?<![a-z][ ])Pacific(?![ ])', r'(?<![a-z][ ])Global(?![ ])']
if isinstance(x, list):
new = []
for i in x:
if re.search(pat[0], i):
i = re.sub(pat[0], '', i)
if re.search(pat[1], i):
i = re.sub(pat[1], '', i)
if len(i) > 0:
new.append(i)
x = new
else:
x = x.strip()
Just add another for loop:
for patn in pat:
if re.search(patn, i):
i = re.sub(patn, '', i)
if i:
new.append(i)
pat = [r'(?<![a-z][ ])Pacific(?![ ])', r'(?<![a-z][ ])Global(?![ ])']
if isinstance(x, list):
new = []
for i in x:
for p in pat:
i = re.sub(p, '', i)
if len(i) > 0:
new.append(i)
x = new
else:
x = x.strip()
Add another loop:
pat = [r'(?<![a-z][ ])Pacific(?![ ])', r'(?<![a-z][ ])Global(?![ ])']
if isinstance(x, list):
new = []
for i in x:
# iterate through pat list
for regx in pat:
if re.search(regx, i):
i = re.sub(regx, '', i)
...
If in your pattern, then changes are only the words, then you can add the words joined with | to make it or. So for your two patterns from the example will become one like below one.
r'(?<![a-z][ ])(?:Pacific|Global)(?![ ])'
If you need to add more words, just add with a pipe. For example (?:word1|word2|word3)
Inside the bracket ?: means do not capture the group.
something like this:
[word for word in l if not any(re.search(p, word) for p in pat)]
I will attempt a guess here; if I am wrong, please skip to the "this is how I'd write it" and modify the code that I provide, according to what you intend to do (which I may have failed to understand).
I am assuming you are trying to eliminate the words "Global" and "Pacific" in a list of phrases that may contain them.
If that is the case, I think your regular expression does not do what you specify. You probably intended to have something like the following (which does not work as-is!):
pat = [r'(?<=[a-z][ ])Pacific(?=[ ])', r'(?<=[a-z][ ])Global(?=[ ])']
The difference is in the look-ahead patterns, which are positive ((?=...) and (?<=...)) instead of negative ((?!...) and (?<!...)).
Furthermore, writing your regular expressions like this will not always correctly eliminate white space between your words.
This is how I'd write it:
words = ['Pacific', 'Global']
pat = "|".join(r'\b' + word + r'\b\s*' for word in words)
if isinstance(x, str):
x = x.strip() # I don't understand why you don't sub here, anyway!
else:
x = [s for s in (re.sub(pat, '', s) for s in x) if s != '']
In the regular expression for patterns, notice (a) \b, standing for "the empty string, but only at the beginning or end of a word" (see the manual), (b) the use of | for separating alternative patterns, and (c) \s, standing for "characters considered whitespace". The latter is what takes care of correctly removing unnecessary space after each eliminated word.
This works correctly in both Python 2 and Python 3. I think the code is much clearer and, in terms of efficiency, it's best if you leave re to do its work instead of testing each pattern separately.
Given:
x = ["from Global a to Pacific b",
"Global Pacific",
"Pacific Global",
"none",
"only Global and that's it"]
this produces:
x = ['from a to b', 'none', "only and that's it"]
I have a string s with nested brackets: s = "AX(p>q)&E((-p)Ur)"
I want to remove all characters between all pairs of brackets and store in a new string like this: new_string = AX&E
i tried doing this:
p = re.compile("\(.*?\)", re.DOTALL)
new_string = p.sub("", s)
It gives output: AX&EUr)
Is there any way to correct this, rather than iterating each element in the string?
Another simple option is removing the innermost parentheses at every stage, until there are no more parentheses:
p = re.compile("\([^()]*\)")
count = 1
while count:
s, count = p.subn("", s)
Working example: http://ideone.com/WicDK
You can just use string manipulation without regular expression
>>> s = "AX(p>q)&E(qUr)"
>>> [ i.split("(")[0] for i in s.split(")") ]
['AX', '&E', '']
I leave it to you to join the strings up.
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> re.compile("""\([^\)]*\)""").sub('', s)
'AX&E'
Yeah, it should be:
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> p = re.compile("\(.*?\)", re.DOTALL)
>>> new_string = p.sub("", s)
>>> new_string
'AX&E'
Nested brackets (or tags, ...) are something that are not possible to handle in a general way using regex. See http://www.amazon.de/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&s=gateway&qid=1304230523&sr=8-1-spell for details why. You would need a real parser.
It's possible to construct a regex which can handle two levels of nesting, but they are already ugly, three levels will already be quite long. And you don't want to think about four levels. ;-)
You can use PyParsing to parse the string:
from pyparsing import nestedExpr
import sys
s = "AX(p>q)&E((-p)Ur)"
expr = nestedExpr('(', ')')
result = expr.parseString('(' + s + ')').asList()[0]
s = ''.join(filter(lambda x: isinstance(x, str), result))
print(s)
Most code is from: How can a recursive regexp be implemented in python?
You could use re.subn():
import re
s = 'AX(p>q)&E((-p)Ur)'
while True:
s, n = re.subn(r'\([^)(]*\)', '', s)
if n == 0:
break
print(s)
Output
AX&E
this is just how you do it:
# strings
# double and single quotes use in Python
"hey there! welcome to CIP"
'hey there! welcome to CIP'
"you'll understand python"
'i said, "python is awesome!"'
'i can\'t live without python'
# use of 'r' before string
print(r"\new code", "\n")
first = "code in"
last = "python"
first + last #concatenation
# slicing of strings
user = "code in python!"
print(user)
print(user[5]) # print an element
print(user[-3]) # print an element from rear end
print(user[2:6]) # slicing the string
print(user[:6])
print(user[2:])
print(len(user)) # length of the string
print(user.upper()) # convert to uppercase
print(user.lstrip())
print(user.rstrip())
print(max(user)) # max alphabet from user string
print(min(user)) # min alphabet from user string
print(user.join([1,2,3,4]))
input()
I'm creating a function to create all 26 combinations of words with a fixed suffix. The script works except for the JOIN in the second-to-last line.
def create_word(suffix):
e=[]
letters="abcefghijklmnopqrstuvwxyz"
t=list(letters)
for i in t:
e.append(i)
e.append(suffix)
' '.join(e)
print e
Currently, it is printing ['a', 'suffix', 'b', 'suffix, ...etc]. And I want it to print out as one long string: 'aSuffixbSuffixcSuffix...etc.' Why isn't the join working in this? How can I fix this?
In addition, how would I separate the characters once I have the string? For example to translate "take the last character of the suffix and add a space to it every time ('aSuffixbSuffixcSuffix' --> 'aSuffix bSuffix cSuffix')". Or, more generally, to replace the x-nth character, where x is any integer (e.g., to replace the 3rd, 6th, 9th, etc. character some something I choose).
str.join returns the new value, not transform the existing one. Here's one way to accomplish it.
result = ' '.join(e)
print result
But if you're feeling clever, you can streamline a lot of the setup.
import string
def create_word(suffix):
return ' '.join(i + suffix for i in string.ascii_lowercase)
join doesn't change its arguments - it just returns a new string:
result = ' '.join(e)
return result
If you really want the output you specified (all of the results concatenated together):
>>> import string
>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
>>> letters = string.ascii_lowercase
>>> suffix = 'Suffix'
>>> ''.join('%s%s' % (l, suffix) for l in letters)
'aSuffixbSuffixcSuffixdSuffixeSuffixfSuffixgSuffixhSuffixiSuffixjSuffixkSuffixlSuffixmSuffixnSuffixoSuffixpSuffixqSuffixrSuffixsSuffixtSuffixuSuffixvSuffixwSuffixxSuffixySuffixzSuffix'
Beside the problem already mentioned by rekursive, you should have a look at list comprehension:
def create_word(suffix):
return ''.join(
[i+suffix for i in "abcefghijklmnopqrstuvwxyz"]
)
print create_word('suffix')