Python regular expression truncate string by special character with one leading space - python

I need to truncate string by special characters '-', '(', '/' with one leading whitespace, i.e. ' -', ' (', ' /'.
how to do that?
patterns=r'[-/()]'
try:
return row.split(re.findall(patterns, row)[0], 1)[0]
except:
return row
the above code picked up all special characters but without the leading space.
patterns=r'[s-/()]'
this one does not work.

Try this pattern
patterns=r'^\s[-/()]'
or remove ^ depending on your needs.

It looks like you want to get a part of the string before the first occurrence of \s[-(/] pattern.
Use
return re.sub(r'\s[-(/].*', '', row)
This code will return a part of row string without all chars after the first occurrence of a whitespace (\s) followed with -, ( or / ([-(/]).
See the regex demo.

Please try this pattern patterns = r'\s+-|\s\/|\s\(|\s\)'

Related

Python Regex - grab a block of whitespace from end of string

I am trying to write a regex that grabs blocks of whitespace from either side of a string. I can get the beginning, but I can't seem to grab the end block.
s = ' This is a string with whitespace on either side '
strip_regex = re.compile(r'(\s+)(.*)(something to grab end block)')
mo = strip_regex.findall(s)
What I get as an output is this:
[(' ', 'This is a string with whitespace on either side ')]
I have played around with that to do at the end, and the best I can get is one whitespace but I can never just grab the string until the end of 'side'. I don't want to use the characters in side because I want the regex to work with any string surrounded by whitespace. I am pretty sure that it's because I am using the (.*) which is just grabbing everything after the first whitespace block. But can't figure out how to make it stop before the end whitespace block.
Thanks for any help :)
If what you want to do is strip whitespace, you could use strip() instead.
See: https://www.journaldev.com/23625/python-trim-string-rstrip-lstrip-strip
As for your regex, if you want both the start and end whitespace, I suggest matching the whole line, with the middle part not greedy like so:
s = ' This is a string with whitespace on either side '
strip_regex = re.compile(r'^(\s+)(.*?)(\s+)$')
mo = strip_regex.findall(s)
Result:
[(' ', 'This is a string with whitespace on either side', ' ')]
More about greedy: How can I write a regex which matches non greedy?

remove only consecutive special characters but keep consecutive [a-zA-Z0-9] and single characters

How can I remove multiple consecutive occurrences of all the special characters in a string?
I can get the code like:
re.sub('\.\.+',' ',string)
re.sub('##+',' ',string)
re.sub('\s\s+',' ',string)
for individual and in best case, use a loop for all the characters in a list like:
from string import punctuation
for i in punctuation:
to = ('\\' + i + '\\' + i + '+')
string = re.sub(to, ' ', string)
but I'm sure there is an effective method too.
I tried:
re.sub('[^a-zA-Z0-9][^a-zA-Z0-9]+', ' ', '\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y.')
but it removes all the special characters except one preceded by alphabets.
string can have different consecutive special characters like 99#aaaa*!##$. but not same like ++--....
A pattern to match all non-alphanumeric characters in Python is [\W_].
So, all you need is to wrap the pattern with a capturing group and add \1+ after it to match 2 or more consecutive occurrences of the same non-alphanumeric characters:
text = re.sub(r'([\W_])\1+',' ',text)
In Python 3.x, if you wish to make the pattern ASCII aware only, use the re.A or re.ASCII flag:
text = re.sub(r'([\W_])\1+',' ',text, flags=re.A)
Mind the use of the r prefix that defines a raw string literal (so that you do not have to escape \ char).
See the regex demo. See the Python demo:
import re
text = "\n\n.AAA.x.##+*##=..xx000..x..\t.x..\nx*+Y."
print(re.sub(r'([\W_])\1+',' ',text))
Output:
.AAA.x. +*##= xx000 x .x
x*+Y.

Regex breaking with preceding characters

I am trying to parse a phone number from a group of strings by compiling this regex:
exp = re.compile(r'(\+\d|)(([^0-9\s]|)\d\d\d([^0-9\s]|)([^0-9\s]|)\d+([^0-9\s]|)\d+)')
This successfully matches with a line like "+1(123)-456-7890". However, if I add anything in front of it, like "P: +1(123)-456-7890" it does not match. I tested on Regex websites but can't figure this out at all.
You might consider using re.search (which scans) instead of re.match, which only looks at the beginning of the string. You could instead add a .* to the start.
Your regex will return following results
[('+1', '(123)-456-7890', '(', ')', '-', '-')]
If format is fixed you can use something like
phone = re.compile(r"\+\d\(\d+\)-\d+-\d+")
\d - matches digit.
+ - one or more occurrences.
\+ - for matching "+"
\( - for matching "("
str = "P: +1(123)-456-7890"
phone.findall(str)
Output :
['+1(123)-456-7890']

How can I remove all non-alphanumeric characters from a string, except for '#', with regex?

I currently have this line address = re.sub('[^A-Za-z0-9]+', ' ', address).lstrip() which will remove all special characters from my string address. How can I modify this line to keep #?
In order to avoid removing the hash symbol, you need to add it into the negated character class:
r'[^A-Za-z0-9#]+'
^
See the regex demo

Python Regex - Match a character without consuming it

I would like to convert the following string
"For "The" Win","Way "To" Go"
to
"For ""The"" Win","Way ""To"" Go"
The straightforward regex would be
str2 = re.sub(r'(?<!,|^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
i.e., Double the quotes that are
Followed by a letter but not preceded by a comma or the beginning of line
Preceded by a letter but not followed by a comma or the end of line
The problem is I am using python and it's regex engine does not allow using the OR operator in the lookbehind construct. I get the error
sre_constants.error: look-behind requires fixed-width pattern
What I am looking for is a regex that will replace the '"' around 'The' and 'To' with '""'.
I can use the following regex (An answer provided to another question)
\b\s*"(?!,|[ \t]*$)
but that consumes the space just before the 'The' and 'To' and I get the below
"For""The"" Win","Way""To"" Go"
Is there a workaround so that I can double the quotes around 'The' and 'To' without consuming the spaces just before them?
Instead of saying not preceded by comma or the line start, say preceded by a non-comma character:
r'(?<=[^,])"(?=\w)|(?<=\w)"(?!,|$)'
Looks to me like you don't need to bother with anchors.
If there is a character before the quote, you know it's not at the beginning of the string.
If that character is not a newline, you're not at the beginning of a line.
If the character is not a comma, you're not at the beginning of a field.
So you don't need to use anchors, just do a positive lookbehind/lookahead for a single character:
result = re.sub(r'(?<=[^",\r\n])"(?=[^,"\r\n])', '""', subject)
I threw in the " on the chance that there might be some quotes that are already escaped. But realistically, if that's the case you're probably screwed anyway. ;)
re.sub(r'\b(\s*)"(?!,|[ \t]*$)', r'\1""', s)
Most direct workaround whenever you encounter this issue: explode the look-behind into two look-behinds.
str2 = re.sub(r'(?<!,)(?<!^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
(don't name your strings str)
str2 = re.sub('(?<=[^,])"(?=\w)'
'|'
'(?<=\w)"(?!,|$)',
'""', ss,
flags=re.MULTILINE)
I always wonder why people use raw strings for regex patterns when it isn't needed.
Note I changed your str which is the name of a builtin class to ss
.
For `"fun" :
str2 = re.sub('"'
'('
'(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)'
')',
'""', ss,
flags=re.MULTILINE)
or also
str2 = re.sub('(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)',
'"', ss,
flags=re.MULTILINE)

Categories