Python regex '\s' vs '\\s' - python

I have simple expression \s and \\s. Both expression matches This is Sparta!!.
>>> re.findall('\\s',"This is Sparta")
[' ', ' ']
>>> re.findall('\s',"This is Sparta")
[' ', ' ']
I am confused here. \ is used to escape special character and \s represents white space but, how both are acting here?

Don't confuse python-level string-escaping and regex-level string-escaping. Since s is not an escapable character at python-level, the interpreter understand a string like "\s" as the two characters "\" and "s". Replace "s" with "n" (for example), and it understands it as the newline character.
'\s' == '\\s'
True
'\n' == '\\n'
False

\ only escapes the following character if the escaped character is valid
>>> len('\s')
2
>>> len('\n')
1
compare with
>>> len('\\s')
2
>>> len('\\n')
2

Related

Python re.sub strip leading/trailing whitespace within quotes

I want to use re.sub to remove leading and trailing whitespace from single-quoted strings embedded in a larger string. If I have, say,
textin = " foo ' bar nox ': glop ,' frox ' "
I want to produce
desired = " foo 'bar nox': glop ,'frox' "
Removing the leading whitespace is relatively straightforward.
>>> lstripped = re.sub(r"'\s*([^']*')", r"'\1", textin)
>>> lstripped
" foo 'bar nox ': glop ,'frox ' "
The problem is removing the trailing whitespace. I tried, for example,
>>> rstripped = re.sub(r"('[^']*)(\s*')", r"\1'", lstripped)
>>> rstripped
" foo 'bar nox ': glop ,'frox ' "
but that fails because the [^']* matches the trailing whitespace.
I thought about using lookback patterns, but the Re doc says they can only contain fixed-length patterns.
I'm sure this is a previously solved problem but I'm stumped.
Thanks!
EDIT: The solution needs to handle strings containing a single non-whitespace character and empty strings, i.e. ' p ' --> 'p' and ' ' --> ''.
[^\']* - is greedy, i.e. it includes also spaces and/or tabs, so let's use non-greedy one: [^\']*?
In [66]: re.sub(r'\'\s*([^\']*?)\s*\'','\'\\1\'', textin)
Out[66]: " foo 'bar nox': glop ,'frox' "
Less escaped version:
re.sub(r"'\s*([^']*?)\s*'", r"'\1'", textin)
The way to catch the whitespaces is by defining the previous
* as non-greedy, instead of r"('[^']*)(\s*')" use r"('[^']*?)(\s*')".
You can also catch both sides with a single regex:
stripped = re.sub("'\s*([^']*?)\s*'", r"'\1'", textin)
This seems to work:
'(\s*)(.*?)(\s*)'
' # an apostrophe
(\s*) # 0 or more white-space characters (leading white-space)
(.*?) # 0 or more any character, lazily matched (keep)
(\s*) # 0 or more white-space characters (trailing white-space)
' # an apostrophe
Demo

Padding ascii characters with spaces in a mix unicode-ascii string

Given a mixed string of unicode and ascii chars, e.g.:
它看灵魂塑Nike造得和学问同等重要。
The goal is to pad the ascii substrings with spaces, i.e.:
它看灵魂塑 Nike 造得和学问同等重要。
I've tried using the ([^[:ascii:]]) regex, it looks fine in matching the substrings, e.g. https://regex101.com/r/FVHhU1/1
But in code, the substitution with ' \1 ' is not achieving the desired output.
>>> import re
>>> patt = re.compile('([^[:ascii:]])')
>>> s = u'它看灵魂塑Nike造得和学问同等重要。'
>>> print (patt.sub(' \1 ', s))
它看灵魂塑Nike造得和学问同等重要。
How to pad ascii characters with spaces in a mix unicode-ascii string?
The pattern should be:
([\x00-\x7f]+)
So you can use:
patt = re.compile('([\x00-\x7f]+)')
patt.sub(r' \1 ',s)
This generates:
>>> print(patt.sub(r' \1 ',s))
它看灵魂塑 Nike 造得和学问同等重要。
ASCII is defined as a range of characters with hex codes between 00 and 7f. So we define such a range as [\x00-\x7f], use + to denote one or more, and replace the matching group with r' \1 ' to add two spaces.

Match literal string '\$'

I'm trying to match literal string '\$'. I'm escaping both '\' and '$' by backslash. Why isn't working when I escape the backslash in the pattern? But if I use a dot then it works.
import re
print re.match('\$','\$')
print re.match('\\\$','\$')
print re.match('.\$','\$')
Output:
None
None
<_sre.SRE_Match object at 0x7fb89cef7b90>
Can someone explain what's happening internally?
You should use the re.escape() function for this:
escape(string)
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
For example:
import re
val = re.escape('\$') # val = '\\\$'
print re.match(val,'\$')
It outputs:
<_sre.SRE_Match object; span=(0, 2), match='\\$'>
This is equivalent to what #TigerhawkT3 mentioned in his answer.
Unfortunately, you need more backslashes. You need to escape them to indicate that they're literals in the string and get them into the expression, and then further escape them to indicate that they're literals instead of regex special characters. This is why raw strings are often used for regular expressions: the backslashes don't explode.
>>> import re
>>> print re.match('\$','\$')
None
>>> print re.match('\\\$','\$')
None
>>> print re.match('.\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
>>> print re.match('\\\\\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
>>> print re.match(r'\\\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
r'string'
is the raw string
try annotating your regex string
here are the same re's with and without raw annotation
print( re.match(r'\\\$', '\$'))
<_sre.SRE_Match object; span=(0, 2), match='\\$'>
print( re.match('\\\$', '\$'))
None
this is python3 on account of because
In a (non-raw) string literal, backslash is special. It means the Python interpreter should handle following character specially. For example "\n" is a string of length 1 containing the newline character. "\$" is a string of a single character, the dollar sign. "\\$" is a string of two characters: a backslash and a dollar sign.
In regular expressions, the backslash also means the following character is to be handled specially, but in general the special meaning is different. In a regular expression, $ matches the end of a line, and \$ matches a dollar sign, \\ matches a single backslash, and \\$ matches a backslash at the end of a line.
So, when you do re.match('\$',s) the Python interpreter reads '\$' to construct a string object $ (i.e., length 1) then passes that string object to re.match. With re.match('\\$',s) Python makes a string object \$ (length 2) and passes that string object to re.match.
To see what's actually being passed to re.match, just print it. For example:
pat = '\\$'
print "pat :" + pat + ":"
m = re.match(pat, s)
People usually use raw string literals to avoid the double-meaning of backslashes.
pat = r'\$' # same 2-character string as above
Thanks for the above answers. I am adding this answer because we don't have a short summary in the above answers.
The backslash \ needs to be escaped both in python string and regex engine.
Python string will translate 2 \\ to 1 \. And regex engine will require 2 \\ to match 1 \
So to provide the regex engine with 2 \\ in order to match 1 \ we will have to use 4 \\\\ in python string.
\\\\ --> Python(string translation) ---> \\ ---> Regex Engine(translation) ---> \
You have to use . as . matches any characters except newline.

Python re.sub: ignore backreferences in the replacement string

I want to replace a pattern with a string. The string is given in a variable. It might, of course, contains '\1', and it should not be interpreted as a backreference - but simply as \1.
How can I achieve that?
The previous answer using re.escape() would escape too much, and you would get undesirable backslashes in the replacement and the replaced string.
It seems like in Python only the backslash needs escaping in the replacement string, thus something like this could be sufficient:
replacement = replacement.replace("\\", "\\\\")
Example:
import re
x = r'hai! \1 <ops> $1 \' \x \\'
print "want to see: "
print x
print "getting: "
print re.sub(".(.).", x, "###")
print "over escaped: "
print re.sub(".(.).", re.escape(x), "###")
print "could work: "
print re.sub(".(.).", x.replace("\\", "\\\\"), "###")
Output:
want to see:
hai! \1 <ops> $1 \' \x \\
getting:
hai! # <ops> $1 \' \x \
over escaped:
hai\!\ \1\ \<ops\>\ \$1\ \\'\ \x\ \\
could work:
hai! \1 <ops> $1 \' \x \\
Due to comments I thought quite a while about this and tried it out. Helped me a lot to increase my understanding about escaping, so I changed my answer nearly completely that it could be useful for later readers.
NullUserException gave you just the short version, I try to explain it a bit more. And thanks to the critical reviews of Qtax and Duncan, this answer is hopefully now correct and helpful.
The backslash has a special meaning, its the escape character in strings, that means the backslash and the following character form an escape sequence that is translated to something else when something is done with the string. This "something is done" is already the creation of the string. So if you want to use \ literally you need to escape it. This escape character is the backslash itself.
So as start some examples for a better understanding what happens. I print additionally the ASCII codes of the characters in the string to hopefully increase the understandability of what happens.
s = "A\1\nB"
print s
print [x for x in s]
print [hex(ord(x)) for x in s]
is printing
A
B
['A', '\x01', '\n', 'B']
['0x41', '0x1', '0xa', '0x42']
So while I typed \ and 1 in the code, s does not contain those two characters, it contains the ASCII character 0x01 which is "Start of heading". Same for the \n, it translated to 0x0a the Linefeed character.
Since this behaviour is not always wanted, raw strings can be used, where the escape sequences are ignored.
s = r"A\1\nB"
print s
print [x for x in s]
print [hex(ord(x)) for x in s]
I just added the r before the string and the result is now
A\1\nB
['A', '\\', '1', '\\', 'n', 'B']
['0x41', '0x5c', '0x31', '0x5c', '0x6e', '0x42']
All characters are printed as I typed them.
This is the situation we have. Now there is the next thing.
There can be the situation that a string should be passed to a regex to be found literally, so every character that has a special meaning within a regex (e.g. +*$[.) needs to escaped, therefore there is a special function re.escape that does this job.
But for this question this is the wrong function, because the string should not be used within a regex, but as the replacement string for re.sub.
So new situation:
A raw string including escape sequences should be used as replacement string for re.sub. re.sub will also handle the escape sequences, but with a small, but important, difference to the handling before: \n is still translated to 0x0a the Linefeed character, but the transition of \1 has changed now! It will be replaced by the content of the capturing group 1 of the regex in re.sub.
s = r"A\1\nB"
print re.sub(r"(Replace)" ,s , "1 Replace 2")
And the result is
1 AReplace
B 2
The \1 has been replaced with the content of the capturing group and \n with the LineFeed character.
The important point is, you have to understand this behaviour and now you have two possiblities to my opinion (and I am not going to judge which one is the correct one)
The creator is unsure about the string behaviour and if he inputs \n then he wants a newline. In this case use this to just escape the \ that are followed by a digit.
OnlyDigits = re.sub(r"(Replace)" ,re.sub(r"(\\)(?=\d)", r"\\\\", s) , "1 Replace 2")
print OnlyDigits
print [x for x in OnlyDigits]
print [hex(ord(x)) for x in OnlyDigits
Output:
1 A\1
B 2
['1', ' ', 'A', '\\', '1', '\n', 'B', ' ', '2']
['0x31', '0x20', '0x41', '0x5c', '0x31', '0xa', '0x42', '0x20', '0x32']
The creator nows exactly what he is doing and if he would have wanted a newline, he would have typed \0xa. In this case escape all
All = re.sub(r"(Replace)" ,re.sub(r"(\\)", r"\\\\", s) , "1 Replace 2")
print All
print [x for x in All]
print [hex(ord(x)) for x in All]
Output:
1 A\1\nB 2
['1', ' ', 'A', '\\', '1', '\\', 'n', 'B', ' ', '2']
['0x31', '0x20', '0x41', '0x5c', '0x31', '0x5c', '0x6e', '0x42', '0x20', '0x32']

Regex to remove all punctuation and anything enclosed by brackets

I'm trying to remove all punctuation and anything inside brackets or parentheses from a string in python. The idea is to somewhat normalize song names to get better results when I query the MusicBrainz WebService.
Sample input: T.N.T. (live) [nyc]
Expected output: T N T
I can do it in two regexes, but I would like to see if it can be done in just one. I tried the following, which didn't work...
>>> re.sub(r'\[.*?\]|\(.*?\)|\W+', ' ', 'T.N.T. (live) [nyc]')
'T N T live nyc '
If I split the \W+ into it's own regex and run it second, I get the expected result, so it seems that \W+ is eating the braces and parens before the first two options can deal with them.
You are correct that the \W+ is eating the braces, remove the + and you should be set:
>>> re.sub(r'\[.*?\]|\(.*?\)|\W', ' ', 'T.N.T. (live) [nyc]')
'T N T '
Here's a mini-parser that does the same thing I wrote as an exercise. If your effort to normalize gets much more complex, you may start to look at parser-based solutions. This works like a tiny parser.
# Remove all non-word chars and anything between parens or brackets
def consume(I):
I = iter(I)
lookbehind = None
def killuntil(returnchar):
while True:
ch = I.next()
if ch == returnchar:
return
for i in I:
if i in 'abcdefghijklmnopqrstuvwyzABCDEFGHIJKLMNOPQRSTUVWXYZ':
yield i
lookbehind = i
elif not i.strip() and lookbehind != ' ':
yield ' '
lookbehind = ' '
elif i == '(':
killuntil(')')
elif i == '[':
killuntil(']')
elif lookbehind != ' ':
lookbehind = ' '
yield ' '
s = "T.N.T. (live) [nyc]"
c = consume(s)
The \W+ eats the brackets, because it "has a run": It starts matching at the dot after the second T, and matches on until and including the first parenthesis: . (. After that, it starts matching again from bracket to bracket: ) [.
\W
When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_].
So try r'\[.*?\]|\(.*?\)|{.*?}|[^a-zA-Z0-9_()[\]{}]+'.
Andrew's solution is probably better, though.

Categories