Confused about the backslash in Python - python

I understand that to match a literal backslash, it must be escaped in the regular expression. With raw string notation, this means r"\\". Without raw string notation, one must use "\\\\".
When I saw the code string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string), I was wondering the meaning of a backslash in \' and \`, since it also works well as ' and `, like string = re.sub(r"[^A-Za-z0-9(),!?'`]", " ", string). Is there any need to add the backslash here?
I tried some examples in Python:
str1 = "\'s"
print(str1)
str2 = "'s"
print(str2)
The result is same as 's. I think this might be the reason why in previous code, they use \'\` in string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string). I was wondering is there any difference between "\'s" and "'s" ?
string = 'adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .'
re.match(r"\\", string)
The re.match returns nothing, which shows there is no backslash in the string. However, I do see backslashes in it. Is that the backslash in \' actually not a backslash?

In python, those are escaped characters, because they can also have other meanings to the code other than as they appear on-screen (for example, a string can be made by wrapping it in a single quote). You can see all of the python string literals here, but the reason there were no backslashes found in that string is that they are considered escaped single quotes. Although it's not necessary, it is still valid syntax because it sometimes is needed

Check out https://docs.python.org/2.0/ref/strings.html for a better explanation.
The problem with your second example is that string isn't a raw string, so the \' is interpreted as '. If you change it to:
>>> not_raw = 'adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .'
>>> res1 = re.search(r'\\',not_raw)
>>> type(res1)
<type 'NoneType'>
>>> raw = r'adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .'
>>> res2 = re.search(r'\\',raw)
>>> type(res2)
<type '_sre.SRE_Match'>
For an explanation of re.match vs re.search: What is the difference between Python's re.search and re.match?

Related

Regarding the regex in search module with and without raw text

I am doing the following in python2.7
>>> a='hello team 123'
>>> b=re.search('hello team [0-9]+',a)
>>>
>>> b
<_sre.SRE_Match object at 0x00000000022995E0>
>>> b=re.search(r'hello team [0-9]+',a)
>>> b
<_sre.SRE_Match object at 0x0000000002299578>
>>>
Now as you see, in one case i am doing the raw text while in the other it's without raw text.
From one of the posts on SO, i learnt:
The r means that the string is to be treated as a raw string, which means all escape codes will be ignored.
For an example:
'\n' will be treated as a newline character, while r'\n' will be treated as the characters \ followed by n
Then, why is my example working for both cases i.e with r and without r?
Is it because none of my example uses \ ?
Also please look at the attached screenshot
You are not using any special characters in your string, so r'' and '' will do the same thing.
In hello team [0-9]+ nothing needs to escaped. It will be passed to regex engine as it is. If you use special characters in your Python string then you need to escape them to pass them to regex engine.
There are two levels of escaping involved in regex. First level is Python string and second level regex engine.
So for example:
'\\\\' --> Python(string translation) ---> '\\' ---> Regex Engine(translation) ---> '\'
In order to avoid Python string translation you use raw strings.
r'\\' --> Python(string translation) ---> '\\' ---> Regex Engine(translation) ---> '\'
>>> print repr('\\')
'\\'
>>> print repr(r'\\')
'\\\\'
>>> print str('\\')
\
>>> print str(r'\\')
\\

Match literal string '\$'

I'm trying to match literal string '\$'. I'm escaping both '\' and '$' by backslash. Why isn't working when I escape the backslash in the pattern? But if I use a dot then it works.
import re
print re.match('\$','\$')
print re.match('\\\$','\$')
print re.match('.\$','\$')
Output:
None
None
<_sre.SRE_Match object at 0x7fb89cef7b90>
Can someone explain what's happening internally?
You should use the re.escape() function for this:
escape(string)
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
For example:
import re
val = re.escape('\$') # val = '\\\$'
print re.match(val,'\$')
It outputs:
<_sre.SRE_Match object; span=(0, 2), match='\\$'>
This is equivalent to what #TigerhawkT3 mentioned in his answer.
Unfortunately, you need more backslashes. You need to escape them to indicate that they're literals in the string and get them into the expression, and then further escape them to indicate that they're literals instead of regex special characters. This is why raw strings are often used for regular expressions: the backslashes don't explode.
>>> import re
>>> print re.match('\$','\$')
None
>>> print re.match('\\\$','\$')
None
>>> print re.match('.\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
>>> print re.match('\\\\\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
>>> print re.match(r'\\\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
r'string'
is the raw string
try annotating your regex string
here are the same re's with and without raw annotation
print( re.match(r'\\\$', '\$'))
<_sre.SRE_Match object; span=(0, 2), match='\\$'>
print( re.match('\\\$', '\$'))
None
this is python3 on account of because
In a (non-raw) string literal, backslash is special. It means the Python interpreter should handle following character specially. For example "\n" is a string of length 1 containing the newline character. "\$" is a string of a single character, the dollar sign. "\\$" is a string of two characters: a backslash and a dollar sign.
In regular expressions, the backslash also means the following character is to be handled specially, but in general the special meaning is different. In a regular expression, $ matches the end of a line, and \$ matches a dollar sign, \\ matches a single backslash, and \\$ matches a backslash at the end of a line.
So, when you do re.match('\$',s) the Python interpreter reads '\$' to construct a string object $ (i.e., length 1) then passes that string object to re.match. With re.match('\\$',s) Python makes a string object \$ (length 2) and passes that string object to re.match.
To see what's actually being passed to re.match, just print it. For example:
pat = '\\$'
print "pat :" + pat + ":"
m = re.match(pat, s)
People usually use raw string literals to avoid the double-meaning of backslashes.
pat = r'\$' # same 2-character string as above
Thanks for the above answers. I am adding this answer because we don't have a short summary in the above answers.
The backslash \ needs to be escaped both in python string and regex engine.
Python string will translate 2 \\ to 1 \. And regex engine will require 2 \\ to match 1 \
So to provide the regex engine with 2 \\ in order to match 1 \ we will have to use 4 \\\\ in python string.
\\\\ --> Python(string translation) ---> \\ ---> Regex Engine(translation) ---> \
You have to use . as . matches any characters except newline.

python regex subsituting expression using a variable

What I am trying to achieve is to substitute a string using python regex with a variable (contents of the variable). Since I need to retain some of the matched expression, I use the \1 and \3 group match args.
My regex/sub looks like this:
pattern = "\1" + id + "\3" \b
out = re.sub(r'(;11=)(\w+)(;)',r'%s' % pattern, line)
What appears to be happening is \1 and \3 do not get added to the output.
I have also tried this with the substitution expression:
r'\1%s\3'%orderid
But I got similar results.
Any suggestion on what might fix this?
You need to use raw strings or double the backslashes:
pattern = r"\1" + id + r"\3"
or
pattern = "\\1" + id + r"\\3"
In a regular Python string literal, \number is interpreted as an octal character code instead:
>>> '\1'
'\x01'
while the backslash has no special meaning in a raw string literal:
>>> r'\1'
'\\1'
Raw string literals are just a notation, not a type. Both r'' and '' produce strings, and only differ in how they interpret backslashes in source code.
Note that since group 1 and group3 match literal text, you don't need to use substitutions at all; simply use:
out = re.sub(r';11=\w+;', ';11=%s;' % id, line)
or use look-behind and lookahead and forgo having to repeat the literals:
out = re.sub(r'(?<=;11=)\w+(?=;)', id, line)
Demo:
>>> import re
>>> line = 'foobar;11=spam;hameggs'
>>> id = 'monty'
>>> re.sub(r';11=\w+;', ';11=%s;' % id, line)
'foobar;11=monty;hameggs'
>>> re.sub(r'(?<=;11=)\w+(?=;)', id, line)
'foobar;11=monty;hameggs'
This isn't going to work:
pattern = "\1" + id + "\3"
# ...
r'%s' % pattern
The r prefix only affects how the literal is interpreted. So, r'%s' mean that the % and s will be interpreted raw—but that's the same way they'd be interpreted without the r. Meanwhile, the pattern has non-raw literals "\1" and "\3", so it's already a control-A and a control-C before you even get to the %.
What you want is:
pattern = r"\1" + id + r"\3"
# ...
'%s' % pattern
However, you really don't need the % formatting at all; just use pattern itself and you'll get the exact same thing.

how to place a character literal in a python string

I'm trying to write a regular expression in python, and one of the characters involved in it is the \001 character. putting \001 in a string doesn't seem to work. I also tried 'string' + str(chr(1)), but the regex doesn't seem to catch it. Please for the love of god somebody help me, I've been struggling with this all day.
import sys
import postgresql
import re
if len(sys.argv) != 2:
print("usage: FixToDb <fix log file>")
else:
f = open(sys.argv[1], 'r')
timeExp = re.compile(r'(\d{2}):(\d{2}):(\d{2})\.(\d{6}) (\S)')
tagExp = re.compile('(\\d+)=(\\S*)\001')
for line in f:
#parse the time
m = timeExp.match(line)
print(m.group(1) + ':' + m.group(2) + ':' + m.group(3) + '.' + m.group(4) + ' ' + m.group(5));
tagPairs = re.findall('\\d+=\\S*\001', line)
for t in tagPairs:
tagPairMatch = tagExp.match(t)
print ("tag = " + tagPairMatch.group(1) + ", value = " + tagPairMatch.group(2))
Here's is an example line of for the input. I replaced the '\001' character with a '~' for readability
15:32:36.357227 R 1 0 0 0 8=FIX.4.2~9=0067~35=A~52=20120713-19:32:36~34=1~49=PD~56=P~98=0~108=30~10=134
output:
15:32:36.357227 R
tag = 8, value = FIX.4.29=006735=A52=20120713-19:32:3634=149=PD56=P98=0108=3010=134
So it doesn't stop at the '\001' character.
chr(1) should work, as will "\x01", as will "\001". (Note that chr(1) already returns a string, so you don't need to do str(chr(1)).) In your example it looks like you have both "\001" and chr(1), so that won't work unless you have two of the characters in a row in your data.
You say the regex "doesn't seem to catch it", but you don't give an example of your input data, so it's impossible to say why.
Edit; Okay, it looks like the problem has nothing to do with the \001. It is the classic greediness problem. The \S* in your tagExp expression will match a \001 character (since that character is not whitespace. So the \S* is gobbling the entire line. Use \S*? to make it non-greedy.
Edit: As others have noted, it also looks like your backslashes are awry. In regular expressions you face a backslash-doubling problem: Python uses the backslash for its own string escapes (like \t for tab, \n for newline), but regular expressions also use the backslash for their own purposes (e.g., \s for whitespace). The usual solution is to use raw strings, but you can't do that if you want to use the "\001" escape. However, you could use raw strings for your timeExp regex. Then in your other regexes, double the backslashes (except on \001, because you want that one to be interpreted as a character-code escape).
Instead of using \S to match the value, which can be any non-whitespace character, including \001, you should use [^\x01], which will match any character that is not \001.
#Sam Mussmann, no...
1 (decimal) = \001 (octal) <> \x01 (UNICODE)

Can't get single \ in python

I'm trying to learn python, and I'm pretty new at it, and I can't figure this one part out.
Basically, what I'm doing now is something that takes the source code of a webpage, and takes out everything that isn't words.
Webpages have a lot of \n and \t, and I want something that will find \ and delete everything between it and the next ' '.
def removebackslash(source):
while(source.find('\') != -1):
startback = source.find('\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source
is what I have. It doesn't work like this, because the \' doesn't close the string, but when I change \ to \\, it interprets the string as \\. I can't figure out anything that is interpreted at '\'
\ is an escape character; it either gives characters a special meaning or takes said special meaning away. Right now, it's escaping the closing single quote and treating it as a literal single quote. You need to escape it with itself to insert a literal backslash:
def removebackslash(source):
while(source.find('\\') != -1):
startback = source.find('\\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source
Try using replace:
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
So in your case:
my_text = my_text.replace('\n', '')
my_text = my_text.replace('\t', '')
As others have said, you need to use '\\'. The reason you think this isn't working is because when you get the results, they look like they begin with two backslashes. But they don't begin with two backslashes, it's just that Python shows two backslashes. If it didn't, you couldn't tell the difference between a newline (represented as \n) and a backslash followed by the letter n (represented as \\n).
There are two ways to convince yourself of what's really going on. One is to use print on the result, which causes it to expand the escapes:
>>> x = "here is a backslash \\ and here comes a newline \n this is on the next line"
>>> x
u'here is a backslash \\ and here comes a newline \n this is on the next line'
>>> print x
here is a backslash \ and here comes a newline
this is on the next line
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ and here comes a newline \n this is on the next line'
>>> print x[startback:]
\ and here comes a newline
this is on the next line
Another way is to use len to verify the length of the string:
>>> x = "Backslash \\ !"
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ !'
>>> print x[startback:]
\ !
>>> len(x[startback:])
3
Notice that len(x[startback:]) is 3. The string contains three characters: backslash, space, and exclamation point. You can see what's going on even more simply by just looking at a string that contains only a backslash:
>>> x = "\\"
>>> x
u'\\'
>>> print x
\
>>> len(x)
1
x only looks like it starts with two backslashes when you evaluate it at the interactive prompt (or otherwise use it's __repr__ method). When you actually print it, you can see it's only one backslash, and when you look at its length, you can see it's only one character long.
So what this means is you need to escape the backslash in your find, and you need to recognize that the backslashes displayed in the output may also be doubled.
The SO auto-format shows your problem. Since \ is used to escape characters, it's escaping the end quotes. Try changing that line to (note the use of double quotes):
while(source.find("\\") != -1):
Read more about escape characters in the docs.
I don't think anyone's mentioned this yet, but if you don't want to deal with having to escape characters just use a raw string.
source.find(r'\')
Adding the letter r before the string tells Python not to interpret any special characters and keeps the string exactly as you type it.

Categories