Could not found a corresponding PEP or a bug for one problem in Python's re module.
Does anyone know if the following is planned to be fixed?
From regular-expressions.info:
Python does not support hexadecimal escapes in the replacement text
syntax, even though it supports \xFF and \uFFFF in string constants.
But it actually supports standard escapes like \n, \r, etc.
So, for example one cannot replace '<' character with '>' character using hexadecimal escapes:
>>> import re
>>> re.sub(r'\x3c', r'\x3e', '\x3c')
'\\x3e'
Instead of '\\x3e' it should be '>'.
Using escaped \n works fine:
>>> re.sub(r'a', r'\n', 'a')
'\n'
Thanks in advance!
UPD: Not using the raw string is not an option. For example if pattern and replacement strings are stored in a config file, so if I write \x3e in it, it will become '\\x3e' when read, instead of '>'.
The only workaround I know if is to not use a raw string for the replacement text and instead allow normal string evaluation to make \x3e into >. This works because, as you noted, python strings do support such sequences.
>>> import re
>>> re.sub(r'\x3c', '\x3e', '\x3c')
'>'
This means that in more complex replacement text you need more escapes, which could make it less readable, but at least it works.
I don't know if there is any plan to improve on this. I took a look at the existing documentation for the python 3.4 re module (under dev) and found no mention of including this kind of support.
However, if you have a need for more complex logic on the replacement, you can pass a function instead of replacement text for the repl argument of re.sub.
Related
I thought this should work, but it doesn't:
import re
if re.match("\Qbla\E", "bla"):
print "works!"
Why it doesn't work? Can I use the '\Q' and '\E' symbols in python? How?
Python's regex engine doesn't support those; see §7.2.1 "Regular Expression Syntax" in the Python documentation for a list of what it does support. However, you can get the same effect by writing re.match(re.escape("bla"), "bla"); re.escape is a function that inserts backslashes before all special characters.
By the way, you should generally use "raw" strings, r"..." instead of just "...", since otherwise backslashes will get processed twice (once when the string is parsed, and then again by the regex engine), which means you have to write things like \\b instead of \b. Using r"..." prevents that first processing pass, so you can just write \b.
Unfortunately, Python doesn't support the \Q and \E escape sequences. You just have to escape everything yourself.
Python doesn't support \Q...\E .
Ref: http://www.regular-expressions.info/refflavors.html
But that doesn't means it doesn't support escaping strings of metacharacters.
Ref: http://docs.python.org/library/re.html#re.escape
How do python raw strings and string literals work? I'm trying to make a webscraper to download pdfs from a site. When I search the string it works, but when I try to implement it in python I always get None as my answer
import urllib
import re
url="" //insert url here
sock=urllib.urlopen(url)
htmlSource=sock.read();
sock.close();
m=re.match(r"<a href.*?pdf[^>]*?", raw(htmlSource))
print m
$ python temp.py
None
The raw function is from here: http://code.activestate.com/recipes/65211-convert-a-string-into-a-raw-string/
That said, how can I complete this program so that I can print out all of the matches and then download the pdfs?
Thanks!
You seem to be very confused.
A 'string literal' is a string that you type into the program. Because there needs to be a clear beginning and end to your string, certain characters become inconvenient to have within the middle of the string, and escape sequences must be used to represent them.
Python offers 'raw' string literals which have different rules for how the escape sequences are interpreted: the same rules are used to figure out where the string ends (so a single backslash, followed by the opening quote character, doesn't terminate the string), but then the stuff between the backslashes doesn't get transformed. So, while '\'' is a string that consists of a single quote character (the \' in the middle is an escape sequence that produces the quote), r'\'' is a string that consists of a backslash and a quote character.
The raw string literal produces an object of type str. It is the same type as produced by an ordinary string literal. These are often used for the pattern for a regex operation, because the strings used for regexes often need to contain a lot of backslashes. If you wanted to write a regex that matched a backslash in the source text, and you didn't have raw string literals, then you would need to put, perhaps surprisingly, four backslashes between the quotes in your source code: the Python compiler would interpret this as a string containing two real backslashes, which in turn represents "match a backslash" in the regex syntax.
The function you found is an imperfect attempt to re-introduce escape sequences into input text. This is not what what you want to do, doesn't even really make sense, and doesn't meet the author's own spec anyway. It seems to be based on a misconception similar to your own. The concept of a "raw equivalent of" a string is nonsensical. There is, really, no such thing as "a raw string"; raw string literals are a convenience for creating ordinary strings.
You want to search for the pattern within htmlSource. It is already in the form you need it to be in. Your problem has nothing to do with string escapes. When a string comes from user input, file input, or basically anything other than the program source, it is not processed the way string literals are, unless you explicitly arrange for that to happen. If the web page contains a backslash followed by an n, the string that gets read by urllib contains, in the corresponding spot, exactly that - a backslash followed by an n, not a newline.
The problem is as follows: you want to search the string, as you said: "when I search the string it works". You are currently matching the string. See the documentation:
Help on function match in module re:
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
Your pattern does not appear at the beginning of the string, since the HTML for the webpage does not start with the <a> tag you are looking for.
You want m=re.search(r"<a href.*?pdf[^>]*?", htmlSource).
Check out this answer. It seems that Python’s urllib is a lot less user‐friendly — and Unicode‐friendly — than it should be. It seems to force you to deal with ugly raw bytes content instead of decoding it for you into a normal string.
I am seeing the following phenomenon, couldn't seem to figure it out, and didn't find anything with some search through archives:
if I type in:
>>> if re.search(r'\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
I will get:
didn't find it!
However, if I type in:
>>> if re.search(r'\\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
Then I will get:
found it!
(The first one only has one backslash on the r'\n' whereas the second one has two backslashes in a row on the r'\\n' ... even this interpreter is removing one of them.)
I can guess what is going on, but I don't understand the official mechanism as to why this is happening: in the first case, I need to escape two things: both the regular expression and the special strings. "Raw" lets me escape the special strings, but not the regular expression.
But there will never be a regular expression in the second string, since it is the string being matched. So there is only a need to escape once.
However, something doesn't seem consistent to me: how am I supposed to ensure that the characters REALLY ARE taken literally in the first case? Can I type rr'' ? Or do I have to ensure that I escape things twice?
On a similar vein, how do I ensure that a variable is taken literally (or that it is NOT taken literally)? E.g., what if I had a variable tmp = 'this\nis\nmy\nhome', and I really wanted to find the literal combination of a slash and an 'n', instead of a newline?
Thanks!Mike
re.search(r'\n', r'this\nis\nit')
As you said, "there will never be a regular expression in the second string." So we need to look at these strings differently: the first string is a regex, the second just a string. Usually your second string will not be raw, so any backslashes are Python-escapes, not regex-escapes.
So the first string consists of a literal "\" and an "n". This is interpreted by the regex parser as a newline (docs: "Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser"). So your regex will be searching for a newline character.
Your second string consists of the string "this" followed by a literal "\" and an "n". So this string does not contain an actual newline character. Your regex will not match.
As for your second regex:
re.search(r'\\n', r'this\nis\nit')
This version matches because your regex contains three characters: a literal "\", another literal "\" and an "n". The regex parser interprets the two slashes as a single "\" character, followed by an "n". So your regex will be searching for a "\" followed by an "n", which is found within the string. But that isn't very helpful, since it has nothing to do with newlines.
Most likely what you want is to drop the r from the second string, thus treating it as a normal Python string.
re.search(r'\n', 'this\nis\nit')
In this case, your regex (as before) is searching for a newline character. And, it finds it, because the second string contains the word "this" followed by a newline.
Escaping special sequences in string literals is one thing, escaping regular expression special characters is another. The row string modifier only effects the former.
Technically, re.search accepts two strings and passes the first to the regex builder with re.compile. The compiled regex object is used to search patterns inside simple strings. The second string is never compiled and thus it is not subject to regex special character rules.
If the regex builder receives a \n after the string literal is processed, it converts this sequence to a newline character. You also have to escape it if you need the match the sequence instead.
All rationale behind this is that regular expressions are not part of the language syntax. They are rather handled within the standard library inside the re module with common building blocks of the language.
The re.compile function uses special characters and escaping rules compatible with most commonly used regex implementations. However, the Python interpreter is not aware of the whole regular expression concept and it does not know whether a string literal will be compiled into a regex object or not. As a result, Python can't provide any kind syntax simplification such as the ones you suggested.
Regexes have their own meaning for literal backslashes, as character classes like \d. If you actually want a literal backslash character, you will in fact need to double-escape it. It's really not supposed to be parallel since you're comparing a regex to a string.
Raw strings are just a convenience, and it would be way overkill to have double-raw strings.
This question already has answers here:
Why can't Python's raw string literals end with a single backslash?
(14 answers)
Closed 11 months ago.
I ran across something once upon a time and wondered if it was a Python "bug" or at least a misfeature. I'm curious if anyone knows of any justifications for this behavior. I thought of it just now reading "Code Like a Pythonista," which has been enjoyable so far. I'm only familiar with the 2.x line of Python.
Raw strings are strings that are prefixed with an r. This is great because I can use backslashes in regular expressions and I don't need to double everything everywhere. It's also handy for writing throwaway scripts on Windows, so I can use backslashes there also. (I know I can also use forward slashes, but throwaway scripts often contain content cut&pasted from elsewhere in Windows.)
So great! Unless, of course, you really want your string to end with a backslash. There's no way to do that in a 'raw' string.
In [9]: r'\n'
Out[9]: '\\n'
In [10]: r'abc\n'
Out[10]: 'abc\\n'
In [11]: r'abc\'
------------------------------------------------
File "<ipython console>", line 1
r'abc\'
^
SyntaxError: EOL while scanning string literal
In [12]: r'abc\\'
Out[12]: 'abc\\\\'
So one backslash before the closing quote is an error, but two backslashes gives you two backslashes! Certainly I'm not the only one that is bothered by this?
Thoughts on why 'raw' strings are 'raw, except for backslash-quote'? I mean, if I wanted to embed a single quote in there I'd just use double quotes around the string, and vice versa. If I wanted both, I'd just triple quote. If I really wanted three quotes in a row in a raw string, well, I guess I'd have to deal, but is this considered "proper behavior"?
This is particularly problematic with folder names in Windows, where the backslash is the path delimeter.
It's a FAQ.
And in response to "you really want your string to end with a backslash. There's no way to do that in a 'raw' string.": the FAQ shows how to workaround it.
>>> r'ab\c' '\\' == 'ab\\c\\'
True
>>>
Raw strings are meant mostly for readably writing the patterns for regular expressions, which never need a trailing backslash; it's an accident that they may come in handy for Windows (where you could use forward slashes in most cases anyway -- the Microsoft C library which underlies Python accepts either form!). It's not cosidered acceptable to make it (nearly) impossible to write a regular expression pattern containing both single and double quotes, just to reinforce the accident in question.
("Nearly" because triple-quoting would almost alway help... but it could be a little bit of a pain sometimes).
So, yes, raw strings were designed to behave that way (forbidding odd numbers of trailing backslashes), and it is considered perfectly "proper behavior" for them to respect the design decisions Guido made when he invented them;-).
Another way to workaround this is:
>>> print(r"Raw \with\ trailing backslash\ "[:-1])
Raw \with\ trailing backslash\
Updated for Python 3 and removed unnecessary slash at the end which implied an escape.
Note that personally I doubt I would use the above. I guess maybe if it was a huge string with more than just a path. For the above I'd prefer non-raw and double up the slashes.
Thoughts on why 'raw' strings are 'raw, except for backslash-quote'? I
mean, if I wanted to embed a single quote in there I'd just use double
quotes around the string, and vice versa.
But that would then raise the question as to why raw strings are 'raw, except for embedded quotes?'
You have to have some escape mechanism, otherwise you can never use the outer quote characters inside the string at all. And then you need an escape mechanism for the escape mechanism.
Why in python I can't use:
r"c:\"
When a string must contain the same quote character with which it starts, escaping that character is the only available workaround -- so the design alternative was either to make raw-string literals unable to contain their leading quote character, or keep the "backlash escapes" convention, even in string literals, just for quote characters.
Since raw-string literals were designed for handy representation of regular expression patterns (not for DOS / Windows paths!-), and in RE patterns a trailing backslash is never necessary, the design decision was easy (based on the real use case for raw-string literals).
Use "c:/" or "c:\\". Raw string literals are for escaping escape-sequences, not for including literal backslashes, though they do work that way, except in this exact case.
Its a known case I think, better use "c:\\" for that case.
From the documentation:
... a raw string cannot end in a single backslash (since the backslash would escape the following quote character).
.
Even with raw strings, \" causes the " not to be interpreted as the end of the string (though the backslash gets into your string), so r"foo\"bar" would be a legal string. This is convenient enough when writing regex but not great for writing paths.
This is not a big deal as most of the time you should be using os.path and other modules to deal with your paths.
found in Design and History FAQ http://docs.python.org/faq/design.html#why-can-t-raw-strings-r-strings-end-with-a-backslash
Raw strings were designed to ease
creating input for processors (chiefly
regular expression engines) that want
to do their own backslash escape
processing. Such processors consider
an unmatched trailing backslash to be
an error anyway, so raw strings
disallow that. In return, they allow
you to pass on the string quote
character by escaping it with a
backslash. These rules work well when
r-strings are used for their intended
purpose.