How to escape unicode string for regular expressions?

How to escape unicode string for regular expressions? - python

I need to build an re pattern based on the unicode string (e.g. I have "word", and I need something like ^"word"| "word"). However the "word" can contain special re characters. To match the "word" as it is, I need to escape special re characters in unicode string. The basic re.escape() function does the job for ascii strings. How can I do this for unicode?

re.escape() inserts a backslash before every character that's not an ASCII alphanumeric. This may in fact lead to a multitude of unnecessary backslashes to be inserted, however, Python ignores backslashes that don't start a recognized escape sequence, so there is no big harm done (except possibly some performance penalty).
But if you want to build a stricter escape(), you can:
def escape(s):
return re.sub(r"[(){}\[\].*?|^$\\+-]", r"\\\g<0>", s)
which only touches the actual regex metacharacters. I sure hope I didn't miss any :)

Related

What does the "r" in pythons re.compile(r' pattern flags') mean?

I am reading through http://docs.python.org/2/library/re.html. According to this the "r" in pythons re.compile(r' pattern flags') refers the raw string notation :
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.
Would it be fair to say then that:
re.compile(r pattern) means that "pattern" is a regex while, re.compile(pattern) means that "pattern" is an exact match?

As #PauloBu stated, the r string prefix is not specifically related to regex's, but to strings generally in Python.
Normal strings use the backslash character as an escape character for special characters (like newlines):
>>> print('this is \n a test')
this is
a test
The r prefix tells the interpreter not to do this:
>>> print(r'this is \n a test')
this is \n a test
>>>
This is important in regular expressions, as you need the backslash to make it to the re module intact - in particular, \b matches empty string specifically at the start and end of a word. re expects the string \b, however normal string interpretation '\b' is converted to the ASCII backspace character, so you need to either explicitly escape the backslash ('\\b'), or tell python it is a raw string (r'\b').
>>> import re
>>> re.findall('\b', 'test') # the backslash gets consumed by the python string interpreter
[]
>>> re.findall('\\b', 'test') # backslash is explicitly escaped and is passed through to re module
['', '']
>>> re.findall(r'\b', 'test') # often this syntax is easier
['', '']

No, as the documentation pasted in explains the r prefix to a string indicates that the string is a raw string.
Because of the collisions between Python escaping of characters and regex escaping, both of which use the back-slash \ character, raw strings provide a way to indicate to python that you want an unescaped string.
Examine the following:
>>> "\n"
'\n'
>>> r"\n"
'\\n'
>>> print "\n"
>>> print r"\n"
\n
Prefixing with an r merely indicates to the string that backslashes \ should be treated literally and not as escape characters for python.
This is helpful, when for example you are searching on a word boundry. The regex for this is \b, however to capture this in a Python string, I'd need to use "\\b" as the pattern. Instead, I can use the raw string: r"\b" to pattern match on.
This becomes especially handy when trying to find a literal backslash in regex. To match a backslash in regex I need to use the pattern \\, to escape this in python means I need to escape each slash and the pattern becomes "\\\\", or the much simpler r"\\".
As you can guess in longer and more complex regexes, the extra slashes can get confusing, so raw strings are generally considered the way to go.

No. Not everything in regex syntax needs to be preceded by \, so ., *, +, etc still have special meaning in a pattern
The r'' is often used as a convenience for regex that do need a lot of \ as it prevents the clutter of doubling up the \

What is the correct way to use unicode characters in a python regex

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.
Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?
Here is my current usage, which does not do what I want:
re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)

Rather than seek out specific unwanted chars, you could remove everything not wanted:
re.sub('[^\\s!-~]', '', my_str)
This throws away all characters not:
whitespace (spaces, tabs, newlines, etc)
printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)
You could include more chars if needed - just adjust the character class.

i have same problem, i know this in not efficient way but in my case worked
result = re.sub(r"\\" ,",x,x",result)
result = re.sub(r",x,xu00ad" ,"",result)
result = re.sub(r",x,xu" ,"\\u",result)

python regex re.compile match

I am trying to match (using regex in python):
http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg
in the following string:
http://www.mymaterialssite.com','http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'
My code has something like this:
temp="http://www.mymaterialssite.com','http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'"
dummy=str(re.compile(r'.com'',,''(.*?)'',,''Model Photo').search(str(temp)).group(1))
I do not think the "dummy" is correct & I am unsure how I "escape" the single and double quotes in the regex re.compile command.
I tried googling for the problem, but I couldnt find anything relevant.
Would appreciate any guidance on this.
Thanks.

The easiest way to deal with strings in Python that contain escape characters and quotes is to triple double-quote the string (""") and prefix it with r. For example:
my_str = r"""This string would "really "suck"" to write if I didn't
know how to tell Python to parse it as "raw" text with the 'r' character and
triple " quotes. Especially since I want \n to show up as a backlash followed
by n. I don't want \0 to be the null byte either!"""
The r means "take escape characters as literal". The triple double-quotes (""") prevent single-quotes, double-quotes, and double double-quotes from prematurely ending the string.
EDIT: I expanded the example to include things like \0 and \n. In a normal string (not a raw string) a \ (the escape character) signifies that the next character has special meaning. For example \n means "the newline character". If you literally wanted the character \ followed by n in your string you would have to write \\n, or just use a raw string instead, as I show in the example above.
You can also read about string literals in the Python documentation here:
For beginners: http://docs.python.org/tutorial/introduction.html#strings
Complex explanation: http://docs.python.org/reference/lexical_analysis.html#string-literals

Try triple quotes:
import re
tmp=""".*http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg.*"""
str="""http://www.mymaterialssite.com\'\,\'http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'"""
x=re.match(tmp,str)
if x!=None:
print x.group()
Also you were missing the .* in the beginning of the pattern and at the end. I added that too.

if you use double quotes (which have the same meaning as the single ones, in Python), you don't have to escape at all.. (in this case). you can even use string literal without the starting r (you don't have any backslash there)
re.compile(".com','(.*?)','Model Photo")

Commas don't need to be escaped, and single quotes don't need to be escaped if you use double quotes to create the string:
>>> dummy=re.compile(r".com','(.*?)','Model Photo").search(temp).group(1)
>>> print dummy
http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg
Note that I also removed some unnecessary str() calls, and for future reference if you do ever need to escape single or double quotes (say your string contains both), use a backslash like this:
'.com\',\'(.*?)\',\'Model Photo'
As mykhal pointed out in comments, this doesn't work very nicely with regex because you can no longer use the raw string (r'...') literal. A better solution would be to use triple quoted strings as other answers suggested.

python "re" package, strange phenomenon with "raw" string

I am seeing the following phenomenon, couldn't seem to figure it out, and didn't find anything with some search through archives:
if I type in:
>>> if re.search(r'\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
I will get:
didn't find it!
However, if I type in:
>>> if re.search(r'\\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
Then I will get:
found it!
(The first one only has one backslash on the r'\n' whereas the second one has two backslashes in a row on the r'\\n' ... even this interpreter is removing one of them.)
I can guess what is going on, but I don't understand the official mechanism as to why this is happening: in the first case, I need to escape two things: both the regular expression and the special strings. "Raw" lets me escape the special strings, but not the regular expression.
But there will never be a regular expression in the second string, since it is the string being matched. So there is only a need to escape once.
However, something doesn't seem consistent to me: how am I supposed to ensure that the characters REALLY ARE taken literally in the first case? Can I type rr'' ? Or do I have to ensure that I escape things twice?
On a similar vein, how do I ensure that a variable is taken literally (or that it is NOT taken literally)? E.g., what if I had a variable tmp = 'this\nis\nmy\nhome', and I really wanted to find the literal combination of a slash and an 'n', instead of a newline?
Thanks!Mike

re.search(r'\n', r'this\nis\nit')
As you said, "there will never be a regular expression in the second string." So we need to look at these strings differently: the first string is a regex, the second just a string. Usually your second string will not be raw, so any backslashes are Python-escapes, not regex-escapes.
So the first string consists of a literal "\" and an "n". This is interpreted by the regex parser as a newline (docs: "Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser"). So your regex will be searching for a newline character.
Your second string consists of the string "this" followed by a literal "\" and an "n". So this string does not contain an actual newline character. Your regex will not match.
As for your second regex:
re.search(r'\\n', r'this\nis\nit')
This version matches because your regex contains three characters: a literal "\", another literal "\" and an "n". The regex parser interprets the two slashes as a single "\" character, followed by an "n". So your regex will be searching for a "\" followed by an "n", which is found within the string. But that isn't very helpful, since it has nothing to do with newlines.
Most likely what you want is to drop the r from the second string, thus treating it as a normal Python string.
re.search(r'\n', 'this\nis\nit')
In this case, your regex (as before) is searching for a newline character. And, it finds it, because the second string contains the word "this" followed by a newline.

Escaping special sequences in string literals is one thing, escaping regular expression special characters is another. The row string modifier only effects the former.
Technically, re.search accepts two strings and passes the first to the regex builder with re.compile. The compiled regex object is used to search patterns inside simple strings. The second string is never compiled and thus it is not subject to regex special character rules.
If the regex builder receives a \n after the string literal is processed, it converts this sequence to a newline character. You also have to escape it if you need the match the sequence instead.
All rationale behind this is that regular expressions are not part of the language syntax. They are rather handled within the standard library inside the re module with common building blocks of the language.
The re.compile function uses special characters and escaping rules compatible with most commonly used regex implementations. However, the Python interpreter is not aware of the whole regular expression concept and it does not know whether a string literal will be compiled into a regex object or not. As a result, Python can't provide any kind syntax simplification such as the ones you suggested.

Regexes have their own meaning for literal backslashes, as character classes like \d. If you actually want a literal backslash character, you will in fact need to double-escape it. It's really not supposed to be parallel since you're comparing a regex to a string.
Raw strings are just a convenience, and it would be way overkill to have double-raw strings.

raw escape in python except last char

Why in python I can't use:
r"c:\"

When a string must contain the same quote character with which it starts, escaping that character is the only available workaround -- so the design alternative was either to make raw-string literals unable to contain their leading quote character, or keep the "backlash escapes" convention, even in string literals, just for quote characters.
Since raw-string literals were designed for handy representation of regular expression patterns (not for DOS / Windows paths!-), and in RE patterns a trailing backslash is never necessary, the design decision was easy (based on the real use case for raw-string literals).

Use "c:/" or "c:\\". Raw string literals are for escaping escape-sequences, not for including literal backslashes, though they do work that way, except in this exact case.

Its a known case I think, better use "c:\\" for that case.

From the documentation:
... a raw string cannot end in a single backslash (since the backslash would escape the following quote character).
.

Even with raw strings, \" causes the " not to be interpreted as the end of the string (though the backslash gets into your string), so r"foo\"bar" would be a legal string. This is convenient enough when writing regex but not great for writing paths.
This is not a big deal as most of the time you should be using os.path and other modules to deal with your paths.

found in Design and History FAQ http://docs.python.org/faq/design.html#why-can-t-raw-strings-r-strings-end-with-a-backslash
Raw strings were designed to ease
creating input for processors (chiefly
regular expression engines) that want
to do their own backslash escape
processing. Such processors consider
an unmatched trailing backslash to be
an error anyway, so raw strings
disallow that. In return, they allow
you to pass on the string quote
character by escaping it with a
backslash. These rules work well when
r-strings are used for their intended
purpose.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.