Replace based on several regex rules in Python

Replace based on several regex rules in Python - python

I want to use, for example, this patterns
rules = {
'\s': '_',
'.(?P<word>\w)': '\1',
'text1': 'text2',
#etc
}
using re.sub()
There are some examples like this, but it doesn't work with regex special charecters.

I use raw strings when using regex in python. Saves you from having to escape special characters: https://docs.python.org/2/library/re.html
Try:
rules = {
r"\s": r"_",
r"text1": r"text2",
#etc
}

You should use raw strings like so:
rules = {
r'\s': r'_',
r'.(?P<word>\w)': r'\1',
r'text1': r'text2',
#etc
}
It means you don't need to escape special characters
Here is why it happens (direct quote from the docs):
Regular expressions use the backslash character ('\') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This collides with Python’s usage of
the same character for the same purpose in string literals; for
example, to match a literal backslash, one might have to write '\\'
as the pattern string, because the regular expression must be \, and
each backslash must be expressed as \ inside a regular Python string
literal.
And how to solve it (another quote from the docs):
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.

Surely, you need to use raw strings when declaring Python regexes, and there are some issues with your examples, but you are interested in how to run the regex replacements.
I suggest using an OrderedDict so that the replacements could be performed in a strict order, as they were defined in the dictionary. Then, the code will look like
import re
from collections import OrderedDict # adding the import
rules=OrderedDict() # defining the regex
rules[r'\s'] = '-' # replacement
rules[r'.(\w)'] = r'\1' # pairs
rules['text1'] = 'text2' # here
s = "nnoo mmoorree tteexxtt11" # a test string
for key in rules.keys(): # iterating through keys
s = re.sub(key, rules[key], s) # perform the S&R
print(s) # Demo printing
See the IDEONE demo

Use raw string notation to avoid having to escape your special characters:
rules = {
'\s': '_',
'.(?P<word>\w)': '\1',
'text1': 'text2',
#etc
}
Directly from the regular expression module (re) documentation:
Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:
>>> re.match(r"\W(.)\1\W", " ff ")
<_sre.SRE_Match object at ...>
>>> re.match("\\W(.)\\1\\W", " ff ")
<_sre.SRE_Match object at ...>
When one wants to match a literal backslash, it must be escaped in the regular expression. With raw string notation, this means r"\". Without raw string notation, one must use "\\", making the following lines of code functionally identical:
>>> re.match(r"\\", r"\\")
<_sre.SRE_Match object at ...>
>>> re.match("\\\\", r"\\")
<_sre.SRE_Match object at ...>

Related

Search a regex containing special characters

I want to find if there is a pattern $*.* in text.
But i cannot figure out how to do that with regular expressions in python.

Escape the dollar and the dot:
re.search(r'\$\.', inputstring)
Rules of thumb:
Use a raw string literal, so you don't have to double slashes (both Python and regular expressions derive meaning from the backslash)
When in doubt, escape the character to make it match a literal character.

Since you are looking for a specific substring, you don't even need a regular expression for this. This should do:
"$." in my_string
Example:
>>> "$." in "tes$.t"
True
>>> "$." in "test"
False

What does the "r" in pythons re.compile(r' pattern flags') mean?

I am reading through http://docs.python.org/2/library/re.html. According to this the "r" in pythons re.compile(r' pattern flags') refers the raw string notation :
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.
Would it be fair to say then that:
re.compile(r pattern) means that "pattern" is a regex while, re.compile(pattern) means that "pattern" is an exact match?

As #PauloBu stated, the r string prefix is not specifically related to regex's, but to strings generally in Python.
Normal strings use the backslash character as an escape character for special characters (like newlines):
>>> print('this is \n a test')
this is
a test
The r prefix tells the interpreter not to do this:
>>> print(r'this is \n a test')
this is \n a test
>>>
This is important in regular expressions, as you need the backslash to make it to the re module intact - in particular, \b matches empty string specifically at the start and end of a word. re expects the string \b, however normal string interpretation '\b' is converted to the ASCII backspace character, so you need to either explicitly escape the backslash ('\\b'), or tell python it is a raw string (r'\b').
>>> import re
>>> re.findall('\b', 'test') # the backslash gets consumed by the python string interpreter
[]
>>> re.findall('\\b', 'test') # backslash is explicitly escaped and is passed through to re module
['', '']
>>> re.findall(r'\b', 'test') # often this syntax is easier
['', '']

No, as the documentation pasted in explains the r prefix to a string indicates that the string is a raw string.
Because of the collisions between Python escaping of characters and regex escaping, both of which use the back-slash \ character, raw strings provide a way to indicate to python that you want an unescaped string.
Examine the following:
>>> "\n"
'\n'
>>> r"\n"
'\\n'
>>> print "\n"
>>> print r"\n"
\n
Prefixing with an r merely indicates to the string that backslashes \ should be treated literally and not as escape characters for python.
This is helpful, when for example you are searching on a word boundry. The regex for this is \b, however to capture this in a Python string, I'd need to use "\\b" as the pattern. Instead, I can use the raw string: r"\b" to pattern match on.
This becomes especially handy when trying to find a literal backslash in regex. To match a backslash in regex I need to use the pattern \\, to escape this in python means I need to escape each slash and the pattern becomes "\\\\", or the much simpler r"\\".
As you can guess in longer and more complex regexes, the extra slashes can get confusing, so raw strings are generally considered the way to go.

No. Not everything in regex syntax needs to be preceded by \, so ., *, +, etc still have special meaning in a pattern
The r'' is often used as a convenience for regex that do need a lot of \ as it prevents the clutter of doubling up the \

Python - Should I be using string prefix r when looking for a period (full stop or .) using regex?

I would like to know the reason I get the same result when using string prefix "r" or not when looking for a period (full stop) using python regex.
After reading a number sources (Links below) a multiple times and experimenting with in code to find the same result (again see below), I am still unsure of:
What is the difference when using string prefix "r" and not using string prefix "r", when looking for a period using regex?
Which way is considered the correct way of finding a period in a string using python regex with string prefix "r" or without string prefix "r"?
re.compile("\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").search("blah.").group()
'.'
re.compile("\.").search("blah.").group()
'.'
Sources I have looked at:
Python docs: string literals
http://docs.python.org/2/reference/lexical_analysis.html#string-literals
Regular expression to replace "escaped" characters with their originals
Python regex - r prefix
r prefix is for raw strings
http://forums.udacity.com/questions/7000217/r-prefix-is-for-raw-strings

The raw string notation is just that, a notation to specify a string value. The notation results in different string values when it comes to backslash escapes recognized by the normal string notation. Because regular expressions also attach meaning to the backslash character, raw string notation is quite handy as it avoids having to use excessive escaping.
Quoting from the Python Regular Expression HOWTO:
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.
The \. combination has no special meaning in regular python strings, so there is no difference, at all between the result of '\.' and r'\.'; you can use either:
>>> len('\.')
2
>>> len(r'\.')
2
Raw strings only make a difference when the backslash + other characters do have special meaning in regular string notation:
>>> '\b'
'\x08'
>>> r'\b'
'\\b'
>>> len('\b')
1
>>> len(r'\b')
2
The \b combination has special meaning; in a regular string it is interpreted as the backspace character. But regular expressions see \b as a word boundary anchor, so you'd have to use \\b in your Python string every time you wanted to use this in a regular expression. Using r'\b' instead makes it much easier to read and write your expressions.
The regular expression functions are passed string values; the result of Python interpreting your string literal. The functions do not know if you used raw or normal string literal syntax.

Raw string notation in regular expressions

Python has this way of specifying regular expression pattern, where all special character should not be treated as special. From the docs:
So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline.
Why then does this works?
import re
print re.split(r"\n", "1\n2\n3")
The first argument should be "\" and "n" and the second one should contain two newlines. But it prints:
['1', '2', '3']

The first one does contain backslash-and-n, but in regular-expression-language, backslash-and-n means newline (just like it does in Python string syntax). That is, the string r"\n" does not contain an actual newline, but it contains something that tells the regular expression engine to look for actual newlines.
If you want to search for a backslash followed by n, you need to use r"\\n".
The point of the raw strings is that they block Python's basic intepretation of string escapes, allowing you to use the backslash for its regular-expression meaning. If you don't want the regular-expression meaning, you still have to use two backslashes, as in my example above. But without raw strings it would be even worse: if you wanted to search for literal backslash-n without a raw string, you'd have to use "\\\\n". If the raw string blocked interpretation of the regular expression special characters (so that plain "\n" really meant backslash-n), you wouldn't have any way of using the regular expression syntax at all.

Python regular expression with string in it

I would like to match a string with something like:
re.match(r'<some_match_symbols><my_match><some_other_match_symbols>', mystring)
where mymatch is a string I would like it to find. The problem is that it may be different from time to time, and it is stored in a variable. Would it be possible to add one variable to a regex?

Nothing prevents you from simply doing this:
re.match('<some_match_symbols>' + '<my_match>' + '<some_other_match_symbols>', mystring)
Regular expressions are nothing else than a string containing some special characters, specific to the regular expression syntax. But they are still strings, so you can do whatever you are used to do with strings.
The r'…' syntax is btw. a raw string syntax which basically just prevents any escape sequences inside the string from being evaluated. So r'\n' will be the same as '\\n', a string containing a backslash and an n; while '\n' contain a line break.

import re
url = "www.dupe.com"
expression = re.compile('<p>%s</p>'%url)
result = expression.match("<p>www.dupe.com</p>BBB")
if result:
print result.start(), result.end()

The r'' notation is for constants. Use the re library to compile from variables.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace based on several regex rules in Python - python

I want to use, for example, this patterns rules = { '\s': '_', '.(?P<word>\w)': '\1', 'text1': 'text2', #etc } using re.sub() There are some examples like this, but it doesn't work with regex special charecters.

I use raw strings when using regex in python. Saves you from having to escape special characters: https://docs.python.org/2/library/re.html Try: rules = { r"\s": r"_", r"text1": r"text2", #etc }

Related

Search a regex containing special characters

What does the "r" in pythons re.compile(r' pattern flags') mean?

Python - Should I be using string prefix r when looking for a period (full stop or .) using regex?

Raw string notation in regular expressions

Python regular expression with string in it

Categories

Resources