python regex subsituting expression using a variable - python

What I am trying to achieve is to substitute a string using python regex with a variable (contents of the variable). Since I need to retain some of the matched expression, I use the \1 and \3 group match args.
My regex/sub looks like this:
pattern = "\1" + id + "\3" \b
out = re.sub(r'(;11=)(\w+)(;)',r'%s' % pattern, line)
What appears to be happening is \1 and \3 do not get added to the output.
I have also tried this with the substitution expression:
r'\1%s\3'%orderid
But I got similar results.
Any suggestion on what might fix this?

You need to use raw strings or double the backslashes:
pattern = r"\1" + id + r"\3"
or
pattern = "\\1" + id + r"\\3"
In a regular Python string literal, \number is interpreted as an octal character code instead:
>>> '\1'
'\x01'
while the backslash has no special meaning in a raw string literal:
>>> r'\1'
'\\1'
Raw string literals are just a notation, not a type. Both r'' and '' produce strings, and only differ in how they interpret backslashes in source code.
Note that since group 1 and group3 match literal text, you don't need to use substitutions at all; simply use:
out = re.sub(r';11=\w+;', ';11=%s;' % id, line)
or use look-behind and lookahead and forgo having to repeat the literals:
out = re.sub(r'(?<=;11=)\w+(?=;)', id, line)
Demo:
>>> import re
>>> line = 'foobar;11=spam;hameggs'
>>> id = 'monty'
>>> re.sub(r';11=\w+;', ';11=%s;' % id, line)
'foobar;11=monty;hameggs'
>>> re.sub(r'(?<=;11=)\w+(?=;)', id, line)
'foobar;11=monty;hameggs'

This isn't going to work:
pattern = "\1" + id + "\3"
# ...
r'%s' % pattern
The r prefix only affects how the literal is interpreted. So, r'%s' mean that the % and s will be interpreted raw—but that's the same way they'd be interpreted without the r. Meanwhile, the pattern has non-raw literals "\1" and "\3", so it's already a control-A and a control-C before you even get to the %.
What you want is:
pattern = r"\1" + id + r"\3"
# ...
'%s' % pattern
However, you really don't need the % formatting at all; just use pattern itself and you'll get the exact same thing.

Related

Use regex to replace words before any digit with nothing

I want to replace words and spaces that appear before a digit in a string with nothing. For example, for the string = 'Juice of 1/2', I want to return '1/2'. I tried the following, but it did not work.
string = "Juice of 1/2"
new = string.replace(r"^.+?(?=\d)", "")
Also I am trying to perform this on every cell of a list of columns using the following code. How would I incorporate the new regex pattern into the existing pattern of r"(|)|?
df[pd.Index(cols2) + "_clean"] = (
df[cols2]
.apply(lambda col: col.str.replace(r"\(|\)|,", "", regex=True))
)
You might be able to phrase this using str.extract:
df["col2"] = df["col2"].str.extract(r'([0-9/-]+)')
.+? will match anything, including other digits. It will also match the / in 1/2. Since you only want to replace letters and spaces, use [a-z\s]+.
You also have to use re.sub(), not string.replace() (in Pandas, .str.replace() processes regular expressions by default).
new = re.sub(r'[a-z\s]+(?=\d)', '', string, flags=re.I)
May be something like this might work.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"[A-Za-z\s]+"
test_str = "Juice of 1/2 hede"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Match literal string '\$'

I'm trying to match literal string '\$'. I'm escaping both '\' and '$' by backslash. Why isn't working when I escape the backslash in the pattern? But if I use a dot then it works.
import re
print re.match('\$','\$')
print re.match('\\\$','\$')
print re.match('.\$','\$')
Output:
None
None
<_sre.SRE_Match object at 0x7fb89cef7b90>
Can someone explain what's happening internally?
You should use the re.escape() function for this:
escape(string)
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
For example:
import re
val = re.escape('\$') # val = '\\\$'
print re.match(val,'\$')
It outputs:
<_sre.SRE_Match object; span=(0, 2), match='\\$'>
This is equivalent to what #TigerhawkT3 mentioned in his answer.
Unfortunately, you need more backslashes. You need to escape them to indicate that they're literals in the string and get them into the expression, and then further escape them to indicate that they're literals instead of regex special characters. This is why raw strings are often used for regular expressions: the backslashes don't explode.
>>> import re
>>> print re.match('\$','\$')
None
>>> print re.match('\\\$','\$')
None
>>> print re.match('.\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
>>> print re.match('\\\\\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
>>> print re.match(r'\\\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
r'string'
is the raw string
try annotating your regex string
here are the same re's with and without raw annotation
print( re.match(r'\\\$', '\$'))
<_sre.SRE_Match object; span=(0, 2), match='\\$'>
print( re.match('\\\$', '\$'))
None
this is python3 on account of because
In a (non-raw) string literal, backslash is special. It means the Python interpreter should handle following character specially. For example "\n" is a string of length 1 containing the newline character. "\$" is a string of a single character, the dollar sign. "\\$" is a string of two characters: a backslash and a dollar sign.
In regular expressions, the backslash also means the following character is to be handled specially, but in general the special meaning is different. In a regular expression, $ matches the end of a line, and \$ matches a dollar sign, \\ matches a single backslash, and \\$ matches a backslash at the end of a line.
So, when you do re.match('\$',s) the Python interpreter reads '\$' to construct a string object $ (i.e., length 1) then passes that string object to re.match. With re.match('\\$',s) Python makes a string object \$ (length 2) and passes that string object to re.match.
To see what's actually being passed to re.match, just print it. For example:
pat = '\\$'
print "pat :" + pat + ":"
m = re.match(pat, s)
People usually use raw string literals to avoid the double-meaning of backslashes.
pat = r'\$' # same 2-character string as above
Thanks for the above answers. I am adding this answer because we don't have a short summary in the above answers.
The backslash \ needs to be escaped both in python string and regex engine.
Python string will translate 2 \\ to 1 \. And regex engine will require 2 \\ to match 1 \
So to provide the regex engine with 2 \\ in order to match 1 \ we will have to use 4 \\\\ in python string.
\\\\ --> Python(string translation) ---> \\ ---> Regex Engine(translation) ---> \
You have to use . as . matches any characters except newline.

Replacing Certain Parts of a String Python

I can not seem to solve this. I have many different strings, and they are always different. I need to replace the ends of them though, but they are always different lengths. Here is a example of a couple strings:
string1 = "thisisnumber1(111)"
string2 = "itsraining(22252)"
string3 = "fluffydog(3)"
Now when I print these out it will of course print the following:
thisisnumber1(111)
itsraining(22252)
fluffydog(3)
What I would like it to print though is the follow:
thisisnumber1
itsraining
fluffydog
I would like it to remove the part in the parentheses for each string, but I do not know how sense the lengths are always changing. Thank You
You can use str.rsplit for this:
>>> string1 = "thisisnumber1(111)"
>>> string2 = "itsraining(22252)"
>>> string3 = "fluffydog(3)"
>>>
>>> string1.rsplit("(")
['thisisnumber1', '111)']
>>> string1.rsplit("(")[0]
'thisisnumber1'
>>>
>>> string2.rsplit("(")
['itsraining', '22252)']
>>> string2.rsplit("(")[0]
'itsraining'
>>>
>>> string3.rsplit("(")
['fluffydog', '3)']
>>> string3.rsplit("(")[0]
'fluffydog'
>>>
str.rsplit splits the string from right-to-left rather than left-to-right like str.split. So, we split the string from right-to-left on ( and then retrieve the element at index 0 (the first element). This will be everything before the (...) at the end of each string.
Your other option is to use regular expressions, which can give you more precise control over what you want to get.
import re
regex = regex = r"(.+)\(\d+\)"
print re.match(regex, string1).groups()[0] #returns thisisnumber1
print re.match(regex, string2).groups()[0] #returns itsraining
print re.match(regex, string3).groups()[0] #returns fluffydog
Breakdown of what's happening:
regex = r"(.+)\(\d+\)" is the regular expression, the formula for the string you're trying to find
.+ means match 1 or more character of any kind except newline
\d+ means match 1 or more digit
\( and \) are the "(" and ")" characters
putting .+ in parentheses puts that string sequence in a group, meaning that group of characters is one that you want to be able to access later on. We don't put the sequence \(\d+\) in a group because we don't care about those characters.
regex.match(regex, string1).groups() gives every substring in string1 that was part of a group. Since you only want 1 substring, you just access the 0th element.
There's a nice tutorial on regular expressions on Tutorial's Point here if you want to learn more.
Since you say in a comment:
"all that will be in the parentheses will be numbers"
so you'll always have digits between your parens, I'd recommend taking a look at removing them with the regular expression module:
import re
string1 = "thisisnumber1(111)"
string2 = "itsraining(22252)"
string3 = "fluffydog(3)"
strings = string1, string2, string3
for s in strings:
s_replaced = re.sub(
r'''
\( # must escape the parens, since these are special characters in regex
\d+ # one or more digits, 0-9
\)
''', # this regular expression will be replaced by the next argument
'', replace the above with an empty string
s, # the string we're modifying
re.VERBOSE) # verbose flag allows us to comment regex clearly
print(s_replaced)
prints:
thisisnumber1
itsraining
fluffydog

Handling backreferences to capturing groups in re.sub replacement pattern

I want to take the string 0.71331, 52.25378 and return 0.71331,52.25378 - i.e. just look for a digit, a comma, a space and a digit, and strip out the space.
This is my current code:
coords = '0.71331, 52.25378'
coord_re = re.sub("(\d), (\d)", "\1,\2", coords)
print coord_re
But this gives me 0.7133,2.25378. What am I doing wrong?
You should be using raw strings for regex, try the following:
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
With your current code, the backslashes in your replacement string are escaping the digits, so you are replacing all matches the equivalent of chr(1) + "," + chr(2):
>>> '\1,\2'
'\x01,\x02'
>>> print '\1,\2'
,
>>> print r'\1,\2' # this is what you actually want
\1,\2
Any time you want to leave the backslash in the string, use the r prefix, or escape each backslash (\\1,\\2).
Python interprets the \1 as a character with ASCII value 1, and passes that to sub.
Use raw strings, in which Python doesn't interpret the \.
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
This is covered right in the beginning of the re documentation, should you need more info.

Regular expression to replace with XML node

I'm using Python to write a regular expression for replacing parts of the string with a XML node.
The source string looks like:
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
And the result string should be like:
Hello
<replace name="str1"> this is to replace </replace>
<replace name="str2"> this is to replace </replace>
Can anyone help me?
What makes your problem a little bit tricky is that you want to match inside of a multiline string. You need to use the re.MULTILINE flag to make that work.
Then, you need to match some groups inside your source string, and use those groups in the final output. Here is code that works to solve your problem:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
def mksub(m):
return '<replace name="%s">%s</replace>' % m.groups()
s_output = re.sub(pat, mksub, s_input)
The only tricky part is the regular expression pattern. Let's look at it in detail.
^ matches the start of a string. With re.MULTILINE, this matches the start of a line within a multiline string; in other words, it matches right after a newline in the string.
\s* matches optional whitespace.
REPLACE matches the literal string "REPLACE".
\( matches the literal string "(".
( begins a "match group".
[^)] means "match any character but a ")".
+ means "match one or more of the preceding pattern.
) closes a "match group".
\) matches the literal string ")"
(.*) is another match group containing ".*".
$ matches the end of a string. With re.MULTILINE, this matches the end of a line within a multiline string; in other words, it matches a newline character in the string.
. matches any character, and * means to match zero or more of the preceding pattern. Thus .* matches anything, up to the end of the line.
So, our pattern has two "match groups". When you run re.sub() it will make a "match object" which will be passed to mksub(). The match object has a method, .groups(), that returns the matched substrings as a tuple, and that gets substituted in to make the replacement text.
EDIT: You actually don't need to use a replacement function. You can put the special string \1 inside the replacement text, and it will be replaced by the contents of match group 1. (Match groups count from 1; the special match group 0 corresponds the the entire string matched by the pattern.) The only tricky part of the \1 string is that \ is special in strings. In a normal string, to get a \, you need to put two backslashes in a row, like so: "\\1" But you can use a Python "raw string" to conveniently write the replacement pattern. Doing so you get this:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_repl = r'<replace name="\1">\2</replace>'
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
s_output = re.sub(pat, s_repl, s_input)
Here is an excellent tutorial on how to write regular expressions in Python.
Here is a solution using pyparsing. I know you specifically asked about a regex solution, but if your requirements change, you might find it easier to expand a pyparsing parser. Or a pyparsing prototype solution might give you a little more insight into the problem leading toward a regex or other final implementation.
src = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
"""
from pyparsing import Suppress, Word, alphas, alphanums, restOfLine
LPAR,RPAR = map(Suppress,"()")
ident = Word(alphas, alphanums)
replExpr = "REPLACE" + LPAR + ident("name") + RPAR + restOfLine("body")
replExpr.setParseAction(
lambda toks : '<replace name="%(name)s">%(body)s </replace>' % toks
)
print replExpr.transformString(src)
In this case, you create the expression to be matched with pyparsing, define a parse action to do the text conversion, and then call transformString to scan through the input source to find all the matches, apply the parse action to each match, and return the resulting output. The parse action serves a similar function to mksub in #steveha's solution.
In addition to the parse action, pyparsing also supports naming individual elements of the expression - I used "name" and "body" to label the two parts of interest, which are represented in the re solution as groups 1 and 2. You can name groups in an re, the corresponding re would look like:
s_pat = "^\s*REPLACE\((?P<name>[^)]+)\)(?P<body>.*)$"
Unfortunately, to access these groups by name, you have to invoke the group() method on the re match object, you can't directly do the named string interpolation as in my lambda parse action. But this is Python, right? We can wrap that callable with a class that will give us dict-like access to the groups by name:
class CallableDict(object):
def __init__(self,fn):
self.fn = fn
def __getitem__(self,name):
return self.fn(name)
def mksub(m):
return '<replace name="%(name)s">%(body)s</replace>' % CallableDict(m.group)
s_output = re.sub(pat, mksub, s_input)
Using CallableDict, the string interpolation in mksub can now call m.group for each field, by making it look like we are retrieving the ['name'] and ['body'] elements of a dict.
Maybe like this ?
import re
mystr = """Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
prog = re.compile(r'REPLACE\((.*?)\)\s(.*)')
for line in mystr.split("\n"):
print prog.sub(r'< replace name="\1" > \2',line)
Something like this should work:
import re,sys
f = open( sys.argv[1], 'r' )
for i in f:
g = re.match( r'REPLACE\((.*)\)(.*)', i )
if g is None:
print i
else:
print '<replace name=\"%s\">%s</replace>' % (g.group(1),g.group(2))
f.close()
import re
a="""Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
regex = re.compile(r"^REPLACE\(([^)]+)\)\s+(.*)$", re.MULTILINE)
b=re.sub(regex, r'< replace name="\1" > \2 < /replace >', a)
print b
will do the replace in one line.

Categories