Python Regex MULTILINE option not working correctly? - python

I'm writing a simple version updater in Python, and the regex engine is giving me mighty troubles.
In particular, ^ and $ aren't matching correctly even with re.MULTILINE option. The string matches without the ^ and $, but no joy otherwise.
I would appreciate your help if you can spot what I'm doing wrong.
Thanks
target.c
somethingsomethingsomething
NOTICE_TYPE revision[] = "A_X1_01.20.00";
somethingsomethingsomething
versionUpdate.py
fileName = "target.c"
newVersion = "01.20.01"
find = '^(\s+NOTICE_TYPE revision\[\] = "A_X1_)\d\d+\.\d\d+\.\d\d+(";)$'
replace = "\\1" + newVersion + "\\2"
file = open(fileName, "r")
fileContent = file.read()
file.close()
find_regexp = re.compile(find, re.MULTILINE)
file = open(fileName, "w")
file.write( find_regexp.sub(replace, fileContent) )
file.close()
Update: Thank you John and Ethan for a valid point. However, the regexp still isn't matching if I keep $. It works again as soon as I remove $.

Change your replace to:
replace = r'\g<1>' + newVersion + r'\2'
The problem you're having is your version results in this:
replace = "\\101.20.01\\2"
which is confusing the sub call as there is no field 101. From the documentation for the Python re module:
\g<number> uses the corresponding group number; \g<2> is therefore
equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0.
\20 would be interpreted as a reference to group 20, not a reference
to group 2 followed by the literal character '0'.

if you do a print replace you'll see the problem...
replace == '\\101.20.01\2'
and since you don't have a 101st match, the first portion of your line gets lost. Try this instead:
newVersion = "_01.20.01"
find = r'^(\s+NOTICE_TYPE revision\[\] = "A_X1)_\d\d+\.\d\d+\.\d\d+(";)$'
replace = "\\1" + newVersion + "\\2"
(moves a portion of the match so there is no conflict)

Related

How to search for a pattern that contains a variable using regex in python?

I have multiple files, each containing multiple strings like this one:
Species_name_ID:0.0000010229,
I need to find the string with a specific 'Species_name_ID', that I ask the user to provide, and do a simple replacement so that it now reads:
Species_name_ID:0.0000010229 #1,
I'm stuck at the first part, trying to look for the pattern. I've tried looking only for the numeric pattern at the end with this, and it returns a list of all the instances in which the pattern appears:
my_regex = r':0\.\d{10}'
for line in input_file:
sp = re.findall(my_regex, line)
print(sp)
However, when I try adding the rest by using the string the user provides, it doesn't work and returns an empty list.
search = input("Insert the name of the species: ")
my_regex = f"{search}:0\.\d{{10}}"
for line in input_file:
sp = re.findall(my_regex, line)
print(sp)
I've also tried the following syntax for defining the variable (all come from this previous question How to use a variable inside a regular expression?):
my_regex = f"{search}"
my_regex = f"{search}" + r':0\.\d{10}'
my_regex = search + r':0\.\d{10}'
my_regex = re.compile(re.escape(search) + r':0\.\d{10}')
my_regex = r'%s:0\.\d{10}'%search
my_regex = r"Drosophila_melanogaster_12215" + r':0\.\d{10}'
Even when I try searching for the specified string, it doesn't find it in the file even when there are multiple hits it could make.
my_regex = Drosophila_melanogaster_12215
What am I missing?
This must work for you:
import re
search = input("Insert the name of the species: ")
my_regex = fr"{re.escape(search)}:0\.\d{{10}}"
for line in input_file:
print( re.findall(my_regex, line) )
Escape the user-defined variable placed inside regular expressions.
Double curly braces if you want curly braces inside.
Use raw string literal for your regular expressions.

Implement regular expression in Python to replace every occurence of "meshname = x" in a text file

I want to replace every line in a textfile with " " which starts with "meshname = " and ends with any letter/number and underscore combination. I used regex's in CS but I never really understood the different notations in Python. Can you help me with that?
Is this the right regex for my problem and how would i transform that into a Python regex?
m.e.s.h.n.a.m.e.' '.=.' '.{{_}*,{0,...,9}*,{a,...,z}*,{A,...,Z}*}*
x.y = Concatenation of x and y
' ' = whitespace
{x} = set containing x
x* = x.x.x. ... .x or empty word
What would the script look like in order to replace every string/line in a file containing meshname = ... with the Python regex? Something like this?
fin = open("test.txt", 'r')
data = fin.read()
data = data.replace("^meshname = [[a-z]*[A-Z]*[0-9]*[_]*]+", "")
fin.close()
fin = open("test.txt", 'w')
fin.write(data)
fin.close()
or is this completely wrong? I've tried to get it working with this approach, but somehow it never matched the right string: How to input a regex in string.replace?
Following the current code logic, you can use
data = re.sub(r'^meshname = .*\w$', ' ', data, flags=re.M)
The re.sub will replace with a space any line that matches
^ - line start (note the flags=re.M argument that makes sure the multiline mode is on)
meshname - a meshname word
= - a = string
.* - any zero or more chars other than line break chars as many as possible
\w - a letter/digit/_
$ - line end.

Python pattern to replace words between single or double quotes

I am new to Python and pretty bad with regex.
My requirement is to modify a pattern in an existing code
I have extracted the code that I am trying to fix.
def replacer_factory(spelling_dict):
def replacer(match):
word = match.group()
return spelling_dict.get(word, word)
return replacer
def main():
repkeys = {'modify': 'modifyNew', 'extract': 'extractNew'}
with open('test.xml', 'r') as file :
filedata = file.read()
pattern = r'\b\w+\b' # this pattern matches whole words only
#pattern = r'[\'"]\w+[\'"]'
#pattern = r'["]\w+["]'
#pattern = '\b[\'"]\w+[\'"]\b'
#pattern = '(["\'])(?:(?=(\\?))\2.)*?\1'
replacer = replacer_factory(repkeys)
filedata = re.sub(pattern, replacer, filedata)
if __name__ == '__main__':
main()
Input
<fn:modify ele="modify">
<fn:extract name='extract' value="Title"/>
</fn:modify>
Expected Output . Please note that the replacment words can be enclosed within single or double quotes.
<fn:modify ele="modifyNew">
<fn:extract name='extractNew' value="Title"/>
</fn:modify>
The existing pattern r'\b\w+\b' results in for example <fn:modifyNew ele="modifyNew">, but what I am looking for is <fn:modify ele="modifyNew">
Patterns I attempted so far are given as comments. I realized late that couple of them are wrong as , string literals prefixed with r is for special handling of backslash etc. I am still including them to review whatever I have attempted so far.
It would be great if I can get a pattern to solve this , rather than changing the logic. If this cannot be achieved with the existing code , please point out that as well. The environment I work has Python 2.6
Any help is appreciated.
You need to use r'''(['"])(\w+)\1''' regex, and then you need to adapt the replacer method:
def replacer_factory(spelling_dict):
def replacer(match):
return '{0}{1}{0}'.format(match.group(1), spelling_dict.get(match.group(2), match.group(2)))
return replacer
The word you match with (['"])(\w+)\1 is either in double, or in single quotes, but the value is in Group 2, hence the use of spelling_dict.get(match.group(2), match.group(2)). Also, the quotes must be put back, hence the '{0}{1}{0}'.format().
See the Python demo:
import re
def replacer_factory(spelling_dict):
def replacer(match):
return '{0}{1}{0}'.format(match.group(1), spelling_dict.get(match.group(2), match.group(2)))
return replacer
repkeys = {'modify': 'modifyNew', 'extract': 'extractNew'}
pattern = r'''(['"])(\w+)\1'''
replacer = replacer_factory(repkeys)
filedata = """<fn:modify ele="modify">
<fn:extract name='extract' value="Title"/>
</fn:modify>"""
print( re.sub(pattern, replacer, filedata) )
Output:
<fn:modify ele="modifyNew">
<fn:extract name='extractNew' value="Title"/>
</fn:modify>

python regex: pattern not found

I have a pattern compiled as
pattern_strings = ['\xc2d', '\xa0', '\xe7', '\xc3\ufffdd', '\xc2\xa0', '\xc3\xa7', '\xa0\xa0', '\xc2', '\xe9']
join_pattern = '|'.join(pattern_strings)
pattern = re.compile(join_pattern)
and then I find pattern in file as
def find_pattern(path):
with open(path, 'r') as f:
for line in f:
print line
found = pattern.search(line)
if found:
print dir(found)
logging.info('found - ' + found)
and my input as path file is
\xc2d
d\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0
'619d813\xa03697'
When I run this program, nothing happens.
I it not able to catch these patterns, what is am I doing wrong here?
Desired output
- each line because each line has one or the other matching pattern
Update
After changing the regex to
pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
It is still the same, no output
UPDATE
after making regex to
pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
join_pattern = '[' + '|'.join(pattern_strings) + ']'
pattern = re.compile(join_pattern)
Things started to work, but partially, the patterns still not caught are for line
\xc2\xa0
\xc3\xa7
\xa0\xa0
for which my pattern string is ['\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0']
escape the \ in the search patterns
either with r"\xa0" or as "\\xa0"
do this ....
['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
like everyones been saying to do except the one guy you listened too...
Does your file actually contain \xc2d --- that is, five characters: a backslash followed by c, then 2, then d? If so, your regex won't match it. Each of your regexes will match one or two characters with certain character codes. If you want to match the string \xc2d your regex needs to be \\xc2d.

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories