Python backreference replacing doesn't work as expected - python

There are two named groups in my pattern: myFlag and id, I want to add one more myFlag immediately before group id.
Here is my current code:
# i'm using Python 3.4.2
import re
import os
contents = b'''
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg((UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
'''
pattern = rb'(?P<myFlag>[a-zA-Z0-9_]+)::(?P=myFlag).+:.+(?P<id>\(UINT\)0 *,)'
res = re.search(pattern, contents, re.DOTALL)
if None != res:
print(res.groups()) # the output is (b'xdlg', b'(UINT)0,')
# 'replPattern' becomes b'(?P<myFlag>[a-zA-Z0-9_]+)::(?P=myFlag).+:.+((?P=myFlag)\\(UINT\\)0 *,)'
replPattern = pattern.replace(b'?P<id>', b'(?P=myFlag)', re.DOTALL)
print(replPattern)
contents = re.sub(pattern, replPattern, contents)
print(contents)
The expected results should be:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg(xdlg(UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
but now the result this the same with the original:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg((UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}

The issue appears to be the pattern syntax — particularly the end:
0 *,)
That makes no sense really... fixing it seems to solve most of the issues, although I would recommend ditching DOTALL and going with MULTILINE instead:
p = re.compile(ur'([a-zA-Z0-9_]+)::\1(.*\n\W+:.*)(\(UINT\)0,.*)', re.MULTILINE)
sub = u"\\1::\\1\\2\\1\\3"
result = re.sub(p, sub, s)
print(result)
Result:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg(xdlg(UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
https://regex101.com/r/hG3lV7/1

Related

Python regex to find functions and params pairs in js files

I am writing a JavaScript crawler application.
The application needs to open JavaScript files and find some specific code in order to do some stuff with them.
I am using regular expressions to find the code of interest.
Consider the following JavaScript code:
let nlabel = rs.length ? st('string1', [st('string2', ctx = 'ctx2')], ctx = 'ctx1') : st('Found {0}', [st(this.param)]);
As you can see there is the st function which is called three times in the same line. The first two calls have an extra parameter named ctx but the third one doesn't have it.
What I need to do is to have 3 re matches as below:
Match 1
Group: function = "st('"
Group: string = "string1"
Group: ctx = "ctx1"
Match 2
Group: function = "st('"
Group: string = "string2"
Group: ctx = "ctx2"
Match 3
Group: function = "st('"
Group: string = "Found {0}"
Group: ctx = (None)
I am using the regex101.com to test my patterns and the pattern that gives the closest thing to what I am looking for is the following:
(?P<function>st\([\"'])(?P<string>.+?(?=[\"'](\s*,ctx\s*|\s*,\s*)))
You can see it in action here.
However, I have no idea how to make it return the ctx group the way I want it.
For your reference I am using the following Python code:
matches = []
code = "let nlabel = rs.length ? st('string1', [st('string2', ctx = 'ctx2')], ctx = 'ctx1') : st('Found {0}', [st(this.param)], ctx = 'ctxparam'"
pattern = "(?P<function>st\([\"'])(?P<string>.+?(?=[\"'](\s*,ctx\s*|\s*,\s*)))"
for m in re.compile(pattern).finditer(code):
fnc = m.group('function')
msg = m.group('string')
ctx = m.group('ctx')
idx = m.start()
matches.append([idx, fnc, msg, ctx])
print(matches)
I have the feeling that re alone isn't capable to do exactly what I am looking for but any suggestion/solution which gets closer is more than welcome.

Displaying information in a list using regex and .group()

I would like to display a list containing information from some text files using Python.
What I want to display :
[host_name, hardware_ethernet_value, fixed_address_value]
An example of a file (random examples):
host name-random-text.etc {
hardware ethernet 00:02:99:aa:xx:yc;
fixed-address 1.3.0.155;
}
host name-another-again.etc {
hardware ethernet 00:02:99:aa:xx:yc;
fixed-address 3.5.0.115;
}
Someone helped me to write a code for that but it doesn't work anymore, though I know where the problem is coming from.
So the code is as follows :
#!/usr/bin/env python
#import modules
import pprint
import re
#open a file
filedesc = open("DATA/fixed.10.3", "r")
#using regex expressions to get the different informations
SCAN = {
'host' : r"^host (\S+) {",
'hardware' : r"hardware ethernet (\S+);",
'fixed-adress' : r"fixed adress (\S+);"
}
item = []
for key in SCAN:
#print(key)
regex = re.compile(SCAN[key])
#print(regex)
for line in filedesc:
#print(line)
match = regex.search(line)
#print(match)
#match = re.search(regex, line)
#if match is not None:
#print(match.group(1))
if match is not None:
#print(match.group(1))
if match.group(1) == key:
print(line)
item += [match.group(2)]
break
#print the final dictionnaries
pp=print(item)
#make sure to close the file after using it with file.close()
What should be expected :
match.group(1) = host
match.group(2) = name-random-text.etc
But what I have is match.group(1) = name-random-text.etc so match.group(2) = nothing here. This is why the condition match.group(1) == key never works, because match.group(1) never takes the values ['host', 'hardware ethernet', 'fixed-address'].
Your reg exp matches only 1 group.
If you want match 2 groups and group 1 should be equal SCAN's key, you need change SCAN like this:
SCAN = {
'host' : r"^(host) (\S+) {",
'hardware' : r"(hardware) ethernet (\S+);",
'fixed address' : r"(fixed address) (\S+);"
}
Very simple, but working decision:
#!/usr/bin/env python
#import modules
import pprint
import re
#open a file
file_lines = """
host name-random-text.etc {
hardware ethernet 00:02:99:aa:xx:yc;
fixed-address 1.3.0.155;
}
host name-another-again.etc {
hardware ethernet 00:02:99:aa:xx:yc;
fixed-address 3.5.0.115;
}
""".split('\n')
SCAN = {
'host': r"^host (\S+) {",
'hardware': r"hardware ethernet (\S+);",
'fixed_address': r"fixed-address (\S+);",
'end_item': r"^\s*}$\s*",
}
# Compile only once, if not want repeat re.compile above
for key, expr in SCAN.iteritems():
SCAN[key] = re.compile(expr)
items = []
item = {}
for line in file_lines:
for key, expr in SCAN.iteritems():
m = expr.search(line)
if not m:
continue
if key == 'end_item':
items.append(item)
print "Current item", [item.get(x) for x in ['host', 'hardware', 'fixed_address']]
item = {}
else:
item[key] = m.group(1)
print "Full list of items"
pprint.pprint(items)
Output:
Current item ['name-random-text.etc', '00:02:99:aa:xx:yc', '1.3.0.155']
Current item ['name-another-again.etc', '00:02:99:aa:xx:yc', '3.5.0.115']
Full list of items
[{'fixed_address': '1.3.0.155',
'hardware': '00:02:99:aa:xx:yc',
'host': 'name-random-text.etc'},
{'fixed_address': '3.5.0.115',
'hardware': '00:02:99:aa:xx:yc',
'host': 'name-another-again.etc'}]

Converting part of string into variable name in python

I have a file containing a text like this:
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
Does anyone know, how could I extract variables like below:
appList=["application1","application2"]
ServerOfapp1=["127.0.0.1:8082","127.0.0.1:8083","127.0.0.1:8084"]
ServerOfapp2=["127.0.0.1:8092","127.0.0.1:8093","127.0.0.1:8094"]
.
.
.
and so on
If the lines you want always start with upstream and server this should work:
app_dic = {}
with open('file.txt','r') as f:
for line in f:
if line.startswith('upstream'):
app_i = line.split()[1]
server_of_app_i = []
for line in f:
if not line.startswith('server'):
break
server_of_app_i.append(line.split()[1][:-1])
app_dic[app_i] = server_of_app_i
app_dic should then be a dictionary of lists:
{'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'],
'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']}
EDIT
If the input file does not contain any newline character, as long as the file is not too large you could write it to a list and iterate over it:
app_dic = {}
with open('file.txt','r') as f:
txt_iter = iter(f.read().split()) #iterator of list
for word in txt_iter:
if word == 'upstream':
app_i = next(txt_iter)
server_of_app_i=[]
for word in txt_iter:
if word == 'server':
server_of_app_i.append(next(txt_iter)[:-1])
elif word == '}':
break
app_dic[app_i] = server_of_app_i
This is more ugly as one has to search for the closing curly bracket to break. If it gets any more complicated, regex should be used.
If you are able to use the newer regex module by Matthew Barnett, you can use the following solution, see an additional demo on regex101.com:
import regex as re
rx = re.compile(r"""
(?:(?P<application>application\d)\s{\n| # "application" + digit + { + newline
(?!\A)\G\n) # assert that the next match starts here
server\s # match "server"
(?P<server>[\d.:]+); # followed by digits, . and :
""", re.VERBOSE)
string = """
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
"""
result = {}
for match in rx.finditer(string):
if match.group('application'):
current = match.group('application')
result[current] = list()
if current:
result[current].append(match.group('server'))
print result
# {'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094'], 'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084']}
This makes use of the \G modifier, named capture groups and some programming logic.
This is the basic method:
# each of your objects here
objText = "xyz xcyz 244.233.233.2:123"
listOfAll = re.findall(r"/\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):[0-9]{1,5}/g", objText)
for eachMatch in listOfAll:
print "Here's one!" % eachMatch
Obviously that's a bit rough around the edges, but it will perform a full-scale regex search of whatever string it's given. Probably a better solution would be to pass it the objects themselves, but for now I'm not sure what you would have as raw input. I'll try to improve on the regex, though.
I believe this as well can be solved with re:
>>> import re
>>> from collections import defaultdict
>>>
>>> APP = r'\b(?P<APP>application\d+)\b'
>>> IP = r'server\s+(?P<IP>[\d\.:]+);'
>>>
>>> pat = re.compile('|'.join([APP, IP]))
>>>
>>>
>>> scan = pat.scanner(s)
>>> d = defaultdict(list)
>>>
>>> for m in iter(scan.search, None):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})
Or similarly with re.finditer method and without pat.scanner:
>>> for m in re.finditer(pat, s):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})

Delete a repeating pattern in a string using Python

I have a JavaScript file with an array of data.
info = [ {
Date = "YR-MM-DDT00:00:10"
}, ....
What I'm trying to do is remove T and on in the Date field.
Here's what I've tried:
import re
with open ("info.js","r") as myFile:
data= myFile.read();
data= re.sub('\0-9T,'',data);
Desired output for each Date field in the array:
Date = "YR-MM-DD"
You should match the T and the characters that come after it, This works for a single timestamp:
import re
print(re.sub('T.*$', '', 'YR-MM-DDT00:00:10'))
Or if you have text containing a bunch of timestamps, match the closing double quote as well, and replace with a double quote:
import re
text = """
info = [ {
Date = "YR-MM-DDT00:00:10",
Date = "YR-MM-DDT01:02:03",
Date = "YR-MM-DDT11:22:33"
}
"""
new_text = re.sub('T.*"', '"', text)
print(new_text)

Python multiline regex works in shell but not in program

I'm trying to find and replace a multiline pattern in a JSON feed. Basically, I'm looking for a line ending "}," followed by a line with just "}".
Example input would be:
s = """
"essSurfaceFreezePoint": "1001",
"essSurfaceBlackIceSignal": "4"
},
}
}
"""
and I want to find:
"""
},
}
"""
and replace it with:
"""
}
}
"""
I've tried the following:
pattern = re.compile(r'^ *},\n^ *}$',flags=re.MULTILINE)
pattern.findall(feedStr)
This works in the python shell. However, when I do the same search in my python program, it finds nothing. I'm using the full JSON feed in the program. Perhaps it's getting a different line termination when reading the feed.
The feed is at:
http://hardhat.ahmct.ucdavis.edu/tmp/test.json
If anyone can point out why this is working in the shell, but not in the program, I'd greatly appreciate it. Is there a better way to formulate the regular expression, so it would work in both?
Thanks for any advice.
=====================================================================================
To make this clearer, I'm adding my test code here. Note that I'm now including the regular expression provided by Ahosan Karim Asik. This regex works in the live demo link below, but doesn't quite work for me in a python shell. It also doesn't work against the real feed.
Thanks again for any assistance.
import urllib2
import json
import re
if __name__ == "__main__":
# wget version of real feed:
# url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.json"
# Short text, for milepost and brace substitution test:
url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.txt"
request = urllib2.urlopen(url)
rawResponse = request.read()
# print("Raw response:")
# print(rawResponse)
# Find extra comma after end of records:
p1 = re.compile('(}),(\r?\n *})')
l1 = p1.findall(rawResponse)
print("Brace matches found:")
print(l1)
# Check milepost:
#p2 = re.compile('( *\"milepost\": *\")')
p2 = re.compile('( *\"milepost\": *\")([0-9]*\.?[0-9]*)\r?\n')
l2 = p2.findall(rawResponse)
print("Milepost matches found:")
print(l2)
# Do brace substitutions:
subst = "\1\2"
response = re.sub(p1, subst, rawResponse)
# Do milepost substitutions:
subst = "\1\2\""
response = re.sub(p2, subst, response)
print(response)
try this:
import re
p = re.compile(ur'(^ *}),(\n^ *})$', re.MULTILINE)
test_str = u" \"essSurfaceFreezePoint\": \"1001\",\n \"essSurfaceBlackIceSignal\": \"4\"\n },\n }\n }"
subst = u"$1$2"
result = re.sub(p, subst, test_str)
live demo

Categories