Python multiline regex works in shell but not in program - python

I'm trying to find and replace a multiline pattern in a JSON feed. Basically, I'm looking for a line ending "}," followed by a line with just "}".
Example input would be:
s = """
"essSurfaceFreezePoint": "1001",
"essSurfaceBlackIceSignal": "4"
},
}
}
"""
and I want to find:
"""
},
}
"""
and replace it with:
"""
}
}
"""
I've tried the following:
pattern = re.compile(r'^ *},\n^ *}$',flags=re.MULTILINE)
pattern.findall(feedStr)
This works in the python shell. However, when I do the same search in my python program, it finds nothing. I'm using the full JSON feed in the program. Perhaps it's getting a different line termination when reading the feed.
The feed is at:
http://hardhat.ahmct.ucdavis.edu/tmp/test.json
If anyone can point out why this is working in the shell, but not in the program, I'd greatly appreciate it. Is there a better way to formulate the regular expression, so it would work in both?
Thanks for any advice.
=====================================================================================
To make this clearer, I'm adding my test code here. Note that I'm now including the regular expression provided by Ahosan Karim Asik. This regex works in the live demo link below, but doesn't quite work for me in a python shell. It also doesn't work against the real feed.
Thanks again for any assistance.
import urllib2
import json
import re
if __name__ == "__main__":
# wget version of real feed:
# url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.json"
# Short text, for milepost and brace substitution test:
url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.txt"
request = urllib2.urlopen(url)
rawResponse = request.read()
# print("Raw response:")
# print(rawResponse)
# Find extra comma after end of records:
p1 = re.compile('(}),(\r?\n *})')
l1 = p1.findall(rawResponse)
print("Brace matches found:")
print(l1)
# Check milepost:
#p2 = re.compile('( *\"milepost\": *\")')
p2 = re.compile('( *\"milepost\": *\")([0-9]*\.?[0-9]*)\r?\n')
l2 = p2.findall(rawResponse)
print("Milepost matches found:")
print(l2)
# Do brace substitutions:
subst = "\1\2"
response = re.sub(p1, subst, rawResponse)
# Do milepost substitutions:
subst = "\1\2\""
response = re.sub(p2, subst, response)
print(response)

try this:
import re
p = re.compile(ur'(^ *}),(\n^ *})$', re.MULTILINE)
test_str = u" \"essSurfaceFreezePoint\": \"1001\",\n \"essSurfaceBlackIceSignal\": \"4\"\n },\n }\n }"
subst = u"$1$2"
result = re.sub(p, subst, test_str)
live demo

Related

How to replace parameterized string?

My use-case is to return the redirection uri for the given uri.
URI's will be as follows:
/books
/books/economic-genious
/books/flight-mechanics
My regular expression to match the above URI's as follows:
/books(/(.*))?$
My destination is configured as follows: /ebooks$1. So that the above URI's will be converted to:
/ebooks
/ebooks/economic-genious
/ebooks/flight-mechanics
For this my existing Javascript code is:
function getMappedURI(uri) {
var exp = new RegExp('/books(/(.*))?$');
var destUri = '/ebooks$1';
var redirectUri = uri.replace(exp, destUri);
return redirectUri;
}
Unable to achieve the same in Python.
That's a difficult way to replace the beginning of strings. If it's all about the result I would do it this way
import re
uri_list = [
'/books',
'/books/economic-genious',
'/books/flight-mechanics',
]
def getMappedURI(uri):
return re.sub(r'^\/books', '/ebooks', uri)
for uri in uri_list:
print(getMappedURI(uri))
Result
/ebooks
/ebooks/economic-genious
/ebooks/flight-mechanics
If you need to use the original regular exprression this should work
import re
uri_list = [
'/books/',
'/books/economic-genious',
'/books/flight-mechanics',
]
def getMappedURI(uri):
return re.sub(r'\/books(\/(.*))?$', r'/ebooks\1', uri)
for uri in uri_list:
print(getMappedURI(uri))
Result
/ebooks/
/ebooks/economic-genious
/ebooks/flight-mechanics
Note that backslashes have been added before the slashes in your regular expression.
If you want to avoid that you must use
re.sub(r'/books(/(.*))?$'.replace('/', r'\/'), r'/ebooks\1', uri)

Python regex to find functions and params pairs in js files

I am writing a JavaScript crawler application.
The application needs to open JavaScript files and find some specific code in order to do some stuff with them.
I am using regular expressions to find the code of interest.
Consider the following JavaScript code:
let nlabel = rs.length ? st('string1', [st('string2', ctx = 'ctx2')], ctx = 'ctx1') : st('Found {0}', [st(this.param)]);
As you can see there is the st function which is called three times in the same line. The first two calls have an extra parameter named ctx but the third one doesn't have it.
What I need to do is to have 3 re matches as below:
Match 1
Group: function = "st('"
Group: string = "string1"
Group: ctx = "ctx1"
Match 2
Group: function = "st('"
Group: string = "string2"
Group: ctx = "ctx2"
Match 3
Group: function = "st('"
Group: string = "Found {0}"
Group: ctx = (None)
I am using the regex101.com to test my patterns and the pattern that gives the closest thing to what I am looking for is the following:
(?P<function>st\([\"'])(?P<string>.+?(?=[\"'](\s*,ctx\s*|\s*,\s*)))
You can see it in action here.
However, I have no idea how to make it return the ctx group the way I want it.
For your reference I am using the following Python code:
matches = []
code = "let nlabel = rs.length ? st('string1', [st('string2', ctx = 'ctx2')], ctx = 'ctx1') : st('Found {0}', [st(this.param)], ctx = 'ctxparam'"
pattern = "(?P<function>st\([\"'])(?P<string>.+?(?=[\"'](\s*,ctx\s*|\s*,\s*)))"
for m in re.compile(pattern).finditer(code):
fnc = m.group('function')
msg = m.group('string')
ctx = m.group('ctx')
idx = m.start()
matches.append([idx, fnc, msg, ctx])
print(matches)
I have the feeling that re alone isn't capable to do exactly what I am looking for but any suggestion/solution which gets closer is more than welcome.

Converting part of string into variable name in python

I have a file containing a text like this:
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
Does anyone know, how could I extract variables like below:
appList=["application1","application2"]
ServerOfapp1=["127.0.0.1:8082","127.0.0.1:8083","127.0.0.1:8084"]
ServerOfapp2=["127.0.0.1:8092","127.0.0.1:8093","127.0.0.1:8094"]
.
.
.
and so on
If the lines you want always start with upstream and server this should work:
app_dic = {}
with open('file.txt','r') as f:
for line in f:
if line.startswith('upstream'):
app_i = line.split()[1]
server_of_app_i = []
for line in f:
if not line.startswith('server'):
break
server_of_app_i.append(line.split()[1][:-1])
app_dic[app_i] = server_of_app_i
app_dic should then be a dictionary of lists:
{'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'],
'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']}
EDIT
If the input file does not contain any newline character, as long as the file is not too large you could write it to a list and iterate over it:
app_dic = {}
with open('file.txt','r') as f:
txt_iter = iter(f.read().split()) #iterator of list
for word in txt_iter:
if word == 'upstream':
app_i = next(txt_iter)
server_of_app_i=[]
for word in txt_iter:
if word == 'server':
server_of_app_i.append(next(txt_iter)[:-1])
elif word == '}':
break
app_dic[app_i] = server_of_app_i
This is more ugly as one has to search for the closing curly bracket to break. If it gets any more complicated, regex should be used.
If you are able to use the newer regex module by Matthew Barnett, you can use the following solution, see an additional demo on regex101.com:
import regex as re
rx = re.compile(r"""
(?:(?P<application>application\d)\s{\n| # "application" + digit + { + newline
(?!\A)\G\n) # assert that the next match starts here
server\s # match "server"
(?P<server>[\d.:]+); # followed by digits, . and :
""", re.VERBOSE)
string = """
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
"""
result = {}
for match in rx.finditer(string):
if match.group('application'):
current = match.group('application')
result[current] = list()
if current:
result[current].append(match.group('server'))
print result
# {'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094'], 'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084']}
This makes use of the \G modifier, named capture groups and some programming logic.
This is the basic method:
# each of your objects here
objText = "xyz xcyz 244.233.233.2:123"
listOfAll = re.findall(r"/\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):[0-9]{1,5}/g", objText)
for eachMatch in listOfAll:
print "Here's one!" % eachMatch
Obviously that's a bit rough around the edges, but it will perform a full-scale regex search of whatever string it's given. Probably a better solution would be to pass it the objects themselves, but for now I'm not sure what you would have as raw input. I'll try to improve on the regex, though.
I believe this as well can be solved with re:
>>> import re
>>> from collections import defaultdict
>>>
>>> APP = r'\b(?P<APP>application\d+)\b'
>>> IP = r'server\s+(?P<IP>[\d\.:]+);'
>>>
>>> pat = re.compile('|'.join([APP, IP]))
>>>
>>>
>>> scan = pat.scanner(s)
>>> d = defaultdict(list)
>>>
>>> for m in iter(scan.search, None):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})
Or similarly with re.finditer method and without pat.scanner:
>>> for m in re.finditer(pat, s):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})

Python backreference replacing doesn't work as expected

There are two named groups in my pattern: myFlag and id, I want to add one more myFlag immediately before group id.
Here is my current code:
# i'm using Python 3.4.2
import re
import os
contents = b'''
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg((UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
'''
pattern = rb'(?P<myFlag>[a-zA-Z0-9_]+)::(?P=myFlag).+:.+(?P<id>\(UINT\)0 *,)'
res = re.search(pattern, contents, re.DOTALL)
if None != res:
print(res.groups()) # the output is (b'xdlg', b'(UINT)0,')
# 'replPattern' becomes b'(?P<myFlag>[a-zA-Z0-9_]+)::(?P=myFlag).+:.+((?P=myFlag)\\(UINT\\)0 *,)'
replPattern = pattern.replace(b'?P<id>', b'(?P=myFlag)', re.DOTALL)
print(replPattern)
contents = re.sub(pattern, replPattern, contents)
print(contents)
The expected results should be:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg(xdlg(UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
but now the result this the same with the original:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg((UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
The issue appears to be the pattern syntax — particularly the end:
0 *,)
That makes no sense really... fixing it seems to solve most of the issues, although I would recommend ditching DOTALL and going with MULTILINE instead:
p = re.compile(ur'([a-zA-Z0-9_]+)::\1(.*\n\W+:.*)(\(UINT\)0,.*)', re.MULTILINE)
sub = u"\\1::\\1\\2\\1\\3"
result = re.sub(p, sub, s)
print(result)
Result:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg(xdlg(UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
https://regex101.com/r/hG3lV7/1

Performing a Google search for .pdf and .ppt using keywords from a file in python

I am working on a program which performs a google search for .pdf and .ppt files. Currently I'm manually giving the keywords as inputs in my program. What I want to do is perform an automated search for both .pdf and .ppt files.
Suppose i have file.txt containing keywords:
python
android
parser
I want my program to automatically take these keywords one by one and search for both .pdf and .ppt files.
import urllib2
import urllib
import json as m_json
def getgoogleurl(search,siteurl=False):
if siteurl==False:
return 'http://www.google.com/search?q='+urllib2.quote(search)+'&oq='+urllib2.quote(search)
else:
return 'http://www.google.com/search?q=site:'+urllib2.quote(siteurl)+'%20'+urllib2.quote(search)+'&oq=site:'+urllib2.quote(siteurl)+'%20'+urllib2.quote(search)
def getgooglelinks(search,siteurl=False):
#google returns 403 without user agent
headers = {'User-agent':'Mozilla/11.0'}
req = urllib2.Request(getgoogleurl(search,siteurl),None,headers)
site = urllib2.urlopen(req)
data = site.read()
site.close()
start = data.find('<div id="res">')
end = data.find('<div id="foot">')
if data[start:end]=='':
#error, no links to find
return False
else:
links =[]
data = data[start:end]
start = 0
end = 0
while start>-1 and end>-1:
#get only results of the provided site
if siteurl==False:
start = data.find('<a href="/url?q=')
else:
start = data.find('<a href="/url?q='+str(siteurl))
data = data[start+len('<a href="/url?q='):]
end = data.find('&sa=U&ei=')
if start>-1 and end>-1:
link = urllib2.unquote(data[0:end])
data = data[end:len(data)]
if link.find('http')==0:
links.append(link)
return links
keyword1 =raw_input('Enter the keyword as keyword+filetype: \n eg:python filetype:pdf' )
links = getgooglelinks(keyword1,'http://www.google.com/')
for link in links:
print link
query = raw_input ( 'Query: ' )
query = urllib.urlencode ( { 'q' : query } )
response = urllib.urlopen ( 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&' + query ).read()
json = m_json.loads ( response )
results = json [ 'responseData' ] [ 'results' ]
for result in results:
title = result['title']
url = result['url']
print ( title + '; ' + url )
This is the code i'm working on. I tried using beautiful soup library; but it didn't work.
So you are manually searching as in typing in the query to the program?
If that's what you are doing then you are almost there. You just have to do some basic file operations and instead of passing the user inserted query you will be passing each line in the file as query.
Make sure it is in string format such as str(retrieved_data_from_file) also with dictionary entries maybe like mydict = {'q' : "'"+str(retrieved_data_from_file)+"'" }
and close file after.

Categories