For some reason, I need to extract the fields in an xml doc with python re.
here is an eg. of the string I'll be applying the regex on:
payload2 = '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'
Some of the fields you see above like 'clientIP' may not always be present.
The regex I have come up with is:
PAT3 = re.compile(r'.+(event="(?P<event_code>\S*?)"){1}[\S\s]+?(path="(?P<path>[\s\S]+?)"){0,1}[\S\s]+(clientIP="(?P<client_ip>[\S\s]+?)"){0,1}.*', re.UNICODE)
m1 = PAT3.search(payload2)
print m1.groupdict()
output:
{'path': '\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db', 'client_ip': None, 'event_code': '0x80'}
But when I put {1} instead of {0, 1} after (?P<client_ip>[\S\s]+?)") it works. However this defeats the case when the clientIP is absent.
Any idea on how can make the regex work in both cases where a field is present or not present?
My advice:
Stop trying to do a big one-line regex.
It's very simple to just break up your code so that it is not only more readable, but easier too.
My version of your code
payloads = [
'<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>',
'<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'
]
def scrape_xml(payload):
import re
ipv4 = r'clientIP="(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'
pat3 = dict()
pat3['event_code'] = r'event="(0[xX][0-9a-fA-F]+?)"'
pat3['path'] = r'path="(.*?)"'
pat3['client_ip'] = ipv4
matches = {}
for index, regex in enumerate(pat3):
matches[index] = re.search(
pattern=pat3[regex],
string=payload,
flags=re.UNICODE
)
for index in matches:
if not index:
print "\n"
if matches[index] is None:
pass
else:
print matches[index].group(0)
for p in payloads:
scrape_xml(p)
Output:
path="\c2_emcvnx.ntaplion.prv\CHECK$\demoshare1\Engineering\Benchmarking\Thumbs.db"
event="0x80"
path="\c2_emcvnx.ntaplion.prv\CHECK$\demoshare1\Engineering\Benchmarking\Thumbs.db"
clientIP="172.26.64.233"
event="0x80"
First, I have to give you the standard warning against parsing XML with regular expressions, but if you're deadset on that…
You probably don't want to be using [\S\s], as that'll match anything, including going past the quote. To prevent that, you made it non-greedy, but there's a better solution: just make it not match quotes: [^"]. Also note that you can replace {0,1} with ?.
Related
Does anyone know a smart way to extract unknown number of different parts from a string with Python regex?
I know this question is probably too general to answer clearly, so please let's have a look at the example:
S = "name.surname#sub1.sub2.sub3"
As a result I would like to get separately a local part and each subdomain. Please note that in this sample email address we have three subdomains but I would like to find a regular expression that is able to capture any number of them, so please do not use this number.
To avoid straying from the point, let's additionaly assume only alphanumeric characters (hence \w), dots and one # are allowed in email addresses.
I tried to solve it myself and found this way:
L = re.findall(r"([\w.]+)(?=#)|(\w+)", S)
for i in L:
if i[0] == '': print i[1],
else: print i[0],
# output: name.surname sub1 sub2 sub3
But it doesn't look nice to me. Does anyone know a way to achieve this with one regex and without any loop?
Of course, we can easily do it without regular expressions:
L = S.split('#')
localPart = L[0] # name.surname
subdomains = str(L[1]).split('.') # ['sub1', 'sub2', 'sub3']
But I am interested in how to figure it out with regexes.
[EDIT]
Uff, finally I figured this out, here is the nice solution:
S = "name.surname#sub1.sub2.sub3"
print re.split(r"#|\.(?!.*#)", S) # ['name.surname', 'sub1', 'sub2', 'sub3']
S = "name.surname.nick#sub1.sub2.sub3.sub4"
print re.split(r"#|\.(?!.*#)", S) # ['name.surname.nick', 'sub1', 'sub2', 'sub3', 'sub4']
Perfect output.
If I am understanding your request correctly, you want to find each section in your sample email address, without the periods. What you are missing in your sample regex snippet is re.compile. For example:
import re
s = "name.surname#sub1.sub2.sub3"
r = "\w+"
r2 = re.compile(r)
re.findall(r2,s)
This looks for the r2 regex object in the string s and outputs ['name', 'surname', 'sub1', 'sub2', 'sub3'].
Basically you can use the fact that when there's a capture group in the pattern, re.findall returns only the content of this capture group and no more the whole match:
>>> re.findall(r'(?:^[^#]*#|\.)([^.]*)', s)
['sub1', 'sub2', 'sub3']
Obviously the email format can be more complicated than your example string.
I have a regex "value=4020a345-f646-4984-a848-3f7f5cb51f21"
if re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x ):
x = re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x )
m = x.group(1)
m only gives me 4020a345, not sure why it does not give me the entire "4020a345-f646-4984-a848-3f7f5cb51f21"
Can anyone tell me what i am doing wrong?
try out this regex, looks like you are trying to match a GUID
value=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
This should match what you want, if all the strings are of the form you've shown:
value=((\w*\d*\-?)*)
You can also use this website to validate your regular expressions:
http://regex101.com/
The below regex works as you expect.
value=([\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*]+)
You are trying to match on some hex numbers, that is why this regex is more correct than using [\w\d]
pattern = "value=([0-9a-fA-F]{8}-([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12})"
data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
res = re.search(pattern, data)
print(res.group(1))
If you dont care about the regex safety, aka checking that it is correct hex, there is no reason not to use simple string manipulation like shown below.
>>> data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
>>> print(data[7:])
020a345-f646-4984-a848-3f7f5cb51f21
>>> # or maybe
...
>>> print(data[7:].replace('-',''))
020a345f6464984a8483f7f5cb51f21
You can get the subparts of the value as a list
txt = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
parts = re.findall('\w+', txt)[1:]
parts is ['4020a345', 'f646', '4984', 'a848', '3f7f5cb51f21']
if you really want the entire string
full = "-".join(parts)
A simple way
full = re.findall("[\w-]+", txt)[-1]
full is 4020a345-f646-4984-a848-3f7f5cb51f21
value=([\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*)
Try this.Grab the capture.Your regex was not giving the whole as you had used | operator.So if regex on left side of | get satisfied it will not try the latter part.
See demo.
http://regex101.com/r/hQ1rP0/45
I wrote a script to catch and correct commands before they are read by a parser. The parser requires equal, not equal, greater, etc, entries to be separated by commas, such as:
'test(a>=b)' is wrong
'test(a,>=,b)' is correct
The script i wrote works fine, but i would love to know if there's a more efficient way to do this.
Here's my script:
# Correction routine
def corrector(exp):
def rep(exp,a,b):
foo = ''
while(True):
foo = exp.replace(a,b)
if foo == exp:
return exp
exp = foo
# Replace all instances with a unique identifier. Do it in a specific order
# so for example we catch an instance of '>=' before we get to '='
items = ['>=','<=','!=','==','>','<','=']
for i in range(len(items)):
exp = rep(exp,items[i],'###%s###'%i)
# Re-add items with commas
for i in range(len(items)):
exp = exp.replace('###%s###'%i,',%s,'%items[i])
# Remove accidental double commas we may have added
return exp.replace(',,',',')
print corrector('wrong_syntax(b>=c) correct_syntax(b,>=,c)')
// RESULT: wrong_syntax(b,>=,c) correct_syntax(b,>=,c)
thanks!
As mentioned in the comments, one approach would be to use a regular expression. The following regex matches any of your operators when they are not surrounded by commas, and replaces them with the same string with the commas inserted:
inputstring = 'wrong_syntax(b>=c) correct_syntax(b,>=,c)'
regex = r"([^,])(>=|<=|!=|==|>|<|=)([^,])"
replace = r"\1,\2,\3"
result = re.sub(regex, replace, inputstring)
print(result)
Simple regexes are relatively easy, but they can get complicated quickly. Check out the docs for more info:
http://docs.python.org/2/library/re.html
Here is a regex that will do what you asked:
import re
regex = re.compile(r'''
(?<!,) # Negative lookbehind
(!=|[><=]=?)
(?!,) # Negative lookahead
''', re.VERBOSE)
print regex.sub(r',\1,', 'wrong_expression(b>=c) or right_expression(b,>=,c)')
outputs
wrong_expression(b,>=,c) or right_expression(b,>=,c)
I'm trying to match a pattern against strings that could have multiple instances of the pattern. I need every instance separately. re.findall() should do it but I don't know what I'm doing wrong.
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
I need 'http://url.com/123', http://url.com/456 and the two numbers 123 & 456 to be different elements of the match list.
I have also tried '/review: ((http://url.com/(\d+)\s?)+)/' as the pattern, but no luck.
Use this. You need to place 'review' outside the capturing group to achieve the desired result.
pattern = re.compile(r'(?:review: )?(http://url.com/(\d+))\s?', re.IGNORECASE)
This gives output
>>> match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
>>> match
[('http://url.com/123', '123'), ('http://url.com/456', '456')]
You've got extra /'s in the regex. In python the pattern should just be a string. e.g. instead of this:
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
It should be:
pattern = re.compile('review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
Also typically in python you'd actually use a "raw" string like this:
pattern = re.compile(r'review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
The extra r on the front of the string saves you from having to do lots of backslash escaping etc.
Use a two-step approach: First get everything from "review:" to EOL, then tokenize that.
msg = 'this is the message. review: http://url.com/123 http://url.com/456'
review_pattern = re.compile('.*review: (.*)$')
urls = review_pattern.findall(msg)[0]
url_pattern = re.compile("(http://url.com/(\d+))")
url_pattern.findall(urls)
I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']