Python Regex Problems with Whitespace - python

I'm trying to do a python regular expression that looks for lines formatted as such ([edit:] without new lines; the original is all on one line):
<MediaLine Label="main-video" xmlns="ms-rtcp-metrics">
<OtherTags...></OtherTags>
</MediaLine>
I wish to create a capture group of the body of this XML element (so the OtherTags...) for later processing.
Now the problem lies in the first line, where Label="main-video", and I would like to not capture Label="main-audio"
My initial solution is as such:
m = re.search(r'<MediaLine(.*?)</MediaLine>', line)
This works, in that it filters out all other non-MediaLine elements, but doesn't account for video vs audio. So to build on it, I try simply adding
m = re.search(r'<MediaLine Label(.*?)</MediaLine>', line)
but this won't create a single match, let alone being specific enough to filter audio/video. My problem seems to come down to the space between line and Label. The two variations I can think of trying both fail:
m = re.search(r'<MediaLine L(.*?)</MediaLine>', line)
m = re.search(r'<MediaLine\sL(.*?)</MediaLine>', line)
However, the following works, without being able to distinguish audio/video:
m = re.search(r'<MediaLine\s(.*?)</MediaLine>', line)
Why is the 'L' the point of failure? Where am I going wrong? Thanks for any help.
And to add to this preemptively, my goal is an expression like this:
m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*?)</MediaLine>", line)
result = m.group('payload')

By default, . doesn’t match a newline, so your initial solution didn't work either. To make . match a newline, you need to use the re.DOTALL flag (aka re.S):
>>> m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*)</MediaLine>", line, re.DOTALL)
>>> m.group('payload')
'\n <OtherTags...></OtherTags>\n'
Notice there’s also an extra ? in the first group, so that it’s not greedy.
As another comment observes, the best thing to parse XML is an XML parser. But if your particular XML is sufficiently strict in the tags and attributes that it has, then a regular expression can get the job done. It will just be messier.

Related

How to split a string on multiple pattern using pythonic way (one liner)?

I am trying to extract file name from file pointer without extension. My file name is as follows:
this site:time.list,this.list,this site:time_sec.list, that site:time_sec.list and so on. Here required file name always precedes either whitespace or dot.
Currently I am doing this to get file from file name preceding white space and dot in file name.
search_term = os.path.basename(f.name).split(" ")[0]
and
search_term = os.path.basename(f.name).split(".")[0]
Expected file name output: this, this, this, that.
How can i combine above two into one liner kind and pythonic way?
Thanks in advance.
using regex as below,
[ .] will split either on a space or a dot char
re.split('[ .]', os.path.basename(f.name))[0]
If you split on one and splitting on the other still returns something smaller, that's the one you want. If not, what you get is what you got from the first split. You don't need regex for this.
search_term = os.path.basename(f.name).split(" ")[0].split(".")[0]
Use regex to get the first word at the beginning of the string:
import re
re.match(r"\w+", "this site:time_sec.list").group()
# 'this'
re.match(r"\w+", "this site:time.list").group()
# 'this'
re.match(r"\w+", "that site:time_sec.list").group()
# 'that'
re.match(r"\w+", "this.list").group()
# 'this'
try this:
pattern = re.compile(r"\w+")
pattern.match(os.path.basename(f.name)).group()
Make sure your filenames don't have whitespace inside when you rely on the assumption that a whitespace separates what you want to extract from the rest. It's much more likely to get unexpected results you didn't think up in advance if you rely on implicit rules like that instead of actually looking at the strings you want to extract and tailor explicit expressions to fit the content.

Extract part of string according to pattern using regular expression Python

I have a files that follow a specific format which look something like this:
test_0800_20180102_filepath.csv
anotherone_0800_20180101_hello.csv
The numbers in the middle represent timestamps, so I would like to extract that information. I know that there is a specific pattern which will always be _time_date_, so essentially I want the part of the string that lies between the first and third underscores. I found some examples and somehow similar problems, but I am new to Python and I am having trouble adapting them.
This is what I have implemented thus far:
datetime = re.search(r"\d+_(\d+)_", "test_0800_20180102_filepath.csv")
But the result I get is only the date part:
20180102
But what I actually need is:
0800_20180101
That's quite simple:
match = re.search(r"_((\d+)_(\d+))_", your_string)
print(match.group(1)) # print time_date >> 0800_20180101
print(match.group(2)) # print time >> 0800
print(match.group(3)) # print date >> 20180101
Note that for such tasks the group operator () inside the regexp is really helpful, it allows you to access certain substrings of a bigger pattern without having to match each one individually (which can sometimes be much more ambiguous than matching a larger one).
The order in which you then access the groups is from 1-n_specified, where group 0 is the whole matched pattern. Groups themselves are assigned from left to right, as defined in your pattern.
On a side note, if you have control over it, use unix timestamps so you only have one number defining both date and time universally.
They key here is you want everything between the first and the third underscores on each line, so there is no need to worry about designing a regex to match your time and date pattern.
with open('myfile.txt', 'r') as f:
for line in f:
x = '_'.join(line.split('_')[1:3])
print(x)
The problem with your implementation is that you are only capturing the date part of your pattern. If you want to stick with a regex solution then simply move your parentheses to capture the entire pattern you want:
re.search(r"(\d+_\d+)_", "test_0800_20180102_filepath.csv").group(1)
gives:
'0800_20180102'
This is very easy to do with .split():
time = filename.split("_")[1]
date = filename.split("_")[2]

A rather special parsing of a txt file

Alright good people of stackOverflow, my question is on the broad subject of parsing. The information i want to obtain is on multiple positions on a text file marked by begin and end headers (special strings) on each appearance. I want to get everything that's between these headers. The code i have implemented so far seems somehow terribly inefficient (although not slow) and as you can see below makes use of two while statements.
with open(sessionFile, 'r') as inp_ses:
curr_line = inp_ses.readline()
while 'ga_group_create' not in curr_line:
curr_line = inp_ses.readline()
set_name = curr_line.split("\"")[1]
recording = []
curr_line = inp_ses.readline()
# now looking for the next instance
while 'ga_group_create' not in curr_line:
recording.append(curr_line)
curr_line = inp_ses.readline()
Pay no attention to the fact that the begin and end headers are the same string (just call them "begin" and "end"). The code above gives me the text between the headers only the first time they appear. I can modify it to give me the rest by keeping track of variables that increment in every instance, modifying my while statements etc but all this feels like trying to re-invent the wheel and in a very bad way too.
Is there anything out there i can make use of?
Oye gentle stack traveller. Time hast come for thee to use the power of regex
Basic usage
import re
m = re.search('start(.*?)end', 'startsecretend')
m.group(1)
'secret'
. matches any character
* repeats any number of times
? makes it non greedy i.e. it won't capture 'end'
( ) indicates the group or capture
More at Python re manual
I agree regex is a good way to go here, but this is a more direct application to your problem:
import re
options = re.DOTALL | re.MULTILINE
contents = open('parsexample.txt').read()
m = re.search('ga_group_create(.*)ga_group_create', contents,
options)
lines_in_between = m.groups(0)[0].split()
If you have a couple of these groups, you can iterate through them:
for m in re.finditer('ga_group_create(.*?)ga_group_create', contents, options):
print(m.groups(0)[0].split())
Notice I've used *? to do non-greedy matching.

How to get Metar data using python

So I want to write a python code that will take the latest Metar ONLY and spit it back out. The trick here though, is that this url constantly updates, but I still want it to take only the latest Metar and spit it out while ignoring the other previous Metars.
So far what I have for code is:
import urllib2
import re
URL="http://www.ogimet.com/display_metars2.php?lang=en&lugar=kewr&tipo=SA&ord=REV&nil=SI&fmt=html&ano=2015&mes=07&day=20&hora=17&anof=2015&mesf=08&dayf=19&horaf=18&minf=59&send=send"
f = urllib2.urlopen(URL)
data = f.read()
r = re.compile('<pre>(.*)</pre>', re.I | re.S | re.M)
print r.findall(data)
When I run it, it returns back all Metars.
Thanks in advance!
Your regex isn't correct, the .* is capturing everything -- including the <\pre> tag. When I'm using regex for this type of parsing I normally use the form <tag>([^<]*), where the group matches any character except for < which signals the next tag; obviously this isn't a super robust solution but is often enough to do the trick. Also, you don't need those flags in your regex. In your case you will have:
r=re.compile(`<pre>([^<]*)`)
Secondly, re.findall returns a list of matches. In Python lists are indexed using square brackets, with the indexing starting at zero; if you want to print the first element of your list, you can call
print r.findall(data)[0]

printing out optional regex

I am trying to print out values stored in a dictionary. These values were created from regular expressions.
Currently I have some optional fields but I am not sure if I am doing this correctly
(field A(field B)? field C (field D)?)?
I read a quick reference and it said that ? means 0 or 1 occurrence.
When I try to search for a field such as reputation or content-type I get None because these are optional on my regex. I might have the wrong regular expression but I am wondering why whenever I search for an optional field (...)? it prints out None
my code:
import re
httpproxy515139 = re.compile(r'....url\=\"(?P<url>(.*))\"(\s+exceptions\=\"(?P<exceptions>([^\"]*))\"\s+error\=\"(?P<error>([^\"]*))\"\s+(reputation\=\"(?P<reputation_opt>([^\"]*))\"\s+)?category\=\"(?P<category>([^\"]*))\"\s+reputation\=\"(?P<reputation>([^\"]*))\"\s+categoryname\=\"(?P<categoryname>([^\"]*))\"\s+(content-type\=\"(?P<content_type>([^\"]*))\")?)?')
f = open("sophos-httpproxy.out", "r")
fw = open("sophosfilter.log", "w+")
HttpProxyCount = 0
otherCount = 0
for line in f.readlines():
HttpProxy = re.search(httpproxy515139, line)
HttpProxy.groupdict()
print "AV Field: "
print "Date/Time: " + str(HttpProxy.groupdict()['categoryname'])
here is the full regex:
(?P<datetime>\w\w\w\s+\d+\s+\d\d:\d\d:\d\d)\s+(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*httpproxy\[(?P<HTTPcode>(.*))\]:\s+id\=\"(?P<id>([^\"]*))\"\s+severity\=\"(?P<Severity>([^\"]*))\"\s+sys\=\"(?P<sys>([^\"]*))\"\s+sub\=\"(?P<sub>([^\"]*))\"\s+name\=\"(?P<name>([^\"]*))\"\s+action\=\"(?P<action>([^\"]*))\"\s+method\=\"(?P<method>([^\"]*))\"\s+srcip\=\"(?P<srcip>([^\"]*))\"\s+dstip\=\"(?P<dstip>([^\"]*))\"\s+user\=\"(?P<user>[^\"]*)\"\s+statuscode\=\"(?P<statuscode>([^\"]*))\"\s+cached\=\"(?P<cached>([^\"]*))\"\s+profile\=\"(?P<profile>([^\"]*))\"\s+filteraction\=\"(?P<filteraction>([^\"]*))\"\s+size\=\"(?P<size>([^\"]*))\"\s+request\=\"(?P<request>([^\"]*))\"\s+url\=\"(?P<url>(.*))\"(\s+exceptions\=\"(?P<exceptions>([^\"]*))\"\s+error\=\"(?P<error>([^\"]*))\"\s+(reputation\=\"(?P<reputation_opt>([^\"]*))\"\s+)?category\=\"(?P<category>([^\"]*))\"\s+reputation\=\"(?P<reputation>([^\"]*))\"\s+categoryname\=\"(?P<categoryname>([^\"]*))\"\s+(content-type\=\"(?P<content_type>([^\"]*))\")?)?
Here is a sample input:
Oct 7 13:22:55 192.168.10.2 2013: 10:07-13:22:54 httpproxy[15359]: id="0001" severity="info" sys="SecureWeb" sub="http" name="http access" action="pass" method="GET" srcip="192.168.8.47" dstip="64.94.90.108" user="" statuscode="200" cached="0" profile="REF_DefaultHTTPProfile (Default Proxy)" filteraction="REF_DefaultHTTPCFFAction (Default content filter action)" size="1502" request="0x10870200" url="http://www.concordmonitor.com/csp/mediapool/sites/dt.common.streams.StreamServer.cls?STREAMOID=6rXcvJGqsPivgZ7qnO$Sic$daE2N3K4ZzOUsqbU5sYvZF78hLWDhaM8n_FuBV1yRWCsjLu883Ygn4B49Lvm9bPe2QeMKQdVeZmXF$9l$4uCZ8QDXhaHEp3rvzXRJFdy0KqPHLoMevcTLo3h8xh70Y6N_U_CryOsw6FTOdKL_jpQ-&CONTENTTYPE=image/jpeg" exceptions="" error="" category="134" reputation="neutral" categoryname="General News" content-type="image/jpeg"
I ma trying to capture the entire log
However sometimes the url has many quotation marks in it which makes things confusing. Also in some logs, there is an extra reputation data field between error and reputation. content-type also does not always appear. Sometimes everything after the url data field is missing as well. This is why I added all the optional ?. I am trying to take account of these occurrences and print None when necessary.
Let's break your regex into two pieces:
....url\=\"(?P<url>(.*))\"
and
(\s+exceptions\=\"(?P<exceptions>([^\"]*))\"\s+error\=\"(?P<error>([^\"]*))\"\s+(reputation\=\"(?P<reputation_opt>([^\"]*))\"\s+)?category\=\"(?P<category>([^\"]*))\"\s+reputation\=\"(?P<reputation>([^\"]*))\"\s+categoryname\=\"(?P<categoryname>([^\"]*))\"\s+(content-type\=\"(?P<content_type>([^\"]*))\")?)?
The .* in the first part is greedy. It'll match everything it can, only backtracking if absolutely necessary.
The second part is one giant optional group.
When the regex executes, the .* will match everything up to the end of the string, then backtrack as necessary until the \" can match a quotation mark. That will be the last quotation mark in the string, and it will probably not be the one you wanted it to be.
Then, the giant optional group will try to match, but since the greedy .* already ate everything the giant optional group was supposed to parse, it'll fail. Since it's optional, the regex algorithm will be fine with that.
To fix this? Well, non-greedy quantifiers might help with the immediate problem, but a better solution is probably to stop trying to use a regex to parse this. Look for existing parsers for your data format. Are you trying to pull data out of HTML or XML? I've seen a lot of recommendations for BeautifulSoup.

Categories