So I want to write a python code that will take the latest Metar ONLY and spit it back out. The trick here though, is that this url constantly updates, but I still want it to take only the latest Metar and spit it out while ignoring the other previous Metars.
So far what I have for code is:
import urllib2
import re
URL="http://www.ogimet.com/display_metars2.php?lang=en&lugar=kewr&tipo=SA&ord=REV&nil=SI&fmt=html&ano=2015&mes=07&day=20&hora=17&anof=2015&mesf=08&dayf=19&horaf=18&minf=59&send=send"
f = urllib2.urlopen(URL)
data = f.read()
r = re.compile('<pre>(.*)</pre>', re.I | re.S | re.M)
print r.findall(data)
When I run it, it returns back all Metars.
Thanks in advance!
Your regex isn't correct, the .* is capturing everything -- including the <\pre> tag. When I'm using regex for this type of parsing I normally use the form <tag>([^<]*), where the group matches any character except for < which signals the next tag; obviously this isn't a super robust solution but is often enough to do the trick. Also, you don't need those flags in your regex. In your case you will have:
r=re.compile(`<pre>([^<]*)`)
Secondly, re.findall returns a list of matches. In Python lists are indexed using square brackets, with the indexing starting at zero; if you want to print the first element of your list, you can call
print r.findall(data)[0]
Related
Alright good people of stackOverflow, my question is on the broad subject of parsing. The information i want to obtain is on multiple positions on a text file marked by begin and end headers (special strings) on each appearance. I want to get everything that's between these headers. The code i have implemented so far seems somehow terribly inefficient (although not slow) and as you can see below makes use of two while statements.
with open(sessionFile, 'r') as inp_ses:
curr_line = inp_ses.readline()
while 'ga_group_create' not in curr_line:
curr_line = inp_ses.readline()
set_name = curr_line.split("\"")[1]
recording = []
curr_line = inp_ses.readline()
# now looking for the next instance
while 'ga_group_create' not in curr_line:
recording.append(curr_line)
curr_line = inp_ses.readline()
Pay no attention to the fact that the begin and end headers are the same string (just call them "begin" and "end"). The code above gives me the text between the headers only the first time they appear. I can modify it to give me the rest by keeping track of variables that increment in every instance, modifying my while statements etc but all this feels like trying to re-invent the wheel and in a very bad way too.
Is there anything out there i can make use of?
Oye gentle stack traveller. Time hast come for thee to use the power of regex
Basic usage
import re
m = re.search('start(.*?)end', 'startsecretend')
m.group(1)
'secret'
. matches any character
* repeats any number of times
? makes it non greedy i.e. it won't capture 'end'
( ) indicates the group or capture
More at Python re manual
I agree regex is a good way to go here, but this is a more direct application to your problem:
import re
options = re.DOTALL | re.MULTILINE
contents = open('parsexample.txt').read()
m = re.search('ga_group_create(.*)ga_group_create', contents,
options)
lines_in_between = m.groups(0)[0].split()
If you have a couple of these groups, you can iterate through them:
for m in re.finditer('ga_group_create(.*?)ga_group_create', contents, options):
print(m.groups(0)[0].split())
Notice I've used *? to do non-greedy matching.
I'm trying to do a python regular expression that looks for lines formatted as such ([edit:] without new lines; the original is all on one line):
<MediaLine Label="main-video" xmlns="ms-rtcp-metrics">
<OtherTags...></OtherTags>
</MediaLine>
I wish to create a capture group of the body of this XML element (so the OtherTags...) for later processing.
Now the problem lies in the first line, where Label="main-video", and I would like to not capture Label="main-audio"
My initial solution is as such:
m = re.search(r'<MediaLine(.*?)</MediaLine>', line)
This works, in that it filters out all other non-MediaLine elements, but doesn't account for video vs audio. So to build on it, I try simply adding
m = re.search(r'<MediaLine Label(.*?)</MediaLine>', line)
but this won't create a single match, let alone being specific enough to filter audio/video. My problem seems to come down to the space between line and Label. The two variations I can think of trying both fail:
m = re.search(r'<MediaLine L(.*?)</MediaLine>', line)
m = re.search(r'<MediaLine\sL(.*?)</MediaLine>', line)
However, the following works, without being able to distinguish audio/video:
m = re.search(r'<MediaLine\s(.*?)</MediaLine>', line)
Why is the 'L' the point of failure? Where am I going wrong? Thanks for any help.
And to add to this preemptively, my goal is an expression like this:
m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*?)</MediaLine>", line)
result = m.group('payload')
By default, . doesn’t match a newline, so your initial solution didn't work either. To make . match a newline, you need to use the re.DOTALL flag (aka re.S):
>>> m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*)</MediaLine>", line, re.DOTALL)
>>> m.group('payload')
'\n <OtherTags...></OtherTags>\n'
Notice there’s also an extra ? in the first group, so that it’s not greedy.
As another comment observes, the best thing to parse XML is an XML parser. But if your particular XML is sufficiently strict in the tags and attributes that it has, then a regular expression can get the job done. It will just be messier.
I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between < br> tags.
I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between < br> tags.
Can anyone help me out? Thanks.
Best I could come up with:
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)
EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by < br> and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.
You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use
re.findall('<br>(.*?)<br>', html, re.S)
However will return multiple results as there are a bunch of <br><br> on that page. You may want to use the more specific:
re.findall('<hr><br>(.*?)<br><hr>', html, re.S)
from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)
if (len(output) > 0):
print(output)
output = re.sub('\n', ' ', output[0])
output = re.sub('\t', '', output)
print(output)
Terminal
imac2011:Desktop allendar$ python test.py
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
You could also strip of the final \n's and replace all those inside the text (on longer quotes) with <br /> if you are displaying it in HTML again, so you would maintain the original line breaks visually.
All jokes of that page have the same model, no ambigous things, you can use this
output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)
No need to use the dotall flag cause there's no dot.
This is uh, 7 years later, but for future reference:
Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.
Here is the html I am trying to parse.
<TD>Serial Number</TD><TD>AB12345678</TD>
I am attempting to use regex to parse the data. I heard about BeautifulSoup but there are around 50 items like this on the page all using the same table parameters and none of them have ID numbers. The closest they have to unique identifiers is the data in the cell before the data I need.
serialNumber = re.search("Serial Number</td><td>\n(.*?)</td>", source)
Source is simply the source code of the page grabbed using urllib. There is new line in the html between the second and the serial number but I am unsure if that matters.
Pyparsing can give you a little more robust extractor for your data:
from pyparsing import makeHTMLTags, Word, alphanums
htmlfrag = """<blah></blah><TD>Serial Number</TD><TD>
AB12345678
</TD><stuff></stuff>"""
td,tdEnd = makeHTMLTags("td")
sernoFormat = (td + "Serial Number" + tdEnd +
td + Word(alphanums)('serialNumber') + tdEnd)
for sernoData in sernoFormat.searchString(htmlfrag):
print sernoData.serialNumber
Prints:
AB12345678
Note that pyparsing doesn't care where the extra whitespace falls, and it also handles unexpected attributes that might crop up in the defined tags, whitespace inside tags, tags in upper/lower case, etc.
In most of the cases it is better to work on html using an appropriate parser, but for some cases it is perfectly OK to use regular expressions for the job. I do not know enough about your task to judge if it is a good solution or if it is better to go with #Paul 's solution, but here I try to fix your regex:
serialNumber = re.search("Serial Number</td><td>(.*?)</td>", source, re.S | re.I )
I removed the \n, because it is difficult in my opinion (\n,\r,\r\n, ...?), instead I used the option re.S (Dotall).
But be aware, now if there is a newline, it will be in your capturing group! i.e. you should strip whitespaces afterwards from your result.
Another problem of your regex is the <TD> in your string but you search for <td>. There for is the option re.I (IgnoreCase).
You can find more explanations about regex here on docs.python.org
I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(