Finding a random sentence in HTML with python regex - python

I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between < br> tags.
I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between < br> tags.
Can anyone help me out? Thanks.
Best I could come up with:
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)
EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by < br> and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.

You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use
re.findall('<br>(.*?)<br>', html, re.S)
However will return multiple results as there are a bunch of <br><br> on that page. You may want to use the more specific:
re.findall('<hr><br>(.*?)<br><hr>', html, re.S)

from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)
if (len(output) > 0):
print(output)
output = re.sub('\n', ' ', output[0])
output = re.sub('\t', '', output)
print(output)
Terminal
imac2011:Desktop allendar$ python test.py
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
You could also strip of the final \n's and replace all those inside the text (on longer quotes) with <br /> if you are displaying it in HTML again, so you would maintain the original line breaks visually.

All jokes of that page have the same model, no ambigous things, you can use this
output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)
No need to use the dotall flag cause there's no dot.

This is uh, 7 years later, but for future reference:
Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.

Related

Strip all html lines/code from string in python

Given the following string parsed from an email body...
s = "Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay."
How do I remove all the html code and lines from the string to simply return "Keep all of this this is still good But this is still okay." on one line? I've looked at bleach and lxml but they are simply just removing the html <> and returning what's inside, whereas I don't want any of it.
You can still use lxml to get all of the root element's text nodes:
import lxml.html
html = '''
Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay.
'''
root = lxml.html.fromstring('<div>' + html + '</div>')
text = ' '.join(t.strip() for t in root.xpath('text()') if t.strip())
Seems to work fine:
>>> text
'Keep all of this this is still good, but But this is still okay.'
Simple solution that requires no external packages:
import re
while '<' in s:
s = re.sub('<.+?>.+?<.+?>', '', s)
Not very efficient, since it passes over the target string many times, but it should work. Note there must be absolutely no < or > characters on the string.
This one?
import re
s = # Your string here
print re.sub('[\s\n]*<.+?>.+?<.+?>[\s\n]*', ' ', s)
Edit: Just made a few mods to #BoppreH answer albeit with an extra space.

Bad named links search and replace

The problem i'm facing is badly named links...
There are few hundred bad links in different files.
So I write bash to replace links
<a href="../../../external.html?link=http://www.twitter.com"><a href="../../external.html?link=http://www.facebook.com/pages/somepage/">
<a href="../external.html?link=http://www.tumblr.com/">
to direct links like
<a href="http://www.twitter.com>
I know we have pattern ../ repeating one or more times. Also external.html?link which also should be removed.
How would recommend to do this? awk, sed, maybe python??
Will i need regex?
Thanks for opinions...
This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.
The following python regular expression would locate these links for you:
r'href="((?:\.\./)+external\.html\?link=)([^"]+)"'
The pattern we look for is something inside a href="" chunk of text, where that 'something' starts with one or more instances of ../, followed by external.html?link=, then followed with any text that does not contain a " quote.
The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the ../../external.html?link= part.
If all you want to do is remove the ../../external.html?link= part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple .sub() on your HTML files:
import re
redirects = re.compile(r'href="(?:\.\./)+external\.html\?link=([^"]+)"')
# ...
redirects.sub(r'href="\1"', somehtmlstring)
Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you'll need a full-blown HTML parser like BeautifulSoup or lxml instead.
Use a HTML parser like BeautifulSoup or lxml.html.

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

skip over HTML tags in Regular Expression patterns

I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.
Basically the scheme looks like this:
[$$price$$]
{
<h3 class="price">
$12.99
</h3>
}
I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:
[$$price$$]{<h3 class="price">$12.99</h3>}
I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.
Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.
Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.
Try this:
\r?\n[ \t]*
EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.
Alan,
I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.
On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)
Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)
So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.
The other way, is to have an or with something like this (not tested!):
'(<[^>]*>)|([\r\n\f ]+)'
This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']

Categories