Python: Separating an HTML snippets to paragraphs - python

I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance:
'''
<p class="my_class">Hello!</p>
<p>What's up?</p>
<p style="whatever: whatever;">Goodbye!</p>
'''
Should become:
['<p class="my_class">Hello!</p>',
'<p>What's up?</p>'
'<p style="whatever: whatever;">Goodbye!</p>']
What would be a good way to approach this?

If your string only contains paragraphs, you may be able to get away with a nicely crafted regex and re.split(). However, if your string is more complex HTML, or not always valid HTML, you might want to look at the BeautifulSoup package.
Usage goes like:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(some_html)
paragraphs = list(unicode(x) for x in soup.findAll('p'))

Use lxml.html to parse the HTML into the form you want. This is essentially the same advice as the people who are recommending BeautifulSoup, except lxml is still being actively developed and BeatifulSoup development has slowed.

Use BeautifulSoup to parse the HTML and iterate over the paragraphs.

The xml.etree (std lib) or lxml.etree (enhanced) make this easy to do, but I'm not going to get the answer cred for this because I don't remember the exact syntax. I keep mixing it up with similar packages and have to look it up afresh every time.

Related

HTML Parser In Python without fixing HTML

I need to parse through html but I do not need the python parsing library to attempt to "fix" the html. Any suggestions on a tool or method to use (in python)? In my situation, if the html is malformed then my script needs to end the processing. I tried BeautifulSoup but it fixed things that I did not want it to fix. I'm creating a tool to parse template files and output another converted template style.
The book Foundations of Python Network Programming has a detailed comparison of what it looks like to scrape the same web page with Beautiful Soup and with the lxml library; but, in general, you will find that lxml is faster, more effective, and has an API which adheres closely to a Python standard (the ElementTree API, which comes with the Python Standard Library). See this blog post by the inimitable Ian Bicking for an idea of why you should be looking at lxml instead of the old-fashioned Beautiful Soup library for parsing HTML:
http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/
I believe BeautifulStoneSoup can do this if you pass in a list of selfclosing tags
The most common shortcoming of BeautifulStoneSoup is that it doesn't
know about self-closing tags. HTML has a fixed set of self-closing
tags, but with XML it depends on what the DTD says. You can tell
BeautifulStoneSoup that certain tags are self-closing by passing in
their names as the selfClosingTags argument to the constructor:
from BeautifulSoup import BeautifulStoneSoup
xml = "<tag>Text 1<selfclosing>Text 2"
print BeautifulStoneSoup(xml).prettify()
# <tag>
# Text 1
# <selfclosing>
# Text 2
# </selfclosing>
# </tag>
print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify()
# <tag>
# Text 1
# <selfclosing />
# Text 2
# </tag>

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

Speedier/less resource-demolishing way to strip html from large files than BeautifulSoup? Or, a better way to use BeautifulSoup?

Currently I am having trouble typing this because, according to top, my processor is at 100% and my memory is at 85.7%, all being taken up by python.
Why? Because I had it go through a 250-meg file to remove markup. 250 megs, that's it! I've been manipulating these files in python with so many other modules and things; BeautifulSoup is the first code to give me any problems with something so small. How are nearly 4 gigs of RAM used to manipulate 250megs of html?
The one-liner that I found (on stackoverflow) and have been using was this:
''.join(BeautifulSoup(corpus).findAll(text=True))
Additionally, this seems to remove everything BUT markup, which is sort of the opposite of what I want to do. I'm sure that BeautifulSoup can do that, too, but the speed issue remains.
Is there anything that will do something similar (remove markup, leave text reliably) and NOT require a Cray to run?
lxml.html is FAR more efficient.
http://lxml.de/lxmlhtml.html
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Looks like this will do what you want.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
A couple of other similar questions: python [lxml] - cleaning out html tags
lxml.etree, element.text doesn't return the entire text from an element
Filter out HTML tags and resolve entities in python
UPDATE:
You probably want to clean the HTML to remove all scripts and CSS, and then extract the text using .text_content()
from lxml import html
from lxml.html.clean import clean_html
tree = html.parse('http://www.example.com')
tree = clean_html(tree)
text = tree.getroot().text_content()
(From: Remove all html in python?)
use cleaner from lxml.html:
>>> import lxml.html
>>> from lxml.html.clean import Cleaner
>>> cleaner = Cleaner(style=True) # to delete scripts styles objects comments etc;)
>>> html = lxml.html.fromstring(content).xpath('//body')[0]
>>> print cleaner.clean_html(html)

How to get the content of a Html page in Python

I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.
To be clear:
Input:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Output:
Page title This is paragraph one. This is paragraph two.
putting together:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
Related
Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)
Parse the HTML with Beautiful Soup.
To get all the text, without the tags, try:
''.join(soup.findAll(text=True))
Personally, I use lxml because it's a swiss-army knife...
from lxml import html
print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()
This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.
I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.
The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.
You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.
The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.
If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.
The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

Getting the value of href attributes in all <a> tags on a html file with Python

I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().
Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:
for line in lines:
result = re.match ('/href="(.*)"/iU', line)
print result
This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.
Can someone give me a hint on this?
Thanks in advance
Beautiful Soup can do this almost trivially:
from BeautifulSoup import BeautifulSoup as soup
html = soup('<body>qweasd</body>')
print [tag.attrMap['href'] for tag in html.findAll('a', {'href': True})]
Another alternative to BeautifulSoup is lxml (http://lxml.de/);
import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/#href")
for link in links:
print link
There's an HTML parser that comes standard in Python. Checkout htmllib.
As previously mentioned: regex does not have the power to parse HTML. Do not use regex for parsing HTML. Do not pass Go. Do not collect £200.
Use an HTML parser.
But for completeness, the primary problem is:
re.match ('/href="(.*)"/iU', line)
You don't use the “/.../flags” syntax for decorating regexes in Python. Instead put the flags in a separate argument:
re.match('href="(.*)"', line, re.I|re.U)
Another problem is the greedy ‘.*’ pattern. If you have two hrefs in a line, it'll happily suck up all the content between the opening " of the first match and the closing " of the second match. You can use the non-greedy ‘.*?’ or, more simply, ‘[^"]*’ to only match up to the first closing quote.
But don't use regexes for parsing HTML. Really.
What others haven't told you is that using regular expressions for this is not a reliable solution.
Using regular expression will give you wrong results on many situations: if there are <A> tags that are commented out, or if there are text in the page which include the string "href=", or if there are <textarea> elements with html code in it, and many others. Plus, the href attribute may exist on tags other that the anchor tag.
What you need for this is XPath, which is a query language for DOM trees, i.e. it lets you retrieve any set of nodes satisfying the conditions you specify (HTML attributes are nodes in the DOM).
XPath is a well standarized language now a days (W3C), and is well supported by all major languages. I strongly suggest you use XPath and not regexp for this.
adw's answer shows one example of using XPath for your particular case.
Don't divide the html content into lines, as there maybe multiple matches in a single line. Also don't assume there is always quotes around the url.
Do something like this:
links = re.finditer(' href="?([^\s^"]+)', content)
for link in links:
print link
Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.
Here follows the code to list all URL's from a webpage:
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url
Thanks for all the replies.

Categories