A search for "python" and "xml" returns a variety of libraries for combining the two.
This list probably faulty:
xml.dom
xml.etree
xml.sax
xml.parsers.expat
PyXML
beautifulsoup?
HTMLParser
htmllib
sgmllib
Be nice if someone can offer a quick summary of when to use which, and why.
The DOM/SAX divide is a basic one. It applies not just to python since DOM and SAX are cross-language.
DOM: read the whole document into memory and manipulate it.
Good for:
complex relationships across tags in the markup
small intricate XML documents
Cautions:
Easy to use excessive memory
SAX: parse the document while you read it. Good for:
Long documents or open ended streams
places where memory is a constraint
Cautions:
You'll need to code a stateful parser, which can be tricky
beautifulsoup:
Great for HTML or not-quite-well-formed markup. Easy to use and fast. Good for screen scraping, etc. It can work with markup where the XML based ones would just through an error saying the markup is incorrect.
Most of the rest I haven't used, but I don't think there's hard and fast rules about when to use which. Just your standard considerations: who is going to maintain the code, which APIs do you find most easy to use, how well do they work, etc.
In general, for basic needs, it's nice to use the standard library modules since they are "standard" and thus available and well known. However, if you need to dig deep into something, almost always there are newer nonstandard modules with superior functionality outside of the standard library.
I find xml.etree essentially sufficient for everything, except for BeautifulSoup if I ever need to parse broken XML (not a common problem, differently from broken HTML, which BeautifulSoup also helps with and is everywhere): it has reasonable support for reading entire XML docs in memory, navigating them, creating them, incrementally-parsing large docs. lxml supports the same interface, and is generally faster -- useful to push performance when you can afford to install third party Python extensions (e.g. on App Engine you can't -- but xml.etree is still there, so you can run exactly the same code). lxml also has more features, and offers BeautifulSoup too.
The other libs you mention mimic APIs designed for very different languages, and in general I see no reason to contort Python into those gyrations. If you have very specific needs such as support for xslt, various kinds of validations, etc, it may be worth looking around for other libraries yet, but I haven't had such needs in a long time so I'm not current wrt the offerings for them.
For many problems you can get by with the xml. It has the major advantage of being part of the standard library. This means that it is pre-installed on almost every system and that the interface will be static. It is not the best, or the fastest, but it is there.
For everything else there is lxml. Specically, lxml is best for parsing broken HTML, xHTML, or suspect feeds. It uses libxml2 and libxslt to handle XPath, XSLT, and EXSLT. The tutorial is clear and the interface is simplistically straight-forward. The rest of the libraries mentioned exist because lxml was not available in its current form.
This is my opinion.
I don't do much with XML, but when I've needed to, lxml has been a joy to work with and is apparently quite fast. The element tree API is very nice in an object oriented setting.
Related
There's a code base (Py 2.6 stuff) that extensively uses zope components and there's a need for tool that can analyze the sources. The tool is supposed to analyze the source; like look for usage of some restricted classes/objects/interfaces, etc. Basically, scan each statement in the source while being aware of the statement's context (which class/method/function it is in, which module, etc) and analyze it for specific patterns.
The approach of reading the source into text buffers and matching patterns doesn't seem to be a reliable approach at all.
Another approach that came up was of using inspect, but inspect seems to be broken and doesn't seem to be able to handle our code base (multiple tries, all of them crashed inspect). I might have to give up on that as there seems to be now way out with inspect.
Now, other options that I could think of are using pylint or using AST (and doing a lot of postprocessing on it). Am not sure to what extent pylint is extensible; and when it comes to analyzing source statments if pylint could be aware of context (i.e, which class defn/function/method has this statement etc) Using AST seems too much of an overkill for such a trivial purpose.
What would you suggest as a suitable approach here? Anybody else had to deal with such an issue before?
Please suggest.
Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on the page.
The original plan was to use a BeautifulSoup findAll(True) and to sort each tag by its .getText() value. EDIT: don't use this for html work, use the lxml library, it's python based and much faster than BeautifulSoup. command (which means extract all html tags)
But this won't work for most pages, like the one I listed as an example, because the large body of text is split into many smaller tags, like paragraph dividers for example.
Does anyone have any experience with this? Any help with something like this would be amazing.
At the moment I'm using BeautifulSoup along with python, but willing to explore other possibilities.
EDIT: Came back to this question after a few months later (wow i sounded like an idiot ^), and solved this with a combination of libraries & own code.
Here are some deadly helpful python libraries for the task in sorted order of how much it helped me:
#1 goose library Fast, powerful, consistent
#2 readability library Content is passable, slower on average than goose but faster than boilerpipe
#3 python-boilerpipe Slower & hard to install, no fault to the boilerpipe library (originally in java), but to the fact that this library is build on top of another library in java, which attributes to IO time & errors, etc.
I'll release benchmarks perhaps if there is interest.
Indirectly related libraries, you should probably install them and read their docs:
NLTK text processing library This
is too good not to install. They provide text analysis tools along
with html tools (like cleanup, etc).
lxml html/xml parser Mentioned
above. This beats BeautifulSoup in every aspect but usability. It's a
bit harder to learn but the results are worth it. HTML parsing takes
much less time, it's very noticeable.
python
webscraper library I think the value of this code isn't the
lib itself, but using the lib as a reference manual to build your own
crawlers/extractors. It's very nicely coded / documented!
A lot of the value and power in using python, a rather slow language, comes from it's open source libraries. They are especially awesome when combined and used together, and everyone should take advantage of them to solve whatever problems they may have!
Goose library gets lots of solid maintenance, they just added Arabic support, it's great!
You might look at the python-readability package which does exactly this for you.
You're really not going about it the right way, I would say, as all the comments above would attest to.
That said, this does what you're looking for.
from bs4 import BeautifulSoup as BS
import requests
html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
soup = BS(html)
print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])
It pulls out only the text, first by finding the main container of all the <p> tags, then by selecting only the <p> tags themselves to get the text; ignoring the <script> and other irrelevant ones.
As was mentioned in the comments, this will only work for CNN--and possibly, only this page. You might need a different strategy for every new webpage.
What is the best way in Python to read XML from a file, build a tree, do some rewriting in the tree and generate Python code? (I'm slightly confused as in Python there seem to be multiple options such as expat, ElementTree, XSLT, ...).
You should check out lxml. It is easy to use and does what you want in a few steps. If you want to stick to the stdlib, then you check out ElementTree (Python 2.5+) . The type of XML processing you want depends on your needs and for high performance XML parsing, read this.
EDIT: My answer is for XML parsing with Python and does not answer your last question: "generate Python code", cause that makes no sense :)
I use Beautiful Soup. Its relatively easy to use and powerful.
Check it out here
I know that any language is capable of parsing XML; I'm really just looking for advantages or drawbacks that you may have come across in your own experiences. Perl would be my standard go to here, but I'm open to suggestions.
Thanks!
UPDATE: I ended up going with XML::Simple which did a nice job, but I have one piece of advice if you plan to use it--research the forcearray option first. I had to rewrite a bunch of statements after learning that it is usually best practice to set forcearray. This page had the clearest explanation that I could find. Frankly, I'm surprised this isn't the default behavior.
If you are using Perl then I would recommend XML::Simple:
As more and more Web sites begin using
XML for their content, it's
increasingly important for Web
developers to know how to parse XML
data and convert it into different
formats. That's where the Perl module
called XML::Simple comes in. It takes
away the drudgery of parsing XML data,
making the process easier than you
ever thought possible.
XML::Twig is very nice, especially because it’s not as awfully verbose as some of the other options.
For pure XML parsing, I wouldn't use Java, C#, C++, C, etc. They tend to overcomplicate things, as in you want a banana and get the gorilla with it as well.
Higher-level and interpreted languages such as Perl, PHP, Python, Groovy are more suitable. Perl is included in virtually every Linux distro, as is PHP for the most part.
I've used Groovy recently for especially this and found it very easy. Mind you though that a C parser will be orders of magnitude faster than Groovy for instance.
It's all going to be in the libraries.
Python has great libraries for XML. My preference is lxml. It uses libxml/libxslt so it's fast, but the Python binding make it really easy to use. Perl may very well have equally awesome OO libraries.
I saw that people recommend XML::Simple if you decide on Perl.
While XML::Simple is, indeed, very simple to use and great, is a DOM parser. As such, it is, sadly, completely unsuitable to processing large XML files as your process would run out of memory (it's a common problem for any DOM parser, not limited to XML::Simple or Perl).
So, for large files, you must pick a SAX parser in whichever language you choose (there are many XML SAX parsers in Perl, or use another stream parser like XML::Twig that is even better than standard SAX parser. Can't speak for other languages).
Not exactly a scripting language, but you could also consider Scala. You can start from here.
Scala's XML support is rather good, especially as XML can just be typed directly into Scala programs.
Microsoft also did some cool integrated stuff with their LINQ for XML
But I really like Elementtree and just that package alone is a good reason to use Python instead of Perl ;)
Here's an example:
import elementtree.ElementTree as ET
# build a tree structure
root = ET.Element("html")
head = ET.SubElement(root, "head")
title = ET.SubElement(head, "title")
title.text = "Page Title"
body = ET.SubElement(root, "body")
body.set("bgcolor", "#ffffff")
body.text = "Hello, World!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xhtml")
It's not a scripting language, but Scala is great for working with XML natively. Also, see this book (draft) by Burak.
Python has some pretty good support for XML. From the standard library DOM packages to much more 'pythonic' libraries that parse XML directly into more usable object structures.
There isn't really a 'right' language though... there are good XML packages for most languages nowadays.
If you're going to use Ruby to do it then you're going to want to take a look at Nokogiri or Hpricot. Both have their strengths and weaknesses. The language and package selection really comes down to what you want to do with the data after you've parsed it.
Reading Data out of XML files is dead easy with C# and LINQ to XML!
Somehow, although I really love python, I found it hard to parse XML with the standard libraries.
I would say it depends like everything else. VB.NET 2008 uses XML literals, has IntelliSense for LINQ to XML, and a few power toys that help turn XML into XSD. So personally, if you are working in a .NET environment I think this is the best choice.
I'm writing a small Python app for distribution. I need to include simple XML validation (it's a debugging tool), but I want to avoid any dependencies on compiled C libraries such as lxml or pyxml as those will make the resulting app much harder to distribute. I can't find anything that seems to fit the bill - for DTDs, Relax NG or XML Schema. Any suggestions?
Do you mean something like MiniXsv? I have never used it, but from the website, we can read that
minixsv is a lightweight XML schema
validator package written in pure
Python (at least Python 2.4 is
required).
so, it should work for you.
I believe that ElementTree could also be used for that goal, but I am not 100% sure.
Why don't you try invoking an online XML validator and parsing the results?
I couldn't find any free REST or SOAP based services but it would be easy enough to use a normal HTML form based one such as this one or this one. You just need to construct the correct request and parse the results (httplib may be of help here if you don't want to use a third party library such as mechanize to easy the pain).
I haven't looked at it in a while so it might have changed, but Fourthought's 4Suite (community edition) could still be pure Python.
Edit: just browsed through the docs and it's mostly Python which probably isn't enough for you.
The beautifulsoup module is pure Python (and a single .py file), but I don't know if it will fulfill your validation needs, I've only used it for quickly extracting some fields on large XMLs.
http://www.crummy.com/software/BeautifulSoup/