I need to parse XML file and put results on HTML form, but I am new to Python. Does Python 2.7 have something like LINQ to XML from C# or any good library for XML can suggest me ?
Check out Pynq – Python Language Integrated Query https://github.com/heynemann/pynq/wiki
I'm sure though if Pynq will be sufficient for you, although it does implement exprenssion trees in Python just like LINQ does for C#.
For an easy way to access XML in Python, you could check out BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/documentation.html). Note: for XML processing: use BeautifulStoneSoup for XML processing.
A simple example: "find first three a tags"
soup.findAll('p', limit=3)
For a more comprehensive selection of XML libraries for Python, please see "PythonXml" in PythonInfo Wiki.
Look at lxml, more specifically the combination of:
the ElementTree interface: an easier xml model/api than DOM, much like the XDocument and XElement classes (attributes and text are even simpler because they are not separate "nodes", which may look strange at first if you're working with "mixed content" models a lot and you're used to DOM interfaces)
The E-factory of lxml.builder (like "functional construction" in LINQ to XML, but even better ;-))
python's built-in list comprehensions and generator expressions, which can get you very close to the LINQ query syntax (without looking like SQL however ;-))
... will get you a very similar experience. (having played with LINQ to XML in .NET as well: I much prefer working python and lxml)
Another remark: lxml has some good support for html (even ill-formed) as well, including facilities for filling out html forms (I'm not sure if that is what you meant with "put results on html form" however)
The most featured XML library in python is lxml http://lxml.de but it doesn't support LINQ interface, as I know.
You can use ElementTree interface http://lxml.de/tutorial.html#the-elementtree-class
or XPAth selectors http://lxml.de/xpathxslt.html#xpath
or CSS selectors http://lxml.de/cssselect.html
for data extraction from XML.
Related
there are <div> inner blocks inside a <div> block,
What is the fastest way to extract all <div> blocks from a html str ?
(bs4, lxml or regex ?)
lxml is generally considered to be the fastest among existing Python parsers, though the parsing speed depends on multiple factors starting with the specific HTML to parse and ending with the computational power you have available. For HTML parsing use the lxml.html subpackage:
from lxml.html import fromstring, tostring
data = """my HTML string"""
root = fromstring(data)
print([tostring(div) for div in root.xpath(".//div")])
print([div.text_content() for div in root.xpath(".//div")])
There is also the awesome BeautifulSoup parser which, if allowed to use lxml under-the-hood, would be a great combination of convenience, flexibility and speed. It would not be generally faster than pure lxml, but it comes with one of the best APIs I've ever seen allowing you to "view" your XML/HTML from different angles and use a huge variety of techniques:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, "lxml")
print([str(div) for div in soup.find_all("div")])
print([div.get_text() for div in soup.find_all("div")])
And, I personally think, there is rarely a case when regex is suitable for HTML parsing:
RegEx match open tags except XHTML self-contained tags
When I'm teaching XML/HTML parsing with Python, I use to show this levels of complexity:
RegEx: efficient for (very) simple parsing but can be/become hard to maintain.
SAX: efficient and safe to parse XML as a stream. Easy to extract pieces of data but awful when you want to transform the tree. Can become really difficult to maintain. Who still use that anyway?
DOM parsing or ElementTree parsing with lxml: less efficient: all the XML tree is loaded in memory (can be an issue for big XML). But this library is compiled (in Cython). Very popular and reliable. Easy to understand: the code can be maintained.
XSLT1 is also a possibility. Very good to transform the tree in depth. But not efficient because of the templates machinery. Need learn a new language which appears to be difficult to learn. Maintenance often become heavy. Note that lxml can do XSLT with Python functions as an extension.
XSLT2 is very powerful but the only implementation I know is in Java language with Saxon. Launching the JRE is time consuming. The language is difficult to learn. One need to be an expert to understand every subtleties. Worse as XSLT1.
For your problem, lxml (or BeautifulSoup) sound good.
This question already has answers here:
Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?
(7 answers)
Closed 9 years ago.
I am working on a project that will involve parsing HTML.
After searching around, I found two probable options: BeautifulSoup and lxml.html
Is there any reason to prefer one over the other? I have used lxml for XML some time back and I feel I will be more comfortable with it, however BeautifulSoup seems to be much common.
I know I should use the one that works for me, but I was looking for personal experiences with both.
The simple answer, imo, is that if you trust your source to be well-formed, go with the lxml solution. Otherwise, BeautifulSoup all the way.
Edit:
This answer is three years old now; it's worth noting, as Jonathan Vanasco does in the comments, that BeautifulSoup4 now supports using lxml as the internal parser, so you can use the advanced features and interface of BeautifulSoup without most of the performance hit, if you wish (although I still reach straight for lxml myself -- perhaps it's just force of habit :)).
In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup's functionality. BeautifulSoup is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,
BeautifulSoup uses a different parsing approach. It is not a real HTML
parser but uses regular expressions to dive through tag soup. It is
therefore more forgiving in some cases and less good in others. It is
not uncommon that lxml/libxml2 parses and fixes broken HTML better,
but BeautifulSoup has superiour support for encoding detection. It
very much depends on the input which parser works better.
In the end they are saying,
The downside of using this parser is that it is much slower than
the HTML parser of lxml. So if performance matters, you might want
to consider using soupparser only as a fallback for certain cases.
If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml.
They also show how to benefit from BeautifulSoup's encoding detection, while still parsing quickly with lxml:
>>> from BeautifulSoup import UnicodeDammit
>>> def decode_html(html_string):
... converted = UnicodeDammit(html_string, isHTML=True)
... if not converted.unicode:
... raise UnicodeDecodeError(
... "Failed to detect encoding, tried [%s]",
... ', '.join(converted.triedEncodings))
... # print converted.originalEncoding
... return converted.unicode
>>> root = lxml.html.fromstring(decode_html(tag_soup))
(Same source: http://lxml.de/elementsoup.html).
In words of BeautifulSoup's creator,
That's it! Have fun! I wrote Beautiful Soup to save everybody time.
Once you get used to it, you should be able to wrangle data out of
poorly-designed websites in just a few minutes. Send me email if you
have any comments, run into problems, or want me to know about your
project that uses Beautiful Soup.
--Leonard
Quoted from the Beautiful Soup documentation.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
Also, from the lxml website,
lxml has been downloaded from the Python Package Index more than two
million times and is also available directly in many package
distributions, e.g. for Linux or MacOS-X.
And, from Why lxml?,
The C libraries libxml2 and libxslt have huge benefits:...
Standards-compliant... Full-featured... fast. fast! FAST! ... lxml
is a new Python binding for libxml2 and libxslt...
Use both? lxml for DOM manipulation, BeautifulSoup for parsing:
http://lxml.de/elementsoup.html
lxml's great. But parsing your input as html is useful only if the dom structure actually helps you find what you're looking for.
Can you use ordinary string functions or regexes? For a lot of html parsing tasks, treating your input as a string rather than an html document is, counterintuitively, way easier.
What is the best way in Python to read XML from a file, build a tree, do some rewriting in the tree and generate Python code? (I'm slightly confused as in Python there seem to be multiple options such as expat, ElementTree, XSLT, ...).
You should check out lxml. It is easy to use and does what you want in a few steps. If you want to stick to the stdlib, then you check out ElementTree (Python 2.5+) . The type of XML processing you want depends on your needs and for high performance XML parsing, read this.
EDIT: My answer is for XML parsing with Python and does not answer your last question: "generate Python code", cause that makes no sense :)
I use Beautiful Soup. Its relatively easy to use and powerful.
Check it out here
A search for "python" and "xml" returns a variety of libraries for combining the two.
This list probably faulty:
xml.dom
xml.etree
xml.sax
xml.parsers.expat
PyXML
beautifulsoup?
HTMLParser
htmllib
sgmllib
Be nice if someone can offer a quick summary of when to use which, and why.
The DOM/SAX divide is a basic one. It applies not just to python since DOM and SAX are cross-language.
DOM: read the whole document into memory and manipulate it.
Good for:
complex relationships across tags in the markup
small intricate XML documents
Cautions:
Easy to use excessive memory
SAX: parse the document while you read it. Good for:
Long documents or open ended streams
places where memory is a constraint
Cautions:
You'll need to code a stateful parser, which can be tricky
beautifulsoup:
Great for HTML or not-quite-well-formed markup. Easy to use and fast. Good for screen scraping, etc. It can work with markup where the XML based ones would just through an error saying the markup is incorrect.
Most of the rest I haven't used, but I don't think there's hard and fast rules about when to use which. Just your standard considerations: who is going to maintain the code, which APIs do you find most easy to use, how well do they work, etc.
In general, for basic needs, it's nice to use the standard library modules since they are "standard" and thus available and well known. However, if you need to dig deep into something, almost always there are newer nonstandard modules with superior functionality outside of the standard library.
I find xml.etree essentially sufficient for everything, except for BeautifulSoup if I ever need to parse broken XML (not a common problem, differently from broken HTML, which BeautifulSoup also helps with and is everywhere): it has reasonable support for reading entire XML docs in memory, navigating them, creating them, incrementally-parsing large docs. lxml supports the same interface, and is generally faster -- useful to push performance when you can afford to install third party Python extensions (e.g. on App Engine you can't -- but xml.etree is still there, so you can run exactly the same code). lxml also has more features, and offers BeautifulSoup too.
The other libs you mention mimic APIs designed for very different languages, and in general I see no reason to contort Python into those gyrations. If you have very specific needs such as support for xslt, various kinds of validations, etc, it may be worth looking around for other libraries yet, but I haven't had such needs in a long time so I'm not current wrt the offerings for them.
For many problems you can get by with the xml. It has the major advantage of being part of the standard library. This means that it is pre-installed on almost every system and that the interface will be static. It is not the best, or the fastest, but it is there.
For everything else there is lxml. Specically, lxml is best for parsing broken HTML, xHTML, or suspect feeds. It uses libxml2 and libxslt to handle XPath, XSLT, and EXSLT. The tutorial is clear and the interface is simplistically straight-forward. The rest of the libraries mentioned exist because lxml was not available in its current form.
This is my opinion.
I don't do much with XML, but when I've needed to, lxml has been a joy to work with and is apparently quite fast. The element tree API is very nice in an object oriented setting.
I know that any language is capable of parsing XML; I'm really just looking for advantages or drawbacks that you may have come across in your own experiences. Perl would be my standard go to here, but I'm open to suggestions.
Thanks!
UPDATE: I ended up going with XML::Simple which did a nice job, but I have one piece of advice if you plan to use it--research the forcearray option first. I had to rewrite a bunch of statements after learning that it is usually best practice to set forcearray. This page had the clearest explanation that I could find. Frankly, I'm surprised this isn't the default behavior.
If you are using Perl then I would recommend XML::Simple:
As more and more Web sites begin using
XML for their content, it's
increasingly important for Web
developers to know how to parse XML
data and convert it into different
formats. That's where the Perl module
called XML::Simple comes in. It takes
away the drudgery of parsing XML data,
making the process easier than you
ever thought possible.
XML::Twig is very nice, especially because it’s not as awfully verbose as some of the other options.
For pure XML parsing, I wouldn't use Java, C#, C++, C, etc. They tend to overcomplicate things, as in you want a banana and get the gorilla with it as well.
Higher-level and interpreted languages such as Perl, PHP, Python, Groovy are more suitable. Perl is included in virtually every Linux distro, as is PHP for the most part.
I've used Groovy recently for especially this and found it very easy. Mind you though that a C parser will be orders of magnitude faster than Groovy for instance.
It's all going to be in the libraries.
Python has great libraries for XML. My preference is lxml. It uses libxml/libxslt so it's fast, but the Python binding make it really easy to use. Perl may very well have equally awesome OO libraries.
I saw that people recommend XML::Simple if you decide on Perl.
While XML::Simple is, indeed, very simple to use and great, is a DOM parser. As such, it is, sadly, completely unsuitable to processing large XML files as your process would run out of memory (it's a common problem for any DOM parser, not limited to XML::Simple or Perl).
So, for large files, you must pick a SAX parser in whichever language you choose (there are many XML SAX parsers in Perl, or use another stream parser like XML::Twig that is even better than standard SAX parser. Can't speak for other languages).
Not exactly a scripting language, but you could also consider Scala. You can start from here.
Scala's XML support is rather good, especially as XML can just be typed directly into Scala programs.
Microsoft also did some cool integrated stuff with their LINQ for XML
But I really like Elementtree and just that package alone is a good reason to use Python instead of Perl ;)
Here's an example:
import elementtree.ElementTree as ET
# build a tree structure
root = ET.Element("html")
head = ET.SubElement(root, "head")
title = ET.SubElement(head, "title")
title.text = "Page Title"
body = ET.SubElement(root, "body")
body.set("bgcolor", "#ffffff")
body.text = "Hello, World!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xhtml")
It's not a scripting language, but Scala is great for working with XML natively. Also, see this book (draft) by Burak.
Python has some pretty good support for XML. From the standard library DOM packages to much more 'pythonic' libraries that parse XML directly into more usable object structures.
There isn't really a 'right' language though... there are good XML packages for most languages nowadays.
If you're going to use Ruby to do it then you're going to want to take a look at Nokogiri or Hpricot. Both have their strengths and weaknesses. The language and package selection really comes down to what you want to do with the data after you've parsed it.
Reading Data out of XML files is dead easy with C# and LINQ to XML!
Somehow, although I really love python, I found it hard to parse XML with the standard libraries.
I would say it depends like everything else. VB.NET 2008 uses XML literals, has IntelliSense for LINQ to XML, and a few power toys that help turn XML into XSD. So personally, if you are working in a .NET environment I think this is the best choice.