I have some code that, as part of its running, takes a HTML document and mangles it into another form for the sake of output. (HTML to BBCode, essentially.)
I am currently doing this via defining a dictionary of XPath and the replacements, and then iterating over the dictionary with tools from lxml:
change_xpaths = {
XPath(".//span[contains(#style, 'font')]") : "font",
XPath(".//span[contains(#style, 'color')]") : "color",
XPath(".//span[contains(#style, 'size')]") : "size"
}
replace_xpaths = {
XPath(".//span[#style='text-decoration: underline']") : "u",
XPath(".//span[#style='text-decoration: line-through']") : "s",
XPath(".//div[#style='padding-left: 30px']") : "remove"
}
def _clean_text(cls, raw):
for ele in cls.struck_through_xpath(raw):
ele.getparent().remove(ele)
for xp, repl in cls.replace_xpaths.items():
for ele in xp(raw):
ele.attrib.pop("style")
ele.tag = repl
for xp, chng in cls.change_xpaths.items():
for ele in xp(raw):
ele.tag = chng
for br in raw.xpath(".//br"):
try:
br.tail = "\n" + br.tail
except TypeError:
br.tail = "\n"
strip_elements(raw, 'img', with_tail = False)
strip_elements(raw, 'br', with_tail = False)
strip_tags(raw, 'remove')
(This is, indeed, part of a class definition.)
I understand that I can do this using an xslt transform, too.
I would like, firstly, a conformation that I can indeed do all this with xslt, namely, replacing some tags with non-standard tags, and outright removing tags while leaving their text or tail content.
Secondly, I would like to know if I can expect a significant performance increase by doing so? I would suspect so, however, I can't seem to find much about this on the internet.
Question 1: Yes, this is possible with XSLT. But it seems that you simply ignore the font, color and size values. Actually parsing these values from inline CSS could be complicated with XSLT 1.0.
Question 2: I think it will be significantly faster. With your current solution, you have to iterate all nodes of your document multiple times (more than 10 times, AFAICS). With an XSLT stylesheet, you visit each input node only once. Also, since lxml is based on libxml2 and libxslt, you'll need less calls into the C API which can be quite expensive in my experience.
OTOH, you could get a similar performance boost by rewriting your Python code to scan the document only once.
Make sure that you compile your XSLT stylesheet only once if you make multiple transformations.
There also some optimizations possible on the XSLT level. The most elegant way would be to write templates like:
<xsl:template match="span[contains(#style, 'font')]">...
<xsl:template match="span[contains(#style, 'color')]">...
<xsl:template match="span[contains(#style, 'size')]">...
It might be a little faster to have a single template per element name like this:
<xsl:template match="span">
<xsl:choose>
<xsl:when test="contains(#style, 'font')">...
<xsl:when test="contains(#style, 'color')">...
<xsl:when test="contains(#style, 'size')">...
Related
I have many graphml files starting with:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns/graphml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns/graphml">
I need to change the xmlns and xsi attributes to reflect proper values for this XML file format specification:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
I tried to change these values with BeautifulSoup like:
soup = BeautifulSoup(myfile, 'html.parser')
soup.graphml['xmlns'] = 'http://graphml.graphdrawing.org/xmlns'
soup.graphml['xsi:schemalocation'] = "http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"
It works fine but it is definitely too slow on some of my larger files, so I am trying to do the same with lxml, but I don't understand how to achieve the same result. I sort of managed to reach the attributes, but don't know how to change them:
doc = etree.parse(myfile)
root = doc.getroot()
root.attrib
> {'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://graphml.graphdrawing.org/xmlns/graphml'}
What is the right way to accomplish this task?
When you say that you have many files "starting with" those 4 lines, if you really mean they're exactly like that, the fastest way is probably to entirely ignore that fact that it's XML, and just replace those lines.
In Python, just read the first four lines, compare them to what you expect (so you can issue a warning if they don't match), then discard them. Write out the new four lines you want, then copy the rest of the file out. Repeat for each file.
On the other hand, if you have namespace attributes anywhere else in the file this method wouldn't catch them, and you should probably do a real XML-based solution. With a regular SAX parser, you get a callback for each element start, element end, text node, etc. as it comes along. So you'd just copy them out until you hit the one(s) you want (in this case, a graphml element), then instead of copying out that start-tag, write out the new one you want. Then back to copying. XSLT is also a fine way to do this, which would let you write a tiny generic copier, plus one rule to handle the graphml element.
I have an XML file that looks like this:
xml = '''<?xml version="1.0"?>
<root>
<item>text</item>
<item2>more text</item2>
<targetroot>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
</targetroot>
...more items
</root>
'''
With lxml I'm trying to acces the text in the element < target >. I've found a solution, but I'm sure there is a better, more efficient way to do this. My solution:
target = etree.XML(xml)
for x in target.getiterator('root'):
item1 = x.findtext('item')
for target in x.iterchildren('targetroot'):
for t in target.iterchildren('targetcontainer'):
targetText = t.findtext('target')
Although this works, as it gives me acces to all the elements in root as well as the target element, I'm having a hard time believing this is the most efficient solution.
So my question is this: is there a more efficient way to access the < target >'s texts while staying in the loop of root, because I also need access to the other elements.
You can use XPath:
for x in target.xpath('/root/targetroot/targetcontainer/target'):
print x.text
We ask all elements that match a path. In this case, the path is /root/targetroot/targetcontainer/target, which means
all the <target> elements that are inside a <targetcontainer> element, inside a <targetroot> element, inside a <root> element. Also, the <root> element should be the document root because it is preceded by /, which means the beginning of the document.
Also, your XML document had two problems. First, the <?xml version="1.0"?> declaration should be the very first thing in the document - and in this example it is preceded by a newline and some space. Also, it is not a tag and should not be closed, so the </xml> at the end of your string should be removed. I already edited your question anyway.
EDIT: this solution can be improved yet. You do not need to pass all the path - you can just ask to all elements <target> inside the document. This is done by preceding the tag name by two slashes. Since you want all the <target> texts, independent of where they are, this can be a better solution. So, the loop above can be written just as:
for x in target.xpath('//target'):
print x.text
I tried it at first but it did not worked. The problem, however, was the syntax problems in the XML, not the XPath, but I tried the other, longer path and forgot to retry this one. Sorry! Anyway, I hope I put some light about XPath nonetheless :)
I have just finished skiming through the python DOM API and can't seem to find what I am looking for.
I basically want to preserve the XML tags when traversing through the DOM tree. The idea is to print the tag name and corresponding attributes which I later want to convert into an xml file.
<book name="bookname" source="/home/phiri/Book/book.xml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter>
<page>page1</page>
<page>page2</page>
</chapter>
<chapter>
<page>page1</page>
<page>page2</page>
<page>Page3</page>
</chapter>
</book>
Using the XML contents above for instance, what I want is for the result of the book.xml file to have.
<book name="bookname" source="/home/phiri/Book/book.xml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter></chapter>
<chapter></chapter>
</book>
Is there an alternative xml package I could use to preserve results I get when extracting contents using python?
A simple way to get the output you posted from the input is to override the XSLT identity transform. It looks like you want to eliminate all text nodes and all elements that have more than two ancestors, so you'd just add empty templates for those:
<xsl:template match="text()"/>
<xsl:template match="*[count(ancestor::*) > 2]"/>
Generally the best way to use XSLT in Python is with the libxml2 module. Unless you need a pure Python solution, in which case you're stuck not using XSLT, because nobody's built a pure Python XSLT processor yet.
Suppose we have a table:
Key|Val|Flag
01 |AAA| Y
02 |BBB| N
...
wrapped into xml this way:
<Data>
<R><F>Key</F><F>Val</F><F>Flag</F></R>
<R><F>01</F><F>AAA</F><F>Y</F></R>
<R><F>02</F><F>BBB</F><F>N</F></R>
...
</Data>
There can be more columns and rows, obviously.
Now I'd like to parse XML back to table using single regex.
I can find all fields with '<F>([\w\d]*)</F>', but I need them to be groupped by rows somehow.
I thought about <R>(<F>([\w\d]*)</F>)*</R>, but Python implementation finds nothing.
Can someone please help to compose regex?
UPDATE
Some context of the question.
I'm aware about plenty of XML parsing libraries, but unfortunately my environment is limited to standard libraries. Anyway thanks to everyone who have warned not to use regexes for XML parsing.
And I needed some quick and dirty solution, therefore I decided to start with regexes and switch to parsing later.
So far I have the code:
...
row_p = r'<R>(.*?)</R>'
field_p = r'<F>(.*?)</F>'
table = ''
for row in re.finditer(row_p, xml):
table += '|'.join(re.findall(field_p, row.group(1))) + '\n'
...
It works for small datasets (about 10'000 rows) but fails for tables larger 500'000 rows.
Maybe I'll do some investigation why it fails, but next step I'm going to take - switch to some standard XML parser. ElementTree is the first candidate.
Mandatory links:
RegEx match open tags except XHTML self-contained tags and
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Use an XML parser. lxml is very good and even provides (among other XML-related thingies) XPath - if you got a fetish with oneliners, I'm sure there is an XPath oneliner to extract these elements ;)
import libxml2
txt = '\n<Data>\n <R><F>Key</F><F>Val</F><F>Flag</F></R>\n <R><F>01</F><F>AAA</F><F>Y</F></R>\n <R><F>02</F><F>BBB</F><F>N</F></R>\n</Data>\n'
rows = []
for elem in libxml2.parseDoc(txt):
if elem.name == 'R':
curRow = []
rows.append(curRow)
elif elem.name == 'F':
curRow.append(elem.get_content())
returns:
rows = [['Key', 'Val', 'Flag'], ['01', 'AAA', 'Y'], ['02', 'BBB', 'N']]
If this question is tagged with Perl, I can post a solution + code for you, but since this is python.
Anyway, I suggest you load the xml file, and read it line by line. Loop each line until the end of the file and find all fields within that line. As far as I know matches in python are stored in an array. There you have it. Wish I can show you with code but this is just the main idea:
load file
foreach line in <file>
if regex.match('<F>([\w\d]*)</F>', line)
print matches[1] . '|' . matches[2] . '|' . matches[3] . "\n"
end loop
DISCLAIMER: The above code is just a scratch
Oh by the way, if possible, use an XML parser instead.
lxml is a Pythonic binding for
the libxml2 and libxslt libraries. It
is unique in that it combines the
speed and feature completeness of
these libraries with the simplicity of
a native Python API, mostly compatible
but superior to the well-known
ElementTree API.
I need to remove white spaces between xml tags, e.g. if the original xml looks like:
<node1>
<node2>
<node3>foo</node3>
</node2>
</node1>
I'd like the end-result to be crunched down to single line:
<node1><node2><node3>foo</node3></node2></node1>
Please note that I will not have control over the xml structure, so the solution should be generic enough to be able to handle any valid xml. Also the xml might contain CDATA blocks, which I'd need to exclude from this crunching and leave them as-is.
I have couple of ideas so far: (1) parse the xml as text and look for start and end of tags < and > (2) another approach is to load the xml document and go node-by-node and print out a new document by concatenating the tags.
I think either method would work, but I'd rather not reinvent the wheel here, so may be there is a python library that already does something like this? If not, then any issues/pitfalls to be aware of when rolling out my own cruncher? Any recommendations?
EDIT
Thank you all for answers/suggestions, both Triptych's and Van Gale's solutions work for me and do exactly what I want. Wish I could accept both answers.
This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
foo = """<node1>
<node2>
<node3>foo </node3>
</node2>
</node1>"""
bar = etree.XML(foo, parser)
print etree.tostring(bar,pretty_print=False,with_tail=True)
Results in:
<node1><node2><node3>foo </node3></node2></node1>
Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:
parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)
I'd use XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="#*" />
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
That should do the trick.
In python you could use lxml (direct link to sample on homepage) to transform it.
For some tests, use xsltproc, sample:
xsltproc test.xsl test.xml
where test.xsl is the file above and test.xml your XML file.
Pretty straightforward with BeautifulSoup.
This solution assumes it is ok to strip whitespace from the tail ends of character data.
Example: <foo> bar </foo> becomes <foo>bar</foo>
It will correctly ignore comments and CDATA.
import BeautifulSoup
s = """
<node1>
<node2>
<node3>foo</node3>
</node2>
<node3>
<!-- I'm a comment! Leave me be! -->
</node3>
<node4>
<![CDATA[
I'm CDATA! Changing me would be bad!
]]>
</node4>
</node1>
"""
soup = BeautifulSoup.BeautifulStoneSoup(s)
for t in soup.findAll(text=True):
if type(t) is BeautifulSoup.NavigableString: # Ignores comments and CDATA
t.replaceWith(t.strip())
print soup
Not a solution really but since you asked for recommendations: I'd advise against doing your own parsing (unless you want to learn how to write a complex parser) because, as you say, not all spaces should be removed. There are not only CDATA blocks but also elements with the "xml:space=preserve" attribute, which correspond to things like <pre> in XHTML (where the enclosed whitespaces actually have meaning), and writing a parser that is able to recognize those elements and leave the whitespace alone would be possible but unpleasant.
I would go with the parsing method, i.e. load the document and go node-by-node printing them out. That way you can easily identify which nodes you can strip the spaces out of and which you can't. There are some modules in the Python standard library, none of which I have ever used ;-) that could be useful to you... try xml.dom, or I'm not sure if you could do this with xml.parsers.expat.