How to split up a file by keyword? - python

I have a large XML file that looks like
<data> skdfnlsniisimsoinfsdfoisdfinsdofinodnfonf <emrosem> 23324097234097g </emrosem>
<peto> oifmisnie </peto>
</data>
<data> sfnseosfnosefoisneofinseionfoaisenfoisen <emrosem> 3249087203470w </emrosem>
<peto> sdfn </peto>
</data>
I want to separate this into a list that looks like
[<data> skdfnlsniisimsoinfsdfoisdfinsdofinodnfonf <emrosem> 23324097234097g </emrosem>
<peto> oifmisnie </peto></data>, <data> sfnseosfnosefoisneofinseionfoaisenfoisen
<emrosem> 3249087203470w </emrosem> <peto> sdfn </peto> </data>]
In other words, I want to split it based on the word "data".
I'm using python 2.7, thanks for the help.

The included XML Parser is one way to parse XML. It might be a bit kludgey to get data off of it and into a list with the tags intact but it should be doable.

Please don't use regular expressions for this. If you need to parse XML, use an XML parser. XML just has too many subtleties to handle it with simple string manipulation routines. For a nice explanation as to why, see the first answer to this question.

Related

How to append data to a parsed XML object - Python

I am trying to take an xml document parsed with lxml objectify in python and add subelements to it.
The problem is that I can't work out how to do this. The only real option I've found is a complete reconstruction of the data with objectify.Element and objectify.SubElement, but that... doesn't make any sense.
With JSON for instance, I can just import the data as an object and navigate to any section and read, add and edit data super easily.
How do I parse an xml document with objectify so that I can add subelement data to it?
data.xml
<?xml version='1.0' encoding='UTF-8'?>
<data>
<items>
<item> text_1 </item>
<item> text_2 </item>
</items>
</data>
I'm sure there is an answer on how to do this online but my search terminology is either bad or im dumb, but I can't find a solution to this question. I'm sorry if this is a duplicate question.
I guess it has been quite difficult explain the question, but the problem can essentially be defined by an ambiguity with how .Element and .SubElement can be applied.
This reference contains actionable and replicable ways in which one can append or add data to a parsed XML file.
Solving the key problem of:
How do I reference content in a nested tag without reconstructing the entire tree?
The author uses the find feature, which is called with root.findall(./tag)
This allows one to find the nested tag that they wanted without having to reconstruct the tree.
Here is one of the examples that they have used:
cost = ["$2000","$3000","$4000")
traveltime = ["3hrs", "8hrs", "15hrs"]
i = 0
for elm in root.findall("./country"):
ET.SubElement(elm, "Holiday", attrib={"fight_cost": cost[i],
"trip_duration": traveltime[i]})
i += 1
This example also answers the question of How do you dynamically add data to XML?
They have accomplished this by using a list outside of the loop and ieteration through it with i
In essence, this example helps explain how to reference nested content without remaking the entire tree, as well as how to dynamically add data to xml trees.

how to modify attribute values from xml tag with python

I have many graphml files starting with:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns/graphml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns/graphml">
I need to change the xmlns and xsi attributes to reflect proper values for this XML file format specification:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
I tried to change these values with BeautifulSoup like:
soup = BeautifulSoup(myfile, 'html.parser')
soup.graphml['xmlns'] = 'http://graphml.graphdrawing.org/xmlns'
soup.graphml['xsi:schemalocation'] = "http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"
It works fine but it is definitely too slow on some of my larger files, so I am trying to do the same with lxml, but I don't understand how to achieve the same result. I sort of managed to reach the attributes, but don't know how to change them:
doc = etree.parse(myfile)
root = doc.getroot()
root.attrib
> {'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://graphml.graphdrawing.org/xmlns/graphml'}
What is the right way to accomplish this task?
When you say that you have many files "starting with" those 4 lines, if you really mean they're exactly like that, the fastest way is probably to entirely ignore that fact that it's XML, and just replace those lines.
In Python, just read the first four lines, compare them to what you expect (so you can issue a warning if they don't match), then discard them. Write out the new four lines you want, then copy the rest of the file out. Repeat for each file.
On the other hand, if you have namespace attributes anywhere else in the file this method wouldn't catch them, and you should probably do a real XML-based solution. With a regular SAX parser, you get a callback for each element start, element end, text node, etc. as it comes along. So you'd just copy them out until you hit the one(s) you want (in this case, a graphml element), then instead of copying out that start-tag, write out the new one you want. Then back to copying. XSLT is also a fine way to do this, which would let you write a tiny generic copier, plus one rule to handle the graphml element.

XML parsing using Elemetree in python

I am trying to read a XML file using python [ver - 2.6.7] using ElementTree
There are some tags of the format :
<tag, [attributes]>
....Data....
</tag>
The data in my case is usually some binary data that I read using text attribute.
However there are some cases where data can reference any other tag in the file.
<tag, [attributes]>
....Data....
<ref target='idname'/>
</tag>
What attribute from element tree can be used to parse them ?
Try XPath expressions.
This will tell you whether the tag is present and, if present, returns the node.
I think I would use something like this:
for iteration in root.iter('tag'):
if iteration.find('ref'):
...
So basicly I would parse thous cases separately.

Preserve XML Tags - Python DOM Implementation

I have just finished skiming through the python DOM API and can't seem to find what I am looking for.
I basically want to preserve the XML tags when traversing through the DOM tree. The idea is to print the tag name and corresponding attributes which I later want to convert into an xml file.
<book name="bookname" source="/home/phiri/Book/book.xml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter>
<page>page1</page>
<page>page2</page>
</chapter>
<chapter>
<page>page1</page>
<page>page2</page>
<page>Page3</page>
</chapter>
</book>
Using the XML contents above for instance, what I want is for the result of the book.xml file to have.
<book name="bookname" source="/home/phiri/Book/book.xml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter></chapter>
<chapter></chapter>
</book>
Is there an alternative xml package I could use to preserve results I get when extracting contents using python?
A simple way to get the output you posted from the input is to override the XSLT identity transform. It looks like you want to eliminate all text nodes and all elements that have more than two ancestors, so you'd just add empty templates for those:
<xsl:template match="text()"/>
<xsl:template match="*[count(ancestor::*) > 2]"/>
Generally the best way to use XSLT in Python is with the libxml2 module. Unless you need a pure Python solution, in which case you're stuck not using XSLT, because nobody's built a pure Python XSLT processor yet.

Crunching xml with python

I need to remove white spaces between xml tags, e.g. if the original xml looks like:
<node1>
<node2>
<node3>foo</node3>
</node2>
</node1>
I'd like the end-result to be crunched down to single line:
<node1><node2><node3>foo</node3></node2></node1>
Please note that I will not have control over the xml structure, so the solution should be generic enough to be able to handle any valid xml. Also the xml might contain CDATA blocks, which I'd need to exclude from this crunching and leave them as-is.
I have couple of ideas so far: (1) parse the xml as text and look for start and end of tags < and > (2) another approach is to load the xml document and go node-by-node and print out a new document by concatenating the tags.
I think either method would work, but I'd rather not reinvent the wheel here, so may be there is a python library that already does something like this? If not, then any issues/pitfalls to be aware of when rolling out my own cruncher? Any recommendations?
EDIT
Thank you all for answers/suggestions, both Triptych's and Van Gale's solutions work for me and do exactly what I want. Wish I could accept both answers.
This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
foo = """<node1>
<node2>
<node3>foo </node3>
</node2>
</node1>"""
bar = etree.XML(foo, parser)
print etree.tostring(bar,pretty_print=False,with_tail=True)
Results in:
<node1><node2><node3>foo </node3></node2></node1>
Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:
parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)
I'd use XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="#*" />
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
That should do the trick.
In python you could use lxml (direct link to sample on homepage) to transform it.
For some tests, use xsltproc, sample:
xsltproc test.xsl test.xml
where test.xsl is the file above and test.xml your XML file.
Pretty straightforward with BeautifulSoup.
This solution assumes it is ok to strip whitespace from the tail ends of character data.
Example: <foo> bar </foo> becomes <foo>bar</foo>
It will correctly ignore comments and CDATA.
import BeautifulSoup
s = """
<node1>
<node2>
<node3>foo</node3>
</node2>
<node3>
<!-- I'm a comment! Leave me be! -->
</node3>
<node4>
<![CDATA[
I'm CDATA! Changing me would be bad!
]]>
</node4>
</node1>
"""
soup = BeautifulSoup.BeautifulStoneSoup(s)
for t in soup.findAll(text=True):
if type(t) is BeautifulSoup.NavigableString: # Ignores comments and CDATA
t.replaceWith(t.strip())
print soup
Not a solution really but since you asked for recommendations: I'd advise against doing your own parsing (unless you want to learn how to write a complex parser) because, as you say, not all spaces should be removed. There are not only CDATA blocks but also elements with the "xml:space=preserve" attribute, which correspond to things like <pre> in XHTML (where the enclosed whitespaces actually have meaning), and writing a parser that is able to recognize those elements and leave the whitespace alone would be possible but unpleasant.
I would go with the parsing method, i.e. load the document and go node-by-node printing them out. That way you can easily identify which nodes you can strip the spaces out of and which you can't. There are some modules in the Python standard library, none of which I have ever used ;-) that could be useful to you... try xml.dom, or I'm not sure if you could do this with xml.parsers.expat.

Categories