Parsing log4j in Python - python

I have a log file (log4j xml-ish format) that I am trying to pull info out of and use in my Python module. Could I treat this file as if it were XML? My gut is telling me no... If not, what is the best way to parse the data? Below is a section of the log file. The file does not include your standard doctype or version headers which is why I said "xml-ish."
<log4j:event
logger="com.hp.cp.elk.impl.subscriptions.AsyncSimpleSubscriptionManager"
timestamp="1352320517430" level="DEBUG" thread="Thread-77">
<log4j:message><![CDATA[Broadcasting signals to subscribers...]]></log4j:message>
</log4j:event>
<log4j:event logger="com.hp.cp.jdf.idp.queue.IDPJobProgressMonitor"
timestamp="1352320517430" level="DEBUG" thread="IDPJobProgressMonitorThread">
<log4j:message><![CDATA[[JDFQueueEntry[ --> JDFAutoQueueEntry[ --> JDFElement[
--> <?xml version="1.0" encoding="UTF-8"?><QueueEntry
xmlns="http://www.CIP4.org/JDFSchema_1_1"
DescriptiveName="H44E61-6.pdf" DeviceID="HPPRO1-SM1"
EndTime="2012-11-07T10:58:18-08:00" JobID="Default" Priority="50"
QueueEntryID="d5fbbe98a1194e0da573b51a0c8040fb" Status="Completed"
SubmissionTime="2012-11-06T16:35:06-08:00"> <Comment AgentName="CIP4 JDF Writer
Java" AgentVersion="1.4a BLD 63" ID="c_121106_163506894_000804"
Name="JobSpec">WBG_4C_Flat_21up_BusCards_Duplex</Comment>
</QueueEntry>
] ] ]] queue entries changed.]]></log4j:message>
</log4j:event>
<log4j:event logger="com.hp.cp.jdf.idp.queue.IDPJobProgressMonitor"
timestamp="1352320517430" level="DEBUG" thread="IDPJobProgressMonitorThread">
<log4j:message><![CDATA[no active queue entries changed.]]></log4j:message>
</log4j:event>
Sorry for the messy code, I just wanted to make you all can get an idea of the formatting. Anyway, I'm currently just trying to pull the value from QueueEntryID="d5fbbe98a1194e0da573b51a0c8040fb" Any suggestions? Thank you!

I would imagine that you could use standard XML tools like DOM or SAX to parse this. Otherwise, have fun with re or htmllib.

Related

how to write new xml file with same header

I have an xml file with this as the header
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='\\segotn12805\ppr\PPRData3\StyleSheet\PPRData3.xslt'?>
when I modify the file I use .write (for example)
mytree.write('output.xml')
but the output file does not contain the header info.
The first two lines of the output file look like this
<ns0:pprdata xmlns:ns0="http://ManHub.PPRData">
<ns0:Group name="Models">
any ideas on how I can add the header info to the output file?
The first line is the XML declaration. It is optional, and a parser will assume UTF-8 if not specified.
The second line is a processing instruction.
It would be helpful if you provided more code to show what you are doing, but I suspect that you are using ElementTree. The documentation has this note indicating that by default these are skipped:
Note Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input. Nevertheless, trees built using this module’s API rather than parsing from XML text can have comments and processing instructions in them; they will be included when generating XML output. A document type declaration may be accessed by passing a custom TreeBuilder instance to the XMLParser constructor.
As suggested in this answer, you might want to try using lxml

How to write XML with "usual" self-closing/header tags?

This is my code:
from xml.dom import minidom as md
doc = md.parse('file.props')
# operations with doc
xml_file = open('file.props', "w")
doc.writexml(xml_file, encoding="utf-8")
xml_file.close()
I parse an XML, I do some operations, than I open and write on it. But for example, if in my file got:
<MY_TAG />
^
it rewrites as:
<MY_TAG/>
^
I know this can seem irrelevant, but my file is constantly monitored by version control GIT, which say that line is "different" on every write.
The same with the header:
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
it becomes:
<?xml version="1.0" encoding="utf-8"?><Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
Which is pretty annoying. Any clues?
Retaining quirks of formatting in XML through parsing and serialization is pretty well impossible. If you need to do text-level comparisons, the only way to do it is to canonicalize the formats that you are comparing (there are various XML canonicalization tools out there).
In principle you can configure git to use an XML-aware diff tool for comparisons, but please don't ask me for details, it's not something I've ever done myself. I've always just lived with the fact that it works badly.

Is it possible to call xsl Apache FOP without providing an input file but instead passing a string

I am trying to generate a PDF using FOP. To do this I am taking in a template file, initialling its values with Jinja2 and then passing it through to fop with a system call.
Is it possible to do a subprocess call to FOP without passing through an input file but instead a string containing the XML directly? And if so how would I go about doing so?
I was hoping for something like this
fop -fo "XML here" -pdf output.pdf
Yes actually it was possible.
Using python I was able to import the xml from the file into lxml.etree:
tree = etree.parse('FOP_PARENT.fo.xml')
And then by using the etree to parse the include tags:
tree.xinclude()
Then it was a simple case of converting the xml back into unicode:
xml = etree.tounicode(tree)
This is how I got the templates to work. Hopefully this helps someone who has the same issue!

Use Python to edit XML header

I've written a Python script to create some XML, but I didn't find a way to edit the heading within Python.
Basically, instead of having this as my heading:
<?xml version="1.0" ?>
I need to have this:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
I looked into this as best I could, but found no way to add standalone status from within Python. So I figured that I'd have to go over the file after it had been created and then replace the text. I read in several places that I should stay far away from using readlines() because it could ruin the XML formatting.
Currently the code I have to do this - which I got from another Stackoverflow post about editing XML with Python - is:
doc = parse('file.xml')
elem = doc.find('xml version="1.0"')
elem.text = 'xml version="1.0" encoding="UTF-8" standalone="no"'
That provides me with a KeyError. I've tried to debug it to no avail, which leads me to believe that perhaps the XML heading wasn't meant to be edited in this way. Or my code is just wrong.
If anyone is curious (or miraculously knows how to insert standalone status into Python code), here is my code to write the xml file:
with open('file.xml', 'w') as f:
f.write(doc.toprettyxml(indent=' '))
Some people don't like "toprettyxml", but with my relatively basic level, it seemed like the best bet.
Anyway, if anyone can provide some advice or insight, I would be very grateful.
The xml.etree API does not give you any options to write out a standalone attribute in the XML declaration.
You may want to use the lxml library instead; it uses the same ElementTree API, but offers more control over the output. tostring() supports a standalone flag:
from lxml import etree
etree.tostring(doc, pretty_print=True, standalone=False)
or use .write(), which support the same options:
doc.write(outputfile, pretty_print=True, standalone=False)

xml to Python data structure using lxml

How can I convert xml to Python data structure using lxml?
I have searched high and low but can't find anything.
Input example
<ApplicationPack>
<name>Mozilla Firefox</name>
<shortname>firefox</shortname>
<description>Leading Open Source internet browser.</description>
<version>3.6.3-1</version>
<license name="Firefox EULA">http://www.mozilla.com/en-US/legal/eula/firefox-en.html</license>
<ms-license>False</ms-license>
<vendor>Mozilla Foundation</vendor>
<homepage>http://www.mozilla.org/firefox</homepage>
<icon>resources/firefox.png</icon>
<download>http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-GB</download>
<crack required="0"/>
<install>scripts/install.sh</install>
<postinstall name="Clean Up"></postinstall>
<run>C:\\Program Files\\Mozilla Firefox\\firefox.exe</run>
<uninstall>c:\\Program Files\\Mozilla Firefox\\uninstall\\helper.exe /S</uninstall>
<requires name="autohotkey" />
</ApplicationPack>
>>> from lxml import etree
>>> treetop = etree.fromstring(anxmlstring)
converts the xml in the string to a Python data structure, and so does
>>> othertree = etree.parse(somexmlurl)
where somexmlurl is the path to a local XML file or the URL of an XML file on the web.
The Python data structure these functions provide (known as an "element tree", whence the etree module name) is well documented here -- all the classes, functions, methods, etc, that the Python data structure in question supports. It closely matches one supported in the Python standard library, by the way.
If you want some different Python data structure, you'll have to walk through the Python data structure which lxml returns, as above mentioned, and build your different data structure yourself based on the information thus collected; lxml can't specifically help you, except by offering several helpers for finding information in the parsed structure it returns, so that collecting said info is a flexible, easy task (again, see the documentation URL above).
It's not entirely clear what kind of data structure you're looking for, but here's a link to a code sample to convert XML to python dictionary of lists via lxml.etree.

Categories