Parsing an XML file in python for emailing purposes

Parsing an XML file in python for emailing purposes - python

I am writing code in python that can not only read a xml but also send the results of that parsing as an email. Now I am having trouble just trying to read the file I have in xml. I made a simple python script that I thought would at least read the file which I can then try to email within python but I am getting a Syntax Error in line 4.
root.tag 'log'
Anyways here is the code I written so far:
import xml.etree.cElementTree as etree
tree = etree.parse('C:/opidea.xml')
response = tree.getroot()
log = response.find('log').text
logentry = response.find('logentry').text
author = response.find('author').text
date = response.find('date').text
msg = [i.text for i in response.find('msg')]
Now the xml file has this type of formating
<log>
<logentry
revision="12345">
<author>glv</author>
<date>2012-08-09T13:16:24.488462Z</date>
<paths>
<path
action="M"
kind="file">/trunk/build.xml</path>
</paths>
<msg>BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:Example</msg>
</logentry>
</log>
I want to be able to send an email of this xml file. For now though I am just trying to get the python code to read the xml file.

response.find('log') won't find anything, because:
find(self, path, namespaces=None)
Finds the first matching subelement, by tag name or path.
In your case log is not a subelement, but rather the root element itself. You can get its text directly, though: response.text. But in your example the log element doesn't have any text in it, anyway.
EDIT: Sorry, that quote from the docs actually applies to lxml.etree documentation, rather than xml.etree.
I'm not sure about the reason, but all other calls to find also return None (you can find it out by printing response.find('date') and so on). With lxml ou can use xpath instead:
author = response.xpath('//author')[0].text
msg = [i.text for i in response.xpath('//msg')]
In any case, your use of find is not correct for msg, because find always returns a single element, not a list of them.

Related

Importing xml files with an encoder element at the start(using lxml)

I have a file trace.xml.The very first line in this file is<?xml version="1.0" encoding="UTF-8"?>.I tried reading the data using the following command:
with open('trace.xml') as fobj:
xml=fobj.read()
root = etree.fromstring(xml)
This however yields the following error:ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I am assuming its because of the very first tag i mentioned above.Is there a way around this any help would be appreciated.

Try changing
root = etree.fromstring(xml)
to
root = etree.fromstring(xml.encode())
and see if it works.

Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file

I am trying to parse OpenStreetMap's planet.osm, compressed in bz2 format. Because it is already 41G, I don't want to decompress the file completely.
So I figured out how to parse portions of the planet.osm file using bz2 and lxml, using the following code
from lxml import etree as et
from bz2 import BZ2File
path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
parser = et.iterparse(xml_file, events=('end',))
for events, elem in parser:
if elem.tag == "tag":
continue
if elem.tag == "node":
(do something)
## Do some cleaning
# Get rid of that element
elem.clear()
# Also eliminate now-empty references from the root node to node
while elem.getprevious() is not None:
del elem.getparent()[0]
which works perfectly with the Geofabrick extracts. However, when I try to parse the planet-latest.osm.bz2 with the same script I get the error:
xml.etree.XMLSyntaxError: Specification mandate value for attribute num_change, line 3684, column 60
Here are the things I tried:
Check the planet-latest.osm.bz2 md5sum
Check the planet-latest.osm where the script with bz2 stops. There is no apparent error, and the attribute is called "num_changes", not "num_change" as indicated in the error
Also I did something stupid, but the error puzzled me: I opened the planet-latest.osm.bz2 in mode 'rb' [c = BZ2File('file.osm.bz2', 'rb')] and then passed c.read() to iterparse(), which returned me an error saying (very long string) cannot be opened. Strange thing, (very long string) ends right where the "Specification mandate value" error refers to...
Then I tried to decompress first the planet.osm.gz2 usin a simple
bzcat planet.osm.gz2 > planet.osm
And ran the parser directly on planet.osm. And... it worked! I am very puzzled by this, and could not find any pointer to why this may happen and how to solve this. My guess would be there is something going on between the decompression and the parsing, but I am not sure. Please help me understand!

It turns out that the problem is with the compressed planet.osm file.
As indicated on the OSM Wiki, the planet file is compressed as a multistream file, and the bz2 python module cannot read multistream files. However, the bz2 documentation indicates an alternative module that can read such files, bz2file. I used it and it works perfectly!
So the code should read:
from lxml import etree as et
from bz2file import BZ2File
path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
parser = et.iterparse(xml_file, events=('end',))
for events, elem in parser:
if elem.tag == "tag":
continue
if elem.tag == "node":
(do something)
## Do some cleaning
# Get rid of that element
elem.clear()
# Also eliminate now-empty references from the root node to node
while elem.getprevious() is not None:
del elem.getparent()[0]
Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser, a python module that implements a generic parser for OSM data (in pbf or xml format). You may want to have a look at this!

As an alternative you can use the output of bzcat command (which can handle multistream files too):
p = subprocess.Popen(["bzcat", "data.bz2"], stdout=subprocess.PIPE)
parser = et.iterparse(p.stdout, ...)
# at the end just check that p.returncode == 0 so there were no errors

Using Python to validate numeric string in XML

I have some hardware that creates a bazillion record XML file where the xml records look like this:
<Reading>
<DeviceId>13553678</DeviceId>
<Reading>1009735</Reading>
<DataStatus>0</DataStatus>
</Reading>
Every once in awhile, we will experience hardware failure where a character value gets inserted into the Reading tag, like this:
<Reading>
<DeviceId>13553678</DeviceId>
<Reading>100F735</Reading>
<DataStatus>0</DataStatus>
</Reading>
Unfortunately, the application that consumes this XML file will dump the ENTIRE file with a "Input string was not in a correct format" error. I would like to write an intermediary program in Python to remove the bad records from the xml file, archive them, and then rebuild the file for processing. I have used python for simple text manipulation but I believe there are some XML features I could leverage. Any help would be appreciated.

This can easily be done by using the lxml module and XPath expressions. Also see the logging module on how to do proper logging.
Configure a logger with a FileHandler
Get all inner <Reading/> nodes
If their text doesn't consist only of digits, drop the parent node and log
from lxml import etree
import logging
logger = logging.getLogger()
logger.addHandler(logging.FileHandler('dropped_readings.log'))
tree = etree.parse(open('readings.xml'))
readings = tree.xpath('//Reading/Reading')
for reading in readings:
reading_block = reading.getparent()
value = reading.text
if not all(c.isdigit() for c in value):
reading_dump = etree.tostring(reading_block)
logger.warn("Dropped reading '%s':" % value)
logger.warn(reading_dump)
reading_block.getparent().remove(reading_block)
print etree.tostring(tree, xml_declaration=True, encoding='utf-8')
See the all() builtin and generator epxressions for how the condition works.

How to auto-close xml tags in truncated file?

I receive an email when a system in my company generates an error. This email contains XML all crammed onto a single line.
I wrote a notepad++ Python script that parses out everything except XML and pretty prints it. Unfortunately some of the emails contain too much XML data and it gets truncated. In general, the truncated data isn't that important to me. I would like to be able to just auto-close any open tags so that my Python script works. It doesn't need to be smart or correct, it just needs to make the xml well-enough formed that the script runs. Is there a way to do this?
I am open to Python scripts, online apps, downloadable apps, etc.
I realize that the right solution is to get the non-truncated xml, but pulling the right lever to get things done will be far more work than just dealing with it.

Use Beautiful Soup
>>> import bs4
>>> s= bs4.BeautifulSoup("<asd><xyz>asd</xyz>")
>>> s
<html><head></head><body><asd><xyz>asd</xyz></asd></body></html>
>>
>>> s.body.contents[0]
<asd><xyz>asd</xyz></asd>
Notice that it closed the "asd" tag automagically"
To create a notepad++ script to handle this,
download the tarball and extract the files
Copy the bs4 directory to your PythonScript/scripts folder.
In notepad++ add the following code to your python script
#import Beautiful Soup
import bs4
#get text in document
text = editor.getText()
#soupify it to fix XML
soup = bs4.BeautifulSoup(text)
#convert soup object to string again
text = str(soup)
#clear editor and replace bad xml with fixed xml
editor.clearAll()
editor.addText(text)
#change language to xml
notepad.menuCommand( MENUCOMMAND.LANG_XML )
#soup has its own prettify, but I like the XML tools version better
notepad.runMenuCommand('XML Tools', 'Pretty print (XML only - with line breaks)', 1)

If you have BeautifulSoup and lxml installed, it's straightforward:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <?xml version="1.0" encoding="utf-8"?>
... <a>
... <b>foo</b>
... <c>bar</""", "xml")
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<a>
<b>foo</b>
<c>bar</c></a>
Note the second "xml" argument to the constructor to avoid the XML being interpreted as HTML.

element tree treats similar files differently

Here are two different files that my python (2.6) script encounters. One will parse, the other will not. I'm just curious as to why this happens.
This xml file will not parse and the script will fail:
<Landfire_Feedback_Point_xlsform id="fbfm40v10" instanceID="uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab" submissionDate="2013-04-30T23:03:32.881Z" isComplete="true" markedAsCompleteDate="2013-04-30T23:03:32.881Z" xmlns="http://opendatakit.org/submissions">
<date_test>2013-04-17</date_test>
<plot_number>10</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.2452830500 -118.2149402900 210.3000030518 3.0000000000</geopoint_plot><fbfm40_new>GS2</fbfm40_new>
<select_grazing>NONE</select_grazing>
<image_close>1366230030355.jpg</image_close>
<plot_note>No road present.</plot_note>
<n0:meta xmlns:n0="http://openrosa.org/xforms">
<n0:instanceID>uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab</n0:instanceID>
</n0:meta>
</Landfire_Feedback_Point_xlsform>
This xml file will parse correctly and the script succeeds:
<Landfire_Feedback_Point_xlsform id="fbfm40v10">
<date_test>2013-05-14</date_test>
<plot_number>010</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.26630563 -118.39881809 351.70001220703125 5.0</geopoint_plot>
<fbfm40_new>GR1</fbfm40_new>
<select_grazing>HIGH</select_grazing>
<image_close>fbfm40v10_PLOT_010_ID_6.jpg</image_close>
<plot_note>Heavy grazing</plot_note>
<meta><instanceID>uuid:90e7d603-86c0-46fc-808f-ea0baabdc082</instanceID></meta>
</Landfire_Feedback_Point_xlsform>
Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't. Specifically, the one that doesn't seem to parse fails with a "'NONE' type doesn't have a 'text' attribute" or something similar. But, it's because it doesn't seem to consider the file as xml or it can't see any elements beyond the opening line. Any explanation or direction with regard to this error would be appreciated. Thanks in advance.
Python script:
import os
from xml.etree import ElementTree
def replace_xml_attribute_in_file(original_file,element_name,attribute_value):
#THIS FUNCTION ONLY WORKS ON XML FILES WITH UNIQUE ELEMENT NAMES
# -DUPLICATE ELEMENT NAMES WILL ONLY GET THE FIRST ELEMENT WITH A GIVEN NAME
#split original filename and add tempfile name
tempfilename="temp.xml"
rootsplit = original_file.rsplit('\\') #split the root directory on the backslash
rootjoin = '\\'.join(rootsplit[:-1]) #rejoin the root diretory parts with a backslash -minus the last
temp_file = os.path.join(rootjoin,tempfilename)
et = ElementTree.parse(original_file)
author=et.find(element_name)
author.text = attribute_value
et.write(temp_file)
if os.path.exists(temp_file) and os.path.exists(original_file): #if both the original and the temp files exist
os.remove(original_file) #erase the original
os.rename(temp_file,original_file) #rename the new file
else:
print "Something went wrong."
replace_xml_attribute_in_file("testfile1.xml","image_close","whoopdeedoo.jpg");

Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't.
Your code doesn't demonstrate that at all. It demonstrates that they're both seen by ElementTree as valid XML files chock full of nodes. They both parse just fine, they both read past the first line, etc.
The only problem is that the first one doesn't have a node named 'image_close', so your code doesn't work.
You can see that pretty easily:
for node in et.getroot().getchildren():
print node.tag
You get 9 children of the root, with either version.
And the output to that should show you the problem. The node you want is actually named {http://opendatakit.org/submissions}image_close in the first example, rather than image_close as in the second.
And, as you can probably guess, this is because of the namespace=http://opendatakit.org/submissions in the root node. ElementTree uses the "James Clark notation" for mapping unknown-namespaced names to universal names.
Anyway, because none of the nodes are named image_close, the et.find(element_name) returns None, so your code stores author=None, then tries to assign to author.text, and gets an error.
As for how to fix this problem… well, you could learn how namespaces work by default in ElementTree, or you could upgrade to Python 2.7 or install a newer ElementTree for 2.6 that lets you customize things more easily. But if you want to do custom namespace handling and also stick with your old version… I'd start with this article (and its two predecessors) and this one.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.