element tree treats similar files differently

element tree treats similar files differently - python

Here are two different files that my python (2.6) script encounters. One will parse, the other will not. I'm just curious as to why this happens.
This xml file will not parse and the script will fail:
<Landfire_Feedback_Point_xlsform id="fbfm40v10" instanceID="uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab" submissionDate="2013-04-30T23:03:32.881Z" isComplete="true" markedAsCompleteDate="2013-04-30T23:03:32.881Z" xmlns="http://opendatakit.org/submissions">
<date_test>2013-04-17</date_test>
<plot_number>10</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.2452830500 -118.2149402900 210.3000030518 3.0000000000</geopoint_plot><fbfm40_new>GS2</fbfm40_new>
<select_grazing>NONE</select_grazing>
<image_close>1366230030355.jpg</image_close>
<plot_note>No road present.</plot_note>
<n0:meta xmlns:n0="http://openrosa.org/xforms">
<n0:instanceID>uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab</n0:instanceID>
</n0:meta>
</Landfire_Feedback_Point_xlsform>
This xml file will parse correctly and the script succeeds:
<Landfire_Feedback_Point_xlsform id="fbfm40v10">
<date_test>2013-05-14</date_test>
<plot_number>010</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.26630563 -118.39881809 351.70001220703125 5.0</geopoint_plot>
<fbfm40_new>GR1</fbfm40_new>
<select_grazing>HIGH</select_grazing>
<image_close>fbfm40v10_PLOT_010_ID_6.jpg</image_close>
<plot_note>Heavy grazing</plot_note>
<meta><instanceID>uuid:90e7d603-86c0-46fc-808f-ea0baabdc082</instanceID></meta>
</Landfire_Feedback_Point_xlsform>
Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't. Specifically, the one that doesn't seem to parse fails with a "'NONE' type doesn't have a 'text' attribute" or something similar. But, it's because it doesn't seem to consider the file as xml or it can't see any elements beyond the opening line. Any explanation or direction with regard to this error would be appreciated. Thanks in advance.
Python script:
import os
from xml.etree import ElementTree
def replace_xml_attribute_in_file(original_file,element_name,attribute_value):
#THIS FUNCTION ONLY WORKS ON XML FILES WITH UNIQUE ELEMENT NAMES
# -DUPLICATE ELEMENT NAMES WILL ONLY GET THE FIRST ELEMENT WITH A GIVEN NAME
#split original filename and add tempfile name
tempfilename="temp.xml"
rootsplit = original_file.rsplit('\\') #split the root directory on the backslash
rootjoin = '\\'.join(rootsplit[:-1]) #rejoin the root diretory parts with a backslash -minus the last
temp_file = os.path.join(rootjoin,tempfilename)
et = ElementTree.parse(original_file)
author=et.find(element_name)
author.text = attribute_value
et.write(temp_file)
if os.path.exists(temp_file) and os.path.exists(original_file): #if both the original and the temp files exist
os.remove(original_file) #erase the original
os.rename(temp_file,original_file) #rename the new file
else:
print "Something went wrong."
replace_xml_attribute_in_file("testfile1.xml","image_close","whoopdeedoo.jpg");

Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't.
Your code doesn't demonstrate that at all. It demonstrates that they're both seen by ElementTree as valid XML files chock full of nodes. They both parse just fine, they both read past the first line, etc.
The only problem is that the first one doesn't have a node named 'image_close', so your code doesn't work.
You can see that pretty easily:
for node in et.getroot().getchildren():
print node.tag
You get 9 children of the root, with either version.
And the output to that should show you the problem. The node you want is actually named {http://opendatakit.org/submissions}image_close in the first example, rather than image_close as in the second.
And, as you can probably guess, this is because of the namespace=http://opendatakit.org/submissions in the root node. ElementTree uses the "James Clark notation" for mapping unknown-namespaced names to universal names.
Anyway, because none of the nodes are named image_close, the et.find(element_name) returns None, so your code stores author=None, then tries to assign to author.text, and gets an error.
As for how to fix this problem… well, you could learn how namespaces work by default in ElementTree, or you could upgrade to Python 2.7 or install a newer ElementTree for 2.6 that lets you customize things more easily. But if you want to do custom namespace handling and also stick with your old version… I'd start with this article (and its two predecessors) and this one.

Related

Right way to use lxml to generate xml in python

I'm trying to create xml that corresponds to structure of directory (with subdirectories and files in that subdirs). When I try to use this example: Best way to generate xml? instead of output from example, that is:
<root>
<child/>
<child>some text</child>
</root>
I've got:
b'<root>\n <child/>\n <child>some text</child>\n</root>\n'
Why is it so?
Uses PyCharm IDE if it matters.

This is a change in Python 3. etree.tostring() is returning a bytes literal, which is denoted by b'string value'. Note that whenever you see \n in the bytes literal it means the same as a new line if you print it out to a file.
You can turn this into a regular string using s = s.decode('utf-8').

'\n' is a Line feed char (LF).
When you, for instance, save your output to a file and open it in some editor, you'll get what you expect.
Your output is fine.

How to extract text from an existing docx file using python-docx

I'm trying to use python-docx module (pip install python-docx)
but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even though they are only showing how to add text to a docx file, not reading existing one?
1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:
from docx import Document
document = Document('test_doc.docx')
print(document.paragraphs)
It returned a list of <docx.text.Paragraph object at 0x... >
Then I did:
for p in document.paragraphs:
print(p.text)
It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.
What is the issue? Why URLs are missing?
How could I get complete text without iterating over loop (something like open().read())

you can try this
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)

You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.

Without Installing python-docx
docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without the need to rely on python-docx and lxml the latter being sometimes hard to install:
http://etienned.github.io/posts/extract-text-from-word-docx-simply/

There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.
The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.
Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it. UPDATE Hyperlink support was added subsequent to this answer.

Using python-docx, as #Chinmoy Panda 's answer shows:
for para in doc.paragraphs:
fullText.append(para.text)
However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:
def para2text(p):
rs = p._element.xpath('.//w:t')
return u" ".join([r.text for r in rs])

It seems that there is no official solution for this problem, but there is a workaround posted here
https://github.com/savoirfairelinux/python-docx/commit/afd9fef6b2636c196761e5ed34eb05908e582649
just update this file
"...\site-packages\docx\oxml_init_.py"
# add
import re
import sys
# add
def remove_hyperlink_tags(xml):
if (sys.version_info > (3, 0)):
xml = xml.decode('utf-8')
xml = xml.replace('</w:hyperlink>', '')
xml = re.sub('<w:hyperlink[^>]*>', '', xml)
if (sys.version_info > (3, 0)):
xml = xml.encode('utf-8')
return xml
# update
def parse_xml(xml):
"""
Return root lxml element obtained by parsing XML character string in
*xml*, which can be either a Python 2.x string or unicode. The custom
parser is used, so custom element classes are produced for elements in
*xml* that have them.
"""
root_element = etree.fromstring(remove_hyperlink_tags(xml), oxml_parser)
return root_element
and of course don't forget to mention in the documentation that use are changing the official library

How can I say a file is SVG without using a magic number?

An SVG file is basically an XML file so I could use the string <?xml (or the hex representation: '3c 3f 78 6d 6c') as a magic number but there are a few opposing reason not to do that if for example there are extra white-spaces it could break this check.
The other images I need/expect to check are all binaries and have magic numbers. How can I fast check if the file is an SVG format without using the extension eventually using Python?

XML is not required to start with the <?xml preamble, so testing for that prefix is not a good detection technique — not to mention that it would identify every XML as SVG. A decent detection, and really easy to implement, is to use a real XML parser to test that the file is well-formed XML that contains the svg top-level element:
import xml.etree.cElementTree as et
def is_svg(filename):
tag = None
with open(filename, "r") as f:
try:
for event, el in et.iterparse(f, ('start',)):
tag = el.tag
break
except et.ParseError:
pass
return tag == '{http://www.w3.org/2000/svg}svg'
Using cElementTree ensures that the detection is efficient through the use of expat; timeit shows that an SVG file was detected as such in ~200μs, and a non-SVG in 35μs. The iterparse API enables the parser to forego creating the whole element tree (module name notwithstanding) and only read the initial portion of the document, regardless of total file size.

You could try reading the beginning of the file as binary - if you can't find any magic numbers, you read it as a text file and match to any textual patterns you wish. Or vice-versa.

This is from man file (here), for the unix file command:
The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable ... These files have a “magic number” stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a “magic” has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. ...
(my emphasis)
And here's one example of the "magic" that the file command uses to identify an svg file (see source for more):
...
0 string \<?xml\ version=
>14 regex ['"\ \t]*[0-9.]+['"\ \t]*
>>19 search/4096 \<svg SVG Scalable Vector Graphics image
...
0 string \<svg SVG Scalable Vector Graphics image
...
As described by man magic, each line follows the format <offset> <type> <test> <message>.
If I understand correctly, the code above looks for the literal "<?xml version=". If that is found, it looks for a version number, as described by the regular expression. If that is found, it searches the next 4096 bytes until it finds the literal "<svg". If any of this fails, it looks for the literal "<svg" at the start of the file, and so on.
Something similar could be implemented in Python.
Note there's also python-magic, which provides an interface to libmagic, as used by the unix file command.

Parsing an XML file in python for emailing purposes

I am writing code in python that can not only read a xml but also send the results of that parsing as an email. Now I am having trouble just trying to read the file I have in xml. I made a simple python script that I thought would at least read the file which I can then try to email within python but I am getting a Syntax Error in line 4.
root.tag 'log'
Anyways here is the code I written so far:
import xml.etree.cElementTree as etree
tree = etree.parse('C:/opidea.xml')
response = tree.getroot()
log = response.find('log').text
logentry = response.find('logentry').text
author = response.find('author').text
date = response.find('date').text
msg = [i.text for i in response.find('msg')]
Now the xml file has this type of formating
<log>
<logentry
revision="12345">
<author>glv</author>
<date>2012-08-09T13:16:24.488462Z</date>
<paths>
<path
action="M"
kind="file">/trunk/build.xml</path>
</paths>
<msg>BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:Example</msg>
</logentry>
</log>
I want to be able to send an email of this xml file. For now though I am just trying to get the python code to read the xml file.

response.find('log') won't find anything, because:
find(self, path, namespaces=None)
Finds the first matching subelement, by tag name or path.
In your case log is not a subelement, but rather the root element itself. You can get its text directly, though: response.text. But in your example the log element doesn't have any text in it, anyway.
EDIT: Sorry, that quote from the docs actually applies to lxml.etree documentation, rather than xml.etree.
I'm not sure about the reason, but all other calls to find also return None (you can find it out by printing response.find('date') and so on). With lxml ou can use xpath instead:
author = response.xpath('//author')[0].text
msg = [i.text for i in response.xpath('//msg')]
In any case, your use of find is not correct for msg, because find always returns a single element, not a list of them.

xml to Python data structure using lxml

How can I convert xml to Python data structure using lxml?
I have searched high and low but can't find anything.
Input example
<ApplicationPack>
<name>Mozilla Firefox</name>
<shortname>firefox</shortname>
<description>Leading Open Source internet browser.</description>
<version>3.6.3-1</version>
<license name="Firefox EULA">http://www.mozilla.com/en-US/legal/eula/firefox-en.html</license>
<ms-license>False</ms-license>
<vendor>Mozilla Foundation</vendor>
<homepage>http://www.mozilla.org/firefox</homepage>
<icon>resources/firefox.png</icon>
<download>http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-GB</download>
<crack required="0"/>
<install>scripts/install.sh</install>
<postinstall name="Clean Up"></postinstall>
<run>C:\\Program Files\\Mozilla Firefox\\firefox.exe</run>
<uninstall>c:\\Program Files\\Mozilla Firefox\\uninstall\\helper.exe /S</uninstall>
<requires name="autohotkey" />
</ApplicationPack>

>>> from lxml import etree
>>> treetop = etree.fromstring(anxmlstring)
converts the xml in the string to a Python data structure, and so does
>>> othertree = etree.parse(somexmlurl)
where somexmlurl is the path to a local XML file or the URL of an XML file on the web.
The Python data structure these functions provide (known as an "element tree", whence the etree module name) is well documented here -- all the classes, functions, methods, etc, that the Python data structure in question supports. It closely matches one supported in the Python standard library, by the way.
If you want some different Python data structure, you'll have to walk through the Python data structure which lxml returns, as above mentioned, and build your different data structure yourself based on the information thus collected; lxml can't specifically help you, except by offering several helpers for finding information in the parsed structure it returns, so that collecting said info is a flexible, easy task (again, see the documentation URL above).

It's not entirely clear what kind of data structure you're looking for, but here's a link to a code sample to convert XML to python dictionary of lists via lxml.etree.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.