I'm trying to make a class which makes it easier to handle XML-invoices, but I am having trouble getting ElementTree to work within a class.
This is the general idea of what I'm trying to do:
def open_invoice(input_file):
with open(input_file, 'r', encoding = 'utf8') as invoice_file:
return ET.parse(input_file).getroot()
This works fine, and I can make functions to handle the data without issue. But when trying to do the equivalent inside a class, I get an error message:
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
I think this means that the parser is never given anything to parse, though I could be wrong.
The class:
import xmltodict
import xml.etree.ElementTree as ET
class Peppol:
def __init__(self, peppol_invoice):
self.invoice = xmltodict.parse(
peppol_invoice.read()
)
self.root = ET.parse(peppol_invoice).getroot()
Making the class instance:
from pypeppol import Peppol
def open_invoice(input_file):
with open(input_file, 'r', encoding = 'utf8') as invoice_file:
return Peppol(invoice_file)
invoice = open_invoice('invoice.xml')
Help is much appreciated.
The error means that invoice.xml is empty, does not contain XML or contains XML + over stuff before the XML data.
import xml.etree.ElementTree as ET
with open('empty.xml', 'w') as f:
f.write('')
# or
# f.write('No xml here!')
with open('empty.xml') as f:
ET.parse(f).getroot()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
The problem here is that you are attempting to read the contents of the file peppol_invoice twice, once in the call to xmltodict.parse, and once in the call to ET.parse.
After the call to peppol_invoice.read() completes, peppol_invoice is left pointing at the end of the file. You get the error in your question title because when peppol_invoice is passed to ET.parse, there is nothing left to be read from the file.
If you want to read the contents of the file again, call peppol_invoice.seek(0) to reset the pointer back to the start of the file:
import xmltodict
import xml.etree.ElementTree as ET
class Peppol:
def __init__(self, peppol_invoice):
self.invoice = xmltodict.parse(
peppol_invoice.read()
)
peppol_invoice.seek(0) # add this line
self.root = ET.parse(peppol_invoice).getroot()
Related
Because I'm not able to use an XSL IDE, I've written a super-simple Python script using lxml to transform a given XML file with a given XSL transform, and write the results to a file. As follows (abridged):
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
print(xml_root.tag)
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
I'm getting the following error:
"lxml.etree.XSLTApplyError: Failed to evaluate the 'select' expression"
...but I have quite a number of select expressions in my XSLT. After having looked carefully and isolated blocks of code, I'm still at a loss as to which select is failing, or why.
Without trying to debug the code, is there a way to get more information out of lxml, like a line number or quote from the failing expression?
aaaaaand of course as soon as I actually take the time to post the question, I stumble upon the answer.
This might be a duplicate of this question, but I think the added benefit here is the Python side of things.
The linked answer points out that each parser includes an error log that you can access. The only "trick" is catching those errors so that you can look in the log once it's been created.
I did it thusly (perhaps also poorly, but it worked):
import os
import lxml.etree as etree
from lxml.etree import XMLParser
import sys
xml_filename = '(some path to an XML file)'
xsl_filename = '(some path to an XSL file)'
output = '(some path to a file)'
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = None
try:
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
except:
for error in transform.error_log:
print(error.message, error.line)
The messages in this log are more descriptive than those printed to the console, and the "line" element will point you to the line number where the failure occurred.
This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.
import urllib2
import urllib
import json
import urlparse
def main():
f = open("C:\Users\Stern Marketing\Desktop\dumpaday.txt","r")
if f.mode == 'r':
item = f.read()
for x in item:
urlParts = urlparse.urlsplit(x)
filename = urlParts.path.split('/')[-1]
urllib.urlretrieve(item.strip(), filename)
if __name__ == "__main__":
main()`
Looks like script still not working properly, I'm really not sure why... :S
Getting lots of errors...
urllib.urlretrieve("x", "0001.jpg")
This will try to download from the (static) URL "x".
The URL you actually want to download from is within the variable x, so you should write your line to reference that variable:
urllib.urlretrieve(x, "0001.jpg")
Also, you probably want to change the target filename for each download, so you don’t keep on overwriting it.
Regarding your filename update:
urlparse.urlsplit is a function that takes an URL and splits it into multiple parts. Those parts are returned from the function, so you need to save it in some variable.
One part is the path, which is what contains the file name. The path itself is a string on which you can call the split method to separate it by the / character. As you are interested in only the last part—the filename—you can discard everything else:
url = 'http://www.dumpaday.com/wp-content/uploads/2013/12/funny-160.jpg'
urlParts = urlparse.urlsplit(url)
print(urlParts.path) # /wp-content/uploads/2013/12/funny-160.jpg
filename = urlParts.path.split('/')[-1]
print(filename) # funny-160.jpg
It should work like this:
import urllib2
import urllib
import json
import urlparse
def main():
with open("C:\Users\Stern Marketing\Desktop\dumpaday.txt","r") as f:
for x in f:
urlParts = urlparse.urlsplit(x.strip())
filename = urlParts.path.split('/')[-1]
urllib.urlretrieve(x.strip(), filename)
if __name__ == "__main__":
main()`
The readlines method of file objects returns lines with a trailing newline character (\n).
Change your loop to the following:
# By the way, you don't need readlines at all. Iterating over a file yields its lines.
for x in fl:
urllib.urlretrieve(x.strip(), "0001.jpg")
Here is a solution that loops over images indexed 160 to 171. You can adjust as needed. This creates a url from the base, opens it via urllib2 and saves it as a binary file.
import urllib2
base_url = "http://www.dumpaday.com/wp-content/uploads/2013/12/funny-{}.jpg"
for n in xrange(160, 170):
url = base_url.format(n)
f_save = "{}.jpg".format(n)
req = urllib2.urlopen(url)
with open(f_save,'wb') as FOUT:
FOUT.write(req.read())
I wrote this code to validate my xml file via a xsd
def parseAndObjectifyXml(xmlPath, xsdPath):
from lxml import etree
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
xmlContent = xmlinput.read()
myxml = etree.parse(xmlinput) # In this line xml input is empty
schema.assertValid(myxml)
but when I want to validate it, my xmlinput is empty but my xmlContent is not empty.
what is the problem?
Files in python have a "current position"; it starts at the beginning of the file (position 0), then, as you read the file, the current position pointer moves along until it reaches the end.
You'll need to put that pointer back to the beginning before the lxml parser can read the contents in full. Use the .seek() method for that:
from lxml import etree
def parseAndObjectifyXml(xmlPath, xsdPath):
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
xmlContent = xmlinput.read()
xmlinput.seek(0)
myxml = etree.parse(xmlinput)
schema.assertValid(myxml)
You only need to do this if you need xmlContent somewhere else too; you could alternatively pass it into the .parse() method if wrapped in a StringIO object to provide the necessary file object methods:
from lxml import etree
from cStringIO import StringIO
def parseAndObjectifyXml(xmlPath, xsdPath):
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
xmlContent = xmlinput.read()
myxml = etree.parse(StringIO(xmlContent))
schema.assertValid(myxml)
If you are not using xmlContent for anything else, then you do not need the extra .read() call either, and subsequently won't have problems parsing it with lxml; just omit the call altogether, and you won't need to move the current position pointer back to the start either:
from lxml import etree
def parseAndObjectifyXml(xmlPath, xsdPath):
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
myxml = etree.parse(xmlinput)
schema.assertValid(myxml)
To learn more about .seek() (and it's counterpart, .tell()), read up on file objects in the Python tutorial.
You should use the XML content that you have read:
xmlContent = xmlinput.read()
myxml = etree.parse(xmlContent)
instead of:
myxml = etree.parse(xmlinput)
I have a design question. I have a function loadImage() for loading an image file. Now it accepts a string which is a file path. But I also want to be able to load files which are not on physical disk, eg. generated procedurally. I could have it accept a string, but then how could it know the string is not a file path but file data? I could add an extra boolean argument to specify that, but that doesn't sound very clean. Any ideas?
It's something like this now:
def loadImage(filepath):
file = open(filepath, 'rb')
data = file.read()
# do stuff with data
The other version would be
def loadImage(data):
# do stuff with data
How to have this function accept both 'filepath' or 'data' and guess what it is?
You can change your loadImage function to expect an opened file-like object, such as:
def load_image(f):
data = file.read()
... and then have that called from two functions, one of which expects a path and the other a string that contains the data:
from StringIO import StringIO
def load_image_from_path(path):
with open(path, 'rb') as f:
load_image(f)
def load_image_from_string(s):
sio = StringIO(s)
try:
load_image(sio)
finally:
sio.close()
How about just creating two functions, loadImageFromString and loadImageFromFile?
This being Python, you can easily distinguish between a filename and a data string. I would do something like this:
import os.path as P
from StringIO import StringIO
def load_image(im):
fin = None
if P.isfile(im):
fin = open(im, 'rb')
else:
fin = StringIO(im)
# Read from fin like you would from any open file object
Other ways to do it would be a try block instead of using os.path, but the essence of the approach remains the same.