I'm having a weird MemoryError while parsing XML via minidom (ran on a server, file-path changed):
Traceback (most recent call last):
File "python.py", line 19, in <module>
xmldoc = minidom.parseString(unicode(data,errors='ignore'))
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 753, in start_element_handler
_append_child(self.curNode, node)
File "/usr/lib/python2.6/xml/dom/minidom.py", line 287, in _append_child
last.__dict__["nextSibling"] = node
MemoryError
The xml-feed I'm parsing is huge, so that might be the problem. But what to do about it?
Related
Why this code is working without issues on my mac with any version of python, requests and lxml, but doesn't work in any docker container? i tried everything(
it just fails on 34533 line (discovered by printing el.sourceline)
from requests import get
from lxml import etree
r = get('https://printbar.ru/synsfiles/yandex/market/idrr_full.xml')
with open('test.xml', 'wb') as f:
f.write(r.content)
tree = etree.iterparse(source='test.xml', events=('end',))
for (ev, el) in tree:
continue
print('ok')
https://printbar.ru/synsfiles/yandex/market/idrr_full.xml seems completely valid and works locally on any of my macs...
i tried ubuntu, alpine, several python containers even with prebuilt lxml, nothing helped. I expected that parsing this file won't throw this error in the middle of parsing:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
File "src/lxml/parser.pxi", line 1376, in lxml.etree._FeedParser.feed
File "src/lxml/parser.pxi", line 606, in lxml.etree._ParserContext._handleParseResult
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "test.xml", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
xmllint says that there is encoding error, but it works locally on mac...) HOW?) i want it dockerized!)
I have a utility method that parses XML using a parser created as etree.XMLParser(recover=True). I would like to test failure scenarios in a unit test. Except for empty input throwing an lxml.etree.XMLSyntaxError, I can't seem to break the parser.
My question is: is it possible to construct a StringIO or BytesIO input for this parser such that the parser throws a parse error?
Here's some examples (tested with Python 3.5 and lxml 4.3.3):
from io import BytesIO
from lxml import etree
def parse(xml):
parser = etree.XMLParser(recover=True)
elem = etree.parse(BytesIO(xml), parser)
print(etree.tostring(elem))
parse(b'<broken<') # prints b'<broken/>'
parse(b'</lf|\jf>') # prints None
parse('<?xml encoding="ascii"?><foo>æøå</foo>'.encode('utf-8')) # prints b'<foo/>'
parse(b'') # Throws lxml.etree.XMLSyntaxError
If I slap a NULL character at the beginning of any of the bad inputs you show that don't raise an error, I do get an error. For instance:
parse(b'\0<broken<')
produces:
Traceback (most recent call last):
File "test.py", line 13, in <module>
parse(b'\0<broken<') # prints b'<broken/>'
File "test.py", line 9, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
Isn't it because you are using recover=True?
recover - try hard to parse through broken XML
I changed recover=False and I get:
Traceback (most recent call last):
File "./foo.py", line 11, in <module>
parse(b'<broken<') # prints b'<broken/>'
File "./foo.py", line 7, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: error parsing attribute name, line 1, column 8
Am I missing something?
This error messages comes up every time
import nltk
nltk.download()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 655, in download
self._interactive_download()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 974, in _interactive_download
DownloaderGUI(self).mainloop()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 1234, in __init__
self._fill_table()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 1530, in _ fill_table
items = self._ds.collections()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 499, in collections
self._update_index()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 825, in _update_index
ElementTree.parse(compat.urlopen(self._url)).getroot())
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
And the downloader window opens without anything or list. I have checked downloader.py and it contains the right and new default link of the corpora.
I am using Windows standard edition.
I have imported some python module and I would like to view the implementation of some of the module methods. How can I do this?
I tried inspect.getsource, however, I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 701, in getsource
lines, lnum = getsourcelines(object)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 690, in getsourcelines
lines, lnum = findsource(object)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 529, in findsource
raise IOError('source code not available')
IOError: source code not available
Here's the Python source:
fsock = urllib2.urlopen('http://eprints.soton.ac.uk/cgi/exportview/divisions/uos-fp/2009/XML/uos-fp_2009.xml')
doc=et.parse(fsock)
When I tried to run this it gives the following error:
Traceback (most recent call last):
File "C:\Python27\reading and writing xml file from web1.py", line 30, in
doc=et.parse(fsock)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1176, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 654, in parse
self._root = parser.close()
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1635, in close
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1487, in _raiseerror
raise err
ParseError: no element found: line 1, column 0
Can any one help as to why this is happening?
Your code works:
import urllib2
from xml.etree.cElementTree import parse, dump
fsock = urllib2.urlopen('http://eprints.soton.ac.uk/cgi/exportview/divisions/uos-fp/2009/XML/uos-fp_2009.xml')
doc = parse(fsock)
dump(doc)