MemoryError on Python minidom - python

I'm having a weird MemoryError while parsing XML via minidom (ran on a server, file-path changed):
Traceback (most recent call last):
File "python.py", line 19, in <module>
xmldoc = minidom.parseString(unicode(data,errors='ignore'))
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 753, in start_element_handler
_append_child(self.curNode, node)
File "/usr/lib/python2.6/xml/dom/minidom.py", line 287, in _append_child
last.__dict__["nextSibling"] = node
MemoryError
The xml-feed I'm parsing is huge, so that might be the problem. But what to do about it?

Related

python lxml in docker: "Document is empty" while parsing

Why this code is working without issues on my mac with any version of python, requests and lxml, but doesn't work in any docker container? i tried everything(
it just fails on 34533 line (discovered by printing el.sourceline)
from requests import get
from lxml import etree
r = get('https://printbar.ru/synsfiles/yandex/market/idrr_full.xml')
with open('test.xml', 'wb') as f:
f.write(r.content)
tree = etree.iterparse(source='test.xml', events=('end',))
for (ev, el) in tree:
continue
print('ok')
https://printbar.ru/synsfiles/yandex/market/idrr_full.xml seems completely valid and works locally on any of my macs...
i tried ubuntu, alpine, several python containers even with prebuilt lxml, nothing helped. I expected that parsing this file won't throw this error in the middle of parsing:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
File "src/lxml/parser.pxi", line 1376, in lxml.etree._FeedParser.feed
File "src/lxml/parser.pxi", line 606, in lxml.etree._ParserContext._handleParseResult
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "test.xml", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
xmllint says that there is encoding error, but it works locally on mac...) HOW?) i want it dockerized!)

Can etree.XMLParser in recover mode still throw a parse error?

I have a utility method that parses XML using a parser created as etree.XMLParser(recover=True). I would like to test failure scenarios in a unit test. Except for empty input throwing an lxml.etree.XMLSyntaxError, I can't seem to break the parser.
My question is: is it possible to construct a StringIO or BytesIO input for this parser such that the parser throws a parse error?
Here's some examples (tested with Python 3.5 and lxml 4.3.3):
from io import BytesIO
from lxml import etree
def parse(xml):
parser = etree.XMLParser(recover=True)
elem = etree.parse(BytesIO(xml), parser)
print(etree.tostring(elem))
parse(b'<broken<') # prints b'<broken/>'
parse(b'</lf|\jf>') # prints None
parse('<?xml encoding="ascii"?><foo>æøå</foo>'.encode('utf-8')) # prints b'<foo/>'
parse(b'') # Throws lxml.etree.XMLSyntaxError
If I slap a NULL character at the beginning of any of the bad inputs you show that don't raise an error, I do get an error. For instance:
parse(b'\0<broken<')
produces:
Traceback (most recent call last):
File "test.py", line 13, in <module>
parse(b'\0<broken<') # prints b'<broken/>'
File "test.py", line 9, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
Isn't it because you are using recover=True?
recover - try hard to parse through broken XML
I changed recover=False and I get:
Traceback (most recent call last):
File "./foo.py", line 11, in <module>
parse(b'<broken<') # prints b'<broken/>'
File "./foo.py", line 7, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: error parsing attribute name, line 1, column 8
Am I missing something?

After importing nltk nltk.download() is not working

This error messages comes up every time
import nltk
nltk.download()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 655, in download
self._interactive_download()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 974, in _interactive_download
DownloaderGUI(self).mainloop()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 1234, in __init__
self._fill_table()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 1530, in _ fill_table
items = self._ds.collections()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 499, in collections
self._update_index()
File "C:\Python34\Lib\site-packages\nltk\downloader.py", line 825, in _update_index
ElementTree.parse(compat.urlopen(self._url)).getroot())
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
And the downloader window opens without anything or list. I have checked downloader.py and it contains the right and new default link of the corpora.
I am using Windows standard edition.

view the code of methods in a python module

I have imported some python module and I would like to view the implementation of some of the module methods. How can I do this?
I tried inspect.getsource, however, I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 701, in getsource
lines, lnum = getsourcelines(object)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 690, in getsourcelines
lines, lnum = findsource(object)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 529, in findsource
raise IOError('source code not available')
IOError: source code not available

"no element found" using xml.etree.ElementTree with XML file from the web

Here's the Python source:
fsock = urllib2.urlopen('http://eprints.soton.ac.uk/cgi/exportview/divisions/uos-fp/2009/XML/uos-fp_2009.xml')
doc=et.parse(fsock)
When I tried to run this it gives the following error:
Traceback (most recent call last):
File "C:\Python27\reading and writing xml file from web1.py", line 30, in
doc=et.parse(fsock)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1176, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 654, in parse
self._root = parser.close()
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1635, in close
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1487, in _raiseerror
raise err
ParseError: no element found: line 1, column 0
Can any one help as to why this is happening?
Your code works:
import urllib2
from xml.etree.cElementTree import parse, dump
fsock = urllib2.urlopen('http://eprints.soton.ac.uk/cgi/exportview/divisions/uos-fp/2009/XML/uos-fp_2009.xml')
doc = parse(fsock)
dump(doc)

Categories