Parser.pxi problems when pretty printing xml file

Parser.pxi problems when pretty printing xml file - python

I'm getting some errors when I'm trying to pretty print a xml file. I've looked everywhere and tried installning latest version of lxml but still getting this error.
My script is pretty simple it looks like this.
import os
import lxml.etree as etree
from lxml.etree import parse
fname = 'C:\Test_folder\SlutR_20150218.xml'
x = etree.parse(fname)
print etree.tostring(x, pretty_print = True)
And the errors I'm getting is following.
Traceback (most recent call last):
File "C:\Users\a.curcic\Desktop\Övriga_python_script\
Pretty_print_example.py",
line 5, in <module> x = etree.parse(fname)
File "lxml.etree.pyx",
line 3301, in lxml.etree.parse
(src\lxml\lxml.etree.c:72453)
File "parser.pxi", line 1791, in lxml.etree._parseDocument
(src\lxml\lxml.etree.c:105915)
File "parser.pxi", line 1817,
in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
XMLSyntaxError:
Extra content at the end of the document, line 2, column 909

Related

TypeError: expected str, bytes or os.PathLike object, not WindowsPath while converting .py using Pyinstaller

While trying to build an .exe using Pyinstaller, this error is thrown:
133235 INFO: Loading module hook 'hook-matplotlib.backends.py' from 'c:\\users\\jimit vaghela\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\PyInstaller\\hooks'...
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 901, in <module>
fail_on_error=True)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 796, in _rc_params_in_file
with _open_file_or_url(fname) as fd:
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\contextlib.py", line 112, in __enter__
return next(self.gen)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 770, in _open_file_or_url
fname = os.path.expanduser(fname)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\ntpath.py", line 291, in expanduser
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not WindowsPath
134074 INFO: Loading module hook 'hook-matplotlib.py' from 'c:\\users\\jimit vaghela\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\PyInstaller\\hooks'...
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 901, in <module>
fail_on_error=True)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 796, in _rc_params_in_file
with _open_file_or_url(fname) as fd:
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\contextlib.py", line 112, in __enter__
return next(self.gen)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 770, in _open_file_or_url
fname = os.path.expanduser(fname)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\ntpath.py", line 291, in expanduser
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not WindowsPath
I have found a solution posted on the Stackoverflow which states that there needs to be inserted a code into the backend.py in the Pyinstaller folder. But that does not work either.
What is going wrong here?

So, I found out that matplotlib was the issue. Excluding that in the module argument in Pyinstaller fixed it.
Like this:
pyinstaller --exclude-module matplotlib main.py

That is currently an issue with matplotlib.
I solved it by editing the source file. If you don't want to edit source file, it is said that lower versions of matplotlib like 3.0.3 can be installed to forgo this issue.
I my case it didn't work. Anyways, below are the steps I took.
First open Python interpreter & copy the path which you get as output.
>>> import matplotlib
>>> matplotlib.get_data_path() # copy the below path
'C:\\<Python Path>\\lib\\site-packages\\matplotlib\\mpl-data'
Next, open file C:\<Python Path>\lib\site-packages\PyInstaller\hooks\hook-matplotlib.py. Just to be on safe side make a back-up of this file if you need
from PyInstaller.utils.hooks import exec_statement
# old line; delete this
mpl_data_dir = exec_statement(
"import matplotlib; print(matplotlib.get_data_path())")
# Add this line
mpl_data_dir = 'C:\\<Python Path>\\lib\\site-packages\\matplotlib\\mpl-data'
assert mpl_data_dir, "Failed to determine matplotlib's data directory!"
datas = [
(mpl_data_dir, "matplotlib/mpl-data"),
]
And don't forget to save.

Transform docx to html raises python MemoryError

I have a function that converts a docx to html and a large docx file to be converted.
The problem is this function is part of a bigger program and the converted html is parsed afterwards so I cannot afford to use another converter without impacting the rest of the code (which is not wanted). Running on python 2.7.13 installed on 32-bit, but changing to 64-bit is also not desired.
This is the function:
import logging
from ooxml import serialize
def trasnformDocxtoHtml(inputFile, outputFile):
logging.basicConfig(filename='ooxml.log', level=logging.INFO)
dfile = ooxml.read_from_file(inputFile)
with open(outputFile,'w') as htmlFile:
htmlFile.write( serialize.serialize(dfile.document))
and here's the error:
>>> import library
>>> library.trasnformDocxtoHtml(r'large_file.docx', 'output.html')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "library.py", line 9, in trasnformDocxtoHtml
dfile = ooxml.read_from_file(inputFile)
File "C:\Python27\lib\site-packages\ooxml\__init__.py", line 52, in read_from_file
dfile.parse()
File "C:\Python27\lib\site-packages\ooxml\docxfile.py", line 46, in parse
self._doc = parse_from_file(self)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 655, in parse_from_file
document = parse_document(doc_content)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 463, in parse_document
document.elements.append(parse_table(document, elem))
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 436, in parse_table
for p in tc.xpath('./w:p', namespaces=NAMESPACES):
File "src\lxml\etree.pyx", line 1583, in lxml.etree._Element.xpath
MemoryError
no mem for new parser
MemoryError
Could I somehow increase the buffer memory in python? Or fix the function without impacting the html output format?

Error trying parsing xml using python : xml.etree.ElementTree.ParseError: syntax error: line 1,

In python, simply trying to parse XML:
import xml.etree.ElementTree as ET
data = 'info.xml'
tree = ET.fromstring(data)
but got error:
Traceback (most recent call last):
File "C:\mesh\try1.py", line 3, in <module>
tree = ET.fromstring(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1312, in XML
return parser.close()
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1665, in close
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1517, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: syntax error: line 1, column 0
thats a bit of xml, i have:
<?xml version="1.0" encoding="utf-16"?>
<AnalysisData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<BlendOperations OperationNumber="1">
<ComponentQuality>
<MaterialName>Oil</MaterialName>
<Weight>1067.843017578125</Weight>
<WeightPercent>31.545017776585109</WeightPercent>
Why is it happening?

You're trying to parse the string 'info.xml' instead of the contents of the file.
You could call tree = ET.parse('info.xml') which will open the file.
Or you could read the file directly:
ET.fromstring(open('info.xml').read())

How to parse broken HTML with LXML

I'm trying to parse a broken HTML with LXML parser on python 2.5 and 2.7
Unlike in LXML documentation (http://lxml.de/parsing.html#parsing-html) parsing a broken HTML does not work:
from lxml import etree
import StringIO
broken_html = "<html><head><title>test<body><h1>page title</h3>"
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(broken_html))
Result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2954, in lxml.etree.parse (src/lxml/lxml.etree.c:56220)
File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82482)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82764)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81562)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78232)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: h1 line 1 and h3, line 1, column 50

Don't just construct that parser, use it (as per the example you link to):
>>> tree = etree.parse(StringIO.StringIO(broken_html), parser=parser)
>>> tree
<lxml.etree._ElementTree object at 0x2fd8e60>
Or use lxml.html as a shortcut:
>>> from lxml import html
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> html.fromstring(broken_html)
<Element html at 0x2dde650>

lxml allows you load a broken xml by creating a parser instance with recover=True
etree.HTMLParser(recover=True)
You could use the same technique when creating the parser.

You might try to use lxml.html instead
>>> import lxml.html
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> root = lxml.html.fromstring(broken_html)
>>> lxml.html.tostring(root)
'<html><head><title>test</title></head><body><h1>page title</h1></body></html>'

Why is ElementTree.iterparse() raising a ParseError?

import xml.etree.ElementTree as ET
xmldata = file('my_xml_file.xml')
tree = ET.parse(xmldata)
root = tree.getroot()
root_iter = root.iter()
Now I can call root_iter.next() and get my Element objects. The problem is the real file I am working with is huge and I can't fit all of it in memory. So I am trying to use:
parse_iter = ET.iterparse(xmldata)
If I call parse_iter.next() it raises the following
Traceback (most recent call last):
File "<pyshell#38>", line 1, in <module>
parse_iter.next()
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1260, in next
self._root = self._parser.close()
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1636, in close
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
ParseError: no element found: line 1, column 0
What am I doing wrong?

The code I had was perfectly fine, except I was calling ElementTree.iterparse() on a file object I had already read with ElementTree.parse(). D'oh!
So for those who happen to make the same mistake, the solution is to either open a new file object or use file.seek(0) to reset the file cursor.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parser.pxi problems when pretty printing xml file - python

Related

TypeError: expected str, bytes or os.PathLike object, not WindowsPath while converting .py using Pyinstaller

Transform docx to html raises python MemoryError

Error trying parsing xml using python : xml.etree.ElementTree.ParseError: syntax error: line 1,

How to parse broken HTML with LXML

Why is ElementTree.iterparse() raising a ParseError?

Categories

Resources