IOError passing requests Response.content to lxml.etree.parse() [duplicate] - python

This question already has an answer here:
lxml error "IOError: Error reading file" when parsing facebook mobile in a python scraper script
(1 answer)
Closed 7 years ago.
I have the following xml on a webpage -
<entry>
<id>1750</id>
<title>variablename</title>
<source>
com.tidalsoft.webclient.tes.dsp.db.datatypes.Variable
</source>
<tes:variable>
<tes:ownername>ownergroup</tes:ownername>
<tes:productiondate>2015-08-17T00:00:00-0400</tes:productiondate>
<tes:readonly>N</tes:readonly>
<tes:publish>N</tes:publish>
<tes:description>
Decription Here
</tes:description>
<tes:startcalendar>0</tes:startcalendar>
<tes:ownerid>666</tes:ownerid>
<tes:type>1</tes:type>
<tes:lastusermodifiedtime>2015-06-15T15:42:27-0400</tes:lastusermodifiedtime>
<tes:innervalue>\\share\location</tes:innervalue>
<tes:calc>N</tes:calc>
<tes:name>variablename</tes:name>
<tes:startdate>1899-12-30T00:00:00-0500</tes:startdate>
<tes:pub>Y</tes:pub>
<tes:lastvalue>\\share\location</tes:lastvalue>
<tes:id>1750</tes:id>
<tes:startdateasstring>18991230000000</tes:startdateasstring>
<tes:lastchangetime>2015-06-15T15:42:27-0400</tes:lastchangetime>
<tes:clientcachelastchangetime>2015-08-17T09:56:49-0400</tes:clientcachelastchangetime>
</tes:variable>
</entry>
I'm trying to parse this data. I have a get through requests -
r = requests.get(url, auth=('username', 'password'))
but when I try to parse the content I get errors.
>>> xmlObject = etree.parse(r.content)
Traceback (most recent call last):
File "apiTest.py", line 46, in <module>
xmlObject = etree.parse(r.content)
File "lxml.etree.pyx", line 3310, in lxml.etree.parse (src\lxml\lxml.etree.c:7
2517)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etre
e.c:105979)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lx
ml.etree.c:106278)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.e
tree.c:105277)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src
\lxml\lxml.etree.c:100227)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDo
c (src\lxml\lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.e
tree.c:95786)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etr
ee.c:94818)
IOError: Error reading file ''
On the last line what is between the quotes is the xml stated at the beginning as a string -
<?xml version="1.0" encoding="UTF-8" standalone="ye
s"?><entry xmlns="http://purl.org/atom/ns#"><id>1750</id><title>....
The data is being provided as content-type: text/xml

etree.parse expects a filename, a file-like object, or a URL as its first argument (see help(etree.parse)). It does not expect an XML string. To parse an XML string use
xmlObject = etree.fromstring(r.content)
Note that etree.fromstring returns a lxml.etree._Element. In contrast, etree.parse returns a lxml.etree._ElementTree. Given the _Element, you can obtain the _ElementTree with the getroottree method:
xmlTree = xmlObject.getroottree()

Related

Read XML using pandas parsing to csv

I have the following code to extract data from XML to CSV file, but there is an error and I don't know how to solve it.
if anyone can help, please.
url = "http://90.161.233.78:65519/services/user/records.xml?begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?var=EDSLINEEMBEDDED.Module2.API1?period=900"
s = unescape(requests.get(url).text)[5:-6]
df = pd.read_xml(s, xpath="//record/* | //dateTime")
df["field"] = df["field"].ffill()
df.to_csv('output0.csv')
The Error is
doc = fromstring(
File "src\lxml\etree.pyx", line 3252, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1800, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 2
Consider reading URL without requests or escaping content directly into pandas.read_xml(). Per docs, emphasis added:
path_or_buffer: str, path object, or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a read() function. The string can be any valid XML string or a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file.
import pandas as pd
url = (
"http://90.161.233.78:65519/services/user/records.xml?"
"begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?"
"var=EDSLINEEMBEDDED.Module2.API1?period=900"
)
df = pd.read_xml(url, xpath="//record/* | //dateTime")
# FILL PARENT TEXT FORWARD TO CHILD ITEMS
df["dateTime"] = df["dateTime"].ffill()
# DROP UNNEEDED ROWS
df[(pd.notnull(df["id"])) & (pd.notnull(df["value"]))]
df.to_csv('output0.csv')

Python: append multiple sub-elements to a parent element by reading from a string

I am new to python and trying to read and add multiple xml elements represented by a string as a subelement of an XML element.
For eg:
<student>
<name>abc</name>
<description>abcd</description>
<regno>200</regno>
</student>
I have a string which can represent further student info which may contain nested information as well. For eg:
"<grade>100</grade><address-info><street>xyz</street><city>efgh</city><zip>505050</zip></address-info>"
I need this string to be parsed and to be added inside the student element and
result in something like
<student>
<name>abc</name>
<description>abcd</description>
<regno>200</regno>
<grade>100</grade>
<address-info>
<street>xyz</street>
<city>efgh</city>
<zip>505050</zip>
</address-info>
</student>
I tried using append method which is resulting in an error
def add_cfg_under_correct_student(in_name, cfg_to_be_added, root):
if root is None:
return True
for student in root.findall('student'):
name = student.find('name')
if name.text != in_name:
continue
student.append(ET.fromstring(cfg_to_be_added))
return True
But I got an error as
Traceback (most recent call last):
add_cfg_under_correct_student
student.append(ET.fromstring(cfg_to_be_added))
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:79003)
File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118334)
File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117014)
File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:111258)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105102)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106810)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105664)
File "", line 1
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 1, column 27
Then I tried using the ET.ElementTree(ET.fromstring(xmlstring))
option as suggested in this answer, but still get a similar error.
Then I looked up another answer on adding multiple elements at once in this question, but it doesn't exactly solve my scenario.
Does the solution mentioned in the above question to use extend work on sub elements which in turn could have sub elements under them as well?
Please help

Entity 'ouml' error while using lxml to parse dblp data

I am trying to parse dblp data(xml format). So far my code is :
#-*-coding:utf-8-*-
from lxml import etree # lxml import library
parser = etree.XMLParser (load_dtd =True)
Tree = etree.parse( "dblp.xml" ,parser)
Root = tree.getroot()
I tried running the code and I get the following error:
Tree = etree.parse( "dblp.xml" ,parser) # Parse the xml with tree structure
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "dblp.xml", line 70
lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 70,
column 27
how can i resolve this error?
Note: I have xml and dtd files in same location.
I recently encountered the same issue whilst parsing DBLP's XML database. In my case, I was missing the appropriate .dtd file for my dblp.xml (which provides the necessary information for parsing certain custom entities, including ouml). The top of your file should look something like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">
The .dtd file specified on the second line should be located in the same directory as the dblp.xml file that you're attempting to parse. You can download the appropriate .dtd file your XML file from here: http://dblp.org/xml/release/
$ ls
dblp-2017-08-29.dtd dblp-2018-11-01.xml
Also, given the size of dblp.xml, you may also want to use lxml.etree.iterparse to stream the contents of the file instead. Below is some of the code that I used to obtain entries for certain types of publication within the database.
fn = 'dblp.xml'
for event, elem in lxml.etree.iterparse(fn, load_dtd=True):
if elem.tag not in ['article', 'inproceedings', 'proceedings']:
continue
title = elem.find('title') # type: Optional[str]
year = elem.find('year') # type: Optional[int]
authors = elem.find('author') # type: Optional[str]
venue = elem.find('venue') # type: Optional[str]
...
elem.clear()

Error trying parsing xml using python : xml.etree.ElementTree.ParseError: syntax error: line 1,

In python, simply trying to parse XML:
import xml.etree.ElementTree as ET
data = 'info.xml'
tree = ET.fromstring(data)
but got error:
Traceback (most recent call last):
File "C:\mesh\try1.py", line 3, in <module>
tree = ET.fromstring(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1312, in XML
return parser.close()
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1665, in close
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1517, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: syntax error: line 1, column 0
thats a bit of xml, i have:
<?xml version="1.0" encoding="utf-16"?>
<AnalysisData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<BlendOperations OperationNumber="1">
<ComponentQuality>
<MaterialName>Oil</MaterialName>
<Weight>1067.843017578125</Weight>
<WeightPercent>31.545017776585109</WeightPercent>
Why is it happening?
You're trying to parse the string 'info.xml' instead of the contents of the file.
You could call tree = ET.parse('info.xml') which will open the file.
Or you could read the file directly:
ET.fromstring(open('info.xml').read())

How to parse broken HTML with LXML

I'm trying to parse a broken HTML with LXML parser on python 2.5 and 2.7
Unlike in LXML documentation (http://lxml.de/parsing.html#parsing-html) parsing a broken HTML does not work:
from lxml import etree
import StringIO
broken_html = "<html><head><title>test<body><h1>page title</h3>"
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(broken_html))
Result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2954, in lxml.etree.parse (src/lxml/lxml.etree.c:56220)
File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82482)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82764)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81562)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78232)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: h1 line 1 and h3, line 1, column 50
Don't just construct that parser, use it (as per the example you link to):
>>> tree = etree.parse(StringIO.StringIO(broken_html), parser=parser)
>>> tree
<lxml.etree._ElementTree object at 0x2fd8e60>
Or use lxml.html as a shortcut:
>>> from lxml import html
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> html.fromstring(broken_html)
<Element html at 0x2dde650>
lxml allows you load a broken xml by creating a parser instance with recover=True
etree.HTMLParser(recover=True)
You could use the same technique when creating the parser.
You might try to use lxml.html instead
>>> import lxml.html
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> root = lxml.html.fromstring(broken_html)
>>> lxml.html.tostring(root)
'<html><head><title>test</title></head><body><h1>page title</h1></body></html>'

Categories