Read XML using pandas parsing to csv - python

I have the following code to extract data from XML to CSV file, but there is an error and I don't know how to solve it.
if anyone can help, please.
url = "http://90.161.233.78:65519/services/user/records.xml?begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?var=EDSLINEEMBEDDED.Module2.API1?period=900"
s = unescape(requests.get(url).text)[5:-6]
df = pd.read_xml(s, xpath="//record/* | //dateTime")
df["field"] = df["field"].ffill()
df.to_csv('output0.csv')
The Error is
doc = fromstring(
File "src\lxml\etree.pyx", line 3252, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1800, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 2

Consider reading URL without requests or escaping content directly into pandas.read_xml(). Per docs, emphasis added:
path_or_buffer: str, path object, or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a read() function. The string can be any valid XML string or a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file.
import pandas as pd
url = (
"http://90.161.233.78:65519/services/user/records.xml?"
"begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?"
"var=EDSLINEEMBEDDED.Module2.API1?period=900"
)
df = pd.read_xml(url, xpath="//record/* | //dateTime")
# FILL PARENT TEXT FORWARD TO CHILD ITEMS
df["dateTime"] = df["dateTime"].ffill()
# DROP UNNEEDED ROWS
df[(pd.notnull(df["id"])) & (pd.notnull(df["value"]))]
df.to_csv('output0.csv')

Related

Python Pandas xlsxwriter failing to close

I am building automation for Excel a multi-tabbed excel document. When I try to close the document I get the error below (full traceback, minus the personal details at the top), which then is corrupted and I cannot open the xlsx document. Unfortunately I haven't found any clues to go off of. I am using xlsxwriter functions to set row and column formatting, from what I've found this could be an issue but I haven't been able to track it down. Any thoughts on possible solutions?
writer.close()
File "/opt/homebrew/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1480, in close
self._save()
File "/opt/homebrew/lib/python3.10/site-packages/pandas/io/excel/_xlsxwriter.py", line 244, in _save
self.book.close()
File "/opt/homebrew/lib/python3.10/site-packages/xlsxwriter/workbook.py", line 324, in close
self._store_workbook()
File "/opt/homebrew/lib/python3.10/site-packages/xlsxwriter/workbook.py", line 709, in _store_workbook
xml_files = packager._create_package()
File "/opt/homebrew/lib/python3.10/site-packages/xlsxwriter/packager.py", line 137, in _create_package
self._write_worksheet_files()
File "/opt/homebrew/lib/python3.10/site-packages/xlsxwriter/packager.py", line 193, in _write_worksheet_files
worksheet._assemble_xml_file()
File "/opt/homebrew/lib/python3.10/site-packages/xlsxwriter/worksheet.py", line 4221, in _assemble_xml_file
self._write_cols()
File "/opt/homebrew/lib/python3.10/site-packages/xlsxwriter/worksheet.py", line 5807, in _write_cols
self._write_col_info(self.colinfo[col])
File "/opt/homebrew/lib/python3.10/site-packages/xlsxwriter/worksheet.py", line 5836, in _write_col_info
if width > 0:
TypeError: '>' not supported between instances of 'Format' and 'int'

Python: append multiple sub-elements to a parent element by reading from a string

I am new to python and trying to read and add multiple xml elements represented by a string as a subelement of an XML element.
For eg:
<student>
<name>abc</name>
<description>abcd</description>
<regno>200</regno>
</student>
I have a string which can represent further student info which may contain nested information as well. For eg:
"<grade>100</grade><address-info><street>xyz</street><city>efgh</city><zip>505050</zip></address-info>"
I need this string to be parsed and to be added inside the student element and
result in something like
<student>
<name>abc</name>
<description>abcd</description>
<regno>200</regno>
<grade>100</grade>
<address-info>
<street>xyz</street>
<city>efgh</city>
<zip>505050</zip>
</address-info>
</student>
I tried using append method which is resulting in an error
def add_cfg_under_correct_student(in_name, cfg_to_be_added, root):
if root is None:
return True
for student in root.findall('student'):
name = student.find('name')
if name.text != in_name:
continue
student.append(ET.fromstring(cfg_to_be_added))
return True
But I got an error as
Traceback (most recent call last):
add_cfg_under_correct_student
student.append(ET.fromstring(cfg_to_be_added))
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:79003)
File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118334)
File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117014)
File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:111258)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105102)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106810)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105664)
File "", line 1
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 1, column 27
Then I tried using the ET.ElementTree(ET.fromstring(xmlstring))
option as suggested in this answer, but still get a similar error.
Then I looked up another answer on adding multiple elements at once in this question, but it doesn't exactly solve my scenario.
Does the solution mentioned in the above question to use extend work on sub elements which in turn could have sub elements under them as well?
Please help

Entity 'ouml' error while using lxml to parse dblp data

I am trying to parse dblp data(xml format). So far my code is :
#-*-coding:utf-8-*-
from lxml import etree # lxml import library
parser = etree.XMLParser (load_dtd =True)
Tree = etree.parse( "dblp.xml" ,parser)
Root = tree.getroot()
I tried running the code and I get the following error:
Tree = etree.parse( "dblp.xml" ,parser) # Parse the xml with tree structure
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "dblp.xml", line 70
lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 70,
column 27
how can i resolve this error?
Note: I have xml and dtd files in same location.
I recently encountered the same issue whilst parsing DBLP's XML database. In my case, I was missing the appropriate .dtd file for my dblp.xml (which provides the necessary information for parsing certain custom entities, including ouml). The top of your file should look something like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">
The .dtd file specified on the second line should be located in the same directory as the dblp.xml file that you're attempting to parse. You can download the appropriate .dtd file your XML file from here: http://dblp.org/xml/release/
$ ls
dblp-2017-08-29.dtd dblp-2018-11-01.xml
Also, given the size of dblp.xml, you may also want to use lxml.etree.iterparse to stream the contents of the file instead. Below is some of the code that I used to obtain entries for certain types of publication within the database.
fn = 'dblp.xml'
for event, elem in lxml.etree.iterparse(fn, load_dtd=True):
if elem.tag not in ['article', 'inproceedings', 'proceedings']:
continue
title = elem.find('title') # type: Optional[str]
year = elem.find('year') # type: Optional[int]
authors = elem.find('author') # type: Optional[str]
venue = elem.find('venue') # type: Optional[str]
...
elem.clear()

lxml: some XML from URL give this lxml.etree.XMLSyntaxError

I have a script which is suppose to extract some terms from XML files from a list of URLs.
All the URL's give access to XML data.
It is working fine at first opening, parsing and extracting correctly but then get interrupted in the process by some XML files with this error:
File "<stdin>", line 18, in <module>
File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
File "parser.pxi", line 1555, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82511)
File "parser.pxi", line 1585, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:82832)
File "parser.pxi", line 1468, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:81688)
File "parser.pxi", line 1024, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:78735)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
From my search it might be because some XML files have white spaces but i'm not sure if it is the problem. I can't tell which files give the error.
Is there a way to get around this error?
Here is my script:
URLlist = ["http://www.uniprot.org/uniprot/"+x+".xml" for x in IDlist]
for id, item in zip(IDlist, URLlist):
goterm_location = []
goterm_function = []
goterm_process = []
location_list[id] = []
function_list[id] = []
biological_list[id] = []
try:
textfile = urllib2.urlopen(item);
except urllib2.HTTPError:
print("URL", item, "could not be read.")
continue
#Try to solve empty line error#
tree = etree.parse(textfile);
#root = tree.getroot()
for node in tree.iter('{http://uniprot.org/uniprot}dbReference'):
if node.attrib.get('type') == 'GO':
for child in node:
value = child.attrib.get('value');
if value.startswith('C:'):
goterm_C = node.attrib.get('id')
if goterm_C:
location_list[id].append(goterm_C);
if value.startswith('F:'):
goterm_F = node.attrib.get('id')
if goterm_F:
function_list[id].append(goterm_F);
if value.startswith('P:'):
goterm_P = node.attrib.get('id')
if goterm_P:
biological_list[id].append(goterm_P);
I have tried:
tree = etree.iterparse(textfile, events = ("start","end"));
OR
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(textfile, parser)
Without success.
Any help would be greatly appreciated
I can't tell which files give the error
Debug by printing the name of the file/URL prior to parsing. Then you'll see which file(s) cause the error.
Also, read the error message:
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
this suggests that the downloaded XML file is empty. Once you have determined the URL(s) that cause the problem, try downloading the file and check its contents. I suspect it might be empty.
You can ignore problematic files (empty or otherwise syntactically invalid) by using a try/except block when parsing:
try:
tree = etree.parse(textfile)
except lxml.etree.XMLSyntaxError:
print 'Skipping invalid XML from URL {}'.format(item)
continue # go on to the next URL
Or you could check just for empty files by checking the 'Content-length' header, or even by reading the resource returned by urlopen(), but I think that the above is better as it will also catch other potential errors.
I got the same error message in Python 3.6
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
In my case the xml file is not empty. Issue is because of encoding,
Initially used utf-8,
from lxml import etree
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='utf-8')
changing encoding to iso-8859-1 solved my issue,
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='iso-8859-1')

IOError passing requests Response.content to lxml.etree.parse() [duplicate]

This question already has an answer here:
lxml error "IOError: Error reading file" when parsing facebook mobile in a python scraper script
(1 answer)
Closed 7 years ago.
I have the following xml on a webpage -
<entry>
<id>1750</id>
<title>variablename</title>
<source>
com.tidalsoft.webclient.tes.dsp.db.datatypes.Variable
</source>
<tes:variable>
<tes:ownername>ownergroup</tes:ownername>
<tes:productiondate>2015-08-17T00:00:00-0400</tes:productiondate>
<tes:readonly>N</tes:readonly>
<tes:publish>N</tes:publish>
<tes:description>
Decription Here
</tes:description>
<tes:startcalendar>0</tes:startcalendar>
<tes:ownerid>666</tes:ownerid>
<tes:type>1</tes:type>
<tes:lastusermodifiedtime>2015-06-15T15:42:27-0400</tes:lastusermodifiedtime>
<tes:innervalue>\\share\location</tes:innervalue>
<tes:calc>N</tes:calc>
<tes:name>variablename</tes:name>
<tes:startdate>1899-12-30T00:00:00-0500</tes:startdate>
<tes:pub>Y</tes:pub>
<tes:lastvalue>\\share\location</tes:lastvalue>
<tes:id>1750</tes:id>
<tes:startdateasstring>18991230000000</tes:startdateasstring>
<tes:lastchangetime>2015-06-15T15:42:27-0400</tes:lastchangetime>
<tes:clientcachelastchangetime>2015-08-17T09:56:49-0400</tes:clientcachelastchangetime>
</tes:variable>
</entry>
I'm trying to parse this data. I have a get through requests -
r = requests.get(url, auth=('username', 'password'))
but when I try to parse the content I get errors.
>>> xmlObject = etree.parse(r.content)
Traceback (most recent call last):
File "apiTest.py", line 46, in <module>
xmlObject = etree.parse(r.content)
File "lxml.etree.pyx", line 3310, in lxml.etree.parse (src\lxml\lxml.etree.c:7
2517)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etre
e.c:105979)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lx
ml.etree.c:106278)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.e
tree.c:105277)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src
\lxml\lxml.etree.c:100227)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDo
c (src\lxml\lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.e
tree.c:95786)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etr
ee.c:94818)
IOError: Error reading file ''
On the last line what is between the quotes is the xml stated at the beginning as a string -
<?xml version="1.0" encoding="UTF-8" standalone="ye
s"?><entry xmlns="http://purl.org/atom/ns#"><id>1750</id><title>....
The data is being provided as content-type: text/xml
etree.parse expects a filename, a file-like object, or a URL as its first argument (see help(etree.parse)). It does not expect an XML string. To parse an XML string use
xmlObject = etree.fromstring(r.content)
Note that etree.fromstring returns a lxml.etree._Element. In contrast, etree.parse returns a lxml.etree._ElementTree. Given the _Element, you can obtain the _ElementTree with the getroottree method:
xmlTree = xmlObject.getroottree()

Categories