Split XML in multiple files

Split XML in multiple files - python

I would like split XML to multiple XML files. I'm trying this script, but however, I keep getting the following error:
Traceback (most recent call last):
File "F:\dokumenty\COVID-19\rozdel_export.py", line 4, in <module>
for event, elem in context:
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1227, in iterator
yield from pullparser.read_events()
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1302, in read_events
raise event
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1274, in feed
self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 4, column 5
Does anyone have an idea how to solve this?
import xml.etree.ElementTree as ET
context = ET.iterparse('export.xml', events=('end', ))
for event, elem in context:
if elem.tag == 'row':
title = elem.find('ID').text
filename = format(title + ".xml")
with open(filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write("<csv_data>\n")
f.write(ET.tostring(elem))
f.write("</csv_data>")

According to the error, the issue is that your "export.xml" has an incorrect XML format.
iterparse has recover option, which tries hard to parse through broken input.
So you can try enabling this option and see if it works:
context = ET.iterparse('export.xml', events=('end', ), recover=True)
Please make sure that depending on the data in most cases you probably don't wish to ignore XML errors.

Related

Parsing XML from String Imported from SQL Server

I have imported a query from SQL Server where the item is a stored XML script. It's being saved as a pyodbc item and I need to parse it as XML.
import pyodbc
import urllib
import xml.etree.ElementTree as ET
# Create connection
con = pyodbc.connect(driver="{SQL Server}",server="Server",database="Database")
cur = con.cursor()
db_cmd = "SELECT [XML] FROM [Database].[dbo].[Table] where ID = 1"
res = cur.execute(db_cmd)
for row in res.fetchall():
print(row)
tree = ET.ElementTree(ET.fromstring(str(row)))
I keep getting this error:
Traceback (most recent call last):
File "C:...", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-46-1e4a5c1ba170>", line 15, in <module>
tree = ET.ElementTree(ET.fromstring(str(row)))
File "C:...", line 1315, in XML
parser.feed(text)
File "<string>", line unknown
ParseError: syntax error: line 1, column 0
I'm guessing there is an issue with the XML script but I don't know enough about XML to determine what the issue is. Here is an excerpt of the script:
('<response error_code="0"><xml_root><report xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<report_options>...;/html></html_root></response>', )
I tried saving it as a string as you can see from my code but I get the above error. How can I read in the value from SQL and save it as an XML item? The XML is redacted for privacy reasons but if more details are required, please let me know.

I think you are accidently passing a row object, rather than string to ET.fromstring(str(row))
Try:
tree = ET.ElementTree(ET.fromstring(str(row[0])))
According to the pyodbc docs, you may also be able to reference the column by name, e.g. row['XML'], rather than row[0]

Your XML is not well- formed:
XML
<response error_code="0"><xml_root><report xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<report_options>...;/html></html_root></response>
Error
Parsing error: expected end of tag 'report_options' (line 3, column 20)

Python Post Request Response Xml Error load fromstring

I'm literally new to Python and I have encounter something that I am not sure how to resolve I'm sure it must be a simple fix but haven't found an solution and hope someone with more knowledge in Python will be able to help.
My request:
...
contacts = requests.post(url,data=readContactsXml,headers=headers);
#print (contacts.content) ;
outF = open("contact.xml", "wb")
outF.write(contacts.content)
outF.close();
all is fine until with the above until I have to manipulate the data before saving it :
E.G:
...
contacts = requests.post(url,data=readContactsXml,headers=headers);
import xml.etree.ElementTree as ET
# contacts.encoding = 'utf-8'
parser = ET.XMLParser(encoding="UTF-8")
tree = ET.fromstring(contacts.content, parser=parser)
root = tree.getroot()
for item in root[0][0].findall('.//fields'):
if item[0].text == 'maching-text-here':
if not item[1].text:
item[1].text = 'N/A'
print(item[1].text)
#print (contacts.content) ;
outF = open("contact.xml", "wb")
outF.write(contacts.content)
outF.close();
in the above I literally replacing empty value with value 'N/A'
the error that I'm receiving is:
Traceback (most recent call last):
File "Desktop/PythonTests/test.py", line 107, in <module>
tree = ET.fromstring(contacts.content, parser=parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1311, in XML
parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1659, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1523, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 192300
looking around this column I can see a text with characters E.G: Sinéd, É is a problem here and actually when I just save this xml file and open in the browser I get kind of same error round about give or take the same column missing by 2:
This page contains the following errors:
error on line 1 at column 192298: Encoding error
Below is a rendering of the page up to the first error.
I wonder What I can do with data xml response that contain data with characters ?
Anyone any help Appreciated!

Found my answer after digging stack overflow:
I've modified:
FROM:
tree = ET.fromstring(contacts.content, parser=parser)
TO:
tree = ElementTree(fromstring(contacts.content))
REF:https://stackoverflow.com/questions/33962620/elementtree-returns-element-instead-of-elementtree/44483259#44483259

Unable to Parse XML file in Python

I am trying to parse a large xml file (more than 50mb). Getting the following parsing error.
File attached for reference. File
import xml.etree.cElementTree as ET
tree = ET.parse('input_file.xml')
error
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/etree/ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
File "<string>", line unknown
ParseError: no element found: line 21, column 0

Your XML is not well-formed, ElementTree cannot parse it. Please take look at your XML file and check whether it has a proper closing tag, maybe special characters and other stuff.

parsing xml file in python - no element found

I'm a python beginner.
I want to be able to pick values of certain elements in an xml sheet. Below is what my xml sheet looks like:
<TempFolder>D:\Mooniology\DiSecTemp\160708_M02091_0202_000000000-APC99</TempFolder>
<AnalysisFolder>D:\Mooniology\MiSeqAnalysis\160708_M0209831_0202_000000000-APC99</AnalysisFolder>
<RunStartDate>160708</RunStartDate>
<MostRecentWashType>PostRun</MostRecentWashType>
<RecipeFolder>D:\Mooniology\MiSeq Control Software\CustomRecipe</RecipeFolder>
<ILMNOnlyRecipeFolder>C:\Mooniology\MiSeq Control Software\Recipe</ILMNOnlyRecipeFolder>
<SampleSheetName>20160708 ALK Amplicon NGS cDNA synthesis kit comparison</SampleSheetName>
<SampleSheetFolder>Q:\GNO MiSeq\Jaya</SampleSheetFolder>
<ManifestFolder>Q:\GNO MiSeq</ManifestFolder>
<OutputFolder>\\rpbns4-lab\vol10\RMSdisect\160708_M02091_0202_000000000-APC99</OutputFolder>
<FocusMethod>AutoFocus</FocusMethod>
<SurfaceToScan>Both</SurfaceToScan>
<SaveFocusImages>true</SaveFocusImages>
<SaveScanImages>true</SaveScanImages>
And by "picking values", suppose I want the value of the element called TempFolder. I want the script spit out D:\Mooniology\DiSecTemp\160708_M02091_0202_000000000-APC99
Below is the code I'm using to attempt to scan it:
#!/usr/bin/python2.7
import xml.etree.ElementTree as ET
tree = ET.parse('online.xml')
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
Every time i run this code, no matter how i modify it (from researching google), the end result is always the following error:
Traceback (most recent call last):
File "./mindo.py", line 5, in <module>
tree = ET.parse('online.xml')
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 657, in parse
self._root = parser.close()
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 75, column 0
I suspected that the issue could be the xml file I'm using. But since I'm new to python, i have to presume its my code.

This is because the XML is not well formatted and therefore is not parsable:
In [4]: tree = ET.parse('online.xml')
...:
File "<string>", line unknown
ParseError: junk after document element: line 2, column 2
the xml need to have root element ie :
<params>
<TempFolder>D:\Mooniology\DiSecTemp\160708_M02091_0202_000000000-APC99</TempFolder>
<AnalysisFolder>D:\Mooniology\MiSeqAnalysis\160708_M0209831_0202_000000000-APC99</AnalysisFolder>
<RunStartDate>160708</RunStartDate>
<MostRecentWashType>PostRun</MostRecentWashType>
...
...
...
</params>

lxml: some XML from URL give this lxml.etree.XMLSyntaxError

I have a script which is suppose to extract some terms from XML files from a list of URLs.
All the URL's give access to XML data.
It is working fine at first opening, parsing and extracting correctly but then get interrupted in the process by some XML files with this error:
File "<stdin>", line 18, in <module>
File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
File "parser.pxi", line 1555, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82511)
File "parser.pxi", line 1585, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:82832)
File "parser.pxi", line 1468, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:81688)
File "parser.pxi", line 1024, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:78735)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
From my search it might be because some XML files have white spaces but i'm not sure if it is the problem. I can't tell which files give the error.
Is there a way to get around this error?
Here is my script:
URLlist = ["http://www.uniprot.org/uniprot/"+x+".xml" for x in IDlist]
for id, item in zip(IDlist, URLlist):
goterm_location = []
goterm_function = []
goterm_process = []
location_list[id] = []
function_list[id] = []
biological_list[id] = []
try:
textfile = urllib2.urlopen(item);
except urllib2.HTTPError:
print("URL", item, "could not be read.")
continue
#Try to solve empty line error#
tree = etree.parse(textfile);
#root = tree.getroot()
for node in tree.iter('{http://uniprot.org/uniprot}dbReference'):
if node.attrib.get('type') == 'GO':
for child in node:
value = child.attrib.get('value');
if value.startswith('C:'):
goterm_C = node.attrib.get('id')
if goterm_C:
location_list[id].append(goterm_C);
if value.startswith('F:'):
goterm_F = node.attrib.get('id')
if goterm_F:
function_list[id].append(goterm_F);
if value.startswith('P:'):
goterm_P = node.attrib.get('id')
if goterm_P:
biological_list[id].append(goterm_P);
I have tried:
tree = etree.iterparse(textfile, events = ("start","end"));
OR
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(textfile, parser)
Without success.
Any help would be greatly appreciated

I can't tell which files give the error
Debug by printing the name of the file/URL prior to parsing. Then you'll see which file(s) cause the error.
Also, read the error message:
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
this suggests that the downloaded XML file is empty. Once you have determined the URL(s) that cause the problem, try downloading the file and check its contents. I suspect it might be empty.
You can ignore problematic files (empty or otherwise syntactically invalid) by using a try/except block when parsing:
try:
tree = etree.parse(textfile)
except lxml.etree.XMLSyntaxError:
print 'Skipping invalid XML from URL {}'.format(item)
continue # go on to the next URL
Or you could check just for empty files by checking the 'Content-length' header, or even by reading the resource returned by urlopen(), but I think that the above is better as it will also catch other potential errors.

I got the same error message in Python 3.6
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
In my case the xml file is not empty. Issue is because of encoding,
Initially used utf-8,
from lxml import etree
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='utf-8')
changing encoding to iso-8859-1 solved my issue,
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='iso-8859-1')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split XML in multiple files - python

Related

Parsing XML from String Imported from SQL Server

Python Post Request Response Xml Error load fromstring

Unable to Parse XML file in Python

parsing xml file in python - no element found

lxml: some XML from URL give this lxml.etree.XMLSyntaxError

Categories

Resources