Strip attributes / namespaces from SOAP XML - python

If I have several tags like this:
<ServiceId xsi:type="xsd:string">aval</ServiceId>
Is xsi:type="xsd:string" technically an attribute?
When I try this:
from StringIO import StringIO
from SOAPpy.wstools.Utility import DOM
badxml = '''<?xml version="1.0" encoding="utf-8"?>
<ServiceId xsi:type="xsd:string">aval</ServiceId>'''
document = DOM.loadDocument(StringIO(badxml))
orig_len = len(document.childNodes[0].toxml())
for node in document.childNodes:
node.removeAttribute('xsi:type')
new_len = len(node.toxml())
diff = orig_len - new_len
print diff
...I get an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/site-packages/SOAPpy/wstools/Utility.py", line 572, in loadDocument
return xml.dom.minidom.parse(data)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/minidom.py", line 1915, in parse
return expatbuilder.parse(file)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 930, in parse
result = builder.parseFile(file)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: unbound prefix: line 2, column 9
I basically want to remove all attributes from large XML documents.

XSI is a namespace. You can use them in your queries if you need them, removing them can have detrimental effects on your data outcomes or if there are other xml elements with the same element name (but different namespace).
have a look here:
Python ElementTree module: How to ignore the namespace of XML files to locate matching element when using the method "find", "findall"
otherwise what you are doing is a bit of a hack and you might as well read the file as a string and do a mass regex replace on the namespace string you want to delete (not recommended).

Related

Parsing XML from String Imported from SQL Server

I have imported a query from SQL Server where the item is a stored XML script. It's being saved as a pyodbc item and I need to parse it as XML.
import pyodbc
import urllib
import xml.etree.ElementTree as ET
# Create connection
con = pyodbc.connect(driver="{SQL Server}",server="Server",database="Database")
cur = con.cursor()
db_cmd = "SELECT [XML] FROM [Database].[dbo].[Table] where ID = 1"
res = cur.execute(db_cmd)
for row in res.fetchall():
print(row)
tree = ET.ElementTree(ET.fromstring(str(row)))
I keep getting this error:
Traceback (most recent call last):
File "C:...", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-46-1e4a5c1ba170>", line 15, in <module>
tree = ET.ElementTree(ET.fromstring(str(row)))
File "C:...", line 1315, in XML
parser.feed(text)
File "<string>", line unknown
ParseError: syntax error: line 1, column 0
I'm guessing there is an issue with the XML script but I don't know enough about XML to determine what the issue is. Here is an excerpt of the script:
('<response error_code="0"><xml_root><report xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<report_options>...;/html></html_root></response>', )
I tried saving it as a string as you can see from my code but I get the above error. How can I read in the value from SQL and save it as an XML item? The XML is redacted for privacy reasons but if more details are required, please let me know.
I think you are accidently passing a row object, rather than string to ET.fromstring(str(row))
Try:
tree = ET.ElementTree(ET.fromstring(str(row[0])))
According to the pyodbc docs, you may also be able to reference the column by name, e.g. row['XML'], rather than row[0]
Your XML is not well- formed:
XML
<response error_code="0"><xml_root><report xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<report_options>...;/html></html_root></response>
Error
Parsing error: expected end of tag 'report_options' (line 3, column 20)

Unable to Parse XML file in Python

I am trying to parse a large xml file (more than 50mb). Getting the following parsing error.
File attached for reference. File
import xml.etree.cElementTree as ET
tree = ET.parse('input_file.xml')
error
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/etree/ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
File "<string>", line unknown
ParseError: no element found: line 21, column 0
Your XML is not well-formed, ElementTree cannot parse it. Please take look at your XML file and check whether it has a proper closing tag, maybe special characters and other stuff.

parsing xml file in python - no element found

I'm a python beginner.
I want to be able to pick values of certain elements in an xml sheet. Below is what my xml sheet looks like:
<TempFolder>D:\Mooniology\DiSecTemp\160708_M02091_0202_000000000-APC99</TempFolder>
<AnalysisFolder>D:\Mooniology\MiSeqAnalysis\160708_M0209831_0202_000000000-APC99</AnalysisFolder>
<RunStartDate>160708</RunStartDate>
<MostRecentWashType>PostRun</MostRecentWashType>
<RecipeFolder>D:\Mooniology\MiSeq Control Software\CustomRecipe</RecipeFolder>
<ILMNOnlyRecipeFolder>C:\Mooniology\MiSeq Control Software\Recipe</ILMNOnlyRecipeFolder>
<SampleSheetName>20160708 ALK Amplicon NGS cDNA synthesis kit comparison</SampleSheetName>
<SampleSheetFolder>Q:\GNO MiSeq\Jaya</SampleSheetFolder>
<ManifestFolder>Q:\GNO MiSeq</ManifestFolder>
<OutputFolder>\\rpbns4-lab\vol10\RMSdisect\160708_M02091_0202_000000000-APC99</OutputFolder>
<FocusMethod>AutoFocus</FocusMethod>
<SurfaceToScan>Both</SurfaceToScan>
<SaveFocusImages>true</SaveFocusImages>
<SaveScanImages>true</SaveScanImages>
And by "picking values", suppose I want the value of the element called TempFolder. I want the script spit out D:\Mooniology\DiSecTemp\160708_M02091_0202_000000000-APC99
Below is the code I'm using to attempt to scan it:
#!/usr/bin/python2.7
import xml.etree.ElementTree as ET
tree = ET.parse('online.xml')
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
Every time i run this code, no matter how i modify it (from researching google), the end result is always the following error:
Traceback (most recent call last):
File "./mindo.py", line 5, in <module>
tree = ET.parse('online.xml')
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 657, in parse
self._root = parser.close()
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 75, column 0
I suspected that the issue could be the xml file I'm using. But since I'm new to python, i have to presume its my code.
This is because the XML is not well formatted and therefore is not parsable:
In [4]: tree = ET.parse('online.xml')
...:
File "<string>", line unknown
ParseError: junk after document element: line 2, column 2
the xml need to have root element ie :
<params>
<TempFolder>D:\Mooniology\DiSecTemp\160708_M02091_0202_000000000-APC99</TempFolder>
<AnalysisFolder>D:\Mooniology\MiSeqAnalysis\160708_M0209831_0202_000000000-APC99</AnalysisFolder>
<RunStartDate>160708</RunStartDate>
<MostRecentWashType>PostRun</MostRecentWashType>
...
...
...
</params>

BeautifulSoup (bs4), html5lib, HTMLParseError: malformed start tag, at line 1, column 11

I need to copy the source code from a website onto an html file stored locally as parsing from the url directly does not capture all of the page elements. I am hoping to extract locational elements within a table in the source code to be used for geocoding. My program goes through several pages of search results, writing the source code from each to an html file stored locally. The address elements are only about a third of the material each page so it would be nice to get rid of the additional elements to reduce the file size.
To do this, I would like the program to open a blank html doc for writing, write the current page's source code to it, close the doc, reopen it for parsing (in 'r' mode now), open a new doc for writing, and use beautiful soup to capture all of the geocoding data form the first doc and write it to the new document. The program will then close the first doc and then reopen it in 'w' mode again.
This will be done in a loop so the first doc will always get overwritten with the current page's source code while the second doc will stay open and keep having just the geocoding data written to it until there are no more pages.
Everything with looping and navigating and writing the source code to file is working fine but i can't get the parsing part figured out. I tried experimenting in an interactive env with this code:
from bs4 import BeautifulSoup
import html5lib
data = open(r"C:\GIS DataBase\web_resutls_raw_new_test.html",'r').read()
document = html5lib.parse(data)
soup = BeautifulSoup(str(document))
And I get the following error:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\lib\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\Python27\lib\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\bs4\builder\_htmlparser.py", line 219, in feed
raise e
HTMLParseError: malformed start tag, at line 1, column 11
So I tried the following fix:
soup = HTMLParser.handle_starttag(BeautifulSoup(str(document)))
And alas:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\lib\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\Python27\lib\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\bs4\builder\_htmlparser.py", line 219, in feed
raise e
HTMLParseError: malformed start tag, at line 1, column 11
I also tried with lxml, ertree and nothing seems to work. I cannot get the elements I need parsing from the url directly. I need to parse from the html file.
Pass data directly to BeautifulSoup as :
soup = BeautifulSoup(data,'html.parser')

How to parse an XML file with encoding declaration in Python?

I have this XML file, called xmltest.xml:
<?xml version="1.0" encoding="GBK"?>
<productMeta>
<bands>1,2,3,4</bands>
<imageName>TestName.tif</imageName>
<browseName>TestName.jpg</browseName>
</productMeta>
And I have this Python dummy code:
import xml.etree.ElementTree as ET
xmldoc = ET.parse('xmltest.xml')
But it raises a ValueError:
ValueError: multi-byte encodings are not supported
I understand this error, it raises because the encoding declaration in the first line of the XML file. The XML file is UTF-8 encoded but always have that declaration (I'm not the creator of the XML files to be analyzed). How can I avoid such encoding declaration when parsing an XML file such the former one?
One thing that I tried, that worked for me is to open the xml file as a file object , then use ElementTree.fromstring() passing in the complete contents of the file.
Example -
>>> import xml.etree.ElementTree as ET
>>> ef = ET.parse('a.xml')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
ValueError: multi-byte encodings are not supported
>>> with open('a.xml','r') as f:
... ef = ET.fromstring(f.read())
...
>>> ef
<Element 'productMeta' at 0x028DF180>
You can also, create an XMLParser with the required encoding, and this should enable you to be able to parse strings from that encoding, Example -
import xml.etree.ElementTree as ET
xmlp = ET.XMLParser(encoding="utf-8")
f = ET.parse('a.xml',parser=xmlp)
ET.parse('a.xml', parser=ET.XMLParser(encoding='iso-8859-5'))
solved my problem when dealed with xml excel in python

Categories