Error with Python and XML - python

I'm getting an error when trying to grab a value from my XML. I get "Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
Here is my code:
import requests
import lxml.etree
from requests.auth import HTTPBasicAuth
r= requests.get("https://somelinkhere/folder/?parameter=abc", auth=HTTPBasicAuth('username', 'password'))
print r.text
root = lxml.etree.fromstring(r.text)
textelem = root.find("opensearch:totalResults")
print textelem.text
I'm getting this error:
Traceback (most recent call last):
File "tickets2.py", line 8, in <module>
root = lxml.etree.fromstring(r.text)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
File "src/lxml/parser.pxi", line 1814, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124471)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Here is what the XML looks like, where I'm trying to grab the file in the last line.
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:apple-wallpapers="http://www.apple.com/ilife/wallpapers" xmlns:g-custom="http://base.google.com/cns/1.0" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:cc="http://web.resource.org/cc/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:g-core="http://base.google.com/ns/1.0">
<title>Feed from some link here</title>
<link rel="self" href="https://somelinkhere/folder/?parameter=abc" />
<link rel="first" href="https://somelinkhere/folder/?parameter=abc" />
<id>https://somelinkhere/folder/?parameter=abc</id>
<updated>2018-03-06T17:48:09Z</updated>
<dc:creator>company.com</dc:creator>
<dc:date>2018-03-06T17:48:09Z</dc:date>
<opensearch:totalResults>4</opensearch:totalResults>
I have tried various changes from links like https://twigstechtips.blogspot.com/2013/06/python-lxml-strings-with-encoding.html and http://makble.com/how-to-parse-xml-with-python-and-lxml but I keep running into the same error.

Instead of r.text, which guesses at the text encoding and decodes it, try using r.content which accesses the response body as bytes. (See http://docs.python-requests.org/en/latest/user/quickstart/#response-content.)
You could also use r.raw. See parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) for more info.
Once that issue is fixed, you'll have the issue of the namespace. The element you're trying to find (opensearch:totalResults) has the prefix opensearch which is bound to the uri http://a9.com/-/spec/opensearch/1.1/.
You can find the element by combining the namespace uri and the local name (Clark notation):
{http://a9.com/-/spec/opensearch/1.1/}totalResults
See http://lxml.de/tutorial.html#namespaces for more info.
Here's an example with both changes implemented:
os = "{http://a9.com/-/spec/opensearch/1.1/}"
root = lxml.etree.fromstring(r.content)
textelem = root.find(os + "totalResults")
print textelem.text

Related

lxml is not coercing int into a string before parsing it

I am trying to pull a Qualys vulnerability report ID from an API call with python. Essentially, the report ID is an int, and lxml can only parse strings. I have used my same code in the past to do this and it worked fine. I assumed lxml is smart enough to coerce the int into a string before parsing. Is there any way I can do this manually so I stop getting parse errors? Below is my code, output, and the traceback.
Code:
import requests
import time
import lxml
from lxml import etree
s = requests.Session()
s.headers.update({'X-Requested-With':'X'})
def login(s):
payload = {'action':'login', 'username':'X', 'password':'X'}
r = s.post('https://qualysapi.qualys.com/api/2.0/fo/session/',
data=payload)
def launchReport(s, polling_delay=250):
payload = {'action':'launch', 'template_id':'X',
'output_format':'xml', 'report_title':'X'}
r = s.post('https://qualysapi.qualys.com/api/2.0/fo/report/',
data=payload)
print r.text
extract_id = etree.XML(r).find('.//VALUE')
print("Report ID = %s" % extract_id)
time.sleep(polling_delay)
return extract_id
login(s)
launchReport(s)
Output:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SIMPLE_RETURN SYSTEM
"https://qualysapi.qualys.com/api/2.0/simple_return.dtd">
<SIMPLE_RETURN>
<RESPONSE>
<DATETIME>2018-02-01T16:00:14Z</DATETIME>
<TEXT>New report launched</TEXT>
<ITEM_LIST>
<ITEM>
<KEY>ID</KEY>
<VALUE>16441920</VALUE>
</ITEM>
</ITEM_LIST>
</RESPONSE>
</SIMPLE_RETURN>
Traceback:
Traceback (most recent call last):
File "test.py", line 30, in <module>
launchReport(s)
File "test.py", line 22, in launchReport
extract_id = etree.XML(r).find('.//VALUE')
File "src/lxml/etree.pyx", line 3209, in lxml.etree.XML
(src/lxml/etree.c:80823)
File "src/lxml/parser.pxi", line 1870, in
lxml.etree._parseMemoryDocument (src/lxml/etree.c:121231)
ValueError: can only parse strings
You are attempting to parse the response object instead of the data in the response. Change etree.XML(r) to etree.XML(r.text).

How to remove all " \n" in xml payload by using lxml library

I'm trying to change a text value in xml file, and I need to return the updated xml content by using lxml library. I can able to successfully update the value, but the updated xml file contains "\n"(next line) character as below.
Output:
<?xml version='1.0' encoding='ASCII'?>\n<Order>\n <content>\n <sID>123</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
<content>\n <sID>111</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
</Order>
Note: I didn't format the above xml output, and posted it how exactly I get it from output console.
Input:
<Order>
<content>
<sID>123</sID>
<spNumber>UserTemp</spNumber>
<client>WALLMART</client>
<orderType>Dashboard</orderType>
</content>
<content>
<sID>111</sID>
<spNumber>UserTemp</spNumber>
<client>D&B</client>
<orderType>Dashboard</orderType>
</content>
</Order>
Also, I tried to remove the \n character in output xml file by using
getValue = getValue.replace('\n','')
but, no luck.
The below code I used to update the xml( tag), and tried to return the updated xml content back.
Python Code:
from lxml import etree
from io import StringIO
import six
import numpy
def getListOfNodes(location):
f = open(location)
xml = f.read()
f.close()
#print(xml)
getXml = etree.parse(location)
for elm in getXml.xpath('.//Order//content/client'):
index='ARRCHANA'
elm.text=index
#with open('C:\\New folder\\temp.xml','w',newline='\r\n') as writeFile:
#writeFile.write(str(etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
getValue=str((etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
#getValue = getValue.replace('\n','')
#getValue=getValue.replace("\n","<br/>")
print(getValue)
return getValue
When I'm trying to open the response payload through firefox browser, then It says the below error message:
XML Parsing Error: no element found Location:
file:///C:/New%20folder/Confidential.xml
Line Number 1, Column 1:
It says that "no element found location in Line Number 1, column 1" in xml file when it found "\n" character in it.
Can somebody assist me the better way to update the text value, and return it back without any additional characters.
It's fixed by myself by using the below script:
code = root.xpath('.//Order//content/client')
if code:
code[0].text = 'ARRCHANA'
etree.ElementTree(root).write('D:\test.xml', pretty_print=True)

What is wrong with my xpath expression?

I want to extract all the links in td whose class is u-ctitle.
import os
import urllib
import lxml.html
down='http://v.163.com/special/opencourse/bianchengdaolun.html'
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
namelist=root.xpath('//td[#class="u-ctitle"]/a')
len(namelist)
The output is [],there are so many td whose classis "u-ctitle" ,with firebug you ca get, why can't extract it?
My python version is 2.7.9.
It is no use to change file into other name.
Your XPath is correct. The problem is unrelated.
If you examine HTML, you will see following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=GBK" />
And in this code:
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
file is actually a bytes sequence, so decoding from GBK-encoded bytes to Unicode string is happening inside document_fromstring method.
The problem is, HTML encoding is not actually GBK and lxml decodes it incorrectly, leading to loss of data.
>>> file.decode('gbk')
Traceback (most recent call last):
File "down.py", line 9, in <module>
file.decode('gbk')
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 7247-7248: illegal multibyte sequence
After some trial and error, we can find that actual encoding is GB_18030. To make script work, you need to decode bytes manually:
root=lxml.html.document_fromstring(file.decode('GB18030'))

pykml and utf8: either the input is incorrect, or I don't get it

I have a kmz file from the www and wish to read it into csv or such using pykml.
The file is in UTF8, or at least it claims to - see header below. Reading it works, but triggers an error when coming on the first accented character.
<?xml version='1.0' encoding='UTF-8'?>
<kml xmlns='http://www.opengis.net/kml/2.2'>
<Document>
<name>
from pykml import parser
with open(KMZFIL) as f:
folder=parser.parse(f).getroot().Document.Folder
for pm in folder.Placemark:
print(pm.name)
Ablitas (militar) (Emerg)
Ademuz (forestal)
Ager (PL%)
Alcala del Rio (ILIPA MAGNA)(Esc.)
Traceback (most recent call last):
File "bin4/b21_xxxxxxx", line 15, in <module>
print(pm.name)
grep "name" $INFIL | head -7
( ... )
<name>Ablitas (militar) (Emerg)</name>
<name>Ademuz (forestal)</name>
<name>Ager (PL%)</name>
<name>Alcala del Rio (ILIPA MAGNA)(Esc.)</name>
<name>Ainzón</name>
You need to open the file in a way that instructs Python to interpret the bytes as UTF-8 characters. In Python 2.7 you do it with the codecs module.
import codecs
with codecs.open(KMZFIL, encoding='utf-8') as f:
In Python 3 the encoding option has been added to the standard open so there's no need to use codecs.
Didn't see the answer here but these are lmxl StringElements -- I used .text to fix this error.
change print(pm.name) to print(pm.name.text)
https://lxml.de/api/lxml.objectify.StringElement-class.html

WordPress XML ParseError: unbound prefix?

I am trying to use Python's xml.etree.ElementTree.parse() function to parse an XML file I created by exporting all of the content from a WordPress blog. However, when I try like so:
import xml.etree.ElementTree as xml
tree = xml.parse('/path/to/file.xml')
I get the following error:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
raise err
ParseError: unbound prefix: line 189, column 1
Here's what's on line 189 of my XML file:
<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blogname.wordpress.com/osd.xml" title="blog name" />
I've seen many questions about this error coming up with Android development, but I can't tell if and how that applies to my situation. Can anyone help with this?
Apologies to everyone for whom this was stupidly obvious, but it turns out I simply didn't have a namespace definition for "atom" in the document. I'm guessing that "unbound prefix" means that the prefix "atom" wasn't "bound" to a namespace definition?
Anyway, adding said definition has solved the problem. Although it makes me wonder why WordPress exports XML files without proper definitions for all of the namespaces they use...
If you remove all the Name Space, works absolutely fine.
Change
<s:home>USA</s:home>
to
<home>USA</home>
Just in case it helps someone some day, I was also working with a WordPress XML export (WordPress eXtended RSS) file in Python and was getting the same error. In my case, WordPress had included most of the correct namespace definitions. However, the XML had iTunes podcast information as well, and the iTunes namespace declaration was not present.
I fixed it by adding xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" into the RSS declaration block. So this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
became this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
>

Categories