lxml is not coercing int into a string before parsing it - python

I am trying to pull a Qualys vulnerability report ID from an API call with python. Essentially, the report ID is an int, and lxml can only parse strings. I have used my same code in the past to do this and it worked fine. I assumed lxml is smart enough to coerce the int into a string before parsing. Is there any way I can do this manually so I stop getting parse errors? Below is my code, output, and the traceback.
Code:
import requests
import time
import lxml
from lxml import etree
s = requests.Session()
s.headers.update({'X-Requested-With':'X'})
def login(s):
payload = {'action':'login', 'username':'X', 'password':'X'}
r = s.post('https://qualysapi.qualys.com/api/2.0/fo/session/',
data=payload)
def launchReport(s, polling_delay=250):
payload = {'action':'launch', 'template_id':'X',
'output_format':'xml', 'report_title':'X'}
r = s.post('https://qualysapi.qualys.com/api/2.0/fo/report/',
data=payload)
print r.text
extract_id = etree.XML(r).find('.//VALUE')
print("Report ID = %s" % extract_id)
time.sleep(polling_delay)
return extract_id
login(s)
launchReport(s)
Output:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SIMPLE_RETURN SYSTEM
"https://qualysapi.qualys.com/api/2.0/simple_return.dtd">
<SIMPLE_RETURN>
<RESPONSE>
<DATETIME>2018-02-01T16:00:14Z</DATETIME>
<TEXT>New report launched</TEXT>
<ITEM_LIST>
<ITEM>
<KEY>ID</KEY>
<VALUE>16441920</VALUE>
</ITEM>
</ITEM_LIST>
</RESPONSE>
</SIMPLE_RETURN>
Traceback:
Traceback (most recent call last):
File "test.py", line 30, in <module>
launchReport(s)
File "test.py", line 22, in launchReport
extract_id = etree.XML(r).find('.//VALUE')
File "src/lxml/etree.pyx", line 3209, in lxml.etree.XML
(src/lxml/etree.c:80823)
File "src/lxml/parser.pxi", line 1870, in
lxml.etree._parseMemoryDocument (src/lxml/etree.c:121231)
ValueError: can only parse strings

You are attempting to parse the response object instead of the data in the response. Change etree.XML(r) to etree.XML(r.text).

Related

Python XML Iterparse halt on text

I am new to python, using 3.x, and am running into an issue with an XML file that I'm testing/learning on. When I look at the raw file (which is ASCII encoded btw), the issue (I'm pretty sure) is that there's a U+00A0 code in there.
The XML is as follows:
<?xml version="1.0" encoding="utf-8"?>
<XMLSetData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.clientsite.com/subdir/r2.4/v1">
<FileCreationDate>2018-05-05T11:35:44.1043858-05:00</FileCreationDate>
<XMLSetDataList>
<DataIDNumber>99345346</DataIDNumber>
<DataName>RSRS TVL5697 ULLĀ  Georgetown</DataName>
</XMLSetDataList>
</XMLSetData>
Using Notepad++, it shows me that the text has "xA0 " instead of " " (two spaces) between ULL and Georgetown. So when I do the code below:
import xml.etree.ElementTree as ET
events = ("end", "start-ns", "end-ns")
for event, elem in ET.iterparse(xml_file, events=events):
if event == "end":
eltag = elem.tag
eltext = elem.text
print( eltag, eltext)
It gives me an error stating:
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1222, in iterator
yield from pullparser.read_events()
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1297, in read_events
raise event
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1269, in feed
self._parser.feed(data)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 6, column 30
How do I fix this / get around it? If I remove the xA0 part, it parses fine, but obviously something like this may come up again, and I'd like to programmatically handle it.

Error with Python and XML

I'm getting an error when trying to grab a value from my XML. I get "Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
Here is my code:
import requests
import lxml.etree
from requests.auth import HTTPBasicAuth
r= requests.get("https://somelinkhere/folder/?parameter=abc", auth=HTTPBasicAuth('username', 'password'))
print r.text
root = lxml.etree.fromstring(r.text)
textelem = root.find("opensearch:totalResults")
print textelem.text
I'm getting this error:
Traceback (most recent call last):
File "tickets2.py", line 8, in <module>
root = lxml.etree.fromstring(r.text)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
File "src/lxml/parser.pxi", line 1814, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124471)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Here is what the XML looks like, where I'm trying to grab the file in the last line.
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:apple-wallpapers="http://www.apple.com/ilife/wallpapers" xmlns:g-custom="http://base.google.com/cns/1.0" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:cc="http://web.resource.org/cc/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:g-core="http://base.google.com/ns/1.0">
<title>Feed from some link here</title>
<link rel="self" href="https://somelinkhere/folder/?parameter=abc" />
<link rel="first" href="https://somelinkhere/folder/?parameter=abc" />
<id>https://somelinkhere/folder/?parameter=abc</id>
<updated>2018-03-06T17:48:09Z</updated>
<dc:creator>company.com</dc:creator>
<dc:date>2018-03-06T17:48:09Z</dc:date>
<opensearch:totalResults>4</opensearch:totalResults>
I have tried various changes from links like https://twigstechtips.blogspot.com/2013/06/python-lxml-strings-with-encoding.html and http://makble.com/how-to-parse-xml-with-python-and-lxml but I keep running into the same error.
Instead of r.text, which guesses at the text encoding and decodes it, try using r.content which accesses the response body as bytes. (See http://docs.python-requests.org/en/latest/user/quickstart/#response-content.)
You could also use r.raw. See parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) for more info.
Once that issue is fixed, you'll have the issue of the namespace. The element you're trying to find (opensearch:totalResults) has the prefix opensearch which is bound to the uri http://a9.com/-/spec/opensearch/1.1/.
You can find the element by combining the namespace uri and the local name (Clark notation):
{http://a9.com/-/spec/opensearch/1.1/}totalResults
See http://lxml.de/tutorial.html#namespaces for more info.
Here's an example with both changes implemented:
os = "{http://a9.com/-/spec/opensearch/1.1/}"
root = lxml.etree.fromstring(r.content)
textelem = root.find(os + "totalResults")
print textelem.text

How to remove all " \n" in xml payload by using lxml library

I'm trying to change a text value in xml file, and I need to return the updated xml content by using lxml library. I can able to successfully update the value, but the updated xml file contains "\n"(next line) character as below.
Output:
<?xml version='1.0' encoding='ASCII'?>\n<Order>\n <content>\n <sID>123</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
<content>\n <sID>111</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
</Order>
Note: I didn't format the above xml output, and posted it how exactly I get it from output console.
Input:
<Order>
<content>
<sID>123</sID>
<spNumber>UserTemp</spNumber>
<client>WALLMART</client>
<orderType>Dashboard</orderType>
</content>
<content>
<sID>111</sID>
<spNumber>UserTemp</spNumber>
<client>D&B</client>
<orderType>Dashboard</orderType>
</content>
</Order>
Also, I tried to remove the \n character in output xml file by using
getValue = getValue.replace('\n','')
but, no luck.
The below code I used to update the xml( tag), and tried to return the updated xml content back.
Python Code:
from lxml import etree
from io import StringIO
import six
import numpy
def getListOfNodes(location):
f = open(location)
xml = f.read()
f.close()
#print(xml)
getXml = etree.parse(location)
for elm in getXml.xpath('.//Order//content/client'):
index='ARRCHANA'
elm.text=index
#with open('C:\\New folder\\temp.xml','w',newline='\r\n') as writeFile:
#writeFile.write(str(etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
getValue=str((etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
#getValue = getValue.replace('\n','')
#getValue=getValue.replace("\n","<br/>")
print(getValue)
return getValue
When I'm trying to open the response payload through firefox browser, then It says the below error message:
XML Parsing Error: no element found Location:
file:///C:/New%20folder/Confidential.xml
Line Number 1, Column 1:
It says that "no element found location in Line Number 1, column 1" in xml file when it found "\n" character in it.
Can somebody assist me the better way to update the text value, and return it back without any additional characters.
It's fixed by myself by using the below script:
code = root.xpath('.//Order//content/client')
if code:
code[0].text = 'ARRCHANA'
etree.ElementTree(root).write('D:\test.xml', pretty_print=True)

python minidom: 'NoneType' object has no attribute 'data' from url

I'm trying to parse a XML with Python using minidom. When I'm parsing a xml file from my filesystem I haven't any probnlem.
doc = minidom.parse("PATH HERE")
etiquetaDia = doc.getElementsByTagName("dia")
for dia in etiquetaDia:
probPrecip = dia.getElementsByTagName("maxima")[0]
print(probPrecip.firstChild.data)
But when I try to parse a XML from a url with this code:
url = urllib2.urlopen('URL HERE')
doc = minidom.parse(url)
etiquetaDia = doc.getElementsByTagName("dia")
for dia in etiquetaDia:
probPrecip = dia.getElementsByTagName("maxima")[0]
print(probPrecip.firstChild.data)
I have an error message
Obviously it's the same XML in path and in url. Thanks
The urlopen function returns an HttpResponse object. You must first call the read() method of this object to get the actual content of the response, and pass that to minidom
minidom.parse(url.read())
Try the new urllib library instead like below.
It prints out Hello. Is that what you want?
from xml.dom import minidom
from urllib import request
url = request.urlopen('http://localhost:8000/sample.xml')
doc = minidom.parse(url)
etiquetaDia = doc.getElementsByTagName("dia")
for dia in etiquetaDia:
probPrecip = dia.getElementsByTagName("maxima")[0]
print(probPrecip.firstChild.data)
Sample XML
<?xml version="1.0" encoding="UTF-8"?>
<dia>
<maxima>Hello</maxima>
</dia>

Extracting the value of A tag by xpath in python

I have a simple python script like:
#!/usr/bin/python
import requests
from lxml import html
response = requests.get('http://site.ir/')
out=response.content
tree = html.fromstring(open(out).read())
print [e.text_content() for e in tree.xpath('//div[class="group"]/div[class="groupinfo"]/a/text()')]
I used xpath in order to get value of tag a as you can see from image below...
But the output sample is not what I expected.
UPDATE
I have also the following error:
Traceback (most recent call last):
File "p.py", line 7, in <module>
tree = html.fromstring(open(out).read())
IOError: [Errno 36] File name too long: '\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" ....
You need to put # at the beginning of attribute name to address an attribute in XPath :
//div[#class="group"]/div[#class="groupinfo"]/a/text()

Categories