ElementTree XML parsing and urllib2.urlopen - python

I'm opening a URL using:
response = urllib2.urlopen(url, data, timeout=_TIMEOUT)
and using response.read(), it gives the following output:
<XMLlookup licenseid="X4X6X42" reason="OK" status="1" />
but when I want to parse it using ElementTree, like so:
print response.read()
t = ET.parse(response)
r = t.getroot()
print r.attrib.get('status')
gives me the following error message:
File "<string>", line 62, in parse
File "<string>", line 38, in parse
cElementTree.ParseError: no element found: line 1, column 0
but when I remove the line response.read(), the code works fine. What am I doing wrong?

You can only read once a response, because it is a file object like (in fact a addinfourl). Subsequent calls to read will return an empty string since you have always read the whole text.
So either you do not call read before using ET.parse(response), or you store the result in a string and use it for ET :
txt = response.read()
# do what you want with txt (without changing it)
t = ET.fromstring(txt)

You need to use:
t = ET.fromstring(response.read())

instead of
response.read()
t = ET.parse(response)
r = t.getroot()
try
resp = response.read()
t = ET.fromstring(resp)
r = t.getroot()
or
t = ET.fromstring(response.read())
r = t.getroot()
Also, you should note that not all HTML is parsable as XML. If your request returns XHTML then you will be fine, but otherwise you will get a very similar error to what you are seeing.

Related

How to locate an XML error in python given the line number and column number?

I am getting an error when I parse my xml. It gives a line and column number, but I am not sure how to go about locating it.
My code
urlBase = 'https://www.goodreads.com/review/list_rss/'
urlMiddle = '?shelf=read&order=d&sort=rating&per_page=200&page='
finalUrl = urlBase + str(32994) + urlMiddle +str(1)
resp = requests.get(finalUrl)
from xml.etree import ElementTree as ET
x = ET.fromstring(resp.content)
Error
File "<string>", line unknown
ParseError: not well-formed (invalid token): line 952, column 1023
I try to print the contents, but it's just one line
resp.content
The output is too big to print here.
So I'm not sure how to check a specific line since it's just one line.
You are trying to parse a HTML content with an XML parser. You may run into problem if the content is not XML-valid: if it is not XHTML.
Instead of that, you can use a HTML parser like the one available with lxml.
For instance
parser = etree.HTMLParser()
tree = etree.parse(BytesIO(resp.content), parser)
This will solve your issue.
Most likely you are on Windows and the print isn’t respecting e.g \n.
Try adding:
open(‘resp.xml’).write(resp.content)
After where you get resp
Then, you can open resp.xml in an editor and see what line 952 looks like.

Python xml.ElementTree - function to return parsed xml in variable to be used later

I have a funcion which sends get request and parse response to xml:
def get_object(object_name):
...
...
#parse xml file
encoded_text = response.text.encode('utf-8', 'replace')
root = ET.fromstring(encoded_text)
tree = ET.ElementTree(root)
return tree
Then I use this function to loop through object from list to get xmls and store them in variable:
jx_task_tree = ''
for jx in jx_tasks_lst:
jx_task_tree += str(get_object(jx))
I am not sure, if the function returns me data in correct format/form to use them later the way I need to.
When I want to parse variable jx_task_tree like this:
parser = ET.XMLParser(encoding="utf-8")
print(type(jx_task_tree))
tree = ET.parse(jx_task_tree, parser=parser)
print(ET.tostring(tree))
it throws me an error:
Traceback (most recent call last):
File "import_uac_wf.py", line 59, in <module>
tree = ET.parse(jx_task_tree, parser=parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in
parse
tree.parse(source, parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 647, in parse
source = open(source, "rb")
IOError: [Errno 36] File name too long:
'<xml.etree.ElementTree.ElementTree
object at 0x7ff2607c8910>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e23d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607ee4d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607d8e90>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e2550>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607889d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff26079f3d0>\n'
Would anybody help me, what should function get_object() return and how to work with it later, so what's returned can be joined into one variable and parsed?
Regarding to your current exception:
According to [Python 3.Docs]: xml.etree.ElementTree.parse(source, parser=None) (emphasis is mine):
Parses an XML section into an element tree. source is a filename or file object containing XML data.
If you want to load the XML from a string, use ET.fromstring instead.
Then, as you suspected, the 2nd code snippet is completely wrong:
get_object(jx) returns an already parsed XML, so an ElementTree object
Calling str on it, will yield its textual representation (e.g. "<xml.etree.ElementTree.ElementTree object at 0x7ff26079f3d0>") which is not what you want
You could do something like:
jx_tasks_string = ""
for jx in jx_tasks_lst:
jx_tasks_string += ET.tostring(get_object(jx).getroot())
Since jx_tasks_string is the concatenation of some strings obtained from parsing some XML blobs, there's no reason to parse it again.

normally my api returns json, but sometimes returns a full response object?

All,
I have a script i have in place which fetches JSON off of a webserver. Simple as the following:
url = "foo.com/json"
response = requests.get(url).content
data = json.loads(response)
but i noticed is that sometimes instead of returning the JSON object, it will return what looks like a response dump. See here: https://pastebin.com/fUy5YMuY
What confuses me is to how to continue on.
Right now i took the above python and wrapped it
try:
url = "foo.com/json"
response = requests.get(url).content
data = json.loads(response)
except Exception as ex:
with open("test.txt", "w") as t:
t.write(response)
print("Error", sys.exc_info())
Is there a way to catch this? Right now I get a ValueError... and then reparse it? I was thinking to do something like:
except Exception as ex:
response = reparse(response)
but im still confused as to why it will sometimes return the JSON and other times, the header info + content.
def reparse(response):
"""
Catch the ValueError and attempt to reparse it for the json contnet
"""
Can i feed something like the pastebin dump into some sort of requests.Reponse class or similar?
Edit Here is the full stack trace I am getting.
File "scrape_people_by_fcc_docket.py", line 82, in main
json_data = get_page(limit, page*limit)
File "scrape_people_by_fcc_docket.py", line 13, in get_page
data = json.loads(response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1 - line 16 column 367717 (char 3 - 368222)
None
In the above code, the response variable is defined by:
response = requests.get(url).content
which is odd because most of the time, reponse will return a JSON object which is completely parsable.
Ideally, I have been trying to find a way to, when content isnt JSON, pass some how parse it for the actual content and then continue on.
instead of using .text or .content you can use the response method: .json() which so far seems to resolve my issues. I am doing continual testing and watching for errors and will update this as needed, but it seems that the json function will return the data i need without headers, and similarly already calls json.loads or similar to parse the information.

lxml.etree.XPathEvalError: Invalid expression

I am getting an error with Python that I am not able to understand. I have simplified my code to the very bare minimum:
response = requests.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
r = tree.xpath('//divass="campaign"]/a/#href')
print(r)
and still get the error
Traceback (most recent call last):
File "ultimate-1.py", line 17, in <module>
r = tree.xpath('//divass="campaign"]/a/#href')
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954)
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression
Would anyone have an idea of where the issue is coming from? Might it be a dependencies problem? Thanks.
The expression '//divass="campaign"]/a/#href' is not syntactically correct and does not make much sense. Instead you meant to check the class attribute:
//div[#class="campaign"]/a/#href
Now, that would help to avoid the Invalid Expression error, but you would get nothing found by the expression. This is because the data is not there in the response that requests receives. You would need to mimic what the browser does to get the desired data and make an additional request to get the javascript file containing the campaigns.
Here is what works for me:
import ast
import re
import requests
from lxml import html
with requests.Session() as session:
# extract script url
response = session.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
script_url = tree.xpath("//script[contains(#src, 'generate-js')]/#src")[0]
# get the script
response = session.get(script_url)
data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1))
# extract the desired data
tree = html.fromstring(data)
campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[#class="campaign"]/a')]
print(campaigns)
Prints:
['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140',
...
'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481'
]
U was wrong in making xpath.
If u want to take all hrefs your xpath should be like
hrefs = tree.xpath('//div[#class="campaign"]/a')
for href in hrefs:
print(href.get('href'))
or in one line:
hrefs = [item.get('href') for item in tree.xpath('//div[#class="campaign"]/a')]

Python print regex in json string

So, I'm using a small python script to try to simply print out every occurrence of a website 'northwest.hall.' where the wildcard() is a number, in a very large json string pulled from a url.
I have this so far:
import urllib, json, re
url = 'http://graphite.website.com/render/?target=stats.web.northwest.hall.*&format=json'
response = urllib.urlopen(url)
data = json.loads(response.read())
code = re.findall('northwest', data)
print code
This should return a list of the 30 regexpressions of northwest.hall.number in the json string being parsed, but I get the following error instead:
Traceback (most recent call last):
File "/Users/arin/Desktop/scripts/code_parser2.py", line 7, in <module>
code = re.findall('community', data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
New to Python (sure you can tell).
Thanks in advance.
Use
data = response.read()
to get json string from server.
Using
data = json.loads(response.read())
you change this string into python dictionary.
EDIT:
import re
data = """
stats.web.northwest.hall.01
stats.web.northwest.hall.223
stats.web.northwest.hall.31
stats.web.northwest.hall.4
"""
print re.findall(r'stats.web.northwest.hall.(\d+)', data)
['01', '223', '31', '4']

Categories