Convert html to xml Python chilkat - python

Good morning, I am looking to convert html to xml using chilkat library. but it throws this error at me.
import chilkat
a = "asd asd asd asd"
xml = a.toXml()
print(xml)
Traceback (most recent call last):
File "C:\Users\acalobish\Desktop\iaa.py", line 4, in <module>
xml = a.toXml()
AttributeError: 'str' object has no attribute 'toXml'

import chilkat
htmlToXml = chilkat.CkHtmlToXml()
# Indicate the charset of the output XML we'll want.
htmlToXml.put_XmlCharset("utf-8")
success = htmlToXml.ConvertFile("test.html","out.xml")
if (success != True):
print(htmlToXml.lastErrorText())
else:
print("Success")

Related

Namespace error python zeep

I'm trying to read this Soap API http://telemetriaws1.ana.gov.br/ServiceANA.asmx?WSDL.
When I try to call DadosHidrometeorologicosGerais function I getting this error:
File "/home/1234/.local/share/virtualenvs/data_getter-1W9NAele/lib/python3.6/site-packages/zeep/xsd/schema.py", line 570, in _get_component
return items[qname]
KeyError: <lxml.etree.QName object at 0x7f2e93fa6d00>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
zeep.exceptions.LookupError: No element 'DocumentElement' in namespace None. Available elements are: NewDataSet
All other function working ok.
Here my code:
from zeep import Client
client = Client('http://telemetriaws1.ana.gov.br/ServiceANA.asmx?WSDL')
client.service.DadosHidrometeorologicosGerais(codEstacao='86450500', dataInicio='05/05/2018', dataFim='05/05/2018')
I'm not sure, but the xml file seems generated from some data set. The "id" attribute does not match the identification of the element.
A possible solution would be to omit the zeep parsing of the XML and returning the raw data. The use the module xml.etree.ElementTree to parse it.
from zeep import Client, Settings
import xml.etree.ElementTree as ET
settings = Settings(force_https=False, raw_response=True)
WSDL = 'http://telemetriaws1.ana.gov.br/ServiceANA.asmx?WSDL'
client = Client(WSDL, settings=settings)
response = client.service.DadosHidrometeorologicosGerais(codEstacao='86450500',
dataInicio='05/05/2018',
dataFim='05/05/2018')
root = ET.fromstring(response.content)

NameError in function to retrieve JSON data

I'm using python 3.6.1 and have the following code which successfully retrieves data in JSON format:
import urllib.request,json,pprint
url = "https://someurl"
response = urllib.request.urlopen(url)
data = json.loads(response.read())
pprint.pprint(data)
I want to wrap this in a function, so i can reuse it. This is what i have tried in a file called getdata.py:
from urllib.request import urlopen
import json
def get_json_data(url):
response = urlopen(url)
return json.loads(response.read())
and this is the error i get after importing the file and attempting to print out the response:
>>> import getdata
>>> print(getdata.get_json_data("https://someurl"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Nick\getdata.py", line 6, in get_json_data
from urllib.request import urlopen
NameError: name 'urllib' is not defined
i also tried this and got the same error:
import urllib.request,json
def get_json_data(url):
response = urllib.request.urlopen(url)
return json.loads(response.read())
What do i need to do to get this to work please?
cheers
Its working now ! I think the problem was the hydrogen addon i have for the Atom editor. I uninstalled it, tried again and it worked. Thanks for looking.

renderContents in beautifulsoup (python)

The code I'm trying to get working is:
h = str(heading)
# '<h1>Heading</h1>'
heading.renderContents()
I get this error:
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
print h.renderContents()
AttributeError: 'str' object has no attribute 'renderContents'
Any ideas?
I have a string with html tags and i need to clean it if there is a different way of doing that please suggest it.
Your error message and your code sample don't line up. You say you're calling:
heading.renderContents()
But your error message says you're calling:
print h.renderContents()
Which suggests that perhaps you have a bug in your code, trying to call renderContents() on a string object that doesn't define that method.
In any case, it would help if you checked what type of object heading is to make sure it's really a BeautifulSoup instance. This works for me with BeautifulSoup 3.2.0:
from BeautifulSoup import BeautifulSoup
heading = BeautifulSoup('<h1>heading</h1>')
repr(heading)
# '<h1>heading</h1>'
print heading.renderContents()
# <h1>heading</h1>
print str(heading)
# '<h1>heading</h1>'
h = str(heading)
print h
# <h1>heading</h1>

How can I get specific elements from XML data?

I have some code to retrieve XML data:
import cStringIO
import pycurl
from xml.etree import ElementTree
_API_KEY = 'my api key'
_ima = '/the/path/to/a/image'
sock = cStringIO.StringIO()
upl = pycurl.Curl()
values = [
("key", _API_KEY),
("image", (upl.FORM_FILE, _ima))]
upl.setopt(upl.URL, "http://api.imgur.com/2/upload.xml")
upl.setopt(upl.HTTPPOST, values)
upl.setopt(upl.WRITEFUNCTION, sock.write)
upl.perform()
upl.close()
xmldata = sock.getvalue()
#print xmldata
sock.close()
The resulting data looks like:
<?xml version="1.0" encoding="utf-8"?>
<upload><image><name></name><title></title><caption></caption><hash>dxPGi</hash><deletehash>kj2XOt4DC13juUW</deletehash><datetime>2011-06-10 02:59:26</datetime><type>image/png</type><animated>false</animated><width>1024</width><height>768</height><size>172863</size><views>0</views><bandwidth>0</bandwidth></image><links><original>http://i.stack.imgur.com/dxPGi.png</original><imgur_page>http://imgur.com/dxPGi</imgur_page><delete_page>http://imgur.com/delete/kj2XOt4DC13juUW</delete_page><small_square>http://i.stack.imgur.com/dxPGis.jpg</small_square><large_thumbnail>http://i.stack.imgur.com/dxPGil.jpg</large_thumbnail></links></upload>
Now, following this answer, I'm trying to get some specific values from the data.
This is my attempt:
tree = ElementTree.fromstring(xmldata)
url = tree.findtext('original')
webpage = tree.findtext('imgur_page')
delpage = tree.findtext('delete_page')
print 'Url: ' + str(url)
print 'Pagina: ' + str(webpage)
print 'Link de borrado: ' + str(delpage)
I get an AttributeError if I try to add the .text access:
Traceback (most recent call last):
File "<pyshell#28>", line 27, in <module>
url = tree.find('original').text
AttributeError: 'NoneType' object has no attribute 'text'
I couldn't find anything in Python's help for ElementTree about this attribute. How can I get only the text, not the object?
I found some info about getting a text string here; but when I try it I get a TypeError:
Traceback (most recent call last):
File "<pyshell#32>", line 34, in <module>
print 'Url: ' + url
TypeError: cannot concatenate 'str' and 'NoneType' objects
If I try to print 'Url: ' + str(url) instead, there is no error, but the result shows as None.
How can I get the url, webpageanddelete_page` data from this XML?
Your find() call is trying to find an immediate child of the top of the tree with a tag named original, not a tag at any lower level than that. Use:
url = tree.find('.//original').text
if you want to find all elements in the tree with the tag named original. The pattern matching rules for ElementTree's find() method are laid out in a table on this page: http://effbot.org/zone/element-xpath.htm
For // matching it says:
Selects all subelements, on all levels beneath the current element (search the entire subtree). For example, “.//egg” selects all “egg” elements in the entire tree.
Edit: here is some test code for you, it use the XML sample string you posted I just ran it through XML Tidy in TextMate to make it legible:
from xml.etree import ElementTree
xmldata = '''<?xml version="1.0" encoding="utf-8"?>
<upload>
<image>
<name/>
<title/>
<caption/>
<hash>dxPGi</hash>
<deletehash>kj2XOt4DC13juUW</deletehash>
<datetime>2011-06-10 02:59:26</datetime>
<type>image/png</type>
<animated>false</animated>
<width>1024</width>
<height>768</height>
<size>172863</size>
<views>0</views>
<bandwidth>0</bandwidth>
</image>
<links>
<original>http://i.stack.imgur.com/dxPGi.png</original>
<imgur_page>http://imgur.com/dxPGi</imgur_page>
<delete_page>http://imgur.com/delete/kj2XOt4DC13juUW</delete_page>
<small_square>http://i.stack.imgur.com/dxPGis.jpg</small_square>
<large_thumbnail>http://i.stack.imgur.com/dxPGil.jpg</large_thumbnail>
</links>
</upload>'''
tree = ElementTree.fromstring(xmldata)
print tree.find('.//original').text
On my machine (OS X running python 2.6.1) that produces:
Ian-Cs-MacBook-Pro:tmp ian$ python test.py
http://i.stack.imgur.com/dxPGi.png

pubDate RSS parsing weirdness with Beautifulsoup/Python

I'm trying to parse an RSS/Podcast feed using Beautifulsoup and everything is working nicely except I can't seem to parse the 'pubDate' field.
data = urllib2.urlopen("http://www.democracynow.org/podcast.xml")
dom = BeautifulStoneSoup(data, fromEncoding='utf-8')
items = dom.findAll('item');
for item in items:
title = item.find('title').string.strip()
pubDate = item.find('pubDate').string.strip()
The title gets parsed fine but when it gets to pubDate, it says:
Traceback (most recent call last):
File "", line 2, in
AttributeError: 'NoneType' object has no attribute 'string'
However, when I download a copy of the XML file and rename 'pubDate' to something else, then parse it again, it seems to work. Is pubDate a reserved variable or something in Python?
Thanks,
g
It works with item.find('pubdate').string.strip().
Why don't you use feedparser ?

Categories