Here's my project: I'm graphing weather data from WeatherBug using RRDTool. I need a simple, efficient way to download the weather data from WeatherBug. I was using a terribly inefficient bash-script-scraper but moved on to BeautifulSoup. The performance is just too slow (it's running on a Raspberry Pi) so I need to use LXML.
What I have so far:
from lxml import etree
doc=etree.parse('weather.xml')
print doc.xpath("//aws:weather/aws:ob/aws:temp")
But I get an error message. Weather.xml is this:
<?xml version="1.0" encoding="UTF-8"?>
<aws:weather xmlns:aws="http://www.aws.com/aws">
<aws:api version="2.0"/>
<aws:WebURL>http://weather.weatherbug.com/PA/Tunkhannock-weather.html?ZCode=Z5546&Units=0&stat=TNKCN</aws:WebURL>
<aws:InputLocationURL>http://weather.weatherbug.com/PA/Tunkhannock-weather.html?ZCode=Z5546&Units=0</aws:InputLocationURL>
<aws:ob>
<aws:ob-date>
<aws:year number="2013"/>
<aws:month number="1" text="January" abbrv="Jan"/>
<aws:day number="11" text="Friday" abbrv="Fri"/>
<aws:hour number="10" hour-24="22"/>
<aws:minute number="26"/>
<aws:second number="00"/>
<aws:am-pm abbrv="PM"/>
<aws:time-zone offset="-5" text="Eastern Standard Time (USA)" abbrv="EST"/>
</aws:ob-date>
<aws:requested-station-id/>
<aws:station-id>TNKCN</aws:station-id>
<aws:station>Tunkhannock HS</aws:station>
<aws:city-state zipcode="18657">Tunkhannock, PA</aws:city-state>
<aws:country>USA</aws:country>
<aws:latitude>41.5663871765137</aws:latitude>
<aws:longitude>-75.9794464111328</aws:longitude>
<aws:site-url>http://www.tasd.net/highschool/index.cfm</aws:site-url>
<aws:aux-temp units="°F">-100</aws:aux-temp>
<aws:aux-temp-rate units="°F">0</aws:aux-temp-rate>
<aws:current-condition icon="http://deskwx.weatherbug.com/images/Forecast/icons/cond013.gif">Cloudy</aws:current-condition>
<aws:dew-point units="°F">40</aws:dew-point>
<aws:elevation units="ft">886</aws:elevation>
<aws:feels-like units="°F">41</aws:feels-like>
<aws:gust-time>
<aws:year number="2013"/>
<aws:month number="1" text="January" abbrv="Jan"/>
<aws:day number="11" text="Friday" abbrv="Fri"/>
<aws:hour number="12" hour-24="12"/>
<aws:minute number="18"/>
<aws:second number="00"/>
<aws:am-pm abbrv="PM"/>
<aws:time-zone offset="-5" text="Eastern Standard Time (USA)" abbrv="EST"/>
</aws:gust-time>
<aws:gust-direction>NNW</aws:gust-direction>
<aws:gust-direction-degrees>323</aws:gust-direction-degrees>
<aws:gust-speed units="mph">17</aws:gust-speed>
<aws:humidity units="%">98</aws:humidity>
<aws:humidity-high units="%">100</aws:humidity-high>
<aws:humidity-low units="%">61</aws:humidity-low>
<aws:humidity-rate>3</aws:humidity-rate>
<aws:indoor-temp units="°F">77</aws:indoor-temp>
<aws:indoor-temp-rate units="°F">-1.1</aws:indoor-temp-rate>
<aws:light>0</aws:light>
<aws:light-rate>0</aws:light-rate>
<aws:moon-phase moon-phase-img="http://api.wxbug.net/images/moonphase/mphase01.gif">0</aws:moon-phase>
<aws:pressure units=""">30.09</aws:pressure>
<aws:pressure-high units=""">30.5</aws:pressure-high>
<aws:pressure-low units=""">30.08</aws:pressure-low>
<aws:pressure-rate units=""/h">-0.01</aws:pressure-rate>
<aws:rain-month units=""">0.11</aws:rain-month>
<aws:rain-rate units=""/h">0</aws:rain-rate>
<aws:rain-rate-max units=""/h">0.12</aws:rain-rate-max>
<aws:rain-today units=""">0.09</aws:rain-today>
<aws:rain-year units=""">0.11</aws:rain-year>
<aws:temp units="°F">41</aws:temp>
<aws:temp-high units="°F">42</aws:temp-high>
<aws:temp-low units="°F">29</aws:temp-low>
<aws:temp-rate units="°F/h">-0.9</aws:temp-rate>
<aws:sunrise>
<aws:year number="2013"/>
<aws:month number="1" text="January" abbrv="Jan"/>
<aws:day number="11" text="Friday" abbrv="Fri"/>
<aws:hour number="7" hour-24="07"/>
<aws:minute number="29"/>
<aws:second number="53"/>
<aws:am-pm abbrv="AM"/>
<aws:time-zone offset="-5" text="Eastern Standard Time (USA)" abbrv="EST"/>
</aws:sunrise>
<aws:sunset>
<aws:year number="2013"/>
<aws:month number="1" text="January" abbrv="Jan"/>
<aws:day number="11" text="Friday" abbrv="Fri"/>
<aws:hour number="4" hour-24="16"/>
<aws:minute number="54"/>
<aws:second number="19"/>
<aws:am-pm abbrv="PM"/>
<aws:time-zone offset="-5" text="Eastern Standard Time (USA)" abbrv="EST"/>
</aws:sunset>
<aws:wet-bulb units="°F">40.802</aws:wet-bulb>
<aws:wind-speed units="mph">3</aws:wind-speed>
<aws:wind-speed-avg units="mph">1</aws:wind-speed-avg>
<aws:wind-direction>S</aws:wind-direction>
<aws:wind-direction-degrees>163</aws:wind-direction-degrees>
<aws:wind-direction-avg>SE</aws:wind-direction-avg>
</aws:ob>
</aws:weather>
I used http://www.xpathtester.com/test to test my xpath and it worked there. But I get the error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2043, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:47570)
File "xpath.pxi", line 376, in lxml.etree.XPathDocumentEvaluator.__call__ (src/lxml/lxml.etree.c:118247)
File "xpath.pxi", line 239, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:116911)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:116728)
lxml.etree.XPathEvalError: Undefined namespace prefix
This is all very new to me -- Python, XML, and LXML. All I want is the observed time and the temperature.
Do my problems have anything to do with that aws: prefix in front of everything? What does that even mean?
Any help you can offer is greatly appreciated!
The problem has all "to do with that aws: prefix in front of everything"; it is a namespace prefix which you have to define. This is easily achievable, as in:
print doc.xpath('//aws:weather/aws:ob/aws:temp',
namespaces={'aws': 'http://www.aws.com/aws'})[0].text
The need for this mapping between the namespace prefix to a value is documented at http://lxml.de/xpathxslt.html.
Try something like this:
from lxml import etree
ns = etree.FunctionNamespace("http://www.aws.com/aws")
ns.prefix = "aws"
doc=etree.parse('weather.xml')
print doc.xpath("//aws:weather/aws:ob/aws:temp")[0].text
See this link: http://lxml.de/extensions.html
Related
I'm trying to parse an XML file in which there is some VCARD. I need the info: FN, NOTE (SIREN and A) and print them as a list as FN, SIREN_A. I would also like to add them in a list if the string in the description equals "diviseur" only
I've tried different things (vobject, finditer) but none of them work. For my parser, I'm using the library xml.etree.ElementTree and pandas which usually are causing some incompatibilies.
code python :
import xml.etree.ElementTree as ET
import vobject
newlist=[]
data=[]
data.append(newlist)
diviseur=[]
tree=ET.parse('test_oc.xml')
root=tree.getroot()
newlist=[]
for lifeCycle in root.findall('{http://ltsc.ieee.org/xsd/LOM}lifeCycle'):
for contribute in lifeCycle.findall('{http://ltsc.ieee.org/xsd/LOM}contribute'):
for entity in contribute.findall('{http://ltsc.ieee.org/xsd/LOM}entity'):
vcard = vobject.readOne(entity)
siren = vcard.contents['note'].value,":",vcard.contents['fn'].value
print ('siren',siren.text)
for date in contribute.findall('{http://ltsc.ieee.org/xsd/LOM}date'):
for description in date.findall('{http://ltsc.ieee.org/xsd/LOM}description'):
entite=description.find('{http://ltsc.ieee.org/xsd/LOM}string')
print ('Type entité:', entite.text)
newlist.append(entite)
j=0
for j in range(len(entite)-1):
if entite[j]=="diviseur":
diviseur.append(siren[j])
print('diviseur:', diviseur)
newlist.append(diviseur)
data.append(newlist)
print(data)
xml file to parse:
<?xml version="1.0" encoding="UTF-8"?>
<lom:lom xmlns:lom="http://ltsc.ieee.org/xsd/LOM" xmlns:lomfr="http://www.lom-fr.fr/xsd/LOMFR" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ltsc.ieee.org/xsd/LOM">
<lom:version uniqueElementName="version">
<lom:string language="http://id.loc.gov/vocabulary/iso639-2/fre">V4.1</lom:string>
</lom:version>
<lom:lifeCycle uniqueElementName="lifeCycle">
<lom:contribute>
<lom:entity><![CDATA[
BEGIN:VCARD
VERSION:4.0
FN:Cailler
N:;Valérie;;Mr;
ORG:Veoli
NOTE:SIREN=203025106
NOTE :ISNI=0000000000000000
END:VCARD
]]></lom:entity>
<lom:date uniqueElementName="date">
<lom:dateTime uniqueElementName="dateTime">2019-07-10</lom:dateTime>
<lom:description uniqueElementName="description">
<lom:string>departure</lom:string>
</lom:description>
</lom:date>
</lom:contribute>
<lom:contribute>
<lom:entity><![CDATA[
BEGIN:VCARD
VERSION:4.0
FN:Besnard
N:;Ugo;;Mr;
ORG:MG
NOTE:SIREN=501 025 205
NOTE :A=0000 0000
END:VCARD
]]></lom:entity>
<lom:date uniqueElementName="date">
<lom:dateTime uniqueElementName="dateTime">2019-07-10</lom:dateTime>
<lom:description uniqueElementName="description">
<lom:string>diviseur</lom:string>
</lom:description>
</lom:date>
</lom:contribute>
</lom:lifeCycle>
</lom:lom>
Traceback (most recent call last):
File "parser_export_csv_V2.py", line 73, in
vcard = vobject.readOne(entity)
File "C:\Users\b\AppData\Local\Programs\Python\Python36-32\lib\site-packages\vobject\base.py", line 1156, in readOne
allowQP))
File "C:\Users\b\AppData\Local\Programs\Python\Python36-32\lib\site-packages\vobject\base.py", line 1089, in readComponents
for line, n in getLogicalLines(stream, allowQP):
File "C:\Users\b\AppData\Local\Programs\Python\Python36-32\lib\site-packages\vobject\base.py", line 869, in getLogicalLines
val = fp.read(-1)
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'read'
There are a few problems here.
entity is an Element instance, and vCard is a plain text data format. vobject.readOne() expects text.
There is unwanted whitespace adjacent to the vCard properties in the XML file.
NOTE :ISNI=0000000000000000 is invalid; it should be NOTE:ISNI=0000000000000000 (space removed).
vcard.contents['note'] is a list and does not have a value property.
Here is code that probably doesn't produce exactly what you want, but I hope it helps:
import xml.etree.ElementTree as ET
import vobject
NS = {"lom": "http://ltsc.ieee.org/xsd/LOM"}
tree = ET.parse('test_oc.xml')
for contribute in tree.findall('.//lom:contribute', NS):
desc_string = contribute.find('.//lom:string', NS)
print(desc_string.text)
entity = contribute.find('lom:entity', NS)
txt = entity.text.replace(" ", "") # Text with spaces removed
vcard = vobject.readOne(txt)
for p in vcard.contents["note"]:
print(p.name, p.value)
for p in vcard.contents["fn"]:
print(p.name, p.value)
print()
Output:
departure
NOTE SIREN=203025106
NOTE ISNI=0000000000000000
FN Cailler
diviseur
NOTE SIREN=501025205
NOTE A=00000000
FN Besnard
I am currently learning about xml and DTD and I came across a DTD which was a bit puzzling .
<!ELEMENT foo (superpowers*)>
<!ELEMENT superpowers ( foo | agility )>
Firstly is this DTD legal ?
whenever I try to generate a corresponding xml file i get the following error,
Traceback (most recent call last): File "test.py", line 12, in
foo.append(superpowers)\n File "src/lxml/etree.pyx", line 832, in lxml.etree._Element.append File "src/lxml/apihelpers.pxi", line
1283, in lxml.etree._appendChild ValueError: cannot append parent to
itself
The code I am using for xml generation in python is represented by this pseudo code.
from lxml import etree as xml
import pprint
foo = xml.Element("foo")
superpowers = xml.Element("superpowers")
x = True
if x :
foo = xml.Element("foo")
superpowers.append(foo)
foo.append(superpowers)
else :
agility = xml.Element("agility")
superpowers.append(agility)
foo.append(superpowers)
tree = xml.ElementTree(foo)
print (xml.tostring(tree, pretty_print=True))
with open("foo.xml", "wb") as op:
tree.write(op, pretty_print=True) #pretty_print used for indentation
Have I overlooked anything ??
<!ELEMENT foo (superpowers*)>
<!ELEMENT superpowers ( foo | agility )>
Firstly is this DTD legal ?
Yes, that partial DTD is legal. (I say partial because there isn't an element declaration for agility.)
It's saying that foo can contain zero or more superpowers and superpowers must contain exactly one foo or agility.
For example, this would be valid according to that DTD...
<foo>
<superpowers>
<foo/>
</superpowers>
<superpowers>
<foo>
<superpowers>
<foo/>
</superpowers>
</foo>
</superpowers>
</foo>
The error you're getting makes sense; you can't foo.append(superpowers) because superpowers is already the parent of foo. (Like Harry Dunne says, "You can't triple stamp a double stamp Lloyd!".)
What you would need to do is create a brand new foo and append superpowers to that.
Example...
if x :
foo = xml.Element("foo")
superpowers.append(foo)
foo2 = xml.Element("foo")
foo2.append(superpowers)
and what you'd end up with is (comments added to try to help clarify)...
<foo><!--foo2-->
<superpowers>
<foo><!--original foo--></foo>
</superpowers>
</foo>
I have been trying to use minidom but have no real preference. For some reason lxml will not install on my machine.
I would like to parse an xml file:
<?xml version="1.
-<transfer frmt="1" vtl="0" serial_number="E5XX-0822" date="2016-10-03 16:34:53.000" style="startstop">
-<plateInfo>
<plate barcode="E0122326" name="384plate" type="source"/>
<plate barcode="A1234516" name="1536plateD" type="destination"/>
</plateInfo>
-<printmap total="1387">
<w reason="" cf="13" aa="1.779" eo="299.798" tof="32.357" sv="1565.311" ct="1.627" ft="1.649" fc="88.226" memt="0.877" fldu="Percent" fld="DMSO" dy="0" dx="0" region="-1" tz="18989.481" gy="72468.649" gx="55070.768" avt="50" vt="50" vl="3.68" cvl="3.63" t="16:30:47.703" dc="0" dr="0" dn="A1" c="0" r="0" n="A1"/>
<w reason="" cf="13" aa="1.779" eo="299.798" tof="32.357" sv="1565.311" ct="1.627" ft="1.649" fc="88.226" memt="0.877" fldu="Percent" fld="DMSO" dy="0" dx="0" region="-1" tz="18989.481" gy="72468.649" gx="55070.768" avt="50" vt="50" vl="3.68" cvl="3.63" t="16:30:47.703" dc="0" dr="0" dn="A1" c="1" r="0" n="A2"/>
</printmap>
</transfer>
The files do not have any element details, as you can see. All the information is contained in the attributes. In trying to adapt another SO post, I have this - but it seems to be geared more toward elements. I am also failing at a good way to "browse" the xml information, i.e. I would like to say "dir(xml_file)" and have a list of all the methods I can carry out on my tree structure, or see all the attributes. I know this was a lot and potentially different directions, but thank you in advance!
def parse(files):
for xml_file in files:
xmldoc = minidom.parse(xml_file)
transfer = xmldoc.getElementsByTagName('transfer')[0]
plateInfo = transfer.getElementsByTagName('plateInfo')[0]
With minidom you can access the attributes of a particular element using the method attributes which can then be treated as dictionary; this example iterates and print the attributes of the element transfer[0]:
from xml.dom.minidom import parse, parseString
xml_file='''<?xml version="1.0" encoding="UTF-8"?>
<transfer frmt="1" vtl="0" serial_number="E5XX-0822" date="2016-10-03 16:34:53.000" style="startstop">
<plateInfo>
<plate barcode="E0122326" name="384plate" type="source"/>
<plate barcode="A1234516" name="1536plateD" type="destination"/>
</plateInfo>
<printmap total="1387">
<w reason="" cf="13" aa="1.779" eo="299.798" tof="32.357" sv="1565.311" ct="1.627" ft="1.649" fc="88.226" memt="0.877" fldu="Percent" fld="DMSO" dy="0" dx="0" region="-1" tz="18989.481" gy="72468.649" gx="55070.768" avt="50" vt="50" vl="3.68" cvl="3.63" t="16:30:47.703" dc="0" dr="0" dn="A1" c="0" r="0" n="A1"/>
<w reason="" cf="13" aa="1.779" eo="299.798" tof="32.357" sv="1565.311" ct="1.627" ft="1.649" fc="88.226" memt="0.877" fldu="Percent" fld="DMSO" dy="0" dx="0" region="-1" tz="18989.481" gy="72468.649" gx="55070.768" avt="50" vt="50" vl="3.68" cvl="3.63" t="16:30:47.703" dc="0" dr="0" dn="A1" c="1" r="0" n="A2"/>
</printmap>
</transfer>'''
xmldoc = parseString(xml_file)
transfer = xmldoc.getElementsByTagName('transfer')
attlist= transfer[0].attributes.keys()
for a in attlist:
print transfer[0].attributes[a].name,transfer[0].attributes[a].value
you can find more information here:
http://www.diveintopython.net/xml_processing/attributes.html
I have this function to read saved HTML files saved on the computer:
def get_doc_ondrive(self,mypath):
the_file = open(mypath,"r")
line = the_file.readline()
if(line != "")and (line!=None):
self.soup = BeautifulSoup(line)
else:
print "Something is wrong with line:\n\n%r\n\n" % line
quit()
print "\t\t------------ line: %r ---------------\n" % line
while line != "":
line = the_file.readline()
print "\t\t------------ line: %r ---------------\n" % line
if(line != "")and (line!=None):
print "\t\t\tinner if executes: line: %r\n" % line
self.soup.feed(line)
self.get_word_vector()
self.has_doc = True
Doing self.soup = BeautifulSoup(open(mypath,"r")) returns None, but feeding it line by line at least crashes and gives me something to look at.
I edited the functions listed by the traceback in BeautifulSoup.py and sgmllib.py
When I try to run this, I get:
me#GIGABYTE-SERVER:code$ python test_docs.py
in sgml.finish_endtag
in _feed: inDocumentEncoding: None, fromEncoding: None, smartQuotesTo: 'html'
in UnicodeDammit.__init__: markup: '<!DOCTYPE html>\n'
in UnicodeDammit._detectEncoding: xml_data: '<!DOCTYPE html>\n'
in sgmlparser.feed: rawdata: '', data: u'<!DOCTYPE html>\n' self.goahead(0)
------------ line: '<!DOCTYPE html>\n' ---------------
------------ line: '<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n' ---------------
inner if executes: line: '<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n'
in sgmlparser.feed: rawdata: u'', data: '<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n' self.goahead(0)
in sgmlparser.goahead: end: 0,rawdata[i]: u'<', i: 0,literal:0
in sgmlparser.parse_starttag: i: 0, __starttag_text: None, start_pos: 0, rawdata: u'<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n'
in sgmlparser.goahead: end: 0,rawdata[i]: u'<', i: 61,literal:0
in sgmlparser.parse_starttag: i: 61, __starttag_text: None, start_pos: 61, rawdata: u'<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n'
------------ line: '<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n' ---------------
inner if executes: line: '<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n'
in sgmlparser.feed: rawdata: u'', data: '<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n' self.goahead(0)
in sgmlparser.goahead: end: 0,rawdata[i]: u'<', i: 0,literal:0
in sgmlparser.parse_starttag: i: 0, __starttag_text: None, start_pos: 0, rawdata: u'<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n'
in sgml.finish_starttag: tag: u'meta', attrs: [(u'http-equiv', u'content-type'), (u'content', u'text/html; charset=UTF-8')]
in start_meta: attrs: [(u'http-equiv', u'content-type'), (u'content', u'text/html; charset=UTF-8')] declaredHTMLEncoding: u'UTF-8'
in _feed: inDocumentEncoding: u'UTF-8', fromEncoding: None, smartQuotesTo: 'html'
in UnicodeDammit.__init__: markup: None
in UnicodeDammit._detectEncoding: xml_data: None
and the Traceback:
Traceback (most recent call last):
File "test_docs.py", line 28, in <module>
newdoc.get_doc_ondrive(testeee)
File "/home/jddancks/Capstone/Python/code/pkg/vectors/DOCUMENT.py", line 117, in get_doc_ondrive
self.soup.feed(line)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 139, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/sgmllib.py", line 298, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 348, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 385, in handle_starttag
method(attrs)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1618, in start_meta
self._feed(self.declaredHTMLEncoding)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1172, in _feed
smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1776, in __init__
self._detectEncoding(markup, isHTML)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1922, in _detectEncoding
'^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer
so this line
<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n
is somehow causing a null string to be parsed in UnicodeDammit. Why is this happening?
I just read through the source and I think I understand the problem. Essentially, here’s how BeautifulSoup thinks things are supposed to go:
You call BeautifulSoup with the entire markup.
It sets self.markup to that markup.
It calls _feed on itself, which resets the document and parses it in the initially-detected encoding.
While feeding itself, it finds a meta tag that states a different encoding.
To use this new encoding, it calls _feed on itself again, which reparses self.markup.
After the first _feed as well as the _feed it recursed into has finished, it sets self.markup to None. (After all, we’ve parsed everything now; <sarcasm>who could ever need the original markup any more?</sarcasm>)
But the way you’re using it:
You call BeautifulSoup with the first line of the markup.
It sets self.markup to the first line of the markup and calls _feed.
_feed sees no interesting meta tag on the first line, so finishes successfully.
The constructor thinks we’re done parsing, so it sets self.markup back to None and returns.
You call feed on the BeautifulSoup object, which goes straight to the SGMLParser.feed implementation, which is not overridden by BeautifulSoup.
It sees an interesting meta tag and calls _feed to parse the document in this new encoding.
_feed goes trying to construct a UnicodeDammit object with self.markup.
It explodes, since self.markup is None, since it thought it was only going to be called during that little chunk of time in the constructor of BeautifulSoup.
Moral of the story is that feed is an unsupported way of sending input to BeautifulSoup. You have to pass it all the input at once.
As for why BeautifulSoup(open(mypath, "r")) returns None, I’ve no idea; I don’t see a __new__ defined on BeautifulSoup, so it seems like it has to return a BeautifulSoup object.
All that said, you might want to look into using BeautifulSoup 4 rather than 3. Here’s the porting guide. In order to support Python 3, it had to remove the dependency on SGMLParser, and I wouldn’t be surprised if during that part of the rewrite whatever bug you’re encountering was fixed.
I'm working with the Mega API and Python in hope to produce a folder tree readable by Python. At the moment I'm working with the JSON responses Mega's API gives, but for some reason am having trouble parsing it. In the past I would simply use simplejson in the format below, though right now it's not working. At the moment I'm just trying to get the file name. Any help is appreciated!
import simplejson
megaResponseToFileSearch = "(u'BMExefXbYa', {u'a': {u'n': u'A Bullet For Pretty Boy - 1 - Dial M For Murder.mp3'}, u'h': u'BMExXbYa', u'k': (5710166, 21957970, 11015946, 7749654L), u'ts': 13736999, 'iv': (7949460, 15946811, 0, 0), u'p': u'4FlnwBTb', u's': 5236864, 'meta_mac': (529642, 2979591L), u'u': u'xpz_tb-YDUg', u't': 0, 'key': (223xx15874, 642xx8505, 1571620, 26489769L, 799460, 1596811, 559642, 279591L)})"
jsonRespone = simplejson.loads(megaResponseToFileSearch)
print jsonRespone[u'a'][u'n']
ERROR:
Traceback (most recent call last):
File "D:/Projects/Mega Sync/megasync.py", line 18, in <module>
jsonRespone = simplejson.loads(file4)
File "D:\Projects\Mega Sync\simplejson\__init__.py", line 453, in loads
return _default_decoder.decode(s)
File "D:\Projects\Mega Sync\simplejson\decoder.py", line 429, in decode
obj, end = self.raw_decode(s)
File "D:\Projects\Mega Sync\simplejson\decoder.py", line 451, in raw_decode
raise JSONDecodeError("No JSON object could be decoded", s, idx)
simplejson.decoder.JSONDecodeError: No JSON object could be decoded: line 1 column 0 (char 0)
EDIT:
I was asked where I got the string from. It's a response to searching for a file using the Mega API. I'm using the module found here. https://github.com/richardasaurus/mega.py
The code itself looks like this:
from mega import Mega
mega = Mega({'verbose': True})
m = mega.login(email, password)
file = m.find('A Bullet For Pretty Boy - 1 - Dial M For Murder.mp3')
print file
The thing you are getting from m.find is just a python tuple, where the 1-st (next after the 0th) element is a dictionary:
(u'99M1Tazb',
{u'a': {u'n': u'test.txt'},
u'h': u'99M1Tazb',
u'k': (1145485578, 1435138417, 702505527, 274874292),
u'ts': 1373482712,
'iv': (1883603069, 763415510, 0, 0),
u'p': u'9td12YaY',
u's': 0,
'meta_mac': (1091379956, 402442960),
u'u': u'79_166PAQCA',
u't': 0,
'key': (872626551, 2013967015, 1758609603, 127858020, 1883603069, 763415510, 1091379956, 402442960)})
To get the filename, just use:
print file[1]['a']['n']
So, no need to use simplejson at all.