Getting values from XML Url Python - python

I have an account in https://es.besoccer.com/ and they have an api for getting data in a xml.
I have this code in python for print the values of the xml I need:
from xml.dom import minidom
doc = minidom.parse("datos.xml")
partidos = doc.getElementsByTagName("matches")
for partido in partidos:
local = partido.getElementsByTagName("local")[0]
visitante = partido.getElementsByTagName("visitor")[0]
print("local:%s" % local.firstChild.data)
print("visitante:%s" % visitante.firstChild.data)
canales=partido.getElementsByTagName("channels")
for canal in canales:
nombre=canal.getElementsByTagName("name")[0]
print("canal:%s" % nombre.firstChild.data)
The problem is thatthe XML of this site is a url so I donĀ“t know how to read the xml directly form the url. Other problem is that the xml contains some tags that are a link, and python throughs a error with that tags that contains a url.

Read the API docs here: https://www.besoccer.com/api/documentacion
After you understand which API call you need to use, prepare the URL and the query arguments and use a library like requests in order to read the data.
Once you have the reply (assuming it is XML based) - you can use your code and parse it.

Related

Trying to parse xml from url store to string so I can use in another spot to output to irc

The following is the xml from remote URL
<SHOUTCASTSERVER>
<CURRENTLISTENERS>0</CURRENTLISTENERS>
<PEAKLISTENERS>0</PEAKLISTENERS>
<MAXLISTENERS>100</MAXLISTENERS>
<UNIQUELISTENERS>0</UNIQUELISTENERS>
<AVERAGETIME>0</AVERAGETIME>
<SERVERGENRE>variety</SERVERGENRE>
<SERVERGENRE2/>
<SERVERGENRE3/>
<SERVERGENRE4/>
<SERVERGENRE5/>
<SERVERURL>http://localhost/</SERVERURL>
<SERVERTITLE>Wicked Radio WIKD/WPOS</SERVERTITLE>
<SONGTITLE>Unknown - Haxor Radio Show 08</SONGTITLE>
<STREAMHITS>0</STREAMHITS>
<STREAMSTATUS>1</STREAMSTATUS>
<BACKUPSTATUS>0</BACKUPSTATUS>
<STREAMLISTED>0</STREAMLISTED>
<STREAMLISTEDERROR>200</STREAMLISTEDERROR>
<STREAMPATH>/stream</STREAMPATH>
<STREAMUPTIME>448632</STREAMUPTIME>
<BITRATE>128</BITRATE>
<CONTENT>audio/mpeg</CONTENT>
<VERSION>2.4.7.256 (posix(linux x64))</VERSION>
</SHOUTCASTSERVER>
All I am trying to do is store the contents of the element <SONGTITLE> store it so I can post to IRC using a bot that I have.
import urllib2
from lxml import etree
url = "http://142.4.217.133:9203/stats?sid=1&mode=viewxml&page=0"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
for record in doc.xpath('//SONGTITLE'):
for x in record.xpath("./subfield/text()"):
print "\t", x
That is what I have so far; not sure what I am doing wrong here. I am quite new to python but the IRC bot works and does some other utility type things I just want to add this as a feature to it.
You don't need to include ./subfield/:
for x in record.xpath("text()"):
Output:
Unknown - Haxor Radio Show 08

urllib2.urlopen not getting all content

I am a beginner in python to pull some data from reddit.com
More precisely, I am trying to send a request to http:www.reddit.com/r/nba/.json to get the JSON content of the page and then parse it for entries about a specific team or player.
To automate the data gathering, I am requesting the page like this:
import urllib2
FH = urllib2.urlopen("http://www.reddit.com/r/nba/.json")
rnba = FH.readlines()
rnba = str(rnba[0])
FH.close()
I am also pulling the content like this on a copy of the script, just to be sure:
FH = requests.get("http://www.reddit.com/r/nba/.json",timeout=10)
rnba_json = FH.json()
FH.close()
However, I am not getting the full data that is presented when I manually go to
http://www.reddit.com/r/nba/.json with either method, in particular when I call
print len(rnba_json['data']['children']) # prints 20-something child stories
but when I do the same loading the copy-pasted JSON string like this:
import json
import urllib2
fh = r"""{"kind": "Listing", "data": {"modhash": ..."""# long JSON string
r_nba = json.loads(fh) #loads the json string from the site into json object
print len(r_nba['data']['children']) #prints upwards of 100 stories
I get more story links. I know about the timeout parameter but providing it did not resolve anything.
What am I doing wrong or what can I do to get all the content presented when I pull the page in the browser?
To get the max allowed, you'd use the API like: http://www.reddit.com/r/nba/.json?limit=100

Download a Google Sites page Content Feed using gdata-python-client

My final goal is import some data from Google Site pages.
I'm trying to use gdata-python-client (v2.0.17) to download a specific Content Feed:
self.client = gdata.sites.client.SitesClient(source=SOURCE_APP_NAME)
self.client.client_login(USERNAME, PASSWORD, source=SOURCE_APP_NAME, service=self.client.auth_service)
self.client.site = SITE
self.client.domain = DOMAIN
uri = '%s?path=%s' % (self.client.MakeContentFeedUri(), '[PAGE PATH]')
feed = self.client.GetContentFeed(uri=uri)
entry = feed.entry[0]
...
Resulted entry.content has a page content in xhtml format. But this tree doesn't content any plan text data from a page. Only html page struct and links.
For example my test page has
<div>Some text</div>
ContentFeed entry has only div node with text=None.
I have debugged gdata-python-client request/response and checked resolved data from server in raw buffer - any plan text data in content. Hence it is a Google API bug.
May be there is some workaround? May be i can use some common request parameter? What's going wrong here?
This code works for me against a Google Apps domain and gdata 2.0.17:
import atom.data
import gdata.sites.client
import gdata.sites.data
client = gdata.sites.client.SitesClient(source='yourCo-yourAppName-v1', site='examplesite', domain='example.com')
client.ClientLogin('admin#example.com', 'examplepassword', client.source);
uri = '%s?path=%s' % (client.MakeContentFeedUri(), '/home')
feed = client.GetContentFeed(uri=uri)
entry = feed.entry[0]
print entry
Given, it's pretty much identical to yours, but it might help you prove or disprove something. Good luck!

Google Reader Archive feed not valid xml?

I want to get the most recent 10,000 entries from CNN's top stories RSS feed. I'm using the following python program to do this, connecting to Google's archive tool as follows:
import string
import urllib2
from xml.dom import minidom
feedAddr = "http://www.google.com/reader/atom/feed/http://rss.cnn.com/rss/cnn_topstories.rss?r=n&n=1000"
feedString = urllib2.build_opener().open(urllib2.Request(feedAddr)).read()
xml = minidom.parseString(feedString)
items = xml.getElementsByTagName("item")
for item in items:
titleNode = item.childNodes[1]
linkNode = item.childNodes[3]
titleString = titleNode.firstChild.data
linkString = linkNode.firstChild.data
print titleString, linkString
I'm getting the following error:
xml.parsers.expat.ExpatError: mismatched tag: line 1285, column 4
Is this a problem with Google's archiving tool or feed generator? Is it a problem with my Python code? I'm getting the feed url from this page, splicing in the CNN feed url as seen above:
http://googlesystem.blogspot.com/2007/06/reconstruct-feeds-history-using-google.html
Have you actually examined the data returned by urllib? Are you sure you're getting a feed and not something else? Google Reader requires authentication and if you attempt to load that URL without authentication you will get back an HTML error page. Try this:
feedString = urllib2.build_opener().open(urllib2.Request(feedAddr)).read()
open('feed.xml', 'w').write(feedString)
And examine the feed.xml file.
Also, you can grab it from CNN directly by just stripping off the `http://www.google.com/reader/atom/feed/' part and using:
http://rss.cnn.com/rss/cnn_topstories.rss?r=n&n=1000.
This returns a valid RSS feed.

The JSON syntax vs html/xml tags

The JSON syntax definition say that
html/xml tags (like the <script>...</script> part) are not part of
valid json, see the description at http://json.org.
A number of browsers and tools ignore these things silently, but python does
not.
I'd like to insert the javascript code (google analytics) to get info about the users using this service (place, browsers, OS ...).
What do you suggest to do?
I should solve the problem on [browser output][^1] or [python script][^2]?
thanks,
Antonio
[^1]: Browser output
<script>...</script>
[{"key": "value"}]
[^2]: python script
#!/usr/bin/env python
import urllib2, urllib, json
url="http://.........."
params = {}
url = url + '?' + urllib.urlencode(params, doseq=True)
req = urllib2.Request(url)
headers = {'Accept':'application/json;text/json'}
for key, val in headers.items():
req.add_header(key, val)
data = urllib2.urlopen(req)
print json.load(data)
These sound like two different kinds of services--one is a user-oriented web view of some data, with visualizations, formatting, etc., and one is a machine-oriented data service. I would keep these separate, and maybe build the user view as an extension to the data service.

Categories