I want to get the most recent 10,000 entries from CNN's top stories RSS feed. I'm using the following python program to do this, connecting to Google's archive tool as follows:
import string
import urllib2
from xml.dom import minidom
feedAddr = "http://www.google.com/reader/atom/feed/http://rss.cnn.com/rss/cnn_topstories.rss?r=n&n=1000"
feedString = urllib2.build_opener().open(urllib2.Request(feedAddr)).read()
xml = minidom.parseString(feedString)
items = xml.getElementsByTagName("item")
for item in items:
titleNode = item.childNodes[1]
linkNode = item.childNodes[3]
titleString = titleNode.firstChild.data
linkString = linkNode.firstChild.data
print titleString, linkString
I'm getting the following error:
xml.parsers.expat.ExpatError: mismatched tag: line 1285, column 4
Is this a problem with Google's archiving tool or feed generator? Is it a problem with my Python code? I'm getting the feed url from this page, splicing in the CNN feed url as seen above:
http://googlesystem.blogspot.com/2007/06/reconstruct-feeds-history-using-google.html
Have you actually examined the data returned by urllib? Are you sure you're getting a feed and not something else? Google Reader requires authentication and if you attempt to load that URL without authentication you will get back an HTML error page. Try this:
feedString = urllib2.build_opener().open(urllib2.Request(feedAddr)).read()
open('feed.xml', 'w').write(feedString)
And examine the feed.xml file.
Also, you can grab it from CNN directly by just stripping off the `http://www.google.com/reader/atom/feed/' part and using:
http://rss.cnn.com/rss/cnn_topstories.rss?r=n&n=1000.
This returns a valid RSS feed.
Related
So here's my problem. I'm trying to use lxml to web scrape a website and get some information but the elements that the information pertains to aren't being found when using the var.xpath command. It's finding the page but after using the xpath it doesn't find anything.
import requests
from lxml import html
def main():
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# the root of the tracker website
page = html.fromstring(result.content)
print('its getting the element from here', page)
threesRank = page.xpath('//*[#id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tbody/tr[*]/td[3]/div/div[2]/div[1]/div')
print('the 3s rank is: ', threesRank)
if __name__ == "__main__":
main()
OUTPUT:
"D:\Python projects\venv\Scripts\python.exe" "D:/Python projects/main.py"
its getting the element from here <Element html at 0x20eb01006d0>
the 3s rank is: []
Process finished with exit code 0
The output next to "the 3s rank is:" should look something like this
[<Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>]
Because the xpath string does not match, no result set is returned by page.xpath(..). It's difficult to say exactly what you are looking for but considering "threesRank" I assume you are looking for all the table values, ie. ranking and so on.
You can get a more accurate and self-explanatory xpath using the Chrome Addon "Xpath helper". Usage: enter the site and activate the extension. Hold down the shift key and hoover on the element you are interested in.
Since the HTML used by tracker.network.com is built dynamically using javascript with BootstrapVue (and Moment/Typeahead/jQuery) there is a big risk the dynamic rendering is producing different results from time to time.
Instead of scraping the rendered html, I suggest you instead use the structured data needed for the rendering, which in this case is stored as json in a JavaScript variable called __INITIAL_STATE__
import requests
import re
import json
from contextlib import suppress
# get page
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# Extract everything needed to render the current page. Data is stored as Json in the
# JavaScript variable: window.__INITIAL_STATE__={"route":{"path":"\u0 ... }};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)
# convert text string to structured json data
rocketleague = json.loads(json_string)
# Save structured json data to a text file that helps you orient yourself and pick
# the parts you are interested in.
with open('rocketleague_json_data.txt', 'w') as outfile:
outfile.write(json.dumps(rocketleague, indent=4, sort_keys=True))
# Access members using names
print(rocketleague['titles']['currentTitle']['platforms'][0]['name'])
# To avoid 'KeyError' when a key is missing or index is out of range, use "with suppress"
# as in the example below: since there there is no platform no 99, the variable "platform99"
# will be unassigned without throwing a 'keyerror' exception.
from contextlib import suppress
with suppress(KeyError):
platform1 = rocketleague['titles']['currentTitle']['platforms'][0]['name']
platform99 = rocketleague['titles']['currentTitle']['platforms'][99]['name']
# print platforms used by currentTitle
for platform in rocketleague['titles']['currentTitle']['platforms']:
print(platform['name'])
# print all titles with corresponding platforms
for title in rocketleague['titles']['titles']:
print(f"\nTitle: {title['name']}")
for platform in title['platforms']:
print(f"\tPlatform: {platform['name']}")
lxml doesn't support "tbody". change your xpath to
'//*[#id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tr[*]/td[3]/div/div[2]/div[1]/div'
I have an account in https://es.besoccer.com/ and they have an api for getting data in a xml.
I have this code in python for print the values of the xml I need:
from xml.dom import minidom
doc = minidom.parse("datos.xml")
partidos = doc.getElementsByTagName("matches")
for partido in partidos:
local = partido.getElementsByTagName("local")[0]
visitante = partido.getElementsByTagName("visitor")[0]
print("local:%s" % local.firstChild.data)
print("visitante:%s" % visitante.firstChild.data)
canales=partido.getElementsByTagName("channels")
for canal in canales:
nombre=canal.getElementsByTagName("name")[0]
print("canal:%s" % nombre.firstChild.data)
The problem is thatthe XML of this site is a url so I donĀ“t know how to read the xml directly form the url. Other problem is that the xml contains some tags that are a link, and python throughs a error with that tags that contains a url.
Read the API docs here: https://www.besoccer.com/api/documentacion
After you understand which API call you need to use, prepare the URL and the query arguments and use a library like requests in order to read the data.
Once you have the reply (assuming it is XML based) - you can use your code and parse it.
I am trying to extract meta description using goose. I have written the following code. I also considered cookie handling. When I test using just one url, it works. However, when I iterate over an array of urls, an empty array results when I use the following code to extract meta description.
os.chdir("C:\Users\EDAWES01\Desktop\Cookie profiling")
data = pandas.read_csv('activity_url.csv', delimiter=';')
x="https"
url_data=np.array(data[(data.iloc[:,2]==1) & (data.iloc[:,1].str.contains(x))])[:,1]
#remove '~oref='
clean_url_data=[urlparse.urlparse(i)[2].split("=")[1] for i in url_data]
g=goose.Goose()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor()) #for websites with cookie handling
website_meta_description=[g.extract(raw_html=(opener.open(urlw)).read()).meta_description for urlw in clean_url_data]
print website_meta_description
The following is the xml from remote URL
<SHOUTCASTSERVER>
<CURRENTLISTENERS>0</CURRENTLISTENERS>
<PEAKLISTENERS>0</PEAKLISTENERS>
<MAXLISTENERS>100</MAXLISTENERS>
<UNIQUELISTENERS>0</UNIQUELISTENERS>
<AVERAGETIME>0</AVERAGETIME>
<SERVERGENRE>variety</SERVERGENRE>
<SERVERGENRE2/>
<SERVERGENRE3/>
<SERVERGENRE4/>
<SERVERGENRE5/>
<SERVERURL>http://localhost/</SERVERURL>
<SERVERTITLE>Wicked Radio WIKD/WPOS</SERVERTITLE>
<SONGTITLE>Unknown - Haxor Radio Show 08</SONGTITLE>
<STREAMHITS>0</STREAMHITS>
<STREAMSTATUS>1</STREAMSTATUS>
<BACKUPSTATUS>0</BACKUPSTATUS>
<STREAMLISTED>0</STREAMLISTED>
<STREAMLISTEDERROR>200</STREAMLISTEDERROR>
<STREAMPATH>/stream</STREAMPATH>
<STREAMUPTIME>448632</STREAMUPTIME>
<BITRATE>128</BITRATE>
<CONTENT>audio/mpeg</CONTENT>
<VERSION>2.4.7.256 (posix(linux x64))</VERSION>
</SHOUTCASTSERVER>
All I am trying to do is store the contents of the element <SONGTITLE> store it so I can post to IRC using a bot that I have.
import urllib2
from lxml import etree
url = "http://142.4.217.133:9203/stats?sid=1&mode=viewxml&page=0"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
for record in doc.xpath('//SONGTITLE'):
for x in record.xpath("./subfield/text()"):
print "\t", x
That is what I have so far; not sure what I am doing wrong here. I am quite new to python but the IRC bot works and does some other utility type things I just want to add this as a feature to it.
You don't need to include ./subfield/:
for x in record.xpath("text()"):
Output:
Unknown - Haxor Radio Show 08
I am a beginner in python to pull some data from reddit.com
More precisely, I am trying to send a request to http:www.reddit.com/r/nba/.json to get the JSON content of the page and then parse it for entries about a specific team or player.
To automate the data gathering, I am requesting the page like this:
import urllib2
FH = urllib2.urlopen("http://www.reddit.com/r/nba/.json")
rnba = FH.readlines()
rnba = str(rnba[0])
FH.close()
I am also pulling the content like this on a copy of the script, just to be sure:
FH = requests.get("http://www.reddit.com/r/nba/.json",timeout=10)
rnba_json = FH.json()
FH.close()
However, I am not getting the full data that is presented when I manually go to
http://www.reddit.com/r/nba/.json with either method, in particular when I call
print len(rnba_json['data']['children']) # prints 20-something child stories
but when I do the same loading the copy-pasted JSON string like this:
import json
import urllib2
fh = r"""{"kind": "Listing", "data": {"modhash": ..."""# long JSON string
r_nba = json.loads(fh) #loads the json string from the site into json object
print len(r_nba['data']['children']) #prints upwards of 100 stories
I get more story links. I know about the timeout parameter but providing it did not resolve anything.
What am I doing wrong or what can I do to get all the content presented when I pull the page in the browser?
To get the max allowed, you'd use the API like: http://www.reddit.com/r/nba/.json?limit=100