HTML data grabbing in python? - python

I'm fairly new to programming and I am trying to take data from a webpage and use it in my python code. Basically, I'm trying to take the price of an item for a game by having python grab the data whenever I run my code, if that makes sense. Here's what I'm struggling with in particular:
The HTML page I'm using is for runescape, namely
http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151
This page provides me with a bunch of dictionaries from which I am trying to extract the price of the item in question. All I really want to do it get all of this data into python so I can then manipulate it. My current code is:
import urllib2
response =urllib2.urlopen('http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151')
print response
And it outputs:
addinfourl at 49631760 whose fp = socket._fileobject object at 0x02F4B2F0
whereas I just want it to display exactly what is on the URL in question.
Any ideas? I'm sorry if my formatting is terrible. And if it sounds like I have no idea what I'm talking about, it's because I don't.

If the webpage returns a json-encoded data, then do something like this:
import urllib2
import json
response = urllib2.urlopen("http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151")
data = json.load(response)
print(data)
Extract the relevant keys in the data variable to get the values you want.

Related

How do I print specific values from a json request?

I am trying to request data from Yahoo Finance and then print specific pieces of the data.
My code so far is:
import requests
ticker = input("Enter Stock Ticker: ")
url = "https://query1.finance.yahoo.com/v8/finance/chart/{}?region=GB&lang=en-GB&includePrePost=false&interval=2m&range=1d&corsDomain=uk.finance.yahoo.com&.tsrc=finance".format(ticker)
r = requests.get(url)
data = r.json()
What I am unsure of is how to extract certain pieces of the 'data' variable. For example, I want to display the value that is paired with 'regularMarketPrice'. This can be found in the request.
How can I do this?
Apologies if this isn't worded correctly.
Thanks
If you print data, you will see that it is a dictionary.
If you dig deep enough into the dictionary, you will see that regularMarketPrice can be retrieved as follows (for the first result):
print(data['chart']['result'][0]['meta']['regularMarketPrice'])
If there are multiple results, then you can use the following:
for result in data['chart']['result']:
print(result['meta']['regularMarketPrice'])

How can I use urllib to parse a url but input multiple url's into a text prompt?

I'm using urllib to parse a url, but I was wanting it to take input from a text box so I could put in multiple url's whenever I needed instead of changing the code to parse just one url. I tried using tkinter but I couldn't figure out how to get urllib to grab the input from that.
You haven't provided much information on your use case but let's pretend you have multiple URLs already and that part is working.
def retrieve_input(list_of_urls):
for url in list_of_urls:
# do parsing as needed
Now if you wanted to have a way to get more than one URL and put them in a list, maybe you would do something like:
list_of_urls = []
while True:
url = input('What is your URL?')
if url != 'Stop':
list_of_urls.append(url)
else:
break
With that example you would probably want to control inputs more but just to give you an idea. If you are expecting to get help with the tkinter portion, you'll need to provide more information and examples of what you have tried, your expected input (and method), and expected output.

No JSON object could be decoded (Requests + Pandas)

learning to work with the request library and pandas but have been struggling to get past the starting point even with a good amount of examples online.
I am trying to extract NBA shot data from the URL below using a GET request, and then turn it into a DataFrame:
def extractData():
Harden_data_url = "https://stats.nba.com/events/?flag=3&CFID=33&CFPARAMS=2017-18&PlayerID=201935&ContextMeasure=FGA&Season=2017-18&section=player&sct=hex"
response = requests.get(Harden_data_url)
data = response.json()
shots = data['resultSets'][0]['rowSet']
headers = data['resultSets'][0]['headers']
df = pandas.DataFrame.from_records(shots, columns = headers)
However I get this error starting on line 2 "response = requests.get(url)"
ValueError: No JSON object could be decoded
I imagine I am missing something basic, any debugging help is appreciated!
The problem is that you are using the wrong URL for fetching the data.
The URL you used was for the HTML, which is in charge of the layout of the site. The data comes from a different URL, which fetches it in JSON format.
The correct URL for the data you are looking for is this:
https://stats.nba.com/stats/shotchartdetail?CFID=33&CFPARAMS=2017-18&ContextMeasure=FGA&DateFrom=&DateTo=&EndPeriod=10&EndRange=28800&GameID=&GameSegment=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID=201935&PlayerPosition=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&StartPeriod=1&StartRange=0&TeamID=0&VsConference=&VsDivision=
If you run it on the browser, you can see only the raw JSON data, which is exactly what you will get in your code, and make it work properly.
This blog post explains the method to find the data URL, and although the API has changed a little since the post was written, the method still works:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

python scrape webpage and parse the content

I want to scrape the data on this link
http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json
I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:
import requests
url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text
The type of the source is unicode. I also try to use the urllib2 to scrape like:
source2=urllib2.urlopen(url).read()
The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/
Thanks.
The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets
return_json("JSON code to copy")
In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html
The response is text. It does contain JSON, just need to extract it
import json
strip_len = len("return_json(")
source=requests.get(url).text[strip_len:-2]
source = json.loads(source)

After parsing url as json, can't return a specific element from list- only whole list

All,
I wrote the following code:
import requests,bs4
res=requests.get('http://itunes.apple.com/lookup?id=551798799')
res.raise_for_status()
wwe=bs4.BeautifulSoup(res.text)
print wwe.select('p averageUserRating')
If I only do: print wwe.select('p') then the code works, but it prints everything in the list. However when I print what is in the output above, this throws an error saying the selector is unsupported.
I am basically only trying to return the averageUserRating value (which is 4.0).
Thanks for the help!
The contents of that file isn't HTML, which is what BeautifulSoup is designed to read; it's a different data format called JSON. Thankfully, the requests library makes it really easy to parse JSON - if you call .json() on a response it parses it into a dictionary. You need to access averageUserRating, which is inside the first element of the results list, so you can use this to access what you need:
>>> data = res.json()
>>> data["results"][0]["averageUserRating"]
4.0
To modify your existing code:
import requests
res=requests.get('http://itunes.apple.com/lookup?id=551798799')
res.raise_for_status()
wwe=res.json()
print data["results"][0]["averageUserRating"]

Categories