Common Crawl Request returns 403 WARC - python

I am trying to crawl some WARC files from the common crawls archives, but I do not seem to get successful requests through to the server. A minimal python example below is provided below to replicate the error. I tried adding the UserAgent in the request header, but it did help. Any ideas on how to proceed?
import io
import time
import justext # >= 2.2.0
import argparse
import requests # >= 2.23.0
import pandas as pd # pandas >= 1.0.3
from tqdm import tqdm
from warcio.archiveiterator import ArchiveIterator warcio >= 1.7.3
def debug():
common_crawl_data = {"filename":"crawl-data/CC-MAIN-2016-07/segments/1454702018134.95/warc/CC-MAIN-20160205195338-00121-ip-10-236-182-209.ec2.internal.warc.gz",
"offset":244189209,
"length":989
}
offset, length = int(common_crawl_data['offset']), int(common_crawl_data['length'])
offset_end = offset + length - 1
prefix = 'https://commoncrawl.s3.amazonaws.com/'
resp = requests.get(prefix + common_crawl_data['filename'], headers={'Range': 'bytes={}-{}'.format(offset, offset_end)})
raw_data = io.BytesIO(resp.content)
uri = None
page = None
for record in ArchiveIterator(raw_data, arc2warc=True):
uri = record.rec_headers.get_header('WARC-Target-URI')
R = record.content_stream().read()
try:
page = R.strip().decode('utf-8')
except:
page = R.strip().decode('latin1')
print(uri, page)
return uri, page
debug()

Please see this commoncrawl blog posting for the recent change that generates 403s for some unauthenticated requests.

Related

Spotify API Works Sometimes, Stalls Other Times(?)

I'm trying to follow along in a simple tutorial that uses the spotipy api. The code is below:
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy
import sys
from pprint import pprint
import spotipy
import yaml
from spotipy.oauth2 import SpotifyOAuth
from data_functions import offset_api_limit, get_artists_df, get_tracks_df, get_track_audio_df,\
get_all_playlist_tracks_df, find_popular_album, get_recommendations_artists, get_recommendations_tracks, get_most_popular_artist, get_recommendations_genre
import pandas as pd
urn = 'spotify:artist:3jOstUTkEu2JkjvRdBA5Gu'
scope = "user-library-read user-follow-read user-top-read playlist-read-private"
with open("spotify/spotify_info.yaml", 'r') as stream:
spotify_details = yaml.safe_load(stream)
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
client_id=spotify_details['client_id'],
client_secret=spotify_details['client_secret'],
redirect_uri=spotify_details['redirect_uri'],
scope=scope,
))
results = sp.search(q='album:' + 'From 2 to 3', type='album')
print(results)
print("\n\n")
items = results['albums']['items']
if len(items) > 0:
print(items)
album = items[0]
print(album)
print(album['name'], album['id'])
alb = sp.album(album['id']) #issues here
pprint(alb)
The code works and pulls the information from spotify until the marked line. I have no idea what's different about this function but when it hits that line it stalls indefinitely. Has anyone encountered this issue?
I've tried changing my authentication configuration and various functions such as albums, related_artists that also don't work.

Json problems to csv

I'm trying to get some stats from the NBA stats page. I'm following this tutorial-idea
https://towardsdatascience.com/using-python-pandas-and-plotly-to-generate-nba-shot-charts-e28f873a99cb
The basic idea is put the data into a csv file.
So I try this code, to get the data from the nba web, trying to get the json file and the convert it to a csv:
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
shot_data_url_start="https://stats.nba.com/events/?flag=3&CFID=33&CFPARAMS=2017-18&PlayerID="
player_id="202695"
shot_data_url_end="&ContextMeasure=FGA&Season=2017-18&section=player&sct=plot"
def shoy_chart(player_id):
full_url = shot_data_url_start + str(player_id) + shot_data_url_end
json = requests.get(full_url, headers=headers).json()
return(json)
data = json['resultSets'][0]['rowSets']
columns = json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
And this is the error that notebook shows to me:
TypeError Traceback (most recent call last)
<ipython-input-42-a3452c3a4fc8> in <module>
18
19
---> 20 data = json['resultSets'][0]['rowSets']
21 columns = json['resultSets'][0]['headers']
22
TypeError: 'module' object is not subscriptable
Anyone can help me, or know another way to get the data into a .csv or excel file?
When imported with import json, the name json is referring to the JSON module of the Python standard library. You cannot use it as a regular variable name. If you rename your variable to something else such as response_json, this part of your code will work.
Regarding the rest of the code, the page https://stats.nba.com/events/ doesn't return any JSON text, it is a regular web page with images, menus, a video player, etc... If you want to access the API that returns the shots in JSON format, you will have to use the https://stats.nba.com/stats/shotchartdetail (with the right query string). This API endpoint is mentioned in the tutorial, in the "Chrome XHR tab and resulting json linked by url" image.
Ok I've changed the code like this:
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
def shot_chart(player_id):
full_url = "https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2017-18&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID=202695&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
response_json = requests.get(full_url, headers=headers)
return(response_json)
data = response_json['resultSets'][0]['rowSets']
columns = response_json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
shot_data_url_start="https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2019-20&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID="
player_id="202330"
shot_data_url_end="&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
def shot_chart(player_id):
full_url = shot_data_url_start + str(player_id) + shot_data_url_end
response_json = requests.get(full_url).json()
return(response_json)
data = response_json['resultSets'][0]['rowSets']
columns = response_json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
shot_chart("202330")
What is going on now? the notebook is tucked right know
Try this out
import pandas as pd
from pandas import DataFrame as df
shot_data_url_start = "https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2017-18&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID="
player_id = "204001"
shot_data_url_end = "&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
def get_shot_data(player_id):
full_url = shot_data_url_start + player_id + shot_data_url_end
data = requests.get(
full_url,
headers = {
"User-Agent": "PostmanRuntime/7.4.0"
}
)
return data.json()
shot_results = get_shot_data(player_id)
result_sets = shot_results['resultSets']
first_result_set = result_sets[0]
row_set = first_result_set['rowSet']
set_headers = first_result_set['headers']
df = pd.DataFrame.from_records(row_set, columns=set_headers)
I see how you got confused with that medium post. You were missing the headers and the url for the NBA api wasn't right. That's what #pierre was trying to say in his response. The url you're using isn't right. If you reread that post you were following, you'll see that the author said he had to dig in to dev tools in order to find that actual url to use in order to grab the JSON.
Edit: Forgot to mention that when I didn't pass a User-Agent in the headers, the request would timeout. If you don't pass that in, you won't get a successful response.

Python 3 Get and parse JSON API [duplicate]

This question already has answers here:
HTTP requests and JSON parsing in Python [duplicate]
(8 answers)
Closed 8 months ago.
How would I parse a json api response with python?
I currently have this:
import urllib.request
import json
url = 'https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty'
def response(url):
with urllib.request.urlopen(url) as response:
return response.read()
res = response(url)
print(json.loads(res))
I'm getting this error:
TypeError: the JSON object must be str, not 'bytes'
What is the pythonic way to deal with json apis?
Version 1: (do a pip install requests before running the script)
import requests
r = requests.get(url='https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
print(r.json())
Version 2: (do a pip install wget before running the script)
import wget
fs = wget.download(url='https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
with open(fs, 'r') as f:
content = f.read()
print(content)
you can use standard library python3:
import urllib.request
import json
url = 'http://www.reddit.com/r/all/top/.json'
req = urllib.request.Request(url)
##parsing response
r = urllib.request.urlopen(req).read()
cont = json.loads(r.decode('utf-8'))
counter = 0
##parcing json
for item in cont['data']['children']:
counter += 1
print("Title:", item['data']['title'], "\nComments:", item['data']['num_comments'])
print("----")
##print formated
#print (json.dumps(cont, indent=4, sort_keys=True))
print("Number of titles: ", counter)
output will be like this one:
...
Title: Maybe we shouldn't let grandma decide things anymore.
Comments: 2018
----
Title: Carrie Fisher and Her Stunt Double Sunbathing on the Set of Return of The Jedi, 1982
Comments: 880
----
Title: fidget spinner
Comments: 1537
----
Number of titles: 25
I would usually use the requests package with the json package. The following code should be suitable for your needs:
import requests
import json
url = 'https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty'
r = requests.get(url)
print(json.loads(r.content))
Output
[11008076,
11006915,
11008202,
....,
10997668,
10999859,
11001695]
The only thing missing in the original question is a call to the decode method on the response object (and even then, not for every python3 version). It's a shame no one pointed that out and everyone jumped on a third party library.
Using only the standard library, for the simplest of use cases :
import json
from urllib.request import urlopen
def get(url, object_hook=None):
with urlopen(url) as resource: # 'with' is important to close the resource after use
return json.load(resource, object_hook=object_hook)
Simple use case :
data = get('http://url') # '{ "id": 1, "$key": 13213654 }'
print(data['id']) # 1
print(data['$key']) # 13213654
Or if you prefer, but riskier :
from types import SimpleNamespace
data = get('http://url', lambda o: SimpleNamespace(**o)) # '{ "id": 1, "$key": 13213654 }'
print(data.id) # 1
print(data.$key) # invalid syntax
# though you can still do
print(data.__dict__['$key'])
With Python 3
import requests
import json
url = 'http://IP-Address:8088/ws/v1/cluster/scheduler'
r = requests.get(url)
data = json.loads(r.content.decode())

Getting the auth signature in the yelp api using python

I'm trying to write a simple script in python using urllib2 and json where I print the json to the console.
Currently I'm having trouble getting the auth_signature value. I already have the url variable setup with the appropriate keys except for the auth_signature. How do I go about this?
Here is what I have:
import json
import urllib2
import oauth2
timestamp = oauth2.generate_timestamp
nonce = oauth2.generate_nonce
url = "http://api.yelp.com/v2/search?term=food&location=Seattle&callback=callbackYelpAuth&oauth_consumer_key=XXX&oauth_consumer_secret=XXX&oauth_token=XXX&oauth_signature_method=HMAC-SHA1&oauth_timestamp=" + str(timestamp) + "&oauth_nonce=" + str(nonce) + "&oauth_signature=" + str(????)
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
print data
Following code is working fine. Hope it will help you.
import urlparse
url = "http://api.yelp.com/v2/search?term=food&location=Seattle&callback=callbackYelpAuth&oauth_consumer_key=XXX&oauth_consumer_secret=XXX&oauth_token=XXX&oauth_signature_method=HMAC-SHA1&oauth_timestamp=temstamp_value&oauth_nonce=nonce_value&oauth_signature=signature_value"
parsed = urlparse.urlparse(url)
params = urlparse.parse_qsl(parsed.query)
for x,y in params:
print "Parameter = "+x,", Value = "+y

JSON serialization Error in Python 3.2

I am using JSON library and trying to import a page feed to an CSV file. Tried many a ways to get the result however every time code execute it Gives JSON not serialzable. No Facebook use auth code which I have and used it so connection string will change however if you use a page which has public privacy you will still be able to get the result from below code.
following is the code
import urllib3
import json
import requests
#from pprint import pprint
import csv
from urllib.request import urlopen
page_id = "abcd" # username or id
api_endpoint = "https://graph.facebook.com"
fb_graph_url = api_endpoint+"/"+page_id
try:
#api_request = urllib3.Requests(fb_graph_url)
#http = urllib3.PoolManager()
#api_response = http.request('GET', fb_graph_url)
api_response = requests.get(fb_graph_url)
try:
#print (list.sort(json.loads(api_response.read())))
obj = open('data', 'w')
# write(json_dat)
f = api_response.content
obj.write(json.dumps(f))
obj.close()
except Exception as ee:
print(ee)
except Exception as e:
print( e)
Tried many approach but not successful. hope some one can help
api_response.content is the text content of the API, not a Python object so you won't be able to dump it.
Try either:
f = api_response.content
obj.write(f)
Or
f = api_response.json()
obj.write(json.dumps(f))
requests.get(fb_graph_url).content
is probably a string. Using json.dumps on it won't work. This function expects a list or a dictionary as the argument.
If the request already returns JSON, just write it to the file.

Categories