context for using `yield` keyword in python

context for using `yield` keyword in python - python

I have the following program to scrap data from a website. I want to improve the below code by using a generator with a yield instead of calling generate_url and call_me multiple times sequentially. The purpose of this exersise is to properly understand yield and the context in which it can be used.
import requests
import shutil
start_date='03-03-1997'
end_date='10-04-2015'
yf_base_url ='http://real-chart.finance.yahoo.com/table.csv?s=%5E'
index_list = ['BSESN','NSEI']
def generate_url(index, start_date, end_date):
s_day = start_date.split('-')[0]
s_month = start_date.split('-')[1]
s_year = start_date.split('-')[2]
e_day = end_date.split('-')[0]
e_month = end_date.split('-')[1]
e_year = end_date.split('-')[2]
if (index == 'BSESN') or (index == 'NSEI'):
url = yf_base_url + index + '&a={}&b={}&c={}&d={}&e={}&f={}'.format(s_day,s_month,s_year,e_day,e_month,e_year)
return url
def callme(url,index):
print('URL {}'.format(url))
r = requests.get(url, verify=False,stream=True)
if r.status_code!=200:
print "Failure!!"
exit()
else:
r.raw.decode_content = True
with open(index + "file.csv", 'wb') as f:
shutil.copyfileobj(r.raw, f)
print "Success"
if __name__ == '__main__':
url = generate_url(index_list[0],start_date,end_date)
callme(url,index_list[0])
url = generate_url(index_list[1],start_date,end_date)
callme(url,index_list[1])

There are multiple options. You could use yield to iterate over URL's. Or over request objects.
If your index_list were long, I would suggest yielding URLs.
Because then you could use multiprocessing.Pool to map a function that does a request and saves the output over these URLs. That would execute them in parallel, potentially making it a lot faster (assuming that you have enough network bandwidth, and that yahoo finance doesn't throttle connections).
yf ='http://real-chart.finance.yahoo.com/table.csv?s=%5E'
'{}&a={}&b={}&c={}&d={}&e={}&f={}'
index_list = ['BSESN','NSEI']
def genurl(symbols, start_date, end_date):
# assemble the URLs
s_day, s_month, s_year = start_date.split('-')
e_day, e_month, e_year = end_date.split('-')
for s in symbols:
url = yf.format(s, s_day,s_month,s_year,e_day,e_month,e_year)
yield url
def download(url):
# Do the request, save the file
p = multiprocessing.Pool()
rv = p.map(download, genurl(index_list, '03-03-1997', '10-04-2015'))

If I understand you correctly, what you want to know is how to change the code so that you can replace the last part by
if __name__ == '__main__':
for url in generate_url(index_list,start_date,end_date):
callme(url,index)
If this is correct, you need to change generate_url, but not callme. Changing generate_url is rather mechanical. Make the first parameter index_list instead of index, wrap the function body in a for index in index_list loop, and change return url to yield url.
You don't need to change callme because you never want to say something like for call in callme(...). You won't do anything with it but a normal function call.

Related

Use list items in variable in python requests url

I am trying to make a call to an API and then grab event_ids from the data. I then want to use those event ids as variables in another request, then parse that data. Then loop back and make another request using the next event id in the event_id variable for all the IDs.
so far i have the following
def nba_odds():
url = "https://xxxxx.com.au/sports/summary/basketball?api_key=xxxxx"
response = requests.get(url)
data = response.json()
event_ids = []
for event in data['Events']:
if event['Country'] == 'USA' and event['League'] == 'NBA':
event_ids.append(event['EventID'])
# print(event_ids)
game_url = f'https://xxxxx.com.au/sports/detail/{event_ids}?api_key=xxxxx'
game_response = requests.get(game_url)
game_data = game_response.json()
print(game_url)
that gives me the result below in the terminal.
https://xxxxx.com.au/sports/detail/['dbx-1425135', 'dbx-1425133', 'dbx-1425134', 'dbx-1425136', 'dbx-1425137', 'dbx-1425138', 'dbx-1425139', 'dbx-1425140', 'anyvsany-nba01-1670043600000000000', 'dbx-1425141', 'dbx-1425142', 'dbx-1425143', 'dbx-1425144', 'dbx-1425145', 'dbx-1425148', 'dbx-1425149', 'dbx-1425147', 'dbx-1425146', 'dbx-1425150', 'e95270f6-661b-46dc-80b9-cd1af75d38fb', '0c989be7-0802-4683-8bb2-d26569e6dcf9']?api_key=779ac51a-2fff-4ad6-8a3e-6a245a0a4cbb
the URL above format should look like
https://xxxx.com.au/sports/detail/dbx-1425135
If anyone can point me in the right direction it would be appreciated.
thanks.

you need to loop over the event ID's again to call the API with one event_id if it is not supporting multiple event_ids like:
all_events_response = []
for event_id in event_ids
game_url = f'https://xxxxx.com.au/sports/detail/{event_id}?api_key=xxxxx'
game_response = requests.get(game_url)
game_data = game_response.json()
all_events_response.append(game_data)
print(game_url)
You can find list of json responses under all_events_response

event_ids is an entire list of event ids. You make a single URL with the full list converted to its string view (['dbx-1425135', 'dbx-1425133', ...]). But it looks like you want to get information on each event in turn. To do that, put the second request in the loop so that it runs for every event you find interesting.
def nba_odds():
url = "https://xxxxx.com.au/sports/summary/basketball?api_key=xxxxx"
response = requests.get(url)
data = response.json()
event_ids = []
for event in data['Events']:
if event['Country'] == 'USA' and event['League'] == 'NBA':
event_id = event['EventID']
# print(event_id)
game_url = f'https://xxxxx.com.au/sports/detail/{event_id}?api_key=xxxxx'
game_response = requests.get(game_url)
game_data = game_response.json()
# do something with game_data - it will be overwritten
# on next round in the loop
print(game_url)

IndexError: list index out of range (on Reddit data crawler)

is expected the below is supposed to run without issues.
Solution to Reddit data:
import requests
import re
import praw
from datetime import date
import csv
import pandas as pd
import time
import sys
class Crawler(object):
'''
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
'''
def __init__(self, subreddit="apple"):
'''
Initialize a Crawler object.
subreddit is the topic you want to parse. default is r"apple"
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
submission_ids save all the ids of submission you will parse.
reddit is an object created using praw API. Please check it before you use.
'''
self.basic_url = "https://www.reddit.com"
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
self.REX = re.compile(r"<div class=\" thing id-t3_[\w]+")
self.subreddit = subreddit
self.submission_ids = []
self.reddit = praw.Reddit(client_id="your_id", client_secret="your_secret", user_agent="subreddit_comments_crawler")
def get_submission_ids(self, pages=2):
'''
Collect all ids of submissions..
One page has 25 submissions.
page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
id(after) is the last submission from last page.
'''
# This is page url.
url = self.basic_url + "/r/" + self.subreddit
if pages <= 0:
return []
text = requests.get(url, headers=self.headers).text
ids = self.REX.findall(text)
ids = list(map(lambda x: x[-6:], ids))
if pages == 1:
self.submission_ids = ids
return ids
count = 0
after = ids[-1]
for i in range(1, pages):
count += 25
temp_url = self.basic_url + "/r/" + self.subreddit + "?count=" + str(count) + "&after=t3_" + ids[-1]
text = requests.get(temp_url, headers=self.headers).text
temp_list = self.REX.findall(text)
temp_list = list(map(lambda x: x[-6:], temp_list))
ids += temp_list
if count % 100 == 0:
time.sleep(60)
self.submission_ids = ids
return ids
def get_comments(self, submission):
'''
Submission is an object created using praw API.
'''
# Remove all "more comments".
submission.comments.replace_more(limit=None)
comments = []
for each in submission.comments.list():
try:
comments.append((each.id, each.link_id[3:], each.author.name, date.fromtimestamp(each.created_utc).isoformat(), each.score, each.body) )
except AttributeError as e: # Some comments are deleted, we cannot access them.
# print(each.link_id, e)
continue
return comments
def save_comments_submissions(self, pages):
'''
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
'''
print("Start to collect all submission ids...")
self.get_submission_ids(pages)
print("Start to collect comments...This may cost a long time depending on # of pages.")
submission_url = self.basic_url + "/r/" + self.subreddit + "/comments/"
comments = []
submissions = []
count = 0
for idx in self.submission_ids:
temp_url = submission_url + idx
submission = self.reddit.submission(url=temp_url)
submissions.append((submission.name[3:], submission.num_comments, submission.score, submission.subreddit_name_prefixed, date.fromtimestamp(submission.created_utc).isoformat(), submission.title, submission.selftext))
temp_comments = self.get_comments(submission)
comments += temp_comments
count += 1
print(str(count) + " submissions have got...")
if count % 50 == 0:
time.sleep(60)
comments_fieldnames = ["comment_id", "submission_id", "author_name", "post_time", "comment_score", "text"]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = ["submission_id", "num_of_comments", "submission_score", "submission_subreddit", "post_date", "submission_title", "text"]
df_submission = pd.DataFrame(submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
return df_comments
if __name__ == "__main__":
args = sys.argv[1:]
if len(args) != 2:
print("Wrong number of args...")
exit()
subreddit, pages = args
c = Crawler(subreddit)
c.save_comments_submissions(int(pages))
but I got:
(base) UserAir:scrape_reddit user$ python reddit_crawler.py apple 2
Start to collect all submission ids...
Traceback (most recent call last):
File "reddit_crawler.py", line 127, in
c.save_comments_submissions(int(pages))
File "reddit_crawler.py", line 94, in save_comments_submissions
self.get_submission_ids(pages)
File "reddit_crawler.py", line 54, in get_submission_ids
after = ids[-1]
IndexError: list index out of range

Erik's answer diagnoses the specific cause of this error, but more broadly I think this is caused by you not using PRAW to its fullest potential. Your script imports requests and performs a lot of manual requests that PRAW has methods for already. The whole point of PRAW is to prevent you from having to write these requests that do things such as paginate a listing, so I recommend you take advantage of that.
As an example, your get_submission_ids function (which scrapes the web version of Reddit and handles paginating) could be replaced by just
def get_submission_ids(self, pages=2):
return [
submission.id
for submission in self.reddit.subreddit(self.subreddit).hot(
limit=25 * pages
)
]
because the .hot() function does everything you tried to do by hand.
I'm going to go one step further here and have the function just return a list of Submission objects, because the rest of your code ends up doing things that would better by done by interacting with the PRAW Submission object. Here's that code (I renamed the function to reflect its updated purpose):
def get_submissions(self, pages=2):
return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))
(I've updated this function to just return its result, as your version both returns the value and sets it as self.submission_ids, unless pages is 0. That felt quite inconsistent, so I made it just return the value.)
Your get_comments function looks good.
The save_comments_submissions function, like get_submission_ids, does a lot of manual work that PRAW can handle. You construct a temp_url that has the full URL of a post, and then use that to make a PRAW Submission object, but we can replace that with directly using the one returned by get_submissions. You also have some calls to time.sleep() which I removed because PRAW will automatically sleep the appropriate amount for you. Lastly, I removed the return value of this function because the point of the function is to save data to disk, not to return it to anywhere else, and the rest of your script doesn't use the return value. Here's the updated version of that function:
def save_comments_submissions(self, pages):
"""
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
"""
print("Start to collect all submission ids...")
submissions = self.get_submissions(pages)
print(
"Start to collect comments...This may cost a long time depending on # of pages."
)
comments = []
pandas_submissions = []
for count, submission in enumerate(submissions):
pandas_submissions.append(
(
submission.name[3:],
submission.num_comments,
submission.score,
submission.subreddit_name_prefixed,
date.fromtimestamp(submission.created_utc).isoformat(),
submission.title,
submission.selftext,
)
)
temp_comments = self.get_comments(submission)
comments += temp_comments
print(str(count) + " submissions have got...")
comments_fieldnames = [
"comment_id",
"submission_id",
"author_name",
"post_time",
"comment_score",
"text",
]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = [
"submission_id",
"num_of_comments",
"submission_score",
"submission_subreddit",
"post_date",
"submission_title",
"text",
]
df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
Here's an updated version of the whole script that uses PRAW fully:
from datetime import date
import sys
import pandas as pd
import praw
class Crawler:
"""
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
"""
def __init__(self, subreddit="apple"):
"""
Initialize a Crawler object.
subreddit is the topic you want to parse. default is r"apple"
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
submission_ids save all the ids of submission you will parse.
reddit is an object created using praw API. Please check it before you use.
"""
self.subreddit = subreddit
self.submission_ids = []
self.reddit = praw.Reddit(
client_id="your_id",
client_secret="your_secret",
user_agent="subreddit_comments_crawler",
)
def get_submissions(self, pages=2):
"""
Collect all submissions..
One page has 25 submissions.
page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
id(after) is the last submission from last page.
"""
return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))
def get_comments(self, submission):
"""
Submission is an object created using praw API.
"""
# Remove all "more comments".
submission.comments.replace_more(limit=None)
comments = []
for each in submission.comments.list():
try:
comments.append(
(
each.id,
each.link_id[3:],
each.author.name,
date.fromtimestamp(each.created_utc).isoformat(),
each.score,
each.body,
)
)
except AttributeError as e: # Some comments are deleted, we cannot access them.
# print(each.link_id, e)
continue
return comments
def save_comments_submissions(self, pages):
"""
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
"""
print("Start to collect all submission ids...")
submissions = self.get_submissions(pages)
print(
"Start to collect comments...This may cost a long time depending on # of pages."
)
comments = []
pandas_submissions = []
for count, submission in enumerate(submissions):
pandas_submissions.append(
(
submission.name[3:],
submission.num_comments,
submission.score,
submission.subreddit_name_prefixed,
date.fromtimestamp(submission.created_utc).isoformat(),
submission.title,
submission.selftext,
)
)
temp_comments = self.get_comments(submission)
comments += temp_comments
print(str(count) + " submissions have got...")
comments_fieldnames = [
"comment_id",
"submission_id",
"author_name",
"post_time",
"comment_score",
"text",
]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = [
"submission_id",
"num_of_comments",
"submission_score",
"submission_subreddit",
"post_date",
"submission_title",
"text",
]
df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
if __name__ == "__main__":
args = sys.argv[1:]
if len(args) != 2:
print("Wrong number of args...")
exit()
subreddit, pages = args
c = Crawler(subreddit)
c.save_comments_submissions(int(pages))
I realize that my answer here gets into Code Review territory, but I hope that this answer is helpful for understanding some of the things PRAW can do. Your "list index out of range" error would have been avoided by using the pre-existing library code, so I do consider this to be a solution to your problem.

When my_list[-1] throws an IndexError, it means that my_list is empty:
>>> ids = []
>>> ids[-1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> ids = ['1']
>>> ids[-1]
'1'

simple web scraper very slow

I'm fairly new to python and web-scraping in general. The code below works but it seems to be awfully slow for the amount of information its actually going through. Is there any way to easily cut down on execution time. I'm not sure but it does seem like I have typed out more/made it more difficult then I actually needed to, any help would be appreciated.
Currently the code starts at the sitemap then iterates through a list of additional sitemaps. Within the new sitemaps it pulls data information to construct a url for the json data of a webpage. From the json data I pull an xml link that I use to search for a string. If the string is found it appends it to a text file.
#global variable
start = 'https://www.govinfo.gov/wssearch/getContentDetail?packageId='
dash = '-'
urlSitemap="https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml"
old_xml=requests.get(urlSitemap)
print (old_xml)
new_xml= io.BytesIO(old_xml.content).read()
final_xml=BeautifulSoup(new_xml)
linkToBeFound = final_xml.findAll('loc')
for loc in linkToBeFound:
urlPLmap=loc.text
old_xmlPLmap=requests.get(urlPLmap)
print(old_xmlPLmap)
new_xmlPLmap= io.BytesIO(old_xmlPLmap.content).read()
final_xmlPLmap=BeautifulSoup(new_xmlPLmap)
linkToBeFound2 = final_xmlPLmap.findAll('loc')
for pls in linkToBeFound2:
argh = pls.text.find('PLAW')
theWanted = pls.text[argh:]
thisShallWork =eval(requests.get(start + theWanted).text)
print(requests.get(start + theWanted))
dict1 = (thisShallWork['download'])
finaldict = (dict1['modslink'])[2:]
print(finaldict)
url2='https://' + finaldict
try:
old_xml4=requests.get(url2)
print(old_xml4)
new_xml4= io.BytesIO(old_xml4.content).read()
final_xml4=BeautifulSoup(new_xml4)
references = final_xml4.findAll('identifier',{'type': 'Statute citation'})
for sec in references:
if sec.text == "106 Stat. 4845":
Print(dash * 20)
print(sec.text)
Print(dash * 20)
sec313 = open('sec313info.txt','a')
sec313.write("\n")
sec313.write(pls.text + '\n')
sec313.close()
except:
print('error at: ' + url2)

No idea why i spent so long on this, but i did. Your code was really hard to look through. So i started with that, I broke it up into 2 parts, getting the links from the sitemaps, then the other stuff. I broke out a few bits into separate functions too.
This is checking about 2 urls per second on my machine which seems about right.
How this is better (you can argue with me about this part).
Don't have to reopen and close the output file after each write
Removed a fair bit of unneeded code
gave your variables better names (this does not improve speed in any way but please do this especially if you are asking for help with it)
Really the main thing... once you break it all up it becomes fairly clear that whats slowing you down is waiting on the requests which is pretty standard for web-scraping, you can look into multi threading to avoid the wait. Once you get into multi threading, the benefit of breaking up your code will likely also become much more evident.
# returns sitemap links
def get_links(s):
old_xml = requests.get(s)
new_xml = old_xml.text
final_xml = BeautifulSoup(new_xml, "lxml")
return final_xml.findAll('loc')
# gets the final url from your middle url and looks through it for the thing you are looking for
def scrapey(link):
link_id = link[link.find("PLAW"):]
r = requests.get('https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id))
print(r.url)
try:
r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:]))
print(r.url)
soup = BeautifulSoup(r.text, "lxml")
references = soup.findAll('identifier', {'type': 'Statute citation'})
for ref in references:
if ref.text == "106 Stat. 4845":
return r.url
else:
return False
except:
print("bah" + r.url)
return False
sitemap_links_el = get_links("https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml")
sitemap_links = map(lambda x: x.text, sitemap_links_el)
nlinks_el = map(get_links, sitemap_links)
links = [num.text for elem in nlinks_el for num in elem]
with open("output.txt", "a") as f:
for link in links:
url = scrapey(link)
if url is False:
print("no find")
else:
print("found on: {}".format(url))
f.write("{}\n".format(url))

Eliminate unwanted characters from JSON file using different threads (Python)

In my python file, I have created a class called Download. The code where the class is:
import requests, json, os, pytube, threading
class Download:
def __init__(self, url, json=False, get=False, post=False, put=False, unwanted="", wanted="", unwanted2="", wanted2="", unwanted3="", wanted3=""):
self.url = url
self.json = json
self.get = get
self.post = post
self.put = put
self.unwanted = unwanted
self.wanted = wanted
self.unwanted2 = unwanted2
self.wanted2 = wanted2
self.unwanted3 = unwanted3
self.wanted3 = wanted3
def downloadJson(self):
if self.get is True:
downloadJson = requests.get(self.url)
downloadJson = str(downloadJson.content)
downloadJsonS = str(downloadJson) # This saves the downloaded JSON file as string
if self.json is True:
with open("downloadedJson.json", "w") as writeDownloadedJson:
writeDownloadedJson.write(json.dumps(downloadJson))
writeDownloadedJson.close()
with open("downloadedJson.json", "r") as replaceUnwanted:
a = replaceUnwanted.read()
x = a.replace(self.unwanted, self.wanted)
# y = a.replace(self.unwanted2, self.wanted2)
# z = a.replace(self.unwanted3, self.wanted3)
print(x)
with open("downloadedJson.json", "w") as writeUnwanted:
# writeUnwanted.write(y)
# writeUnwanted.write(z)
writeUnwanted.write(x)
else:
# with open("downloadedJson.json", "w")as j:
# j.write(downloadJsonS)
# j.close()
pass
I have written all this by myself, and I understand how it works. My objective is to remove all the unwanted characters that come in the JSON file once downloaded, such as: \\n, \' or \n. I have many arguments in the __init__() function, like the __init__(unwanted="", wanted="", unwanted2="") etcetera.
By this, when adding any character to the unwanted parameter, such as: \\n, it should replace all these characters by a space. This is done properly, and it works. The lines of code that are comments are the lines of code that I was using, but that did not work. It would only replace the characters from only 1 argument.
Is there any way of passing all the unwanted characters in each for each argument, using threads. If it is not possible using threads, is there any alternative?
By the way, the file where I am executing the class: (main.py):
from downloader import Download
with open("url.txt", "r")as url:
x = Download(url.read(), get=True, json=True, unwanted="\\n")
x.downloadJson()
Thanks

You could apply the replacements one after another:
x = a.replace(self.unwanted, self.wanted)
x = x.replace(self.unwanted2, self.wanted2)
x = x.replace(self.unwanted3, self.wanted3)
You could also chain the replacement together, but that would quickly become hard to read:
x = a.replace(...).replace(...).replace(...)
Btw, instead of having multiple unwantedN and wantedN,
it would be probably a lot easier to use a list of (unwanted, wanted) pairs, something like this:
def __init__(self, url, json=False, get=False, post=False, put=False, replacements=[]):
self.url = url
self.json = json
self.get = get
self.post = post
self.put = put
self.replacements = replacements
And then you could perform the replacements in a loop:
x = a
for unwanted, wanted in self.replacements:
x = x.replace(unwanted, wanted)

Python: Check the url request per hour

I am accessing a api and extracting a json but I want to make sure I stay within the hourly request limit, what would be the best way to do this?
This where I make the request:
# return the json
def returnJSONQuestion(id):
url = 'http://someApi.com?index_id={0}&output=json'
format_url = url.format(id)
try:
urlobject = urllib2.urlopen(format_url)
jsondata = json.loads(urlobject.read().decode("utf-8"))
print jsondata
shortRandomSleep()
except urllib2.URLError, e:
print e.reason
except(json.decoder.JSONDecodeError,ValueError):
print 'Decode JSON has failed'
return jsondata

I usually use a cheap hack where I make the script run every other minute by checking the current time. This is the general form of the function:
def minuteMod(x, p=0):
import datetime
minute = datetime.datetime.now() + datetime.timedelta(seconds=15)
minute = int(datetime.datetime.strftime(minute, "%M"))
if minute % x == p:
return True
return False
p is the remainder here and has a default value of 0 so no particular need to pass in the second argument.
So basically, if you want your script to run only every other minute, you use it like this:
def returnJSONQuestion(id):
if not minuteMod(2):
return None or ''
# rest of the code
This will stop the request if the current minute is not even. Considering this is not the best way to do things, you can use this function to cache results (depending on if this is allowed). So basically, you would do something like this:
def returnJSONQuestion(id):
if minuteMod(3): # current minute is a factor of 3
return jsonFromCache # open a file and output cached contents
else:
url = 'http://...'
storeJSONToFile(url)
return json

You could use a token bucket algorithm, something like this: http://code.activestate.com/recipes/511490/
Have tokens added to the bucket at the rate the API allows you to make requests, and take a token from the bucket each time you make a request.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

context for using `yield` keyword in python - python

Related

Use list items in variable in python requests url

IndexError: list index out of range (on Reddit data crawler)

simple web scraper very slow

Eliminate unwanted characters from JSON file using different threads (Python)

Python: Check the url request per hour

Categories

Resources