I want to handle the Search-API rate limit of 180 requests / 15 minutes. The first solution I came up with was to check the remaining requests in the header and wait 900 seconds. See the following snippet:
results = search_interface.cursor(search_interface.search, q=k, lang=lang, result_type=result_mode)
while True:
try:
tweet = next(results)
if limit_reached(search_interface):
sleep(900)
self.writer(tweet)
def limit_reached(search_interface):
remaining_rate = int(search_interface.get_lastfunction_header('X-Rate-Limit-Remaining'))
return remaining_rate <= 2
But it seems, that the header information are not reseted to 180 after it reached the two remaining requests.
The second solution I came up with was to handle the twython exception for rate limitation and wait the remaining amount of time:
results = search_interface.cursor(search_interface.search, q=k, lang=lang, result_type=result_mode)
while True:
try:
tweet = next(results)
self.writer(tweet)
except TwythonError as inst:
logger.error(inst.msg)
wait_for_reset(search_interface)
continue
except StopIteration:
break
def wait_for_reset(search_interface):
reset_timestamp = int(search_interface.get_lastfunction_header('X-Rate-Limit-Reset'))
now_timestamp = datetime.now().timestamp()
seconds_offset = 10
t = reset_timestamp - now_timestamp + seconds_offset
logger.info('Waiting {0} seconds for Twitter rate limit reset.'.format(t))
sleep(t)
But with this solution I receive this message INFO: Resetting dropped connection: api.twitter.com" and the loop will not continue with the last element of the generator. Have somebody faced the same problems?
Regards.
just rate limit yourself is my suggestion (assuming you are constantly hitting the limit ...)
QUERY_PER_SEC = 15*60/180.0 #180 per 15 minutes
#~5 seconds per query
class TwitterBot:
last_update=0
def doQuery(self,*args,**kwargs):
tdiff = time.time()-self.last_update
if tdiff < QUERY_PER_SEC:
time.sleep(QUERY_PER_SEC-tdiff)
self.last_update = time.time()
return search_interface.cursor(*args,**kwargs)
Related
I have a loop with a delay which calls a method inside my class, however it only runs fully once, then only the first line gets called, I can see this through my logs in the terminal.
class Processor:
def __init__(self) -> None:
self.suppliers = ["BullionByPostUK", "Chards"]
self.scraper_delay = 60 # in seconds (30mins)
self.oc_delay = 30 # in seconds (5mins)
self.scraper = Scraper()
def run_one(self):
self.scraper.scrape()
self.data = self.scraper.get_data()
t_s = time.perf_counter()
for supplier in self.suppliers:
eval(supplier)(self.data).run() # danger? eval()
t_e = time.perf_counter()
dur = f"{t_e - t_s:0.4f}"
dur_ps = float(dur) / len(self.suppliers)
print(f"Processed/Updated {len(self.suppliers)} suppliers in {dur} seconds \n Average {dur_ps} seconds per supplier\n")
logging.info(f"Processed/Updated {len(self.suppliers)} suppliers in {dur} seconds \n Average {dur_ps} seconds per supplier")
def run_queue(self): # IN PROGRESS
print(color.BOLD + color.YELLOW + f"[{datetime.datetime.utcnow()}] " + color.END + f"Queue Started")
while True:
recorded_silver = self.scraper.silver()
try:
wait(lambda: self.bullion_diff(recorded_silver), timeout_seconds=self.scraper_delay, sleep_seconds=self.oc_delay)
self.run_one()
except exceptions.TimeoutExpired: # delay passed
self.run_one() # update algorithm (x1)
finally:
continue
...
As seen from image provided, each run through after does not run all the way up to process
Why is this that the code up to before the eval() only runs once then stops out? Im basically trying to loop the run_one() method so it runs on a cycle.
downloadStart = datetime.now()
while (True):
requestURL = transactionAPI.format(page = tempPage,limit = 5000)
response = requests.get(requestURL,headers=headers)
json_data = json.loads(response.content)
tempMomosTransactionHistory.extend(json_data["list"])
if(datetime.fromtimestamp(json_data["list"][-1]["crtime"]) < datetime(datetime.today().year,datetime.today().month,datetime.today().day - dateRange)):
break
tempPage += 1
downloadEnd = datetime.now()
Any suggestions please threading or something like that ?
Outputs here
downloadtime 0:00:02.056010
downloadtime 0:00:05.680806
downloadtime 0:00:05.447945
You need to improve it in two ways.
Optimise code within loop
Parallelize code execution
#1
By looking at your code I can see one improvement ie. create datetime.today object instead of doing 3 times. Check other methods like transactionAPI optimise further.
#2:
If you multi core CPU machine then you take advantage of machine by spanning thread per page. Refer to modified code of above.
import threading
def processRequest(tempPage):
requestURL = transactionAPI.format(page = tempPage,limit = 5000)
response = requests.get(requestURL,headers=headers)
json_data = json.loads(response.content)
tempMomosTransactionHistory.extend(json_data["list"])
downloadStart = datetime.now()
while (True):
#create thread per page
t1 = threading.Thread(target=processRequest, args=(tempPage, ))
t1.start()
#Fetch datetime today object once instaed 3 times
datetimetoday = datetime()
if(datetime.fromtimestamp(json_data["list"][-1]["crtime"]) < datetime(datetimetoday.year,datetimetoday.month,datetimetoday.day - dateRange)):
break
tempPage += 1
downloadEnd = datetime.now()
def get_price_history_data(ticker):
pricelist = []
try:
pricedata = False
tradingdays = 252
Historical_Prices = pdr.get_data_yahoo(symbols=ticker, start=(datetime.today()-timedelta(tradingdays)), end=(datetime.today()))#-timedelta(years4-1)))
price_df = pd.DataFrame(Historical_Prices)
pricelist = price_df['Adj Close']
pricedata = True
except:
print(ticker,' failed to get price data')
return(pricelist, pricedata)
tickers = ['FB','V']
for ticker in tickers:
[pricelist, pricedata] = get_price_data(ticker)
I have a list of a few thousand tickers that i run through this for loop. It outputs a single column df and a boolean. Overall it works just fine and does what I need it to. However, it inconsistently freezes indefinitely with no error message and stops running forcing me to close the program and re-run from the beginning.
I am looking for a way for me to skip the iteration of the for loop if a certain amount of time has passed. I have looked into the time.sleep() and the continue function but cant figure out how to apply it to this specific application. If it freezes, it freezes on the "pdr.get_data_yahoo() section". Help would be apprec
I'm guessing that get_data_yahoo() probably freezes because it's making some kind of request to a server that never gets answered. It doesn't have a timeout option so the most obvious option is to start it in another thread/process and terminate it if it takes too long. You can use concurrent.futures for that. Once you're happy about how the code below works, you can replace sleeps_for_a_while with get_price_history_data and (3, 1, 4, 0) with tickers.
from concurrent.futures import ThreadPoolExecutor, TimeoutError
from time import sleep
TIMEOUT = 2 # seconds
def sleeps_for_a_while(sleep_for):
print('starting {}s sleep'.format(sleep_for))
sleep(sleep_for)
print('finished {}s sleep'.format(sleep_for))
# return a value to break out of the while loop
return sleep_for * 100
if __name__ == '__main__':
# this only works with functions that return values
results = []
for sleep_secs in (3, 1, 4, 0):
with ThreadPoolExecutor(max_workers=1) as executor:
# a future represents something that will be done
future = executor.submit(sleeps_for_a_while, sleep_secs)
try:
# result() raises an error if it times out
results.append(future.result(TIMEOUT))
except TimeoutError as e:
print('Function timed out')
results.append(None)
print('Got results:', results)
Currently using an API that rate limits me to 3000 requests per 10 seconds. I have 10,000 urls that are fetched using Tornado due to it's asynchronous IO nature.
How do I go about implementing a rate limit to reflect the API limit?
from tornado import ioloop, httpclient
i = 0
def handle_request(response):
print(response.code)
global i
i -= 1
if i == 0:
ioloop.IOLoop.instance().stop()
http_client = httpclient.AsyncHTTPClient()
for url in open('urls.txt'):
i += 1
http_client.fetch(url.strip(), handle_request, method='HEAD')
ioloop.IOLoop.instance().start()
You can check where does the value of i lies in the interval of 3000 requests. For example, if i is in between 3000 and 6000, you can set the timeout of 10 seconds on every request until 6000. After 6000, just double the timeout. And so on.
http_client = AsyncHTTPClient()
timeout = 10
interval = 3000
for url in open('urls.txt'):
i += 1
if i <= interval:
# i is less than 3000
# just fetch the request without any timeout
http_client.fetch(url.strip(), handle_request, method='GET')
continue # skip the rest of the loop
if i % interval == 1:
# i is now 3001, or 6001, or so on ...
timeout += timeout # double the timeout for next 3000 calls
loop = ioloop.IOLoop.current()
loop.call_later(timeout, callback=functools.partial(http_client.fetch, url.strip(), handle_request, method='GET'))
Note: I only tested this code with small number of requests. It might be possible that the value of i would change because you're subtracting i in handle_request function. If that's the case, you should maintain another variable similar to i and perform subtraction on that.
I have a checkin list which contains about 600000 checkins, and there is a url in each checkin, I need to expand them back to original long ones. I do so by
now = time.time()
files_without_url = 0
for i, checkin in enumerate(NYC_checkins):
try:
foursquare_url = urllib2.urlopen(re.search("(?P<url>https?://[^\s]+)", checkin[5]).group("url")).url
except:
files_without_url += 1
if i%1000 == 0:
print("from %d to %d: %2.5f seconds" %(i-1000, i, time.time()-now))
now = time.time()
But this takes too long time: from 0 to 1000 checkins, it takes 3241 seconds! Is this normal? What's the most efficient way to expand url by Python?
MODIFIED: Some Urls are from Bitly while some others are not, and I am not sure where they come from. In this case, I wanna simply use urllib2 module.
for your information, here is an example of checkin[5]:
I'm at The Diner (2453 18th Street NW, Columbia Rd., Washington) w/ 4 others. http...... (this is the short url)
I thought I would expand on my comment regarding the use of multiprocessing to speed up this task.
Let's start with a simple function that will take a url and resolve it as far as possible (following redirects until it gets a 200 response code):
import requests
def resolve_url(url):
try:
r = requests.get(url)
except requests.exceptions.RequestException:
return (url, None)
if r.status_code != 200:
longurl = None
else:
longurl = r.url
return (url, longurl)
This will either return a (shorturl, longurl) tuple, or it will
return (shorturl, None) in the event of a failure.
Now, we create a pool of workers:
import multiprocessing
pool = multiprocessing.Pool(10)
And then ask our pool to resolve a list of urls:
resolved_urls = []
for shorturl, longurl in pool.map(resolve_url, urls):
resolved_urls.append((shorturl, longurl))
Using the above code...
With a pool of 10 workers, I can resolve 500 URLs in 900 seconds.
If I increase the number of workers to 100, I can resolve 500 URLs in 30 seconds.
If I increase the number of workers to 200, I can resolve 500 URLs in 25 seconds.
This is hopefully enough to get you started.
(NB: you could write a similar solution using the threading module rather than multiprocessing. I usually just grab for multiprocessing first, but in this case either would work, and threading might even be slightly more efficient.)
Thread are most appropriate in case of network I/O. But you could try the following first.
pat = re.compile("(?P<url>https?://[^\s]+)") # always compile it
missing_urls = 0
bad_urls = 0
def check(checkin):
match = pat.search(checkin[5])
if not match:
global missing_urls
missing_urls += 1
else:
url = match.group("url")
try:
urllib2.urlopen(url) # don't lookup .url if you don't need it later
except URLError: # or just Exception
global bad_urls
bad_urls += 1
for i, checkin in enumerate(NYC_checkins):
check(checkin)
print(bad_urls, missing_urls)
If you get no improvement, now that we have a nice check function, create a threadpool and feed it. Speedup is guaranteed. Using processes for network I/O is pointless