how to add delay in python webscraping [duplicate]

how to add delay in python webscraping [duplicate] - python

I have a list of start 2000 urls and I'm using:
DOWNLOAD_DELAY = 0.25
For controlling the speed of the requests, But I also want to add a bigger delay after n requests.
For example, I want a delay of 0.25 seconds for each request and a delay of 100 seconds each 500 requests.
Edit:
Sample code:
import os
from os.path import join
import scrapy
import time
date = time.strftime("%d/%m/%Y").replace('/','_')
list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',
'http://runrun.es/':'runrunes',
'http://www.noticierodigital.com/':'noticiero_digital',
'http://www.eluniversal.com/':'el_universal',
'http://www.el-nacional.com/':'el_nacional',
'http://globovision.com/':'globovision',
'http://www.talcualdigital.com/':'talcualdigital',
'http://www.maduradas.com/':'maduradas',
'http://laiguana.tv/':'laiguana',
'http://www.aporrea.org/':'aporrea'}
root_dir = os.getcwd()
output_dir = join(root_dir,'data/',date)
class TestSpider(scrapy.Spider):
name = "news_spider"
download_delay = 1
start_urls = list_of_pages.keys()
def parse(self, response):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = list_of_pages[response.url]
print time.time()
with open(join(output_dir,filename), 'wb') as f:
f.write(response.body)
The list, in this case, is shorter yet the idea is the same. I want to have to levels of delays one for each request and one each 'N' requests.
I'm not crawling the links, just saving the main page.

You can look into using an AutoThrottle extension which does not give you a tight control of the delays but instead has its own algorithm of slowing down the spider adjusting it on the fly depending on the response time and number of concurrent requests.
If you need more control over the delays at certain stages of the scraping process, you might need a custom middleware or a custom extension (similar to AutoThrottle - source).
You can also change the .download_delay attribute of your spider on the fly. By the way, this is exactly what AutoThrottle extension does under-the-hood - it updates the .download_delay value on the fly.
Some related topics:
Per request delay
Request delay configurable for each Request

Here's a sleepy decorator I wrote that pauses after N function calls.
def sleepy(f):
def wrapped(*args, **kwargs):
wrapped.calls += 1
print(f"{f.__name__} called {wrapped.calls} times")
if wrapped.calls % 500 == 0:
print("Sleeping...")
sleep(20)
return f(*args, **kwargs)
wrapped.calls = 0
return wrapped

Related

scrapy recursive callback with time

I want to scrape a website which scrapes information of given webpages every 5 minutes. I implemented this by adding a sleep time of 5 minutes in between a recursive callback, like so:
def _parse(self, response):
status_loader = ItemLoader(Status())
# perform parsing
yield status_loader.load_item()
time.sleep(5)
yield scrapy.Request(response._url,callback=self._parse,dont_filter=True,meta=response.meta)
However, adding time.sleep(5) to the scraper seems to mess with the inner workings of scrapy. For some reason scrapy does send out the request, but the yield items are not (or rarely) outputted to the given output file.
I was thinking it has to do with the request prioritization of scrapy, which might prioritize sending a new request over yielding the scraped items. Could this be the case? I tried to edit the settings to go from a depth-first queue to a breadth-first queue. This did not solve the problem.
How would I go about scraping a website at a given interval, let's say 5 minutes?

It won't work because Scrapy is asynchronous by default.
Try to set a corn job like this instead -
import logging
import subprocess
import sys
import time
import schedule
def subprocess_cmd(command):
process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
proc_stdout = process.communicate()[0].strip()
logging.info(proc_stdout)
def cron_run_win():
# print('start scraping... ####')
logging.info('start scraping... ####')
subprocess_cmd('scrapy crawl <spider_name>')
def cron_run_linux():
# print('start scraping... ####')
logging.info('start scraping... ####')
subprocess_cmd('scrapy crawl <spider_name>')
def cron_run():
if 'win' in sys.platform:
cron_run_win()
schedule.every(5).minutes.do(cron_run_win)
elif 'linux' in sys.platform:
cron_run_linux()
schedule.every(5).minutes.do(cron_run_linux)
while True:
schedule.run_pending()
time.sleep(1)
cron_run()
This will run your desired spider every 5 mins depending on the os you are using

Broad crawling - different xpaths - Scrapy

I'm new in Scrapy. I have thousands of url,xpath tuples and values in a database.
These urls are from different domains (not allways, there can be 100 urls from the same domain).
x.com/a //h1
y.com/a //div[#class='1']
z.com/a //div[#href='...']
x.com/b //h1
x.com/c //h1
...
Now I want to get these values every 2 hours as fast as possible but to be sure that I don't overload any of these.
Can't figure out how to do that.
My thoughts:
I could create one Spider for every different domain, set it's parsing rules and run them at once.
Is it a good practice?
EDIT:
I'm not sure how it would work with outputting data into database according to concurrency.
EDIT2:
I can do something like this - for every domain there is a new spider. But this is impossible to do having thousands of different urls and it's xpaths.
class WikiScraper(scrapy.Spider):
name = "wiki_headers"
def start_requests(self):
urls = [
'https://en.wikipedia.org/wiki/Spider',
'https://en.wikipedia.org/wiki/Data_scraping',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//h1/text()').extract()
print header
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
class CraigslistScraper(scrapy.Spider):
name = "craigslist_headers"
def start_requests(self):
urls = [
'https://columbusga.craigslist.org/act/6062657418.html',
'https://columbusga.craigslist.org/acc/6060297390.html',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//span[#id="titletextonly"]/text()').extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)

From the example you posted in edit2, it looks like all your classes are easily abstractable by one more level. How about this:?
from urllib.parse import urlparse
class GenericScraper(scrapy.Spider):
def __init__(self, urls, xpath):
super().__init__()
self.name = self._create_scraper_name_from_url(urls[0])
self.urls = urls
self.xpath = xpath
def _create_scraper_name_from_url(url):
'''Generate scraper name from url
www.example.com/foobar/bar -> www_example_com'''
netloc = urlparse(url).netloc
return netloc.replace('.','_')
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select(self.xpath).extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
Next, you could group the data from database by xpaths
for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
# do whatever you need with scraper
AD concurency: your database should handle concurent writes so I do not see a problem there
Edit:
Related to the timeouts: I Do not know how scrapy works under the hood i.e. if it uses some sort of paralelization and whether it runs asynchronously in the background. But from what you wrote I guess it does and when you fire up 1k scrapers each firing multiple requests at time your hardware cant handle that much traffic (disclaimer, this is just a guess!).
There might be a native way to do this, but a possible workaround is to use multiprocessing + Queue:
from multiprocessing import JoinableQueue, Process
NUMBER_OF_CPU = 4 # change this to your number.
SENTINEL = None
class Worker(Process):
def __init__(self, queue):
super().__init__()
self.queue = queue
def run(self):
# blocking wait !You have to use sentinels if you use blocking waits!
item = self.queue.get():
if item is SENTINEL:
# we got sentinel, there are no more scrapers to process
self.queue.task_done()
return
else:
# item is scraper, run it
item.run_spider() # or however you run your scrapers
# This assumes that each scraper is **not** running in background!
# Tell the JoinableQueue we have processed one more item
# In the main thread the queue.join() waits untill for
# each item taken from queue a queue.task_done() is called
self.queue.task_done()
def run():
queue = JoinableQueue()
# if putting that many things in the queue gets slow (I imagine
# it can) You can fire up a separate Thread/Process to fill the
# queue in the background while workers are already consuming it.
for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
queue.put(scraper)
for sentinel in range(NUMBER_OF_CPU):
# None or sentinel of your choice to tell the workers there are
# no more scrapers to process
queue.put(SENTINEL)
workers = []
for _ in range(NUMBER_OF_CPU):
worker = Worker(queue)
workers.append(worker)
worker.start()
# We have to wait until the queue is processed
queue.join()
But please bear in mind that this is a vanilla approach for paralell execution completely ignoring Scrapy abilities. I have found This blogpost which uses twisted to achieve (what I think is) the same thing. But since I've never used twisted I can't comment on that

if you are thinking about scrapy can't handle multiple domains at once because of the allowed_domains parameters, remember that it is optional.
If no allowed_domains parameter is set in the spider, it can work with every domain it gets.

If I understand correctly you have map of domain to xpath values and you want to pull xpath depending on what domain you crawl?
Try something like:
DOMAIN_DATA = [('domain.com', '//div')]
def get_domain(url):
for domain, xpath in DOMAIN_DATA:
if domain in url:
return xp
def parse(self, response):
xpath = get_domain(response.url)
if not xpath:
logging.error('no xpath for url: {}; unknown domain'.format(response.url))
return
item = dict()
item['some_field'] = repsonse.xpath(xpath).extract()
yield item

Add a delay after 500 requests scrapy

I have a list of start 2000 urls and I'm using:
DOWNLOAD_DELAY = 0.25
For controlling the speed of the requests, But I also want to add a bigger delay after n requests.
For example, I want a delay of 0.25 seconds for each request and a delay of 100 seconds each 500 requests.
Edit:
Sample code:
import os
from os.path import join
import scrapy
import time
date = time.strftime("%d/%m/%Y").replace('/','_')
list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',
'http://runrun.es/':'runrunes',
'http://www.noticierodigital.com/':'noticiero_digital',
'http://www.eluniversal.com/':'el_universal',
'http://www.el-nacional.com/':'el_nacional',
'http://globovision.com/':'globovision',
'http://www.talcualdigital.com/':'talcualdigital',
'http://www.maduradas.com/':'maduradas',
'http://laiguana.tv/':'laiguana',
'http://www.aporrea.org/':'aporrea'}
root_dir = os.getcwd()
output_dir = join(root_dir,'data/',date)
class TestSpider(scrapy.Spider):
name = "news_spider"
download_delay = 1
start_urls = list_of_pages.keys()
def parse(self, response):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = list_of_pages[response.url]
print time.time()
with open(join(output_dir,filename), 'wb') as f:
f.write(response.body)
The list, in this case, is shorter yet the idea is the same. I want to have to levels of delays one for each request and one each 'N' requests.
I'm not crawling the links, just saving the main page.

You can look into using an AutoThrottle extension which does not give you a tight control of the delays but instead has its own algorithm of slowing down the spider adjusting it on the fly depending on the response time and number of concurrent requests.
If you need more control over the delays at certain stages of the scraping process, you might need a custom middleware or a custom extension (similar to AutoThrottle - source).
You can also change the .download_delay attribute of your spider on the fly. By the way, this is exactly what AutoThrottle extension does under-the-hood - it updates the .download_delay value on the fly.
Some related topics:
Per request delay
Request delay configurable for each Request

Here's a sleepy decorator I wrote that pauses after N function calls.
def sleepy(f):
def wrapped(*args, **kwargs):
wrapped.calls += 1
print(f"{f.__name__} called {wrapped.calls} times")
if wrapped.calls % 500 == 0:
print("Sleeping...")
sleep(20)
return f(*args, **kwargs)
wrapped.calls = 0
return wrapped

CherryPy: how to stop and buffer incoming request while data is updated

I'm working with cherrypy in a server that implements a RESTful like API.
The responses imply some heavy computation that takes about 2 seconds for
request. To do this computations, some data is used that is updated three
times a day.
The data is updated in the background (takes about half hour),
and once it is updated, the references of the new data are passed to
the functions that respond the requests. This takes just a milisecond.
What I need is to be sure that each request is answered either with the
old data or with the new data, but none request processing can take place while the data references are being changed. Ideally, I would like to find a way of buffering incoming request while the data references are changed, and also to ensure that the references are changed after all in-process requests finished.
My current (not) working minimal example is as follows:
import time
import cherrypy
from cherrypy.process import plugins
theData = 0
def processData():
"""Backround task works for half hour three times a day,
and when finishes it publish it in the engine buffer."""
global theData # using global variables to simplify the example
theData += 1
cherrypy.engine.publish("doChangeData", theData)
class DataPublisher(object):
def __init__(self):
self.data = 'initData'
cherrypy.engine.subscribe('doChangeData', self.changeData)
def changeData(self, newData):
cherrypy.engine.log("Changing data, buffering should start!")
self.data = newData
time.sleep(1) #exageration of the 1 milisec of the references update to visualize the problem
cherrypy.engine.log("Continue serving buffered and new requests.")
#cherrypy.expose
def index(self):
result = "I get "+str(self.data)
cherrypy.engine.log(result)
time.sleep(3)
return result
if __name__ == '__main__':
conf = {
'/': { 'server.socket_host': '127.0.0.1',
'server.socket_port': 8080}
}
cherrypy.config.update(conf)
btask = plugins.BackgroundTask(5, processData) #5 secs for the example
btask.start()
cherrypy.quickstart(DataPublisher())
If I run this script, and also open a browser, put localhost:8080 and refresh
the page a lot, I get:
...
[17/Sep/2015:21:32:41] ENGINE Changing data, buffering should start!
127.0.0.1 - - [17/Sep/2015:21:32:41] "GET / HTTP/1.1" 200 7 "...
[17/Sep/2015:21:32:42] ENGINE I get 3
[17/Sep/2015:21:32:42] ENGINE Continue serving buffered and new requests.
127.0.0.1 - - [17/Sep/2015:21:24:44] "GET / HTTP/1.1" 200 7 "...
...
Which means that some requests processing started before and ends after the
data references start or end to being changed. I want to avoid both cases.
Something like:
...
127.0.0.1 - - [17/Sep/2015:21:32:41] "GET / HTTP/1.1" 200 7 "...
[17/Sep/2015:21:32:41] ENGINE Changing data, buffering should start!
[17/Sep/2015:21:32:42] ENGINE Continue serving buffered and new requests.
[17/Sep/2015:21:32:42] ENGINE I get 3
127.0.0.1 - - [17/Sep/2015:21:24:44] "GET / HTTP/1.1" 200 7 "...
...
I searched documentation and the web and find these references that do not completely cover this case:
http://www.defuze.org/archives/198-managing-your-process-with-the-cherrypy-bus.html
How to execute asynchronous post-processing in CherryPy?
http://tools.cherrypy.org/wiki/BackgroundTaskQueue
Cherrypy : which solutions for pages with large processing time
How to stop request processing in Cherrypy?
Update (with a simple solution):
After giving more thought, I think that the question is misleading since it includes some implementation requirements in the question itself, namely: to stop processing and start buffering. While for the problem the requirement can be simplified to: be sure that each request is processed either with the old data or with the new data.
For the later, it is enough to store a temporal local reference of the used data. This reference can be used in all the request processing, and it will be no problem if another thread changes self.data. For python objects, the garbage collector will take care of the old data.
Specifically, it is enough to change the index function by:
#cherrypy.expose
def index(self):
tempData = self.data
result = "I started with %s"%str(tempData)
time.sleep(3) # Heavy use of tempData
result += " that changed to %s"%str(self.data)
result += " but I am still using %s"%str(tempData)
cherrypy.engine.log(result)
return result
And as a result we will see:
[21/Sep/2015:10:06:00] ENGINE I started with 1 that changed to 2 but I am still using 1
I still want to keep the original (more restrictive) question and cyraxjoe answer too, since I find those solutions very useful.

I'll explain two one approaches that will solve the issue.
The first one is Plugin based.
Plugin based Still needs a kind of synchronization. It only works because there is only one BackgroundTask making the modifications (also is just an atomic operation).
import time
import threading
import cherrypy
from cherrypy.process import plugins
UPDATE_INTERVAL = 0.5
REQUEST_DELAY = 0.1
UPDATE_DELAY = 0.1
THREAD_POOL_SIZE = 20
next_data = 1
class DataGateway(plugins.SimplePlugin):
def __init__(self, bus):
super(DataGateway, self).__init__(bus)
self.data = next_data
def start(self):
self.bus.log("Starting DataGateway")
self.bus.subscribe('dg:get', self._get_data)
self.bus.subscribe('dg:update', self._update_data)
self.bus.log("DataGateway has been started")
def stop(self):
self.bus.log("Stopping DataGateway")
self.bus.unsubscribe('dg:get', self._get_data)
self.bus.unsubscribe('dg:update', self._update_data)
self.bus.log("DataGateway has been stopped")
def _update_data(self, new_val):
self.bus.log("Changing data, buffering should start!")
self.data = new_val
time.sleep(UPDATE_DELAY)
self.bus.log("Continue serving buffered and new requests.")
def _get_data(self):
return self.data
def processData():
"""Backround task works for half hour three times a day,
and when finishes it publish it in the engine buffer."""
global next_data
cherrypy.engine.publish("dg:update", next_data)
next_data += 1
class DataPublisher(object):
#property
def data(self):
return cherrypy.engine.publish('dg:get').pop()
#cherrypy.expose
def index(self):
result = "I get " + str(self.data)
cherrypy.engine.log(result)
time.sleep(REQUEST_DELAY)
return result
if __name__ == '__main__':
conf = {
'global': {
'server.thread_pool': THREAD_POOL_SIZE,
'server.socket_host': '127.0.0.1',
'server.socket_port': 8080,
}
}
cherrypy.config.update(conf)
DataGateway(cherrypy.engine).subscribe()
plugins.BackgroundTask(UPDATE_DELAY, processData).start()
cherrypy.quickstart(DataPublisher())
In this version the synchronizations comes by the fact that both read & write operations are executed on the cherrypy.engine thread. Everything is abstracted on the plugin DataGateway you just operated publishing into the engine.
The second approach is by using an Event a threading.Event object. This is a more manual approach with the added benefit that it's probably going to be faster given that the reads are faster because it's doesn't execute over the cherrypy.engine thread.
threading.Event based (a.k.a. manual)
import time
import cherrypy
import threading
from cherrypy.process import plugins
UPDATE_INTERVAL = 0.5
REQUEST_DELAY = 0.1
UPDATE_DELAY = 0.1
THREAD_POOL_SIZE = 20
next_data = 1
def processData():
"""Backround task works for half hour three times a day,
and when finishes it publish it in the engine buffer."""
global next_data
cherrypy.engine.publish("doChangeData", next_data)
next_data += 1
class DataPublisher(object):
def __init__(self):
self._data = next_data
self._data_readable = threading.Event()
cherrypy.engine.subscribe('doChangeData', self.changeData)
#property
def data(self):
if self._data_readable.is_set():
return self._data
else:
self._data_readable.wait()
return self.data
#data.setter
def data(self, value):
self._data_readable.clear()
time.sleep(UPDATE_DELAY)
self._data = value
self._data_readable.set()
def changeData(self, newData):
cherrypy.engine.log("Changing data, buffering should start!")
self.data = newData
cherrypy.engine.log("Continue serving buffered and new requests.")
#cherrypy.expose
def index(self):
result = "I get " + str(self.data)
cherrypy.engine.log(result)
time.sleep(REQUEST_DELAY)
return result
if __name__ == '__main__':
conf = {
'global': {
'server.thread_pool': THREAD_POOL_SIZE,
'server.socket_host': '127.0.0.1',
'server.socket_port': 8080,
}
}
cherrypy.config.update(conf)
plugins.BackgroundTask(UPDATE_INTERVAL, processData).start()
cherrypy.quickstart(DataPublisher())
I've added some niceties with the #property decorator but the real gist is on the threading.Event and the fact that the DataPublisher object is shared among the worker threads.
I also added the thread pool configuration required to increase the thread pool size in both examples. The default is 10.
As a way to test what I just said you can execute this Python 3 script (if you don't have python3 now you have a pretext to install it) it will do a 100 requests more or less concurrently given the thread pool.
Test script
import time
import urllib.request
import concurrent.futures
URL = 'http://localhost:8080/'
TIMEOUT = 60
DELAY = 0.05
MAX_WORKERS = 20
REQ_RANGE = range(1, 101)
def load_url():
with urllib.request.urlopen(URL, timeout=TIMEOUT) as conn:
return conn.read()
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures = {}
for i in REQ_RANGE:
print("Sending req {}".format(i))
futures[executor.submit(load_url)] = i
time.sleep(DELAY)
results = []
for future in concurrent.futures.as_completed(futures):
try:
data = future.result().decode()
except Exception as exc:
print(exc)
else:
results.append((futures[future], data))
curr_max = 0
for i, data in sorted(results, key=lambda r: r[0]):
new_max = int(data.split()[-1])
assert new_max >= curr_max, "The data was not updated correctly"
print("Req {}: {}".format(i, data))
curr_max = new_max
The way that you determined that you have a problem based on the log, it's not trust worthy for this kind of problems. Specially given that you don't have control over the time on which the request gets logged on the "access" log. I couldn't make it fail your code with my test code but there is indeed a race condition in the general case, in this example it should work all the time because the code is just making an atomic operation. Just one attribute assignment periodically from a central point.
I hope the code is self explanatory in case that you have a question leave a comment.
EDIT: I edited the Plugin based approach because it only works because there is just one place that is executing the plugin if you create another background task that updates the data then it could have problems when you do something more than just an assignment. Regardless the code could be what you are looking for if you will update from one BackgroundTask.

Limiting number of HTTP requests per second on Python

I've written a script that fetches URLs from a file and sends HTTP requests to all the URLs concurrently. I now want to limit the number of HTTP requests per second and the bandwidth per interface (eth0, eth1, etc.) in a session. Is there any way to achieve this on Python?

You could use Semaphore object which is part of the standard Python lib:
python doc
Or if you want to work with threads directly, you could use wait([timeout]).
There is no library bundled with Python which can work on the Ethernet or other network interface. The lowest you can go is socket.
Based on your reply, here's my suggestion. Notice the active_count. Use this only to test that your script runs only two threads. Well in this case they will be three because number one is your script then you have two URL requests.
import time
import requests
import threading
# Limit the number of threads.
pool = threading.BoundedSemaphore(2)
def worker(u):
# Request passed URL.
r = requests.get(u)
print r.status_code
# Release lock for other threads.
pool.release()
# Show the number of active threads.
print threading.active_count()
def req():
# Get URLs from a text file, remove white space.
urls = [url.strip() for url in open('urllist.txt')]
for u in urls:
# Thread pool.
# Blocks other threads (more than the set limit).
pool.acquire(blocking=True)
# Create a new thread.
# Pass each URL (i.e. u parameter) to the worker function.
t = threading.Thread(target=worker, args=(u, ))
# Start the newly create thread.
t.start()
req()

You could use a worker concept like described in the documentation:
https://docs.python.org/3.4/library/queue.html
Add a wait() command inside your workers to get them waiting between the requests (in the example from documentation: inside the "while true" after the task_done).
Example: 5 "Worker"-Threads with a waiting time of 1 sec between the requests will do less then 5 fetches per second.

Note the solution below still send the requests serially but limits the TPS (transactions per second)
TLDR;
There is a class which keeps a count of the number of calls that can still be made in the current second. It is decremented for every call that is made and refilled every second.
import time
from multiprocessing import Process, Value
# Naive TPS regulation
# This class holds a bucket of tokens which are refilled every second based on the expected TPS
class TPSBucket:
def __init__(self, expected_tps):
self.number_of_tokens = Value('i', 0)
self.expected_tps = expected_tps
self.bucket_refresh_process = Process(target=self.refill_bucket_per_second) # process to constantly refill the TPS bucket
def refill_bucket_per_second(self):
while True:
print("refill")
self.refill_bucket()
time.sleep(1)
def refill_bucket(self):
self.number_of_tokens.value = self.expected_tps
print('bucket count after refill', self.number_of_tokens)
def start(self):
self.bucket_refresh_process.start()
def stop(self):
self.bucket_refresh_process.kill()
def get_token(self):
response = False
if self.number_of_tokens.value > 0:
with self.number_of_tokens.get_lock():
if self.number_of_tokens.value > 0:
self.number_of_tokens.value -= 1
response = True
return response
def test():
tps_bucket = TPSBucket(expected_tps=1) ## Let's say I want to send requests 1 per second
tps_bucket.start()
total_number_of_requests = 60 ## Let's say I want to send 60 requests
request_number = 0
t0 = time.time()
while True:
if tps_bucket.get_token():
request_number += 1
print('Request', request_number) ## This is my request
if request_number == total_number_of_requests:
break
print (time.time() - t0, ' time elapsed') ## Some metrics to tell my how long every thing took
tps_bucket.stop()
if __name__ == "__main__":
test()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to add delay in python webscraping [duplicate] - python

Here's a sleepy decorator I wrote that pauses after N function calls. def sleepy(f): def wrapped(*args, **kwargs): wrapped.calls += 1 print(f"{f.name} called {wrapped.calls} times") if wrapped.calls % 500 == 0: print("Sleeping...") sleep(20) return f(*args, **kwargs) wrapped.calls = 0 return wrapped

Related

scrapy recursive callback with time

Broad crawling - different xpaths - Scrapy

Add a delay after 500 requests scrapy

CherryPy: how to stop and buffer incoming request while data is updated

Limiting number of HTTP requests per second on Python

Categories

Resources