I have a list of start 2000 urls and I'm using:
DOWNLOAD_DELAY = 0.25
For controlling the speed of the requests, But I also want to add a bigger delay after n requests.
For example, I want a delay of 0.25 seconds for each request and a delay of 100 seconds each 500 requests.
Edit:
Sample code:
import os
from os.path import join
import scrapy
import time
date = time.strftime("%d/%m/%Y").replace('/','_')
list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',
'http://runrun.es/':'runrunes',
'http://www.noticierodigital.com/':'noticiero_digital',
'http://www.eluniversal.com/':'el_universal',
'http://www.el-nacional.com/':'el_nacional',
'http://globovision.com/':'globovision',
'http://www.talcualdigital.com/':'talcualdigital',
'http://www.maduradas.com/':'maduradas',
'http://laiguana.tv/':'laiguana',
'http://www.aporrea.org/':'aporrea'}
root_dir = os.getcwd()
output_dir = join(root_dir,'data/',date)
class TestSpider(scrapy.Spider):
name = "news_spider"
download_delay = 1
start_urls = list_of_pages.keys()
def parse(self, response):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = list_of_pages[response.url]
print time.time()
with open(join(output_dir,filename), 'wb') as f:
f.write(response.body)
The list, in this case, is shorter yet the idea is the same. I want to have to levels of delays one for each request and one each 'N' requests.
I'm not crawling the links, just saving the main page.
You can look into using an AutoThrottle extension which does not give you a tight control of the delays but instead has its own algorithm of slowing down the spider adjusting it on the fly depending on the response time and number of concurrent requests.
If you need more control over the delays at certain stages of the scraping process, you might need a custom middleware or a custom extension (similar to AutoThrottle - source).
You can also change the .download_delay attribute of your spider on the fly. By the way, this is exactly what AutoThrottle extension does under-the-hood - it updates the .download_delay value on the fly.
Some related topics:
Per request delay
Request delay configurable for each Request
Here's a sleepy decorator I wrote that pauses after N function calls.
def sleepy(f):
def wrapped(*args, **kwargs):
wrapped.calls += 1
print(f"{f.__name__} called {wrapped.calls} times")
if wrapped.calls % 500 == 0:
print("Sleeping...")
sleep(20)
return f(*args, **kwargs)
wrapped.calls = 0
return wrapped
Related
I want to scrape a website which scrapes information of given webpages every 5 minutes. I implemented this by adding a sleep time of 5 minutes in between a recursive callback, like so:
def _parse(self, response):
status_loader = ItemLoader(Status())
# perform parsing
yield status_loader.load_item()
time.sleep(5)
yield scrapy.Request(response._url,callback=self._parse,dont_filter=True,meta=response.meta)
However, adding time.sleep(5) to the scraper seems to mess with the inner workings of scrapy. For some reason scrapy does send out the request, but the yield items are not (or rarely) outputted to the given output file.
I was thinking it has to do with the request prioritization of scrapy, which might prioritize sending a new request over yielding the scraped items. Could this be the case? I tried to edit the settings to go from a depth-first queue to a breadth-first queue. This did not solve the problem.
How would I go about scraping a website at a given interval, let's say 5 minutes?
It won't work because Scrapy is asynchronous by default.
Try to set a corn job like this instead -
import logging
import subprocess
import sys
import time
import schedule
def subprocess_cmd(command):
process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
proc_stdout = process.communicate()[0].strip()
logging.info(proc_stdout)
def cron_run_win():
# print('start scraping... ####')
logging.info('start scraping... ####')
subprocess_cmd('scrapy crawl <spider_name>')
def cron_run_linux():
# print('start scraping... ####')
logging.info('start scraping... ####')
subprocess_cmd('scrapy crawl <spider_name>')
def cron_run():
if 'win' in sys.platform:
cron_run_win()
schedule.every(5).minutes.do(cron_run_win)
elif 'linux' in sys.platform:
cron_run_linux()
schedule.every(5).minutes.do(cron_run_linux)
while True:
schedule.run_pending()
time.sleep(1)
cron_run()
This will run your desired spider every 5 mins depending on the os you are using
I'v created a script in scrapy to parse the titles of different sites listed in start_urls. The script is doing it's job flawlessly.
What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there.
I've tried so far with:
import scrapy
from scrapy.crawler import CrawlerProcess
class TitleSpider(scrapy.Spider):
name = "title_bot"
start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]
def parse(self, response):
yield {'title':response.css('title::text').get()}
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TitleSpider)
c.start()
How can I make my script stop when two of the listed urls are scraped?
Currently I see the only one way to immediately stop this script - usage of os._exit force exit function:
import os
import scrapy
from scrapy.crawler import CrawlerProcess
class TitleSpider(scrapy.Spider):
name = "title_bot"
start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]
item_counter =0
def parse(self, response):
yield {'title':response.css('title::text').get()}
self.item_counter+=1
print(self.item_counter)
if self.item_counter >=2:
self.crawler.stats.close_spider(self,"2 items")
os._exit(0)
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0' })
c.crawl(TitleSpider)
c.start()
Another things that I tried. But I didn't received required result (immediately stop script afted 2 scraped items with only 3 urls in start_urls):
Transfer CrawlerProcess instance into spider settings and calling
CrawlerProcess.stop ,(reactor.stop), etc.. and other methods
from parse method.
Usage of CloseSpider extension docs source ) with following CrawlerProcess definition:
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'EXTENSIONS' : {
'scrapy.extensions.closespider.CloseSpider': 500,
},
"CLOSESPIDER_ITEMCOUNT":2 })
Reducing CONCURRENT_REQUESTS setting to 1 (with raise CloseSpider
condition in parse method). When application scraped 2 items and it
reaches code line with raise ClosesSpider - 3rd request already
started in another thread. In case of usage conventional way to stop
spider, application will be active until it process previously sent
requests and process their responses and only after that - it
closes.
As your application has relatively low numbers of urls in start_urls, application starts process all urls a long before it reaches raise CloseSpider.
As Gallaecio proposed, you can add a counter, but the difference here is that you export an item after the if statement. This way, it will almost always end up exporting 2 items.
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import CloseSpider
class TitleSpider(scrapy.Spider):
name = "title_bot"
start_urls = ["https://www.google.com/", "https://www.yahoo.com/", "https://www.bing.com/"]
item_limit = 2
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.counter = 0
def parse(self, response):
self.counter += 1
if self.counter > self.item_limit:
raise CloseSpider
yield {'title': response.css('title::text').get()}
Why almost always? you may ask. It has to do with race condition in parse method.
Imagine that self.counter is currently equal to 1, which means that one more item is expected to be exported. But now Scrapy receives two responses at the same moment and invokes the parse method for both of them. If two threads running the parse method will increase the counter simultaneously, they will both have self.counter equal to 3 and thus will both raise the CloseSpider exception.
In this case (which is very unlikely, but still can happen), spider will export only one item.
Constructing on top of https://stackoverflow.com/a/38331733/939364, you can define a counter in the constructor of your spider, and use parse to increase it and raise CloseSpider when it reaches 2:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import CloseSpider # 1. Import CloseSpider
class TitleSpider(scrapy.Spider):
name = "title_bot"
start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.counter = 0 # 2. Define a self.counter property
def parse(self, response):
yield {'title':response.css('title::text').get()}
self.counter += 1 # 3. Increase the count on each parsed URL
if self.counter >= 2:
raise CloseSpider # 4. Raise CloseSpider after 2 URLs are parsed
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TitleSpider)
c.start()
I am not 100% certain that it will prevent a third URL to be parsed, because I think CloseSpider stops new requests from start but waits for started requests to finish.
If you want to prevent more than 2 items from being scraped, you can edit parse not to yield items when self.counter > 2.
Enumerate do jobs fine. Some changes in architecture and
for cnt, url in enumerate(start_urls):
if cnt > 1:
break
else:
parse(url)
I have a list of start 2000 urls and I'm using:
DOWNLOAD_DELAY = 0.25
For controlling the speed of the requests, But I also want to add a bigger delay after n requests.
For example, I want a delay of 0.25 seconds for each request and a delay of 100 seconds each 500 requests.
Edit:
Sample code:
import os
from os.path import join
import scrapy
import time
date = time.strftime("%d/%m/%Y").replace('/','_')
list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',
'http://runrun.es/':'runrunes',
'http://www.noticierodigital.com/':'noticiero_digital',
'http://www.eluniversal.com/':'el_universal',
'http://www.el-nacional.com/':'el_nacional',
'http://globovision.com/':'globovision',
'http://www.talcualdigital.com/':'talcualdigital',
'http://www.maduradas.com/':'maduradas',
'http://laiguana.tv/':'laiguana',
'http://www.aporrea.org/':'aporrea'}
root_dir = os.getcwd()
output_dir = join(root_dir,'data/',date)
class TestSpider(scrapy.Spider):
name = "news_spider"
download_delay = 1
start_urls = list_of_pages.keys()
def parse(self, response):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = list_of_pages[response.url]
print time.time()
with open(join(output_dir,filename), 'wb') as f:
f.write(response.body)
The list, in this case, is shorter yet the idea is the same. I want to have to levels of delays one for each request and one each 'N' requests.
I'm not crawling the links, just saving the main page.
You can look into using an AutoThrottle extension which does not give you a tight control of the delays but instead has its own algorithm of slowing down the spider adjusting it on the fly depending on the response time and number of concurrent requests.
If you need more control over the delays at certain stages of the scraping process, you might need a custom middleware or a custom extension (similar to AutoThrottle - source).
You can also change the .download_delay attribute of your spider on the fly. By the way, this is exactly what AutoThrottle extension does under-the-hood - it updates the .download_delay value on the fly.
Some related topics:
Per request delay
Request delay configurable for each Request
Here's a sleepy decorator I wrote that pauses after N function calls.
def sleepy(f):
def wrapped(*args, **kwargs):
wrapped.calls += 1
print(f"{f.__name__} called {wrapped.calls} times")
if wrapped.calls % 500 == 0:
print("Sleeping...")
sleep(20)
return f(*args, **kwargs)
wrapped.calls = 0
return wrapped
I'm new in Scrapy. I have thousands of url,xpath tuples and values in a database.
These urls are from different domains (not allways, there can be 100 urls from the same domain).
x.com/a //h1
y.com/a //div[#class='1']
z.com/a //div[#href='...']
x.com/b //h1
x.com/c //h1
...
Now I want to get these values every 2 hours as fast as possible but to be sure that I don't overload any of these.
Can't figure out how to do that.
My thoughts:
I could create one Spider for every different domain, set it's parsing rules and run them at once.
Is it a good practice?
EDIT:
I'm not sure how it would work with outputting data into database according to concurrency.
EDIT2:
I can do something like this - for every domain there is a new spider. But this is impossible to do having thousands of different urls and it's xpaths.
class WikiScraper(scrapy.Spider):
name = "wiki_headers"
def start_requests(self):
urls = [
'https://en.wikipedia.org/wiki/Spider',
'https://en.wikipedia.org/wiki/Data_scraping',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//h1/text()').extract()
print header
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
class CraigslistScraper(scrapy.Spider):
name = "craigslist_headers"
def start_requests(self):
urls = [
'https://columbusga.craigslist.org/act/6062657418.html',
'https://columbusga.craigslist.org/acc/6060297390.html',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//span[#id="titletextonly"]/text()').extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
From the example you posted in edit2, it looks like all your classes are easily abstractable by one more level. How about this:?
from urllib.parse import urlparse
class GenericScraper(scrapy.Spider):
def __init__(self, urls, xpath):
super().__init__()
self.name = self._create_scraper_name_from_url(urls[0])
self.urls = urls
self.xpath = xpath
def _create_scraper_name_from_url(url):
'''Generate scraper name from url
www.example.com/foobar/bar -> www_example_com'''
netloc = urlparse(url).netloc
return netloc.replace('.','_')
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select(self.xpath).extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
Next, you could group the data from database by xpaths
for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
# do whatever you need with scraper
AD concurency: your database should handle concurent writes so I do not see a problem there
Edit:
Related to the timeouts: I Do not know how scrapy works under the hood i.e. if it uses some sort of paralelization and whether it runs asynchronously in the background. But from what you wrote I guess it does and when you fire up 1k scrapers each firing multiple requests at time your hardware cant handle that much traffic (disclaimer, this is just a guess!).
There might be a native way to do this, but a possible workaround is to use multiprocessing + Queue:
from multiprocessing import JoinableQueue, Process
NUMBER_OF_CPU = 4 # change this to your number.
SENTINEL = None
class Worker(Process):
def __init__(self, queue):
super().__init__()
self.queue = queue
def run(self):
# blocking wait !You have to use sentinels if you use blocking waits!
item = self.queue.get():
if item is SENTINEL:
# we got sentinel, there are no more scrapers to process
self.queue.task_done()
return
else:
# item is scraper, run it
item.run_spider() # or however you run your scrapers
# This assumes that each scraper is **not** running in background!
# Tell the JoinableQueue we have processed one more item
# In the main thread the queue.join() waits untill for
# each item taken from queue a queue.task_done() is called
self.queue.task_done()
def run():
queue = JoinableQueue()
# if putting that many things in the queue gets slow (I imagine
# it can) You can fire up a separate Thread/Process to fill the
# queue in the background while workers are already consuming it.
for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
queue.put(scraper)
for sentinel in range(NUMBER_OF_CPU):
# None or sentinel of your choice to tell the workers there are
# no more scrapers to process
queue.put(SENTINEL)
workers = []
for _ in range(NUMBER_OF_CPU):
worker = Worker(queue)
workers.append(worker)
worker.start()
# We have to wait until the queue is processed
queue.join()
But please bear in mind that this is a vanilla approach for paralell execution completely ignoring Scrapy abilities. I have found This blogpost which uses twisted to achieve (what I think is) the same thing. But since I've never used twisted I can't comment on that
if you are thinking about scrapy can't handle multiple domains at once because of the allowed_domains parameters, remember that it is optional.
If no allowed_domains parameter is set in the spider, it can work with every domain it gets.
If I understand correctly you have map of domain to xpath values and you want to pull xpath depending on what domain you crawl?
Try something like:
DOMAIN_DATA = [('domain.com', '//div')]
def get_domain(url):
for domain, xpath in DOMAIN_DATA:
if domain in url:
return xp
def parse(self, response):
xpath = get_domain(response.url)
if not xpath:
logging.error('no xpath for url: {}; unknown domain'.format(response.url))
return
item = dict()
item['some_field'] = repsonse.xpath(xpath).extract()
yield item
I've written a script that fetches URLs from a file and sends HTTP requests to all the URLs concurrently. I now want to limit the number of HTTP requests per second and the bandwidth per interface (eth0, eth1, etc.) in a session. Is there any way to achieve this on Python?
You could use Semaphore object which is part of the standard Python lib:
python doc
Or if you want to work with threads directly, you could use wait([timeout]).
There is no library bundled with Python which can work on the Ethernet or other network interface. The lowest you can go is socket.
Based on your reply, here's my suggestion. Notice the active_count. Use this only to test that your script runs only two threads. Well in this case they will be three because number one is your script then you have two URL requests.
import time
import requests
import threading
# Limit the number of threads.
pool = threading.BoundedSemaphore(2)
def worker(u):
# Request passed URL.
r = requests.get(u)
print r.status_code
# Release lock for other threads.
pool.release()
# Show the number of active threads.
print threading.active_count()
def req():
# Get URLs from a text file, remove white space.
urls = [url.strip() for url in open('urllist.txt')]
for u in urls:
# Thread pool.
# Blocks other threads (more than the set limit).
pool.acquire(blocking=True)
# Create a new thread.
# Pass each URL (i.e. u parameter) to the worker function.
t = threading.Thread(target=worker, args=(u, ))
# Start the newly create thread.
t.start()
req()
You could use a worker concept like described in the documentation:
https://docs.python.org/3.4/library/queue.html
Add a wait() command inside your workers to get them waiting between the requests (in the example from documentation: inside the "while true" after the task_done).
Example: 5 "Worker"-Threads with a waiting time of 1 sec between the requests will do less then 5 fetches per second.
Note the solution below still send the requests serially but limits the TPS (transactions per second)
TLDR;
There is a class which keeps a count of the number of calls that can still be made in the current second. It is decremented for every call that is made and refilled every second.
import time
from multiprocessing import Process, Value
# Naive TPS regulation
# This class holds a bucket of tokens which are refilled every second based on the expected TPS
class TPSBucket:
def __init__(self, expected_tps):
self.number_of_tokens = Value('i', 0)
self.expected_tps = expected_tps
self.bucket_refresh_process = Process(target=self.refill_bucket_per_second) # process to constantly refill the TPS bucket
def refill_bucket_per_second(self):
while True:
print("refill")
self.refill_bucket()
time.sleep(1)
def refill_bucket(self):
self.number_of_tokens.value = self.expected_tps
print('bucket count after refill', self.number_of_tokens)
def start(self):
self.bucket_refresh_process.start()
def stop(self):
self.bucket_refresh_process.kill()
def get_token(self):
response = False
if self.number_of_tokens.value > 0:
with self.number_of_tokens.get_lock():
if self.number_of_tokens.value > 0:
self.number_of_tokens.value -= 1
response = True
return response
def test():
tps_bucket = TPSBucket(expected_tps=1) ## Let's say I want to send requests 1 per second
tps_bucket.start()
total_number_of_requests = 60 ## Let's say I want to send 60 requests
request_number = 0
t0 = time.time()
while True:
if tps_bucket.get_token():
request_number += 1
print('Request', request_number) ## This is my request
if request_number == total_number_of_requests:
break
print (time.time() - t0, ' time elapsed') ## Some metrics to tell my how long every thing took
tps_bucket.stop()
if __name__ == "__main__":
test()