I want to scrape a website which scrapes information of given webpages every 5 minutes. I implemented this by adding a sleep time of 5 minutes in between a recursive callback, like so:
def _parse(self, response):
status_loader = ItemLoader(Status())
# perform parsing
yield status_loader.load_item()
time.sleep(5)
yield scrapy.Request(response._url,callback=self._parse,dont_filter=True,meta=response.meta)
However, adding time.sleep(5) to the scraper seems to mess with the inner workings of scrapy. For some reason scrapy does send out the request, but the yield items are not (or rarely) outputted to the given output file.
I was thinking it has to do with the request prioritization of scrapy, which might prioritize sending a new request over yielding the scraped items. Could this be the case? I tried to edit the settings to go from a depth-first queue to a breadth-first queue. This did not solve the problem.
How would I go about scraping a website at a given interval, let's say 5 minutes?
It won't work because Scrapy is asynchronous by default.
Try to set a corn job like this instead -
import logging
import subprocess
import sys
import time
import schedule
def subprocess_cmd(command):
process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
proc_stdout = process.communicate()[0].strip()
logging.info(proc_stdout)
def cron_run_win():
# print('start scraping... ####')
logging.info('start scraping... ####')
subprocess_cmd('scrapy crawl <spider_name>')
def cron_run_linux():
# print('start scraping... ####')
logging.info('start scraping... ####')
subprocess_cmd('scrapy crawl <spider_name>')
def cron_run():
if 'win' in sys.platform:
cron_run_win()
schedule.every(5).minutes.do(cron_run_win)
elif 'linux' in sys.platform:
cron_run_linux()
schedule.every(5).minutes.do(cron_run_linux)
while True:
schedule.run_pending()
time.sleep(1)
cron_run()
This will run your desired spider every 5 mins depending on the os you are using
Related
I am using Scrapy for scraping text from websites.
I would like Scrapy to scrape text from various URLs with different structure, without having to change the code for each website.
The following example works in my Jupyter Notebook for the given URL ( http://quotes.toscrape.com/tag/humor/ ). But it does not work for another (for ex.: https://en.wikipedia.org/wiki/Web_scraping ).
My question is, how to make it work for (most) other websites without manually inspecting every site and changing the code all the time? I guess I need to make a change under def parse(self, response), but so far I could not find a good example how to do that.
Modules:
import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor
Spider:
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
A wrapper to make it run more times in Jupyter:
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
Get the result:
print('Extracted text:')
run_spider(QuotesSpider)
Extracted text:
“The person, be it gentleman or lady, who has not pleasure in a good novel, ..."
I've created a simple python program that scrapes my favorite recipe website and returns the individual recipe URLs from the main site. While this is a relatively quick and simple process, I've tried scaling this out to scrape multiple webpages within the site. When I do this, it takes about 45 seconds to scrape all of the recipe URLs from the whole site. I'd like this process to be much quicker so I tried implementing threads into my program.
I realize there is something wrong here as each thread returns the whole URL thread over and over again instead of 'splitting up' the work. Does anyone have any suggestions on how to better implement the threads? I've included my work below. Using Python 3.
from bs4 import BeautifulSoup
import urllib.request
from urllib.request import urlopen
from datetime import datetime
import threading
from datetime import datetime
startTime = datetime.now()
quote_page='http://thepioneerwoman.com/cooking_cat/all-pw-recipes/'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
all_recipe_links = []
#get all recipe links on current page
def get_recipe_links():
for link in soup.find_all('a', attrs={'post-card-permalink'}):
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
print(datetime.now() - startTime)
return all_recipe_links
def worker():
"""thread worker function"""
print(get_recipe_links())
return
threads = []
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
I was able to distribute the work to the workers by having the workers all process data from a single list, instead of having them all run the whole method individually. Below are the parts that I changed. The method get_recipe_links is no longer needed, since its tasks have been moved to other methods.
all_recipe_links = []
links_to_process = []
def worker():
"""thread worker function"""
while(len(links_to_process) > 0):
link = links_to_process.pop()
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
threads = []
links_to_process = soup.find_all('a', attrs={'post-card-permalink'})
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
while len(links_to_process)>0:
continue
print(all_recipe_links)
I ran the new methods several times, and on average it takes .02 seconds to run this.
I have to call the crawler from another python file, for which I use the following code.
def crawl_koovs():
spider = SomeSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
On running this, I get the error as
exceptions.ValueError: signal only works in main thread
The only workaround I could find is to use
reactor.run(installSignalHandlers=False)
which I don't want to use as I want to call this method multiple times and want reactor to be stopped before the next call. What can I do to make this work (maybe force the crawler to start in the same 'main' thread)?
The first thing I would say to you is when you're executing Scrapy from external file the loglevel is set to INFO,you should change it to DEBUG to see what's happening if your code doesn't work
you should change the line:
log.start()
for:
log.start(loglevel=log.DEBUG)
To store everything in the log and generate a text file (for debugging purposes) you can do:
log.start(logfile="file.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)
About the signals issue with the log level changed to DEBUG maybe you can see some output that can help you to fix it, you can try to put your script into the Scrapy Project folder to see if still crashes.
If you change the line:
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
for:
dispatcher.connect(reactor.stop, signals.spider_closed)
What does it say ?
Depending on your Scrapy version it may be deprecated
for looping and use un azure functions with timertrigger use this taks
from twisted.internet import task from twisted.internet import reactor
loopTimes = 3 failInTheEnd = False
_loopCounter = 0
def runEverySecond():
"""
Called at ever loop interval.
"""
global _loopCounter
if _loopCounter < loopTimes:
_loopCounter += 1
print('A new second has passed.')
return
if failInTheEnd:
raise Exception('Failure during loop execution.')
# We looped enough times.
loop.stop()
return
def cbLoopDone(result):
"""
Called when loop was stopped with success.
"""
print("Loop done.")
reactor.stop()
def ebLoopFailed(failure):
"""
Called when loop execution failed.
"""
print(failure.getBriefTraceback())
reactor.stop()
loop = task.LoopingCall(runEverySecond)
# Start looping every 1 second. loopDeferred = loop.start(1.0)
# Add callbacks for stop and failure. loopDeferred.addCallback(cbLoopDone) loopDeferred.addErrback(ebLoopFailed)
reactor.run()
If we want a task to run every X seconds repeatedly, we can use twisted.internet.task.LoopingCall:
from https://docs.twisted.org/en/stable/core/howto/time.html
Here's the python script that i am using to call scrapy, the answer of
Scrapy crawl from script always blocks script execution after scraping
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
here's my pipelines.py code
from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher
class scrapermar11Pipeline(object):
def __init__(self):
self.files = {}
dispatcher.connect(self.spider_opened , signals.spider_opened)
dispatcher.connect(self.spider_closed , signals.spider_closed)
def spider_opened(self,spider):
file = open('links_pipelines.json' ,'wb')
self.files[spider] = file
self.exporter = JsonItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self,spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
log.msg('It reached here')
return item
This code is taken from here
Scrapy :: Issues with JSON export
When i run the crawler like this
scrapy crawl MySpider -a start_url='abc'
a links file with the expected output is created .But when i execute the python script it does not create any file though the crawler runs as the dumped scrapy stats are similar to those of the previous run.
I think there's a mistake in the python script as the file is getting created in the first approach .How do i get the script to output the file ?
This code worked for me:
from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'}
settings.overrides.update(mySettings)
crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()
spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool
dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)
log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."
A solution that worked for me was to ditch the run script and use of the internal API and use the command line & GNU Parallel to parallelize instead.
To run all known spiders, one per core:
scrapy list | parallel --line-buffer scrapy crawl
scrapy list lists all spiders one per line, allowed us to pipe them as arguments to append to a command (scrapy crawl) passed to GNU Parallel instead. --line-buffer means that output received back from the processes will be be printed to stdout mixed, but on a line-by-line basis rather than quater/half lines being garbled together (for other options look at --group and --ungroup).
NB: obviously this works best on machines that have multiple CPU cores as by default, GNU Parallel will run one job per core. Note that unlike many modern development machines, the cheap AWS EC2 & DigitalOcean tiers only have one virtual CPU core. Therefore if you wish to run jobs simultaneously on one core you will have to play with the --jobs argument to GNU Parellel. e.g to run 2 scrapy crawlers per core:
scrapy list | parallel --jobs 200% --line-buffer scrapy crawl
My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:
"ReactorNotRestartable"
Why?
The offending code (last line throws the error):
crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()
# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)
# start engine scrapy/twisted
crawler.start()
Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.
So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.
Edit: in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)
from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process
class CrawlerWorker(Process):
def __init__(self, spider, results):
Process.__init__(self)
self.results = results
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
self.crawler.stop()
self.results.put(self.items)
# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
pass # Do something with item
crawler.start() starts Twisted reactor. There can be only one reactor.
If you want to run more spiders - use
another_spider = MyAnotherSpider()
crawler.queue.append_spider(another_spider)
I've used threads to start reactor several time in one app and avoid ReactorNotRestartable error.
Thread(target=process.start).start()
Here is the detailed explanation: Run a Scrapy spider in a Celery Task
Seems to me that you cannot use crawler.start() command twice: you may have to re-create it if you want it to run a second time.