Scrapy: non-blocking pause

Scrapy: non-blocking pause - python

I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause.
It's looks like:
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=self.second_parse_function)
# Here I need some function for sleep only this function like time.sleep(10)
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
Function non_stop_function needs to be stopped for a while, but it should not block the rest of the output.
If I insert time.sleep() - it will stop the whole parser, but I don't need it. Is it possible to stop one function using twisted or something else?
Reason: I need to create a non-blocking function that will parse the page of the website every n seconds. There she will get urls and fill for 10 seconds. URLs that have been obtained will continue to work, but the main feature needs to sleep.
UPDATE:
Thanks to TkTech and viach. One answer helped me to understand how to make a pending Request, and the second is how to activate it. Both answers complement each other and I made an excellent non-blocking pause for Scrapy:
def call_after_pause(self, response):
d = Deferred()
reactor.callLater(10.0, d.callback, Request(
'https://example.com/',
callback=self.non_stop_function,
dont_filter=True))
return d
And use this function for my request:
yield Request('https://example.com/', callback=self.call_after_pause, dont_filter=True)

Request object has callback parameter, try to use that one for the purpose.
I mean, create a Deferred which wraps self.second_parse_function and pause.
Here is my dirty and not tested example, changed lines are marked.
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
If the approach works for you then you can create a function which constructs a Deferred object according to the rule. It could be implemented in the way like the following:
def get_perform_and_pause_deferred(seconds, fn, *args, **kwargs):
d = Deferred()
d.addCallback(fn, *args, **kwargs)
d.addCallback(pause, seconds=seconds)
return d
And here is possible usage:
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
for url in ['url1', 'url2', 'url3', 'more urls']:
# changed
yield Request(url, callback=get_perform_and_pause_deferred(10, self.second_parse_function))
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass

If you're attempting to use this for rate limiting, you probably just want to use DOWNLOAD_DELAY instead.
Scrapy is just a framework on top of Twisted. For the most part, you can treat it the same as any other twisted app. Instead of calling sleep, just return the next request to make and tell twisted to wait a bit. Ex:
from twisted.internet import reactor, defer
def non_stop_function(self, response)
d = defer.Deferred()
reactor.callLater(10.0, d.callback, Request(
'some url',
callback=self.non_stop_function
))
return d

The asker already provides an answer in the question's update, but I want to give a slightly better version so it's reusable for any request.
# removed...
from twisted.internet import reactor, defer
class MySpider(scrapy.Spider):
# removed...
def request_with_pause(self, response):
d = defer.Deferred()
reactor.callLater(response.meta['time'], d.callback, scrapy.Request(
response.url,
callback=response.meta['callback'],
dont_filter=True, meta={'dont_proxy':response.meta['dont_proxy']}))
return d
def parse(self, response):
# removed....
yield scrapy.Request(the_url, meta={
'time': 86400,
'callback': self.the_parse,
'dont_proxy': True
}, callback=self.request_with_pause)
For explanation, Scrapy use Twisted to manage the request asynchronously, so we need Twisted's tool to do a delayed request too.

Related

Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?

I have the following scrapy CrawlSpider:
import logger as lg
from scrapy.crawler import CrawlerProcess
from scrapy.http import Response
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashTextResponse
from urllib.parse import urlencode
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse
logger = lg.get_logger("oddsportal_spider")
class SeleniumScraper(CrawlSpider):
name = "splash"
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"DOWNLOADER_MIDDLEWARES": {
'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
},
}
httperror_allowed_codes = [301]
start_urls = ["https://www.oddsportal.com/tennis/results/"]
rules = (
Rule(
LinkExtractor(allow="/atp-buenos-aires/results/"),
callback="parse_tournament",
follow=True,
),
Rule(
LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[#class='name table-participant']//a"),
),
callback="parse_match",
),
)
def parse_tournament(self, response: Response):
logger.info(f"Parsing tournament - {response.url}")
def parse_match(self, response: Response):
logger.info(f"Parsing match - {response.url}")
process = CrawlerProcess()
process.crawl(SeleniumScraper)
process.start()
The Selenium middleware is as follows:
class SeleniumMiddleware:
#classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
logger.debug(f"Selenium processing request - {request.url}")
self.driver.get(request.url)
return HtmlResponse(
request.url,
body=self.driver.page_source,
encoding='utf-8',
request=request,
)
def spider_opened(self, spider):
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
self.driver = webdriver.Firefox(
options=options,
executable_path=Path("/opt/geckodriver/geckodriver"),
)
def spider_closed(self, spider):
self.driver.close()
End to end this takes around a minute for around 50ish pages. To try and speed things up and take advantage of multiple threads and Javascript I've implemented the following scrapy_splash spider:
class SplashScraper(CrawlSpider):
name = "splash"
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"SPLASH_URL": "http://localhost:8050",
"DOWNLOADER_MIDDLEWARES": {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
"SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
"DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
"HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
}
httperror_allowed_codes = [301]
start_urls = ["https://www.oddsportal.com/tennis/results/"]
rules = (
Rule(
LinkExtractor(allow="/atp-buenos-aires/results/"),
callback="parse_tournament",
process_request="use_splash",
follow=True,
),
Rule(
LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[#class='name table-participant']//a"),
),
callback="parse_match",
process_request="use_splash",
),
)
def process_links(self, links):
for link in links:
link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url})
return links
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashTextResponse)):
return
seen = set()
for rule_index, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
for link in rule.process_links(links):
seen.add(link)
request = self._build_request(rule_index, link)
yield rule.process_request(request, response)
def use_splash(self, request, response):
request.meta.update(splash={'endpoint': 'render.html'})
return request
def parse_tournament(self, response: Response):
logger.info(f"Parsing tournament - {response.url}")
def parse_match(self, response: Response):
logger.info(f"Parsing match - {response.url}")
However, this takes about the same amount of time. I was hoping to see a big increase in speed :(
I've tried playing around with different DOWNLOAD_DELAY settings but that hasn't made things any faster.
All the concurrency settings are left at their defaults.
Any ideas on if/how I'm going wrong?

Taking a stab at an answer here with no experience of the libraries.
It looks like Scrapy Crawlers themselves are single-threaded. To get multi-threaded behavior you need to configure your application differently or write code that makes it behave mulit-threaded. It sounds like you've already tried this so this is probably not news to you but make sure you have configured the CONCURRENT_REQUESTS and REACTOR_THREADPOOL_MAXSIZE.
https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize
I can't imagine there is much CPU work going on in the crawling process so i doubt it's a GIL issue.
Excluding GIL as an option there are two possibilities here:
Your crawler is not actually multi-threaded. This may be because you are missing some setup or configuration that would make it so. i.e. You may have set the env variables correctly but your crawler is written in a way that is processing requests for urls synchronously instead of submitting them to a queue.
To test this, create a global object and store a counter on it. Each time your crawler starts a request increment the counter. Each time your crawler finishes a request, decrement the counter. Then run a thread that prints the counter every second. If your counter value is always 1, then you are still running synchronously.
# global_state.py
GLOBAL_STATE = {"counter": 0}
# middleware.py
from global_state import GLOBAL_STATE
class SeleniumMiddleware:
def process_request(self, request, spider):
GLOBAL_STATE["counter"] += 1
self.driver.get(request.url)
GLOBAL_STATE["counter"] -= 1
...
# main.py
from global_state import GLOBAL_STATE
import threading
import time
def main():
gst = threading.Thread(target=gs_watcher)
gst.start()
# Start your app here
def gs_watcher():
while True:
print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
time.sleep(1)
The site you are crawling is rate limiting you.
To test this, run the application multiple times. If you go from 50 req/s to 25 req/s per application then you are being rate limited. To skirt around this use a VPN to hop-around.
If after that you find that you are running concurrent requests, and you are not being rate limited, then there is something funky going on in the libraries. Try removing chunks of code until you get to the bare minimum of what you need to crawl. If you have gotten to the absolute bare minimum implementation and it's still slow then you now have a minimal reproducible example and can get much better/informed help.

Scrapy instance method mysteriously refusing to call another instance method

I'm using Scrapy to scrape a site that has a login page followed by a set of content pages with sequential integer IDs, pulled up as a URL parameter. This has been successfully running for a while, but the other day I decided to move the code that yields the Requests into a separate method, so that I can call it other places besides the initial load (basically, to dynamically add some more pages to fetch).
And it... just won't call that separate method. It reaches the point where I invoke self.issue_requests(), and proceeds right through it as if the instruction isn't there.
So this (part of the spider class definition, without the separate method) works:
# ...final bit of start_requests():
yield scrapy.FormRequest(url=LOGIN_URL + '/login', method='POST', formdata=LOGIN_PARAMS, callback=self.parse_login)
def parse_login(self, response):
self.logger.debug("Logged in successfully!")
global next_priority, input_reqno, REQUEST_RANGE, badreqs
# go through our desired page numbers
while len(REQUEST_RANGE) > 0:
input_reqno = int(REQUEST_RANGE.pop(0))
if input_reqno not in badreqs:
yield scrapy.Request(url=REQUEST_BASE_URL + str(input_reqno), method='GET', meta={'input_reqno': input_reqno,'dont_retry': True}, callback=self.parse_page, priority = next_priority)
next_priority -= 1
def parse_page(self, response):
# ...
...however, this slight refactor does not:
# ...final bit of start_requests():
yield scrapy.FormRequest(url=LOGIN_URL + '/login', method='POST', formdata=LOGIN_PARAMS, callback=self.parse_login)
def issue_requests(self):
self.logger.debug("Inside issue_requests()!")
global next_priority, input_reqno, REQUEST_RANGE, badreqs
# go through our desired page numbers
while len(REQUEST_RANGE) > 0:
input_reqno = int(REQUEST_RANGE.pop(0))
if input_reqno not in badreqs:
yield scrapy.Request(url=REQUEST_BASE_URL + str(input_reqno), method='GET', meta={'input_reqno': input_reqno,'dont_retry': True}, callback=self.parse_page, priority = next_priority)
next_priority -= 1
return
def parse_login(self, response):
self.logger.debug("Logged in successfully!")
self.issue_requests()
def parse_page(self, response):
# ...
Looking at the logs, it reaches the "logged in successfully!" part, but then never gets "inside issue_requests()", and because there are no scrapy Request objects yielded by the generator, its next step is to close the spider, having done nothing.
I've never seen a situation where an object instance just refuses to call a method. You'd expect there to be some failure message if it can't pass control to the method, or if there's (say) a problem with the variable scoping in the method. But for it to silently move on and pretend I never told it to go to issue_requests() is, to me, bizarre. Help!
(this is Python 2.7.18, btw)

You have to yield items from parse_login as well:
def parse_login(self, response):
self.logger.debug("Logged in successfully!")
for req in self.issue_requests():
yield req

How to go through all the requests in scrapy-splash?

I'm trying to write a scrapy-splash spider to go to a website, and continuously enter fields into a form, and submit the form. For the loop, for field in field[0:10], it will work for the first field, give me the data I want, and write the files. But for the last 9 elements, I get no response/it never calls the callback function. I tried doing the body of parse in the start_urls function, but got the same response. I hope someone can clear up what I seem to be misunderstanding.
Additional notes: I'm doing this in jupyter, where I predefined my settings and put them in when I initialize my process.
Observations: The first field will work, and then the next 9 will print their respective lua script instantly. I cannot tell if this is because the lua script is not working, or something fundamentally about how scrapy-splash works. Additionally, the print(c) only works once so either the callback is not being called, or the splash_request is not called.
c=0
fields = ['']
class MySpider(scrapy.Spider):
name = 'url'
start_urls = ['url']
def __init__(self, *args, **kwargs):
with open('script.lua', 'r') as script:
self.LUA_SOURCE = script.read()
print(self.LUA_SOURCE)
def parse(self, response):
for field in fields[0:10]:
try:
src = self.LUA_SOURCE % (field)
print(src)
yield SplashRequest(url=self.start_urls[0],
callback=self.parse2,
endpoint='execute',
args={'wait': 10,
'timeout':60,
'lua_source': src},
cache_args=['lua_source'],
)
except Exception as e:
print(e)
def parse2(self, response):
global c
c+=1
print(c)
try:
with open(f'text_files/{c}.txt', 'w') as f:
f.write(response.text)
except Exception as e:
print(e)

Django Celery Scrappy ERROR: twisted.internet.error.ReactorNotRestartable

I have next model:
Command 'collect' (collect_positions.py) -> Celery task (tasks.py) -> ScrappySpider (MySpider) ...
collect_positions.py:
from django.core.management.base import BaseCommand
from tracker.models import Keyword
from tracker.tasks import positions
class Command(BaseCommand):
help = 'collect_positions'
def handle(self, *args, **options):
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
chunk_size = 1
keywords = Keyword.objects.filter(product=product).values_list('id', flat=True)
chunks_list = list(chunks(keywords, chunk_size))
positions.chunks(chunks_list, 1).apply_async(queue='collect_positions')
return 0
tasks.py:
from app_name.celery import app
from scrapy.settings import Settings
from scrapy_app import settings as scrapy_settings
from scrapy_app.spiders.my_spider import MySpider
from tracker.models import Keyword
from scrapy.crawler import CrawlerProcess
#app.task
def positions(*args):
s = Settings()
s.setmodule(scrapy_settings)
keywords = Keyword.objects.filter(id__in=list(args))
process = CrawlerProcess(s)
process.crawl(MySpider, keywords_chunk=keywords)
process.start()
return 1
I run the command through the command line, which creates tasks for parsing. The first queue completes successfully, but other returned an error:
twisted.internet.error.ReactorNotRestartable
Please tell me how can I fix this error?
I can provide any data if there is a need...
UPDATE 1
Thanks for the answer, #Chiefir! I managed to run all queues, but only the start_requests() function is started, and parse() does not run.
The main functions of the scrappy spider:
def start_requests(self):
print('STEP1')
yield scrapy.Request(
url='exmaple.com',
callback=self.parse,
errback=self.error_callback,
dont_filter=True
)
def error_callback(self, failure):
print(failure)
# log all errback failures,
# in case you want to do something special for some errors,
# you may need the failure's type
print(repr(failure))
# if isinstance(failure.value, HttpError):
if failure.check(HttpError):
# you can get the response
response = failure.value.response
print('HttpError on %s', response.url)
# elif isinstance(failure.value, DNSLookupError):
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
print('DNSLookupError on %s', request.url)
# elif isinstance(failure.value, TimeoutError):
elif failure.check(TimeoutError):
request = failure.request
print('TimeoutError on %s', request.url)
def parse(self, response):
print('STEP2', response)
In the console I get:
STEP1
What could be the reason?

This is old question as a world:
This is what helped for me to win the battle against ReactorNotRestartable error: last answer from the author of the question
0) pip install crochet
1) import from crochet import setup
2) setup() - at the top of the file
3) remove 2 lines:
a) d.addBoth(lambda _: reactor.stop())
b) reactor.run()
I had the same problem with this error, and spend 4+ hours to solve this problem, read all questions here about it. Finally found that one - and share it. That is how i solved this. The only meaningful lines from Scrapy docs left are 2 last lines in this my code:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
This code allows me to select what spider to run just with its name passed to run_spider function and after scrapping finishes - select another spider and run it again.
In your case you need in separate file create separate function which runs your spiders and run it from your task. Usually I do in this way :)
P.S. And really there is no way to restart the TwistedReactor.
UPDATE 1
I don't know if you need to call a start_requests() method. For me it usually works just with this code:
class mySpider(scrapy.Spider):
name = "somname"
allowed_domains = ["somesite.com"]
start_urls = ["https://somesite.com"]
def parse(self, response):
pass
def parse_dir_contents(self, response): #for crawling additional links
pass

You can fix this by setting the parameter stop_after_crawl to False on the start method of CrawlerProcess:
stop_after_crawl (bool) – stop or not the reactor when all crawlers have finished
#shared_task
def crawl(m_id, *args, **kwargs):
process = CrawlerProcess(get_project_settings(), install_root_handler=False)
process.crawl(SpiderClass, m_id=m_id)
process.start(stop_after_crawl=False)

modifying urls before sending for fetching in scrapy

I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.
for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation
with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules and implement process_requests(). but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
#sitemap_rules = [some_rules, process_request='process_request')]
#def process_request(self, request, spider):
# modified_url=orginal_url_from_sitemap + 'myword'
# return request.replace(url = modified_url)
def parse(self, response):
print response.status, response.url

You can attach the request_scheduled signal to a function and do what you want in the function. For example
class MySpider(SitemapSpider):
#classmethod
def from_crawler(cls, crawler):
spider = cls()
crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)
def request_scheduled(self, request, spider):
modified_url = orginal_url_from_sitemap + 'myword'
request.url = modified_url

SitemapSpider has sitemap_filter method.
You can override it to implement required functionality.
class MySpider(SitemapSpider):
...
def sitemap_filter(self, entries):
for entry in entries:
entry["loc"] = entry["loc"] + myword
yield entry
Each of that entry objects are dicts with structure like this:
<class 'dict'>:
{'loc': 'https://example.com/',
'lastmod': '2019-01-04T08:09:23+00:00',
'changefreq': 'weekly',
'priority': '0.8'}
Important note!. SitemapSpider.sitemap_filter method appeared on scrapy 1.6.0 released on Jan 2019 1.6.0 release notes - new extensibility features section

I've just facet this. Apparently you can't really use process_requests because sitemap rules in SitemapSpider are different from Rule objects in CrawlSpider - only the latter can have this argument.
After examining the code it looks like this can be avoided by manually overriding part of SitemapSpider implementation:
class MySpider(SitemapSpider):
sitemap_urls = ['...']
sitemap_rules = [('/', 'parse')]
def start_requests(self):
# override to call custom_parse_sitemap instead of _parse_sitemap
for url in self.sitemap_urls:
yield Request(url, self.custom_parse_sitemap)
def custom_parse_sitemap(self, response):
# modify requests marked to be called with parse callback
for request in super()._parse_sitemap(response):
if request.callback == self.parse:
yield self.modify_request(request)
else:
yield request
def modify_request(self, request):
return request.replace(
# ...
)
def parse(self, response):
# ...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: non-blocking pause - python

Related

Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?

Scrapy instance method mysteriously refusing to call another instance method

How to go through all the requests in scrapy-splash?

Django Celery Scrappy ERROR: twisted.internet.error.ReactorNotRestartable

modifying urls before sending for fetching in scrapy

Categories

Resources