I'm using Scrapy to scrape a site that has a login page followed by a set of content pages with sequential integer IDs, pulled up as a URL parameter. This has been successfully running for a while, but the other day I decided to move the code that yields the Requests into a separate method, so that I can call it other places besides the initial load (basically, to dynamically add some more pages to fetch).
And it... just won't call that separate method. It reaches the point where I invoke self.issue_requests(), and proceeds right through it as if the instruction isn't there.
So this (part of the spider class definition, without the separate method) works:
# ...final bit of start_requests():
yield scrapy.FormRequest(url=LOGIN_URL + '/login', method='POST', formdata=LOGIN_PARAMS, callback=self.parse_login)
def parse_login(self, response):
self.logger.debug("Logged in successfully!")
global next_priority, input_reqno, REQUEST_RANGE, badreqs
# go through our desired page numbers
while len(REQUEST_RANGE) > 0:
input_reqno = int(REQUEST_RANGE.pop(0))
if input_reqno not in badreqs:
yield scrapy.Request(url=REQUEST_BASE_URL + str(input_reqno), method='GET', meta={'input_reqno': input_reqno,'dont_retry': True}, callback=self.parse_page, priority = next_priority)
next_priority -= 1
def parse_page(self, response):
# ...
...however, this slight refactor does not:
# ...final bit of start_requests():
yield scrapy.FormRequest(url=LOGIN_URL + '/login', method='POST', formdata=LOGIN_PARAMS, callback=self.parse_login)
def issue_requests(self):
self.logger.debug("Inside issue_requests()!")
global next_priority, input_reqno, REQUEST_RANGE, badreqs
# go through our desired page numbers
while len(REQUEST_RANGE) > 0:
input_reqno = int(REQUEST_RANGE.pop(0))
if input_reqno not in badreqs:
yield scrapy.Request(url=REQUEST_BASE_URL + str(input_reqno), method='GET', meta={'input_reqno': input_reqno,'dont_retry': True}, callback=self.parse_page, priority = next_priority)
next_priority -= 1
return
def parse_login(self, response):
self.logger.debug("Logged in successfully!")
self.issue_requests()
def parse_page(self, response):
# ...
Looking at the logs, it reaches the "logged in successfully!" part, but then never gets "inside issue_requests()", and because there are no scrapy Request objects yielded by the generator, its next step is to close the spider, having done nothing.
I've never seen a situation where an object instance just refuses to call a method. You'd expect there to be some failure message if it can't pass control to the method, or if there's (say) a problem with the variable scoping in the method. But for it to silently move on and pretend I never told it to go to issue_requests() is, to me, bizarre. Help!
(this is Python 2.7.18, btw)
You have to yield items from parse_login as well:
def parse_login(self, response):
self.logger.debug("Logged in successfully!")
for req in self.issue_requests():
yield req
Related
I'm trying to write a scrapy-splash spider to go to a website, and continuously enter fields into a form, and submit the form. For the loop, for field in field[0:10], it will work for the first field, give me the data I want, and write the files. But for the last 9 elements, I get no response/it never calls the callback function. I tried doing the body of parse in the start_urls function, but got the same response. I hope someone can clear up what I seem to be misunderstanding.
Additional notes: I'm doing this in jupyter, where I predefined my settings and put them in when I initialize my process.
Observations: The first field will work, and then the next 9 will print their respective lua script instantly. I cannot tell if this is because the lua script is not working, or something fundamentally about how scrapy-splash works. Additionally, the print(c) only works once so either the callback is not being called, or the splash_request is not called.
c=0
fields = ['']
class MySpider(scrapy.Spider):
name = 'url'
start_urls = ['url']
def __init__(self, *args, **kwargs):
with open('script.lua', 'r') as script:
self.LUA_SOURCE = script.read()
print(self.LUA_SOURCE)
def parse(self, response):
for field in fields[0:10]:
try:
src = self.LUA_SOURCE % (field)
print(src)
yield SplashRequest(url=self.start_urls[0],
callback=self.parse2,
endpoint='execute',
args={'wait': 10,
'timeout':60,
'lua_source': src},
cache_args=['lua_source'],
)
except Exception as e:
print(e)
def parse2(self, response):
global c
c+=1
print(c)
try:
with open(f'text_files/{c}.txt', 'w') as f:
f.write(response.text)
except Exception as e:
print(e)
I was trying to use the following function to wait for a crawler to finish and return all results. However, this function always returns immediately when called while the crawler is still running. What am I missing here? Aren't join() supposed to wait?
def spider_results():
runner = CrawlerRunner(get_project_settings())
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_passed)
runner.crawl(QuotesSpider)
runner.join()
return results
Accordig to scrapy docs (common practices section)
CrawlerProcess class is recommended to use in this cases.
In the following code example I have a function do_async_thing which appears to return a Future, even though I'm not sure why?
import tornado.ioloop
import tornado.web
import tornado.httpclient
#tornado.gen.coroutine
def do_async_thing():
http = tornado.httpclient.AsyncHTTPClient()
response = yield http.fetch("http://www.google.com/")
return response.body
class MainHandler(tornado.web.RequestHandler):
def get(self):
x = do_async_thing()
print(x) # <tornado.concurrent.Future object at 0x10753a6a0>
self.set_header("Content-Type", "application/json")
self.write('{"foo":"bar"}')
self.finish()
if __name__ == "__main__":
app = tornado.web.Application([
(r"/foo/?", MainHandler),
])
app.listen(8888)
tornado.ioloop.IOLoop.current().start()
You'll see that I yield the call to fetch and in doing so I should have forced the value to be realised (and subsequently been able to access the body field of the response).
What's more interesting is how I can even access the body field on a Future and not have it error (as far as I know a Future has no such field/property/method)
So does anyone know how I can:
Resolve the Future so I get the actual value
Modify this example so the function do_async_thing makes multiple async url fetches
Now it's worth noting that because I was still getting a Future back I thought I would try adding a yield to prefix the call to do_async_thing() (e.g. x = yield do_async_thing()) but that gave me back the following error:
tornado.gen.BadYieldError: yielded unknown object <generator object get at 0x1023bc308>
I also looked at doing something like this for the second point:
def do_another_async_thing():
http = tornado.httpclient.AsyncHTTPClient()
a = http.fetch("http://www.google.com/")
b = http.fetch("http://www.github.com/")
return a, b
class MainHandler(tornado.web.RequestHandler):
def get(self):
y = do_another_async_thing()
print(y)
But again this returns:
<tornado.concurrent.Future object at 0x102b966d8>
Where as I would've expected a tuple of Futures at least? At this point I'm unable to resolve these Futures without getting an error such as:
tornado.gen.BadYieldError: yielded unknown object <generator object get at 0x1091ac360>
Update
Below is an example that works (as per answered by A. Jesse Jiryu Davis)
But I've also added another example where by I have a new function do_another_async_thing which makes two async HTTP requests (but evaluating their values are a little bit more involved as you'll see):
def do_another_async_thing():
http = tornado.httpclient.AsyncHTTPClient()
a = http.fetch("http://www.google.com/")
b = http.fetch("http://www.github.com/")
return a, b
#tornado.gen.coroutine
def do_async_thing():
http = tornado.httpclient.AsyncHTTPClient()
response = yield http.fetch("http://www.google.com/")
return response.body
class MainHandler(tornado.web.RequestHandler):
#tornado.gen.coroutine
def get(self):
x = yield do_async_thing()
print(x) # displays HTML response
fa, fb = do_another_async_thing()
fa = yield fa
fb = yield fb
print(fa.body, fb.body) # displays HTML response for each
It's worth clarifying: you might expect the two yield statements for do_another_async_thing to cause a blockage. But here is a breakdown of the steps that are happening:
do_another_async_thing returns immediately a tuple with two Futures
we yield the first tuple which causes the program to be blocked until the value is realised
the value is realised and so we move to the next line
we yield again, causing the program to block until the value is realised
but as both futures were created at the same time and run concurrently the second yield returns practically instantly
Coroutines return futures. To wait for the coroutine to complete, the caller must also be a coroutine, and must yield the future. So:
#gen.coroutine
def get(self):
x = yield do_async_thing()
For more info see Refactoring Tornado Coroutines.
I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause.
It's looks like:
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=self.second_parse_function)
# Here I need some function for sleep only this function like time.sleep(10)
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
Function non_stop_function needs to be stopped for a while, but it should not block the rest of the output.
If I insert time.sleep() - it will stop the whole parser, but I don't need it. Is it possible to stop one function using twisted or something else?
Reason: I need to create a non-blocking function that will parse the page of the website every n seconds. There she will get urls and fill for 10 seconds. URLs that have been obtained will continue to work, but the main feature needs to sleep.
UPDATE:
Thanks to TkTech and viach. One answer helped me to understand how to make a pending Request, and the second is how to activate it. Both answers complement each other and I made an excellent non-blocking pause for Scrapy:
def call_after_pause(self, response):
d = Deferred()
reactor.callLater(10.0, d.callback, Request(
'https://example.com/',
callback=self.non_stop_function,
dont_filter=True))
return d
And use this function for my request:
yield Request('https://example.com/', callback=self.call_after_pause, dont_filter=True)
Request object has callback parameter, try to use that one for the purpose.
I mean, create a Deferred which wraps self.second_parse_function and pause.
Here is my dirty and not tested example, changed lines are marked.
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
If the approach works for you then you can create a function which constructs a Deferred object according to the rule. It could be implemented in the way like the following:
def get_perform_and_pause_deferred(seconds, fn, *args, **kwargs):
d = Deferred()
d.addCallback(fn, *args, **kwargs)
d.addCallback(pause, seconds=seconds)
return d
And here is possible usage:
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
for url in ['url1', 'url2', 'url3', 'more urls']:
# changed
yield Request(url, callback=get_perform_and_pause_deferred(10, self.second_parse_function))
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
If you're attempting to use this for rate limiting, you probably just want to use DOWNLOAD_DELAY instead.
Scrapy is just a framework on top of Twisted. For the most part, you can treat it the same as any other twisted app. Instead of calling sleep, just return the next request to make and tell twisted to wait a bit. Ex:
from twisted.internet import reactor, defer
def non_stop_function(self, response)
d = defer.Deferred()
reactor.callLater(10.0, d.callback, Request(
'some url',
callback=self.non_stop_function
))
return d
The asker already provides an answer in the question's update, but I want to give a slightly better version so it's reusable for any request.
# removed...
from twisted.internet import reactor, defer
class MySpider(scrapy.Spider):
# removed...
def request_with_pause(self, response):
d = defer.Deferred()
reactor.callLater(response.meta['time'], d.callback, scrapy.Request(
response.url,
callback=response.meta['callback'],
dont_filter=True, meta={'dont_proxy':response.meta['dont_proxy']}))
return d
def parse(self, response):
# removed....
yield scrapy.Request(the_url, meta={
'time': 86400,
'callback': self.the_parse,
'dont_proxy': True
}, callback=self.request_with_pause)
For explanation, Scrapy use Twisted to manage the request asynchronously, so we need Twisted's tool to do a delayed request too.
I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.
for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation
with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules and implement process_requests(). but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
#sitemap_rules = [some_rules, process_request='process_request')]
#def process_request(self, request, spider):
# modified_url=orginal_url_from_sitemap + 'myword'
# return request.replace(url = modified_url)
def parse(self, response):
print response.status, response.url
You can attach the request_scheduled signal to a function and do what you want in the function. For example
class MySpider(SitemapSpider):
#classmethod
def from_crawler(cls, crawler):
spider = cls()
crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)
def request_scheduled(self, request, spider):
modified_url = orginal_url_from_sitemap + 'myword'
request.url = modified_url
SitemapSpider has sitemap_filter method.
You can override it to implement required functionality.
class MySpider(SitemapSpider):
...
def sitemap_filter(self, entries):
for entry in entries:
entry["loc"] = entry["loc"] + myword
yield entry
Each of that entry objects are dicts with structure like this:
<class 'dict'>:
{'loc': 'https://example.com/',
'lastmod': '2019-01-04T08:09:23+00:00',
'changefreq': 'weekly',
'priority': '0.8'}
Important note!. SitemapSpider.sitemap_filter method appeared on scrapy 1.6.0 released on Jan 2019 1.6.0 release notes - new extensibility features section
I've just facet this. Apparently you can't really use process_requests because sitemap rules in SitemapSpider are different from Rule objects in CrawlSpider - only the latter can have this argument.
After examining the code it looks like this can be avoided by manually overriding part of SitemapSpider implementation:
class MySpider(SitemapSpider):
sitemap_urls = ['...']
sitemap_rules = [('/', 'parse')]
def start_requests(self):
# override to call custom_parse_sitemap instead of _parse_sitemap
for url in self.sitemap_urls:
yield Request(url, self.custom_parse_sitemap)
def custom_parse_sitemap(self, response):
# modify requests marked to be called with parse callback
for request in super()._parse_sitemap(response):
if request.callback == self.parse:
yield self.modify_request(request)
else:
yield request
def modify_request(self, request):
return request.replace(
# ...
)
def parse(self, response):
# ...