I want to asynchronously query a database for keys, then make requests to several urls for each key.
I have a function that returns a Deferred from the database whose value is the key for several requests. Ideally, I would call this function and return a generator of Deferreds from start_requests.
#inlineCallbacks
def get_request_deferred(self):
d = yield engine.execute(select([table])) # async
d.addCallback(make_url)
d.addCallback(Request)
return d
def start_requests(self):
????
But attempting this in several ways raises
builtins.AttributeError: 'Deferred' object has no attribute 'dont_filter'
which I take to mean that start_requests must return Request objects, not Deferreds whose values are Request objects. The same seems to be true of spider middleware's process_start_requests().
Alternatively, I can make initial requests to, say, http://localhost/ and change them to the real url once the key is available from the database through downloader middleware's process_request(). However, process_request only returns a Request object; it cannot yield Requests to multiple pages using the key: attempting yield Request(url) raises
AssertionError: Middleware myDownloaderMiddleware.process_request
must return None, Response or Request, got generator
What is the cleanest solution to
get key asynchronously from database
for each key, generate several requests
You've provided no use case for async database queries to be a necessity. I'm assuming you cannot begin to scrape your URLs unless you query the database first? If that's the case then you're better off just doing the query synchronously, iterate over the query results, extract what you need, then yield Request objects. It makes little sense to query a db asynchronously and just sit around waiting for the query to finish.
You can let the callback for the Deferred object pass the urls to a generator of some sort. The generator will then convert any received urls into scrapy Request objects and yield them. Below is an example using the code you linked (not tested):
import scrapy
from Queue import Queue
from pdb import set_trace as st
from twisted.internet.defer import Deferred, inlineCallbacks
class ExampleSpider(scrapy.Spider):
name = 'example'
def __init__(self):
self.urls = Queue()
self.stop = False
self.requests = request_generator()
self.deferred = deferred_generator()
def deferred_generator(self):
d = Deferred()
d.addCallback(self.deferred_callback)
yield d
def request_generator(self):
while not self.stop:
url = self.urls.get()
yield scrapy.Request(url=url, callback=self.parse)
def start_requests(self):
return self.requests.next()
def parse(self, response):
st()
# when you need to parse the next url from the callback
yield self.requests.next()
#static_method
def deferred_callback(url):
self.urls.put(url)
if no_more_urls():
self.stop = True
Don't forget to stop the request generator when you're done.
Related
I'm trying to iterate through a list of URL's return from the callback passed to scrapy request, but I'm getting the following error:
TypeError: 'Request' object is not iterable
The following works. I can see all the extracted URL's flood the terminal:
import scrapy
class PLSpider(scrapy.Spider):
name = 'pl'
start_urls = [ 'https://example.com' ]
def genres(self, resp):
for genre in resp.css('div.sub-menus a'):
yield {
'genre': genre.css('::text').extract_first(),
'url': genre.css('::attr(href)').extract_first()
}
def extractSamplePackURLs(self, resp):
return {
'packs': resp.css('h4.product-title a::attr(href)').extract()
}
def extractPackData(self, resp):
return {
'title': resp.css('h1.product-title::text'),
'description': resp.css('div.single-product-description p').extract_first()
}
def parse(self, resp):
for genre in self.genres(resp):
samplePacks = scrapy.Request(genre['url'], callback=self.extractSamplePackURLs)
yield samplePacks
But if I replace the yield samplePacks line with:
def parse(self, resp):
for genre in self.genres(resp):
samplePacks = scrapy.Request(genre['url'], callback=self.extractSamplePackURLs)
for pack in samplePacks:
yield pack
... I get the error I posted above.
Why is this and how can I loop through the returned value of the callback?
Yielding Request objects in scrapy.Spider callbacks only tells Scrapy framework to enqueue HTTP requests. It yields HTTP requests objects, just that. It does not download them immediately. Or give back control until they are downloaded, ie. after the yield, you still don't have the result. Request objects are not promises, futures, deferred. Scrapy is not designed the same as various async frameworks.
These Request objects will eventually get processed by the framework's downloader, and the response body from each HTTP request will be passed to the associated callback.
This is the basis of Scrapy's asynchronous programming pattern.
If you want to do something more "procedural-like" in which yield request(...) gets you the HTTP response the next time you have control, you can have a look at https://github.com/rmax/scrapy-inline-requests/.
I'm trying to make an API that will collect responses from several other API's and combine the results into one response. I want to sent the get requests asynchronously so that it runs faster, but even though I'm using coroutines and yielding, my code still seems to be making each request one at a time. Wondering if maybe it's because I'm using the requests library instead of tornado's AsyncHTTPClient, or because I'm calling self.path_get inside of a loop, or because I'm storing results in an instance variable?
The API's I'm hitting return arrays of JSON objects, and I want to combine them all into one array and write that to the response.
from tornado import gen, ioloop, web
from tornado.gen import Return
import requests
PATHS = [
"http://firsturl",
"http://secondurl",
"http://thirdurl"
]
class MyApi(web.RequestHandler):
#gen.coroutine
def get(self):
self.results = []
for path in PATHS:
x = yield self.path_get(path)
self.write({
"results": self.results,
})
#gen.coroutine
def path_get(self, path):
resp = yield requests.get(path)
self.results += resp.json()["results"]
raise Return(resp)
ROUTES = [
(r"/search", MyApi),
]
def run():
app = web.Application(
ROUTES,
debug=True,
)
app.listen(8000)
ioloop.IOLoop.current().start()
if __name__ == "__main__":
run()
There's many reasons why your code doesn't work. To begin, requests generally blocks the event loop and doesn't let anything else execute. Replace requests with AsyncHTTPClient.fetch. Also, the way you were yielding each request would also make the requests sequential and not concurrently like you thought. Here's an example of how your code could be restructured:
import json
from tornado import gen, httpclient, ioloop, web
# ...
class MyApi(web.RequestHandler):
#gen.coroutine
def get(self):
futures_list = []
for path in PATHS:
futures_list.append(self.path_get(path))
yield futures_list
result = json.dumps({'results': [x.result() for x in futures_list]})
self.write(result)
#gen.coroutine
def path_get(self, path):
request = httpclient.AsyncHTTPClient()
resp = yield request.fetch(path)
result = json.loads(resp.body.decode('utf-8'))
raise gen.Return(result)
What's happening is we're creating a list of Futures that get returned from gen.coroutine functions and yielding the entire list until the results from the request are available. Then once all the requests are complete, futures_list is iterated and the results are used to create a new list which is appended to a JSON object.
i want to write an asynchronous http client using twisted framework which fires 5 requests asynchronously/simultaneously to 5 different servers. Then compare those responses and display a result. Could someone please help regarding this.
For this situation I'd suggest using treq and DeferredList to aggregate the responses then fire a callback when all the URLs have been returned. Here is a quick example:
import treq
from twisted.internet import reactor, defer, task
def fetchURL(*urls):
dList = []
for url in urls:
d = treq.get(url)
d.addCallback(treq.content)
dList.append(d)
return defer.DeferredList(dList)
def compare(responses):
# the responses are returned in a list of tuples
# Ex: [(True, b'')]
for status, content in responses:
print(content)
def main(reactor):
urls = [
'http://swapi.co/api/films/schema',
'http://swapi.co/api/people/schema',
'http://swapi.co/api/planets/schema',
'http://swapi.co/api/species/schema',
'http://swapi.co/api/starships/schema',
]
d = fetchURL(*urls) # returns Deferred
d.addCallback(compare) # fire compare() once the URLs return w/ a response
return d # wait for the DeferredList to finish
task.react(main)
# usually you would run reactor.run() but react() takes care of that
In the main function, a list of URLs are passed into fecthURL(). There, each site will make an async request and return a Deferred that will be appended to a list. Then the final list will be used to create and return a DeferredList obj. Finally we add a callback (compare() in this case) to the DeferredList that will access each response. You would put your comparison logic in the compare() function.
You don't necessarily need twisted to make asynchronous http requests. You can use python threads and the wonderful requests package.
from threading import Thread
import requests
def make_request(url, results):
response = requests.get(url)
results[url] = response
def main():
results = {}
threads = []
for i in range(5):
url = 'http://webpage/{}'.format(i)
t = Thread(target=make_request, kwargs={'url': url, 'results': results})
t.start()
threads.append(t)
for t in threads():
t.join()
print results
I'm trying to get scrapy to grab a URL from a message queue, and then scrape that URL. I have the loop going just fine and grabbing the URL from the queue, but it never enters the parse() method once it has a url, it just continues to loop (and sometimes the url comes back around even though I've deleted it from the queue...)
While it's running in terminal, if I CTRL+C and force it to end, it enters the parse() method and crawls the page, then ends. I'm not sure what's wrong here.
class my_Spider(Spider):
name = "my_spider"
allowed_domains = ['domain.com']
def __init__(self):
super(my_Spider, self).__init__()
self.url = None
def start_requests(self):
while True:
# Crawl the url from queue
yield self.make_requests_from_url(self._pop_queue())
def _pop_queue(self):
# Grab the url from queue
return self.queue()
def queue(self):
url = None
while url is None:
conf = {
"sqs-access-key": "",
"sqs-secret-key": "",
"sqs-queue-name": "crawler",
"sqs-region": "us-east-1",
"sqs-path": "sqssend"
}
# Connect to AWS
conn = boto.sqs.connect_to_region(
conf.get('sqs-region'),
aws_access_key_id=conf.get('sqs-access-key'),
aws_secret_access_key=conf.get('sqs-secret-key')
)
q = conn.get_queue(conf.get('sqs-queue-name'))
message = conn.receive_message(q)
# Didn't get a message back, wait.
if not message:
time.sleep(10)
url = None
else:
url = message
if url is not None:
message = url[0]
message_body = str(message.get_body())
message.delete()
self.url = message_body
return self.url
def parse(self, response):
...
yield item
Updated from comments:
def start_requests(self):
while True:
# Crawl the url from queue
queue = self._pop_queue()
self.logger.error(queue)
if queue is None:
time.sleep(10)
continue
url = queue
if url:
yield self.make_requests_from_url(url)
Removed the while url is None: loop, but still get the same problem.
Would I be right to assume that if this works:
import scrapy
import random
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
def __init__(self):
super(ExampleSpider, self).__init__()
self.url = None
def start_requests(self):
while True:
# Crawl the url from queue
yield self.make_requests_from_url(self._pop_queue())
def _pop_queue(self):
# Grab the url from queue
return self.queue()
def queue(self):
return 'http://www.example.com/?{}'.format(random.randint(0,100000))
def parse(self, response):
print "Successfully parsed!"
Then your code should work as well, unless:
There's a problem with allowed_domains and your queue actually returns URLs outside it
There's a problem with your queue() function and/or the data it produces e.g. it returns arrays or it blocks indefinitely or something like that
Note also that the boto library is blocking and not Twisted/asynchronous. In order to not block scrapy while using it, you will have to use a Twisted-compatible library like txsqs. Alternatively you might want to run boto calls in a separate thread with deferToThread.
After your follow up question in Scrapy list, I believe that you have to understand that your code is quite far from functional and this makes it as much as generic Boto/SQS question as Scrapy question. Anyway - here's an average functional solution.
I've created and AWS SQS with this properties:
Then gave it some overly broad permissions:
Now I'm able to submit messages in the queue with AWS CLi like this:
$ aws --region eu-west-1 sqs send-message --queue-url "https://sqs.eu-west-1.amazonaws.com/123412341234/my_queue" --message-body 'url:https://stackoverflow.com'
For some weird reason - I think that when I was setting --message-body to a URL it was actually downloading the page and sending the result as message body(!) Not sure - don't have time to confirm this, but interesting. Anyway.
Here's a proper'ish Spider code. As I said before, boto is blocking API which is bad. In this implementation I call its API just once from start_requests() and then only when the spider is idle on the spider_idle() callback. At that point, because the spider is idle, the fact that boto is blocking doesn't pose much of a problem. While I pull URLs from SQS, I pull as many as possible with the while loop (you could put a limit there if you don't want to consume e.g. more than 500 at a time) in order to have to call the blocking API as rarely as possible. Notice also the call to conn.delete_message_batch() which actually removes messages from the queue (otherwise they just stay there for ever) and the queue.set_message_class(boto.sqs.message.RawMessage) that avoids this problem.
Overall this might be an ok solution for your level of requirements.
from scrapy import Spider, Request
from scrapy import signals
import boto.sqs
from scrapy.exceptions import DontCloseSpider
class CPU_Z(Spider):
name = "cpuz"
allowed_domains = ['http://valid.x86.fr']
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(CPU_Z, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
def __init__(self, *args, **kwargs):
super(CPU_Z, self).__init__(*args, **kwargs)
conf = {
"sqs-access-key": "AK????????????????",
"sqs-secret-key": "AB????????????????????????????????",
"sqs-queue-name": "my_queue",
"sqs-region": "eu-west-1",
}
self.conn = boto.sqs.connect_to_region(
conf.get('sqs-region'),
aws_access_key_id=conf.get('sqs-access-key'),
aws_secret_access_key=conf.get('sqs-secret-key')
)
self.queue = self.conn.get_queue(conf.get('sqs-queue-name'))
assert self.queue
self.queue.set_message_class(boto.sqs.message.RawMessage)
def _get_some_urs_from_sqs(self):
while True:
messages = self.conn.receive_message(self.queue, number_messages=10)
if not messages:
break
for message in messages:
body = message.get_body()
if body[:4] == 'url:':
url = body[4:]
yield self.make_requests_from_url(url)
self.conn.delete_message_batch(self.queue, messages)
def spider_idle(self, spider):
for request in self._get_some_urs_from_sqs():
self.crawler.engine.crawl(request, self)
raise DontCloseSpider()
def start_requests(self):
for request in self._get_some_urs_from_sqs():
yield request
def parse(self, response):
yield {
"freq_clock": response.url
}
My code is included below and is really not much more than a slightly tweaked version of the example lifted from Scrapy's documentation. The code works as-is, but I there is a gap in the logic I am not understanding between the login and how the request is passed through subsequent requests.
According to the documentation, a request object returns a response object. This response object is passed as the first argument to a callback function. This I get. This is the way authentication can be handled and subsequent requests made using the user credentials.
What I am not understanding is how the response object makes it to the next request call following authentication. In my code below, the parse method returns a result object created when authenticating using the FormRequest method. Since the FormRequest has a callback to the after_login method, the after_login method is called with the response from the FormRequest as the first parameter.
The after_login method checks to make sure there are no errors, then makes another request through a yield statement. What I do not understand is how the response passed in as an argument to the after_login method is making it to the Request following the yield. How does this happen?
The primary reason why I am interested is I need to make two requests per iterated value in the after_login method, and I cannot figure out how the responses are being handled by the scraper to then understand how to modify the code. Thank you in advance for your time and explanations.
# import Scrapy modules
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy import log
# import custom item from item module
from scrapy_spage.items import ReachItem
class AwSpider(BaseSpider):
name = 'spage'
allowed_domains = ['webpage.org']
start_urls = ('https://www.webpage.org/',)
def parse(self, response):
credentials = {'username': 'user',
'password': 'pass'}
return [FormRequest.from_response(response,
formdata=credentials,
callback=self.after_login)]
def after_login(self, response):
# check to ensure login succeeded
if 'Login failed' in response.body:
# log error
self.log('Login failed', level=log.ERROR)
# exit method
return
else:
# for every integer from one to 5000, 1100 to 1110 for testing...
for reach_id in xrange(1100, 1110):
# call make requests, use format to create four digit string for each reach
yield Request('https://www.webpage.org/content/River/detail/id/{0:0>4}/'.format(reach_id),
callback=self.scrape_page)
def scrape_page(self, response):
# create selector object instance to parse response
sel = Selector(response)
# create item object instance
reach_item = ReachItem()
# get attribute
reach_item['attribute'] = sel.xpath('//body/text()').extract()
# other selectors...
# return the reach item
return reach_item
how the response passed in as an argument to the after_login method is making it to the Request following the yield.
if I understand your question, the answer is that it doesn't
the mechanism is simple:
for x in spider.function():
if x is a request:
http call this request and wait for a response asynchronously
if x is an item:
send it to piplelines etc...
upon getting a response:
request.callback(response)
as you can see, there is no limit to the number of requests the function can yield so you can:
for reach_id in xrange(x, y):
yield Request(url=url1, callback=callback1)
yield Request(url=url2, callback=callback2)
hope this helps