I am using scrapy and I have several problems:
first problem: I put start_requests in a loop but the function is not started from each iteration
second problem: I need to call different callback related to the start_urls given by the loop, but I can't give a dynamic name for the callback. I would like to put callback=parse_i and i come from the loop above.
liste [[liste1],[liste2],[liste3]]
for i in range (0, 2):
start_urls = liste[i]
def start_requests(self):
#print(self.start_urls)
for u in self.start_urls:
try:
req = requests.get(u)
except requests.exceptions.ConnectionError:
print("Connection refused")
if req.status_code != 200:
print("Request failed, status code is :", req.status_code)
continue
yield scrapy.Request(u, callback=self.parse, meta={'dont_merge_cookies': True}, dont_filter=False)
thanks
I need to call different callback related to the start_urls given by the loop, but I can't give a dynamic name for the callback. I would like to put callback=parse_i and i come from the loop above.
The callback attribute just needs to be a callable, so you can use getattr just like normal:
my_callback = getattr(self, 'parse_{}'.format(i))
yield Request(u, callback=my_callback)
Separately, while you didn't ask this, it is highly unusual to make a URL call from within start_requests since (a) that's why one would use Scrapy to begin with, to deal with all that non-200 stuff and (b) doing so with requests will not honor any of the throttling, proxy, user-agent, resumption, or host of other knobs over which one would wish to influence a scraping job.
Related
So I am trying to build a very basic scraper that pulls information from my server, using that information it creates a link which it then yields a request for, after parsing that, it grabs a single link from the parsed page, uploads it back to the server using a get request. The problem i am encountering is that it will pull info from the server, create a link, and then yield the request, and depending on the response time there (which is unreliably consistent) it will dump out and start over with another get request to the server. The way that my server logic is designed is that it is pulling the next data set that needs worked on, and until a course of action is decided with this data set, it will continuously try to pull it and parse it. I am fairly new to scrapy and in need of assistance. I know that my code is wrong but I haven't been able to come up with another method of approach without changing a lot of server code and creating unnecessary hassle, and I am not super savvy with scrapy, or python unfortunately
My start requests method:
name = "scrapelevelone"
start_urls = []
def start_requests(self):
print("Start Requests is initiatied")
while True:
print("Were looping")
r = requests.get('serverlink.com')
print("Sent request")
pprint(r.text)
print("This is the request response text")
print("Now try to create json object: ")
try:
personObject = json.loads(r.text)
print("Made json object: ")
pprint(personObject)
info = "streetaddress=" + '+'.join(personObject['address1'].split(" ")) + "&citystatezip=" + '+'.join(personObject['city'].split(" ")) + ",%20" + personObject['state'] + "%20" + personObject['postalcodeextended']
nextPage = "https://www.webpage.com/?" + info
print("Creating info")
newRequest = scrapy.Request(nextPage, self.parse)
newRequest.meta['item'] = personObject
print("Yielding request")
yield newRequest
except Exception:
print("Reach JSON exception")
time.sleep(10)
And everytime the parse function gets called it does all the logic, creates a request.get statement at the end and it's supposed to send data to the server. And it all does what is supposed to if it gets to the end. I tried a lot of different things to try and get the scraper to loop and constantly request to the server for more information. I want the scraper to run indefinitely but that defeats the purpose when I can't step away from the computer because it chokes on a request. Any recommendations for keeping the scraper running 24/7 without using the stupid while loop in the start_requests function? And on top of that, can anyone tell me why it gets stuck in a loop of requests? :( I have a huge headache trying to troubleshoot this and finally gave in to a forum...
What you should do is start with your server url and keep retrying it constantly by yielding Request objects. If data you have is new then parse it and schedule your requests:
class MyCrawler:
start_urls = ['http://myserver.com']
past_data = None
def parse(self, response):
data = json.loads(response.body_as_unicode())
if data == past_data: # if data is the same, retry
# time.sleep(10) # you can add delay but sleep will stop everything
yield Request(response.url, dont_filter=True, priority=-100)
return
past_data = data
for url in data['urls']:
yield Request(url, self.parse_url)
# keep retrying
yield Request(response.url, dont_filter=True, priority=-100)
def parse_url(self, repsonse):
#...
yield {'scrapy': 'item'}
I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:
request.meta['redirect_urls']
My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:
def test1(self, response):
......
for row in empties: # 100 records
d = object_as_dict(row)
AA
yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
print str(response.meta['redirect_urls'])
BB
d = response.meta['d']
So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?
Why not use a DownloaderMiddleware?
You could write a DownloaderMiddleware like so:
Edit: I have edited the original code to address a second problem the OP had in the comments.
from scrapy.http import Request
class CustomMiddleware():
def process_response(self, request, response, spider):
if 'redirect_urls' in response.meta:
# assuming your spider has a method for handling the login
original_url = response.meta["redirect_urls"][0]
return Request(url="login_url",
callback=spider.login,
meta={"original_url": original_url})
return response
So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...
Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.
Remember to add the middleware to your settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
You can't achieve what you want because Scrapy uses asynchronous processing.
In theory you could use approach partially suggested in comment by #Paulo Scardine, i.e. raise an exception in parse_lookup. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception method to log back in and retry failed requests.
But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN works, but setting this to 1 might let you process one request at time and so there should be no failing requests as you always log back in when you need to.
Don't iterate over the 100 items and create requests for all of them. Instead, just create a request for the first item, process it in your callback function, yield the item, and only after that's done create the request for the second item and yield it. With this approach, you can check for the location header in your callback and either make the request for the next item or login and repeat the current item request.
For example:
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
# It's a redirect
yield Request(url=your_login_url, callback=self.parse_login_response, meta={'current_item_url': response.request.url}
else:
# It's a normal response
item = YourItem()
... # Extract your item fields from the response
yield item
next_item_url = ... # Extract the next page URL from the response
yield Request(url=next_item_url, callback=self.parse_lookup)
This assumes that you can get the next item URL from the current item page, otherwise just put the list of URLs in the first request's META dict and pass it along.
I think it should be better not to try all 100 requests all at once, instead you should try to "serialize" the requests, for example you could add all your empties in the request's meta and pop them out as necessary, or put the empties as a field of your spider.
Another alternative would be to use the scrapy-inline-requests package to accomplish what you want, but you should probably extend your middleware to perform the login.
I am making a call to a URL in Python using urllib2.urlopen in a while(True) loop
My URL keeps changing every time (as there is a change in a particular parameter of the URL every time).
My code look as as follows:
def get_url(url):
'''Get json page data using a specified API url'''
response = urlopen(url)
data = str(response.read().decode('utf-8'))
page = json.loads(data)
return page
I am calling the above method from the main function by changing the url every time I make the call.
What I observe is that after few calls to the function, suddenly (I don;t know why), the code gets stuck at the statement
response = urlopen(url)
and it just waits and waits...
How do I best handle this situation?
I want to make sure that if it does not respond within say 10 seconds, I make the same call again.
I read about
response = urlopen(url, timeout=10)
but then what about the repeated call if this fails?
Depending on how many retries you want to attempt, use a try/catch inside a loop:
while True:
try:
response = urlopen(url, timeout=10)
break
except:
# do something with the error
pass
# do something with response
data = str(response.read().decode('utf-8'))
...
This will silence all exceptions, which may not be ideal (more on that here: Handling urllib2's timeout? - Python)
With this method you can retry once.
def get_url(url, trial=1):
try:
'''Get json page data using a specified API url'''
response = urlopen(url, timeout=10)
data = str(response.read().decode('utf-8'))
page = json.loads(data)
return page
except:
if trial == 1:
return get_url(url, trial=2)
else:
return
There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector).
Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do.
def check_response(response):
if response.body != '':
return response
else:
return Request(copy_of_response.request,
callback=check_response)
Basically, is there a way I can repeat a request with the exact same properties (method, url, payload, cookies, etc.)?
Follow the EAFP principle:
Easier to ask for forgiveness than permission. This common Python
coding style assumes the existence of valid keys or attributes and
catches exceptions if the assumption proves false. This clean and fast
style is characterized by the presence of many try and except
statements. The technique contrasts with the LBYL style common to many
other languages such as C.
Handle an exception and yield a Request to the current url with dont_filter=True:
dont_filter (boolean) – indicates that this request should not be
filtered by the scheduler. This is used when you want to perform an
identical request multiple times, to ignore the duplicates filter. Use
it with care, or you will get into crawling loops. Default to False.
def parse(response):
try:
# parsing logic here
except AttributeError:
yield Request(response.url, callback=self.parse, dont_filter=True)
You can also make a copy of the current request (not tested):
new_request = response.request.copy()
new_request.dont_filter = True
yield new_request
Or, make a new request using replace():
new_request = response.request.replace(dont_filter=True)
yield new_request
How about calling actual _rety() method from retry middleware, so it acts as a normal retry with all it's logic that takes settings into account?
In settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scraper.middlewares.retry.RetryMiddleware': 550,
}
Then your retry middleware could be like:
from scrapy.downloadermiddlewares.retry import RetryMiddleware \
as BaseRetryMiddleware
class RetryMiddleware(BaseRetryMiddleware):
def process_response(self, request, response, spider):
# inject retry method so request could be retried by some conditions
# from spider itself even on 200 responses
if not hasattr(spider, '_retry'):
spider._retry = self._retry
return super(RetryMiddleware, self).process_response(request, response, spider)
And then in your success response callback you can call for ex.:
yield self._retry(response.request, ValueError, self)
From Scrapy 2.5.0 there is a new method get_retry_request().
It's pretty easy, the example from the Scrapy docs:
def parse(self, response):
if not response.text:
new_request_or_none = get_retry_request(
response.request,
spider=self,
reason='empty',
)
return new_request_or_none
in your existing code, you can simply allow duplicate filter=True
def check_response(response):
if response.body != '':
return response
else:
return Request(copy_of_response.request,
callback=check_response, dont_filter=True)
My code is included below and is really not much more than a slightly tweaked version of the example lifted from Scrapy's documentation. The code works as-is, but I there is a gap in the logic I am not understanding between the login and how the request is passed through subsequent requests.
According to the documentation, a request object returns a response object. This response object is passed as the first argument to a callback function. This I get. This is the way authentication can be handled and subsequent requests made using the user credentials.
What I am not understanding is how the response object makes it to the next request call following authentication. In my code below, the parse method returns a result object created when authenticating using the FormRequest method. Since the FormRequest has a callback to the after_login method, the after_login method is called with the response from the FormRequest as the first parameter.
The after_login method checks to make sure there are no errors, then makes another request through a yield statement. What I do not understand is how the response passed in as an argument to the after_login method is making it to the Request following the yield. How does this happen?
The primary reason why I am interested is I need to make two requests per iterated value in the after_login method, and I cannot figure out how the responses are being handled by the scraper to then understand how to modify the code. Thank you in advance for your time and explanations.
# import Scrapy modules
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy import log
# import custom item from item module
from scrapy_spage.items import ReachItem
class AwSpider(BaseSpider):
name = 'spage'
allowed_domains = ['webpage.org']
start_urls = ('https://www.webpage.org/',)
def parse(self, response):
credentials = {'username': 'user',
'password': 'pass'}
return [FormRequest.from_response(response,
formdata=credentials,
callback=self.after_login)]
def after_login(self, response):
# check to ensure login succeeded
if 'Login failed' in response.body:
# log error
self.log('Login failed', level=log.ERROR)
# exit method
return
else:
# for every integer from one to 5000, 1100 to 1110 for testing...
for reach_id in xrange(1100, 1110):
# call make requests, use format to create four digit string for each reach
yield Request('https://www.webpage.org/content/River/detail/id/{0:0>4}/'.format(reach_id),
callback=self.scrape_page)
def scrape_page(self, response):
# create selector object instance to parse response
sel = Selector(response)
# create item object instance
reach_item = ReachItem()
# get attribute
reach_item['attribute'] = sel.xpath('//body/text()').extract()
# other selectors...
# return the reach item
return reach_item
how the response passed in as an argument to the after_login method is making it to the Request following the yield.
if I understand your question, the answer is that it doesn't
the mechanism is simple:
for x in spider.function():
if x is a request:
http call this request and wait for a response asynchronously
if x is an item:
send it to piplelines etc...
upon getting a response:
request.callback(response)
as you can see, there is no limit to the number of requests the function can yield so you can:
for reach_id in xrange(x, y):
yield Request(url=url1, callback=callback1)
yield Request(url=url2, callback=callback2)
hope this helps