Scrapy : Sending information to prior function - python

I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:
request.meta['redirect_urls']
My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:
def test1(self, response):
......
for row in empties: # 100 records
d = object_as_dict(row)
AA
yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
print str(response.meta['redirect_urls'])
BB
d = response.meta['d']
So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?

Why not use a DownloaderMiddleware?
You could write a DownloaderMiddleware like so:
Edit: I have edited the original code to address a second problem the OP had in the comments.
from scrapy.http import Request
class CustomMiddleware():
def process_response(self, request, response, spider):
if 'redirect_urls' in response.meta:
# assuming your spider has a method for handling the login
original_url = response.meta["redirect_urls"][0]
return Request(url="login_url",
callback=spider.login,
meta={"original_url": original_url})
return response
So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...
Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.
Remember to add the middleware to your settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}

You can't achieve what you want because Scrapy uses asynchronous processing.
In theory you could use approach partially suggested in comment by #Paulo Scardine, i.e. raise an exception in parse_lookup. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception method to log back in and retry failed requests.
But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN works, but setting this to 1 might let you process one request at time and so there should be no failing requests as you always log back in when you need to.

Don't iterate over the 100 items and create requests for all of them. Instead, just create a request for the first item, process it in your callback function, yield the item, and only after that's done create the request for the second item and yield it. With this approach, you can check for the location header in your callback and either make the request for the next item or login and repeat the current item request.
For example:
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
# It's a redirect
yield Request(url=your_login_url, callback=self.parse_login_response, meta={'current_item_url': response.request.url}
else:
# It's a normal response
item = YourItem()
... # Extract your item fields from the response
yield item
next_item_url = ... # Extract the next page URL from the response
yield Request(url=next_item_url, callback=self.parse_lookup)
This assumes that you can get the next item URL from the current item page, otherwise just put the list of URLs in the first request's META dict and pass it along.

I think it should be better not to try all 100 requests all at once, instead you should try to "serialize" the requests, for example you could add all your empties in the request's meta and pop them out as necessary, or put the empties as a field of your spider.
Another alternative would be to use the scrapy-inline-requests package to accomplish what you want, but you should probably extend your middleware to perform the login.

Related

How do I use Python request to make a POST response upload to a database and wait for a POST success before making another POST?

I am using a scraper to upload json data to a database but I noticed that my uploads are not in order yet my local json data file FILE_PATH mirroring the database has my json data in order, meaning that the culprit is in how I use python's request module.
def main():
if not FILE_PATH.is_file():
print("initializing json database file")
# intialize and reverse the dictionary list in one go, [start:stop:step]
ref_links = getArchivedCharacterList()[::-1]
index =0 # counting up the roster size for assignReleasedate
for link in ref_links:
char_details = getCharacterDetails(link['ref_id'],link['game_origin'],index)
index +=1
saveCharacterInfo(char_details)
postRequest(POST_CHARA_DETAILS_URL,char_details,headers)
print(f"Done! Updated the file with all previously released and archived units as of: \n {datetime.datetime.now()} \n See {FILE_PATH} for details!")
From the above, I initially scrape a page for a list of links using getArchivedCharacterList(), then for each link, I grab more information for individual pages and then I post them into my local file FILE_PATH using saveCharacterInfo and I POST it into my database with the postRequest function.
postRequest function
def postRequest(url,json_details,headers):
r = requests.post(url, json=json_details, headers=headers)
while r.status_code != 201:
time.sleep(0.1)
print(r.status_code)
I tried doing a while function to wait for the 201 POST success response. Didn't work. I looked up async/await tutorials and it doesn't seem like something I want... unless I am supposed to bundle up all my request posts in a single go? I did this before in javascript where I bundled up all my posts in a promise array is there a promise equivalent for python from javascript so that I do my uploads in order? Or is there another method to achieve a sequential upload?
Thank you in advance!

call several callback function in scrapy

I am using scrapy and I have several problems:
first problem: I put start_requests in a loop but the function is not started from each iteration
second problem: I need to call different callback related to the start_urls given by the loop, but I can't give a dynamic name for the callback. I would like to put callback=parse_i and i come from the loop above.
liste [[liste1],[liste2],[liste3]]
for i in range (0, 2):
start_urls = liste[i]
def start_requests(self):
#print(self.start_urls)
for u in self.start_urls:
try:
req = requests.get(u)
except requests.exceptions.ConnectionError:
print("Connection refused")
if req.status_code != 200:
print("Request failed, status code is :", req.status_code)
continue
yield scrapy.Request(u, callback=self.parse, meta={'dont_merge_cookies': True}, dont_filter=False)
thanks
I need to call different callback related to the start_urls given by the loop, but I can't give a dynamic name for the callback. I would like to put callback=parse_i and i come from the loop above.
The callback attribute just needs to be a callable, so you can use getattr just like normal:
my_callback = getattr(self, 'parse_{}'.format(i))
yield Request(u, callback=my_callback)
Separately, while you didn't ask this, it is highly unusual to make a URL call from within start_requests since (a) that's why one would use Scrapy to begin with, to deal with all that non-200 stuff and (b) doing so with requests will not honor any of the throttling, proxy, user-agent, resumption, or host of other knobs over which one would wish to influence a scraping job.

Scraper getting stuck in request loop

So I am trying to build a very basic scraper that pulls information from my server, using that information it creates a link which it then yields a request for, after parsing that, it grabs a single link from the parsed page, uploads it back to the server using a get request. The problem i am encountering is that it will pull info from the server, create a link, and then yield the request, and depending on the response time there (which is unreliably consistent) it will dump out and start over with another get request to the server. The way that my server logic is designed is that it is pulling the next data set that needs worked on, and until a course of action is decided with this data set, it will continuously try to pull it and parse it. I am fairly new to scrapy and in need of assistance. I know that my code is wrong but I haven't been able to come up with another method of approach without changing a lot of server code and creating unnecessary hassle, and I am not super savvy with scrapy, or python unfortunately
My start requests method:
name = "scrapelevelone"
start_urls = []
def start_requests(self):
print("Start Requests is initiatied")
while True:
print("Were looping")
r = requests.get('serverlink.com')
print("Sent request")
pprint(r.text)
print("This is the request response text")
print("Now try to create json object: ")
try:
personObject = json.loads(r.text)
print("Made json object: ")
pprint(personObject)
info = "streetaddress=" + '+'.join(personObject['address1'].split(" ")) + "&citystatezip=" + '+'.join(personObject['city'].split(" ")) + ",%20" + personObject['state'] + "%20" + personObject['postalcodeextended']
nextPage = "https://www.webpage.com/?" + info
print("Creating info")
newRequest = scrapy.Request(nextPage, self.parse)
newRequest.meta['item'] = personObject
print("Yielding request")
yield newRequest
except Exception:
print("Reach JSON exception")
time.sleep(10)
And everytime the parse function gets called it does all the logic, creates a request.get statement at the end and it's supposed to send data to the server. And it all does what is supposed to if it gets to the end. I tried a lot of different things to try and get the scraper to loop and constantly request to the server for more information. I want the scraper to run indefinitely but that defeats the purpose when I can't step away from the computer because it chokes on a request. Any recommendations for keeping the scraper running 24/7 without using the stupid while loop in the start_requests function? And on top of that, can anyone tell me why it gets stuck in a loop of requests? :( I have a huge headache trying to troubleshoot this and finally gave in to a forum...
What you should do is start with your server url and keep retrying it constantly by yielding Request objects. If data you have is new then parse it and schedule your requests:
class MyCrawler:
start_urls = ['http://myserver.com']
past_data = None
def parse(self, response):
data = json.loads(response.body_as_unicode())
if data == past_data: # if data is the same, retry
# time.sleep(10) # you can add delay but sleep will stop everything
yield Request(response.url, dont_filter=True, priority=-100)
return
past_data = data
for url in data['urls']:
yield Request(url, self.parse_url)
# keep retrying
yield Request(response.url, dont_filter=True, priority=-100)
def parse_url(self, repsonse):
#...
yield {'scrapy': 'item'}

Is it possible to modify scrapy Response object in middleware?

I scrape some data from several EU sites and find that sometimes my calls to response.xpath() brokes text. For instance, I found that html entities like "& amp;" &#164 and other similar translated into broken bytes like \x92 or \xc3 etc.
I found some working solution - escape html entities before calls to xpath method (using lxml lib). Looks Like this:
body_str = str(response.body, response._body_declared_encoding())
unescaped_body = html.unescape(body_str)
response = response.replace(body=unescaped_body)
It seems to work fine for me if such code called immediately at start of callback for processing response.
What I'm trying to do now is to move this code into Spider Middleware, to use this approach for each request or in another spider etc. But problem is that this code doesn't modify response object inside
def process_spider_input(self, response, spider):
Seems that response = response.replace(...) creates new local variable response, which isn't used elsewhere.
And my question is in title: can I modify response object inside spider middleware or not?
I would say it is better to use a Downloader Middleware with the process_response method and return a Response object.
...
def process_response(self, request, response, spider):
...
body_str = str(response.body, response._body_declared_encoding())
unescaped_body = html.unescape(body_str)
new_response = response.replace(body=unescaped_body)
return new_response

How are response objects passed through request callbacks in a Scrapy scraper?

My code is included below and is really not much more than a slightly tweaked version of the example lifted from Scrapy's documentation. The code works as-is, but I there is a gap in the logic I am not understanding between the login and how the request is passed through subsequent requests.
According to the documentation, a request object returns a response object. This response object is passed as the first argument to a callback function. This I get. This is the way authentication can be handled and subsequent requests made using the user credentials.
What I am not understanding is how the response object makes it to the next request call following authentication. In my code below, the parse method returns a result object created when authenticating using the FormRequest method. Since the FormRequest has a callback to the after_login method, the after_login method is called with the response from the FormRequest as the first parameter.
The after_login method checks to make sure there are no errors, then makes another request through a yield statement. What I do not understand is how the response passed in as an argument to the after_login method is making it to the Request following the yield. How does this happen?
The primary reason why I am interested is I need to make two requests per iterated value in the after_login method, and I cannot figure out how the responses are being handled by the scraper to then understand how to modify the code. Thank you in advance for your time and explanations.
# import Scrapy modules
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy import log
# import custom item from item module
from scrapy_spage.items import ReachItem
class AwSpider(BaseSpider):
name = 'spage'
allowed_domains = ['webpage.org']
start_urls = ('https://www.webpage.org/',)
def parse(self, response):
credentials = {'username': 'user',
'password': 'pass'}
return [FormRequest.from_response(response,
formdata=credentials,
callback=self.after_login)]
def after_login(self, response):
# check to ensure login succeeded
if 'Login failed' in response.body:
# log error
self.log('Login failed', level=log.ERROR)
# exit method
return
else:
# for every integer from one to 5000, 1100 to 1110 for testing...
for reach_id in xrange(1100, 1110):
# call make requests, use format to create four digit string for each reach
yield Request('https://www.webpage.org/content/River/detail/id/{0:0>4}/'.format(reach_id),
callback=self.scrape_page)
def scrape_page(self, response):
# create selector object instance to parse response
sel = Selector(response)
# create item object instance
reach_item = ReachItem()
# get attribute
reach_item['attribute'] = sel.xpath('//body/text()').extract()
# other selectors...
# return the reach item
return reach_item
how the response passed in as an argument to the after_login method is making it to the Request following the yield.
if I understand your question, the answer is that it doesn't
the mechanism is simple:
for x in spider.function():
if x is a request:
http call this request and wait for a response asynchronously
if x is an item:
send it to piplelines etc...
upon getting a response:
request.callback(response)
as you can see, there is no limit to the number of requests the function can yield so you can:
for reach_id in xrange(x, y):
yield Request(url=url1, callback=callback1)
yield Request(url=url2, callback=callback2)
hope this helps

Categories