Scrapy callback after redirect - python

I have a very basic scrapy spider, which grabs urls from the file and then downloads them. The only problem is that some of them got redirected to a slightly modified url within same domain. I want to get them in my callback function using response.meta, and it works on a normal urls, but then url is redirected callback doesn't seem to get called. How can I fix it?
Here's my code.
from scrapy.contrib.spiders import CrawlSpider
from scrapy import log
from scrapy import Request
class DmozSpider(CrawlSpider):
name = "dmoz"
handle_httpstatus_list = [302]
allowed_domains = ["http://www.exmaple.net/"])
f = open("C:\\python27\\1a.csv",'r')
url = 'http://www.exmaple.net/Query?indx='
start_urls = [url+row for row in f.readlines()]
def parse(self, response):
print response.meta.get('redirect_urls', [response.url])
print response.status
print (response.headers.get('Location'))
I've also tried something like that:
def parse(self, response):
return Request(response.url, meta={'dont_redirect': True, 'handle_httpstatus_list': [302]}, callback=self.parse_my_url)
def parse_my_url(self, response):
print response.status
print (response.headers.get('Location'))
And it doesn't work either.

By default scrapy requests are redirected, although if you don't want to redirect you can do like this, use start_requests method and add flags in request meta.
def start_requests(self):
requests =[(Request(self.url+u, meta={'handle_httpstatus_list': [302],
'dont_redirect': True},
callback=self.parse)) for u in self.start_urls]
return requests

Related

Scrapy response url not exatly the same as the one i defined on start urls

I have a spider i give it this url https://tuskys.dpo.store/#!/~/search/keyword=dairy milk
However when i try to get the url in scrapy parse method the url looks like https://tuskys.dpo.store/?_escaped_fragment_=%2F%7E%2Fsearch%2Fkeyword%3Ddairy%2520milk
Here is a demo code to demonstrate my problem
import scrapy
class TuskysDpoSpider(scrapy.Spider):
name = "Tuskys_dpo"
#allowed_domains = ['ebay.com']
start_urls = ['https://tuskys.dpo.store/#!/~/search/keyword=dairy milk']
def parse(self, response):
yield{'url':response.url}
results: {"url": "https://tuskys.dpo.store/?_escaped_fragment_=%2F%7E%2Fsearch%2Fkeyword%3Ddairy%2520milk"}
Why is my scrapy response url not exactly the same as the url i defined and is there a way to go around this?
You should use response.request.url because you are redirected from your start url, so response.url is the url you are redirected to.

Scrapy to login and then grab data from Weibo

I am still trying to use Scrapy to collect data from pages on Weibo which need to be logged in to access.
I now understand that I need to use Scrapy FormRequests to get the login cookie. I have updated my Spider to try to make it do this, but it still isn't working.
Can anybody tell me what I am doing wrong?
import scrapy
class LoginSpider(scrapy.Spider):
name = 'WB'
def start_requests(self):
return [
scrapy.Request("https://www.weibo.com/u/2247704362/home?wvr=5&lf=reg", callback=self.parse_item)
]
def parse_item(self, response):
return scrapy.FormRequest.from_response(response, formdata={'user': 'user', 'pass': 'pass'}, callback=self.parse)
def parse(self, response):
print(response.body)
When I run this spider. Scrapy redirects from the URL under start_requests, and then returns the following error:
ValueError: No element found in <200 https://passport.weibo.com/visitor/visitor?entry=miniblog&a=enter&url=https%3A%2F%2Fweibo.com%2Fu%2F2247704362%2Fhome%3Fwvr%3D5%26lf%3Dreg&domain=.weibo.com&ua=php-sso_sdk_client-0.6.28&_rand=1585243156.3952>
Does that mean I need to get the spider to look for something other than Form data in the original page. How do I tell it to look for the cookie?
I have also tried a spider like this below based on this post.
import scrapy
class LoginSpider(scrapy.Spider):
name = 'WB'
login_url = "https://www.weibo.com/overseas"
test_url = 'https://www.weibo.com/u/2247704362/'
def start_requests(self):
yield scrapy.Request(url=self.login_url, callback=self.parse_login)
def parse_login(self, response):
return scrapy.FormRequest.from_response(response, formid="W_login_form", formdata={"loginname": "XXXXX", "password": "XXXXX"}, callback=self.start_crawl)
def start_crawl(self, response):
yield Request(self.test_url, callback=self.parse_item)
def parse_item(self, response):
print("Test URL " + response.url)
But it still doesn't work, giving the error:
ValueError: No element found in <200 https://www.weibo.com/overseas>
Would really appreciate any help anybody can offer as this is kind of beyond my range of knowledge.

Why is the print() function not echoing to console?

I haven't written any Python code in over 10 years. So I'm trying to use Scrapy to assemble some information off of a website:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')
I'm trying to parse the url in the parse method. What I can't figure out is, why will those simple print statements not print anything to stdout? They are silent. There doesn't seem to be a way to echo anything there back to the console, and I am very curious about what I am missing here.
Both requests you're doing in your spider receive 404 Not found responses. By default, Scrapy ignores responses with such a status and your callback doesn't get called.
In order to have your self.parse callback called for such responses, you have to add the 404 status code to the list of handled status codes using the handle_httpstatus_list meta key (more info here).
You could change your start_requests method so that the requests instruct Scrapy to handle even 404 responses:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(
url=url,
callback=self.parse,
meta={'handle_httpstatus_list': [404]},
)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')

Scrapy ignore request for a specific domain

I try to crawl the forum category of craiglist.org (https://forums.craigslist.org/).
My spider:
class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']
def error_handler(self, failure):
print failure
def parse(self, response):
yield Request('https://forums.craigslist.org/',
self.getForumPage,
dont_filter=True,
errback=self.error_handler)
def getForumPage(self, response):
print "forum page"
I have this message by the error callback:
[Failure instance: Traceback: :
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:455:callback
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:563:_startRunCallbacks
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1316:gotResult
--- ---
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1258:_inlineCallbacks
/usr/local/lib/python2.7/site-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator
/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py:37:process_request
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks
/usr/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/robotstxt.py:46:process_request_2
]
But i have this problem only with the forum section of Craigslist. It might be because is https for the forum section in contrary of the rest of website.
So, impossible to get a response...
An idea ?
I post a solution that I found for get around the problem.
I have used urllib2 library. Look:
import urllib2
from scrapy.http import HtmlResponse
class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']
def error_handler(self, failure):
print failure
def parse(self, response):
# Get a valid request with urllib2
req = urllib2.Request('https://forums.craigslist.org/')
# Get the content of this request
pageContent = urllib2.urlopen(req).read()
# Parse the content in a HtmlResponse compatible with Scrapy
response = HtmlResponse(url=response.url, body=pageContent)
print response.css(".forumlistcolumns li").extract()
With this solution, you can parse a good request in a valid Scrapy request and use this normaly.
There is probably a better method but this one is functional.
I think you are dealing with robots.txt. Try running your spider with
custom_settings = {
"ROBOTSTXT_OBEY": False
}
You can also test it using command line settings: scrapy crawl craigslist -s ROBOTSTXT_OBEY=False.

Scrapy: Creating a Request to a site that is not in allowed_domains

I am scraping a certain website. Under certain conditions, I might want to make a Request to go to a website that is not listed in allowed_domains. Is that possible? If not, can I temporarily add the domain in there, create a Request and then remove the domain from my parser callback?
Set dont_filter=True on a Request object (documentation):
dont_filter (boolean) – indicates that this request should not be
filtered by the scheduler.
Example:
from scrapy.spider import BaseSpider
from scrapy.http import Request
class MySpider(BaseSpider):
name = 'wikipedia'
allowed_domains = ['en.wikipedia.org']
start_urls = [
'http://en.wikipedia.org/wiki/Main_Page',
]
def parse(self, response):
print "I'm at wikipedia"
request = Request(url="https://google.com",
callback=self.parse_google,
dont_filter=True)
yield request
def parse_google(self, response):
print "I'm at google"

Categories