I haven't written any Python code in over 10 years. So I'm trying to use Scrapy to assemble some information off of a website:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')
I'm trying to parse the url in the parse method. What I can't figure out is, why will those simple print statements not print anything to stdout? They are silent. There doesn't seem to be a way to echo anything there back to the console, and I am very curious about what I am missing here.
Both requests you're doing in your spider receive 404 Not found responses. By default, Scrapy ignores responses with such a status and your callback doesn't get called.
In order to have your self.parse callback called for such responses, you have to add the 404 status code to the list of handled status codes using the handle_httpstatus_list meta key (more info here).
You could change your start_requests method so that the requests instruct Scrapy to handle even 404 responses:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(
url=url,
callback=self.parse,
meta={'handle_httpstatus_list': [404]},
)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')
Related
This problem is starting to frustrate me very much as I feel like I have no clue how scrapy works and that I can't wrap my head around the documentation.
My question is simple. I have the most standard of spiders.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
test = scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
return site_number_of_pages
I just want to somehow get the number of pages from the parse function back into the start requests function so I can start a for loop to go through all the pages on the website, using the same parse function again. The code above does illustrated the principle only but would not work if put in practice. Variable test would be a Request class and not my plain Joe integer that I want.
How would I accomplish what I am trying to do?
EDIT:
This is what I have tried up till now
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = ..
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
site_number_of_pages = int(response.xpath(..))
for count in range(2,site_number_of_pages):
url = url + str(count)
yield scrapy.Request(url=url, callback=self.parse, headers = header)
Scrapy is asynchronous framework. Here is no any possibility to.. return to start_urls - only Requests followed by it's callbacks.
On general case if Requests appeared as result of some response parsing (on your case - site_number_of_pages from first url) - it is not start_requests
The easiest thing You can do in this case - is to yield requests from parse method.
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
for i in range(site_number_of_pages):
...
yield Request(url=...
Instead of grabbing the number of pages and looping through all of them, I grabbed the "next page" feature of the web page. So everytime self.parse is activated it will grab the next page and call itself again. This will go on until there is no next page and it will just error out.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
..
next_page = response.xpath(..)
url = "www.website.whatever/search/filter1=.../&page=" + next_page
yield scrapy.Request(url=url, callback=self.parse, headers = header)
I am still trying to use Scrapy to collect data from pages on Weibo which need to be logged in to access.
I now understand that I need to use Scrapy FormRequests to get the login cookie. I have updated my Spider to try to make it do this, but it still isn't working.
Can anybody tell me what I am doing wrong?
import scrapy
class LoginSpider(scrapy.Spider):
name = 'WB'
def start_requests(self):
return [
scrapy.Request("https://www.weibo.com/u/2247704362/home?wvr=5&lf=reg", callback=self.parse_item)
]
def parse_item(self, response):
return scrapy.FormRequest.from_response(response, formdata={'user': 'user', 'pass': 'pass'}, callback=self.parse)
def parse(self, response):
print(response.body)
When I run this spider. Scrapy redirects from the URL under start_requests, and then returns the following error:
ValueError: No element found in <200 https://passport.weibo.com/visitor/visitor?entry=miniblog&a=enter&url=https%3A%2F%2Fweibo.com%2Fu%2F2247704362%2Fhome%3Fwvr%3D5%26lf%3Dreg&domain=.weibo.com&ua=php-sso_sdk_client-0.6.28&_rand=1585243156.3952>
Does that mean I need to get the spider to look for something other than Form data in the original page. How do I tell it to look for the cookie?
I have also tried a spider like this below based on this post.
import scrapy
class LoginSpider(scrapy.Spider):
name = 'WB'
login_url = "https://www.weibo.com/overseas"
test_url = 'https://www.weibo.com/u/2247704362/'
def start_requests(self):
yield scrapy.Request(url=self.login_url, callback=self.parse_login)
def parse_login(self, response):
return scrapy.FormRequest.from_response(response, formid="W_login_form", formdata={"loginname": "XXXXX", "password": "XXXXX"}, callback=self.start_crawl)
def start_crawl(self, response):
yield Request(self.test_url, callback=self.parse_item)
def parse_item(self, response):
print("Test URL " + response.url)
But it still doesn't work, giving the error:
ValueError: No element found in <200 https://www.weibo.com/overseas>
Would really appreciate any help anybody can offer as this is kind of beyond my range of knowledge.
I'm trying to access a site and check if no links redirecting to a page within the site that are down. As there is no sitemap available, I'm using Scrapy to crawl the site and get all links on every page, but I can't get it to output a file with all the links found and their status code. The site I'm using to test the code is quotes.toscrape.com and my code is:
from scrapy.spiders import Spider
from mytest.items import MytestItem
from scrapy.http
import Request
import re
class MySpider(Spider):
name = "sample"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
links = response.xpath('//a/#href').extract()
\# We stored already crawled links in this list
crawledLinks = []
for link in links:
\# If it is a proper link and is not checked yet, yield it to the Spider
if link not in crawledLinks:
link = "http://quotes.toscrape.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
I've tried adding the following lines after yield:
item = MytestItem()
item['url'] = link
item['status'] = response.status
yield item
But it gets me a bunch of duplicates and no url with status 404 or 301. Does anyone know how I can get all the urls with the status?
Scrapy by default does not return any unsuccessful requests, but you can fetch them and handle them in one of your functions if you set errback on the request.
def parse(self, response):
# some code
yield Request(link, self.parse, errback=self.parse_error)
def parse_error(self, failure):
# log the response as an error
The parameter failure will contain more information on the exact reason for failure, because it could be HTTP errors (where you can fetch a response), but also DNS lookup errors and such (where there is no response).
The documentation contains an example how to use failure to determine the error reason and access Response if available:
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
You should use the HTTPERROR_ALLOW_ALL in your settings or set the meta key handle_httpstatus_all = Truein all your requests, please refer to the docs for more information.
I try to crawl the forum category of craiglist.org (https://forums.craigslist.org/).
My spider:
class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']
def error_handler(self, failure):
print failure
def parse(self, response):
yield Request('https://forums.craigslist.org/',
self.getForumPage,
dont_filter=True,
errback=self.error_handler)
def getForumPage(self, response):
print "forum page"
I have this message by the error callback:
[Failure instance: Traceback: :
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:455:callback
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:563:_startRunCallbacks
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1316:gotResult
--- ---
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1258:_inlineCallbacks
/usr/local/lib/python2.7/site-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator
/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py:37:process_request
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks
/usr/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/robotstxt.py:46:process_request_2
]
But i have this problem only with the forum section of Craigslist. It might be because is https for the forum section in contrary of the rest of website.
So, impossible to get a response...
An idea ?
I post a solution that I found for get around the problem.
I have used urllib2 library. Look:
import urllib2
from scrapy.http import HtmlResponse
class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']
def error_handler(self, failure):
print failure
def parse(self, response):
# Get a valid request with urllib2
req = urllib2.Request('https://forums.craigslist.org/')
# Get the content of this request
pageContent = urllib2.urlopen(req).read()
# Parse the content in a HtmlResponse compatible with Scrapy
response = HtmlResponse(url=response.url, body=pageContent)
print response.css(".forumlistcolumns li").extract()
With this solution, you can parse a good request in a valid Scrapy request and use this normaly.
There is probably a better method but this one is functional.
I think you are dealing with robots.txt. Try running your spider with
custom_settings = {
"ROBOTSTXT_OBEY": False
}
You can also test it using command line settings: scrapy crawl craigslist -s ROBOTSTXT_OBEY=False.
I have a very basic scrapy spider, which grabs urls from the file and then downloads them. The only problem is that some of them got redirected to a slightly modified url within same domain. I want to get them in my callback function using response.meta, and it works on a normal urls, but then url is redirected callback doesn't seem to get called. How can I fix it?
Here's my code.
from scrapy.contrib.spiders import CrawlSpider
from scrapy import log
from scrapy import Request
class DmozSpider(CrawlSpider):
name = "dmoz"
handle_httpstatus_list = [302]
allowed_domains = ["http://www.exmaple.net/"])
f = open("C:\\python27\\1a.csv",'r')
url = 'http://www.exmaple.net/Query?indx='
start_urls = [url+row for row in f.readlines()]
def parse(self, response):
print response.meta.get('redirect_urls', [response.url])
print response.status
print (response.headers.get('Location'))
I've also tried something like that:
def parse(self, response):
return Request(response.url, meta={'dont_redirect': True, 'handle_httpstatus_list': [302]}, callback=self.parse_my_url)
def parse_my_url(self, response):
print response.status
print (response.headers.get('Location'))
And it doesn't work either.
By default scrapy requests are redirected, although if you don't want to redirect you can do like this, use start_requests method and add flags in request meta.
def start_requests(self):
requests =[(Request(self.url+u, meta={'handle_httpstatus_list': [302],
'dont_redirect': True},
callback=self.parse)) for u in self.start_urls]
return requests