I was trying to make an authenticated spider. I have referred almost every post here related to Scrapy authenticated spider, I couldn't find any answer for my issue. I have used the following code:
import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
import logging
from PWC.items import PwcItem
class PwcmoneySpider(scrapy.Spider):
name = "PWCMoney"
allowed_domains = ["pwcmoneytree.com"]
start_urls = (
'https://www.pwcmoneytree.com/SingleEntry/singleComp?compName=Addicaid',
)
def parse(self, response):
return [scrapy.FormRequest("https://www.pwcmoneytree.com/Account/Login",
formdata={'UserName': 'user', 'Password': 'pswd'},
callback=self.after_login)]
def after_login(self, response):
if "authentication failed" in response.body:
self.log("Login failed", level=logging.ERROR)
return
# We've successfully authenticated, let's have some fun!
print("Login Successful!!")
return Request(url="https://www.pwcmoneytree.com/SingleEntry/singleComp?compName=Addicaid",
callback=self.parse_tastypage)
def parse_tastypage(self, response):
for sel in response.xpath('//div[#id="MainDivParallel"]'):
item = PwcItem()
item['name'] = sel.xpath('div[#id="CompDiv"]/h2/text()').extract()
item['location'] = sel.xpath('div[#id="CompDiv"]/div[#id="infoPane"]/div[#class="infoSlot"]/div/a/text()').extract()
item['region'] = sel.xpath('div[#id="CompDiv"]/div[#id="infoPane"]/div[#id="contactInfoDiv"]/div[1]/a[2]/text()').extract()
yield item
And I got the following output:
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Python27\PWC>scrapy crawl PWCMoney -o test.csv
2016-04-29 11:37:35 [scrapy] INFO: Scrapy 1.0.5 started (bot: PWC)
2016-04-29 11:37:35 [scrapy] INFO: Optional features available: ssl, http11
2016-04-29 11:37:35 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'PW
C.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['PWC.spiders'], 'FEED_URI':
'test.csv', 'BOT_NAME': 'PWC'}
2016-04-29 11:37:35 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-29 11:37:36 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-29 11:37:36 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-29 11:37:36 [scrapy] INFO: Enabled item pipelines:
2016-04-29 11:37:36 [scrapy] INFO: Spider opened
2016-04-29 11:37:36 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-04-29 11:37:36 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-29 11:37:37 [scrapy] DEBUG: Retrying <POST https://www.pwcmoneytree.com/
Account/Login> (failed 1 times): 500 Internal Server Error
2016-04-29 11:37:38 [scrapy] DEBUG: Retrying <POST https://www.pwcmoneytree.com/
Account/Login> (failed 2 times): 500 Internal Server Error
2016-04-29 11:37:38 [scrapy] DEBUG: Gave up retrying <POST https://www.pwcmoneyt
ree.com/Account/Login> (failed 3 times): 500 Internal Server Error
2016-04-29 11:37:38 [scrapy] DEBUG: Crawled (500) <POST https://www.pwcmoneytree
.com/Account/Login> (referer: None)
2016-04-29 11:37:38 [scrapy] DEBUG: Ignoring response <500 https://www.pwcmoneyt
ree.com/Account/Login>: HTTP status code is not handled or not allowed
2016-04-29 11:37:38 [scrapy] INFO: Closing spider (finished)
2016-04-29 11:37:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 954,
'downloader/request_count': 3,
'downloader/request_method_count/POST': 3,
'downloader/response_bytes': 30177,
'downloader/response_count': 3,
'downloader/response_status_count/500': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 29, 6, 7, 38, 674000),
'log_count/DEBUG': 6,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 4, 29, 6, 7, 36, 193000)}
2016-04-29 11:37:38 [scrapy] INFO: Spider closed (finished)
Since I am new to python and Scrapy, I can't seem to understand the error, I hope someone here could help me.
So, I modified the code like this taking Rejected's advice, showing only the modified part:
allowed_domains = ["pwcmoneytree.com"]
start_urls = (
'https://www.pwcmoneytree.com/Account/Login',
)
def start_requests(self):
return [scrapy.FormRequest.from_response("https://www.pwcmoneytree.com/Account/Login",
formdata={'UserName': 'user', 'Password': 'pswd'},
callback=self.logged_in)]
And got the following error:
C:\Python27\PWC>scrapy crawl PWCMoney -o test.csv
2016-04-30 11:04:47 [scrapy] INFO: Scrapy 1.0.5 started (bot: PWC)
2016-04-30 11:04:47 [scrapy] INFO: Optional features available: ssl, http11
2016-04-30 11:04:47 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'PW
C.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['PWC.spiders'], 'FEED_URI':
'test.csv', 'BOT_NAME': 'PWC'}
2016-04-30 11:04:50 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-30 11:04:54 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-30 11:04:54 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-30 11:04:54 [scrapy] INFO: Enabled item pipelines:
Unhandled error in Deferred:
2016-04-30 11:04:54 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 150, in _run_comm
and
cmd.run(args, opts)
File "c:\python27\lib\site-packages\scrapy\commands\crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 153, in crawl
d = crawler.crawl(*args, **kwargs)
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1274, in
unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1128, in
_inlineCallbacks
result = g.send(result)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 72, in crawl
start_requests = iter(self.spider.start_requests())
File "C:\Python27\PWC\PWC\spiders\PWCMoney.py", line 16, in start_requests
callback=self.logged_in)]
File "c:\python27\lib\site-packages\scrapy\http\request\form.py", line 36, in
from_response
kwargs.setdefault('encoding', response.encoding)
exceptions.AttributeError: 'str' object has no attribute 'encoding'
2016-04-30 11:04:54 [twisted] CRITICAL:
As seen in your error log, it's the POST request to https://www.pwcmoneytree.com/Account/Login that is giving you a 500 error.
I tried making the same POST request manually, using POSTman. It gives the 500 error code and a HTML page containing this error message:
The required anti-forgery cookie "__RequestVerificationToken" is not present.
This is a feature many APIs and websites use to prevent CSRF attacks. If you still want to scrape the site, you would have to first visit the login form and get the proper cookie before logging in.
You're making your crawler do work for no reason. Your first request (initiated with start_urls) is being processed, and then the response is discarded. There's very rarely a reason to do this (unless making the request itself is a requirement).
Instead, change your start_urls to "https://www.pwcmoneytree.com/Account/Login", and change scrapy.FormRequest(...) to scrapy.FormRequest.from_response(...). You'll also need to also change the provided URL to being the received response (and possibly identify the desired form).
This will save you a wasted request, fetch/pre-fill other verification tokens, and clean up your code.
EDIT: Below is code you should be using. Note: You changed self.after_login to self.logged_in, so I left it as the newer change.
...
allowed_domains = ["pwcmoneytree.com"]
start_urls = (
'https://www.pwcmoneytree.com/Account/Login',
)
def parse(self, response):
return scrapy.FormRequest.from_response(response,
formdata={'UserName': 'user', 'Password': 'pswd'},
callback=self.logged_in)
...
Related
I'm trying to scrape the prices for shoes on the website in the code. I have no idea of knowing if my syntax is even correct. I could really use some help.
from scrapy.spider import BaseSpider
from scrapy import Field
from scrapy import Item
from scrapy.selector import HtmlXPathSelector
def Yeezy(Item):
price = Field()
class YeezySpider(BaseSpider):
name = "yeezy"
allowed_domains = ["https://www.grailed.com/"]
start_url = ['https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2']
def parse(self, response):
hxs = HtmlXPathSelector(response)
price = hxs.css('.listing-price .sub-title:nth-child(1) span').extract()
items = []
for price in price:
item = Yeezy()
item["price"] = price.select(".listing-price .sub-title:nth-child(1) span").extract()
items.append(item)
yield item
The code is reporting this to the console:
ScrapyDeprecationWarning: YeezyScrape.spiders.yeezy_spider.YeezySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class YeezySpider(BaseSpider):
2017-08-02 14:45:25-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-02 14:45:25-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-02 14:45:25-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'BOT_NAME': 'YeezyScrape'}
2017-08-02 14:45:25-0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled item pipelines:
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider opened
2017-08-02 14:45:26-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-02 14:45:26-0700 [yeezy] INFO: Closing spider (finished)
2017-08-02 14:45:26-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 127000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 125000)}
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider closed (finished)
Process finished with exit code 0
At first I thought it was a problem with the css elements I entered but now I'm not so sure. This is my first time trying a project like this, I could really use some insight. Thank you in advance.
EDIT: So I tried simulating an xhr request in my code by following another example. This is what I have:
import scrapy
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
#from YeezyScrape import YeezyscrapeItem
class YeezySpider(scrapy.Spider):
name = "yeezy"
allowed_domains = ["www.grailed.com"]
start_url = ["https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2"]
def parse(self, response):
for i in range(0,2):
yield FormRequest(url = 'https://mnrwefss2q-
dsn.algolia.net/1/indexes/Listing_production/query?x-algolia-
agent=Algolia%20for%20vanilla%20JavaScript%203.21.1&x-algolia-application-
id=MNRWEFSS2Q&x-algolia-api-key=a3a4de2e05d9e9b463911705fb6323ad',
method="post", formdata={"params":"query:boost
filters:(strata:'basic' OR strata:'grailed' OR strata:'hype') AND
(category_path:'footwear.slip_ons' OR category_path:'footwear.sandals' OR
category_path:'footwear.lowtop_sneakers' OR category_path:'footwear.leather'
OR category_path:'footwear.hitop_sneakers' OR
category_path:'footwear.formal_shoes' OR category_path:'footwear.boots') AND
(marketplace:grailed)
hitsPerPage:40
facets ["strata","size","category","category_size",
"category_path","category_path_size",
"category_path_root_size","price_i","designers.id",
"location","marketplace"]
page:2"}, callback=self.data_parse())
def data_parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.xpath("//p").extract()
for prices in prices:
price = prices.select("a/text()").extract()
print price
I had to reformat things a little to fit the indentation differences between Python and Stackoverflow.
These are the logs reported in the terminal, again thanks for the help:
C:\Python27\python.exe C:/Python27/Lib/site-packages/scrapy/cmdline.py crawl yeezy -o price.json
2017-08-04 13:23:27-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-04 13:23:27-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-04 13:23:27-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'FEED_URI': 'price.json', 'BOT_NAME': 'YeezyScrape'}
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled item pipelines:
2017-08-04 13:23:27-0700 [yeezy] INFO: Spider opened
2017-08-04 13:23:28-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-04 13:23:28-0700 [yeezy] INFO: Closing spider (finished)
2017-08-04 13:23:28-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 3000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 1000)}
2017-08-04 13:23:28-0700 [yeezy] INFO: Spider closed (finished)
Process finished with exit code 0
Seems like the products are retrieved by AJAX (see related: Can scrapy be used to scrape dynamic content from websites that are using AJAX?).
If you open up browsers webinspector, select network tab and look for XHR requests when the page loads, you can see this:
Seems like a POST type request is being made with categories, filter etc. and a json of products is returned. You can reverse engineer it and replicate it in scrapy.
I am trying to scrap a site using Scrapy and Selenium.
I can get the web-browser to open using selenium but i am unable to get the start url into the web-browser. At present, the web-browser opens, does nothing and then closes whilst i get the error "<405 https://etc etc>: HTTP status code is not handled or not allowed".
Which, as far as i understand, confirms that i am not being able to pass the url to the web-browser.
What am i doing wrong here?
import scrapy
import time
from selenium import webdriver
from glassdoor.items import GlassdoorItem
class glassdoorSpider(scrapy.Spider):
name = "glassdoor"
allowed_domains = ["glassdoor.co.uk"]
start_urls = ["https://www.glassdoor.co.uk/Overview/Working-at-Greene-King-EI_IE10160.11,22.htm",
]
def __init__(self):
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
def parse(self, response):
self.driver.get(response.url)
time.sleep(5)
for sel in response.xpath('//*[#id="EmpStats"]'):
item = GlassdoorItem()
item['rating'] = sel.xpath('//*[#class="notranslate ratingNum"]/text()').extract()
# item['recommend'] = sel.xpath('//*[#class="address"]/text()').extract()
# item['approval'] = sel.xpath('//*[#class="address"]/text()').extract()
yield item
# self.driver.close()
the logs I get from the above are:
2017-01-26 21:49:02 [scrapy] INFO: Scrapy 1.0.5 started (bot: glassdoor)
2017-01-26 21:49:02 [scrapy] INFO: Optional features available: ssl, http11
2017-01-26 21:49:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'glassdoor.spiders', 'SPIDER_MODULES': ['glassdoor.spiders'], 'BOT_NAME': 'glassdoor'}
2017-01-26 21:49:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2017-01-26 21:49:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:58378/session {"requiredCapabilities": {}, "desiredCapabilities": {"platform": "ANY", "browserName": "chrome", "version": "", "chromeOptions": {"args": [], "extensions": []}, "javascriptEnabled": true}}
2017-01-26 21:49:06 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-01-26 21:49:06 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-01-26 21:49:06 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-01-26 21:49:06 [scrapy] INFO: Enabled item pipelines:
2017-01-26 21:49:06 [scrapy] INFO: Spider opened
2017-01-26 21:49:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-26 21:49:06 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-26 21:49:07 [scrapy] DEBUG: Crawled (405) <GET https://www.glassdoor.co.uk/Overview/Working-at-Greene-King-EI_IE10160.11,22.htm> (referer: None)
2017-01-26 21:49:07 [scrapy] DEBUG: Ignoring response <405 https://www.glassdoor.co.uk/Overview/Working-at-Greene-King-EI_IE10160.11,22.htm>: HTTP status code is not handled or not allowed
2017-01-26 21:49:07 [scrapy] INFO: Closing spider (finished)
2017-01-26 21:49:07 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 269,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 7412,
'downloader/response_count': 1,
'downloader/response_status_count/405': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 1, 26, 21, 49, 7, 388000),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 1, 26, 21, 49, 6, 572000)}
2017-01-26 21:49:07 [scrapy] INFO: Spider closed (finished)
ok, as suggest by both the replies, i was not passing the correct response to selenium.
Hence, by adding the line:
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
and therefore changing one line of the code as well:
for sel in response1.xpath('//*[#id="EmpStats"]'):
the new code is (which works):
import scrapy
import time
from selenium import webdriver
from glassdoor.items import GlassdoorItem
class glassdoorSpider(scrapy.Spider):
header = {"User-Agent":"Mozilla/5.0 Gecko/20100101 Firefox/33.0"}
name = "glassdoor"
allowed_domains = ["glassdoor.co.uk"]
start_urls = ["https://www.glassdoor.co.uk/Overview/Working-at-Greene-King-EI_IE10160.11,22.htm",
]
def __init__(self):
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
def parse(self, response):
self.driver.get(response.url)
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
time.sleep(5)
for sel in response1.xpath('//*[#id="EmpStats"]'):
item = GlassdoorItem()
item['rating'] = sel.xpath('//*[#class="notranslate ratingNum"]/text()').extract()
# item['recommend'] = sel.xpath('//*[#class="address"]/text()').extract()
# item['approval'] = sel.xpath('//*[#class="address"]/text()').extract()
yield item
# self.driver.close()
I'm using scrapy to scrap this site but when I run the spider I don't see any response.
I tried reddit.com and quora.com and they both returned data (started to crawl) but not the site I want.
Here is my simple spider:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
class FirstSpider(CrawlSpider):
name = "jobs"
allowed_domains = ["bayt.com"]
start_urls = (
'http://www.bayt.com/',
)
rules = [
Rule(
LinkExtractor(allow=['.*']),
)
]
I tried several combinations of urls in the start_urls but nothing seemed to work.
Here is the log after running the spider:
2015-12-13 20:31:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: bayt)
2015-12-13 20:31:45 [scrapy] INFO: Optional features available: ssl, http11
2015-12-13 20:31:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bayt.spiders', 'SPIDER_MODULES': ['bayt.spiders'], 'BOT_NAME': 'bayt'}
2015-12-13 20:31:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-13 20:31:45 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-13 20:31:45 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-13 20:31:45 [scrapy] INFO: Enabled item pipelines:
2015-12-13 20:31:45 [scrapy] INFO: Spider opened
2015-12-13 20:31:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-13 20:31:45 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-13 20:31:45 [scrapy] DEBUG: Redirecting (302) to <GET http://www.bayt.com/en/jordan/> from <GET http://www.bayt.com/>
2015-12-13 20:31:46 [scrapy] DEBUG: Crawled (200) <GET http://www.bayt.com/en/jordan/> (referer: None)
2015-12-13 20:31:46 [scrapy] INFO: Closing spider (finished)
2015-12-13 20:31:46 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 881,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2320,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 12, 13, 18, 31, 46, 212468),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 12, 13, 18, 31, 45, 138408)}
2015-12-13 20:31:46 [scrapy] INFO: Spider closed (finished)
the problem is that you are not using the rules as you mentioned, you have your own parse method, which is not ok, CrawlSpider uses the parse method so you shouldn't override that method.
Now, if you are still getting items when overriding the parse method, it is because parse is the default method for the start_urls requests, so the requests are not really following the rules, but only crawling urls inside start_urls
Just change the name of your parsing method from parse to a different one, and specify that on your rule as a callback.
I did Curl www.bayt.com in the command line and it seems that they redirect the request to http://www.bayt.com/en/jordan/
I put that as My start_urls and it worked and changed the user agent in the settings.py to localhost and it worked.
I'm having a problem getting my Scrapy spider to run its callback method.
I don't think it's an indentation error which seems to be the case for the other previous posts, but perhaps it is and I don't know it? Any ideas?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import tldextract
class CrawlerSpider(CrawlSpider):
name = "crawler"
def __init__(self, initial_url):
log.msg('initing...', level=log.WARNING)
CrawlSpider.__init__(self)
if not initial_url.startswith('http'):
initial_url = 'http://' + initial_url
ext = tldextract.extract(initial_url)
initial_domain = ext.domain + '.' + ext.tld
initial_subdomain = ext.subdomain + '.' + ext.domain + '.' + ext.tld
self.allowed_domains = [initial_domain, 'www.' + initial_domain, initial_subdomain]
self.start_urls = [initial_url]
self.rules = [
Rule(SgmlLinkExtractor(), callback='parse_item'),
Rule(SgmlLinkExtractor(allow_domains=self.allowed_domains), follow=True),
]
self._compile_rules()
def parse_item(self, response):
log.msg('parse_item...', level=log.WARNING)
hxs = HtmlXPathSelector(response)
links = hxs.select("//a/#href").extract()
for link in links:
log.msg('link', level=log.WARNING)
Sample output is below; it should show a warning message with "parse_item..." printed but it doesn't.
$ scrapy crawl crawler -a initial_url=http://www.szuhanchang.com/test.html
2013-02-19 18:03:24+0000 [scrapy] INFO: Scrapy 0.16.4 started (bot: crawler)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled item pipelines:
2013-02-19 18:03:24+0000 [scrapy] WARNING: initing...
2013-02-19 18:03:24+0000 [crawler] INFO: Spider opened
2013-02-19 18:03:24+0000 [crawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-02-19 18:03:25+0000 [crawler] DEBUG: Crawled (200) <GET http://www.szuhanchang.com/test.html> (referer: None)
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
2013-02-19 18:03:25+0000 [crawler] INFO: Closing spider (finished)
2013-02-19 18:03:25+0000 [crawler] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 234,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 363,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 2, 19, 18, 3, 25, 84855),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'log_count/WARNING': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 2, 19, 18, 3, 24, 805064)}
2013-02-19 18:03:25+0000 [crawler] INFO: Spider closed (finished)
Thanks in advance!
The start_urls of http://www.szuhanchang.com/test.html has only one anchor link, namely:
Test
which contains a link to the domain 20130219-0606.com and according to your allowed_domains of:
['szuhanchang.com', 'www.szuhanchang.com', 'www.szuhanchang.com']
this Request gets filtered by the OffsiteMiddleware:
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
therefore parse_item will not be called for this url.
Changing the name of your callback to parse_start_url seems to work, although since the test URL provided is quite small, I cannot be sure if this will still be effective. Give it a go and let me know. :)
I'm trying to parse site with Scrapy. The urls I need to parse formed like this http://example.com/productID/1234/. This links can be found on pages with address like: http://example.com/categoryID/1234/. The thing is that my crawler fetches first categoryID page (http://www.example.com/categoryID/79/, as you can see from trace below), but nothing more. What am I doing wrong? Thank you.
Here is my Scrapy code:
# -*- coding: UTF-8 -*-
#THIRD-PARTY MODULES
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class ExampleComSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["http://www.example.com/"]
start_urls = [
"http://www.example.com/"
]
rules = (
# Extract links matching 'categoryID/xxx'
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('/categoryID/(\d*)/', ), )),
# Extract links matching 'productID/xxx' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('/productID/(\d*)/', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
Here is a trace of Scrapy:
2012-01-31 12:38:56+0000 [scrapy] INFO: Scrapy 0.14.1 started (bot: parsers)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled item pipelines:
2012-01-31 12:38:57+0000 [example.com] INFO: Spider opened
2012-01-31 12:38:57+0000 [example.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-01-31 12:38:58+0000 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
2012-01-31 12:38:58+0000 [example.com] DEBUG: Filtered offsite request to 'www.example.com': <GET http://www.example.com/categoryID/79/>
2012-01-31 12:38:58+0000 [example.com] INFO: Closing spider (finished)
2012-01-31 12:38:58+0000 [example.com] INFO: Dumping spider stats:
{'downloader/request_bytes': 199,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 121288,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 1, 31, 12, 38, 58, 409806),
'request_depth_max': 1,
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 1, 31, 12, 38, 57, 127805)}
2012-01-31 12:38:58+0000 [example.com] INFO: Spider closed (finished)
2012-01-31 12:38:58+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 26992640, 'memusage/startup': 26992640}
It can be a difference between "www.example.com" and "example.com". If it helps, you can use them both this way
allowed_domains = ["www.example.com", "example.com"]
Replace:
allowed_domains = ["http://www.example.com/"]
with:
allowed_domains = ["example.com"]
That should do the trick.