Scrapy and Gearman

Scrapy and Gearman - python

I am using Scrapy 1.0.5 and Gearman to create distributed spiders.
The idea is to build a spider, call it from a gearman worker script and pass 20 URLs at a time to crawl from a gearman client to the worker and then to the spider.
I am able to start the worker, pass URLs to it from the client on to the spider to crawl. The first URL or array of URLs do get picked up and crawled. Once the spider is done, I am unable to reuse it. I get the log message that the spider is closed. When I initiate the client again, the spider reopens, but doesn't crawl.
Here is my worker:
import gearman
import json
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
gm_worker = gearman.GearmanWorker(['localhost:4730'])
def task_listener_reverse(gearman_worker, gearman_job):
process = CrawlerProcess(get_project_settings())
data = json.loads(gearman_job.data)
if(data['vendor_name'] == 'walmart'):
process.crawl('walmart', url=data['url_list'])
process.start() # the script will block here until the crawling is finished
return 'completed'
# gm_worker.set_client_id is optional
gm_worker.set_client_id('python-worker')
gm_worker.register_task('reverse', task_listener_reverse)
# Enter our work loop and call gm_worker.after_poll() after each time we timeout/see socket activity
gm_worker.work()
Here is the code of my Spider.
from crawler.items import CrawlerItemLoader
from scrapy.spiders import Spider
class WalmartSpider(Spider):
name = "walmart"
def __init__(self, **kw):
super(WalmartSpider, self).__init__(**kw)
self.start_urls = kw.get('url')
self.allowed_domains = ["walmart.com"]
def parse(self, response):
item = CrawlerItemLoader(response=response)
item.add_value('url', response.url)
#Title
item.add_xpath('title', '//div/h1/span/text()')
if(response.xpath('//div/h1/span/text()')):
title = response.xpath('//div/h1/span/text()')
item.add_value('title', title)
yield item.load_item()
The first client run produces results and I get the data I need whether it was a single URL or multiple URLs.
On the second run, the spider opens and no results.
This is what I get back and it stops
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
I was able to print the URL or URLs from the worker and spider and ensured they were getting passed on the first working run and second non working run. I spent 2 days and haven't gotten anywhere with it. I would appreciate any pointers.

Well, I decided to abandon Scrapy.
I looked around a lot and everyone kept pointing to the limitation of the twisted reactor. Rather than fighting the framework, I decided to build my own scraper and it was very successful for what I needed. I am able to spin up multiple gearman workers and use the scraper I built to scrape the data concurrently in a server farm.
If anyone is interested, I started with this simple article to build the scraper.
I use gearman client to query the DB and send multiple urls to a worker, the worker scrapes the URLs and does an update query back to the DB. Success!! :)
http://docs.python-guide.org/en/latest/scenarios/scrape/

Related

Scrapy reporting 0 pages crawled

I'm trying to scrape the prices for shoes on the website in the code. I have no idea of knowing if my syntax is even correct. I could really use some help.
from scrapy.spider import BaseSpider
from scrapy import Field
from scrapy import Item
from scrapy.selector import HtmlXPathSelector
def Yeezy(Item):
price = Field()
class YeezySpider(BaseSpider):
name = "yeezy"
allowed_domains = ["https://www.grailed.com/"]
start_url = ['https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2']
def parse(self, response):
hxs = HtmlXPathSelector(response)
price = hxs.css('.listing-price .sub-title:nth-child(1) span').extract()
items = []
for price in price:
item = Yeezy()
item["price"] = price.select(".listing-price .sub-title:nth-child(1) span").extract()
items.append(item)
yield item
The code is reporting this to the console:
ScrapyDeprecationWarning: YeezyScrape.spiders.yeezy_spider.YeezySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class YeezySpider(BaseSpider):
2017-08-02 14:45:25-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-02 14:45:25-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-02 14:45:25-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'BOT_NAME': 'YeezyScrape'}
2017-08-02 14:45:25-0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled item pipelines:
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider opened
2017-08-02 14:45:26-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-02 14:45:26-0700 [yeezy] INFO: Closing spider (finished)
2017-08-02 14:45:26-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 127000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 125000)}
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider closed (finished)
Process finished with exit code 0
At first I thought it was a problem with the css elements I entered but now I'm not so sure. This is my first time trying a project like this, I could really use some insight. Thank you in advance.
EDIT: So I tried simulating an xhr request in my code by following another example. This is what I have:
import scrapy
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
#from YeezyScrape import YeezyscrapeItem
class YeezySpider(scrapy.Spider):
name = "yeezy"
allowed_domains = ["www.grailed.com"]
start_url = ["https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2"]
def parse(self, response):
for i in range(0,2):
yield FormRequest(url = 'https://mnrwefss2q-
dsn.algolia.net/1/indexes/Listing_production/query?x-algolia-
agent=Algolia%20for%20vanilla%20JavaScript%203.21.1&x-algolia-application-
id=MNRWEFSS2Q&x-algolia-api-key=a3a4de2e05d9e9b463911705fb6323ad',
method="post", formdata={"params":"query:boost
filters:(strata:'basic' OR strata:'grailed' OR strata:'hype') AND
(category_path:'footwear.slip_ons' OR category_path:'footwear.sandals' OR
category_path:'footwear.lowtop_sneakers' OR category_path:'footwear.leather'
OR category_path:'footwear.hitop_sneakers' OR
category_path:'footwear.formal_shoes' OR category_path:'footwear.boots') AND
(marketplace:grailed)
hitsPerPage:40
facets ["strata","size","category","category_size",
"category_path","category_path_size",
"category_path_root_size","price_i","designers.id",
"location","marketplace"]
page:2"}, callback=self.data_parse())
def data_parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.xpath("//p").extract()
for prices in prices:
price = prices.select("a/text()").extract()
print price
I had to reformat things a little to fit the indentation differences between Python and Stackoverflow.
These are the logs reported in the terminal, again thanks for the help:
C:\Python27\python.exe C:/Python27/Lib/site-packages/scrapy/cmdline.py crawl yeezy -o price.json
2017-08-04 13:23:27-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-04 13:23:27-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-04 13:23:27-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'FEED_URI': 'price.json', 'BOT_NAME': 'YeezyScrape'}
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled item pipelines:
2017-08-04 13:23:27-0700 [yeezy] INFO: Spider opened
2017-08-04 13:23:28-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-04 13:23:28-0700 [yeezy] INFO: Closing spider (finished)
2017-08-04 13:23:28-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 3000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 1000)}
2017-08-04 13:23:28-0700 [yeezy] INFO: Spider closed (finished)
Process finished with exit code 0

Seems like the products are retrieved by AJAX (see related: Can scrapy be used to scrape dynamic content from websites that are using AJAX?).
If you open up browsers webinspector, select network tab and look for XHR requests when the page loads, you can see this:
Seems like a POST type request is being made with categories, filter etc. and a json of products is returned. You can reverse engineer it and replicate it in scrapy.

Why this inconsistent behaviour using scrapy shell printing results?

Load the scrapy shell
scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
Try a selector:
response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]')
Note: it prints results.
But now use that selector as a for statement:
for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
Hit return twice, nothing is printed. To print results inside the for loop, you have to wrap the selector in a print function. Like so:
print(row.xpath(".//a[contains(#href, 'report')]/#href").extract_first())
Why?
Edit
If I do the exact same thing as Liam's post below, my output is this:
rmp:www rmp$ scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
2016-03-05 06:13:28 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-05 06:13:28 [scrapy] INFO: Optional features available: ssl, http11
2016-03-05 06:13:28 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-03-05 06:13:28 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-03-05 06:13:28 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-05 06:13:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-05 06:13:28 [scrapy] INFO: Enabled item pipelines:
2016-03-05 06:13:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-05 06:13:28 [scrapy] INFO: Spider opened
2016-03-05 06:13:29 [scrapy] DEBUG: Crawled (200) <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x108c89c10>
[s] item {}
[s] request <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] response <200 http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] settings <scrapy.settings.Settings object at 0x10a25bb10>
[s] spider <DefaultSpider 'default' at 0x10c1201d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2016-03-05 06:13:29 [root] DEBUG: Using default logger
2016-03-05 06:13:29 [root] DEBUG: Using default logger
In [1]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
But with print added?
In [2]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: print row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/
/report/premier-league-2015-2016-afc-bournemouth-aston-villa/
/report/premier-league-2015-2016-everton-fc-watford-fc/
/report/premier-league-2015-2016-leicester-city-sunderland-afc/
/report/premier-league-2015-2016-norwich-city-crystal-palace/

This just worked for me.
>>>scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
>>> for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
... row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...
u'/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/'
u'/report/premier-league-2015-2016-afc-bournemouth-aston-villa/'
u'/report/premier-league-2015-2016-everton-fc-watford-fc/'
u'/report/premier-league-2015-2016-leicester-city-sunderland-afc/'
u'/report/premier-league-2015-2016-norwich-city-crystal-palace/'
u'/report/premier-league-2015-2016-chelsea-fc-swansea-city/'
u'/report/premier-league-2015-2016-arsenal-fc-west-ham-united/'
u'/report/premier-league-2015-2016-newcastle-united-southampton-fc/'
u'/report/premier-league-2015-2016-stoke-city-liverpool-fc/'
u'/report/premier-league-2015-2016-west-bromwich-albion-manchester-city/'
does this not show the same results for you?

Spider not scraping page/writing

I am using the following code to scrape data using scrapey:
from scrapy.selector import Selector
from scrapy.spider import Spider
class ExampleSpider(Spider):
name = "example"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
sel = Selector(response)
for li in sel.xpath('//ul/li'):
title = li.xpath('a/text()').extract()
link = li.xpath('a/#href').extract()
desc = li.xpath('text()').extract()
print title, link, desc
However, when I run this spider, I get the following message:
2014-06-30 23:39:00-0500 [scrapy] INFO: Scrapy 0.24.1 started (bot: tutorial)
2014-06-30 23:39:00-0500 [scrapy] INFO: Optional features available: ssl, http11
2014-06-30 23:39:00-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI': 'willthiswork.csv', 'BOT_NAME': 'tutorial'}
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled item pipelines:
2014-06-30 23:39:01-0500 [example] INFO: Spider opened
2014-06-30 23:39:01-0500 [example] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-30 23:39:01-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-06-30 23:39:01-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-06-30 23:39:01-0500 [example] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
Of note is the line "Crawled 0 pages (at 0 pages/min....., as well as the overridden settings.
Additionally, the file I intended to write my data to is completely blank.
Is there something I am doing wrong that is causing data not to write?

I am assuming you are trying to use scrapy crawl tutorial -o myfile.json
To make this work, you need to use scrapy items.
add the following to items.py:
def MozItem(Item):
title = Field()
link = Field()
desc = Field()
and adjust the parse function
def parse(self, response):
sel = Selector(response)
item = MozItem()
for li in sel.xpath('//ul/li'):
item['title'] = li.xpath('a/text()').extract()
item['link'] = li.xpath('a/#href').extract()
item['desc'] = li.xpath('text()').extract()
yield item

Scrapy : UNFORMATTABLE OBJECT WRITTEN TO LOG

I'm stuck with this log now for 3 days:
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, FilterFieldsPipeline
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Spider opened
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-03 11:32:56-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Crawled (%(status)s) %(request)s (referer: %(referer)s)%(flags)s', MESSAGE LOST
2014-06-03 11:33:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:34:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
More like the last line... Forever and very slowly
The offensive line 4th from bottom, appears only when I set the logging level in Scrapy to DEBUG.
Here's the header of my spider:
class ScrapyCrawler(CrawlSpider):
name = "ScrapyCrawler"
def __init__(self, spiderPath, spiderID, name="ScrapyCrawler", *args, **kwargs):
super(ScrapyCrawler, self).__init__()
self.name = name
self.path = spiderPath
self.id = spiderID
self.path_index = 0
self.favicon_required = kwargs.get("downloadFavicon", True) #the favicon for the scraped site will be added to the first item
self.favicon_item = None
def start_requests(self):
start_path = self.path.pop(0)
# determine the callback based on next step
callback = self.parse_intermediate if type(self.path[0]) == URL \
else self.parse_item_pages
if type(start_path) == URL:
start_url = start_path
request = Request(start_path, callback=callback)
elif type(start_path) == Form:
start_url = start_path.url
request = FormRequest(start_path.url, start_path.data,
callback=callback)
return [request]
def parse_intermediate(self, response):
...
def parse_item_pages(self, response):
...
The thing is, none of the callbacks are called after start_requests().
Here's a hint: The first request out of start_request() is to a page like http://www.example.com. If I change http to https, this causes a redirect in scrapy and the log changes to this:
2014-06-03 12:00:51-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Redirecting (%(reason)s) to %(redirected)s from %(request)s', MESSAGE LOST
2014-06-03 12:00:51-0700 [NefsakLaptopSpider] DEBUG: Redirecting (302) to <GET http://www.nefsak.com/home.php?cat=58> from <GET http://www.nefsak.com/home.php?cat=58&xid_be279=248933808671e852497b0b1b33333a8b>
2014-06-03 12:00:52-0700 [NefsakLaptopSpider] DEBUG: Redirecting (301) to <GET http://www.nefsak.com/15-17-Screen/> from <GET http://www.nefsak.com/home.php?cat=58>
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/> (referer: None)
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] ERROR: Spider must return Request, BaseItem or None, got 'list' in <GET http://www.nefsak.com/15-17-Screen/>
2014-06-03 12:00:56-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/?page=4> (referer: http://www.nefsak.com/15-17-Screen/)
More extracted links and more errors like above, then it finishes, unlike former log
As you can see from the last line, the spider has actually gone and extracted a navigation page!. All By Itself.(There's a navigation extraction code, but it doesn't get called, as the debugger breakpoints are never reached).
Unfortunately, I couldn't reproduce the error outside the project. A similar spider just works!. But not inside the project though.
I'll provide more code if requested.
Thanks, and sorry for the long post.

Well, I had a URL class derived from the built-in str. It was coded like that:
class URL(str):
def canonicalize(self, parentURL):
parsed_self = urlparse.urlparse(self)
if parsed_self.scheme:
return self[:] #string copy?
else:
parsed_parent = urlparse.urlparse(parentURL)
return urlparse.urljoin(parsed_parent.scheme + "://" + parsed_parent.netloc, self)
def __str__(self):
return "<URL : {0} >".format(self)
The __str__ method caused infinite recursion when it was printed or logged, because format() called __str__ again... But the exception was swallowed by twisted somehow.
Only when printed the response that the error was shown.
def __str__(self):
return "<URL : " + self + " >" # or use super(URL, self).__str__()
:-)

Scrapy gets stuck with IIS 5.1 page

I'm writing spiders with scrapy to get some data from a couple of applications using ASP. Both webpages are almost identical and requires to log in before starting scrapping, but I only managed to scrap one of them. In the other one scrapy gets waiting something forever and never gets after the login using FormRequest method.
The code of both spiders (they are almost identical but with different IPs) is as following:
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.shell import inspect_response
class MySpider(BaseSpider):
name = "my_very_nice_spider"
allowed_domains = ["xxx.xxx.xxx.xxx"]
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/']
def parse(self,response):
#Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/)
return [FormRequest.from_response(response,
formdata={'user':'the_username',
'password':'my_nice_password'},
callback=self.after_login)]
def after_login(self,response):
inspect_response(response,self) #Spider never gets here in one site
if "Bad login" in response.body:
print "Login failed"
return
#Scrapping code begins...
Wondering what could be different between them I used Firefox Live HTTP Headers for inspecting the headers and found only one difference: the webpage that works uses IIS 6.0 and the one that doesn't IIS 5.1.
As this alone couldn't explain myself why one works and the other doesnt' I used Wireshark to capture network traffic and found this:
Interaction using scrapy with working webpage (IIS 6.0)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 302 Object moved
scrapy --> webpage GET /reporting/htm/webpage.asp
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/asp/report1.asp
...Scrapping begins
Interaction using scrapy with not working webpage (IIS 5.1)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 100 Continue # What the f...?
scrapy <-- webpage HTTP/1.1 302 Object moved
...Scrapy waits forever...
I googled a little bit and found that indeed IIS 5.1 has some nice kind of "feature" that makes it return HTTP 100 whenever someone makes a POST to it as shown here.
Knowing that the root of all evil is where always is, but having to scrap that site anyway... How can I make scrapy work in this situation? Or am I doing something wrong?
Thank you!
Edit - Console log with not working site:
2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot)
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'}
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines:
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None)
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
...

Try using the HTTP 1.0 downloader:
# settings.py
DOWNLOAD_HANDLERS = {
'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy and Gearman - python

Related

Scrapy reporting 0 pages crawled

Why this inconsistent behaviour using scrapy shell printing results?

Spider not scraping page/writing

Scrapy : UNFORMATTABLE OBJECT WRITTEN TO LOG

Scrapy gets stuck with IIS 5.1 page

Categories

Resources