Scrapy gets stuck with IIS 5.1 page

Scrapy gets stuck with IIS 5.1 page - python

I'm writing spiders with scrapy to get some data from a couple of applications using ASP. Both webpages are almost identical and requires to log in before starting scrapping, but I only managed to scrap one of them. In the other one scrapy gets waiting something forever and never gets after the login using FormRequest method.
The code of both spiders (they are almost identical but with different IPs) is as following:
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.shell import inspect_response
class MySpider(BaseSpider):
name = "my_very_nice_spider"
allowed_domains = ["xxx.xxx.xxx.xxx"]
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/']
def parse(self,response):
#Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/)
return [FormRequest.from_response(response,
formdata={'user':'the_username',
'password':'my_nice_password'},
callback=self.after_login)]
def after_login(self,response):
inspect_response(response,self) #Spider never gets here in one site
if "Bad login" in response.body:
print "Login failed"
return
#Scrapping code begins...
Wondering what could be different between them I used Firefox Live HTTP Headers for inspecting the headers and found only one difference: the webpage that works uses IIS 6.0 and the one that doesn't IIS 5.1.
As this alone couldn't explain myself why one works and the other doesnt' I used Wireshark to capture network traffic and found this:
Interaction using scrapy with working webpage (IIS 6.0)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 302 Object moved
scrapy --> webpage GET /reporting/htm/webpage.asp
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/asp/report1.asp
...Scrapping begins
Interaction using scrapy with not working webpage (IIS 5.1)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 100 Continue # What the f...?
scrapy <-- webpage HTTP/1.1 302 Object moved
...Scrapy waits forever...
I googled a little bit and found that indeed IIS 5.1 has some nice kind of "feature" that makes it return HTTP 100 whenever someone makes a POST to it as shown here.
Knowing that the root of all evil is where always is, but having to scrap that site anyway... How can I make scrapy work in this situation? Or am I doing something wrong?
Thank you!
Edit - Console log with not working site:
2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot)
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'}
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines:
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None)
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
...

Try using the HTTP 1.0 downloader:
# settings.py
DOWNLOAD_HANDLERS = {
'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
}

Related

Unable to access website using Tor with scrapy (Resposne status code 400)

I am trying to use tor with scrapy. To convert socks5 into http I am using python-proxy(pproxy) module. For configuration of tor with help of pproxy I used the command pproxy -l http://:8080 -r socks5://127.0.0.1:9050. For quickly checking whether this setup was working or not I used requests module. The code is:
import requests
url = 'http://icanhazip.com'
proxies = {
"http": 'http://127.0.0.1:8080',
"https": 'http://127.0.0.1:8080'
}
response = requests.get(url, proxies=proxies)
response2 = requests.get(url)
print('Tor IP address: ', response.text)
print('My IP address: ', response2.text)
The output was:
Tor IP address: 18.27.197.252
My IP address: 119.152.110.234
which confirmed that pproxy and tor were working properly. The problem happened when I tried use this setup within scrapy. Code for that purpose is as follows:
import scrapy
class Test2Spider(scrapy.Spider):
name = 'test2'
def start_requests(self):
url = 'http://icanhazip.com'
# For getting Tor IP
yield scrapy.Request(url=url, callback=self.parse,
meta={'proxy':'http://127.0.0.1:8080'})
# For getting my actual IP
#yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
yield{
'ip':response.xpath('//p/text()').get()
}
I can't access the site and got http response code 400 error
2020-07-01 17:35:33 [scrapy.core.engine] INFO: Spider opened
2020-07-01 17:35:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 17:35:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 17:35:34 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://icanhazip.com> (referer: None)
2020-07-01 17:35:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://icanhazip.com>: HTTP status code is not handled or not allowed
2020-07-01 17:35:34 [scrapy.core.engine] INFO: Closing spider (finished)
I also tried after changing user-agent but had no luck. And I know that this can't be because of user-agent because when I tried it without proxy I was successfully able to get my own IP
2020-07-01 17:41:39 [scrapy.core.engine] INFO: Spider opened
2020-07-01 17:41:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 17:41:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 17:41:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://icanhazip.com> (referer: None)
2020-07-01 17:41:40 [scrapy.core.scraper] DEBUG: Scraped from <200 http://icanhazip.com>
{'ip': '119.152.110.234'}
2020-07-01 17:41:40 [scrapy.core.engine] INFO: Closing spider (finished)
I can't understand two things
If I am allowed to access the site with my original IP, why I can't access it using tor IP address
Let suppose if there is some problem with configuring tor through pproxy, how I am able to access site with Tor IP address using requests
Edit:
I have observed that it varies from site to site. I am able to access a good amount of sites but not all. Some sites gave http response code 400 error while some other sites gave 403 error. But I was able to access all the sites using my own IP or using requests that originally denied access when using TOR IP. I can't figure out what's happening

Crawling redirected urls with scrapy

Im trying to use scrapy to crawl www.mywebsite.com.
www.mywebsite.com is hosted on a free host with the url www.mywebsite.freehost.com. I am redirecting the free host to my paid domain.
The problem here is that scrapy ignores the redirect and the end result is that 0 pages are scraped.
How do I tell scrapy that I need it to crawl the redirected url? I only need it to crawl the redirected url and not other urls that lead out of the website (like facebook pages etc.)
2016-11-27 14:48:42 [scrapy] INFO: Spider opened
2016-11-27 14:48:42 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-27 14:48:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-27 14:48:44 [scrapy] DEBUG: Crawled (200) <GET http://www.mywebsite.com/> (referer: None)
2016-11-27 14:48:44 [scrapy] DEBUG: Filtered offsite request to 'www.mywebsite.freehost.net': <GET www.mywebsite.freehost.net>
2016-11-27 14:48:44 [scrapy] INFO: Closing spider (finished)
2016-11-27 14:48:44 [scrapy] INFO: Dumping Scrapy stats:

The logs show that your request is being filtered:
DEBUG: Filtered offsite request to 'www.mywebsite.freehost.net': <GET www.mywebsite.freehost.net>
Add that domain freehost.net to your allowed_domains list, or remove allowed_domains from your spider to allow every domain.

Why this inconsistent behaviour using scrapy shell printing results?

Load the scrapy shell
scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
Try a selector:
response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]')
Note: it prints results.
But now use that selector as a for statement:
for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
Hit return twice, nothing is printed. To print results inside the for loop, you have to wrap the selector in a print function. Like so:
print(row.xpath(".//a[contains(#href, 'report')]/#href").extract_first())
Why?
Edit
If I do the exact same thing as Liam's post below, my output is this:
rmp:www rmp$ scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
2016-03-05 06:13:28 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-05 06:13:28 [scrapy] INFO: Optional features available: ssl, http11
2016-03-05 06:13:28 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-03-05 06:13:28 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-03-05 06:13:28 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-05 06:13:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-05 06:13:28 [scrapy] INFO: Enabled item pipelines:
2016-03-05 06:13:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-05 06:13:28 [scrapy] INFO: Spider opened
2016-03-05 06:13:29 [scrapy] DEBUG: Crawled (200) <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x108c89c10>
[s] item {}
[s] request <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] response <200 http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] settings <scrapy.settings.Settings object at 0x10a25bb10>
[s] spider <DefaultSpider 'default' at 0x10c1201d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2016-03-05 06:13:29 [root] DEBUG: Using default logger
2016-03-05 06:13:29 [root] DEBUG: Using default logger
In [1]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
But with print added?
In [2]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: print row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/
/report/premier-league-2015-2016-afc-bournemouth-aston-villa/
/report/premier-league-2015-2016-everton-fc-watford-fc/
/report/premier-league-2015-2016-leicester-city-sunderland-afc/
/report/premier-league-2015-2016-norwich-city-crystal-palace/

This just worked for me.
>>>scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
>>> for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
... row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...
u'/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/'
u'/report/premier-league-2015-2016-afc-bournemouth-aston-villa/'
u'/report/premier-league-2015-2016-everton-fc-watford-fc/'
u'/report/premier-league-2015-2016-leicester-city-sunderland-afc/'
u'/report/premier-league-2015-2016-norwich-city-crystal-palace/'
u'/report/premier-league-2015-2016-chelsea-fc-swansea-city/'
u'/report/premier-league-2015-2016-arsenal-fc-west-ham-united/'
u'/report/premier-league-2015-2016-newcastle-united-southampton-fc/'
u'/report/premier-league-2015-2016-stoke-city-liverpool-fc/'
u'/report/premier-league-2015-2016-west-bromwich-albion-manchester-city/'
does this not show the same results for you?

Scrapy and Gearman

I am using Scrapy 1.0.5 and Gearman to create distributed spiders.
The idea is to build a spider, call it from a gearman worker script and pass 20 URLs at a time to crawl from a gearman client to the worker and then to the spider.
I am able to start the worker, pass URLs to it from the client on to the spider to crawl. The first URL or array of URLs do get picked up and crawled. Once the spider is done, I am unable to reuse it. I get the log message that the spider is closed. When I initiate the client again, the spider reopens, but doesn't crawl.
Here is my worker:
import gearman
import json
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
gm_worker = gearman.GearmanWorker(['localhost:4730'])
def task_listener_reverse(gearman_worker, gearman_job):
process = CrawlerProcess(get_project_settings())
data = json.loads(gearman_job.data)
if(data['vendor_name'] == 'walmart'):
process.crawl('walmart', url=data['url_list'])
process.start() # the script will block here until the crawling is finished
return 'completed'
# gm_worker.set_client_id is optional
gm_worker.set_client_id('python-worker')
gm_worker.register_task('reverse', task_listener_reverse)
# Enter our work loop and call gm_worker.after_poll() after each time we timeout/see socket activity
gm_worker.work()
Here is the code of my Spider.
from crawler.items import CrawlerItemLoader
from scrapy.spiders import Spider
class WalmartSpider(Spider):
name = "walmart"
def __init__(self, **kw):
super(WalmartSpider, self).__init__(**kw)
self.start_urls = kw.get('url')
self.allowed_domains = ["walmart.com"]
def parse(self, response):
item = CrawlerItemLoader(response=response)
item.add_value('url', response.url)
#Title
item.add_xpath('title', '//div/h1/span/text()')
if(response.xpath('//div/h1/span/text()')):
title = response.xpath('//div/h1/span/text()')
item.add_value('title', title)
yield item.load_item()
The first client run produces results and I get the data I need whether it was a single URL or multiple URLs.
On the second run, the spider opens and no results.
This is what I get back and it stops
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
I was able to print the URL or URLs from the worker and spider and ensured they were getting passed on the first working run and second non working run. I spent 2 days and haven't gotten anywhere with it. I would appreciate any pointers.

Well, I decided to abandon Scrapy.
I looked around a lot and everyone kept pointing to the limitation of the twisted reactor. Rather than fighting the framework, I decided to build my own scraper and it was very successful for what I needed. I am able to spin up multiple gearman workers and use the scraper I built to scrape the data concurrently in a server farm.
If anyone is interested, I started with this simple article to build the scraper.
I use gearman client to query the DB and send multiple urls to a worker, the worker scrapes the URLs and does an update query back to the DB. Success!! :)
http://docs.python-guide.org/en/latest/scenarios/scrape/

Scrapy : UNFORMATTABLE OBJECT WRITTEN TO LOG

I'm stuck with this log now for 3 days:
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, FilterFieldsPipeline
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Spider opened
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-03 11:32:56-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Crawled (%(status)s) %(request)s (referer: %(referer)s)%(flags)s', MESSAGE LOST
2014-06-03 11:33:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:34:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
More like the last line... Forever and very slowly
The offensive line 4th from bottom, appears only when I set the logging level in Scrapy to DEBUG.
Here's the header of my spider:
class ScrapyCrawler(CrawlSpider):
name = "ScrapyCrawler"
def __init__(self, spiderPath, spiderID, name="ScrapyCrawler", *args, **kwargs):
super(ScrapyCrawler, self).__init__()
self.name = name
self.path = spiderPath
self.id = spiderID
self.path_index = 0
self.favicon_required = kwargs.get("downloadFavicon", True) #the favicon for the scraped site will be added to the first item
self.favicon_item = None
def start_requests(self):
start_path = self.path.pop(0)
# determine the callback based on next step
callback = self.parse_intermediate if type(self.path[0]) == URL \
else self.parse_item_pages
if type(start_path) == URL:
start_url = start_path
request = Request(start_path, callback=callback)
elif type(start_path) == Form:
start_url = start_path.url
request = FormRequest(start_path.url, start_path.data,
callback=callback)
return [request]
def parse_intermediate(self, response):
...
def parse_item_pages(self, response):
...
The thing is, none of the callbacks are called after start_requests().
Here's a hint: The first request out of start_request() is to a page like http://www.example.com. If I change http to https, this causes a redirect in scrapy and the log changes to this:
2014-06-03 12:00:51-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Redirecting (%(reason)s) to %(redirected)s from %(request)s', MESSAGE LOST
2014-06-03 12:00:51-0700 [NefsakLaptopSpider] DEBUG: Redirecting (302) to <GET http://www.nefsak.com/home.php?cat=58> from <GET http://www.nefsak.com/home.php?cat=58&xid_be279=248933808671e852497b0b1b33333a8b>
2014-06-03 12:00:52-0700 [NefsakLaptopSpider] DEBUG: Redirecting (301) to <GET http://www.nefsak.com/15-17-Screen/> from <GET http://www.nefsak.com/home.php?cat=58>
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/> (referer: None)
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] ERROR: Spider must return Request, BaseItem or None, got 'list' in <GET http://www.nefsak.com/15-17-Screen/>
2014-06-03 12:00:56-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/?page=4> (referer: http://www.nefsak.com/15-17-Screen/)
More extracted links and more errors like above, then it finishes, unlike former log
As you can see from the last line, the spider has actually gone and extracted a navigation page!. All By Itself.(There's a navigation extraction code, but it doesn't get called, as the debugger breakpoints are never reached).
Unfortunately, I couldn't reproduce the error outside the project. A similar spider just works!. But not inside the project though.
I'll provide more code if requested.
Thanks, and sorry for the long post.

Well, I had a URL class derived from the built-in str. It was coded like that:
class URL(str):
def canonicalize(self, parentURL):
parsed_self = urlparse.urlparse(self)
if parsed_self.scheme:
return self[:] #string copy?
else:
parsed_parent = urlparse.urlparse(parentURL)
return urlparse.urljoin(parsed_parent.scheme + "://" + parsed_parent.netloc, self)
def __str__(self):
return "<URL : {0} >".format(self)
The __str__ method caused infinite recursion when it was printed or logged, because format() called __str__ again... But the exception was swallowed by twisted somehow.
Only when printed the response that the error was shown.
def __str__(self):
return "<URL : " + self + " >" # or use super(URL, self).__str__()
:-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy gets stuck with IIS 5.1 page - python

Try using the HTTP 1.0 downloader: # settings.py DOWNLOAD_HANDLERS = { 'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler', 'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler', }

Related

Unable to access website using Tor with scrapy (Resposne status code 400)

Crawling redirected urls with scrapy

Why this inconsistent behaviour using scrapy shell printing results?

Scrapy and Gearman

Scrapy : UNFORMATTABLE OBJECT WRITTEN TO LOG

Categories

Resources