I'm trying to write parsing script using python/scrapy. How can I remove [] and u' from strings in result file?
Now I have text like this:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.markup import remove_tags
from googleparser.items import GoogleparserItem
import sys
class GoogleparserSpider(BaseSpider):
name = "google.com"
allowed_domains = ["google.com"]
start_urls = [
"http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0",
"http://www.google.com/search?q=this+is+second+test&num=20&hl=uk&start=0"
]
def parse(self, response):
print "===START======================================================="
hxs = HtmlXPathSelector(response)
qqq = hxs.select('/html/head/title/text()').extract()
print qqq
print "---DATA--------------------------------------------------------"
sites = hxs.select('/html/body/div[5]/div[3]/div/div/div/ol/li/h3')
i = 1
items = []
for site in sites:
try:
item = GoogleparserItem()
title1 = site.select('a').extract()
title2=str(title1)
title=remove_tags(title2)
link=site.select('a/#href').extract()
item['num'] = i
item['title'] = title
item['link'] = link
i= i+1
items.append(item)
except:
print 'EXCEPTION'
return items
print "===END========================================================="
SPIDER = GoogleparserSpider()
and I have result like this after running
python scrapy-ctl.py crawl google.com
2010-07-25 17:44:44+0300 [-] Log opened.
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled extensions: CoreStats, CloseSpider, WebService, TelnetConsole, MemoryUsage
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloaderStats, UserAgentMiddleware, RedirectMiddleware, DefaultHeadersMiddleware, CookiesMiddleware, HttpCompressionMiddleware, RetryMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled spider middlewares: UrlLengthMiddleware, HttpErrorMiddleware, RefererMiddleware, OffsiteMiddleware, DepthMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled item pipelines: CsvWriterPipeline
2010-07-25 17:44:44+0300 [-] scrapy.webservice.WebService starting on 6080
2010-07-25 17:44:44+0300 [-] scrapy.telnet.TelnetConsole starting on 6023
2010-07-25 17:44:44+0300 [google.com] INFO: Spider opened
2010-07-25 17:44:45+0300 [google.com] DEBUG: Crawled (200) <GET http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0> (referer: None)
===START=======================================================
[u'this is first test - \u041f\u043e\u0448\u0443\u043a Google']
---DATA--------------------------------------------------------
2010-07-25 17:52:42+0300 [google.com] DEBUG: Scraped GoogleparserItem(num=1, link=[u'http://www.amazon.com/First-Protector-Small-Tamora-Pierce/dp/0679889175'], title=u"[u'Amazon.com: First Test (Protector of the Small) (9780679889175 ...']") in <http://www.google.com/search?q=this+is+first+test&num=100&hl=uk&start=0>
and this text in file:
1,[u'Amazon.com: First Test (Protector of the Small) (9780679889175 ...'],[u'http://www.amazon.com/First-Protector-Small-Tamora-Pierce/dp/0679889175']
more prettier - print qqq.pop()
Replace print qqq with print qqq[0]. You get that result because qqq is a list.
Same problem with your text file. You have a list with one element that you're writing instead of the element within the list.
It looks like the result from extract is a list. Try:
print ', '.join(qqq)
The u infront of the code, purely means it's a unicode string. See the reference here. http://docs.python.org/tutorial/introduction.html#unicode-strings. The fix would be to convert your content to a string using the str() method.
Related
Load the scrapy shell
scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
Try a selector:
response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]')
Note: it prints results.
But now use that selector as a for statement:
for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
Hit return twice, nothing is printed. To print results inside the for loop, you have to wrap the selector in a print function. Like so:
print(row.xpath(".//a[contains(#href, 'report')]/#href").extract_first())
Why?
Edit
If I do the exact same thing as Liam's post below, my output is this:
rmp:www rmp$ scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
2016-03-05 06:13:28 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-05 06:13:28 [scrapy] INFO: Optional features available: ssl, http11
2016-03-05 06:13:28 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-03-05 06:13:28 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-03-05 06:13:28 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-05 06:13:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-05 06:13:28 [scrapy] INFO: Enabled item pipelines:
2016-03-05 06:13:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-05 06:13:28 [scrapy] INFO: Spider opened
2016-03-05 06:13:29 [scrapy] DEBUG: Crawled (200) <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x108c89c10>
[s] item {}
[s] request <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] response <200 http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] settings <scrapy.settings.Settings object at 0x10a25bb10>
[s] spider <DefaultSpider 'default' at 0x10c1201d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2016-03-05 06:13:29 [root] DEBUG: Using default logger
2016-03-05 06:13:29 [root] DEBUG: Using default logger
In [1]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
But with print added?
In [2]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: print row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/
/report/premier-league-2015-2016-afc-bournemouth-aston-villa/
/report/premier-league-2015-2016-everton-fc-watford-fc/
/report/premier-league-2015-2016-leicester-city-sunderland-afc/
/report/premier-league-2015-2016-norwich-city-crystal-palace/
This just worked for me.
>>>scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
>>> for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
... row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...
u'/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/'
u'/report/premier-league-2015-2016-afc-bournemouth-aston-villa/'
u'/report/premier-league-2015-2016-everton-fc-watford-fc/'
u'/report/premier-league-2015-2016-leicester-city-sunderland-afc/'
u'/report/premier-league-2015-2016-norwich-city-crystal-palace/'
u'/report/premier-league-2015-2016-chelsea-fc-swansea-city/'
u'/report/premier-league-2015-2016-arsenal-fc-west-ham-united/'
u'/report/premier-league-2015-2016-newcastle-united-southampton-fc/'
u'/report/premier-league-2015-2016-stoke-city-liverpool-fc/'
u'/report/premier-league-2015-2016-west-bromwich-albion-manchester-city/'
does this not show the same results for you?
I am using Scrapy 1.0.5 and Gearman to create distributed spiders.
The idea is to build a spider, call it from a gearman worker script and pass 20 URLs at a time to crawl from a gearman client to the worker and then to the spider.
I am able to start the worker, pass URLs to it from the client on to the spider to crawl. The first URL or array of URLs do get picked up and crawled. Once the spider is done, I am unable to reuse it. I get the log message that the spider is closed. When I initiate the client again, the spider reopens, but doesn't crawl.
Here is my worker:
import gearman
import json
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
gm_worker = gearman.GearmanWorker(['localhost:4730'])
def task_listener_reverse(gearman_worker, gearman_job):
process = CrawlerProcess(get_project_settings())
data = json.loads(gearman_job.data)
if(data['vendor_name'] == 'walmart'):
process.crawl('walmart', url=data['url_list'])
process.start() # the script will block here until the crawling is finished
return 'completed'
# gm_worker.set_client_id is optional
gm_worker.set_client_id('python-worker')
gm_worker.register_task('reverse', task_listener_reverse)
# Enter our work loop and call gm_worker.after_poll() after each time we timeout/see socket activity
gm_worker.work()
Here is the code of my Spider.
from crawler.items import CrawlerItemLoader
from scrapy.spiders import Spider
class WalmartSpider(Spider):
name = "walmart"
def __init__(self, **kw):
super(WalmartSpider, self).__init__(**kw)
self.start_urls = kw.get('url')
self.allowed_domains = ["walmart.com"]
def parse(self, response):
item = CrawlerItemLoader(response=response)
item.add_value('url', response.url)
#Title
item.add_xpath('title', '//div/h1/span/text()')
if(response.xpath('//div/h1/span/text()')):
title = response.xpath('//div/h1/span/text()')
item.add_value('title', title)
yield item.load_item()
The first client run produces results and I get the data I need whether it was a single URL or multiple URLs.
On the second run, the spider opens and no results.
This is what I get back and it stops
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
I was able to print the URL or URLs from the worker and spider and ensured they were getting passed on the first working run and second non working run. I spent 2 days and haven't gotten anywhere with it. I would appreciate any pointers.
Well, I decided to abandon Scrapy.
I looked around a lot and everyone kept pointing to the limitation of the twisted reactor. Rather than fighting the framework, I decided to build my own scraper and it was very successful for what I needed. I am able to spin up multiple gearman workers and use the scraper I built to scrape the data concurrently in a server farm.
If anyone is interested, I started with this simple article to build the scraper.
I use gearman client to query the DB and send multiple urls to a worker, the worker scrapes the URLs and does an update query back to the DB. Success!! :)
http://docs.python-guide.org/en/latest/scenarios/scrape/
I am using the following code to scrape data using scrapey:
from scrapy.selector import Selector
from scrapy.spider import Spider
class ExampleSpider(Spider):
name = "example"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
sel = Selector(response)
for li in sel.xpath('//ul/li'):
title = li.xpath('a/text()').extract()
link = li.xpath('a/#href').extract()
desc = li.xpath('text()').extract()
print title, link, desc
However, when I run this spider, I get the following message:
2014-06-30 23:39:00-0500 [scrapy] INFO: Scrapy 0.24.1 started (bot: tutorial)
2014-06-30 23:39:00-0500 [scrapy] INFO: Optional features available: ssl, http11
2014-06-30 23:39:00-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI': 'willthiswork.csv', 'BOT_NAME': 'tutorial'}
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled item pipelines:
2014-06-30 23:39:01-0500 [example] INFO: Spider opened
2014-06-30 23:39:01-0500 [example] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-30 23:39:01-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-06-30 23:39:01-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-06-30 23:39:01-0500 [example] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
Of note is the line "Crawled 0 pages (at 0 pages/min....., as well as the overridden settings.
Additionally, the file I intended to write my data to is completely blank.
Is there something I am doing wrong that is causing data not to write?
I am assuming you are trying to use scrapy crawl tutorial -o myfile.json
To make this work, you need to use scrapy items.
add the following to items.py:
def MozItem(Item):
title = Field()
link = Field()
desc = Field()
and adjust the parse function
def parse(self, response):
sel = Selector(response)
item = MozItem()
for li in sel.xpath('//ul/li'):
item['title'] = li.xpath('a/text()').extract()
item['link'] = li.xpath('a/#href').extract()
item['desc'] = li.xpath('text()').extract()
yield item
I'm stuck with this log now for 3 days:
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, FilterFieldsPipeline
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Spider opened
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-03 11:32:56-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Crawled (%(status)s) %(request)s (referer: %(referer)s)%(flags)s', MESSAGE LOST
2014-06-03 11:33:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:34:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
More like the last line... Forever and very slowly
The offensive line 4th from bottom, appears only when I set the logging level in Scrapy to DEBUG.
Here's the header of my spider:
class ScrapyCrawler(CrawlSpider):
name = "ScrapyCrawler"
def __init__(self, spiderPath, spiderID, name="ScrapyCrawler", *args, **kwargs):
super(ScrapyCrawler, self).__init__()
self.name = name
self.path = spiderPath
self.id = spiderID
self.path_index = 0
self.favicon_required = kwargs.get("downloadFavicon", True) #the favicon for the scraped site will be added to the first item
self.favicon_item = None
def start_requests(self):
start_path = self.path.pop(0)
# determine the callback based on next step
callback = self.parse_intermediate if type(self.path[0]) == URL \
else self.parse_item_pages
if type(start_path) == URL:
start_url = start_path
request = Request(start_path, callback=callback)
elif type(start_path) == Form:
start_url = start_path.url
request = FormRequest(start_path.url, start_path.data,
callback=callback)
return [request]
def parse_intermediate(self, response):
...
def parse_item_pages(self, response):
...
The thing is, none of the callbacks are called after start_requests().
Here's a hint: The first request out of start_request() is to a page like http://www.example.com. If I change http to https, this causes a redirect in scrapy and the log changes to this:
2014-06-03 12:00:51-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Redirecting (%(reason)s) to %(redirected)s from %(request)s', MESSAGE LOST
2014-06-03 12:00:51-0700 [NefsakLaptopSpider] DEBUG: Redirecting (302) to <GET http://www.nefsak.com/home.php?cat=58> from <GET http://www.nefsak.com/home.php?cat=58&xid_be279=248933808671e852497b0b1b33333a8b>
2014-06-03 12:00:52-0700 [NefsakLaptopSpider] DEBUG: Redirecting (301) to <GET http://www.nefsak.com/15-17-Screen/> from <GET http://www.nefsak.com/home.php?cat=58>
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/> (referer: None)
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] ERROR: Spider must return Request, BaseItem or None, got 'list' in <GET http://www.nefsak.com/15-17-Screen/>
2014-06-03 12:00:56-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/?page=4> (referer: http://www.nefsak.com/15-17-Screen/)
More extracted links and more errors like above, then it finishes, unlike former log
As you can see from the last line, the spider has actually gone and extracted a navigation page!. All By Itself.(There's a navigation extraction code, but it doesn't get called, as the debugger breakpoints are never reached).
Unfortunately, I couldn't reproduce the error outside the project. A similar spider just works!. But not inside the project though.
I'll provide more code if requested.
Thanks, and sorry for the long post.
Well, I had a URL class derived from the built-in str. It was coded like that:
class URL(str):
def canonicalize(self, parentURL):
parsed_self = urlparse.urlparse(self)
if parsed_self.scheme:
return self[:] #string copy?
else:
parsed_parent = urlparse.urlparse(parentURL)
return urlparse.urljoin(parsed_parent.scheme + "://" + parsed_parent.netloc, self)
def __str__(self):
return "<URL : {0} >".format(self)
The __str__ method caused infinite recursion when it was printed or logged, because format() called __str__ again... But the exception was swallowed by twisted somehow.
Only when printed the response that the error was shown.
def __str__(self):
return "<URL : " + self + " >" # or use super(URL, self).__str__()
:-)
Hi i am working on scrapy to scrape xml urls
Suppose below is my spider.py code
class TestSpider(BaseSpider):
name = "test"
allowed_domains = {"www.example.com"}
start_urls = [
"https://example.com/jobxml.asp"
]
def parse(self, response):
print response,"??????????????????????"
result:
2012-07-24 16:43:34+0530 [scrapy] INFO: Scrapy 0.14.3 started (bot: testproject)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled item pipelines:
2012-07-24 16:43:34+0530 [test] INFO: Spider opened
2012-07-24 16:43:34+0530 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-24 16:43:36+0530 [testproject] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 1 times): 400 Bad Request
2012-07-24 16:43:37+0530 [test] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 2 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Gave up retrying <GET https://example.com/jobxml.asp> (failed 3 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Crawled (400) <GET https://example.com/jobxml.asp> (referer: None)
2012-07-24 16:43:38+0530 [test] INFO: Closing spider (finished)
2012-07-24 16:43:38+0530 [test] INFO: Dumping spider stats:
{'downloader/request_bytes': 651,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 504,
'downloader/response_count': 3,
'downloader/response_status_count/400': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 24, 11, 13, 38, 573931),
'scheduler/memory_enqueued': 3,
'start_time': datetime.datetime(2012, 7, 24, 11, 13, 34, 803202)}
2012-07-24 16:43:38+0530 [test] INFO: Spider closed (finished)
2012-07-24 16:43:38+0530 [scrapy] INFO: Dumping global stats:
{'memusage/max': 263143424, 'memusage/startup': 263143424}
Whether scrapy does n't work for xml scraping, if yes can anyone please provide me an example on how to scrape xml tag data
Thanks in advance...........
You have a specific spider made for scraping xml feeds. This is from scrapy documentation:
XMLFeedSpider example
These spiders are pretty easy to use, let’s have a look at one example:
from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecesary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))
item = Item()
item['id'] = node.select('#id').extract()
item['name'] = node.select('name').extract()
item['description'] = node.select('description').extract()
return item
This is another way without scrapy:
This is a function used to download xml from given url, note that some import are not in here and this will also give you a nice progress for downloading xml file.
def get_file(self, dir, url, name):
s = urllib2.urlopen(url)
f = open('xml/test.xml','w')
meta = s.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (name, file_size)
current_file_size = 0
block_size = 4096
while True:
buf = s.read(block_size)
if not buf:
break
current_file_size += len(buf)
f.write(buf)
status = ("\r%10d [%3.2f%%]" %
(current_file_size, current_file_size * 100. / file_size))
status = status + chr(8)*(len(status)+1)
sys.stdout.write(status)
sys.stdout.flush()
f.close()
print "\nDone getting feed"
return 1
And then you parse that xml file that you downloaded and saved with iterparse, something like:
for event, elem in iterparse('xml/test.xml'):
if elem.tag == "properties":
print elem.text
That's just an example how do you go through xml tree.
Also, this is an old code of mine, so you would be better of using with for opening files.