I'm stuck with this log now for 3 days:
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, FilterFieldsPipeline
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Spider opened
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-03 11:32:56-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Crawled (%(status)s) %(request)s (referer: %(referer)s)%(flags)s', MESSAGE LOST
2014-06-03 11:33:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:34:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
More like the last line... Forever and very slowly
The offensive line 4th from bottom, appears only when I set the logging level in Scrapy to DEBUG.
Here's the header of my spider:
class ScrapyCrawler(CrawlSpider):
name = "ScrapyCrawler"
def __init__(self, spiderPath, spiderID, name="ScrapyCrawler", *args, **kwargs):
super(ScrapyCrawler, self).__init__()
self.name = name
self.path = spiderPath
self.id = spiderID
self.path_index = 0
self.favicon_required = kwargs.get("downloadFavicon", True) #the favicon for the scraped site will be added to the first item
self.favicon_item = None
def start_requests(self):
start_path = self.path.pop(0)
# determine the callback based on next step
callback = self.parse_intermediate if type(self.path[0]) == URL \
else self.parse_item_pages
if type(start_path) == URL:
start_url = start_path
request = Request(start_path, callback=callback)
elif type(start_path) == Form:
start_url = start_path.url
request = FormRequest(start_path.url, start_path.data,
callback=callback)
return [request]
def parse_intermediate(self, response):
...
def parse_item_pages(self, response):
...
The thing is, none of the callbacks are called after start_requests().
Here's a hint: The first request out of start_request() is to a page like http://www.example.com. If I change http to https, this causes a redirect in scrapy and the log changes to this:
2014-06-03 12:00:51-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Redirecting (%(reason)s) to %(redirected)s from %(request)s', MESSAGE LOST
2014-06-03 12:00:51-0700 [NefsakLaptopSpider] DEBUG: Redirecting (302) to <GET http://www.nefsak.com/home.php?cat=58> from <GET http://www.nefsak.com/home.php?cat=58&xid_be279=248933808671e852497b0b1b33333a8b>
2014-06-03 12:00:52-0700 [NefsakLaptopSpider] DEBUG: Redirecting (301) to <GET http://www.nefsak.com/15-17-Screen/> from <GET http://www.nefsak.com/home.php?cat=58>
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/> (referer: None)
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] ERROR: Spider must return Request, BaseItem or None, got 'list' in <GET http://www.nefsak.com/15-17-Screen/>
2014-06-03 12:00:56-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/?page=4> (referer: http://www.nefsak.com/15-17-Screen/)
More extracted links and more errors like above, then it finishes, unlike former log
As you can see from the last line, the spider has actually gone and extracted a navigation page!. All By Itself.(There's a navigation extraction code, but it doesn't get called, as the debugger breakpoints are never reached).
Unfortunately, I couldn't reproduce the error outside the project. A similar spider just works!. But not inside the project though.
I'll provide more code if requested.
Thanks, and sorry for the long post.
Well, I had a URL class derived from the built-in str. It was coded like that:
class URL(str):
def canonicalize(self, parentURL):
parsed_self = urlparse.urlparse(self)
if parsed_self.scheme:
return self[:] #string copy?
else:
parsed_parent = urlparse.urlparse(parentURL)
return urlparse.urljoin(parsed_parent.scheme + "://" + parsed_parent.netloc, self)
def __str__(self):
return "<URL : {0} >".format(self)
The __str__ method caused infinite recursion when it was printed or logged, because format() called __str__ again... But the exception was swallowed by twisted somehow.
Only when printed the response that the error was shown.
def __str__(self):
return "<URL : " + self + " >" # or use super(URL, self).__str__()
:-)
Related
Load the scrapy shell
scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
Try a selector:
response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]')
Note: it prints results.
But now use that selector as a for statement:
for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
Hit return twice, nothing is printed. To print results inside the for loop, you have to wrap the selector in a print function. Like so:
print(row.xpath(".//a[contains(#href, 'report')]/#href").extract_first())
Why?
Edit
If I do the exact same thing as Liam's post below, my output is this:
rmp:www rmp$ scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
2016-03-05 06:13:28 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-05 06:13:28 [scrapy] INFO: Optional features available: ssl, http11
2016-03-05 06:13:28 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-03-05 06:13:28 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-03-05 06:13:28 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-05 06:13:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-05 06:13:28 [scrapy] INFO: Enabled item pipelines:
2016-03-05 06:13:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-05 06:13:28 [scrapy] INFO: Spider opened
2016-03-05 06:13:29 [scrapy] DEBUG: Crawled (200) <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x108c89c10>
[s] item {}
[s] request <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] response <200 http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] settings <scrapy.settings.Settings object at 0x10a25bb10>
[s] spider <DefaultSpider 'default' at 0x10c1201d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2016-03-05 06:13:29 [root] DEBUG: Using default logger
2016-03-05 06:13:29 [root] DEBUG: Using default logger
In [1]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
But with print added?
In [2]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: print row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/
/report/premier-league-2015-2016-afc-bournemouth-aston-villa/
/report/premier-league-2015-2016-everton-fc-watford-fc/
/report/premier-league-2015-2016-leicester-city-sunderland-afc/
/report/premier-league-2015-2016-norwich-city-crystal-palace/
This just worked for me.
>>>scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
>>> for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
... row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...
u'/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/'
u'/report/premier-league-2015-2016-afc-bournemouth-aston-villa/'
u'/report/premier-league-2015-2016-everton-fc-watford-fc/'
u'/report/premier-league-2015-2016-leicester-city-sunderland-afc/'
u'/report/premier-league-2015-2016-norwich-city-crystal-palace/'
u'/report/premier-league-2015-2016-chelsea-fc-swansea-city/'
u'/report/premier-league-2015-2016-arsenal-fc-west-ham-united/'
u'/report/premier-league-2015-2016-newcastle-united-southampton-fc/'
u'/report/premier-league-2015-2016-stoke-city-liverpool-fc/'
u'/report/premier-league-2015-2016-west-bromwich-albion-manchester-city/'
does this not show the same results for you?
I am using the following code to scrape data using scrapey:
from scrapy.selector import Selector
from scrapy.spider import Spider
class ExampleSpider(Spider):
name = "example"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
sel = Selector(response)
for li in sel.xpath('//ul/li'):
title = li.xpath('a/text()').extract()
link = li.xpath('a/#href').extract()
desc = li.xpath('text()').extract()
print title, link, desc
However, when I run this spider, I get the following message:
2014-06-30 23:39:00-0500 [scrapy] INFO: Scrapy 0.24.1 started (bot: tutorial)
2014-06-30 23:39:00-0500 [scrapy] INFO: Optional features available: ssl, http11
2014-06-30 23:39:00-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI': 'willthiswork.csv', 'BOT_NAME': 'tutorial'}
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled item pipelines:
2014-06-30 23:39:01-0500 [example] INFO: Spider opened
2014-06-30 23:39:01-0500 [example] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-30 23:39:01-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-06-30 23:39:01-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-06-30 23:39:01-0500 [example] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
Of note is the line "Crawled 0 pages (at 0 pages/min....., as well as the overridden settings.
Additionally, the file I intended to write my data to is completely blank.
Is there something I am doing wrong that is causing data not to write?
I am assuming you are trying to use scrapy crawl tutorial -o myfile.json
To make this work, you need to use scrapy items.
add the following to items.py:
def MozItem(Item):
title = Field()
link = Field()
desc = Field()
and adjust the parse function
def parse(self, response):
sel = Selector(response)
item = MozItem()
for li in sel.xpath('//ul/li'):
item['title'] = li.xpath('a/text()').extract()
item['link'] = li.xpath('a/#href').extract()
item['desc'] = li.xpath('text()').extract()
yield item
I'm writing spiders with scrapy to get some data from a couple of applications using ASP. Both webpages are almost identical and requires to log in before starting scrapping, but I only managed to scrap one of them. In the other one scrapy gets waiting something forever and never gets after the login using FormRequest method.
The code of both spiders (they are almost identical but with different IPs) is as following:
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.shell import inspect_response
class MySpider(BaseSpider):
name = "my_very_nice_spider"
allowed_domains = ["xxx.xxx.xxx.xxx"]
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/']
def parse(self,response):
#Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/)
return [FormRequest.from_response(response,
formdata={'user':'the_username',
'password':'my_nice_password'},
callback=self.after_login)]
def after_login(self,response):
inspect_response(response,self) #Spider never gets here in one site
if "Bad login" in response.body:
print "Login failed"
return
#Scrapping code begins...
Wondering what could be different between them I used Firefox Live HTTP Headers for inspecting the headers and found only one difference: the webpage that works uses IIS 6.0 and the one that doesn't IIS 5.1.
As this alone couldn't explain myself why one works and the other doesnt' I used Wireshark to capture network traffic and found this:
Interaction using scrapy with working webpage (IIS 6.0)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 302 Object moved
scrapy --> webpage GET /reporting/htm/webpage.asp
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/asp/report1.asp
...Scrapping begins
Interaction using scrapy with not working webpage (IIS 5.1)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 100 Continue # What the f...?
scrapy <-- webpage HTTP/1.1 302 Object moved
...Scrapy waits forever...
I googled a little bit and found that indeed IIS 5.1 has some nice kind of "feature" that makes it return HTTP 100 whenever someone makes a POST to it as shown here.
Knowing that the root of all evil is where always is, but having to scrap that site anyway... How can I make scrapy work in this situation? Or am I doing something wrong?
Thank you!
Edit - Console log with not working site:
2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot)
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'}
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines:
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None)
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
...
Try using the HTTP 1.0 downloader:
# settings.py
DOWNLOAD_HANDLERS = {
'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
}
Hi i am working on scrapy to scrape xml urls
Suppose below is my spider.py code
class TestSpider(BaseSpider):
name = "test"
allowed_domains = {"www.example.com"}
start_urls = [
"https://example.com/jobxml.asp"
]
def parse(self, response):
print response,"??????????????????????"
result:
2012-07-24 16:43:34+0530 [scrapy] INFO: Scrapy 0.14.3 started (bot: testproject)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled item pipelines:
2012-07-24 16:43:34+0530 [test] INFO: Spider opened
2012-07-24 16:43:34+0530 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-24 16:43:36+0530 [testproject] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 1 times): 400 Bad Request
2012-07-24 16:43:37+0530 [test] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 2 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Gave up retrying <GET https://example.com/jobxml.asp> (failed 3 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Crawled (400) <GET https://example.com/jobxml.asp> (referer: None)
2012-07-24 16:43:38+0530 [test] INFO: Closing spider (finished)
2012-07-24 16:43:38+0530 [test] INFO: Dumping spider stats:
{'downloader/request_bytes': 651,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 504,
'downloader/response_count': 3,
'downloader/response_status_count/400': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 24, 11, 13, 38, 573931),
'scheduler/memory_enqueued': 3,
'start_time': datetime.datetime(2012, 7, 24, 11, 13, 34, 803202)}
2012-07-24 16:43:38+0530 [test] INFO: Spider closed (finished)
2012-07-24 16:43:38+0530 [scrapy] INFO: Dumping global stats:
{'memusage/max': 263143424, 'memusage/startup': 263143424}
Whether scrapy does n't work for xml scraping, if yes can anyone please provide me an example on how to scrape xml tag data
Thanks in advance...........
You have a specific spider made for scraping xml feeds. This is from scrapy documentation:
XMLFeedSpider example
These spiders are pretty easy to use, let’s have a look at one example:
from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecesary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))
item = Item()
item['id'] = node.select('#id').extract()
item['name'] = node.select('name').extract()
item['description'] = node.select('description').extract()
return item
This is another way without scrapy:
This is a function used to download xml from given url, note that some import are not in here and this will also give you a nice progress for downloading xml file.
def get_file(self, dir, url, name):
s = urllib2.urlopen(url)
f = open('xml/test.xml','w')
meta = s.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (name, file_size)
current_file_size = 0
block_size = 4096
while True:
buf = s.read(block_size)
if not buf:
break
current_file_size += len(buf)
f.write(buf)
status = ("\r%10d [%3.2f%%]" %
(current_file_size, current_file_size * 100. / file_size))
status = status + chr(8)*(len(status)+1)
sys.stdout.write(status)
sys.stdout.flush()
f.close()
print "\nDone getting feed"
return 1
And then you parse that xml file that you downloaded and saved with iterparse, something like:
for event, elem in iterparse('xml/test.xml'):
if elem.tag == "properties":
print elem.text
That's just an example how do you go through xml tree.
Also, this is an old code of mine, so you would be better of using with for opening files.
I'm trying to write parsing script using python/scrapy. How can I remove [] and u' from strings in result file?
Now I have text like this:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.markup import remove_tags
from googleparser.items import GoogleparserItem
import sys
class GoogleparserSpider(BaseSpider):
name = "google.com"
allowed_domains = ["google.com"]
start_urls = [
"http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0",
"http://www.google.com/search?q=this+is+second+test&num=20&hl=uk&start=0"
]
def parse(self, response):
print "===START======================================================="
hxs = HtmlXPathSelector(response)
qqq = hxs.select('/html/head/title/text()').extract()
print qqq
print "---DATA--------------------------------------------------------"
sites = hxs.select('/html/body/div[5]/div[3]/div/div/div/ol/li/h3')
i = 1
items = []
for site in sites:
try:
item = GoogleparserItem()
title1 = site.select('a').extract()
title2=str(title1)
title=remove_tags(title2)
link=site.select('a/#href').extract()
item['num'] = i
item['title'] = title
item['link'] = link
i= i+1
items.append(item)
except:
print 'EXCEPTION'
return items
print "===END========================================================="
SPIDER = GoogleparserSpider()
and I have result like this after running
python scrapy-ctl.py crawl google.com
2010-07-25 17:44:44+0300 [-] Log opened.
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled extensions: CoreStats, CloseSpider, WebService, TelnetConsole, MemoryUsage
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloaderStats, UserAgentMiddleware, RedirectMiddleware, DefaultHeadersMiddleware, CookiesMiddleware, HttpCompressionMiddleware, RetryMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled spider middlewares: UrlLengthMiddleware, HttpErrorMiddleware, RefererMiddleware, OffsiteMiddleware, DepthMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled item pipelines: CsvWriterPipeline
2010-07-25 17:44:44+0300 [-] scrapy.webservice.WebService starting on 6080
2010-07-25 17:44:44+0300 [-] scrapy.telnet.TelnetConsole starting on 6023
2010-07-25 17:44:44+0300 [google.com] INFO: Spider opened
2010-07-25 17:44:45+0300 [google.com] DEBUG: Crawled (200) <GET http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0> (referer: None)
===START=======================================================
[u'this is first test - \u041f\u043e\u0448\u0443\u043a Google']
---DATA--------------------------------------------------------
2010-07-25 17:52:42+0300 [google.com] DEBUG: Scraped GoogleparserItem(num=1, link=[u'http://www.amazon.com/First-Protector-Small-Tamora-Pierce/dp/0679889175'], title=u"[u'Amazon.com: First Test (Protector of the Small) (9780679889175 ...']") in <http://www.google.com/search?q=this+is+first+test&num=100&hl=uk&start=0>
and this text in file:
1,[u'Amazon.com: First Test (Protector of the Small) (9780679889175 ...'],[u'http://www.amazon.com/First-Protector-Small-Tamora-Pierce/dp/0679889175']
more prettier - print qqq.pop()
Replace print qqq with print qqq[0]. You get that result because qqq is a list.
Same problem with your text file. You have a list with one element that you're writing instead of the element within the list.
It looks like the result from extract is a list. Try:
print ', '.join(qqq)
The u infront of the code, purely means it's a unicode string. See the reference here. http://docs.python.org/tutorial/introduction.html#unicode-strings. The fix would be to convert your content to a string using the str() method.