I am using the following code to scrape data using scrapey:
from scrapy.selector import Selector
from scrapy.spider import Spider
class ExampleSpider(Spider):
name = "example"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
sel = Selector(response)
for li in sel.xpath('//ul/li'):
title = li.xpath('a/text()').extract()
link = li.xpath('a/#href').extract()
desc = li.xpath('text()').extract()
print title, link, desc
However, when I run this spider, I get the following message:
2014-06-30 23:39:00-0500 [scrapy] INFO: Scrapy 0.24.1 started (bot: tutorial)
2014-06-30 23:39:00-0500 [scrapy] INFO: Optional features available: ssl, http11
2014-06-30 23:39:00-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI': 'willthiswork.csv', 'BOT_NAME': 'tutorial'}
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-30 23:39:01-0500 [scrapy] INFO: Enabled item pipelines:
2014-06-30 23:39:01-0500 [example] INFO: Spider opened
2014-06-30 23:39:01-0500 [example] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-30 23:39:01-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-06-30 23:39:01-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-06-30 23:39:01-0500 [example] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
Of note is the line "Crawled 0 pages (at 0 pages/min....., as well as the overridden settings.
Additionally, the file I intended to write my data to is completely blank.
Is there something I am doing wrong that is causing data not to write?
I am assuming you are trying to use scrapy crawl tutorial -o myfile.json
To make this work, you need to use scrapy items.
add the following to items.py:
def MozItem(Item):
title = Field()
link = Field()
desc = Field()
and adjust the parse function
def parse(self, response):
sel = Selector(response)
item = MozItem()
for li in sel.xpath('//ul/li'):
item['title'] = li.xpath('a/text()').extract()
item['link'] = li.xpath('a/#href').extract()
item['desc'] = li.xpath('text()').extract()
yield item
Related
I'm trying to scrape the prices for shoes on the website in the code. I have no idea of knowing if my syntax is even correct. I could really use some help.
from scrapy.spider import BaseSpider
from scrapy import Field
from scrapy import Item
from scrapy.selector import HtmlXPathSelector
def Yeezy(Item):
price = Field()
class YeezySpider(BaseSpider):
name = "yeezy"
allowed_domains = ["https://www.grailed.com/"]
start_url = ['https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2']
def parse(self, response):
hxs = HtmlXPathSelector(response)
price = hxs.css('.listing-price .sub-title:nth-child(1) span').extract()
items = []
for price in price:
item = Yeezy()
item["price"] = price.select(".listing-price .sub-title:nth-child(1) span").extract()
items.append(item)
yield item
The code is reporting this to the console:
ScrapyDeprecationWarning: YeezyScrape.spiders.yeezy_spider.YeezySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class YeezySpider(BaseSpider):
2017-08-02 14:45:25-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-02 14:45:25-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-02 14:45:25-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'BOT_NAME': 'YeezyScrape'}
2017-08-02 14:45:25-0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled item pipelines:
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider opened
2017-08-02 14:45:26-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-02 14:45:26-0700 [yeezy] INFO: Closing spider (finished)
2017-08-02 14:45:26-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 127000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 125000)}
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider closed (finished)
Process finished with exit code 0
At first I thought it was a problem with the css elements I entered but now I'm not so sure. This is my first time trying a project like this, I could really use some insight. Thank you in advance.
EDIT: So I tried simulating an xhr request in my code by following another example. This is what I have:
import scrapy
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
#from YeezyScrape import YeezyscrapeItem
class YeezySpider(scrapy.Spider):
name = "yeezy"
allowed_domains = ["www.grailed.com"]
start_url = ["https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2"]
def parse(self, response):
for i in range(0,2):
yield FormRequest(url = 'https://mnrwefss2q-
dsn.algolia.net/1/indexes/Listing_production/query?x-algolia-
agent=Algolia%20for%20vanilla%20JavaScript%203.21.1&x-algolia-application-
id=MNRWEFSS2Q&x-algolia-api-key=a3a4de2e05d9e9b463911705fb6323ad',
method="post", formdata={"params":"query:boost
filters:(strata:'basic' OR strata:'grailed' OR strata:'hype') AND
(category_path:'footwear.slip_ons' OR category_path:'footwear.sandals' OR
category_path:'footwear.lowtop_sneakers' OR category_path:'footwear.leather'
OR category_path:'footwear.hitop_sneakers' OR
category_path:'footwear.formal_shoes' OR category_path:'footwear.boots') AND
(marketplace:grailed)
hitsPerPage:40
facets ["strata","size","category","category_size",
"category_path","category_path_size",
"category_path_root_size","price_i","designers.id",
"location","marketplace"]
page:2"}, callback=self.data_parse())
def data_parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.xpath("//p").extract()
for prices in prices:
price = prices.select("a/text()").extract()
print price
I had to reformat things a little to fit the indentation differences between Python and Stackoverflow.
These are the logs reported in the terminal, again thanks for the help:
C:\Python27\python.exe C:/Python27/Lib/site-packages/scrapy/cmdline.py crawl yeezy -o price.json
2017-08-04 13:23:27-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-04 13:23:27-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-04 13:23:27-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'FEED_URI': 'price.json', 'BOT_NAME': 'YeezyScrape'}
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled item pipelines:
2017-08-04 13:23:27-0700 [yeezy] INFO: Spider opened
2017-08-04 13:23:28-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-04 13:23:28-0700 [yeezy] INFO: Closing spider (finished)
2017-08-04 13:23:28-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 3000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 1000)}
2017-08-04 13:23:28-0700 [yeezy] INFO: Spider closed (finished)
Process finished with exit code 0
Seems like the products are retrieved by AJAX (see related: Can scrapy be used to scrape dynamic content from websites that are using AJAX?).
If you open up browsers webinspector, select network tab and look for XHR requests when the page loads, you can see this:
Seems like a POST type request is being made with categories, filter etc. and a json of products is returned. You can reverse engineer it and replicate it in scrapy.
Load the scrapy shell
scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
Try a selector:
response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]')
Note: it prints results.
But now use that selector as a for statement:
for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
Hit return twice, nothing is printed. To print results inside the for loop, you have to wrap the selector in a print function. Like so:
print(row.xpath(".//a[contains(#href, 'report')]/#href").extract_first())
Why?
Edit
If I do the exact same thing as Liam's post below, my output is this:
rmp:www rmp$ scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
2016-03-05 06:13:28 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-05 06:13:28 [scrapy] INFO: Optional features available: ssl, http11
2016-03-05 06:13:28 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-03-05 06:13:28 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-03-05 06:13:28 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-05 06:13:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-05 06:13:28 [scrapy] INFO: Enabled item pipelines:
2016-03-05 06:13:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-05 06:13:28 [scrapy] INFO: Spider opened
2016-03-05 06:13:29 [scrapy] DEBUG: Crawled (200) <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x108c89c10>
[s] item {}
[s] request <GET http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] response <200 http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/>
[s] settings <scrapy.settings.Settings object at 0x10a25bb10>
[s] spider <DefaultSpider 'default' at 0x10c1201d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2016-03-05 06:13:29 [root] DEBUG: Using default logger
2016-03-05 06:13:29 [root] DEBUG: Using default logger
In [1]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
But with print added?
In [2]: for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
...: print row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...:
/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/
/report/premier-league-2015-2016-afc-bournemouth-aston-villa/
/report/premier-league-2015-2016-everton-fc-watford-fc/
/report/premier-league-2015-2016-leicester-city-sunderland-afc/
/report/premier-league-2015-2016-norwich-city-crystal-palace/
This just worked for me.
>>>scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
>>> for row in response.xpath('(//table[#class="standard_tabelle"])[1]/tr[not(th)]'):
... row.xpath(".//a[contains(#href, 'report')]/#href").extract_first()
...
u'/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/'
u'/report/premier-league-2015-2016-afc-bournemouth-aston-villa/'
u'/report/premier-league-2015-2016-everton-fc-watford-fc/'
u'/report/premier-league-2015-2016-leicester-city-sunderland-afc/'
u'/report/premier-league-2015-2016-norwich-city-crystal-palace/'
u'/report/premier-league-2015-2016-chelsea-fc-swansea-city/'
u'/report/premier-league-2015-2016-arsenal-fc-west-ham-united/'
u'/report/premier-league-2015-2016-newcastle-united-southampton-fc/'
u'/report/premier-league-2015-2016-stoke-city-liverpool-fc/'
u'/report/premier-league-2015-2016-west-bromwich-albion-manchester-city/'
does this not show the same results for you?
I am using Scrapy 1.0.5 and Gearman to create distributed spiders.
The idea is to build a spider, call it from a gearman worker script and pass 20 URLs at a time to crawl from a gearman client to the worker and then to the spider.
I am able to start the worker, pass URLs to it from the client on to the spider to crawl. The first URL or array of URLs do get picked up and crawled. Once the spider is done, I am unable to reuse it. I get the log message that the spider is closed. When I initiate the client again, the spider reopens, but doesn't crawl.
Here is my worker:
import gearman
import json
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
gm_worker = gearman.GearmanWorker(['localhost:4730'])
def task_listener_reverse(gearman_worker, gearman_job):
process = CrawlerProcess(get_project_settings())
data = json.loads(gearman_job.data)
if(data['vendor_name'] == 'walmart'):
process.crawl('walmart', url=data['url_list'])
process.start() # the script will block here until the crawling is finished
return 'completed'
# gm_worker.set_client_id is optional
gm_worker.set_client_id('python-worker')
gm_worker.register_task('reverse', task_listener_reverse)
# Enter our work loop and call gm_worker.after_poll() after each time we timeout/see socket activity
gm_worker.work()
Here is the code of my Spider.
from crawler.items import CrawlerItemLoader
from scrapy.spiders import Spider
class WalmartSpider(Spider):
name = "walmart"
def __init__(self, **kw):
super(WalmartSpider, self).__init__(**kw)
self.start_urls = kw.get('url')
self.allowed_domains = ["walmart.com"]
def parse(self, response):
item = CrawlerItemLoader(response=response)
item.add_value('url', response.url)
#Title
item.add_xpath('title', '//div/h1/span/text()')
if(response.xpath('//div/h1/span/text()')):
title = response.xpath('//div/h1/span/text()')
item.add_value('title', title)
yield item.load_item()
The first client run produces results and I get the data I need whether it was a single URL or multiple URLs.
On the second run, the spider opens and no results.
This is what I get back and it stops
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
I was able to print the URL or URLs from the worker and spider and ensured they were getting passed on the first working run and second non working run. I spent 2 days and haven't gotten anywhere with it. I would appreciate any pointers.
Well, I decided to abandon Scrapy.
I looked around a lot and everyone kept pointing to the limitation of the twisted reactor. Rather than fighting the framework, I decided to build my own scraper and it was very successful for what I needed. I am able to spin up multiple gearman workers and use the scraper I built to scrape the data concurrently in a server farm.
If anyone is interested, I started with this simple article to build the scraper.
I use gearman client to query the DB and send multiple urls to a worker, the worker scrapes the URLs and does an update query back to the DB. Success!! :)
http://docs.python-guide.org/en/latest/scenarios/scrape/
I'm trying to parse site with Scrapy. The urls I need to parse formed like this http://example.com/productID/1234/. This links can be found on pages with address like: http://example.com/categoryID/1234/. The thing is that my crawler fetches first categoryID page (http://www.example.com/categoryID/79/, as you can see from trace below), but nothing more. What am I doing wrong? Thank you.
Here is my Scrapy code:
# -*- coding: UTF-8 -*-
#THIRD-PARTY MODULES
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class ExampleComSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["http://www.example.com/"]
start_urls = [
"http://www.example.com/"
]
rules = (
# Extract links matching 'categoryID/xxx'
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('/categoryID/(\d*)/', ), )),
# Extract links matching 'productID/xxx' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('/productID/(\d*)/', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
Here is a trace of Scrapy:
2012-01-31 12:38:56+0000 [scrapy] INFO: Scrapy 0.14.1 started (bot: parsers)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Enabled item pipelines:
2012-01-31 12:38:57+0000 [example.com] INFO: Spider opened
2012-01-31 12:38:57+0000 [example.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-01-31 12:38:57+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-01-31 12:38:58+0000 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
2012-01-31 12:38:58+0000 [example.com] DEBUG: Filtered offsite request to 'www.example.com': <GET http://www.example.com/categoryID/79/>
2012-01-31 12:38:58+0000 [example.com] INFO: Closing spider (finished)
2012-01-31 12:38:58+0000 [example.com] INFO: Dumping spider stats:
{'downloader/request_bytes': 199,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 121288,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 1, 31, 12, 38, 58, 409806),
'request_depth_max': 1,
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 1, 31, 12, 38, 57, 127805)}
2012-01-31 12:38:58+0000 [example.com] INFO: Spider closed (finished)
2012-01-31 12:38:58+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 26992640, 'memusage/startup': 26992640}
It can be a difference between "www.example.com" and "example.com". If it helps, you can use them both this way
allowed_domains = ["www.example.com", "example.com"]
Replace:
allowed_domains = ["http://www.example.com/"]
with:
allowed_domains = ["example.com"]
That should do the trick.
I'm trying to write parsing script using python/scrapy. How can I remove [] and u' from strings in result file?
Now I have text like this:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.markup import remove_tags
from googleparser.items import GoogleparserItem
import sys
class GoogleparserSpider(BaseSpider):
name = "google.com"
allowed_domains = ["google.com"]
start_urls = [
"http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0",
"http://www.google.com/search?q=this+is+second+test&num=20&hl=uk&start=0"
]
def parse(self, response):
print "===START======================================================="
hxs = HtmlXPathSelector(response)
qqq = hxs.select('/html/head/title/text()').extract()
print qqq
print "---DATA--------------------------------------------------------"
sites = hxs.select('/html/body/div[5]/div[3]/div/div/div/ol/li/h3')
i = 1
items = []
for site in sites:
try:
item = GoogleparserItem()
title1 = site.select('a').extract()
title2=str(title1)
title=remove_tags(title2)
link=site.select('a/#href').extract()
item['num'] = i
item['title'] = title
item['link'] = link
i= i+1
items.append(item)
except:
print 'EXCEPTION'
return items
print "===END========================================================="
SPIDER = GoogleparserSpider()
and I have result like this after running
python scrapy-ctl.py crawl google.com
2010-07-25 17:44:44+0300 [-] Log opened.
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled extensions: CoreStats, CloseSpider, WebService, TelnetConsole, MemoryUsage
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloaderStats, UserAgentMiddleware, RedirectMiddleware, DefaultHeadersMiddleware, CookiesMiddleware, HttpCompressionMiddleware, RetryMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled spider middlewares: UrlLengthMiddleware, HttpErrorMiddleware, RefererMiddleware, OffsiteMiddleware, DepthMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled item pipelines: CsvWriterPipeline
2010-07-25 17:44:44+0300 [-] scrapy.webservice.WebService starting on 6080
2010-07-25 17:44:44+0300 [-] scrapy.telnet.TelnetConsole starting on 6023
2010-07-25 17:44:44+0300 [google.com] INFO: Spider opened
2010-07-25 17:44:45+0300 [google.com] DEBUG: Crawled (200) <GET http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0> (referer: None)
===START=======================================================
[u'this is first test - \u041f\u043e\u0448\u0443\u043a Google']
---DATA--------------------------------------------------------
2010-07-25 17:52:42+0300 [google.com] DEBUG: Scraped GoogleparserItem(num=1, link=[u'http://www.amazon.com/First-Protector-Small-Tamora-Pierce/dp/0679889175'], title=u"[u'Amazon.com: First Test (Protector of the Small) (9780679889175 ...']") in <http://www.google.com/search?q=this+is+first+test&num=100&hl=uk&start=0>
and this text in file:
1,[u'Amazon.com: First Test (Protector of the Small) (9780679889175 ...'],[u'http://www.amazon.com/First-Protector-Small-Tamora-Pierce/dp/0679889175']
more prettier - print qqq.pop()
Replace print qqq with print qqq[0]. You get that result because qqq is a list.
Same problem with your text file. You have a list with one element that you're writing instead of the element within the list.
It looks like the result from extract is a list. Try:
print ', '.join(qqq)
The u infront of the code, purely means it's a unicode string. See the reference here. http://docs.python.org/tutorial/introduction.html#unicode-strings. The fix would be to convert your content to a string using the str() method.