I'm having an issue running through the CrawlSpider example in the Scrapy documentation. It seems to be crawling just fine but I'm having trouble getting it to output to a CSV file (or anything really).
So, my question is can I use this:
scrapy crawl dmoz -o items.csv
or do I have to create an Item Pipeline?
UPDATED, now with code!:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from targets.item import TargetsItem
class MySpider(CrawlSpider):
name = 'abc'
allowed_domains = ['ididntuseexample.com']
start_urls = ['http://www.ididntuseexample.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = TargetsItem()
item['title'] = response.xpath('//h2/a/text()').extract() #this pulled down data in scrapy shell
item['link'] = response.xpath('//h2/a/#href').extract() #this pulled down data in scrapy shell
return item
Rules are the mechanism CrawlSpider uses for following links. Those links are defined with a LinkExtractor. This element basically indicates which links to extract from the crawled page (like the ones defined in the start_urls list) to be followed. Then you can pass a callback that will be called on each extracted link, or more precise, on the pages downloaded following those links.
Your rule must call the parse_item. So, replace:
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
with:
Rule(LinkExtractor(allow=('ididntuseexample.com',)), callback='parse_item),
This rule defines that you want to call parse_item on every link whose href is ididntuseexample.com. I suspect that what you want as link extractor is not the domain, but the links you want to follow/scrape.
Here you have a basic example that crawls Hacker News to retrieve the title and the first lines of the first comment for all the news in the main page.
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class HackerNewsItem(scrapy.Item):
title = scrapy.Field()
comment = scrapy.Field()
class HackerNewsSpider(CrawlSpider):
name = 'hackernews'
allowed_domains = ['news.ycombinator.com']
start_urls = [
'https://news.ycombinator.com/'
]
rules = (
# Follow any item link and call parse_item.
Rule(LinkExtractor(allow=('item.*', )), callback='parse_item'),
)
def parse_item(self, response):
item = HackerNewsItem()
# Get the title
item['title'] = response.xpath('//*[contains(#class, "title")]/a/text()').extract()
# Get the first words of the first comment
item['comment'] = response.xpath('(//*[contains(#class, "comment")])[1]/font/text()').extract()
return item
Related
I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*
crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?
import scrapy
class BidItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
spider.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
class GetbidSpider(CrawlSpider):
name = 'getbid'
allowed_domains = ['domain.de']
start_urls = ['https://www.domain.de/']
rules = (
Rule(
LinkExtractor(),
follow=True
),
Rule(
LinkExtractor(allow=['example/.*']),
callback='parse_item'
),
)
def parse_item(self, response):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1::text").extract()
return href
crawl: scrapy crawl getbid -o 012916.csv
From the CrawlSpider docs:
If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.
Since your first rule will match all links, it will always be used and all other rules will be ignored.
Fixing the problem is as simple as switching the order of the rules.
I'm scraping reddit to get the link of every entry in a subreddit. And I would like to follow the links that match http://imgur.com/gallery/\w* too. But I'm having problems to run the callback for Imgur. It just doesn't execute it. What's failing ?
And I'm detecting the Imgur url with a simple if "http://imgur.com/gallery/" in item['link'][0]: statement, maybe scrapy provides a better way to detect them ?
This is what I tried:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from reddit.items import RedditItem
class RedditSpider(CrawlSpider):
name = "reddit"
allowed_domains = ["reddit.com"]
start_urls = [
"http://www.reddit.com/r/pics",
]
rules = [
Rule(
LinkExtractor(allow=['/r/pics/\?count=\d.*&after=\w.*']),
callback='parse_item',
follow=True
)
]
def parse_item(self, response):
for title in response.xpath("//div[contains(#class, 'entry')]/p/a"):
item = RedditItem()
item['title'] = title.xpath('text()').extract()
item['link'] = title.xpath('#href').extract()
yield item
if "http://imgur.com/gallery/" in item['link'][0]:
# print item['link'][0]
url = response.urljoin(item['link'][0])
print url
yield scrapy.Request(url, callback=self.parse_imgur_gallery)
def parse_imgur_gallery(self, response):
print response.url
This is my Item class:
import scrapy
class RedditItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
This is the output when executing the spider with --nolog and printing the url variable inside the if condition (It's not the response.url var output), It still doesn't run the callback:
PS C:\repos\python\scrapy\reddit> scrapy crawl --output=export.json --nolog reddit
http://imgur.com/gallery/W7sXs/new
http://imgur.com/gallery/v26KnSX
http://imgur.com/gallery/fqqBq
http://imgur.com/gallery/9GDTP/new
http://imgur.com/gallery/5gjLCPV
http://imgur.com/gallery/l6Tpavl
http://imgur.com/gallery/Ow4gQ
...
I've found It. The imgur.com domain wasn't allowed. Just needed to add it...
allowed_domains = ["reddit.com", "imgur.com"]
Problem: Scrapy keeps visiting a single url and keeps scraping it recursively. I have checked the response.url to ensure that this is a single page that it keeps scraping and there is no query string involved that may render the same page for different url.
What I have done to reolve it :
Under Scrapy/spider.py I noticed that dont_filter was set to True and changed it False. but it didn't help
I have set the unique = True also in the code, but this didn't help either.
Additional information
The Page thats given as start_url has only 1 link to a page a.html. Scrapy keeps scraping a.html again and again.
Code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["datacaredubai.com"]
start_urls = ["http://www.datacaredubai.com/aj/link.html"]
rules = (
Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//*')
items = []
for site in sites:
item = DmozItem()
item['title']= site.xpath('/html/head/meta[3]').extract()
item['req_url']= response.url
items.append(item)
return items
Scrapy, by default, would append into the output file if it exists. What you see in the output.csv is the results of multiple spider runs. Remove the output.csv before running the spider again.
I am trying to collect all the URLs under a domain using Scrapy. I was trying to use the CrawlSpider to start from the homepage and crawl their web. For each page, I want to use Xpath to extract all the hrefs. And store the data in a format like key-value pair.
Key: the current Url
Value: all the links on this page.
class MySpider(CrawlSpider):
name = 'abc.com'
allowed_domains = ['abc.com']
start_urls = ['http://www.abc.com']
rules = (Rule(SgmlLinkExtractor()), )
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = AbcItem()
item['key'] = response.url
item['value'] = hxs.select('//a/#href').extract()
return item
I define my AbcItem() looks like below:
from scrapy.item import Item, Field
class AbcItem(Item):
# key: url
# value: list of links existing in the key url
key = Field()
value = Field()
pass
And when I run my code like this:
nohup scrapy crawl abc.com -o output -t csv &
The robot seems like began to crawl and I can see the nohup.out file being populated by all the configurations log but there is no information from my output file.. which is what I am trying to collect, can anyone help me with this? what might be wrong with my robot?
You should have defined a callback for a rule. Here's an example for getting all links from twitter.com main page (follow=False):
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'twitter.com'
allowed_domains = ['twitter.com']
start_urls = ['http://www.twitter.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
return item
Then, in the output file, I see:
http://status.twitter.com/
https://twitter.com/
http://support.twitter.com/forums/26810/entries/78525
http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code
...
Hope that helps.
if you dont set the callback function explicitly, scrapy will use the method parse to process crawled pages. so, you should add parse_item as the callback, or change it's name to parse.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/#href').extract()
desc = site.select('text()').extract()
print title, link, desc
This is my code. I want plenty of URLs to scrape using loop. So how am I suposed to these? I did put multiple urls in there but I didn't get output from all of them. Some URLs stop responding. So how can I get the data for sure using this code?
You code looks ok but are you sure that start_urls shouldn't start with http://
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
UPD
start_urls is a list of urls scrapy starts with. Usually it has one or two links. Rarely more.
This pages must have identical HTML structure because Scrapy spider process them the same way.
See if i put 4-5 url's in start_urls it gives output ok for first 2-3
url's.
I don't believe this because scrapy doesn't care how many links is start_urls list.
But it stops responding and also tell me how i can implement GUI for this.?
Scrapy has debug shell to test you code.
You just posted the code from the tutorial. What you should do is to actually read the whole documentation, especially the basic concept part. What you basically want is the crawl spider where you can define rules that the spider will follow and process with your given code.
To quote the docs with the example:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[#id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[#id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[#id="item_description"]/text()').extract()
return item