Scrapy save every link whole domain - python

Introduction
currently im working on a crawler, which saves every link of a domain to a .csv-file
Problem
In my console, i can see, which links its following, but my items are still empty.
I get something like:
Here is my default code
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import LinkextractorItem
class TopArtSpider(CrawlSpider):
name = "topart"
start_urls = [
'https://www.topart-online.com/de/Bambus-Kunstbaeume/l-KAT11'
]
custom_settings = {'FEED_EXPORT_FIELDS' : ['Link'] }
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
items = LinkextractorItem()
link = response.xpath('a/#href')
items['Link'] = link
yield items
my start_url is just a category of the domain, because i dont want to wait too long, as long as im trying to build the correct spider.

The XPATH selector isn't searching th entire DOM. Change it to this.
link = response.xpath('//a/#href')
The // searches the entire DOM.
You also are not grabbing the data, so you need to include getall() which will give you a list. You could also use a for loop, to loop around each link which I think is probably the approach you should do.
link = response.xpath('//a/#href')
for a in link:
items['Link'] = a.get()
yield items

Related

Crawling through Web-pages that have categories

I'm trying to scrap a website that has a uncommon web-page structure, page upon page upon page until i get to the item i'm trying to extract data from,
edit(Thanks to the answers, I have been able to extract most data I require, however I need the path links to get to the said product)
Here's the code I have so far:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'drapertools.com'
start_urls = ['https://www.drapertools.com/category/0/Product%20Range']
rules = (
Rule(LinkExtractor(allow=['/category-?.*?/'])),
Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
)
def parse_product(self, response):
yield {
'product_name': response.xpath('//div[#id="product-title"]//h1[#class="text-primary"]/text()').extract_first(),
'product_number': response.xpath('//div[#id="product-title"]//h1[#style="margin-bottom: 20px; color:#000000; font-size: 23px;"]/text()').extract_first(),
'product_price': response.xpath('//div[#id="product-title"]//p/text()').extract_first(),
'product_desc': response.xpath('//div[#class="col-md-6 col-sm-6 col-xs-12 pull-left"]//div[#class="col-md-11 col-sm-11 col-xs-11"]//p/text()').extract_first(),
'product_path': response.xpath('//div[#class="nav-container"]//ol[#class="breadcrumb"]//li//a/text()').extract(),
'product_path_links': response.xpath('//div[#class="nav-container"]//ol[#class="breadcrumb"]//li//a/href()').extract(),
}
I don't know if this would work or anything, can anyone please help me here?
I would greatly appreciate it.
More Info:
I'm trying to access all categories and all items within them
however there is a categories within them and even more before I can get to the item.
I'm thinking of using Guillaume's LinkExtractor Code but i'm not sure that is supposed to be used for the outcome I want...
rules = (
Rule(LinkExtractor(allow=['/category-?.*?/'])),
Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
)
Why not using a CrawlSpider instead! It's perfect for this use-case!
It basically gets all the links for every page recursively, and calls a callback only for the interesting ones (I'm assuming that you are interested in the products).
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'drapertools.com'
start_urls = ['https://www.drapertools.com/category/0/Product%20Range']
rules = (
Rule(LinkExtractor(allow=['/category-?.*?/'])),
Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
)
def parse_product(self, response):
yield {
'product_name': response.xpath('//div[#id="product-title"]//h1[#class="text-primary"]/text()').extract_first(),
}
You have the same structure for all pages, maybe you can shorten it?
import scrapy
class DraperToolsSpider(scrapy.Spider):
name = 'drapertools_spider'
start_urls = ["https://www.drapertools.com/category/0/Product%20Range"]
def parse(self, response):
# this will call self.parse by default for all your categories
for url in response.css('.category p a::attr(href)').extract():
yield scrapy.Request(response.urljoin(url))
# here you can add some "if" if you want to catch details only on certain pages
for req in self.parse_details(response):
yield req
def parse_details(self, response):
yield {}

Using python scrapy to extract links from a webpage

I am a beginner with python and using scrapy to extract links from the following webpage
http://www.basketball-reference.com/leagues/NBA_2015_games.html.
The code that I have written is
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem
class BasketballSpider(CrawlSpider):
name = 'basketball'
allowed_domains = ['basketball-reference.com/']
start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]
def parse_item(self, response):
item = BasketballItem()
item['url'] = response.url
return item
I run this code through the command prompt, but the file created does not have any links. Could someone please help?
It cannot find the links, fix you regular expression in the rule:
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'))
]
Also, you don't have to set the callback when it is called parse_item - it is a default.
And allow can be set as a string also.
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]

make scrapy move to next page recursively

I'm trying to scrape this page using scrapy. I can succesfully scrape the data on the page, but I want to be able to scrape data from the other pages too. (the ones that say next). heres the relevant part of my code:
def parse(self, response):
item = TimemagItem()
item['title']= response.xpath('//div[#class="text"]').extract()
links = response.xpath('//h3/a').extract()
crawledLinks=[]
linkPattern = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*#)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%#!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%#!\-\/\(\)]+))?$")
for link in links:
if linkPattern.match(link) and not link in crawledLinks:
crawledLinks.append(link)
yield Request(link, self.parse)
yield item
I'm getting the right information: the titles from the linked pages, but it simply isn't 'navigating'. how do I tell scrapy to navigate?
Take a look on Scrapy Link Extractors documentation. They are the correct way to tell your spider to follow the links on the page.
Taking a look on the page you want to crawl, I believe you should make it with 2 extractor rules. Here is an example of a simple spider with rules that fit on your TIMES web page needs:
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class TIMESpider(CrawlSpider):
name = "time_spider"
allowed_domains = ["time.com"]
start_urls = [
'http://search.time.com/results.html?N=45&Ns=p_date_range|1&Ntt=&Nf=p_date_range%7cBTWN+19500101+19500130'
]
rules = (
Rule (SgmlLinkExtractor(restrict_xpaths=('//div[#class="tout"]/h3/a',))
, callback='parse'),
Rule (SgmlLinkExtractor(restrict_xpaths=('//a[#title="Next"]',))
, follow= True),
)
def parse(self, response):
item = TimemagItem()
item['title']= response.xpath('.//title/text()').extract()
return item

scrapy crawlspider output

I'm having an issue running through the CrawlSpider example in the Scrapy documentation. It seems to be crawling just fine but I'm having trouble getting it to output to a CSV file (or anything really).
So, my question is can I use this:
scrapy crawl dmoz -o items.csv
or do I have to create an Item Pipeline?
UPDATED, now with code!:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from targets.item import TargetsItem
class MySpider(CrawlSpider):
name = 'abc'
allowed_domains = ['ididntuseexample.com']
start_urls = ['http://www.ididntuseexample.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = TargetsItem()
item['title'] = response.xpath('//h2/a/text()').extract() #this pulled down data in scrapy shell
item['link'] = response.xpath('//h2/a/#href').extract() #this pulled down data in scrapy shell
return item
Rules are the mechanism CrawlSpider uses for following links. Those links are defined with a LinkExtractor. This element basically indicates which links to extract from the crawled page (like the ones defined in the start_urls list) to be followed. Then you can pass a callback that will be called on each extracted link, or more precise, on the pages downloaded following those links.
Your rule must call the parse_item. So, replace:
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
with:
Rule(LinkExtractor(allow=('ididntuseexample.com',)), callback='parse_item),
This rule defines that you want to call parse_item on every link whose href is ididntuseexample.com. I suspect that what you want as link extractor is not the domain, but the links you want to follow/scrape.
Here you have a basic example that crawls Hacker News to retrieve the title and the first lines of the first comment for all the news in the main page.
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class HackerNewsItem(scrapy.Item):
title = scrapy.Field()
comment = scrapy.Field()
class HackerNewsSpider(CrawlSpider):
name = 'hackernews'
allowed_domains = ['news.ycombinator.com']
start_urls = [
'https://news.ycombinator.com/'
]
rules = (
# Follow any item link and call parse_item.
Rule(LinkExtractor(allow=('item.*', )), callback='parse_item'),
)
def parse_item(self, response):
item = HackerNewsItem()
# Get the title
item['title'] = response.xpath('//*[contains(#class, "title")]/a/text()').extract()
# Get the first words of the first comment
item['comment'] = response.xpath('(//*[contains(#class, "comment")])[1]/font/text()').extract()
return item

scrapy didn't crawl all link

I want to extract data from http://community.sellfree.co.kr/. Scrapy is working, however it appears to only scrape the start_urls, and doesn't crawl any links.
I would like the spider to crawl the entire site.
The following is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from metacritic.items import MetacriticItem
class MetacriticSpider(BaseSpider):
name = "metacritic" # Name of the spider, to be used when crawling
allowed_domains = ["sellfree.co.kr"] # Where the spider is allowed to go
start_urls = [
"http://community.sellfree.co.kr/"
]
rules = (Rule (SgmlLinkExtractor(allow=('.*',))
,callback="parse", follow= True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response) # The XPath selector
sites = hxs.select('/html/body')
items = []
for site in sites:
item = MetacriticItem()
item['title'] = site.select('//a[#title]').extract()
items.append(item)
return items
There are two kinds of links on the page. One is onclick="location='../bbs/board.php?bo_table=maket_5_3' and another is <span class="list2">solution</span>
How can I get the crawler to follow both kinds of links?
Before I get started, I'd highly recommend using an updated version of Scrapy. It appears you're still using an old one, as many of the methods/classes you're using have been moved around or deprecated.
To the problem at hand: the scrapy.spiders.BaseSpider class will not do anything with the rules you specify. Instead, use the scrapy.contrib.spiders.CrawlSpider class, which has functionality to handle rules built into.
Next, you'll need to switch your parse() method to a new name, since the the CrawlSpider uses parse() internally to work. (We'll assume parse_page() for the rest of this answer)
To pick up all basic links, and have them crawled, your link extractor will need to be changed. By default, you shouldn't use regular expression syntax for domains you want to follow. The following will pick it up, and your DUPEFILTER will filter out links not on the site:
rules = (
Rule(SgmlLinkExtractor(allow=('')), callback="parse_page", follow=True),
)
As for the onclick=... links, these are JavaScript links, and the page you are trying to process relies on them heavily. Scrapy cannot crawl things like onclick=location.href="javascript:showLayer_tap('2')" or onclick="win_open('./bbs/profile.php?mb_id=wlsdydahs', because it can't execute showLayer_tap() or win_open() in Javascript.
(the following is untested, but should work and provide the basic idea of what you need to do)
You can write your own functions for parsing these, though. For instance, the following can handle onclick=location.href="./photo/":
def process_onclick(value):
m = re.search("location.href=\"(.*?)\"", value)
if m:
return m.group(1)
Then add the following rule (this only handles tables, expand it as needed):
Rule(SgmlLinkExtractor(allow=(''), tags=('table',),
attrs=('onclick',), process_value=process_onclick),
callback="parse_page", follow=True),

Categories