I'm currently trying to use Scrapy to go through the Elite Dangerous subreddit and collect post titles, urls, and vote counts. I did the first two fine, but am unsure of how to write an XPath expression to access the votes.
selector.xpath('//div[#class="score unvoted"]').extract() works, but it returns vote counts for all posts on the current page (instead of for each individual post). response.css('div.score.unvoted').extract() Works for each individual post, but returns [u'<div class="score unvoted">1</div>'], instead of just 1. ( I would also really like to know how to do this with XPath! :) )
Code is as follows:
class redditSpider(CrawlSpider): # http://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.CrawlSpider
name = "reddits"
allowed_domains = ["reddit.com"]
start_urls = [
"https://www.reddit.com/r/elitedangerous",
]
rules = [
Rule(LinkExtractor(
allow=['/r/EliteDangerous/\?count=\d*&after=\w*']), # Looks for next page with RE
callback='parse_item', # What do I do with this? --- pass to self.parse_item
follow=True), # Tells spider to continue after callback
]
def parse_item(self, response):
selector_list = response.css('div.thing') # Each individual little "box" with content
for selector in selector_list:
item = RedditItem()
item['title'] = selector.xpath('div/p/a/text()').extract()
item['url'] = selector.xpath('a/#href').extract()
# item['votes'] = selector.xpath('//div[#class="score unvoted"]')
item['votes'] = selector.css('div.score.unvoted').extract()
yield item
You are on the right track. The first approach just needs two things:
a dot at the beginning to make it context-specific
text() at the end
Fixed version:
selector.xpath('.//div[#class="score unvoted"]/text()').extract()
And, FYI, you can make the second option work too by using the ::text pseudo-element:
response.css('div.score.unvoted::text').extract()
this should work -
selector.xpath('//div[contains(#class, "score unvoted")]/text()').extract()
Related
hello i'm trying to build a crawler using scrapy
my crawler code is :
import scrapy
from shop.items import ShopItem
class ShopspiderSpider(scrapy.Spider):
name = 'shopspider'
allowed_domains = ['www.organics.com']
start_urls = ['https://www.organics.com/product-tag/special-offers/']
def parse(self, response):
items = ShopItem()
title = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/h3').extract()
sale_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/del/span').extract()
product_original_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
category = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
items['product_name'] = ''.join(title).strip()
items['product_sale_price'] = ''.join(sale_price).strip()
items['product_original_price'] = ''.join(product_original_price).strip()
items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
yield items
but when i run the command : scrapy crawl shopspider -o info.csv to see the output i can find just the informations about the first product not all the products in this page.
so i remove the numbers between [ ] in the xpath for exemple the xpath of the title ://*[#id="content"]/div/div/ul/li/a/h3
but still get the same result.
the result is : <span class="amount">£40.00</span>,<h3>Halo Skincare Organic Gift Set</h3>,"<span class=""amount"">£40.00</span>","<span class=""amount"">£58.00</span>"
kindely help please
If you remove the indexes on your XPaths, they will find all the items in the page:
response.xpath('//*[#id="content"]/div/div/ul/li/a/h3').extract() # Returns 7 items
However, you should observe that this will return a list of strings of the selected html elements. You should add /text() in the XPath if you want the text inside the element. (Which looks like you do)
Also, the reason you only get one return is because you are concatenating all the items into a single string when assigning them to the item:
items['product_name'] = ''.join(title).strip()
Here title is a list of elements and you concatenate them all in a single string. Same logic applies for the other vars
If that's really what you want you can disregard the following, but I believe a better approach would be to execute a for loop and yield them separately?
My suggestion would be:
def parse(self, response):
products = response.xpath('//*[#id="content"]/div/div/ul/li')
for product in products:
items = ShopItem()
items['product_name'] = product.xpath('a/h3/text()').get()
items['product_sale_price'] = product.xpath('a/span/del/span/text()').get()
items['product_original_price'] = product.xpath('a/span/ins/span/text()').get()
items['product_category'] = product.xpath('a/span/ins/span/text()').get()
yield items
Notice that in your original code your category var has the same XPath that your product_original_price, I kept the logic in the code, but it's probably a mistake.
I am trying to write some code to scrap the website of a UK housebuilder to record a list of houses for sale.
I am starting on the page http://www.persimmonhomes.com/sitemap and I have written one part of the code to list all the urls of the housebuilder developments and then the second part of the code to scrap from each of the urls to record prices etc.
I know the second part works and I know that the first part lists out all the urls. But for some reason the urls listed by the first part don't seem want to trigger the second part of the code to scrap from them.
The code of this first part is:
def parse(self, response):
for href in response.xpath('//*[#class="contacts-item"]/ul/li/a/#href'):
url = urlparse.urljoin('http://www.persimmonhomes.com/',href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
Now, I know this lists the urls I want (if I put in the line "print url" then they all get listed) and I can manually list add them to the code to run the second part all ok if I wanted to. However, even though the urls are created they do not seem to allow the second part of the code to scrap from them.
and the entire code is below:
import scrapy
import urlparse
from Persimmon.items import PersimmonItem
class persimmonSpider(scrapy.Spider):
name = "persimmon"
allowed_domains = ["http://www.persimmonhomes.com/"]
start_urls = [
"http://www.persimmonhomes.com/sitemap",
]
def parse(self, response):
for href in response.xpath('//*[#class="contacts-item"]/ul/li/a/#href'):
url = urlparse.urljoin('http://www.persimmonhomes.com/',href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//*[#id="aspnetForm"]/div[4]'):
item = PersimmonItem()
item['name'] = sel.xpath('//*[#id="aspnetForm"]/div[4]/div[1]/div[1]/div/div[2]/span/text()').extract()
item['address'] = sel.xpath('//*[#id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[#itemprop="postalCode"]/text()').extract()
plotnames = sel.xpath('//div[#class="housetype js-filter-housetype"]/div[#class="housetype__col-2"]/div[#class="housetype__plots"]/div[not(contains(#data-status,"Sold"))]/div[#class="plot__name"]/a/text()').extract()
plotnames = [plotname.strip() for plotname in plotnames]
plotids = sel.xpath('//div[#class="housetype js-filter-housetype"]/div[#class="housetype__col-2"]/div[#class="housetype__plots"]/div[not(contains(#data-status,"Sold"))]/div[#class="plot__name"]/a/#href').extract()
plotids = [plotid.strip() for plotid in plotids]
plotprices = sel.xpath('//div[#class="housetype js-filter-housetype"]/div[#class="housetype__col-2"]/div[#class="housetype__plots"]/div[not(contains(#data-status,"Sold"))]/div[#class="plot__price"]/text()').extract()
plotprices = [plotprice.strip() for plotprice in plotprices]
result = zip(plotnames, plotids, plotprices)
for plotname, plotid, plotprices in result:
item['plotname'] = plotname
item['plotid'] = plotid
item['plotprice'] = plotprice
yield item
any views as to why the first part of the code creates the urls but the second part does not loop through them?
You just need to fix your allowed_domains property:
allowed_domains = ["persimmonhomes.com"]
(tested - worked for me).
I'll start with the scrapy code I'm trying to use to iterate through a collection of vehicles and extract the model and price:
def parse(self, response):
hxs = Selector(response)
split_url = response.url.split("/")
listings = hxs.xpath("//div[contains(#class,'listing-item')]")
for vehicle in listings:
item = Vehicle()
item['make'] = split_url[5]
item['price'] = vehicle.xpath("//div[contains(#class,'price')]/text()").extract()
item['description'] = vehicle.xpath("//div[contains(#class,'title-module')]/h2/a/text()").extract()
yield item
I was expecting that to loop through the listings and return the price only for the single vehicle being parsed, but it is actually adding an array of all prices on the page to each vehicle item.
I assume the problem is in my xpath selectors - is "//div[contains(#class,'price')]/text()" somehow allowing the parser to look at divs outside the single vehicle that should be getting parsed each time?
For reference, if I do listings[1] it returns only 1 listing, hence the loop should be working.
Edit: I added the line print vehicle.extract() above, and confirmed that vehicle is definitely only a single item (and it changes each time the loop iterates). How is the xpath selector applied to vehicle able to escape the vehicle object and return all prices?
I was having the same problem. I have consulted the document that you have referred. Providing the modified code here so that it would be helpful to beginners like me. Note that the usage of '.' in the xpath .//div[contains(#class,'title-module')]/h2/a/text()
def parse(self, response):
hxs = Selector(response)
split_url = response.url.split("/")
listings = hxs.xpath("//div[contains(#class,'listing-item')]")
for vehicle in listings:
item = Vehicle()
item['make'] = split_url[5]
item['price'] = vehicle.xpath(".//div[contains(#class,'price')]/text()").extract()
item['description'] = vehicle.xpath(".//div[contains(#class,'title-module')]/h2/a/text()").extract()
yield item
I was able to solve the problem with the aid of the manual, here. In summary, the xpath was indeed escaping the iteration because I neglected to put a period in front of the // which meant that it was escaping to the root node every time.
I'm writing a spider (CrawlSpider) for an online store. According to client requisites, I need to write two rules: one for determining which pages have items and other for extracting the items.
I have both rules already working independently:
if my start_urls = ["www.example.com/books.php",
"www.example.com/movies.php"] and I comment the Rule and the code
of parse_category, my parse_item will extract every item.
On the other hand, if start_urls = "http://www.example.com" and I
comment the Ruleand the code of parse_item, parse_category will
return every link in which there a items for extracting, i.e.
parse_category will return www.example.com/books.php and
www.example.com/movies.php.
My problem is that I don't know how to merge both modules, so that start_urls = "http://www.example.com" and then parse_category extracts www.example.com/books.php and www.example.com/movies.php and feed those links to parse_item, where I actually extract the info of each item.
I need to find a way to do it this way instead of just using start_urls = ["www.example.com/books.php", "www.example.com/movies.php"] because if in the future a new category is added (e.g. www.example.com/music.php), the spider wouldn't be able to automatically detect that new category and should be manually edited. Not a big deal, but the client doesn't want this.
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
#start_urls = ["http://www.example.com/books.php", "http://www.example.com/movies.php"]
rules = (
Rule(LinkExtractor(), follow=True, callback='parse_category'),
Rule(LinkExtractor(), follow=False, callback="parse_item"),
)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
return category
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
the CrawlSpider rules don't work like you want, you'll need to implement the logic by yourself. when you specify follow=True you can't use callback, because the idea is to keep getting links (no items) while following the rules, check the documentation
you could try with something like:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
# no rules
def parse(self, response): # this is parse_category
category_le = LinkExtractor("something for categories")
for a in category_le.extract_links(response):
yield Request(a.url, callback=self.parse_category)
item_le = LinkExtractor("something for items")
for a in item_le.extract_links(response):
yield Request(a.url, callback=self.parse_item)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
yield category
for req in self.parse(response):
yield req
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
Instead of using a parse_category, I used restrict_css in LinkExtractorto get the links I want, and it seems to be feeding the second Rule with the extracted links, so my question is answered. It ended up this way:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = (
Rule(LinkExtractor(restrict_css=("#movies", "#books"))),
Rule(LinkExtractor(), callback="parse_item"),
)
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
Still it can't detect new added categories (and there is not a clear pattern for using in restrict_css without fetching other garbage), but at least it's complying with the requisites of the client: 2 rules, one for extracting category's links and other for extracting item's data.
I have done and spider that can take the information of this page and it can follow "Next page" links. Now, the spider just takes the information that i'm showing in the following structure.
The structure of the page is something like this
Title 1
URL 1 ---------> If you click you go to one page with more information
Location 1
Title 2
URL 2 ---------> If you click you go to one page with more information
Location 2
Next page
Then, that i want is that the spider goes on each URL link and get full information. I suppose that i must generate another rule that specify that i want do something like this.
The behaviour of the spider it should be:
Go to URL1 (get info)
Go to URL2 (get info)
...
Next page
But i don't know how i can implement it. Can someone guide me?
Code of my Spider:
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
self.log("parse_item")
sel = Selector(response)
sites = sel.xpath("//div[#id='llista-resultats']/div")
items = []
cont = 0
for site in sites:
item = BcnItem()
item['id'] = cont
item['title'] = u''.join(site.xpath('h3/a/text()').extract())
item['url'] = u''.join(site.xpath('h3/a/#href').extract())
item['when'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[1]/text()').extract())
item['where'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[2]/span/a/text()').extract())
item['street'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[3]/span/text()').extract())
item['phone'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[4]/text()').extract())
items.append(item)
cont = cont + 1
return items
EDIT After searching in internet I found a code with which i can do that.
First of all, I have to get all the links, then I have to call another parse method.
def parse(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
If you want use Rules because the page have a paginator, you should change def parse to def parse_start_url and then call this method through Rule. With this changes you make sure that the parser begins at the parse_start_url and the code it would be something like this:
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
Thant's all folks
There is an easier way of achieving this. Click next on your link, and read the new url carefully:
http://guia.bcn.cat/index.php?pg=search&from=10&q=*:*&nr=10
By looking at the get data in the url (everything after the questionmark), and a bit of testing, we find that these mean
from=10 - Starting index
q=*:* - Search query
nr=10 - Number of items to display
This is how I would've done it:
Set nr=100 or higher. (1000 may do as well, just be sure that there is no timeout)
Loop from from=0 to 34300. This is above the number of entries currently. You may want to extract this value first.
Example code:
entries = 34246
step = 100
stop = entries - entries % step + step
for x in xrange(0, stop, step):
url = 'http://guia.bcn.cat/index.php?pg=search&from={}&q=*:*&nr={}'.format(x, step)
# Loop over all entries, and open links if needed