hello i'm trying to build a crawler using scrapy
my crawler code is :
import scrapy
from shop.items import ShopItem
class ShopspiderSpider(scrapy.Spider):
name = 'shopspider'
allowed_domains = ['www.organics.com']
start_urls = ['https://www.organics.com/product-tag/special-offers/']
def parse(self, response):
items = ShopItem()
title = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/h3').extract()
sale_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/del/span').extract()
product_original_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
category = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
items['product_name'] = ''.join(title).strip()
items['product_sale_price'] = ''.join(sale_price).strip()
items['product_original_price'] = ''.join(product_original_price).strip()
items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
yield items
but when i run the command : scrapy crawl shopspider -o info.csv to see the output i can find just the informations about the first product not all the products in this page.
so i remove the numbers between [ ] in the xpath for exemple the xpath of the title ://*[#id="content"]/div/div/ul/li/a/h3
but still get the same result.
the result is : <span class="amount">£40.00</span>,<h3>Halo Skincare Organic Gift Set</h3>,"<span class=""amount"">£40.00</span>","<span class=""amount"">£58.00</span>"
kindely help please
If you remove the indexes on your XPaths, they will find all the items in the page:
response.xpath('//*[#id="content"]/div/div/ul/li/a/h3').extract() # Returns 7 items
However, you should observe that this will return a list of strings of the selected html elements. You should add /text() in the XPath if you want the text inside the element. (Which looks like you do)
Also, the reason you only get one return is because you are concatenating all the items into a single string when assigning them to the item:
items['product_name'] = ''.join(title).strip()
Here title is a list of elements and you concatenate them all in a single string. Same logic applies for the other vars
If that's really what you want you can disregard the following, but I believe a better approach would be to execute a for loop and yield them separately?
My suggestion would be:
def parse(self, response):
products = response.xpath('//*[#id="content"]/div/div/ul/li')
for product in products:
items = ShopItem()
items['product_name'] = product.xpath('a/h3/text()').get()
items['product_sale_price'] = product.xpath('a/span/del/span/text()').get()
items['product_original_price'] = product.xpath('a/span/ins/span/text()').get()
items['product_category'] = product.xpath('a/span/ins/span/text()').get()
yield items
Notice that in your original code your category var has the same XPath that your product_original_price, I kept the logic in the code, but it's probably a mistake.
Related
I'm getting started with Scrapy and there is a website I'm trying to get data from. Specifically the phone number element which is inside a div element that has an id. I noticed that if I send a request to this page I can get it.
https://www.otomoto.pl/ajax/misc/contact/multi_phone/6CLxXv/0
so basiclay the base url would be https://www.otomoto.pl/ajax/misc/contact/multi_phone/ID/0/
and 6CLxXv = ID for this example.
How do I scrape all the div elements, concatenate them with the base url and then retrieve the phone number element ?
Here is the code used :
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Compose
from otomoto.items import OtomotoItem
def filter_out_array(x):
x = x.strip()
return None if x == '' else x
def remove_spaces(x):
return x.replace(' ', '')
def convert_to_integer(x):
return int(x)
class OtomotoCarLoader(ItemLoader):
default_output_processor = TakeFirst()
features_out = MapCompose(filter_out_array)
price_out = Compose(TakeFirst(), remove_spaces, convert_to_integer)
class OtomotoSpider(scrapy.Spider):
name = 'otomoto'
start_urls = ['https://www.otomoto.pl/osobowe/']
def parse(self, response):
for car_page in response.css('.offer-title__link::attr(href)'):
yield response.follow(car_page, self.parse_car_page)
for next_page in response.css('.next.abs a::attr(href)'):
yield response.follow(next_page, self.parse)
#inline_requests
def parse_car_page(self, response):
property_list_map = {
'Marka pojazdu': 'brand',
'Model pojazdu': 'model',
'Rok produkcji': 'year',
}
contact_response = yield scrapy.Request(url_number) # how do i get the specific phone number url
number = # parse the responose here ? then add load it in the loader
loader = OtomotoCarLoader(OtomotoItem(), response=response)
for params in response.css('.offer-params__item'):
property_name = params.css('.offer-params__label::text').extract_first().strip()
if property_name in property_list_map:
css = params.css('.offer-params__value::text').extract_first().strip()
if css == '':
css = params.css('a::text').extract_first().strip()
loader.add_value(property_list_map[property_name], css)
loader.add_css('price', '.offer-price__number::text')
loader.add_css('price_currency', '.offer-price__currency::text')
loader.add_css('features', '.offer-features__item::text')
loader.add_value('url', response.url)
loader.add('phone number', number) # here i want to add the phone number to the rest of the elements
yield loader.load_item()
note : i was able to find the following link "https://www.otomoto.pl/ajax/misc/contact/multi_phone/6CLxXv/0" by checking the page xhr
Take a look into xpath https://docs.scrapy.org/en/0.9/topics/selectors.html. There you should find feasable solutions to select the distinct elements you need. Eg. selecting all the elements divs of a parent div which have an id-attribute starting with a ... "//div[#id='a']/div/"
This way you can put your results into a list. The latter - extracting the numbers from the list and building the base string is simple string concatenation.
The same counts for scraping the ids. Find unique indicators, so you can make sure that those are the elements you need. Eg. following content. Is the id you need different from others on the page which you don't need?
for idx in collected_list:
url = 'https.com/a/b/'+idx+'/0'
EDIT:
I see. Your code is quite advanced. I could get more into it, if I would have the full code, but from that I see you use this html element:
<a href="" class="spoiler seller-phones__button" data-path="multi_phone" data-id="6D5zmw" data-id_raw="6074401671" title="Kontakt Rafał" data-test="view-seller-phone-1-button" data-index="0" data-type="bottom">
<span class="icon-phone2 seller-phones__icon"></span>
<span data-test="seller-phone-2" class="phone-number seller-phones__number">694 *** ***</span>
<span class="separator">-</span>
<span class="spoilerAction">Wyświetl numer</span>
</a>
The data-id is what you need to extract, because its the ID you are looking for and can simple apply to:
new_request_url = "https://www.otomoto.pl/ajax/misc/contact/multi_phone/"+id+"/0/"
I'll start with the scrapy code I'm trying to use to iterate through a collection of vehicles and extract the model and price:
def parse(self, response):
hxs = Selector(response)
split_url = response.url.split("/")
listings = hxs.xpath("//div[contains(#class,'listing-item')]")
for vehicle in listings:
item = Vehicle()
item['make'] = split_url[5]
item['price'] = vehicle.xpath("//div[contains(#class,'price')]/text()").extract()
item['description'] = vehicle.xpath("//div[contains(#class,'title-module')]/h2/a/text()").extract()
yield item
I was expecting that to loop through the listings and return the price only for the single vehicle being parsed, but it is actually adding an array of all prices on the page to each vehicle item.
I assume the problem is in my xpath selectors - is "//div[contains(#class,'price')]/text()" somehow allowing the parser to look at divs outside the single vehicle that should be getting parsed each time?
For reference, if I do listings[1] it returns only 1 listing, hence the loop should be working.
Edit: I added the line print vehicle.extract() above, and confirmed that vehicle is definitely only a single item (and it changes each time the loop iterates). How is the xpath selector applied to vehicle able to escape the vehicle object and return all prices?
I was having the same problem. I have consulted the document that you have referred. Providing the modified code here so that it would be helpful to beginners like me. Note that the usage of '.' in the xpath .//div[contains(#class,'title-module')]/h2/a/text()
def parse(self, response):
hxs = Selector(response)
split_url = response.url.split("/")
listings = hxs.xpath("//div[contains(#class,'listing-item')]")
for vehicle in listings:
item = Vehicle()
item['make'] = split_url[5]
item['price'] = vehicle.xpath(".//div[contains(#class,'price')]/text()").extract()
item['description'] = vehicle.xpath(".//div[contains(#class,'title-module')]/h2/a/text()").extract()
yield item
I was able to solve the problem with the aid of the manual, here. In summary, the xpath was indeed escaping the iteration because I neglected to put a period in front of the // which meant that it was escaping to the root node every time.
I'm currently trying to use Scrapy to go through the Elite Dangerous subreddit and collect post titles, urls, and vote counts. I did the first two fine, but am unsure of how to write an XPath expression to access the votes.
selector.xpath('//div[#class="score unvoted"]').extract() works, but it returns vote counts for all posts on the current page (instead of for each individual post). response.css('div.score.unvoted').extract() Works for each individual post, but returns [u'<div class="score unvoted">1</div>'], instead of just 1. ( I would also really like to know how to do this with XPath! :) )
Code is as follows:
class redditSpider(CrawlSpider): # http://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.CrawlSpider
name = "reddits"
allowed_domains = ["reddit.com"]
start_urls = [
"https://www.reddit.com/r/elitedangerous",
]
rules = [
Rule(LinkExtractor(
allow=['/r/EliteDangerous/\?count=\d*&after=\w*']), # Looks for next page with RE
callback='parse_item', # What do I do with this? --- pass to self.parse_item
follow=True), # Tells spider to continue after callback
]
def parse_item(self, response):
selector_list = response.css('div.thing') # Each individual little "box" with content
for selector in selector_list:
item = RedditItem()
item['title'] = selector.xpath('div/p/a/text()').extract()
item['url'] = selector.xpath('a/#href').extract()
# item['votes'] = selector.xpath('//div[#class="score unvoted"]')
item['votes'] = selector.css('div.score.unvoted').extract()
yield item
You are on the right track. The first approach just needs two things:
a dot at the beginning to make it context-specific
text() at the end
Fixed version:
selector.xpath('.//div[#class="score unvoted"]/text()').extract()
And, FYI, you can make the second option work too by using the ::text pseudo-element:
response.css('div.score.unvoted::text').extract()
this should work -
selector.xpath('//div[contains(#class, "score unvoted")]/text()').extract()
I'm trying to scrape the items from this website.
Items are: Brand, Model and Price. Because of the complexity of the page structure, spider is using 2 xpath selectors.
Brand and Model items are from one xpath, price is from the different xpath. I'm using ( | ) operator as #har07 suggested. Xpaths were tested individually for each item and they were working and extracting the needed items correctly. However, after joining the 2 xpaths, price item started parsing additional items, like commas and prices aren't matched with Brand/Model items, when outputting to csv.
This is how the parse fragment of the spider looks:
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#border="0"]//td[#class="compact"] | //table[#border="0"]//td[#class="cl-price-cont"]//span[4]')
items = []
for t in titles:
item = AltaItem()
item["brand"] = t.xpath('div[#class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
item["model"] = t.xpath('div[#class="cl-prod-name"]/a/text()').re('\s+(.*)$')
item["price"] = t.xpath('text()').extract()
items.append(item)
return(items)
and that's what csv looks after scraping:
any suggestions how to fix this?
Thank you.
Basically, the issue is being caused by your titles xpath. The xpath goes down too deeply, to the point where you need to use join two xpaths in order to be able to scrape the brand/model field and the price field.
Modifying the titles xpath to a single xpath includes both of the repeating elements for brand/model and price (and subsequently changing the brand, model and price xpaths) means that you no longer get mismatches where the brand and model are in one item, and the price is in the next item.
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#class="table products cl"]//tr[#valign="middle"]')
items = []
for t in titles:
item = AltaItem()
item["brand"] = t.xpath('td[#class="compact"]/div[#class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
item["model"] = t.xpath('td[#class="compact"]/div[#class="cl-prod-name"]/a/text()').re('\s+(.*)$')
item["price"] = t.xpath('td[#class="cl-price-cont"]//span[4]/text()').extract()
items.append(item)
return(items)
Have python script using scrapy , which scrapes the data from a website, allocates it to 3 fields and then generates a .csv. Works ok but with one major problem. All fields contain all of the data, rather than it being separated out for each table row. I'm sure this is due to my loop not working and when it finds the xpath it just grabs all the data for every row before moving on to get data for the other 2 fields, instead of creating seperate rows
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[#class="someclass"]')
for div in divs:
item = TestBotItem()
item['var1'] = div.select('//table/tbody/tr[*]/td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('//table/tbody/tr[*]/td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('//table/tbody/tr[*]/td[4]/p/text()').extract()
return item
The tr with the * increases in number with each entry on the website I need to crawl, and the other two paths slot in below. How do I edit this so it grabs the first set of data for say //table/tbody/tr[3] only, stores it for all three fields and then moves on to //table/tbody/tr[4] etc??
Update
Works correctly, however I'm trying to add some validation to the pipelines.py file to drop any records where var1 is more than 100%. I'm certain my code below is wrong, and also does "yield" instead of "return" stop the pipeline being used?
from scrapy.exceptions import DropItem
class TestbotPipeline(object):
def process_item(self, item, spider):
if item('var1') > 100%:
return item
else:
raise Dropitem(item)
I think this is what you are looking for:
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[#class="someclass"]')
for div in divs:
item = TestBotItem()
item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('./td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('./td[4]/p/text()').extract()
yield item
You loop on the trs and then use relative XPath expressions (./td...), and in each iteration you use the yield instruction.
You can also append each item to a list and return that list outside of the loop) like this (it's equivalent to the code above):
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[#class="someclass"]')
items = []
for div in divs:
item = TestBotItem()
item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('./td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('./td[4]/p/text()').extract()
items.append(item)
return items
You don't need HtmlXPathSelector. Scrapy already has built-in XPATH selector. Try this:
def parse(self, response):
divs = response.xpath('//tr[#class="someclass"]')
for div in divs:
item = TestBotItem()
item['var1'] = div.xpath('table/tbody/tr[*]/td[2]/p/span[2]/text()').extract()[0]
item['var2'] = div.xpath('table/tbody/tr[*]/td[3]/p/span[2]/text()').extract()[0]
item['var3'] = div.xpath('table/tbody/tr[*]/td[4]/p/text()').extract()[0]
return item