I'm trying to scrape the items from this website.
Items are: Brand, Model and Price. Because of the complexity of the page structure, spider is using 2 xpath selectors.
Brand and Model items are from one xpath, price is from the different xpath. I'm using ( | ) operator as #har07 suggested. Xpaths were tested individually for each item and they were working and extracting the needed items correctly. However, after joining the 2 xpaths, price item started parsing additional items, like commas and prices aren't matched with Brand/Model items, when outputting to csv.
This is how the parse fragment of the spider looks:
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#border="0"]//td[#class="compact"] | //table[#border="0"]//td[#class="cl-price-cont"]//span[4]')
items = []
for t in titles:
item = AltaItem()
item["brand"] = t.xpath('div[#class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
item["model"] = t.xpath('div[#class="cl-prod-name"]/a/text()').re('\s+(.*)$')
item["price"] = t.xpath('text()').extract()
items.append(item)
return(items)
and that's what csv looks after scraping:
any suggestions how to fix this?
Thank you.
Basically, the issue is being caused by your titles xpath. The xpath goes down too deeply, to the point where you need to use join two xpaths in order to be able to scrape the brand/model field and the price field.
Modifying the titles xpath to a single xpath includes both of the repeating elements for brand/model and price (and subsequently changing the brand, model and price xpaths) means that you no longer get mismatches where the brand and model are in one item, and the price is in the next item.
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#class="table products cl"]//tr[#valign="middle"]')
items = []
for t in titles:
item = AltaItem()
item["brand"] = t.xpath('td[#class="compact"]/div[#class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
item["model"] = t.xpath('td[#class="compact"]/div[#class="cl-prod-name"]/a/text()').re('\s+(.*)$')
item["price"] = t.xpath('td[#class="cl-price-cont"]//span[4]/text()').extract()
items.append(item)
return(items)
Related
hello i'm trying to build a crawler using scrapy
my crawler code is :
import scrapy
from shop.items import ShopItem
class ShopspiderSpider(scrapy.Spider):
name = 'shopspider'
allowed_domains = ['www.organics.com']
start_urls = ['https://www.organics.com/product-tag/special-offers/']
def parse(self, response):
items = ShopItem()
title = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/h3').extract()
sale_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/del/span').extract()
product_original_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
category = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
items['product_name'] = ''.join(title).strip()
items['product_sale_price'] = ''.join(sale_price).strip()
items['product_original_price'] = ''.join(product_original_price).strip()
items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
yield items
but when i run the command : scrapy crawl shopspider -o info.csv to see the output i can find just the informations about the first product not all the products in this page.
so i remove the numbers between [ ] in the xpath for exemple the xpath of the title ://*[#id="content"]/div/div/ul/li/a/h3
but still get the same result.
the result is : <span class="amount">£40.00</span>,<h3>Halo Skincare Organic Gift Set</h3>,"<span class=""amount"">£40.00</span>","<span class=""amount"">£58.00</span>"
kindely help please
If you remove the indexes on your XPaths, they will find all the items in the page:
response.xpath('//*[#id="content"]/div/div/ul/li/a/h3').extract() # Returns 7 items
However, you should observe that this will return a list of strings of the selected html elements. You should add /text() in the XPath if you want the text inside the element. (Which looks like you do)
Also, the reason you only get one return is because you are concatenating all the items into a single string when assigning them to the item:
items['product_name'] = ''.join(title).strip()
Here title is a list of elements and you concatenate them all in a single string. Same logic applies for the other vars
If that's really what you want you can disregard the following, but I believe a better approach would be to execute a for loop and yield them separately?
My suggestion would be:
def parse(self, response):
products = response.xpath('//*[#id="content"]/div/div/ul/li')
for product in products:
items = ShopItem()
items['product_name'] = product.xpath('a/h3/text()').get()
items['product_sale_price'] = product.xpath('a/span/del/span/text()').get()
items['product_original_price'] = product.xpath('a/span/ins/span/text()').get()
items['product_category'] = product.xpath('a/span/ins/span/text()').get()
yield items
Notice that in your original code your category var has the same XPath that your product_original_price, I kept the logic in the code, but it's probably a mistake.
I am using Scrapy to scrape some website data. But I can't make the step to get my data properly.
This is the output of my code (see code below):
In the command Line:
scrapy crawl myspider -o items.csv
Output:
asin_product product_name
ProductA,,,ProductB,,,ProductC,,, BrandA,,,BrandB,,,BrandC,,,
ProductA,,,ProductD,,,ProductE,,, BrandA,,,BrandB,,,BrandA,,,
#Note that the rows are representing the start_urls and that the ',,,'
#three commas are separating the data.
Desired output:
scrapy crawl myspider -o items.csv
Start_URL asin_product product_name
URL1 ProductA BrandA
URL1 ProductB BrandB
URL1 ProductC BrandC
URL2 ProductA BrandA
URL2 ProductD BrandB
URL2 ProductE BrandA
My Used Code in Scrapy:
import scrapy
from amazon.items import AmazonItem
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=shoes&ref=nb_sb_noss_2", # This should
be #URL 1
"https://www.amazon.com/s?k=computer&ref=nb_sb_noss_2" # This should
be #URL 2
]
def parse(self, response):
items = AmazonItem()
title = response.xpath('//*[#class="a-size-base-plus a-color-base a-
text-normal"]/text()').extract()
asin = response.xpath('//*[#class ="a-link-normal"]/#href').extract()
# Note that I devided the products with ',,,' to make it easy to separate
# them. I am aware that this is not the best approach.
items['product_name'] = ',,,'.join(title).strip()
items['asin_product'] = ',,,'.join(asin).strip()
yield items
First of all, it's recomended to use css when querying by class.
Now to your code:
The product name is within the a tag (product url). So you can iterate though the links and store the URL and the title.
<a class="a-link-normal a-text-normal" href="/adidas-Mens-Lite-Racer-Running/dp/B071P19D3X/ref=sr_1_3?keywords=shoes&qid=1554132536&s=gateway&sr=8-3">
<span class="a-size-base-plus a-color-base a-text-normal">Adidas masculina Lite Racer byd tênis de corrida</span>
</a>
You need to create one AmazonItem object per line on your csv file.
def parse(self, response):
# You need to improve this css selector because there are links which
# are not a product, this is why I am checking if title is None and continuing.
for product in response.css('a.a-link-normal.a-text-normal'):
# product is a selector
title = product.css('span.a-size-base-plus.a-color-base.a-text-normal::text').get()
if not title:
continue
# The selector is already the a tag, so we only need to extract it's href attribute value.
asin = product.xpath('./#href').get()
item = AmazonItem()
item['product_name'] = title.strip()
item['asin_product'] = asin.strip()
yield item
Make the start_url available in parse method
instead of using start_urls you can yield your initial requests from a method named start_requests (see https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=start_requests#our-first-spider).
With each request you can pass the start url as meta data. This meta data is then available within your parse method (see https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=meta#scrapy.http.Request.meta).
def start_requests(self):
urls = [...] # this is equal to your start_urls
for start_url in urls:
yield Request(url=url, meta={"start_url": start_url})
def parse(self, response):
start_url = response.meta["start_url"]
yield multiple items, one for each product
Instead of joining titles and brands you can yield several items from parse. For the example below i assume the lists title and asin have the same length.
for title, asin in zip(title, asin):
item = AmazonItem()
item['product_name'] = title
item['asin_product'] = asin
yield item
PS: you should check amazons robots.txt. They might not allow you to scrape their site and ban your IP (https://www.amazon.de/robots.txt)
I'll start with the scrapy code I'm trying to use to iterate through a collection of vehicles and extract the model and price:
def parse(self, response):
hxs = Selector(response)
split_url = response.url.split("/")
listings = hxs.xpath("//div[contains(#class,'listing-item')]")
for vehicle in listings:
item = Vehicle()
item['make'] = split_url[5]
item['price'] = vehicle.xpath("//div[contains(#class,'price')]/text()").extract()
item['description'] = vehicle.xpath("//div[contains(#class,'title-module')]/h2/a/text()").extract()
yield item
I was expecting that to loop through the listings and return the price only for the single vehicle being parsed, but it is actually adding an array of all prices on the page to each vehicle item.
I assume the problem is in my xpath selectors - is "//div[contains(#class,'price')]/text()" somehow allowing the parser to look at divs outside the single vehicle that should be getting parsed each time?
For reference, if I do listings[1] it returns only 1 listing, hence the loop should be working.
Edit: I added the line print vehicle.extract() above, and confirmed that vehicle is definitely only a single item (and it changes each time the loop iterates). How is the xpath selector applied to vehicle able to escape the vehicle object and return all prices?
I was having the same problem. I have consulted the document that you have referred. Providing the modified code here so that it would be helpful to beginners like me. Note that the usage of '.' in the xpath .//div[contains(#class,'title-module')]/h2/a/text()
def parse(self, response):
hxs = Selector(response)
split_url = response.url.split("/")
listings = hxs.xpath("//div[contains(#class,'listing-item')]")
for vehicle in listings:
item = Vehicle()
item['make'] = split_url[5]
item['price'] = vehicle.xpath(".//div[contains(#class,'price')]/text()").extract()
item['description'] = vehicle.xpath(".//div[contains(#class,'title-module')]/h2/a/text()").extract()
yield item
I was able to solve the problem with the aid of the manual, here. In summary, the xpath was indeed escaping the iteration because I neglected to put a period in front of the // which meant that it was escaping to the root node every time.
This is a newbie question (new to Scrapy and first question on Stackoverflow):
I currently have a spider to crawl the following Amazon page (http://www.amazon.co.uk/Televisions-TVs-LED-LCD-Plasma/b/ref=sn_gfs_co_auto_560864_1?ie=UTF8&node=560864).
I am trying to scrape the title of the TV and main (listed) price. I can successfully parse the TV name. However on some of the Amazon TVs listed they don't all have the same Xpath elements; some have a main (listed) price, some have a "as New" price and some also have a "as Used" price.
My issue is that when a TV does not have a main (listed) price my CSV output does not record a NULL for that item but instead takes the next XPATH item which does have a main price.
Is there a way to check whether an item exists in the XPATH content and if not to get the spider or the pipeline to record a NULL or ""?
My main spider code is:
class AmazonSpider(BaseSpider):
name = "amazon"
allowed_domains = ["amazon.co.uk"]
start_urls = [
"http://www.amazon.co.uk/Televisions-TVs-LED-LCD-Plasma /b/ref=sn_gfs_co_auto_560864_1?ie=UTF8&node=560864"
]
def parse(self, response):
sel = Selector(response)
title = sel.xpath('.//*[starts-with(#id,"result_")]/h3/a/span/text()').extract()
price = sel.xpath('.//*[starts-with(#id,"result_")]/ul/li[1]/div/a/span/text()').extract()
items = []
for title,price in zip(title,price):
item = AmazonItem()
item ["title"] = title.strip()
item ["price"] = price.strip()
items.append(item)
return items
My pipeline is:
class AmazonPipeline(object):
def process_item(self, item, spider):
return item
My items file is:
import scrapy
from scrapy.item import Item, Field
class AmazonItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
I am outputing to CSV as follows:
scrapy crawl amazon -o output.csv -t csv
Thanks in advance!
you can take the xpath relatively so that this won't happen again
Look at the code below, this might help
def parse(self, response):
selector_object = response.xpath('//div[starts-with(#id,"result_")]')
for select in selector_object:
title = select.xpath('./h3/a/span/text()').extract()
title = title[0].strip() if title else 'N/A'
price = select.xpath('/ul/li[1]/div/a/span/text()').extract()
price = price[0].strip() if price else 'N/A'
item = AmazonItem(
title=title,
price=price
)
yield item
I extended Jithin's approach by a couple of if else statements which helped solve my problem:
def parse(self, response):
selector_object = response.xpath('//div[starts-with(#id,"result_")]')
for select in selector_object:
new_price=select.xpath('./ul/li[1]/a/span[1]/text()').extract()
title = select.xpath('./h3/a/span/text()').extract()
title = title[0].strip() if title else 'N/A'
price = select.xpath('./ul/li[1]/div/a/span/text()').extract()
if price:
price = price[0].strip()
elif new_price:
price = new_price[0].strip()
item = AmazonItem(
title=title,
price=price
)
yield item
Have python script using scrapy , which scrapes the data from a website, allocates it to 3 fields and then generates a .csv. Works ok but with one major problem. All fields contain all of the data, rather than it being separated out for each table row. I'm sure this is due to my loop not working and when it finds the xpath it just grabs all the data for every row before moving on to get data for the other 2 fields, instead of creating seperate rows
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[#class="someclass"]')
for div in divs:
item = TestBotItem()
item['var1'] = div.select('//table/tbody/tr[*]/td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('//table/tbody/tr[*]/td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('//table/tbody/tr[*]/td[4]/p/text()').extract()
return item
The tr with the * increases in number with each entry on the website I need to crawl, and the other two paths slot in below. How do I edit this so it grabs the first set of data for say //table/tbody/tr[3] only, stores it for all three fields and then moves on to //table/tbody/tr[4] etc??
Update
Works correctly, however I'm trying to add some validation to the pipelines.py file to drop any records where var1 is more than 100%. I'm certain my code below is wrong, and also does "yield" instead of "return" stop the pipeline being used?
from scrapy.exceptions import DropItem
class TestbotPipeline(object):
def process_item(self, item, spider):
if item('var1') > 100%:
return item
else:
raise Dropitem(item)
I think this is what you are looking for:
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[#class="someclass"]')
for div in divs:
item = TestBotItem()
item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('./td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('./td[4]/p/text()').extract()
yield item
You loop on the trs and then use relative XPath expressions (./td...), and in each iteration you use the yield instruction.
You can also append each item to a list and return that list outside of the loop) like this (it's equivalent to the code above):
def parse(self, response):
hxs = HtmlXPathSelector(response)
divs = hxs.select('//tr[#class="someclass"]')
items = []
for div in divs:
item = TestBotItem()
item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
item['var2'] = div.select('./td[3]/p/span[2]/text()').extract()
item['var3'] = div.select('./td[4]/p/text()').extract()
items.append(item)
return items
You don't need HtmlXPathSelector. Scrapy already has built-in XPATH selector. Try this:
def parse(self, response):
divs = response.xpath('//tr[#class="someclass"]')
for div in divs:
item = TestBotItem()
item['var1'] = div.xpath('table/tbody/tr[*]/td[2]/p/span[2]/text()').extract()[0]
item['var2'] = div.xpath('table/tbody/tr[*]/td[3]/p/span[2]/text()').extract()[0]
item['var3'] = div.xpath('table/tbody/tr[*]/td[4]/p/text()').extract()[0]
return item