How to select various elements of a website - python

I am scraping a website using scrapy where I want to extract a few details such as price, product description, features etc of a product. I want to know how to select each of these elements using css selectors or xpath selectors and store them in xml or json format.
I have written the following code skeleton. Please guide me what should I do from here.
# -*- coding: utf-8 -*-
import scrapy
import time
class QuotesSpider(scrapy.Spider):
name = 'myquotes'
start_urls = [
'https://www.amazon.com/international-sales-offers/b/ref=gbps_ftr_m-9_2862_dlt_LD?node=15529609011&gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CEXPIRED%252CSOLDOUT%252CUPCOMING,sortOrder:BY_SCORE,MARKETING_ID:ship_export,enforcedCategories:15684181,dealTypes:LIGHTNING_DEAL&pf_rd_p=9b8adb89-8774-4860-8b6e-e7cefc1c2862&pf_rd_s=merchandised-search-9&pf_rd_t=101&pf_rd_i=15529609011&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=AA0VVPMWMQM1MF4XQZKR&ie=UTF8'
]
def parse(self, response):
all_div_quotes = response.css('a-section a-spacing-none tallCellView gridColumn2 singleCell')
for quotes in all_div_quotes:
title1 = all_div_quotes.css('.dealPriceText::text').extract()
title2 = all_div_quotes.css('.a-declarative::text').extract()
title3 = all_div_quotes.css('#shipSoldInfo::text').extract()
yield{
'price' : title1,
'details1' : title2,
'details2' : title3
}
I am running the code using the command:
scrapy crawl myquotes -o myfile.json
to save it inside a json file. The problem with this code is that it is not returning the title, product price, product description as intended. If someone could help me with how to scrape the product name, price and description of an amazon page it would be of great help.

The easier way to check and verify CSS selectors is using scrapy shell.
In your case, I have listed the selectors you can use along with the code:
Name: response.css("#productTitle::text").get()
Price: Price was not available in my country so couldn't test it.
Description: response.css("#productDescription p::text").getall()
Best of luck.

The normal method to solve an error like this starting at the top. I think your very first css selector is too detailed. On using the selector gadget, the general css selector is
.dealDetailContainer
Yield the whole response without a for loop and check the output to understand that you're getting some kind of a response.
For products individually, when I scraped a different amazon link the css selector for the product name is
#productTitle::text -># is not a commented line of code here
Basically, you're going wrong with the css selectors. Use the CSS Selector Gadget and before using the command to output it into json, do a normal crawl first.

generallly what you could do is
Name: response.css("#productTitle::text").extract()
Description: response.css("#productDescription p::text").extract()
With this you should be good to go.
CSS selector are more constant so they are usually a better bet than using xpath and consequently the way to go

Related

Scrapy CSS selector returning blank

I'm currently trying to scrape the href elements from each restaurant on a website like:
https://www.menulog.com.au/area/2173-moorebank?lat=-33.9477825&long=150.9190988&q=liverpool
The relevant html can be found at:
HTML Snipping
However, when I use the below code in scrapy shell, it returns nothing
response.css("div.c-listing>div>div")
I was wondering why this is the case/ what I can do to resolve this?
Thank you!
use this css selector 'a.c-listing-item-link.u-clearfix' to extract url links in scrapy shell.
>>> for url in response.css('a.c-listing-item-link.u-clearfix ::attr("href")').extract():
... print(response.urljoin(url))
...
https://www.menulog.com.au/restaurants-blazin-grillz-liverpool/menu
https://www.menulog.com.au/restaurants-phillies-liverpool/menu
https://www.menulog.com.au/restaurants-mcdonalds-liverpool-south/menu
https://www.menulog.com.au/restaurants-kfc-liverpool/menu
https://www.menulog.com.au/restaurants-omer-biryani-house-liverpool/menu
https://www.menulog.com.au/restaurants-classic-burger-liverpool/menu
https://www.menulog.com.au/restaurants-jasmin-1-liverpool/menu
https://www.menulog.com.au/restaurants-subway-liverpool/menu
https://www.menulog.com.au/restaurants-himalayas-indian-restaurant-liverpool/menu
https://www.menulog.com.au/restaurants-jasmins-liverpool/menu
https://www.menulog.com.au/restaurants-sharetea-liverpool/menu
https://www.menulog.com.au/restaurants-healthy-kitchen-liverpool-halal/menu
https://www.menulog.com.au/restaurants-dosa-hut-liverpool/menu
https://www.menulog.com.au/restaurants-the-kulcha-house-liverpool/menu
https://www.menulog.com.au/restaurants-biang-biang-noodle-shop-liverpool/menu
https://www.menulog.com.au/restaurants-zambeekas-liverpool/menu
https://www.menulog.com.au/restaurants-mina-bakery-liverpool/menu
https://www.menulog.com.au/restaurants-crossroads-hotel-liverpool/menu
https://www.menulog.com.au/restaurants-nutrition-station-liverpool/menu
https://www.menulog.com.au/restaurants-mizuki-sushi-liverpool/menu

Scrapy spider not scraping the data correctly

I am trying to scrape the data about the circulrs from my college's website using scrapy for a project but my spider is not scraping the data properly. There are a lot of blank elements and also I am unable to scrape the 'href' attributes of the circulars for some reason. I am assuming that my CSS selectors are wrong but I am unable to figure out what am I doing wrong exactly. I copied my CSS selectors using the 'Selector Gadget' Chrome extension. I ams till learning scrapy so it would be great if you could explain what I was doing wrong.
The Website I am scraping data from is : https://www.imsnsit.org/imsnsit/notifications.php
My code is :
import scrapy
from ..items import CircularItem
class CircularSpider(scrapy.Spider):
name = "circular"
start_urls = [
"https://www.imsnsit.org/imsnsit/notifications.php"
]
def parse(self, response):
items = CircularItem()
all = response.css('tr~ tr+ tr font')
for x in all:
cirName = x.css('a font::text').extract()
cirLink = x.css('.list-data-focus a').attrib['href'].extract()
date = x.css('tr~ tr+ tr td::text').extract()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
I modified your parse callback function. I changed CSS selectors into xpath. Also, try to learn xpath selectors they are very powerful and easy to use.
Generally, It is bad idea to copy CSS or xpath using automatic selectors, because in some cases they might give you incorrect results or just one element without general path.
First of all I select all tr. If you look carefully, some of tr are just blank used for separator. You can filter them by trying to select date, if it is None you can just skip the row. And finally you can just select cirName and cirLink.
Also, markup of the given website is not good and It is really hard to write proper selectors, elements don't have many attributes, like class or id. That's the solution I came up with, I know it is not perfect.
def parse(self, response):
items = CircularItem()
all = response.xpath('//tr') # select all table items
for x in all:
date = x.xpath('.//td/font[#size="3"]/text()').get() # filter them by date
if not date:
continue
cirName = x.xpath('.//a/font/text()').get()
cirLink = x.xpath('.//a[#title="NOTICES / CIRCULARS"]/#href').get()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items

Scrapy - scraping html custom attributes

I am trying to scrape a website and I want to scrape a custom html attribute.
First I get the link:
result.css('p.paraclass a').extract()
It looks like this:
I am a link
I'd like to scrape the value of the data-id tag. I can do this by getting the entire link and then manipulating it, but I'd like to figure out if there is a way to do it directly with a scrapy selector.
I believe the following will work:
result.css('a::attr(data-id)').extract()
Two ways to achieve this:
from scrapy.selector import Selector
partial_body = ' I am a link'
sel = Selector(text=partial_body)
Xpath Selector
sel.xpath('//a/#data-id').extract()
#output : ['12345']
CSS Selector
sel.css('a::attr(data-id)').extract_first()
# output: '12345'

extracting text from css node scrapy

I'm trying to scrape a catalog id number from this page:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='
response = HtmlResponse(url=url)
using the css selector (which works in R with rvest::html_nodes)
".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"
I would like to retrieve the catalog id, which in this case should be:
6011038
I'm ok if it is done easier with the xpath
I don't have scrapy here, but tested this xpath and it will get you the href:
//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href
If you're having too much trouble with scrapy and css selector syntax, I would also suggest trying out BeautifulSoup python package. With BeautifulSoup you can do things like
link.get('href')
If you need to parse id from href:
catalog_id = response.xpath("//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href").re_first( r'(\d+)$' )
There seems to be only one link in the h5 element. So in short:
response.css('h5 > a::attr(href)').re('(\d+)$')

Scrapy Xpath unable to get all prices

I am using scrapy to crawl this page
class QuotesSpider(scrapy.Spider):
name = "tesco"
start_urls = [
'https://www.tesco.com/direct/tv-offer.event?icid=offers_trade_slot1',
]
def parse(self, response):
for quote in response.xpath('//li[contains(#class,"product-tile")]'):
learningscrapyItem = crawlerItem()
learningscrapyItem['title'] = quote.xpath('.//h3/a/text()').extract_first()
price = quote.xpath('.//div[#class="buy-box-container"]/p[2]/text()').extract_first()
learningscrapyItem['price'] = price.strip()
yield (learningscrapyItem)
I am having issues with the price xpath which is only pulling some prices:
//div[#class="buy-box-container"]/p[2]/text()
By removing text() I think I can see the reason why, the ones that are pulling the price though are setup like the below:
<p class="price">
£189.00
</p>
The ones that are not are structured like:
<p class="price">
<span class="from">From</span>
£549.00
</p>
So the strip() appears to be removing these. Is there a way with Xpath that I can get the text from within the paragraph tag and not from/or the span within it?
Thanks.
The problem is that /text() would only match the direct text child node and, you understood correctly, that the second example breaks the selector.
I would just get all the "text" nodes from inside the "price" element and grab the amount with .re_first():
price = quote.xpath('.//div[#class="buy-box-container"]/p[2]//text()').re_first(r"\d+\.\d+")
Or, even simpler with a CSS selector instead of the XPath:
price = quote.css('.buy-box-container .price').re_first(r"\d+\.\d+")
Try the below way to get the prices you wish to have.
Instead of using this:
quote.xpath('.//div[#class="buy-box-container"]/p[2]/text()').extract_first()
Try using this:
quote.xpath('.//div[#class="buy-box-container"]//p[#class="price"]/text()').extract()[-1]

Categories