Scrapy - scraping html custom attributes - python

I am trying to scrape a website and I want to scrape a custom html attribute.
First I get the link:
result.css('p.paraclass a').extract()
It looks like this:
I am a link
I'd like to scrape the value of the data-id tag. I can do this by getting the entire link and then manipulating it, but I'd like to figure out if there is a way to do it directly with a scrapy selector.

I believe the following will work:
result.css('a::attr(data-id)').extract()

Two ways to achieve this:
from scrapy.selector import Selector
partial_body = ' I am a link'
sel = Selector(text=partial_body)
Xpath Selector
sel.xpath('//a/#data-id').extract()
#output : ['12345']
CSS Selector
sel.css('a::attr(data-id)').extract_first()
# output: '12345'

Related

Scrapy CSS selector returning blank

I'm currently trying to scrape the href elements from each restaurant on a website like:
https://www.menulog.com.au/area/2173-moorebank?lat=-33.9477825&long=150.9190988&q=liverpool
The relevant html can be found at:
HTML Snipping
However, when I use the below code in scrapy shell, it returns nothing
response.css("div.c-listing>div>div")
I was wondering why this is the case/ what I can do to resolve this?
Thank you!
use this css selector 'a.c-listing-item-link.u-clearfix' to extract url links in scrapy shell.
>>> for url in response.css('a.c-listing-item-link.u-clearfix ::attr("href")').extract():
... print(response.urljoin(url))
...
https://www.menulog.com.au/restaurants-blazin-grillz-liverpool/menu
https://www.menulog.com.au/restaurants-phillies-liverpool/menu
https://www.menulog.com.au/restaurants-mcdonalds-liverpool-south/menu
https://www.menulog.com.au/restaurants-kfc-liverpool/menu
https://www.menulog.com.au/restaurants-omer-biryani-house-liverpool/menu
https://www.menulog.com.au/restaurants-classic-burger-liverpool/menu
https://www.menulog.com.au/restaurants-jasmin-1-liverpool/menu
https://www.menulog.com.au/restaurants-subway-liverpool/menu
https://www.menulog.com.au/restaurants-himalayas-indian-restaurant-liverpool/menu
https://www.menulog.com.au/restaurants-jasmins-liverpool/menu
https://www.menulog.com.au/restaurants-sharetea-liverpool/menu
https://www.menulog.com.au/restaurants-healthy-kitchen-liverpool-halal/menu
https://www.menulog.com.au/restaurants-dosa-hut-liverpool/menu
https://www.menulog.com.au/restaurants-the-kulcha-house-liverpool/menu
https://www.menulog.com.au/restaurants-biang-biang-noodle-shop-liverpool/menu
https://www.menulog.com.au/restaurants-zambeekas-liverpool/menu
https://www.menulog.com.au/restaurants-mina-bakery-liverpool/menu
https://www.menulog.com.au/restaurants-crossroads-hotel-liverpool/menu
https://www.menulog.com.au/restaurants-nutrition-station-liverpool/menu
https://www.menulog.com.au/restaurants-mizuki-sushi-liverpool/menu

unable to extract full url #href using scrapy

I am trying to extract the url of a product from amazon.in. The href-attribute inside the a-tag from the source looks like this:
href="/Parachute-Coconut-Oil-600-Free/dp/B081WSB91C/ref=sr_1_49?dchild=1&fpw=pantry&fst=as%3Aoff&qid=1588693187&s=pantry&sr=8-49&srs=9574332031&swrs=789D2F4EC1B25821250A55BFCB953F03"
What Scrapy is extracting is:
/Parachute-Coconut-Oil-Bottle-600ml/dp/B071FB2ZVT?dchild=1
I used the following xpath:
//div[#class="a-section a-spacing-none a-spacing-top-small"]//a[#class="a-link-normal a-text-normal"]/#href
This is the website I am trying to scrape:
https://www.amazon.in/s?i=pantry&srs=9574332031&bbn=9735693031&rh=n%3A9735693031&dc&page=2&fst=as%3Aoff&qid=1588056650&swrs=789D2F4EC1B25821250A55BFCB953F03&ref=sr_pg_2
How can I extract the expected url with Scrapy?
That is known as a relative URL. To get the full URL you can simply combine it to the base URL. I don't know what your code is but try something like this.
half_url = response.xpath('//div[#class="a-section a-spacing-none a-spacing-top-small"]//a[#class="a-link-normal a-text-normal"]/#href').extract_first()
full_url = 'https://www.amazon.in/' + half_url

Cannot get a CSS class from Google search page

I use BeautifulSoup for parsing a Google search, but I get empty list. I want to make a spellchecker by using Google's "Did you mean?".
import requests
from bs4 import BeautifulSoup
import urllib.parse
text = "i an you ate goode maan"
data = urllib.parse.quote_plus(text)
url = 'https://translate.google.com/?source=osdd#view=home&op=translate&sl=auto&tl=en&text='
rq = requests.get(url + data)
soup = BeautifulSoup(rq.content, 'html.parser')
words = soup.select('.tlid-spelling-correction spelling-correction gt-spell-correct-message')
print(words)
The output is just: [], but expected: "i and you are good man" (sorry for such a bad text example)
First, the element you are looking for is loaded using javascript. Since BeautifulSoup does not run js, the target elements don't get loaded into the DOM hence the query selector can't find them. Try using Selenium instead of BeautifulSoup.
Second, The CSS selector should be
.tlid-spelling-correction.spelling-correction.gt-spell-correct-message`.
Notice the . instead of space in front of every class name.
I have verified it using JS query selector
The selector you were using .tlid-spelling-correction spelling-correction gt-spell-correct-message was looking for an element with class gt-spell-correct-message inside an element with class spelling-correction which itself was inside another element with class tlid-spelling-correction.
By removing the space and putting a dot in front of every class name, the selector looks for an element with all three of the above mentioned classes.

How to select various elements of a website

I am scraping a website using scrapy where I want to extract a few details such as price, product description, features etc of a product. I want to know how to select each of these elements using css selectors or xpath selectors and store them in xml or json format.
I have written the following code skeleton. Please guide me what should I do from here.
# -*- coding: utf-8 -*-
import scrapy
import time
class QuotesSpider(scrapy.Spider):
name = 'myquotes'
start_urls = [
'https://www.amazon.com/international-sales-offers/b/ref=gbps_ftr_m-9_2862_dlt_LD?node=15529609011&gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CEXPIRED%252CSOLDOUT%252CUPCOMING,sortOrder:BY_SCORE,MARKETING_ID:ship_export,enforcedCategories:15684181,dealTypes:LIGHTNING_DEAL&pf_rd_p=9b8adb89-8774-4860-8b6e-e7cefc1c2862&pf_rd_s=merchandised-search-9&pf_rd_t=101&pf_rd_i=15529609011&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=AA0VVPMWMQM1MF4XQZKR&ie=UTF8'
]
def parse(self, response):
all_div_quotes = response.css('a-section a-spacing-none tallCellView gridColumn2 singleCell')
for quotes in all_div_quotes:
title1 = all_div_quotes.css('.dealPriceText::text').extract()
title2 = all_div_quotes.css('.a-declarative::text').extract()
title3 = all_div_quotes.css('#shipSoldInfo::text').extract()
yield{
'price' : title1,
'details1' : title2,
'details2' : title3
}
I am running the code using the command:
scrapy crawl myquotes -o myfile.json
to save it inside a json file. The problem with this code is that it is not returning the title, product price, product description as intended. If someone could help me with how to scrape the product name, price and description of an amazon page it would be of great help.
The easier way to check and verify CSS selectors is using scrapy shell.
In your case, I have listed the selectors you can use along with the code:
Name: response.css("#productTitle::text").get()
Price: Price was not available in my country so couldn't test it.
Description: response.css("#productDescription p::text").getall()
Best of luck.
The normal method to solve an error like this starting at the top. I think your very first css selector is too detailed. On using the selector gadget, the general css selector is
.dealDetailContainer
Yield the whole response without a for loop and check the output to understand that you're getting some kind of a response.
For products individually, when I scraped a different amazon link the css selector for the product name is
#productTitle::text -># is not a commented line of code here
Basically, you're going wrong with the css selectors. Use the CSS Selector Gadget and before using the command to output it into json, do a normal crawl first.
generallly what you could do is
Name: response.css("#productTitle::text").extract()
Description: response.css("#productDescription p::text").extract()
With this you should be good to go.
CSS selector are more constant so they are usually a better bet than using xpath and consequently the way to go

extracting text from css node scrapy

I'm trying to scrape a catalog id number from this page:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='
response = HtmlResponse(url=url)
using the css selector (which works in R with rvest::html_nodes)
".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"
I would like to retrieve the catalog id, which in this case should be:
6011038
I'm ok if it is done easier with the xpath
I don't have scrapy here, but tested this xpath and it will get you the href:
//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href
If you're having too much trouble with scrapy and css selector syntax, I would also suggest trying out BeautifulSoup python package. With BeautifulSoup you can do things like
link.get('href')
If you need to parse id from href:
catalog_id = response.xpath("//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href").re_first( r'(\d+)$' )
There seems to be only one link in the h5 element. So in short:
response.css('h5 > a::attr(href)').re('(\d+)$')

Categories