Scrapy CSS selector returning blank - python

I'm currently trying to scrape the href elements from each restaurant on a website like:
https://www.menulog.com.au/area/2173-moorebank?lat=-33.9477825&long=150.9190988&q=liverpool
The relevant html can be found at:
HTML Snipping
However, when I use the below code in scrapy shell, it returns nothing
response.css("div.c-listing>div>div")
I was wondering why this is the case/ what I can do to resolve this?
Thank you!

use this css selector 'a.c-listing-item-link.u-clearfix' to extract url links in scrapy shell.
>>> for url in response.css('a.c-listing-item-link.u-clearfix ::attr("href")').extract():
... print(response.urljoin(url))
...
https://www.menulog.com.au/restaurants-blazin-grillz-liverpool/menu
https://www.menulog.com.au/restaurants-phillies-liverpool/menu
https://www.menulog.com.au/restaurants-mcdonalds-liverpool-south/menu
https://www.menulog.com.au/restaurants-kfc-liverpool/menu
https://www.menulog.com.au/restaurants-omer-biryani-house-liverpool/menu
https://www.menulog.com.au/restaurants-classic-burger-liverpool/menu
https://www.menulog.com.au/restaurants-jasmin-1-liverpool/menu
https://www.menulog.com.au/restaurants-subway-liverpool/menu
https://www.menulog.com.au/restaurants-himalayas-indian-restaurant-liverpool/menu
https://www.menulog.com.au/restaurants-jasmins-liverpool/menu
https://www.menulog.com.au/restaurants-sharetea-liverpool/menu
https://www.menulog.com.au/restaurants-healthy-kitchen-liverpool-halal/menu
https://www.menulog.com.au/restaurants-dosa-hut-liverpool/menu
https://www.menulog.com.au/restaurants-the-kulcha-house-liverpool/menu
https://www.menulog.com.au/restaurants-biang-biang-noodle-shop-liverpool/menu
https://www.menulog.com.au/restaurants-zambeekas-liverpool/menu
https://www.menulog.com.au/restaurants-mina-bakery-liverpool/menu
https://www.menulog.com.au/restaurants-crossroads-hotel-liverpool/menu
https://www.menulog.com.au/restaurants-nutrition-station-liverpool/menu
https://www.menulog.com.au/restaurants-mizuki-sushi-liverpool/menu

Related

Extract specific HREF with xpath or css

recently I have tackled one unusual element that's not trivial to scrape. Could you suggest please how to retrieve the href please.
I am scraping some Tripadvisor's restaurants with python scrapy and need to retrieve Google Map's link (href attribute) from location and contacts section. Could you suggest how to
The webpage for example (link)
The code of the element:
<a data-encoded-url="S0k3X2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfeVBw" class="_2wKz--mA _27M8V6YV" target="_blank" href="**https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421**"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
I have tried the following XPATH, but got None as response every time or couldn't get data on the href attribute as if it doesn't exist.
response.xpath("//a[contains(#class, '_2wKz--mA _27M8V6YV')]").getall()
The output:
['<a data-encoded-url="Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z" class="_2wKz--mA _27M8V6YV" target="_blank"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>',
'Website']
Use the data-encoded-url that you already got and decode it using Base64. Example:
>>> import base64
>>> base64.b64decode("Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z").decode("utf-8")
'gzK_https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421_2Ms'
You can then remove the gzK_ prefix and _2Ms suffix and you will have your URL.
You try the specific XPath query to get the href like "//a[contains(#class, 'foobar')]/#href" to retrieve a specific attribute of the element.

How to select various elements of a website

I am scraping a website using scrapy where I want to extract a few details such as price, product description, features etc of a product. I want to know how to select each of these elements using css selectors or xpath selectors and store them in xml or json format.
I have written the following code skeleton. Please guide me what should I do from here.
# -*- coding: utf-8 -*-
import scrapy
import time
class QuotesSpider(scrapy.Spider):
name = 'myquotes'
start_urls = [
'https://www.amazon.com/international-sales-offers/b/ref=gbps_ftr_m-9_2862_dlt_LD?node=15529609011&gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CEXPIRED%252CSOLDOUT%252CUPCOMING,sortOrder:BY_SCORE,MARKETING_ID:ship_export,enforcedCategories:15684181,dealTypes:LIGHTNING_DEAL&pf_rd_p=9b8adb89-8774-4860-8b6e-e7cefc1c2862&pf_rd_s=merchandised-search-9&pf_rd_t=101&pf_rd_i=15529609011&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=AA0VVPMWMQM1MF4XQZKR&ie=UTF8'
]
def parse(self, response):
all_div_quotes = response.css('a-section a-spacing-none tallCellView gridColumn2 singleCell')
for quotes in all_div_quotes:
title1 = all_div_quotes.css('.dealPriceText::text').extract()
title2 = all_div_quotes.css('.a-declarative::text').extract()
title3 = all_div_quotes.css('#shipSoldInfo::text').extract()
yield{
'price' : title1,
'details1' : title2,
'details2' : title3
}
I am running the code using the command:
scrapy crawl myquotes -o myfile.json
to save it inside a json file. The problem with this code is that it is not returning the title, product price, product description as intended. If someone could help me with how to scrape the product name, price and description of an amazon page it would be of great help.
The easier way to check and verify CSS selectors is using scrapy shell.
In your case, I have listed the selectors you can use along with the code:
Name: response.css("#productTitle::text").get()
Price: Price was not available in my country so couldn't test it.
Description: response.css("#productDescription p::text").getall()
Best of luck.
The normal method to solve an error like this starting at the top. I think your very first css selector is too detailed. On using the selector gadget, the general css selector is
.dealDetailContainer
Yield the whole response without a for loop and check the output to understand that you're getting some kind of a response.
For products individually, when I scraped a different amazon link the css selector for the product name is
#productTitle::text -># is not a commented line of code here
Basically, you're going wrong with the css selectors. Use the CSS Selector Gadget and before using the command to output it into json, do a normal crawl first.
generallly what you could do is
Name: response.css("#productTitle::text").extract()
Description: response.css("#productDescription p::text").extract()
With this you should be good to go.
CSS selector are more constant so they are usually a better bet than using xpath and consequently the way to go

Scrapy - scraping html custom attributes

I am trying to scrape a website and I want to scrape a custom html attribute.
First I get the link:
result.css('p.paraclass a').extract()
It looks like this:
I am a link
I'd like to scrape the value of the data-id tag. I can do this by getting the entire link and then manipulating it, but I'd like to figure out if there is a way to do it directly with a scrapy selector.
I believe the following will work:
result.css('a::attr(data-id)').extract()
Two ways to achieve this:
from scrapy.selector import Selector
partial_body = ' I am a link'
sel = Selector(text=partial_body)
Xpath Selector
sel.xpath('//a/#data-id').extract()
#output : ['12345']
CSS Selector
sel.css('a::attr(data-id)').extract_first()
# output: '12345'

extracting text from css node scrapy

I'm trying to scrape a catalog id number from this page:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='
response = HtmlResponse(url=url)
using the css selector (which works in R with rvest::html_nodes)
".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"
I would like to retrieve the catalog id, which in this case should be:
6011038
I'm ok if it is done easier with the xpath
I don't have scrapy here, but tested this xpath and it will get you the href:
//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href
If you're having too much trouble with scrapy and css selector syntax, I would also suggest trying out BeautifulSoup python package. With BeautifulSoup you can do things like
link.get('href')
If you need to parse id from href:
catalog_id = response.xpath("//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href").re_first( r'(\d+)$' )
There seems to be only one link in the h5 element. So in short:
response.css('h5 > a::attr(href)').re('(\d+)$')

XPATH works in Chrome, but not in Scrapy

I tried to scrape a page. Sorry, I can't disclose the link because of my job's non-disclosure agreement.
print response.xpath('//tr')
But it's weird, the XPATH only works on Chrome Dev Tools, but not on Scrapy. I checked the scraped HTML via response.body, and the HTML is normal.
Found the answer. It turns out the HTML is broken and Scrapy can't fix it on its own, so it needs Beautiful Soup help. I do it like this:
from scrapy.selector import Selector
from bs4 import BeautifulSoup
fixed_html = str(BeautifulSoup(response.body, "lxml"))
print Selector(text=fixed_html).xpath('//*')

Categories