I have a scrapy crawler that extracts the data of products from a website.
Website is : https://www.softsurroundings.com
The main page has categories and inside the categories there are products. The link of product is like "https://www.softsurroundings.com/p/audun-jumpsuit/" which I got from the first item in a category with link "https://www.softsurroundings.com/clothing/jumpsuits/"
Now if in Below code I dont use Lnk extractor, and give the link of a specific product, the "parse_start_url" works fine and gives me data that I want to scrap.
But when I give only "https://www.softsurroundings.com" in the start_url, and Use the link extractor as shown in my code below. The Link Extractor can get links of the categories but dont scrap items inside the categories.
My Existing code is:
name = 'Soft'
allowed_domains = ["softsurroundings.com"]
start_urls = [
"https://www.softsurroundings.com/"
]
den_subdirectory= ['/orderstatus/', '/faq/', '/folder/', '/sitemap/', '/catalogrequest/',
'/soft-surroundings-gift-card/', '/cart/', '/stores/',
'/emailus/', '/myaccount/', '/home-wellness/', '/new-bedding-home/']
rules = [
Rule(LinkExtractor(deny=den_subdirectory), callback='parse_start_url')
]
Now I need to modify this above code so that link extractor goes in each category and scraps all the products that exist in a category.
Related
I am scraping a website using scrapy where I want to extract a few details such as price, product description, features etc of a product. I want to know how to select each of these elements using css selectors or xpath selectors and store them in xml or json format.
I have written the following code skeleton. Please guide me what should I do from here.
# -*- coding: utf-8 -*-
import scrapy
import time
class QuotesSpider(scrapy.Spider):
name = 'myquotes'
start_urls = [
'https://www.amazon.com/international-sales-offers/b/ref=gbps_ftr_m-9_2862_dlt_LD?node=15529609011&gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CEXPIRED%252CSOLDOUT%252CUPCOMING,sortOrder:BY_SCORE,MARKETING_ID:ship_export,enforcedCategories:15684181,dealTypes:LIGHTNING_DEAL&pf_rd_p=9b8adb89-8774-4860-8b6e-e7cefc1c2862&pf_rd_s=merchandised-search-9&pf_rd_t=101&pf_rd_i=15529609011&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=AA0VVPMWMQM1MF4XQZKR&ie=UTF8'
]
def parse(self, response):
all_div_quotes = response.css('a-section a-spacing-none tallCellView gridColumn2 singleCell')
for quotes in all_div_quotes:
title1 = all_div_quotes.css('.dealPriceText::text').extract()
title2 = all_div_quotes.css('.a-declarative::text').extract()
title3 = all_div_quotes.css('#shipSoldInfo::text').extract()
yield{
'price' : title1,
'details1' : title2,
'details2' : title3
}
I am running the code using the command:
scrapy crawl myquotes -o myfile.json
to save it inside a json file. The problem with this code is that it is not returning the title, product price, product description as intended. If someone could help me with how to scrape the product name, price and description of an amazon page it would be of great help.
The easier way to check and verify CSS selectors is using scrapy shell.
In your case, I have listed the selectors you can use along with the code:
Name: response.css("#productTitle::text").get()
Price: Price was not available in my country so couldn't test it.
Description: response.css("#productDescription p::text").getall()
Best of luck.
The normal method to solve an error like this starting at the top. I think your very first css selector is too detailed. On using the selector gadget, the general css selector is
.dealDetailContainer
Yield the whole response without a for loop and check the output to understand that you're getting some kind of a response.
For products individually, when I scraped a different amazon link the css selector for the product name is
#productTitle::text -># is not a commented line of code here
Basically, you're going wrong with the css selectors. Use the CSS Selector Gadget and before using the command to output it into json, do a normal crawl first.
generallly what you could do is
Name: response.css("#productTitle::text").extract()
Description: response.css("#productDescription p::text").extract()
With this you should be good to go.
CSS selector are more constant so they are usually a better bet than using xpath and consequently the way to go
In my web-scraping project I have to scrape the football matches data from https://www.national-football-teams.com/country/67/2018/France.html
In order to navigate to matches data from the above url I have to follow a hyper-reference that has a hash in the url:
Matchesevent
The standard scrapy mechanism of following the links:
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
will produce a link that will not lead to the matches data:
https://www.national-football-teams.com/matches.html
I would appreciate any help. Since I am noobie to web-scraping and anything which has something to do with web-development, a more specific advice and/or a minimal working example is highly acknowledged.
For the completeness here is the complete code of my scrapy-spider:
import scrapy
class NationalFootballTeams(scrapy.Spider):
name = "nft"
start_urls = ['https://www.national-football-teams.com/continent/1/Europe.html']
def parse(self, response):
for country in response.xpath("//div[#class='row country-teams']/div[1]/ul/li/a"):
cntry = country.xpath("text()").extract_first().strip()
if cntry == 'France':
href = country.xpath("#href").extract_first()
yield response.follow(href, self.parse_country)
def parse_country(self, response):
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
print href
yield scrapy.Request(url=href, callback=self.parse_matches)
def parse_matches(self, response):
print response.xpath("//tr[#class='win']").extract()
When clicking that link, no new page or even new data is loaded, it's already in the html, but hidden. Clicking that link will call some javascript that hides the current tab and shows the new tab. So to get to the data, you don't need to follow any link, but just use a different xpath query. The match data is in the xpath //div[#id='matches'].
I need to get ASINs from hrefs links in an amazon page.
ASINs are unique blocks of 10 letters and/or numbers that identify items.
Particularly I tried to scrape https://www.amazon.it/gp/goldbox/ with scrapy (python).
In this page there are a lot of links that contains ASINs.
<a id="dealImage" class="a-link-normal" href="https://www.amazon.it/Marantz-TT5005-Giradischi-Equalizzatore-Incorporato/dp/B008NIV668/ref=gbph_img_s-3_c128_ca594162?smid=A11IL2PNWYJU7H&pf_rd_p=8accddad-a52b-4a55-a9e1-760ad483c128&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=5E0HASYCKDNV4YWQCJSJ">
...
every link contain the asin next to "../db/ASIN.."
This is my code, but I can't scrape and get ASINs...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "amazon"
def start_requests(self):
urls = [
'https://www.amazon.it/gp/goldbox/'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.xpath('//a[contains(#class, "a-link-normal")]')
I can split the link with this: split("/dp/")
hope someone can help me, thanks!
response.xpath('//*[contains(text(), "Risparmia su Bic Cristal Original - ")]').re(r'"reviewAsin" : "([^"]+)"')
There are different type of asins, can not decide which to parse.
You can write your pattern and grab them.
checkout this
response.xpath('//*[contains(text(), "Risparmia su Bic Cristal Original - ")]').extract()
The html there is generated by javascript, which is based on json objects. You can pull data directly from these json objects.
You may get all asins by this expression:
/reviewAsin\" : \"([A-Z0-9]+)\"/
Is there a "general" way to scrape link titles from any website in Python? For example, if I use the following code:
from urllib.request import url open
from bs4 import BeautifulSoup
site = "https://news.google.com"
html = urlopen(site)
soup = BeautifulSoup(html.read(), 'lxml');
titles = soup.findAll('span', attrs = { 'class' : 'titletext' })
for title in titles:
print(title.contents)
I am able to extract nearly every headline title from news.google.com. However, if I use the same code at www.yahoo.com, I am unable to due to a different HTML formatting.
Is there a more general way to do this so that it works for most sites?
No, each site is different and if you make a more general scraper, it will get more data that isn't as specific as every headline title.
For instance the following would get every headline title from google and would also probably get them from yahoo also.
titles = soup.find_all('a')
for title in titles:
print(title.get_text())
However it would also get you all of the headers and other links which would muddy up your results. (there are approximately 150 links on that google page that aren't headlines)
Not, that's why we need CSS selector and XPath, but if there are small number of page, there is a convenient way to do that:
site = "https://news.google.com"
if 'google' in site:
filters = {'name':'span', "class" : 'titletext' }
elif 'yahoo' in site:
filters = {'name':'blala', "class" : 'blala' }
titles = soup.findAll(**filters)
for title in titles:
print(title.contents)
I'm working on a project that is looking to ingest data from Ebay listings and store it for analysis. I'm using Scrapy to scrape/crawl specific categories but I'm running into issues when trying to extract the text within an items' "description" field. Each listing seems to have a unique layout for the description so I can't generalize an xPath for the item of my Scrapy object.
For example one advertisement may have a layout like this, while another may be formatted like this. How can I go about extracting the text within each description tab? I can successfully extract other fields, as their xPaths are universal in Ebay advertisements. Here's the method I'm reffering to:
def parse_item(self, response):
item = EbayItem()
item['url'] = response.url
item['title'] = response.xpath('//*[#id="itemTitle"]/text()').extract()
item['description']= response.xpath( #THISISWHEREIMLOST).extract()
print ":("