Scrape Amazon deals page - python

I need to get ASINs from hrefs links in an amazon page.
ASINs are unique blocks of 10 letters and/or numbers that identify items.
Particularly I tried to scrape https://www.amazon.it/gp/goldbox/ with scrapy (python).
In this page there are a lot of links that contains ASINs.
<a id="dealImage" class="a-link-normal" href="https://www.amazon.it/Marantz-TT5005-Giradischi-Equalizzatore-Incorporato/dp/B008NIV668/ref=gbph_img_s-3_c128_ca594162?smid=A11IL2PNWYJU7H&pf_rd_p=8accddad-a52b-4a55-a9e1-760ad483c128&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=5E0HASYCKDNV4YWQCJSJ">
...
every link contain the asin next to "../db/ASIN.."
This is my code, but I can't scrape and get ASINs...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "amazon"
def start_requests(self):
urls = [
'https://www.amazon.it/gp/goldbox/'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.xpath('//a[contains(#class, "a-link-normal")]')
I can split the link with this: split("/dp/")
hope someone can help me, thanks!

response.xpath('//*[contains(text(), "Risparmia su Bic Cristal Original - ")]').re(r'"reviewAsin" : "([^"]+)"')
There are different type of asins, can not decide which to parse.
You can write your pattern and grab them.
checkout this
response.xpath('//*[contains(text(), "Risparmia su Bic Cristal Original - ")]').extract()

The html there is generated by javascript, which is based on json objects. You can pull data directly from these json objects.
You may get all asins by this expression:
/reviewAsin\" : \"([A-Z0-9]+)\"/

Related

How to get a value from inside an href in the HTML structure

I am using the following code to get values from a site
import scrapy
class scraping(scrapy.Spider):
name = 'NewsSpider'
start_urls = ['https://www.uol.com.br/']
def parse(self, response):
news = response.xpath('//article')
for n in news:
print({
'Link': n.xpath("//a[#class='hyperlink headlineSub__link']").get(),
'Title': n.xpath('//a/div/h3/text()').get(),
})
On "Link" I am getting a lot of information but I want to get only the link inside the href, is it possible to get only that information?
I have a sample of doing this very same thing. You should use something like this selector:
.css('a[href*=topic]::attr(href)')
a tag in my case was something like <a ... href="topic/1321343">something</a>.
The key is a::attr(href)
parse your response and make it as small as you can and get your wanted href value.
This is my solution on a project for scraping Microsoft Academia articles. The linked line gets items in "Related Topics" section.
Here is some other example:
<span class="title">
</span>
pars by:
Link = Link1.css('span.title a::attr(href)').extract()[0]

How do i scrape all content if it has various tags within it?

I have a spider that i'd like to scrape an article i'm interested in, then store the title and the content in a dictionary. However, when i scrape the body it returns the html code, of which i want to convert to text (including all the h1 and href within the article), but when i use .getall() it returns an empty list. How do i make this all into text and still keep all of the content within the article.
in the scrapy shell i have tried which returned a large list containing all of the html code.
response.css("div.rich-text-content").getall()
below is the initial spider that i have created in order to do this task...
class ArticleSpider(scrapy.Spider):
name = "article"
def start_requests(self):
urls = [
"https://www.codehousegroup.com/insight-and-inspiration/tech-stream/what-is-machine-learning"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css("div.article-page"):
yield {
'heading': quote.css("h1::text").get(),
'text': quote.css("p.rectangle-decoration::text").get(),
'body': quote.css("div.rich-text-content rich-text-content::text").getall(),
}
The expected results is a string with the everything currently in the body item of my dictionary just without the tags.
If I get it correct, you need to select all inside elements in the div tag and return their text.
You can use * in css which will select all inside elements:
'body': quote.css("div.rich-text-content *::text").getall()
You can use xpath instead css.
Example:
for quote in response.xpath('//div[#class="article-page"]'):
text = quote.xpath("/h1/text()").get()
...

How to follow a hyper-refernce in scrapy if the href attribute contains a hash symbol

In my web-scraping project I have to scrape the football matches data from https://www.national-football-teams.com/country/67/2018/France.html
In order to navigate to matches data from the above url I have to follow a hyper-reference that has a hash in the url:
Matchesevent
The standard scrapy mechanism of following the links:
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
will produce a link that will not lead to the matches data:
https://www.national-football-teams.com/matches.html
I would appreciate any help. Since I am noobie to web-scraping and anything which has something to do with web-development, a more specific advice and/or a minimal working example is highly acknowledged.
For the completeness here is the complete code of my scrapy-spider:
import scrapy
class NationalFootballTeams(scrapy.Spider):
name = "nft"
start_urls = ['https://www.national-football-teams.com/continent/1/Europe.html']
def parse(self, response):
for country in response.xpath("//div[#class='row country-teams']/div[1]/ul/li/a"):
cntry = country.xpath("text()").extract_first().strip()
if cntry == 'France':
href = country.xpath("#href").extract_first()
yield response.follow(href, self.parse_country)
def parse_country(self, response):
href = response.xpath("//a[contains(#href,'matches')]/#href").extract_first()
href = response.urljoin(href)
print href
yield scrapy.Request(url=href, callback=self.parse_matches)
def parse_matches(self, response):
print response.xpath("//tr[#class='win']").extract()
When clicking that link, no new page or even new data is loaded, it's already in the html, but hidden. Clicking that link will call some javascript that hides the current tab and shows the new tab. So to get to the data, you don't need to follow any link, but just use a different xpath query. The match data is in the xpath //div[#id='matches'].

Confusion using Xpath when scraping websites with Scrapy

I'm having trouble understanding which part of the Xpath to select when trying to scrape certain elements of a website. In this case, I am trying to scrape all the websites that are linked in this article (for example, this section of the xpath:
data-track="Body Text Link: External" href="http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/brca-related-cancer-risk-assessment-genetic-counseling-and-genetic-testing">
My spider works but it doesn't scrape anything!
My code is below:
import scrapy
from scrapy.selector import Selector
from nymag.items import nymagItem
class nymagSpider(scrapy.Spider):
name = 'nymag'
allowed_domains = ['http://wwww.nymag.com']
start_urls = ["http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html"]
def parse(self, response):
#I'm pretty sure the below line is the issue
links = Selector(response).xpath(//*[#id="primary"]/main/article/div/span)
for link in links:
item = nymagItem()
#This might also be wrong - am trying to extract the href section
item['link'] = question.xpath('a/#href').extract()
yield item
There is an easier way. Get all the a elements having data-track and href attributes:
In [1]: for link in response.xpath("//div[#id = 'primary']/main/article//a[#data-track and #href]"):
print link.xpath("#href").extract()[0]
...:
//nymag.com/tags/healthcare/
//nymag.com/author/Susan%20Rinkunas/
http://twitter.com/sueonthetown
http://www.facebook.com/sharer/sharer.php?u=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dfb-share-thecut
https://twitter.com/share?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dtwitter-share-thecut&via=TheCut
https://plus.google.com/share?url=http%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html
http://pinterest.com/pin/create/button/?url=http://nymag.com/thecut/2015/09/should-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Dpinterest-share-thecut&description=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&media=http:%2F%2Fpixel.nymag.com%2Fimgs%2Ffashion%2Fdaily%2F2015%2F09%2F08%2F08-angelina-jolie.w750.h750.2x.jpg
whatsapp://send?text=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0A%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html&mid=whatsapp
mailto:?subject=Should%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F&body=I%20saw%20this%20on%20The%20Cut%20and%20thought%20you%20might%20be%20interested...%0A%0AShould%20All%20Women%20Get%20Tested%20for%20the%20Breast%20Cancer%20Gene%3F%0AIt's%20not%20a%20crystal%20ball.%0Ahttp%3A%2F%2Fnymag.com%2Fthecut%2F2015%2F09%2Fshould-we-all-get-the-breast-cancer-gene-test.html%3Fmid%3Demailshare%5Fthecut
...

scrape through website with href references

I am using scrapy, and I want to scrape through www.rentler.com. I have gone to the website and searched for the city that I am interested in, and here is the link of that search result:
https://www.rentler.com/search?Location=millcreek&MaxPrice=
Now, all of the listings that I am interested in are contained on that page, and I want to recursively step through them, one by one.
Each listing is listed under:
<body>/<div id="wrap">/<div class="container search-res">/<ul class="search-results"><li class="result">
each result has a <a class="search-result-link" href="/listing/288910">
I know that I need to create a rule for the crawlspider and have it look at that href and append it to the url. That way it could go to each page, and grab that data that I am interested in.
I think I need something like this:
rules = (Rule(SgmlLinkExtractor(allow="not sure what to insert here, but this is where I think I need to href appending", callback='parse_item', follow=true),)
UPDATE
*Thank you for the input. Here is what I now have, it seems to run but does not scrape:*
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from KSL.items import KSLitem
class KSL(CrawlSpider):
name = "ksl"
allowed_domains = ["https://www.rentler.com"]
start_urls = ["https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978"]
regex_pattern = '<a href="listing/(.*?) class="search-result-link">'
def parse_item(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = re.findall(regex_pattern, "https://www.rentler.com/search?location=millcreek&MaxPrice=")
for site in sites:
item = KSLitem()
item['price'] = site.select('//div[#class="price"]/text()').extract()
item['address'] = site.select('//div[#class="address"]/text()').extract()
item['stats'] = site.select('//ul[#class="basic-stats"]/li/div[#class="count"]/text()').extract()
item['description'] = site.select('//div[#class="description"]/div/p/text()').extract()
items.append(item)
return items
Thoughts?
If you need to scrape data out a html files, which is the case, I would recommend using BeautifulSoup, it's very easy to install and to use:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html)
for link in bs.find_all('a'):
if link.has_attr('href'):
print link.attrs['href']
This little script would get all href that are inside a HTML tag.
Edit: Fully functional script:
I tested this on my computer and the result was as expected, BeautifulSoup needs plain HTML and you can scrape what you need out of it, take a look at this code:
import requests
from bs4 import BeautifulSoup
html = requests.get(
'https://www.rentler.com/search?Location=millcreek&MaxPrice=').text
bs = BeautifulSoup(html)
possible_links = bs.find_all('a')
for link in possible_links:
if link.has_attr('href'):
print link.attrs['href']
That only shows you how to scrape href out of the html page you are trying to scrape, of course you can use it inside scrapy, as I told you, BeautifulSoup only needs plain HTML, that is why I use requests.get(url).text and you can scrape out of that. So I guess scrapy can pass that plain HTML to BeautifulSoup.
Edit 2
Ok, look I don't think you need scrapy at all, so if the previous script gets you all the links that you want to take data from works, you only need to do something like this:
supposing I have a valid list of urls I want to get specific data from, say price, acres, address... You could have this with the previous script only instead of printing urls to screen you could append them to a list and append only the ones that start with /listing/. That way you have a valid list of urls.
for url in valid_urls:
bs = BeautifulSoup(requests.get(url).text)
price = bs.find('span', {'class': 'amount'}).text
print price
You only need to look at the source code and you'll get the idea of how to scrape the data you need from every single url.
You can use a regular expression to find all the rental home ids from the links. From there, you can use the ids you have and scrape that page instead.
import re
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, SOURCE_OF_THE_RENTLER_PAGE)
for rental_id in rental_home_ids:
#Process the data from the page here.
print rental_id
EDIT:
Here's a working-on-its-own version of the code. It prints all the link ids. You can use it as-is.
import re
import urllib
url_to_scrape = "https://www.rentler.com/search?Location=millcreek&MaxPrice="
page_source = urllib.urlopen(url_to_scrape).read()
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, page_source)
for rental_id in rental_home_ids:
#Process the data from the page here.
print rental_id

Categories